Timeline

Background

Information Retrieval (IR) research has traditionally focused on serving the best results for a single query — so-called ad hoc retrieval. However, users typically search iteratively, refining and reformulating their queries during a session. A key challenge in the study of this interaction is the creation of suitable evaluation resources to assess the effectiveness of retrieval systems over sessions. The TREC Interactive, Session and Tasks tracks attempted to approach this problem, without managing to produce a reusable test collection to evaluate the entire sessions of a conversation between a user and a machine. The problem remains open.

It has become urgent for the community, and especially forums such as TREC, CLEF, NTCIR to put a focus on and provide such a setup that will put IR at the frontline in developing dynamic systems that better fit the IR needs. The Dynamic Search Lab attempts to construct such reusable test collections and metrics that will allow the development of dynamic search algorithms. The objective of the lab is threefold:

  • to produce the methodology and algorithms that will lead to a dynamic test collection by simulating the users,
  • to understand and quantify in terms of evaluation measures what constitutes a good ranking of documents at different stages of a session, and a good overall session
  • to develop algorithms that can provide an optimal ranking throughout a session.

Lab Overview

We view the problem of dynamic search as the development of two agents, a question agent and an answer agent. The two agents interact with each other towards fulfilling a user's information need.

The tasks of the Lab can be viewed in the figure below: the development of a Q-agent that produces queries to be submitted to a given retrieval system, and the development of a system that composes the results of the multiple rounds of interactions between the two agents into a single ranking.

Collection and Retrieval System

The collection that will be used in the 2018 DynSe Lab will be the New York Times corpus (https://catalog.ldc.upenn.edu/ldc2008t19). The New York Times dataset consists of articles published in New York Times from January 1, 1987 to June 19, 2007 with metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. Most articles are manually summarized and tagged by professional staffs. The original form of this dataset is in News Industry Text Format (NITF). This dataset can aid the research in Document Categorization, Information Retrieval, Entity Extraction and etc. The corpus has been indexed by Indri and a Query Language Model with Dirichlet Smoothing has been implemented on the top of the Indri index. Participants will be provided with a RESTful API to query the index (https://bitbucket.org/cvangysel/pyndri-flask)

The RESTful API will be provided by April 15, 2018

Topics

The topics have been developed by the NIST assessors. A topic (which is like a query) contains a few words. It is the main search target for one complete run of dynamic search. Each topic contains multiple subtopics, each of which addresses one aspect of the topic. The NIST assessors have tried (very hard to) produce a complete set of subtopics for each topic, and so we will treat them as the complete set and use them in the interactions and evaluation

Ten (10) training topics along with their subtopics will be released on April 15, 2018.

Fifty (50) test topics will be released on May 5, 2018. Subtopics will not be released.

Task 1: Query Suggestion

Objective:

Given a verbose description of a task (topic) generate a sequence of queries and their corresponding rankings of the collection.

Submission Guidelines:

Each Q-Agent is allowed to go over 10 rounds of query suggestions. At each round one query is submitted to the A-Agent, and the top 10 results are collected. At the end of round 10, 100 search results will have been collected.

Submission Format:

The lab will use TREC-style submissions. In TREC, a "run" is the output of a search system over ALL topics. Run Format:
TOPIC QUESTION DOCNO RANK SCORE RUN

Evaluation:

The measurements of runs are Cube Test, sDCG and Expected Utility; other diagnostic measures such as precision and recall may also be reported.

Cube Test is a search effectiveness measurement evaluating the speed of gaining relevant information (could be documents or passages) in a dynamic search process. It measures the amount of relevant information a system could gather and the time needed in the entire search process. The higher the Cube Test score, the better the IR system.

sDCG extends the classic DCG to a search session which consists of multiple iterations. The relevance scores of results that are ranked lower or returned in later iterations get more discounts. The discounted cumulative relevance score is the final results of this metric.

Expected Utility scores different runs by measuring the relevant information a system found and the length of documents. The relevance scores of documents are discounted based on ranking order and novelty. The document length is discounted only based on ranking position. The difference between the cumulative relevance score and the aggregated document length is the final score of each run.

Task 2: Results Composition

Objective:

Given the ranking in Task 1 merge them in a single composite ranking.

Submission Guidelines:

At the end of round 10, 100 search results will have been collected. These 100 results coming from 10 queries should be re-ranked in a single optimal ranking.

Submission Format:

The lab will use TREC-style submissions. In TREC, a "run" is the output of a search system over ALL topics. Run Format:
TOPIC DUMMY DOCNO RANK SCORE RUN
  • TOPIC is the topic id and can be found in the released topics
  • DUMMY is a dummy column to be filled in with 0.
  • DOCNO is the document number in the corpus
  • RANK is the rank of the document returned for this given round (in increasing order)
  • SCORE is the score of the ranking/classification algorith
  • RUN is an identifier/name for the system producing the run

Evaluation:

The nDCG@100 will be the main evaluation measure; other evaluation measures may be reported.

Important Dates

The overall schedule for the labs and the CEUR-WS Lab Working Notes is as follows:

The schedule for the conference and for LNCS Publication: