Information Retrieval (IR) research has traditionally focused on serving the best results for a single query — so-called ad hoc retrieval. However, users typically search iteratively, refining and reformulating their queries during a session. A key challenge in the study of this interaction is the creation of suitable evaluation resources to assess the effectiveness of retrieval systems over sessions. The TREC Interactive, Session and Tasks tracks attempted to approach this problem, without managing to produce a reusable test collection to evaluate the entire sessions of a conversation between a user and a machine. The problem remains open.
It has become urgent for the community, and especially forums such as TREC, CLEF, NTCIR to put a focus on and provide such a setup that will put IR at the frontline in developing dynamic systems that better fit the IR needs. The Dynamic Search Lab attempts to construct such reusable test collections and metrics that will allow the development of dynamic search algorithms. The objective of the lab is threefold:
We view the problem of dynamic search as the development of two agents, a question agent and an answer agent. The two agents interact with each other towards fulfilling a user's information need.
The tasks of the Lab can be viewed in the figure below: the development of a Q-agent that produces queries to be submitted to a given retrieval system, and the development of a system that composes the results of the multiple rounds of interactions between the two agents into a single ranking.
The collection that will be used in the 2018 DynSe Lab will be the New York Times corpus (https://catalog.ldc.upenn.edu/ldc2008t19). The New York Times dataset consists of articles published in New York Times from January 1, 1987 to June 19, 2007 with metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. Most articles are manually summarized and tagged by professional staffs. The original form of this dataset is in News Industry Text Format (NITF). This dataset can aid the research in Document Categorization, Information Retrieval, Entity Extraction and etc. The corpus has been indexed by Indri and a Query Language Model with Dirichlet Smoothing has been implemented on the top of the Indri index. Participants will be provided with a RESTful API to query the index (https://bitbucket.org/cvangysel/pyndri-flask)
The RESTful API will be provided by April 15, 2018
The topics have been developed by the NIST assessors. A topic (which is like a query) contains a few words. It is the main search target for one complete run of dynamic search. Each topic contains multiple subtopics, each of which addresses one aspect of the topic. The NIST assessors have tried (very hard to) produce a complete set of subtopics for each topic, and so we will treat them as the complete set and use them in the interactions and evaluation
Ten (10) training topics along with their subtopics will be released on April 15, 2018.
Fifty (50) test topics will be released on May 5, 2018. Subtopics will not be released.
TOPIC | QUESTION | DOCNO | RANK | SCORE | RUN |
---|
Cube Test is a search effectiveness measurement evaluating the speed of gaining relevant information (could be documents or passages) in a dynamic search process. It measures the amount of relevant information a system could gather and the time needed in the entire search process. The higher the Cube Test score, the better the IR system.
sDCG extends the classic DCG to a search session which consists of multiple iterations. The relevance scores of results that are ranked lower or returned in later iterations get more discounts. The discounted cumulative relevance score is the final results of this metric.
Expected Utility scores different runs by measuring the relevant information a system found and the length of documents. The relevance scores of documents are discounted based on ranking order and novelty. The document length is discounted only based on ranking position. The difference between the cumulative relevance score and the aggregated document length is the final score of each run.
TOPIC | DUMMY | DOCNO | RANK | SCORE | RUN |
---|
The overall schedule for the labs and the CEUR-WS Lab Working Notes is as follows:
The schedule for the conference and for LNCS Publication: