Quantity-Centric Search and Retrieval
- Date in the past
- Friday, 18. October 2024, 10:00
- Mathematikon, room 2.414
- Shideh Satya Almasian
Address
Mathematikon
Room 2.414Organizer
Dean
Event Type
Doctoral Examination
Quantities are essential in documents to describe factual information in domains such as finance, business, medicine, and science. This thesis alone encompasses 1,423 quantities within its text. While these account for just 1% of the overall word count, these values contain the most precise and crucial information necessary for analysis and system comparison. Despite the importance of quantities, only a handful of studies focus on their representation in text and their impact on Information Retrieval (IR). In many cases, the information needs of a user revolve around quantities and cannot be resolved without understanding their semantics. For instance, in the query “a used car that has less than 200hp”, the user is looking for a car with a specific parameter range. To provide an accurate response, the retrieval method should not only recognize the connection between the car and the quantity in the query but it must also comprehend value comparisons and units. Furthermore, the retrieved results should contain values less than “200” for this specific attribute of a car, requiring an understanding of numerical proximity. However, current quantity models often analyze values and units in isolation, disregarding their relationships to other tokens in the text. Additionally, modern search engines apply the same ranking mechanisms to both words and quantities, overlooking magnitude and unit information. As a result, quantity-centric queries yield sub-par results and often cost the users valuable time navigating through irrelevant content. In this thesis, we address these shortcomings and aim to enhance the quantity understanding of current IR models. We start by presenting a holistic quantity model that efficiently models combinations of values and units, changes in the behavior of a quantity in the given context (e.g., rising or falling), and the concept (related entities or events) of a quantity. This quantity model leads to the development of an extraction framework called Comprehensive Quantity Extraction (CQE), which is designed to detect and normalize quantities in text. Additionally, we introduce a novel benchmark dataset tailored to evaluate quantity extraction.
Using the quantity extractor, we introduce two quantity-aware retrieval techniques that encompass both classical and neural models. These models are designed to rank documents based on the proximity of quantities in the text as well as the textual content. One method is the disjoint quantity-aware ranker, which is designed to separate the ranking of quantities and textual tokens by means of a quantity index structure. The second method is the joint quantity-aware ranker, which focuses on the joint ranking of quantities and textual tokens by fine-tuning a neural retrieval model on quantity-rich data. These techniques incorporate quantity information during ranking in both neural and lexical models, with minimal overhead in terms of efficiency and without the change in the system. These models can answer queries containing the numerical conditions equal, greater than, and less than as well as keyword search. To evaluate the effectiveness of our ranking models, we introduce two novel benchmark datasets in the domains of finance and medicine. We compare our methods on the benchmarks against various classical and neural retrieval systems and show significant improvement in answering quantity-centric queries.