TL;DR: Wang et al. as discussed by the authors proposed three weighting models based on tf-idf, k-nearest neighbor (kNN) based cosine similarity, and correlation score.
TL;DR: This work proposes simple knowledge-driven query expansion based on possible answers (values) of a query (attribute) for QA-based AVE, and trains a model with two tricks, knowledge dropout and knowledge token mixing, which mimic the imperfection of the value knowledge in testing.
Abstract: A key challenge in attribute value extraction (AVE) from e-commerce sites is how to handle a large number of attributes for diverse products. Although this challenge is partially addressed by a question answering (QA) approach which finds a value in product data for a given query (attribute), it does not work effectively for rare and ambiguous queries. We thus propose simple knowledge-driven query expansion based on possible answers (values) of a query (attribute) for QA-based AVE. We retrieve values of a query (attribute) from the training data to expand the query. We train a model with two tricks, knowledge dropout and knowledge token mixing, which mimic the imperfection of the value knowledge in testing. Experimental results on our cleaned version of AliExpress dataset show that our method improves the performance of AVE (+6.08 macro F1), especially for rare and ambiguous attributes (+7.82 and +6.86 macro F1, respectively).
TL;DR: Wang et al. as discussed by the authors propose a query-aware video encoder to capture the query-related visual contents, which is a challenging task that requires a joint understanding of natural language queries and video contents.
TL;DR: This paper proposes a novel privacy-preserving aggregate reverse skyline query (PPARS) scheme on a single server model while ensuring full query privacy and proposes an interpolation-based packing technique to improve the communication efficiency of PPARS.
Abstract: It has been witnessed that Aggregate Reverse Skyline (ARS) query has recently received a wide range of practical applications due to its marvelous property of identifying the influence of query requests. Nevertheless, the query users may hesitate to participate in such query services as the query requests and query results may leak sensitive personal data or valuable business data assets to the service providers. To tackle the concerns, a promising solution is to encrypt the query requests, conduct the ARS queries over encrypted query requests without decrypting, and return the encrypted query results. Unfortunately, many existing solutions are either deployed over a two-server model or unable to fully preserve query privacy. In this paper, we propose a novel privacy-preserving aggregate reverse skyline query (PPARS) scheme on a single server model while ensuring full query privacy. Specifically, we first transform the problem of ARS query into a combination of set membership test and logical expressions. Then, by employing the prefix encoding technique, bloom filter technique, and fully homomorphic encryption, we run the transformed logical expressions to obtain the encrypted aggregate values without leaking query requests, query results, and access patterns. Furthermore, we propose an interpolation-based packing technique to improve the communication efficiency of PPARS. Detailed and formal security analysis demonstrates that our proposed schemes can guarantee strong security. In addition, extensive experiments are conducted, and the results validate the efficiency of our proposed schemes.
TL;DR: A way forward is charted for IR to research, document and understand this important question, with a view to creating search engines that provide more consistent, accurate and relevant search results regardless of the searcher's framing of the query.
Abstract: Where do queries -- the words searchers type into a search box -- come from? The Information Retrieval community understands the performance of queries and search engines extensively, and has recently begun to examine the impact of query variation, showing that different queries for the same information need produce different results. In an information environment where bad actors try to nudge searchers toward misinformation, this is worrisome. The source of query variation -- searcher characteristics, contextual or linguistic prompts, cognitive biases, or even the influence of external parties -- while studied in a piecemeal fashion by other research communities has not been studied by ours. In this paper we draw on a variety of literatures (including information seeking, psychology, and misinformation), and report some small experiments to describe what is known about where queries come from, and demonstrate a clear literature gap around the source of query variations in IR. We chart a way forward for IR to research, document and understand this important question, with a view to creating search engines that provide more consistent, accurate and relevant search results regardless of the searcher's framing of the query.
TL;DR: Zhang et al. as discussed by the authors proposed a low-cost SPARQL generator named Light-QAWizard, which integrates multi-label classification into a recurrent neural network (RNN), builds a template classifier, and generates corresponding query grammars based on the results of template classifiers.
Abstract: Much like traditional database querying, the question answering process in a Question Answering (QA) system involves converting a user’s question input into query grammar, querying the knowledge base through the query grammar, and finally returning the query result (i.e., the answer) to the user. The accuracy of query grammar generation is therefore important in determining whether a Question Answering system can produce a correct answer. Generally speaking, incorrect query grammar will never find the right answer. SPARQL is the most frequently used query language in question answering systems. In the past, SPARQL was generated based on graph structures, such as dependency trees, syntax trees and so on. However, the query cost of generating SPARQL is high, which creates long processing times to answer questions. To reduce the query cost, this work proposes a low-cost SPARQL generator named Light-QAWizard, which integrates multi-label classification into a recurrent neural network (RNN), builds a template classifier, and generates corresponding query grammars based on the results of template classifier. Light-QAWizard reduces query frequency to DBpedia by aggregating multiple outputs into a single output using multi-label classification. In the experimental results, Light-QAWizard’s performance on Precision, Recall and F-measure metrics were evaluated on the QALD-7, QALD8 and QALD-9 datasets. Not only did Light-QAWizard outperform all other models, but it also had a lower query cost that was nearly half that of QAWizard.
TL;DR: This work presents a new, alternative approach to search query bias identification that includes less high-level suggestions to deepen the data basis of bias analyses and employs recursive algorithm interrogation techniques and suggestion trees that enable access to more subliminal search query suggestions.
Abstract: Despite their important role in online information search, search query suggestions have not been researched as much as most other aspects of search engines. Although reasons for this are multi-faceted, the sparseness of context and the limited data basis of up to ten suggestions per search query pose the most significant problem in identifying bias in search query suggestions. The most proven method to reduce sparseness and improve the validity of bias identification of search query suggestions so far is to consider suggestions from subsequent searches over time for the same query. This work presents a new, alternative approach to search query bias identification that includes less high-level suggestions to deepen the data basis of bias analyses. We employ recursive algorithm interrogation techniques and create suggestion trees that enable access to more subliminal search query suggestions. Based on these suggestions, we investigate topical group bias in person-related searches in the political domain.
TL;DR: This paper reveals intuitions and behaviours in user behaviour that permits a deeper understanding of user behaviour than previous studies by analysing the query logs of a specialised tool developed to assist expert searchers in refining complex Boolean queries.
Abstract: The creation of high-quality medical systematic reviews requires the development of a complex Boolean query to retrieve medical literature. An effective query in this context is critical, as it determines how many documents are to be assessed for inclusion in the resulting systematic review, as all retrieved documents must be screened. Therefore an effective query must balance a reasonable assessment workload with an estimate for how many relevant documents exist for a given topic. Getting this balance correct is naturally a difficult challenge, and there is a certain level of intuition involved in how a query should be formulated and refined. This paper reveals such intuitions and behaviours by analysing the query logs of a specialised tool developed to assist expert searchers in refining complex Boolean queries. These query logs contain unique information that permits a deeper understanding of user behaviour than previous studies. The approximately 6,000 queries collected over one year are available for further analysis at https://github.com/ielab/searchrefiner-logs-collection.
TL;DR: Several generic techniques are distills the existing solutions into several generic techniques that, when put together, can be utilized to solve a great variety of IQS problems with attractive performance guarantees.
Abstract: Unlike a reporting query that returns all the elements satisfying a predicate, query sampling returns only a sample set of those elements and has long been recognized as an important method in database systems. PODS'14 saw the introduction of independent query sampling (IQS), which extends traditional query sampling with the requirement that the sample outputs of all the queries be mutually independent. The new requirement improves the precision of query estimation, facilitates the execution of randomized algorithms, and enhances the fairness and diversity of query answers. IQS calls for new index structures because conventional indexes are designed to report complete query answers and thus becomes too expensive for extracting only a few random samples. The phenomenon has created an exciting opportunity to revisit the structure for every reporting query known in computer science. There has been considerable progress since 2014 in this direction. This paper distills the existing solutions into several generic techniques that, when put together, can be utilized to solve a great variety of IQS problems with attractive performance guarantees.
TL;DR: In this article , the authors discuss the principles of relational query visualization and its potential for simplifying user interactions with relational data, and propose a query visualization language that helps a user compose a query.
Abstract: Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data.
Xin Tian, Yingzhan Lin, Mengfei Song, Siqi Bao, Fan Wang, Haibo He, Shuangqing Sun, Hua Wang
1 Jan 2022
TL;DR: Q-TOD is a novel query-driven task-oriented dialogue system that alleviates domain adaption and knowledge base scalability issues.
Abstract: Existing pipelined task-oriented dialogue systems usually have difficulties adapting to unseen domains, whereas end-to-end systems are plagued by large-scale knowledge bases in practice.In this paper, we introduce a novel querydriven task-oriented dialogue system, namely Q-TOD.The essential information from the dialogue context is extracted into a query, which is further employed to retrieve relevant knowledge records for response generation.Firstly, as the query is in the form of natural language and not confined to the schema of the knowledge base, the issue of domain adaption is alleviated remarkably in Q-TOD.Secondly, as the query enables the decoupling of knowledge retrieval from the generation, Q-TOD gets rid of the issue of knowledge base scalability.To evaluate the effectiveness of the proposed Q-TOD, we collect query annotations for three publicly available task-oriented dialogue datasets.Comprehensive experiments verify that Q-TOD outperforms strong baselines and establishes a new state-of-the-art performance on these datasets.
TL;DR: This work investigates robust query processing techniques for Linked Data Fragment clients by proposing CROP, a query plan optimizer that explores the cost and robustness of alternative query plans, and proposes a new class of adaptive operators: Polymorphic Join Operators.
Abstract: Linked Data Fragments (LDFs) refer to interfaces that allow for publishing and querying Knowledge Graphs on the Web. These interfaces primarily differ in their expressivity and allow for exploring different trade-offs when balancing the workload between clients and servers in decentralized SPARQL query processing. To devise efficient query plans, clients typically rely on heuristics that leverage the metadata provided by the LDF interface, since obtaining fine-grained statistics from remote sources is a challenging task. However, these heuristics are prone to potential estimation errors based on the metadata which can lead to inefficient query executions with a high number of requests, large amounts of data transferred, and, consequently, excessive execution times. In this work, we investigate robust query processing techniques for Linked Data Fragment clients to address these challenges. We first focus on robust plan selection by proposing CROP, a query plan optimizer that explores the cost and robustness of alternative query plans. Then, we address robust query execution by proposing a new class of adaptive operators: Polymorphic Join Operators. These operators adapt their join strategy in response to possible cardinality estimation errors. The results of our first experimental study show that CROP outperforms state-of-the-art clients by exploring alternative plans based on their cost and robustness. In our second experimental study, we investigate how different planning approaches can benefit from polymorphic join operators and find that they enable more efficient query execution in the majority of cases.
TL;DR: In this paper , different feature selection, machine learning techniques and its methods, performance and dataset applied in the classification of web pages are presented. But, they are not able to produce the accurate results.
Abstract: It is obvious that search engines are in rampant use in today's web world for tracing Web documents) through particular keynotes which ultimately projects both the relevant and irrelevant ones, which contains the same keywords. Of course, web information is exponentially increasing now and then. While so, manual classifications of Web documents are redundant, inaccurate, complex and also a time consuming process. Though several researchers do sustained research for classifying the Web documents, they are not able to produce the accurate results. Hence, automatic classification of Web documents is the need of the hour to solve all such issues. At this juncture, this paper here presents different feature selection, machine learning techniques and its methods, performance and dataset applied in the classification of Web pages
TL;DR: In this paper , a proxy-terms based query obfuscation technique that allows users to retrieve information from an IR system through proxy queries without submitting true queries is proposed. But, the proposed technique requires the user to submit a set of cover queries and true queries from the proxy queries and cannot differentiate whether the user is trying to retrieve the information for the cover queries or true queries.
Abstract: Search engines store users’ queries in a query log for performing personalized information retrieval. However, query logs cause privacy concerns and reveal a lot of information about individuals if used against them. Private web search (PWS) provides a privacy-preserving information retrieval (IR) facility which allows users to retrieve information from an IR system without revealing true search queries. Current PWS techniques that are explored in the domain of web search are query obfuscation-based private web search (OB-PWS). These techniques achieve web privacy by injecting cover queries into the user profiles. However, existing OB-PWS techniques submit true queries along with cover queries and achieve query obfuscation in an isolated manner without considering the similarity between consecutive queries. In this article, we propose a proxy-terms based query obfuscation technique that allows users to retrieve information from an IR system through proxy queries without submitting true queries. IR system automatically generates cover queries and true queries from the proxy queries and cannot differentiate whether the user is trying to retrieve information for the cover queries or true query. We analyzed the effectiveness of the proposed technique on test queries, and develop a similarity metric for testing the accuracy of the proposed technique. Promising results of experiments confirm the effectiveness of the proposed technique.
TL;DR: In this paper , a word embedding-based skip-gram model has been developed for query expansion task, where vocabulary terms are obtained from the top-k retrieved documents using the pseudo relevance feedback model and then they are trained using the skipgram model to find the expansion terms for the user query.
Abstract: The neural network has attracted researchers immensely in the last couple of years due to its wide applications in various areas such as Data mining, Natural language processing, Image processing, and Information retrieval etc. Word embedding has been applied by many researchers for Information retrieval tasks. In this paper word embedding-based skip-gram model has been developed for the query expansion task. Vocabulary terms are obtained from the top “k” initially retrieved documents using the Pseudo relevance feedback model and then they are trained using the skip-gram model to find the expansion terms for the user query. The performance of the model based on mean average precision is 0.3176. The proposed model compares with other existing models. An improvement of 6.61%, 6.93%, and 9.07% on MAP value is observed compare to the Original query, BM25 model, and query expansion with the Chi-Square model respectively. The proposed model also retrieves 84, 25, and 81 additional relevant documents compare to the original query, query expansion with Chi-Square model, and BM25 model respectively and thus improves the recall value also. The per query analysis reveals that the proposed model performs well in 30, 36, and 30 queries compare to the original query, query expansion with Chi-square model, and BM25 model respectively.
TL;DR: In this article , the authors consider three common types of Why-questions: Why-not, Why, and Why-rank, which suggest query manipulations that are responsible for user-specified missing, unexpected, and undesirably ranked entities.
Abstract: Subgraph queries are routinely used to search for entities in richly attributed graphs e.g., social networks and knowledge graphs. With little knowledge of underlying data, users often need to rewrite queries multiple times to reach desirable answers. Why-questions are studied to clarify missing or unexpected query results. This paper makes a first step to answer Why-questions for entity search in attributed graphs. We consider three common types of Why-questions: Why-not, Why, and Why-rank, which suggest query manipulations that are responsible for user-specified missing, unexpected, and undesirably ranked entities, respectively. (1) We approach a general query rewriting paradigm that suggests to identify desired entities that are specified by Why-questions. We introduce measures that characterize good query rewrites by incorporating both query editing cost and answer closeness. (2) While computing optimal query rewrites is intractable, we develop feasible algorithms, from approximation to fast heuristics, and provide query rewrites with (near) optimality guarantees whenever possible, for Why, Why-not and Why-rank questions. We further show that our results remain intact for Why questions that (1) request a single query rewrite to clarify multiple types of entities, and (2) variants such as Why-empty and Why-so-many, by providing the matching algorithms. Using real-world graphs, we experimentally verify that our algorithms are effective and feasible for large graphs. Our case study also verifies their application in e.g., knowledge exploration.
TL;DR: Results prove that the query reformulation method with WPS and word-embedding can improve the search for potential IQAS answers.
Abstract: Query reformulation is one of the tasks in Information Retrieval (IR), which automatically creates new queries based on previous queries. The main challenge of query reformulation is to create a new query whose meaning or context is similar to the old query. Query reformulation can improve the search for relevant documents for Open-domain Question Answering (OpenQA). The more queries are given to the search system, and the more documents will be generated. We propose a Word Predicted and Substituted (WPS) method for query reformulation using a word embedding word2vec. We tested this method on the Indonesian Question Answering System (IQAS). The test results obtained an E-1 value of 81% and an E-2 value of 274%. These results prove that the query reformulation method with WPS and word-embedding can improve the search for potential IQAS answers.
TL;DR: In this paper , the authors propose a query expansion and entity weighting method (QEEW) which leverages the relationships between entities in the entity catalog (consisting of users' queries, assistant's responses, and corresponding entities) to enhance the query reformulation performance.
Abstract: Voice assistants such as Alexa, Siri, and Google Assistant have become increasingly popular worldwide. However, linguistic variations, variability of speech patterns, ambient acoustic conditions, and other such factors are often correlated with the assistants misinterpreting the user's query. In order to provide better customer experience, retrieval based query reformulation (QR) systems are widely used to reformulate those misinterpreted user queries. Current QR systems typically focus on neural retrieval model training or direct entities retrieval for the reformulating. However, these methods rarely focus on query expansion and entity weighting simultaneously, which may limit the scope and accuracy of the query reformulation retrieval. In this work, we propose a novel Query Expansion and Entity Weighting method (QEEW), which leverages the relationships between entities in the entity catalog (consisting of users' queries, assistant's responses, and corresponding entities), to enhance the query reformulation performance. Experiments on Alexa annotated data demonstrate that QEEW improves all top precision metrics, particularly 6% improvement in top10 precision, compared with baselines not using query expansion and weighting; and more than 5% improvement in top10 precision compared with other baselines using query expansion and weighting.
TL;DR: In this article , the authors proposed a three-layer query recommendation method which is benefited from a query community graph in the first layer and enriched by a query-flow graph which models the transitional patterns made by users inside sessions.
TL;DR: In this paper , the authors propose an approach to derive entity-based query interpretations by deriving all reasonable ways of linking suitable parts of the query to semantically compatible entities in a background knowledge base.
Abstract: Web search queries can be ambiguous: is "source of the nile'' meant to find information on the actual river or on a board game of that name? We tackle this problem by deriving entity-based query interpretations: given some query, the task is to derive all reasonable ways of linking suitable parts of the query to semantically compatible entities in a background knowledge base. Our suggested approach focuses on effectiveness but also on efficiency since web search response times should not exceed some hundreds of milliseconds. In our approach, we use query segmentation as a pre-processing step that finds promising segment-based "interpretation skeletons''. The individual segments from these skeletons are then linked to entities from a knowledge base and the reasonable combinations are ranked in a final step. An experimental comparison on a combined corpus of all existing query entity linking datasets shows our approach to have a better interpretation accuracy at a better run time than the previously most effective methods.
TL;DR: A two-layer hierarchical structure is described for understanding the space of query combination behavior types and some key aspects of this behavior are characterized and a classifier is proposed that can automatically classify types ofquery combination behavior using behavioral features.
Abstract: In exploratory search, users sometimes combine two or more issued queries into new queries. We present such a kind of search behavior as query combination behavior. We find that the queries after combination usually can better meet users’ information needs. We also observe that users combine queries for different motivations, which leads to different types of query combination behaviors. Previous work on understanding user exploratory search behaviors has focused on how people reformulate queries, but not on how and why they combine queries. Being able to answer these questions is important for exploring how users search and learn during information retrieval processes and further developing support to assist searchers. In this paper, we first describe a two-layer hierarchical structure for understanding the space of query combination behavior types. We manually classify query combination behavior sessions from AOL and Sogou search engines and explain the relationship from combining queries to success. We then characterize some key aspects of this behavior and propose a classifier that can automatically classify types of query combination behavior using behavioral features. Finally, we summarize our findings and show how search engines can better assist searchers.
TL;DR: In this article , the authors explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25.
Abstract: Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranked results is an important and relatively underexplored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25. We then improve upon this approach through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis of various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.
TL;DR: This work presents a learning method based on a counterfactual learning approach that is able to leverage feedback on the items to evaluate query suggestions through an unbiased estimator, thus avoiding the assumption that users write or select optimal queries.
Abstract: Conventional methods for query autocompletion aim to predict which completed query a user will select from a list. A shortcoming of this approach is that users often do not know which query will provide the best retrieval performance on the current information retrieval system, meaning that any query autocompletion methods trained to mimic user behavior can lead to suboptimal query suggestions. To overcome this limitation, we propose a new approach that explicitly optimizes the query suggestions for downstream retrieval performance. We formulate this as a problem of ranking a set of rankings, where each query suggestion is represented by the downstream item ranking it produces. We then present a learning method that ranks query suggestions by the quality of their item rankings. The algorithm is based on a counterfactual learning approach that is able to leverage feedback on the items (e.g., clicks, purchases) to evaluate query suggestions through an unbiased estimator, thus avoiding the assumption that users write or select optimal queries. We establish theoretical support for the proposed approach and provide learning-theoretic guarantees. We also present empirical results on publicly available datasets, and demonstrate real-world applicability using data from an online shopping store.
TL;DR: This article proposed a multi-task-based Semantic Search Neural Network (MSSNN) for candidate retrieval, which accomplishes query-response interactions through multi-tasks, which employs a Seq2Seq modeling task to learn a good query encoder and then performs a word prediction task to build response embeddings, finally conducts a simple matching model to form the dot-product scorer.
Abstract: Semantic search for candidate retrieval is an important yet neglected problem in retrieval-based Chatbots, which aims to select a bunch of candidate responses efficiently from a large pool. The existing bottleneck is to ensure the model architecture having two points: 1) rich interactions between a query and a response to produce query-relevant responses; 2) ability of separately projecting the query and the response into latent spaces to apply efficiently in semantic search during online inference. To tackle this problem, we propose a novel approach, called Multitask-based Semantic Search Neural Network (MSSNN) for candidate retrieval, which accomplishes query-response interactions through multi-tasks. The method employs a Seq2Seq modeling task to learn a good query encoder, and then performs a word prediction task to build response embeddings, finally conducts a simple matching model to form the dot-product scorer. Experimental studies have demonstrated the potential of the proposed approach.
TL;DR: In this paper , the authors describe basic query transformation rules for processing approximate queries using synthetic data tables or inferential models, based on the preliminary experimental results, confirm that ML models can be used to provide approximate query results in response times acceptable for applications.
Abstract: Approximate query processing can help reduce response time for aggregate queries in exploratory data analysis. In this study, we describe basic query transformation rules for processing approximate queries using synthetic data tables or inferential models. Based on the preliminary experimental results, we confirm that ML models can be used to provide approximate query results in response times acceptable for applications.
TL;DR: A weighting strategy and query node replacement strategy is designed for the community search problem on heterogeneous networks that enables the induced homogeneous graph to contain more semantic information and can find better quality nodes for community search.
TL;DR: In this article , the authors propose a new type of query, the top- $k$ spatial textual cluster retrieval (STC) query that returns the top -textual clusters that are located close to a query location, contain objects that are relevant with regard to given query keywords, and have an object density that exceeds a given threshold.
Abstract: So-called spatial web queries retrieve web content representing points of interest, such that the points of interest have descriptions that are relevant to query keywords and are located close to a query location. Two broad categories of such queries exist. The first encompasses queries that retrieve single spatial web objects that each satisfy the query arguments. Most proposals belong to this category. The second category, to which this paper's proposal belongs, encompasses queries that support exploratory user behavior and retrieve sets of objects that represent regions of space that may be of interest to the user. Specifically, the paper proposes a new type of query, the top- $k$ spatial textual cluster retrieval ( $k$ -STC) query that returns the top- $k$ clusters that (i) are located close to a query location, (ii) contain objects that are relevant with regard to given query keywords, and (iii) have an object density that exceeds a given threshold. To compute this query, we propose a DBSCAN-based approach and an OPTICS-based approach that rely on on-line density-based clustering and that exploit early stop conditions. Empirical studies on real data sets offer evidence that the paper's proposals can find good quality clusters and are capable of excellent performance.
TL;DR: A new hybrid query reformulation through QE and term re-weighting techniques is suggested, which exploited both relationships defined by knowledge resources and the distributed semantics, recently revealed by neural network analysis.
Abstract:
Query reformulation is a well-known technique intended to improve the performance of Information Retrieval Systems. Among the several available techniques, Query Expansion (QE) reformulates the initial query by adding similar terms, drawn from several sources (corpus, knowledge resources), to the query terms in order to retrieve more relevant documents. Most QE methods are based on the relationships between the original query term and candidate terms (new terms) in order to select the most similar expansion terms. In this paper, we suggested a new hybrid query reformulation through QE and term re-weighting techniques. The suggested approach aimed to demonstrate the effectiveness of QE with a semantic selection of candidate terms according to the specificity of original query terms in the improvement of retrieval performance. To this end, we exploited both relationships defined by knowledge resources and the distributed semantics, recently revealed by neural network analysis. For term re-weighting, we proposed a new semantic method based on semantic similarity measure that assigns a weight to each term of the expanded query. The conducted experiments on OHSUMED and TREC 2014 CDS test collections, including long and short queries, yielded significant results that outperformed the baseline and state-of-the-art approaches.
TL;DR: In this paper , the authors proposed a k-ary query scheme with k ≥ 2 samples in a query that identifies (dis)similar items in the set while effectively exploiting the associated transitive relations.
Abstract: We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning. This is typically an expensive and time consuming process and therefore, we seek ways to do so efficiently. The conventional approach involves comparing each sample with (the representative of) each class to find a match. In a setting with N equally likely classes, this involves N/2 pairwise comparisons (queries per sample) on average. We consider a k-ary query scheme with k ≥ 2 samples in a query that identifies (dis)similar items in the set while effectively exploiting the associated transitive relations. We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of [Formula: see text]. In addition, we present an adaptive greedy query scheme, which achieves an average rate of ≈ 0.2N queries per sample with triplet queries. For the proposed algorithms, we investigate the query rate performance analytically and with simulations. Empirical studies suggest that each triplet query takes an expert at most 50% more time compared with a pairwise query, indicating the effectiveness of the proposed k-ary query schemes. We generalize the analyses to nonuniform class distributions when possible.