TL;DR: This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.
Abstract: With the ever increasing size of the web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. Query Expansion (QE) plays a crucial role in improving searches on the Internet. Here, the user’s initial query is reformulated by adding additional meaningful terms with similar significance. QE – as part of information retrieval (IR) – has long attracted researchers’ attention. It has become very influential in the field of personalized social document, question answering, cross-language IR, information filtering and multimedia IR. Research in QE has gained further prominence because of IR dedicated conferences such as TREC (Text Information Retrieval Conference) and CLEF (Conference and Labs of the Evaluation Forum). This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.
TL;DR: In this paper, the authors investigate place-related questions submitted to search systems and their human-generated answers using a large-scale dataset of questions/answers, MS MARCO V2.1.
Abstract: This paper investigates place-related questions submitted to search systems and their human-generated answers. Place-based search is motivated by the need to identify places matching some criteria, to identify them in space or relative to other places, or to characterize the qualities of such places. Human place-related questions have thus far been insufficiently studied and differ strongly from typical keyword queries. They thus challenge today’s search engines providing only rudimentary geographic information retrieval support. We undertake an analysis of the patterns in place-based questions using a large-scale dataset of questions/answers, MS MARCO V2.1. The results of this study reveal patterns that can inform the design of conversational search systems and in-situ assistance systems, such as autonomous vehicles.
TL;DR: This work shows how an existing product-taxonomy mapping can improve query classification, and reduce the need for labeled data, using transfer learning, and makes available a large query data set of 535, 506 unique e-commerce labeled queries, mapped over 58 categories.
Abstract: In web search, query classification (QC) is used to map a query to a user’s search intent. In the e-commerce domain, user’s product search queries can be broadly categorised into product specific queries and category specific queries [9]. In these instances, accurate classification of queries will help with identifying the right product categories from which relevant products can be retrieved. Thus, mapping a query to a pre-defined product taxonomy is an important step in e-commerce query understanding pipeline. A typical e-commerce website has thousands of categories, and curating a labeled data set for query classification is expensive, time consuming, and labor intensive. In addition, product search queries are short, and the vocabulary changes over time as the catalogue evolves. Reducing this effort of generating query-category labels would save time and resources. In this work we show how an existing product-taxonomy mapping can improve query classification, and reduce the need for labeled data, using transfer learning. Our results demonstrate that such an approach can match, and often exceed, the performance of direct training with a smaller computational budget. We further explore how performance varies as the amount of available training data varies, and show that transfer learning is most useful when the target data set size is small. In addition, we make available a large query data set of 535, 506 unique e-commerce labeled queries, mapped over 58 categories. The results and transfer learning approaches presented in this work can act as strong baselines for this collection and task.
TL;DR: A novel Clicked-URL (CURL) feature that uses semantic information extracted from user clicked URLs in the search results, and compare with the "key word list" to identify transnational query type is proposed.
Abstract: Understanding query intents can help search engines to effectively improve their search quality. Click-through data has proven to be a valuable resource for query classification. In this paper, we propose a novel Clicked-URL (CURL) feature that uses semantic information extracted from user clicked URLs in the search results, and compare with the "key word list" to identify transnational query type. Experiments show that we can obtain relatively high accuracy in transnational query identification with CURL, and achieve an improved performance in query classification combing with other features.
TL;DR: A Customised Grammar Framework for text classification is proposed, which exploits domain-related information and a new way to represent text as a series of syntactic categories forming syntactic patterns.
Abstract: In real-life classification problems, prior information about the problem and expert knowledge about the domain are often used to obtain reliable and consistent solutions. This is especially true in fields where the data is ambiguous, such as text, in which the same words can be used in seemingly similar texts, but have a different meaning. A promising avenue for text classification is machine learning, which has been shown to perform well in a variety of applications including query classification and sentiment analysis. Many of the proposed approaches rely on the bag-of-words representation, which loses the information about the structure of the text. In this paper, we propose a Customised Grammar Framework for text classification, which exploits domain-related information and a new way to represent text as a series of syntactic categories forming syntactic patterns. The framework employs a formal grammar approach for transforming the text into the syntactic patterns representation. We applied the framework for the query classification problem and our results show that our approach outperforms previous ones in terms of classification performance.
TL;DR: A continuous reverse skyline query processing scheme that considers the mobility of mobile devices and removes the devices that do not affect a query by using a pruning method and continuously monitors the areas of candidate devices to update the query result incrementally is proposed.
Abstract: Recently, various query processing schemes in mobile environments have been studied. Particularly, a reverse skyline query that is the variation of a skyline query has been receiving much attention these days for multimedia data. However, the existing reverse skyline query processing schemes did not consider the mobility of devices. In this paper, we propose a continuous reverse skyline query processing scheme that considers the mobility of mobile devices. The proposed scheme removes the devices that do not affect a query by using a pruning method and continuously monitors the areas of candidate devices to update the query result incrementally.
TL;DR: This talk describes how to apply personalization and recommendation techniques in three common voice scenarios which can be defined in terms of request types: known-item, thematic and broad open-ended, and discusses differences and challenges regarding evaluation of voice powered recommendation systems.
Abstract: The adoption of voice-enabled devices has seen an explosive growth in the last few years and music consumption is among the most popular use cases. Music personalization and recommendation plays a major role at Pandora in providing a delightful listening experience for millions of users daily. In turn, providing the same perfectly tailored listening experience through these novel voice interfaces brings new interesting challenges and exciting opportunities. In this talk we will describe how we apply personalization and recommendation techniques in three common voice scenarios which can be defined in terms of request types: known-item, thematic and broad open-ended. Known-item search requests are the most common scenario where users have a well defined and clear intent which is looking for a specific item in the catalog or their personal collection. A voice interface makes the task natural and easy to accomplish since the user is not required to type on a small keyboard. Solving for this specific task involves performing an entity search against a large music catalog and personal user collection. This can be very challenging due to imperfect voice utterance transcriptions, unconventional entity names and the numerous combinations of ways a user can ask for music entities. We employ personalization algorithms for entity disambiguation which can be caused by the presence of homonyms, homographs and homophones terms in the catalog. Another common voice use case is to ask for music regarding a specific theme or context such as a genre, an activity, a mood, an occasion or any combination of those. This scenario differs sharply from the known-item case in that multiple results might, based on user varying contexts, be relevant rather than a single clearly relevant one. For example, a rap music fan would not enjoy a country workout playlist when asking for "music for working out" but may like a hip hop one. This problem can be quite complex to solve as it involves different areas such as voice spoken language understanding, content tagging and personalization. We will describe how we use deep learning slot filling techniques and query classification to interpret the user intent and identify the main concepts in the query. After that, we will discuss some of the content tagging work we have done to classify music according to these voice specific themes. Lastly, we will touch upon how we use recommendation techniques to deliver personalized and unique results to each individual and describe the challenge of balancing the delicate trade-off between query relevance and personalization. The third category of voice queries we will describe are broad or open-ended requests. Voice users often skip the hard work of thinking about what they actually want to hear and command: "just play something awesome". A music service should still meet these expectations instead of interpreting those commands as literal requests. We discuss exploit and explore trade-offs made in the recommendation item pool generation process. Here the exploit pool contains items aimed at re-consumption, while the explore pool contains new items with specific context match. Finally, we will discuss differences and challenges regarding evaluation of voice powered recommendation systems. The first key difference is that in the standard recommendation system settings evaluations are based on UI signals such as impressions and clicks or other explicit forms of feedback. Since pure voice interfaces do not contain visual UI elements, relevance labels need to be inferred through implicit actions such as play time, query reformulations or other types of session level information. Another difference is that while the typical recommendation task corresponds to recommending a ranked list of items, a voice play request translates into a single item play action. Thus, some considerations about closed feedback loops need to be made. In summary, improving the quality of voice interactions in music services is a relatively new challenge and many exciting opportunities for breakthroughs still remain. There are many new aspects of recommendation system interfaces to address to bring a delightful and effortless experience for voice users. We will share a few open challenges to solve for the future.
TL;DR: This paper proposed dynamic navigation of web query using B-tree to retrieve relevant documents efficiently by reducing nonrelevant documents and resulting documents are sorted by using an improved page rank algorithm.
Abstract: Most of the web search queries submitted by users are short, uncertain and ambiguous. The information retrieval process is very complicated today as it depends on many factors like classification of a web query; number of phrases present in the query, number of in-links and out-links to and from the documents, vocabulary, changing nature of the meaning of words. Information retrieval process mainly involves two steps, retrieval of relevant documents for user queries and retrieved documents are sorted using efficient page rank algorithms. Most of the existing systems use static navigation of the web query and ranking. They mainly depend on the number of in links and out links of a web page. In this paper, we proposed dynamic navigation of web query using B-tree to retrieve relevant documents efficiently by reducing nonrelevant documents and resulting documents are sorted by using an improved page rank algorithm.
TL;DR: This paper introduces a new approach to automatically detect the type of each thread in online forums, and aims to identify book search queries by syntactic patterns, association rules between terms and tex-tual sequences mining.
Abstract: We argue that verbose natural language queries used for software retrieval contain many terms that follow specific discourse rules, yet hinder retrieval. Through verbose queries, users can express complex or highly specific information needs. However, it is difficult for search engine to deal with this type of queries. Moreover, the emergence of social me-dias allows users to get opinions, suggestions, or recommendations from other users about complex information needs. In order to increase the understanding of user needs, a tasks, as the CLEF Social Book Search Classification Track, the aims is to investigates how systems can automatically identify book search requests in online forums. In this respect, we introduce in the present paper a new approach to automatically detect the type of each thread. Our proposal aims to identify book search queries by syntactic patterns, association rules between terms and tex-tual sequences mining.
TL;DR: Query term weighting based on query performance prediction (QPP) is effective for improving search re-ranking performance.
Abstract: This work presents a general query term weighting approach based on query performance prediction (QPP). To this end, a given term is weighed according to its predicted effect on query performance. Such an effect is assumed to be manifested in the responses made by the underlying retrieval method for the original query and its (simple) variants in the form of a single-term expanded query. Focusing on search re-ranking as the underlying application, the effectiveness of the proposed term weighting approach is demonstrated using several state-of-the-art QPP methods evaluated over TREC corpora.
TL;DR: This paper uses the supervised learning method to learn query vector representation automatically based on semantics feature in query classification task and shows that the automatic learning query vector model outperforms other models.
Abstract: Recently, both commercial websites and search engines require people to enter query keywords to get useful information. It is becoming increasingly important to understand the user's intentions. Query classification task is quite helpful to commercial websites and search engines, and its goal is to classify the query to predefined categories to better understand the needs of the user. In this paper, we use the supervised learning method to learn query vector representation automatically based on semantics feature in query classification task. Our experiments use the three different neural networks, respectively convolutional neural network, Long Short-Term Memory (LSTM) and two layers LSTM to model the query. The experiments results show that our automatic learning query vector model outperforms other models. The two layers LSTM model perform better than the other two models, and the F-score improves 4% compared to Logistic Regression.
TL;DR: A new automatic approach for identifying and characterizing queries in the health domain that makes use of the search engine counts through a semantic similarity measure called Normalized Google Distance (NGD) combined with Support Vector Machines to classify queries into three dimensions: health-related, severity and semantic type.
Abstract: Classifying web queries into a set of categories is a crucial task to better understand the user's intent behind a query, contextualize their search and provide more relevant results to the user. However, web queries are typically short and ambiguous making the query classification a non-trivial problem. In this article, we present a new automatic approach for identifying and characterizing queries in the health domain. This method makes use of the search engine counts through a semantic similarity measure called Normalized Google Distance (NGD)combined with Support Vector Machines to classify queries into three dimensions: health-related, severity and semantic type. To evaluate our methods, we used two datasets in different languages, Portuguese and English, and built another for evaluating the last dimension. Overall, the results achieved were satisfactory. The most generic classification obtains better results than more specific ones. The NGD proved to be a valuable assent in query classification.
TL;DR: This system compares the performance between Neo4j-based and OWL-based WQCA to show the effectiveness of using graph database in the query classification process.
Abstract: Web query classification is emphasized by various search engines nowadays due to the increase in the size of the web as millions of web pages are added to it every day. Web query classification is to classify a user query Qi into a list of n categories ci1, ci2, cin. Search result pages can be grouped according to the categories predicted by query classification method. Providing query classification can help the information providers to understand users’ needs based on the categories searched by the users. To build the domain corpus, most of the query classification system use ontology, Wikipedia category source, graph database etc. In this system, graph database and ontology are built as domain corpus for query classification process by using Neo4j and Web Ontology Language (OWL). Web Query Classification Algorithm (WQCA) with five steps is implemented as a web service by using XML web service technology. Proposed system classifies each domain term of user query into their relevant categories according to this WQCA algorithm by using different domain corpus. Finally, this system compares the performance between Neo4j-based and OWL-based WQCA to show the effectiveness of using graph database in the query classification process.
Abstract: The ever-increasing growth of the Internet of Things (IoT) has attracted a considerable amount of research attention from the Semantic Web community in order to address the challenge of poor interoperability. However, our survey of research work has shown that the goal of providing an intelligent pro- cessing and analysis engine for IoT has still not been fully achieved. Central to this problem is the requirement for a semantic spatio-temporal query processing engine that is able to not only analyze spatio-temporal correlations in a massive amount of IoT data, but that can also generate an effective query plan for a given query to execute in a timely manner. Needless to say, query planning for the multidimensional data like IoT is a costly operation. The most known techniques are either based on the cost model or by using spatio-temporal data statistics and heuristics. In this paper, we propose an alternative solution that uses query similarity identification in conjunction with machine learning techniques to recommend a previously generated query plan to the optimizer for a given query. Our approach also aims to predict the query execution time for the purposes of workload management and capacity planning. Our extensive experiments indicate the efficiency of our learning approach with an impressive prediction accuracy on test queries.
Abstract: Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion.
TL;DR: This paper addresses the problem of sharing person-specific genomic sequences without violating the privacy of their data subjects to support large-scale biomedical research projects by offering a scheme that is twice as fast as theirs but uses twice the storage space.
Abstract: This paper addresses the problem of sharing person-specific genomic sequences without violating the privacy of their data subjects to support large-scale biomedical research projects. The proposed method builds on the framework proposed by Kantarcioglu et al. [1] but extends the results in a number of ways. One improvement is that our scheme is deterministic, with zero probability of a wrong answer (as opposed to a low probability). We also provide a new operating point in the space-time tradeoff, by offering a scheme that is twice as fast as theirs but uses twice the storage space. This point is motivated by the fact that storage is cheaper than computation in current cloud computing pricing plans. Moreover, our encoding of the data makes it possible for us to handle a richer set of queries than exact matching between the query and each sequence of the database, including: (i) counting the number of matches between the query symbols and a sequence; (ii) logical OR matches where a query symbol is allowed to match a subset of the alphabet thereby making it possible to handle (as a special case) a “not equal to” requirement for a query symbol (e.g., “not a G”); (iii) support for the extended alphabet of nucleotide base codes that encompasses ambiguities in DNA sequences (this happens on the DNA sequence side instead of the query side); (iv) queries that specify the number of occurrences of each kind of symbol in the specified sequence positions (e.g., two ‘A’ and four ‘C’ and one ‘G’ and three ‘T’, occurring in any order in the query-specified sequence positions); (v) a threshold query whose answer is ‘yes’ if the number of matches exceeds a query-specified threshold (e.g., “7 or more matches out of the 15 query-specified positions”). (vi) For all query types, we can hide the answers from the decrypting server, so that only the client learns the answer. (vii) In all cases, the client deterministically learns only the query's answer, except for query type (v) where we quantify the (very small) statistical leakage to the client of the actual count.
TL;DR: The experimental results proved that the proposed model with weighted dominant query algorithm effectively improves scalability by reducing the computational cost.
Abstract: The arrival of large voluminous continuous queries sets for a given query leads an insignificant insights. The elimination of certain data tuples occurs in order to balance the system load. The streaming query removes the improper data tuples and uses proper data tuples in the form of defined tables or sets. However, major drawback occurs due to unbounded streaming and inadequate access to end data. Due to such constraints, many stream processing methods makes the processed data unavailable for any applications or to the related queries of neighborhood branches. This paper avoids such problems during the process of data tuples at the generation of queries. The study uses a streaming model that executes effective query plans in continuous data. The streaming model aims reduce the communication cost and improves the scalability of continuous aggregation queries. It sub-divides the client query and executes it over data aggregators within the incoherent limit. A weighted dominant query algorithm is formulated to provide the top dominant value in each sub-query clusters. This reduces the cost for computation in synthetic databases. The experimental results proved that the proposed model with weighted dominant query algorithm effectively improves scalability by reducing the computational cost.
Mathu Uthaman, PG scholar*1, Suresh Kumar N, Associate Professor 2
4 Apr 2019
TL;DR: This study customizes query mining and processing techniques for business analysts, comparing query mining and facet mining to identify growth potential in business products, highlighting merits and demerits of each approach.
Abstract: Query mining and query processing were the twomain research domains that needed customization. Every business analyst needs to identify the growth potential of business products. So, they depended on querying techniques and tended to get the connected data including query-based summarization, query recommendation, facet mining and so on. These techniques had some merits and demerits. Facet mining was a recent technique in query searching. However, it has a lot of challenges in the web community. According to this paper, a comparative study of query mining techniques and facet mining had made to enable the researcher to work on it.