TL;DR: This paper reflects on the state of the art of Approximate Query Processing, and discusses two promising avenues to pursue towards integrating Approximates Query Processing into data platforms.
Abstract: In this paper, we reflect on the state of the art of Approximate Query Processing. Although much technical progress has been made in this area of research, we are yet to see its impact on products and services. We discuss two promising avenues to pursue towards integrating Approximate Query Processing into data platforms.
TL;DR: Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding model, such as word2vec and GloVe.
Abstract: Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.
TL;DR: A customized sequence-to-sequence model for session-based query suggestion that employs a query-aware attention mechanism to capture the structure of the session context and outperforms the baselines both in terms of the generating queries and scoring candidate queries for the task of query suggestion.
Abstract: Users try to articulate their complex information needs during search sessions by reformulating their queries. To make this process more effective, search engines provide related queries to help users in specifying the information need in their search process. In this paper, we propose a customized sequence-to-sequence model for session-based query suggestion. In our model, we employ a query-aware attention mechanism to capture the structure of the session context. is enables us to control the scope of the session from which we infer the suggested next query, which helps not only handle the noisy data but also automatically detect session boundaries. Furthermore, we observe that, based on the user query reformulation behavior, within a single session a large portion of query terms is retained from the previously submitted queries and consists of mostly infrequent or unseen terms that are usually not included in the vocabulary. We therefore empower the decoder of our model to access the source words from the session context during decoding by incorporating a copy mechanism. Moreover, we propose evaluation metrics to assess the quality of the generative models for query suggestion. We conduct an extensive set of experiments and analysis. e results suggest that our model outperforms the baselines both in terms of the generating queries and scoring candidate queries for the task of query suggestion.
TL;DR: This article proposed relevance-based word embedding models that learn word representations based on query-document relevance information and classify each term as belonging to the relevant or non-relevant class for each query.
Abstract: Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.
TL;DR: The principle of maximum entropy is exploited to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations and which lead to increasingly faster response times for future queries.
Abstract: In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that database learning supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.
TL;DR: This paper presents a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms and increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval.
Abstract: Efficient query expansion (QE) terms selection methods are really very important for improving the accuracy and efficiency of the system by removing the irrelevant and redundant terms from the top-retrieved feedback documents corpus with respect to a user query. Each individual QE term selection method has its weaknesses and strengths. To overcome the weaknesses and to utilize the strengths of the individual method, we used multiple terms selection methods together. In this paper, we present a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms. Different QE terms selection methods calculate the degrees of importance of all unique terms of top-retrieved documents collection for mining additional expansion terms. These methods give different relevance scores for each term. The proposed method combines different weights of each term by using fuzzy rules to infer the weights of the additional query terms. Then, the weights of the additional query terms and the weights of the original query terms are used to form the new query vector, and we use this new query vector to retrieve documents. All the experiments are performed on TREC and FIRE benchmark datasets. The proposed QE method increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval. It gets a significant higher average recall rate, average precision rate and F measure on both datasets.
TL;DR: This survey aims to provide a thorough review of a wide range of result diversification techniques including various definitions of diversifications, corresponding algorithms, diversification technique specified for some applications including database, search engines, recommendation systems, graphs, time series and data streams as well as result diversify systems.
Abstract: Nowadays, in information systems such as web search engines and databases, diversity is becoming increasingly essential and getting more and more attention for improving users' satisfaction. In this sense, query result diversification is of vital importance and well worth researching. Some issues such as the definition of diversification and efficient diverse query processing are more challenging to handle in information systems. Many researchers have focused on various dimensions of diversify problem. In this survey, we aim to provide a thorough review of a wide range of result diversification techniques including various definitions of diversifications, corresponding algorithms, diversification technique specified for some applications including database, search engines, recommendation systems, graphs, time series and data streams as well as result diversification systems. We also propose some open research directions, which are challenging and have not been explored up till now, to improve the quality of query results.
TL;DR: This work presents a novel pricing system, called QIRANA, that performs query-based data pricing for a large class of SQL queries (including aggregation) in real time, and provides prices with formal guarantees.
Abstract: Users are increasingly engaging in buying and selling data over the web. Facilitated by the proliferation of online marketplaces that bring such users together, data brokers need to serve requests where they provide results for user queries over the underlying datasets, and price them fairly according to the information disclosed by the query. In this work, we present a novel pricing system, called QIRANA, that performs query-based data pricing for a large class of SQL queries (including aggregation) in real time. QIRANA provides prices with formal guarantees: for example, it avoids prices that create arbitrage opportunities. Our framework also allows flexible pricing, by allowing the data seller to choose from a variety of pricing functions, as well as specify relation and attribute-level parameters that control the price of queries and assign different value to different portions of the data. We test QIRANA on a variety of real-world datasets and query workloads, and we show that it can efficiently compute the prices for queries over large-scale data.
TL;DR: This paper proposes a new scheme to perform k -NN query over encrypted data in cloud while protecting the privacy of both data owner and query users from cloud, and presents a new scalar product protocol for gaining the properties.
TL;DR: This work proposes a novel model to construct enriched user profiles with the help of an external corpus for personalized query expansion, and builds two novel query expansion techniques based on topical weights-enhanced word embeddings and topical relevance between the query and the terms inside a user profile.
Abstract: Query expansion has been widely adopted in Web search as a way of tackling the ambiguity of queries. Personalized search utilizing folksonomy data has demonstrated an extreme vocabulary mismatch problem that requires even more effective query expansion methods. Co-occurrence statistics, tag-tag relationships, and semantic matching approaches are among those favored by previous research. However, user profiles which only contain a user's past annotation information may not be enough to support the selection of expansion terms, especially for users with limited previous activity with the system. We propose a novel model to construct enriched user profiles with the help of an external corpus for personalized query expansion. Our model integrates the current state-of-the-art text representation learning framework, known as word embeddings, with topic models in two groups of pseudo-aligned documents. Based on user profiles, we build two novel query expansion techniques. These two techniques are based on topical weights-enhanced word embeddings, and the topical relevance between the query and the terms inside a user profile, respectively. The results of an in-depth experimental evaluation, performed on two real-world datasets using different external corpora, show that our approach outperforms traditional techniques, including existing non-personalized and personalized query expansion methods.
TL;DR: This paper investigates privacy-preserving similarity join queries, a pivotal primitive of similarity search that finds pairwise similar data points across two data sets, and formalizes the leakage functions in the context of similarity joins, and conducts rigorous security analysis.
Abstract: Similarity search on high-dimensional data has been intensively studied for data processing and analytics. Despite its broad applicability, data security and privacy concerns along the trend of data outsourcing have not been fully addressed. In this paper, we investigate privacy-preserving similarity join queries, i.e., a pivotal primitive of similarity search that finds pairwise similar data points across two data sets. We start from locality-sensitive hashing and searchable symmetric encryption, i.e., the most practical techniques for similarity search and encrypted search, respectively. However, the immediate combination of two techniques discloses the distribution of the query set, which is exploitable to compromise the confidentiality of queries. To enhance the security, we propose the frequency hiding query scheme, which allows the server to see the flattened query distribution only. To improve the scalability, we further design the result sharing query scheme, which processes a small portion of query points and shares the results with other nearby points. Besides, we set up a strict constraint to carefully select query points to achieve “as-strong-as-possible” guarantees. We formalize the leakage functions in the context of similarity joins, and conduct rigorous security analysis. We implement and evaluate the proposed query schemes on Azure cloud. Experimental results indicate that they have different tradeoffs on security, efficiency, and accuracy, which can flexibly be used for different deployment scenarios.
TL;DR: This work proposes a privacy-enhanced search scheme by allowing the data user to generate random query trapdoor every time, and uses Bloom filter and bilinear pairing operation to construct secure index for each data file, which enables the cloud to perform search without obtaining any useful information.
TL;DR: A query-based summarization method, which uses a combination of semantic relations between words and their syntactic composition, to extract meaningful sentences from document sets is introduced and demonstrates better performance as compared to other existing techniques on DUC 2005 and DUC 2006 datasets.
Abstract: In this paper, a query-based summarization method, which uses a combination of semantic relations between words and their syntactic composition, to extract meaningful sentences from document sets is introduced. The problem with current statistical methods is that they fail to capture the meaning when comparing a sentence and a user query; hence there is often a conflict between the extracted sentences and users' requirements. However, this particular method can improve the quality of document summaries because it is able to avoid extracting a sentence whose similarity with the query is high but whose meaning is different. The method is executed by computing the semantic and syntactic similarity of the sentence-to-sentence and sentence-to-query. To reduce redundancy in summary, this method uses the greedy algorithm to impose diversity penalty on the sentences. In addition, the proposed method expands the words in both the query and the sentences to tackle the problem of information limit. It bridges the lexical gaps for semantically similar contexts that are expressed using different wording. The experimental results display that the proposed method is able to improve performance compared with the participating systems in DUC 2006. The experimental results also showed that the proposed method demonstrates better performance as compared to other existing techniques on DUC 2005 and DUC 2006 datasets.
TL;DR: This paper proposes a hybrid index, the GIM-tree, which indexes locations, keywords, and social information of geo-tagged users and objects, and then presents efficient RkGSK query processing algorithms that exploit several pruning strategies.
Abstract: Identifying prospective customers is an important aspect of marketing research In this paper, we provide support for a new type of query, the Reverse Top-k Geo-Social Keyword (RkGSK) query This query takes into account spatial, textual, and social information, and finds prospective customers for geotagged objects As an example, a restaurant manager might apply the query to find prospective customers To address this, we propose a hybrid index, the GIM-tree, which indexes locations, keywords, and social information of geo-tagged users and objects, and then, using the GIM-tree, we present efficient RkGSK query processing algorithms that exploit several pruning strategies The effectiveness of RkGSK retrieval is characterized via a case study, and extensive experiments using real datasets offer insight into the efficiency of the proposed index and algorithms
TL;DR: A novel Web service discovery approach based on topic models is presented that can maintain the performance of service discovery at an elevated level by greatly decreasing the number of candidate Web services, thus leading to faster response time.
Abstract: Web services have attracted much attention from distributed application designers and developers because of their roles in abstraction and interoperability among heterogeneous software systems, and a growing number of distributed software applications have been published as Web services on the Internet. Faced with the increasing numbers of Web services and service users, researchers in the services computing field have attempted to address a challenging issue, i.e., how to quickly find the suitable ones according to user queries. Many previous studies have been reported towards this direction. In this paper, a novel Web service discovery approach based on topic models is presented. The proposed approach mines common topic groups from the service-topic distribution matrix generated by topic modeling, and the extracted common topic groups can then be leveraged to match user queries to relevant Web services, so as to make a better trade-off between the accuracy of service discovery and the number of candidate Web services. Experiment results conducted on two publicly-available data sets demonstrate that, compared with several widely used approaches, the proposed approach can maintain the performance of service discovery at an elevated level by greatly decreasing the number of candidate Web services, thus leading to faster response time.
TL;DR: This paper proposes a novel trajectory query that not only considers the spatio-temporal closeness of trajectories but also leverages probabilistic topic modelling to capture the semantic relevance of the activities between data and query.
Abstract: Nowadays, users of social networks like tweets and weibo have generated massive geo-tagged records, and these records reveal their activities in the physical world together with spatio-temporal dynamics. Existing trajectory data management studies mainly focus on analyzing the spatio-temporal properties of trajectories, while leaving the understanding of their activities largely untouched. In this paper, we incorporate the semantic analysis of the activity information embedded in trajectories into query modelling and processing, with the aim of providing end users more accurate and meaningful trip recommendations. To this end, we propose a novel trajectory query that not only considers the spatio-temporal closeness but also, more importantly, leverages probabilistic topic modelling to capture the semantic relevance of the activities between data and query. To support efficient query processing, we design a novel hybrid index structure, namely ST-tree, to organize the trajectory points hierarchically, which enables us to prune the search space in spatial and topic dimensions simultaneously. The experimental results on real datasets demonstrate the efficiency and scalability of the proposed index structure and search algorithms.
TL;DR: The experiments show that for personalized ranking, behavioral information helps to improve retrieval effectiveness; and given a query, merging information inferred from behavior of a particular user and from behaviors of other users with a user‐dependent adaptive weight outperforms any combination with a fixed weight.
Abstract: Personalized search approaches tailor search results to users' current interests, so as to help improve the likelihood of a user finding relevant documents for their query. Previous work on personalized search focuses on using the content of the user's query and of the documents clicked to model the user's preference. In this paper we focus on a different type of signal: We investigate the use of behavioral information for the purpose of search personalization. That is, we consider clicks and dwell time for reranking an initially retrieved list of documents. In particular, we i investigate the impact of distributions of users and queries on document reranking; ii estimate the relevance of a document for a query at 2 levels, at the query-level and at the word-level, to alleviate the problem of sparseness; and iii perform an experimental evaluation both for users seen during the training period and for users not seen during training. For the latter, we explore the use of information from similar users who have been seen during the training period. We use the dwell time on clicked documents to estimate a document's relevance to a query, and perform Bayesian probabilistic matrix factorization to generate a relevance distribution of a document over queries. Our experiments show that: i for personalized ranking, behavioral information helps to improve retrieval effectiveness; and ii given a query, merging information inferred from behavior of a particular user and from behaviors of other users with a user-dependent adaptive weight outperforms any combination with a fixed weight.
TL;DR: In this article, a query reformulation system based on a neural network that rewrites a query to maximize the number of relevant documents returned is introduced. But the results are often far from satisfactory.
Abstract: Search engines play an important role in our everyday lives by assisting us in finding the information we need. When we input a complex query, however, results are often far from satisfactory. In this work, we introduce a query reformulation system based on a neural network that rewrites a query to maximize the number of relevant documents returned. We train this neural network with reinforcement learning. The actions correspond to selecting terms to build a reformulated query, and the reward is the document recall. We evaluate our approach on three datasets against strong baselines and show a relative improvement of 5-20% in terms of recall. Furthermore, we present a simple method to estimate a conservative upper-bound performance of a model in a particular environment and verify that there is still large room for improvements.
TL;DR: This paper develops a series of query processing algorithms for answering the GBCK query, which aims to find a region containing a set of POIs that covers all the query keywords and these POIs areclose to the group of users and are close to each other.
TL;DR: It is proved the LCSK query is NP-hard, and the exact algorithm as well as approximate algorithm with provable approximation bound to this problem are devised.
TL;DR: A replication-aware framework named LILAC, sparqL query decomposItion against federations of repLicAted data sourCes, that relies on replicated fragment descriptions to accurately identify sources that provide replicated data is proposed.
TL;DR: This paper presents query processing algorithms for a fragment of SPARQL 1.1 incorporating regular path queries (property path queries), extended with query approximation and relaxation operators, and formally shows the soundness, completeness and termination properties of the query rewriting algorithm.
Abstract: Flexible querying techniques can enhance users' access to complex, heterogeneous datasets in settings such as Linked Data, where the user may not always know how a query should be formulated in order to retrieve the desired answers. This paper presents query processing algorithms for a fragment of SPARQL 1.1 incorporating regular path queries (property path queries), extended with query approximation and relaxation operators. Our flexible query processing approach is based on query rewriting and returns answers incrementally according to their ``distance'' from the exact form of the query. We formally show the soundness, completeness and termination properties of our query rewriting algorithm. We also present empirical results that show promising query processing performance for the extended language.
TL;DR: A novel technique is proposed --ACER-- that takes an initial query, identifies appropriate search terms from the source code using a novel term weight --CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning.
Abstract: During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a., concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique --ACER-- that takes an initial query, identifies appropriate search terms from the source code using a novel term weight --CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.
TL;DR: The central idea is to employ parallelism in a novel way, whereby parallel matching/decision attempts are initiated, each using a query rewriting and/or an alternate algorithm, which is shown to be highly beneficial across algorithms and datasets.
Abstract: Subgraph queries are central to graph analytics and graph
DBs. We analyze this problem and present key novel discoveries
and observations on the nature of the problem which
hold across query sizes, datasets, and top-performing algorithms.
Firstly, we show that algorithms (for both the decision
and matching versions of the problem) suffer from
straggler queries, which dominate query workload times. As
related research caps query times not reporting results for
queries exceeding the cap, this can lead to erroneous conclusions
of the methods’ relative performance. Secondly, we
study and show the dramatic effect that isomorphic graph
queries can have on query times. Thirdly, we show that
for each query, isomorphic queries based on proposed query
rewritings can introduce large performance benefits. Fourthly,
that straggler queries are largely algorithm-specific: many
challenging queries to one algorithm can be executed effi-
ciently by another. Finally, the above discoveries naturally
lead to the derivation of a novel framework for subgraph
query processing. The central idea is to employ parallelism
in a novel way, whereby parallel matching/decision attempts
are initiated, each using a query rewriting and/or an alternate
algorithm. The framework is shown to be highly beneficial
across algorithms and datasets.
TL;DR: This work proposes a model for the query-based summarization task based on the encode-attend-decode paradigm with two key additions: a query attention model which learns to focus on different portions of the query at different time steps and a new diversity based Attention model which aims to alleviate the problem of repeating phrases in the summary.
Abstract: ive summarization aims to generate a shorter version of the document covering all the salient points in a compact and coherent fashion. On the other hand, query-based summarization highlights those points that are relevant in the context of a given query. The encode-attend-decode paradigm has achieved notable success in machine translation, extractive summarization, dialog systems, etc. But it suffers from the drawback of generation of repeated phrases. In this work we propose a model for the query-based summarization task based on the encode-attend-decode paradigm with two key additions (i) a query attention model (in addition to document attention model) which learns to focus on different portions of the query at different time steps (instead of using a static representation for the query) and (ii) a new diversity based attention model which aims to alleviate the problem of repeating phrases in the summary. In order to enable the testing of this model we introduce a new query-based summarization dataset building on debatepedia. Our experiments show that with these two additions the proposed model clearly outperforms vanilla encode-attend-decode models with a gain of 28% (absolute) in ROUGE-L scores.
TL;DR: This paper used the Language Model to build the query context, which is composed of the most similar queries to the query to expand and their top-ranked documents, and applied a query expansion approach based on thequery context and the Latent Semantic Analyses method.
Abstract: One of the key difficulties for users in information retrieval is to formulate appropriate queries to submit to the search engine. In this paper, we propose an approach to enrich the user's queries by additional context. We used the Language Model to build the query context, which is composed of the most similar queries to the query to expand and their top-ranked documents. Then, we applied a query expansion approach based on the query context and the Latent Semantic Analyses method. Using a web test collection, we tested our approach on short and long queries. We varied the number of recommended queries and the number of expansion terms to specify the appropriate parameters for the proposed approach. Experimental results show that the proposed approach improves the effectiveness of the information retrieval system by 19.23 % for short queries and 52.94 % for long queries according to the retrieval results using the original users' queries.
TL;DR: A novel method to translate anatural language query into a keyword query relevant to the natural language query for retrieving better search results without change of the engines is proposed.
Abstract: The number of natural language queries submitted to search engines is increasing as search environments get diversified. However, legacy search engines are still optimized for short keyword queries. Thus, the use of natural language queries at legacy search engines degrades the retrieval performance of the engines. This paper proposes a novel method to translate a natural language query into a keyword query relevant to the natural language query for retrieving better search results without change of the engines. The proposed method formulates the translation as a generation task. That is, the method generates a keyword query from a natural language query by preserving the semantics of the natural language query. A recurrent neural network encoder-decoder architecture is adopted as a generator of keyword queries from natural language queries. In addition, an attention mechanism is also used to cope with long natural language queries.
TL;DR: This paper introduces a novel strategy for vertical Web page classification, which is called Classification using Multi-layered Domain Ontology (CMDO), which employs several Web mining techniques, and depends mainly on proposed multi-layering domain ontology.
Abstract: World Wide Web is a continuously growing giant, and within the next few years, Web contents will surely increase tremendously Hence, there is a great requirement to have algorithms that could accurately classify Web pages Automatic Web page classification is significantly different from traditional text classification because of the presence of additional information, provided by the HTML structure Recently, several techniques have been arisen from combinations of artificial intelligence and statistical approaches However, it is not a simple matter to find an optimal classification technique for Web pages This paper introduces a novel strategy for vertical Web page classification, which is called Classification using Multi-layered Domain Ontology (CMDO) It employs several Web mining techniques, and depends mainly on proposed multi-layered domain ontology In order to promote the classification accuracy, CMDO implies a distiller to reject pages related to other domains CMDO also employs a novel classification technique, which is called Graph Based Classification (GBC) The proposed GBC has pioneering features that other techniques do not have, such as outlier rejection and pruning Experimental results have shown that CMDO outperforms recent techniques as it introduces better precision, recall, and classification accuracy
TL;DR: An original approach based on Bat Algorithm is proposed to improve the retrieval effectiveness of query expansion in medical field, using Bat Al algorithm to find the best expanded query among a set of expanded query candidates, while maintaining low computational complexity.
Abstract: With the increasing amount of medical data available on the Web, looking for health information has become one of the most widely searched topics on the Internet. Patients and people of several backgrounds are now using Web search engines to acquire medical information, including information about a specific disease, medical treatment or professional advice. Nonetheless, due to a lack of medical knowledge, many laypeople have difficulties in forming appropriate queries to articulate their inquiries, which deem their search queries to be imprecise due the use of unclear keywords. The use of these ambiguous and vague queries to describe the patients' needs has resulted in a failure of Web search engines to retrieve accurate and relevant information. One of the most natural and promising method to overcome this drawback is Query Expansion. In this paper, an original approach based on Bat Algorithm is proposed to improve the retrieval effectiveness of query expansion in medical field. In contrast to the existing literature, the proposed approach uses Bat Algorithm to find the best expanded query among a set of expanded query candidates, while maintaining low computational complexity. Moreover, this new approach allows the determination of the length of the expanded query empirically. Numerical results on MEDLINE, the on-line medical information database, show that the proposed approach is more effective and efficient compared to the baseline.