Proceedings Article10.1145/3540250.3558935
CodeMatcher: a tool for large-scale code search based on query semantics matching
Chao Liu,Xuanlin Bao,Xin Xia,Meng Yan,David Lo,Ting Zhang +5 more
- 07 Nov 2022
2
TL;DR: CodeMatcher as discussed by the authors is an IR-based tool which inherits the advantages of the DL-based tools in query semantics matching, and it achieves an industrial-level response time (0.3s) with a common server with an Intel-i7 CPU.
read more
Abstract: Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) technique to perform an efficient code search for a frequently changed large-scale codebase. However, the search accuracy was low due to the semantic mismatch between query and code. In the recent years, many tools leveraged Deep Learning (DL) technique to address this issue. But the DL-based tools are slow and the search accuracy is unstable. In this paper, we presented an IR-based tool CodeMatcher, which inherits the advantages of the DL-based tool in query semantics matching. Generally, CodeMatcher builds indexing for a large-scale codebase at first to accelerate the search response time. For a given search query, it addresses irrelevant and noisy words in the query, then retrieves candidate code from the indexed codebase via iterative fuzzy search, and finally reranks the candidates based on two designed measures of semantic matching between query and candidates. We implemented CodeMatcher as a search engine website. To verify the effectiveness of our tool, we evaluated CodeMatcher on 41k+ open-source Java repositories. Experimental results showed that CodeMatcher can achieve an industrial-level response time (0.3s) with a common server with an Intel-i7 CPU. On the search accuracy, CodeMatcher significantly outperforms three state-of-the-art tools (DeepCS, UNIF, and CodeHow) and two online search engines (GitHub search and Google search).
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
SECON: Maintaining Semantic Consistency in Data Augmentation for Code Search
Xu Zhang,Z.-P. Lin,Xiaoyu Hu,J.Q. Wang,Wenpeng Lü,Deyu Zhou +5 more
TL;DR: This study introduces SECON, a novel data augmentation method for code search that maintains semantic consistency by interacting with code and query representations, outperforming traditional approaches and enhancing model performance on diverse programming languages.
VisRepo: A Visual Retrieval Tool for Large-Scale Open-Source Projects
Xiaoqi Yue,Chao Liu,Neng Zhang,Haibo Hu,Xiaohong Zhang +4 more
- 24 Jul 2024
References
WordNet : an electronic lexical database
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
14.4K
google,我,萨娜
方华
- 01 Jan 2006
TL;DR: After you change your VT Google password, you will be unable to log on to VT Google Apps services including Mail, Drive, Groups, etc.
3.8K
A Fast and Accurate Dependency Parser using Neural Networks
Danqi Chen,Christopher D. Manning +1 more
- 01 Jan 2014
TL;DR: This work proposes a novel way of learning a neural network classifier for use in a greedy, transition-based dependency parser that can work very fast, while achieving an about 2% improvement in unlabeled and labeled attachment scores on both English and Chinese datasets.
•Posted Content
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.
TL;DR: The methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task are described.
799
Deep code search
Xiaodong Gu,Hongyu Zhang,Sunghun Kim +2 more
- 27 May 2018
TL;DR: A novel deep neural network named CODEnn (Code-Description Embedding Neural Network) is proposed, which jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors.
712