Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned

doi:10.1109/tse.2022.3187689

Open AccessJournal Article10.1109/tse.2022.3187689

Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned

01 Apr 2023

- IEEE Transactions on Software Engineerin...

- Vol. 49, Iss: 4, pp 1661-1682

2

TL;DR: Wang et al. as mentioned in this paper conducted a systematic study on the basic features used in Binary Code Similar Analysis (BCSA) by leveraging interpretable feature engineering on a large-scale benchmark.

Abstract: Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1109/imcec55388.2022.10019958

Identification of binary file compilation information

16 Dec 2022

TL;DR: In this paper , the compilation information of binary files includes compilation architecture, compiler, optimization option and obfuscation strategy, which is conducive to the improvement of the accuracy in binary code similarity detection, and the final experimental results show that the accuracy of the binary file compilation information identification model designed by them is 100% for both compilation architectures and compilers identification.

...read moreread less

2

Proceedings Article•10.1145/3579856.3582818

Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures

Abdullah A. Qasem, +3 more

- 10 Jul 2023

TL;DR: Binary function clone search is an essential capability that enables multiple applications and use cases, including reverse engineering, patch security inspection, threat analysis, vulnerable function detection, etc. BinFinder as discussed by the authors employs a neural network to learn binary function embeddings based on a set of extracted features that are resilient to both code obfuscation and compiler optimization techniques.

...read moreread less

References

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Proceedings Article•10.3115/V1/D14-1181

Convolutional Neural Networks for Sentence Classification

Yoon Kim

- 25 Aug 2014

TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.

...read moreread less

16.1K

•Journal Article•10.1109/MCSE.2011.37

The NumPy Array: A Structure for Efficient Numerical Computation

Stefan van der Walt, +2 more

- 01 Mar 2011

- Computing in Science and Engineering

TL;DR: In this article, the authors show how to improve the performance of NumPy arrays through vectorizing calculations, avoiding copying data in memory, and minimizing operation counts, which is a technique similar to the one described in this paper.

...read moreread less

11.5K

•Journal Article•10.1007/S10994-006-6226-1

Extremely randomized trees

Pierre Geurts, +2 more

- 01 Apr 2006

- Machine Learning

TL;DR: A new tree-based ensemble method for supervised classification and regression problems that consists of randomizing strongly both attribute and cut-point choice while splitting a tree node and builds totally randomized trees whose structures are independent of the output values of the learning sample.

...read moreread less

7.7K

•Proceedings Article

Signature Verification using a "Siamese" Time Delay Neural Network

Jane Bromley, +4 more

- 29 Nov 1993

TL;DR: An algorithm for verification of signatures written on a pen-input tablet based on a novel, artificial neural network called a "Siamese" neural network, which consists of two identical sub-networks joined at their outputs.

...read moreread less

3.8K