Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned
TL;DR: Wang et al. as mentioned in this paper conducted a systematic study on the basic features used in Binary Code Similar Analysis (BCSA) by leveraging interpretable feature engineering on a large-scale benchmark.
read more
Abstract: Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Identification of binary file compilation information
16 Dec 2022
TL;DR: In this paper , the compilation information of binary files includes compilation architecture, compiler, optimization option and obfuscation strategy, which is conducive to the improvement of the accuracy in binary code similarity detection, and the final experimental results show that the accuracy of the binary file compilation information identification model designed by them is 100% for both compilation architectures and compilers identification.
2
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures
Abdullah A. Qasem,Mourad Debbabi,Bernard Lebel,Marthe Kassouf +3 more
- 10 Jul 2023
TL;DR: Binary function clone search is an essential capability that enables multiple applications and use cases, including reverse engineering, patch security inspection, threat analysis, vulnerable function detection, etc. BinFinder as discussed by the authors employs a neural network to learn binary function embeddings based on a set of extracted features that are resilient to both code obfuscation and compiler optimization techniques.
References
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
Convolutional Neural Networks for Sentence Classification
Yoon Kim
- 25 Aug 2014
TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.
The NumPy Array: A Structure for Efficient Numerical Computation
TL;DR: In this article, the authors show how to improve the performance of NumPy arrays through vectorizing calculations, avoiding copying data in memory, and minimizing operation counts, which is a technique similar to the one described in this paper.
Extremely randomized trees
TL;DR: A new tree-based ensemble method for supervised classification and regression problems that consists of randomizing strongly both attribute and cut-point choice while splitting a tree node and builds totally randomized trees whose structures are independent of the output values of the learning sample.
•Proceedings Article
Signature Verification using a "Siamese" Time Delay Neural Network
Jane Bromley,Isabelle Guyon,Yann LeCun,E. Sackinger,Roopak Shah +4 more
- 29 Nov 1993
TL;DR: An algorithm for verification of signatures written on a pen-input tablet based on a novel, artificial neural network called a "Siamese" neural network, which consists of two identical sub-networks joined at their outputs.