What is the advantage of using CP to evaluate an expert’s confidence?

One of the advantages of using CP to evaluate an expert’s confidence is that the authors can use samples with low credibility to improve an expert model over time.

How do the authors learn to port a network model to another language?

Since the authors use the same network structure, transfer learning is achieved by copying the weights of a model built for one language to initialize the network for another.

How do the authors build and use an expert model?

The authors describe how to build and use an expert model following the three-step process of supervised learning: (1) training data generation; (2) modeling training; (3) using the model.

What is the process of updating node states for a fixed number of iterations?

After repeating this process of updating node states for a fixed number of iterations a readout function is used to aggregate the node states to a single embedding vector.

How do the authors calculate the confidence score for a class label?

The authors compute, offline, the calibration scores, ay p1 , a yp 2 , ..., a ypn , by applying function A to each of the n instances in the calibration set using the probability (yp) given by model h for each class label, y.

What is the edge used to capture where a variable is used?

This edge captures where a variable or buffer is used and is useful for detecting vulnerabilities like “NULL pointer dereference”.

What is the geometric mean of the evaluation metrics across the cross-validation folds?

The authors report the geometric mean of the aforementioned evaluation metrics across the cross-validation folds because it is widely seen as a more reliable performance metric over the arithmetic mean [51].

How do the authors extract code samples from GitHub?

To simplify the process for extracting vulnerability code samples, their current implementation only considers code7revisions that modify one source file at a time.

What is the definition of a vulnerability-relevant code commit?

In this work, a vulnerability-relevant code commit = a code revision log that provides a fix for a vulnerability, while vulnerable code = a piece of code that contains a type of vulnerability defined in the CWE.

What is the way to model a vulnerability?

By modeling the subtle program structures that lead to a vulnerability, their approach can reuse the knowledge learnt from the Java training samples to model the same vulnerability for Php programs or vice versa.

Open AccessJournal Article10.1109/TIFS.2020.3044773

Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection

Huanting Wang, +8 more

- 01 Jan 2021

- IEEE Transactions on Information Forensi...

- Vol. 16, pp 1943-1958

250

TL;DR: Funded leverages the advances in graph neural networks to develop a novel graph-based learning method to capture and reason about the program’s control, data, and call dependencies to identify software vulnerabilities at the function level from program source code.

Abstract: This paper presents FUNDED (Flow-sensitive vUl-Nerability coDE Detection), a novel learning framework for building vulnerability detection models. Funded leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the program’s control, data, and call dependencies. Unlike prior work that treats the program as a sequential sequence or an untyped graph, Funded learns and operates on a graph representation of the program source code, in which individual statements are connected to other statements through relational edges. By capturing the program syntax, semantics and flows, Funded finds better code representation for the downstream software vulnerability detection task. To provide sufficient training data to build an effective deep learning model, we combine probabilistic learning and statistical assessments to automatically gather high-quality training samples from open-source projects. This provides many real-life vulnerable code training samples to complement the limited vulnerable code samples available in standard vulnerability databases. We apply Funded to identify software vulnerabilities at the function level from program source code. We evaluate Funded on large real-world datasets with programs written in C, Java, Swift and Php, and compare it against six state-of-the-art code vulnerability detection models. Experimental results show that Funded significantly outperforms alternative approaches across evaluation settings.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What contributions have the authors mentioned in the paper "Combining graph-based learning with automated data collection for code vulnerability detection" ?

This paper presents FUNDED, a novel learning framework for building vulnerability detection models.. To provide sufficient training data to build an effective deep learning model, the authors combine probabilistic learning and statistical assessments to automatically gather high-quality training samples from opensource projects.. This provides many real-life vulnerable code training samples to complement the limited vulnerable code samples available in standard vulnerability databases.

2. What are the future works mentioned in the paper "Combining graph-based learning with automated data collection for code vulnerability detection" ?

Naturally, there is room for future work and further improvement.. The authors leave this as their future work.. Their future work will explore a language model that is specifically built for modeling program source code like code2vec [ 66 ].. Providing a theoretical proof of the underlying working mechanism of FUNDED is their future work.

3. How does the embedding layer compute the representation vector of a graph node?

Their 100-dimensional embedding vector, hv , of a graph node, v, is computed by the embedding layer through recursively aggregating and transforming the representation vectors of its neighboring nodes.

4. What is the effect of a higher classification threshold?

Lowering the classification threshold (i.e., a higher FPR) increases the likelihood for labeling more samples as vulnerability-relevant, thus increasing both true and false positives.

Fig. 11. Evaluation on GitHub samples. FUNDED gives the best Accuracy, Recall and F1 score.

Fig. 12. Apply transfer learning to port a detection model for a new programming language.

Fig. 10. Evaluation on standard vulnerability databases. Min-max bars show performance across vulnerability types.

Fig. 9. The number of vulnerabilities identified by each approach for each open-source project. A solid symbol represents a successfully detected vulnerability, where a circle means the vulnerability is reported in NVD or CVE, while a square means the vulnerability is not reported in the standard databases. FUNDED successfully detects more vulnerabilities than others.

TABLE V EVALUATION DATASET OF FIVE OPEN-SOURCE PROJECTS.

Fig. 19. Training overhead (a) and accuracy (b). The min-max bars show the variances across evaluation settings.

Citations

구글 TensorFlow 소개

김종영

- 01 Dec 2015

TL;DR: TensorFlow 2.0 in ActionTensor Flow 1.x Deep Learning Cookbook machine Learning with TensorFlow, Second EditionTensor flow 2 Pocket PrimerProgramming with Tensing, Tensor Flow Machine Learning Projects, and Hands-On Neural Networks.

...read moreread less

306

Proceedings Article•10.1145/3540250.3549098

VulRepair: a T5-based automated software vulnerability repair

Michael Fu, +4 more

- 07 Nov 2022

TL;DR: VulRepair as discussed by the authors is a T5-based automated software vulnerability repair approach that leverages the pre-training and BPE components to address various technical limitations of prior work.

...read moreread less

131

•Proceedings Article•10.1145/3510003.3510219

MVD: Memory-Related Vulnerability Detection Based on Flow-Sensitive Graph Neural Networks

Sicong Cao, +5 more

- 05 Mar 2022

TL;DR: MVD, a statement-level Memory-related Vulnerability Detection approach based on flow-sensitive graph neural networks (FS-GNN), achieves better detection accuracy, outperforming both state-of-the-art DL-based and static analysis-based approaches.

...read moreread less

Journal Article•10.1016/J.COSE.2021.102417

VDSimilar: Vulnerability detection based on code similarity of vulnerabilities and patches

Hao Sun, +7 more

- 01 Nov 2021

- Computers & Security

TL;DR: This work argues that similarity in the view of vulnerability is the key in detecting vulnerabilities, and prepares a relatively smaller data set composed of both vulnerabilities and associated patches, and attempts to realize security similarity from the similarity between pair of vulnerabilities and the difference between a pair of vulnerability and patch.

...read moreread less

•Proceedings Article•10.1145/3453483.3454054

Automated Conformance Testing for JavaScript Engines via Deep Compiler Fuzzing

Guixin Ye, +8 more

- 15 Apr 2021

- arXiv: Software Engineering

TL;DR: Comfort as discussed by the authors is a compiler fuzzing framework for detecting JS engine bugs and behaviors that deviate from the ECMAScript standard, which leverages the recent advance in deep learning-based language models.

...read moreread less

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Posted Content

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 22 Dec 2014

- arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

82.5K

•Journal Article

Visualizing Data using t-SNE

Laurens van der Maaten, +1 more

- 01 Jan 2008

- Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

45.8K

•Proceedings Article•10.3115/V1/D14-1179

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Kyunghyun Cho, +8 more

- 01 Jan 2014

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less

28.6K

...

Expand

Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What contributions have the authors mentioned in the paper "Combining graph-based learning with automated data collection for code vulnerability detection" ?

2. What are the future works mentioned in the paper "Combining graph-based learning with automated data collection for code vulnerability detection" ?

3. How does the embedding layer compute the representation vector of a graph node?

4. What is the effect of a higher classification threshold?

5. What is the advantage of using CP to evaluate an expert’s confidence?

6. How do the authors learn to port a network model to another language?

7. How do the authors build and use an expert model?

8. What is the process of updating node states for a fixed number of iterations?

9. How do the authors calculate the confidence score for a class label?

10. What is the edge used to capture where a variable is used?

11. What is the geometric mean of the evaluation metrics across the cross-validation folds?

12. How do the authors extract code samples from GitHub?

13. What is the definition of a vulnerability-relevant code commit?

14. What is the way to model a vulnerability?

Figures

Citations

구글 TensorFlow 소개

VulRepair: a T5-based automated software vulnerability repair

MVD: Memory-Related Vulnerability Detection Based on Flow-Sensitive Graph Neural Networks

VDSimilar: Vulnerability detection based on code similarity of vulnerabilities and patches

Automated Conformance Testing for JavaScript Engines via Deep Compiler Fuzzing

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Adam: A Method for Stochastic Optimization

Visualizing Data using t-SNE

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Related Papers (5)

Cross-Project Transfer Representation Learning for Vulnerable Function Discovery

Long short-term memory

Program Classification Using Gated Graph Attention Neural Network for Online Programming Service.

A Practical Approach to the Automatic Classification of Security-Relevant Commits

Multi-task learning based pre-trained language model for code completion