Interpretable linear models for predicting security vulnerabilities in source code

Toby D. Hocking, Joseph R. Barr, Tyler Thatcher

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In our increasingly digital and networked society, computer code is responsible for many essential tasks. There are an increasing number of attacks on such code using unpatched security vulnerabilities. Therefore, it is important to create tools that can automatically identify or predict security vulnerabilities in code, in order to prevent such attacks. In this paper we focus on methods for predicting security vulnerabilities based on analysis of the source code as a text file. In recent years many attempts to solve this problem involve natural language processing (NLP) methods which use neural networks-based techniques where tokens in the source code are mapped into a vector in a Euclidean space which has dimension much lower than the dimensionality of the encoding of tokens. Those embedding-type methods were shown effective solving problems like sentence completion, indexing large corpora of texts, classifying & organizing documents and more. However, it is often necessary to extract an interpretation for which features are important for the decision rule of the learned model. A weakness of neural networks-based methods is lack of such interpretability. In this paper we show how L1 regularized linear models can be used with engineered features, in order to supplement neural network embedding features. Our approach yields models which are more interpretable and more accurate than models which only use neural network based feature embeddings. Our empirical results in cross-validation experiments show that the linear models with interpretable features are significantly more accurate than models with neural network embedding features alone. We additionally show that nearly all of the features were used in the learned models, and that trained models generalize to some extent to other data sets.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages149-155
Number of pages7
ISBN (Electronic)9781665471848
DOIs
StatePublished - 2022
Event4th International Conference on Transdisciplinary AI, TransAI 2022 - Laguna Hills, United States
Duration: Sep 20 2022Sep 22 2022

Publication series

NameProceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022

Conference

Conference4th International Conference on Transdisciplinary AI, TransAI 2022
Country/TerritoryUnited States
CityLaguna Hills
Period9/20/229/22/22

Keywords

  • L1 regularization
  • interpretable linear models
  • static code analysis
  • vulnerability detection

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Interpretable linear models for predicting security vulnerabilities in source code'. Together they form a unique fingerprint.

Cite this