TY - GEN
T1 - Interpretable linear models for predicting security vulnerabilities in source code
AU - Hocking, Toby D.
AU - Barr, Joseph R.
AU - Thatcher, Tyler
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In our increasingly digital and networked society, computer code is responsible for many essential tasks. There are an increasing number of attacks on such code using unpatched security vulnerabilities. Therefore, it is important to create tools that can automatically identify or predict security vulnerabilities in code, in order to prevent such attacks. In this paper we focus on methods for predicting security vulnerabilities based on analysis of the source code as a text file. In recent years many attempts to solve this problem involve natural language processing (NLP) methods which use neural networks-based techniques where tokens in the source code are mapped into a vector in a Euclidean space which has dimension much lower than the dimensionality of the encoding of tokens. Those embedding-type methods were shown effective solving problems like sentence completion, indexing large corpora of texts, classifying & organizing documents and more. However, it is often necessary to extract an interpretation for which features are important for the decision rule of the learned model. A weakness of neural networks-based methods is lack of such interpretability. In this paper we show how L1 regularized linear models can be used with engineered features, in order to supplement neural network embedding features. Our approach yields models which are more interpretable and more accurate than models which only use neural network based feature embeddings. Our empirical results in cross-validation experiments show that the linear models with interpretable features are significantly more accurate than models with neural network embedding features alone. We additionally show that nearly all of the features were used in the learned models, and that trained models generalize to some extent to other data sets.
AB - In our increasingly digital and networked society, computer code is responsible for many essential tasks. There are an increasing number of attacks on such code using unpatched security vulnerabilities. Therefore, it is important to create tools that can automatically identify or predict security vulnerabilities in code, in order to prevent such attacks. In this paper we focus on methods for predicting security vulnerabilities based on analysis of the source code as a text file. In recent years many attempts to solve this problem involve natural language processing (NLP) methods which use neural networks-based techniques where tokens in the source code are mapped into a vector in a Euclidean space which has dimension much lower than the dimensionality of the encoding of tokens. Those embedding-type methods were shown effective solving problems like sentence completion, indexing large corpora of texts, classifying & organizing documents and more. However, it is often necessary to extract an interpretation for which features are important for the decision rule of the learned model. A weakness of neural networks-based methods is lack of such interpretability. In this paper we show how L1 regularized linear models can be used with engineered features, in order to supplement neural network embedding features. Our approach yields models which are more interpretable and more accurate than models which only use neural network based feature embeddings. Our empirical results in cross-validation experiments show that the linear models with interpretable features are significantly more accurate than models with neural network embedding features alone. We additionally show that nearly all of the features were used in the learned models, and that trained models generalize to some extent to other data sets.
KW - L1 regularization
KW - interpretable linear models
KW - static code analysis
KW - vulnerability detection
UR - http://www.scopus.com/inward/record.url?scp=85143424546&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143424546&partnerID=8YFLogxK
U2 - 10.1109/TransAI54797.2022.00032
DO - 10.1109/TransAI54797.2022.00032
M3 - Conference contribution
AN - SCOPUS:85143424546
T3 - Proceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022
SP - 149
EP - 155
BT - Proceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th International Conference on Transdisciplinary AI, TransAI 2022
Y2 - 20 September 2022 through 22 September 2022
ER -