TY - GEN
T1 - Classifying Imbalanced Data with AUM Loss
AU - Barr, Joseph R.
AU - Hocking, Toby D.
AU - Morton, Garinn
AU - Thatcher, Tyler
AU - Shaw, Peter
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - We present a significant improvement to a methodology which was described in several earlier articles by Barr, et al. where we demonstrated a workflow which classifies the source code of large open source projects for vulnerability. Whereas in the past, to deal with dearth of minority examples we've applied upsampling and simulation technique, this present approach demonstrates that a clever choice of cost function sans upsampling results in excellent performance surpassing previous results. In this iteration a feed-forward neural network classifier was trained on Area Under Min(FP, FN) (AUM) loss. The AUM method is described in Hillman & Hocking. Similar to earlier work, to overcome the out-of-vocabulary challenge, an intermediate step Byte-Pair Encoding which 'compresses' the data and subsequently, with the compressed data, long short-term memory (LSTM) network is used to embed the tokens from which we assemble an embedding of function labels. This results in 128D embedding which along with additional 'interpretable', heuristics-based features which are used to classify CVEs. The resulting labeled dataset is extremely sparse, with a minority class consisting of roughly 0.5% of total. Demonstratively, the AUM cost function is undeterred by sparsity of data; this is amply demonstrated by the performance of the classifier.
AB - We present a significant improvement to a methodology which was described in several earlier articles by Barr, et al. where we demonstrated a workflow which classifies the source code of large open source projects for vulnerability. Whereas in the past, to deal with dearth of minority examples we've applied upsampling and simulation technique, this present approach demonstrates that a clever choice of cost function sans upsampling results in excellent performance surpassing previous results. In this iteration a feed-forward neural network classifier was trained on Area Under Min(FP, FN) (AUM) loss. The AUM method is described in Hillman & Hocking. Similar to earlier work, to overcome the out-of-vocabulary challenge, an intermediate step Byte-Pair Encoding which 'compresses' the data and subsequently, with the compressed data, long short-term memory (LSTM) network is used to embed the tokens from which we assemble an embedding of function labels. This results in 128D embedding which along with additional 'interpretable', heuristics-based features which are used to classify CVEs. The resulting labeled dataset is extremely sparse, with a minority class consisting of roughly 0.5% of total. Demonstratively, the AUM cost function is undeterred by sparsity of data; this is amply demonstrated by the performance of the classifier.
KW - AUC
KW - Area Under Min(FP
KW - FN) (AUM)
KW - LSTM
KW - ROC
KW - byte-pair encoding
KW - classification of imbalanced data
KW - common vulnerabilities & exposures (CVE)
KW - source code embedding
KW - static code analysis
KW - vulnerability detection
UR - http://www.scopus.com/inward/record.url?scp=85143412255&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143412255&partnerID=8YFLogxK
U2 - 10.1109/TransAI54797.2022.00030
DO - 10.1109/TransAI54797.2022.00030
M3 - Conference contribution
AN - SCOPUS:85143412255
T3 - Proceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022
SP - 135
EP - 141
BT - Proceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th International Conference on Transdisciplinary AI, TransAI 2022
Y2 - 20 September 2022 through 22 September 2022
ER -