Classifying Imbalanced Data with AUM Loss

Joseph R. Barr, Toby D. Hocking, Garinn Morton, Tyler Thatcher, Peter Shaw

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a significant improvement to a methodology which was described in several earlier articles by Barr, et al. where we demonstrated a workflow which classifies the source code of large open source projects for vulnerability. Whereas in the past, to deal with dearth of minority examples we've applied upsampling and simulation technique, this present approach demonstrates that a clever choice of cost function sans upsampling results in excellent performance surpassing previous results. In this iteration a feed-forward neural network classifier was trained on Area Under Min(FP, FN) (AUM) loss. The AUM method is described in Hillman & Hocking. Similar to earlier work, to overcome the out-of-vocabulary challenge, an intermediate step Byte-Pair Encoding which 'compresses' the data and subsequently, with the compressed data, long short-term memory (LSTM) network is used to embed the tokens from which we assemble an embedding of function labels. This results in 128D embedding which along with additional 'interpretable', heuristics-based features which are used to classify CVEs. The resulting labeled dataset is extremely sparse, with a minority class consisting of roughly 0.5% of total. Demonstratively, the AUM cost function is undeterred by sparsity of data; this is amply demonstrated by the performance of the classifier.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages135-141
Number of pages7
ISBN (Electronic)9781665471848
DOIs
StatePublished - 2022
Event4th International Conference on Transdisciplinary AI, TransAI 2022 - Laguna Hills, United States
Duration: Sep 20 2022Sep 22 2022

Publication series

NameProceedings - 2022 4th International Conference on Transdisciplinary AI, TransAI 2022

Conference

Conference4th International Conference on Transdisciplinary AI, TransAI 2022
Country/TerritoryUnited States
CityLaguna Hills
Period9/20/229/22/22

Keywords

  • AUC
  • Area Under Min(FP
  • FN) (AUM)
  • LSTM
  • ROC
  • byte-pair encoding
  • classification of imbalanced data
  • common vulnerabilities & exposures (CVE)
  • source code embedding
  • static code analysis
  • vulnerability detection

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Classifying Imbalanced Data with AUM Loss'. Together they form a unique fingerprint.

Cite this