Security level classification of confidential documents written in Turkish

Erdem Alparslan, Hayretdin Bahsi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

This article introduces a security level classification methodology of confidential documents written in Turkish language. Internal documents of TUBITAK UEKAE, holding various security levels (unclassified-restricted-secret) were classified within a methodology using Support Vector Machines (SVM's) [1] and naïve bayes classifiers [3][9]. To represent term-document relations a recommended metric "TF-IDF" [2] was chosen to construct a weight matrix. Turkic languages provide a very difficult natural language processing problem in comparison with English: "Stemming". A Turkish stemming tool "zemberek" was used to find out the features without suffix. At the end of the article some experimental results and success metrics are projected.

Original languageEnglish (US)
Title of host publicationUser Centric Media - First International Conference, UCMedia 2009, Revised Selected Papers
Pages329-334
Number of pages6
DOIs
StatePublished - 2010
Externally publishedYes
Event1st International Conference on User Centric Media, UCMedia 2009 - Venice, Italy
Duration: Dec 9 2009Dec 11 2009

Publication series

NameLecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering
Volume40 LNICST
ISSN (Print)1867-8211

Conference

Conference1st International Conference on User Centric Media, UCMedia 2009
Country/TerritoryItaly
CityVenice
Period12/9/0912/11/09

Keywords

  • Data loss prevention
  • Document classification
  • Naïve bayes
  • Security
  • Stemming
  • Support vector machine
  • TF-IDF
  • Turkish

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Security level classification of confidential documents written in Turkish'. Together they form a unique fingerprint.

Cite this