Machine Learning for Host-based Misuse and Anomaly Detectionin UNIX Environment

by Ehsan Aghaei

Institution: University of Toledo
Year: 2017
Keywords: Computer Science; Computer Engineering
Posted: 02/01/2018
Record ID: 2204639
Full text PDF: http://rave.ohiolink.edu/etdc/view?acc_num=toledo1493255965690437


This thesis focuses on three individual studies aboutintrusion detection systems using different pre-processingtechniques and classifiers on ADFA-LD dataset. ADFA-LD entailsthousands of systems call traces, which are collected during sevendifferent situations including normal and six types of attack inthe UNIX environment. First study presents development andapplication of a frequency-based misuse intrusion detection systemwhich is accomplished through an ensemble classification. Itentails preprocessing the raw ADFA-LD system call traces withN-gram feature extraction methodology, and generating fixed sizepatterns whose attributes are N-grams for N value in the range 1 to10 for training and testing. In order to generate the signature ofeach class and to reduce the dimensionality, we filtered thefeatures in two steps; selecting the most frequent uniqueattributes, and picking the most frequent features regardless ofuniqueness. The five-random-neighbor SMOTE algorithm is used tobalance the classes in terms of pattern counts. The classifierdesign is based on majority voting ensemble with base classifiersof naive Bayes, support vector machine, PART, decision tree andrandom forest as they are implemented in the Weka machine-learningframework. The proposed misuse detection system demonstrated veryhigh performance in detecting attacks. In the second study, themisuse detection system employs ADFA-LD system call traces toextract features using principal components analysis (PCA). In thisstudy, fixed size patterns for both training and testing, namelyEigentraces, are generated by preprocessing the ADFA-LD system calltraces with the PCA methodology. Eigentraces serve as templates forknown normal and attack class traces. Classification of system calltrace data that is in the form of feature vectors is accomplishedusing the k-nearest-neighbor algorithm. A simulation study wasconducted to evaluate the performance of the proposed system. Theproposed misuse intrusion detection system demonstrated very highperformance in detecting attacks and predicting the type of theattacks given that there were six classes of attacks, and as such,appears very promising. In the third study, we modeled a host-basedanomaly detection system within the framework of one-classclassification using the ADFA-LD dataset. Pre-processing andfeature extraction procedures employed windowing on the system-calltrace data followed by the application of PCA-based Eigentracestechnique. The target or normal class probability function ismodeled by two separate machine learners: Radial Basis Functionneural network and Random Forest. The normal class density functionis estimated using Bayes theorem. A simulation study showed thatthe proposed intrusion detection system offers high performance indetecting anomalies and normal activities accurately.Advisors/Committee Members: Serpen, Gursel (Committee Chair).