This thesis covers pattern recognition of large log files using clustering analysis in form of mini-batch K-means clustering and data fitting, to find abnormal traffic in network flows provided by DeIC, formerly The Danish Research Network.

The implementation is a modified clustering algorithm using the Mahalanobis distance. In the analysis, more than 109 network flows from a single day was split into different clusters, and outliers were detected. The calculations of the clustering analysis took less than 13 hours, which means that outliers can be detected the following day. The implementation and analysis could be further improved by selecting a different set of fields from the log files, a parallel imple- mentation of the mini-batch K-means clustering algorithm and a more thorough analysis of the detected outliers.



1 Introduction 1

2 Preliminaries 3 2.1 Machine learning . . . . . .