Analysis of large log files

Kasper Laursen s093078

Kongens Lyngby 2012 IMM-B.Sc.-2012-37

Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 4525 3351, Fax +45 4588 2673 reception@imm.dtu.dk www.imm.dtu.dk IMM-B.Sc.-2012-37

Summary

This thesis covers pattern recognition of large log files using clustering analysis in form of mini-batch K-means clustering and data fitting, to find abnormal traffic in network flows provided by DeIC, formerly The Danish Research Network.

The implementation is a modified clustering algorithm using the Mahalanobis distance. In the analysis, more than 109 network flows from a single day was split into different clusters, and outliers were detected. The calculations of the clustering analysis took less than 13 hours, which means that outliers can be detected the following day. The implementation and analysis could be further improved by selecting a different set of fields from the log files, a parallel imple- mentation of the mini-batch K-means clustering algorithm and a more thorough analysis of the detected outliers.

ii

Preface

This bachelor thesis was prepared at the department of Informatics and Math- ematical Modelling at the Technical University of Denmark in fulfillment of the requirements for acquiring a B.Sc.Eng. degree in Software Technology.

Lyngby, 14 December 2012

Kasper Laursen

iv

Acknowledgements

I would like to thank my supervisor Robin Sharp for weekly meetings and sup- port through the whole project.

Tanks to The Danish Research Network for providing network log files for this analysis.

I would like to give a special thanks to Rasmus Jul Hansen for proofreading this project, thanks to Simon Laursen for discussion and finalizing the report and thanks to SÃ¸ren LÃ¸vborg for proofreading, help and discussion through the whole project phase.

vi

Contents

1 Introduction 1

2 Preliminaries 3 2.1 Machine learning . . . . . .