Text Mining Research: A Survey

Essay by chinweiUniversity, Master's November 2004

download word file, 34 pages 3.0


Text mining, also known as knowledge discovery from text, and document information mining, refers to the process of extracting interesting patterns from very large text corpus for the purposes of discovering knowledge. Text mining is an interdisciplinary field involving information retrieval, text understanding, information extraction, clustering, categorization, visualization, database technology, machine learning, and data mining. Regarded by many as the next wave of knowledge discovery, text mining has a very high commercial value. This paper presents a general framework for text mining, consisting of two stages: text refining that transforms unstructured text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form. I then give the explanations of two of the text refining methods which are information retrieval and information extraction. Then, I survey different documents representation methods and algorithms, give the comparison among these representation and algorithms, and also some of their advantages and limitations.

I then survey the state-of-the-art text mining approaches, products, and applications by aligning them based on the text refining and knowledge distillation functions as well as the intermediate form that they adopt. At the last part, I highlight the upcoming challenges of text mining and the opportunities it offers and give a short conclusion.


Text mining, also known as text data mining [25] or knowledge discovery from textual databases [19], is an emerging technology for analyzing large collections of unstructured documents for the purposes of extracting interesting and non-trivial patterns or knowledge. It can be envisaged as a leap from data mining or knowledge discovery from (structured) databases [17; 58].

As the most natural form of storing and exchanging information is written words, text mining has a very high commercial potential. In fact, a recent study indicated that 80% of a company's information was...