Title :
Mining the Web with active hidden Markov models
Author :
Scheffer, Tobias ; Decomain, Christian ; Wrobel, Stefan
Author_Institution :
Univ. of Magdeburg, Germany
Abstract :
Given the enormous amounts of information available only in unstructured or semi-structured textual documents, tools for information extraction (IE) have become enormously important. IE tools identify the relevant information in such documents and convert it into a structured format such as a database or an XML document. While first IE algorithms were hand-crafted sets of rules, researchers soon turned to learning extraction rules from hand-labeled documents. Unfortunately, rule-based approaches sometimes fail to provide the necessary robustness against the inherent variability of document, structure, which has led to the recent interest in using hidden Markov models (HMMs). By using additional unlabeled documents as they are usually readily available in most applications, we can perform active learning of HMMs. The idea of active learning algorithms is to identify unlabeled observations that would be most useful when labeled by the user. Such algorithms are known for classification, clustering, and regression; we present the first algorithm for active learning of hidden Markov models
Keywords :
data mining; hidden Markov models; information resources; information retrieval; learning (artificial intelligence); Web mining; active hidden Markov models; active learning; information extraction; semi-structured textual documents; unlabeled documents; unstructured textual documents; Clustering algorithms; Data mining; Databases; Hidden Markov models; Probability; Robustness; Sequences; Speech recognition; Tin; XML;
Conference_Titel :
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
Conference_Location :
San Jose, CA
Print_ISBN :
0-7695-1119-8
DOI :
10.1109/ICDM.2001.989591