Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction

Author

Sleiman, Hassan A. ; Corchuelo, Rafael

Author_Institution

ETSI Inf., Univ. of Sevilla, Sevilla, Spain

Volume

26

Issue

6

fYear

2014

fDate

Jun-14

Firstpage

1544

Lastpage

1556

Abstract

Web data extractors are used to extract data from web documents in order to feed automated processes. In this article, we propose a technique that works on two or more web documents generated by the same server-side template and learns a regular expression that models it and can later be used to extract data from similar documents. The technique builds on the hypothesis that the template introduces some shared patterns that do not provide any relevant data and can thus be ignored. We have evaluated and compared our technique to others in the literature on a large collection of web documents; our results demonstrate that our proposal performs better than the others and that input errors do not have a negative impact on its effectiveness; furthermore, its efficiency can be easily boosted by means of a couple of parameters, without sacrificing its effectiveness.

Keywords

Internet; document handling; tree data structures; unsupervised learning; Trinity; Web documents; automatic wrapper generation; server-side template; trinary trees; unsupervised Web data extraction; unsupervised learning; Algorithm design and analysis; Data mining; HTML; Java; Particle separators; Partitioning algorithms; Proposals; Computing Methodologies; Information extraction; Knowledge and data engineering tools and techniques; Machine learning; Pattern Recognition; Web data extraction; automatic wrapper generation; unsupervised learning; wrappers;

fLanguage

English

Journal_Title

Knowledge and Data Engineering, IEEE Transactions on

Publisher

ieee

ISSN

1041-4347

Type

jour

DOI

10.1109/TKDE.2013.161

Filename

6616554