DocumentCode
3427702
Title
HTML Pattern Generator--Automatic Data Extraction from Web Pages
Author
Cosulschi, Mirel ; Giurca, Adrian ; Udrescu, Bogdan ; Constantinescu, Nicolae ; Gabroveanu, Mihai
Author_Institution
Dept. of Comput. Sci., Craiova Univ.
fYear
2006
fDate
Sept. 2006
Firstpage
75
Lastpage
78
Abstract
Existing methods of information extraction from HTML documents include manual approach, supervised learning and automatic techniques. The manual method has high precision and recall values but it is difficult to apply it for large number of pages. Supervised learning involves human interaction to create positive and negative samples. Automatic techniques benefit from less human effort but they are not highly reliable regarding the information retrieved
Keywords
Web sites; hypermedia markup languages; information retrieval; knowledge acquisition; learning (artificial intelligence); HTML documents; HTML pattern generator; Web pages; automatic data extraction; information extraction; information retrieval; supervised learning; Computer science; Costs; Data mining; Databases; HTML; Humans; Internet; Manuals; Supervised learning; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Symbolic and Numeric Algorithms for Scientific Computing, 2006. SYNASC '06. Eighth International Symposium on
Conference_Location
Timisoara
Print_ISBN
0-7695-2740-X
Type
conf
DOI
10.1109/SYNASC.2006.43
Filename
4090300
Link To Document