DocumentCode
2830395
Title
Text Extraction from the Web via Text-to-Tag Ratio
Author
Weninger, Tim ; Hsu, William H.
Author_Institution
Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS
fYear
2008
fDate
1-5 Sept. 2008
Firstpage
23
Lastpage
28
Abstract
We describe a method to extract content text from diverse Web pages by using the HTML document´s text-to-tag ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Keywords
Internet; hypermedia markup languages; information retrieval; text analysis; HTML document; diverse Web pages; text extraction; text-to-tag ratio; Art; Cascading style sheets; Data mining; Databases; Expert systems; HTML; Histograms; Internet; Testing; Web pages; Histogram; Information Extraction; Web;
fLanguage
English
Publisher
ieee
Conference_Titel
Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
Conference_Location
Turin
ISSN
1529-4188
Print_ISBN
978-0-7695-3299-8
Type
conf
DOI
10.1109/DEXA.2008.12
Filename
4624686
Link To Document