Text Extraction from the Web via Text-to-Tag Ratio

Author

Weninger, Tim ; Hsu, William H.

Author_Institution

Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS

fYear

2008

fDate

1-5 Sept. 2008

Firstpage

Lastpage

Abstract

We describe a method to extract content text from diverse Web pages by using the HTML document´s text-to-tag ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.

Keywords

Internet; hypermedia markup languages; information retrieval; text analysis; HTML document; diverse Web pages; text extraction; text-to-tag ratio; Art; Cascading style sheets; Data mining; Databases; Expert systems; HTML; Histograms; Internet; Testing; Web pages; Histogram; Information Extraction; Web;

fLanguage

English

Publisher

ieee

Conference_Titel

Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on

Conference_Location

Turin

ISSN

1529-4188

Print_ISBN

978-0-7695-3299-8

Type

conf

DOI

10.1109/DEXA.2008.12

Filename

4624686

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2830395