مرکز منطقه ای اطلاع رساني علوم و فناوري - Content Code Blurring: A New Approach to Content Extraction

DocumentCode :

2830412

Title :

Content Code Blurring: A New Approach to Content Extraction

Author :

Gottron, Thomas

Author_Institution :

Inst. fur Inf. Johannes Gutenberg-Univ. Mainz, Mainz

fYear :

2008

fDate :

1-5 Sept. 2008

Firstpage :

Lastpage :

Abstract :

Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurring delivers the best results.

Keywords :

content management; document handling; iterative methods; knowledge acquisition; HTML document; Web document; content code blurring; content extraction; iterative process; Databases; Expert systems; Filters; HTML; Image analysis; Image segmentation; Iterative algorithms; Navigation; Proposals; Web sites; Content Extraction; content code blurring; main content detection; web information retrieval;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on

Conference_Location :

Turin

ISSN :

1529-4188

Print_ISBN :

978-0-7695-3299-8

Type :

conf

DOI :

10.1109/DEXA.2008.43

Filename :

4624687

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2830412