DocumentCode :
2392433
Title :
Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic´s Personal Homepage
Author :
Rehm, Georg
Author_Institution :
Res. Unit for Appl. & Computational Linguistics, Justus-Liebig-Universitat, Giessen, Germany
fYear :
2002
fDate :
7-10 Jan. 2002
Firstpage :
1143
Lastpage :
1152
Abstract :
We argue for a systematic analysis of one particular, well structured domain -academic Web pages - with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3000000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type which constitutes the basic framework for a certain Web genre, and compulsory and optional Web genre modules. These act as building blocks which go together to make up the structure characterised by the Web genre type and furthermore, operate as modifiers for the default assignment involved. The analysis of a 200 document sample illustrates our notion of Web genre hierarchy, into which Web genre types and modules are embedded. The analysis of four different documents of the Web genre Academic\´s Personal Homepage, not only illustrates our approach, but also our long-term goal of automatically extracting the contents of Web genre modules in order to build up structured XML documents of groups of unstructured HTML documents.
Keywords :
hypermedia markup languages; information resources; Academic Personal Homepage; German language; HTML documents; Web genre hierarchy; Web genre type; academic Web pages; compulsory modules; database-driven system; optional modules; structured XML documents; unstructured HTML documents; Computational linguistics; Data mining; Databases; Educational institutions; HTML; Search engines; Tagging; Web pages; Web sites; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
System Sciences, 2002. HICSS. Proceedings of the 35th Annual Hawaii International Conference on
Print_ISBN :
0-7695-1435-9
Type :
conf
DOI :
10.1109/HICSS.2002.994036
Filename :
994036
Link To Document :
بازگشت