DocumentCode :
3423265
Title :
Classifying XML Documents by Using Genre Features
Author :
Clark, Malcolm ; Watt, Stuart
Author_Institution :
Robert Gordon Univ., Aberdeen
fYear :
2007
fDate :
3-7 Sept. 2007
Firstpage :
242
Lastpage :
248
Abstract :
The categorization of documents is traditionally topic-based. This paper presents a complementary analysis of research and experiments on genre to show that encouraging results can be obtained by using genre structure (form) features. We conducted an experiment to assess the effectiveness of using eXtensible Mark-Up Language (XML) tag information, and part-of-speech (P-O-S) features, for the classification of genres, testing the hypothesis that if a focus on genre can lead to high precision on normal textual documents, then good results can be achieved using XML tag information in addition to P-O-S information. An experiment was carried out on a subsection of the initiative for the evaluation of XML (INEX) 1.4 collection. The features were extracted and documents were classified using machine learning algorithms, which yielded encouraging results for logistic regression and neural networks. We propose that utilizing these features and training a classifier may benefit retrieval for most World Wide Web (WWW) technologies such as XML and eXtensible Hypertext Markup Language) XHTML.
Keywords :
XML; classification; learning (artificial intelligence); neural nets; World Wide Web; XML document classification; document categorization; eXtensible Mark-up Language; genre features; logistic regression; machine learning; neural networks; Data mining; Feature extraction; Logistics; Machine learning algorithms; Markup languages; Neural networks; Testing; Web sites; World Wide Web; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications, 2007. DEXA '07. 18th International Workshop on
Conference_Location :
Regensburg
ISSN :
1529-4188
Print_ISBN :
978-0-7695-2932-5
Type :
conf
DOI :
10.1109/DEXA.2007.120
Filename :
4312894
Link To Document :
بازگشت