DocumentCode
2135199
Title
Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection
Author
Tikk, Domonkos ; Biró, György
Author_Institution
Dept. of Telecommun. & Media Informatics, Budapest Univ. of Technol. & Econ.
fYear
2003
fDate
24-24 Sept. 2003
Firstpage
104
Lastpage
109
Abstract
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We focus on the special case when categories are organized in hierarchy. We present a new approach on this recently emerged subfield of text categorization. The algorithm applies an iterative learning module that allow of gradually creating a classifier by trial-and-error-like method. We present a software that has been developed on the basis of the algorithm to illustrate the capability of the algorithm on large data collection. We experimented on the very large benchmark collection, on the WIPO-alpha (World Intellectual Property Organization, Geneva, Switzerland, 2002) English patent database that consists of about 75000 XML documents distributed over 5000 categories. Our software is able to index the corpus quickly and creates a classifier in a few iteration cycles. We present the results achieved by the classifier w.r.t. various test setting
Keywords
XML; iterative methods; patents; text analysis; English patent database; WIPO; World Intellectual Property Organization; XML document; alpha patent collection; benchmark collection; hierarchical text categorization; iterative learning module; text document; trial-error-like method; Distributed databases; Informatics; Intellectual property; Iterative algorithms; Software algorithms; Taxonomy; Telecommunications; Testing; Text categorization; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Uncertainty Modeling and Analysis, 2003. ISUMA 2003. Fourth International Symposium on
Conference_Location
College Park, MD
Print_ISBN
0-7695-1997-0
Type
conf
DOI
10.1109/ISUMA.2003.1236148
Filename
1236148
Link To Document