DocumentCode :
1745862
Title :
Knowledge discovery from text documents based on paragraph maps
Author :
Visa, Ari ; Toivonen, Jarmo ; Ruokonen, Piia ; Vanharanta, Hannu ; Back, Barbro
Author_Institution :
Lappeenranta Univ. of Technol., Finland
fYear :
2000
fDate :
4-7 Jan. 2000
Abstract :
In law, physics, business, and so on, there are lots of documents. The organisation of these documents is essential. The right way to organise the documents reveals quite a lot from the information contents of the document. It is common that text documents are characterised and classified by keywords. The authors usually define these keywords. Nowadays there exists a tremendous amount of uncharacterised text documents, due to the Internet and also to old paper based archives. It is important that the information can be managed and the knowledge can be retrieved. It would be desirable to retrieve the information without reading the document. We propose a new technology based on multilevel hierarchies. Here we concentrate only on the highest level. The technology is based on a hierarchy of self-organizing maps (SOM) and on smart encoding of words. Our experiment with a text document (an annual report) shows that it is possible to separate between different types of paragraphs. It is possible to separate between the original paragraph and the one containing the same words but in random order. It is also possible to categorise the paragraphs or for instance, to find all unauthorised citations of paragraphs within a long text document. The only requirement is that there be a considerable amount of text documents for the training process. Finally the text documents can be classified based on the trained types of paragraphs. This means that unknown documents can be categorised without reading them. This facility can be called knowledge discovery.
Keywords :
classification; data mining; self-organising feature maps; text analysis; Internet; SOM; annual report; information contents; information management; keywords; knowledge discovery; knowledge retrieval; multilevel hierarchies; paragraph maps; self-organizing maps; smart word encoding; text documents; Data mining; Databases; Information retrieval; Internet; Knowledge based systems; Learning automata; Neural networks; Organizing; Physics; Read only memory;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
System Sciences, 2000. Proceedings of the 33rd Annual Hawaii International Conference on
Print_ISBN :
0-7695-0493-0
Type :
conf
DOI :
10.1109/HICSS.2000.926659
Filename :
926659
Link To Document :
بازگشت