DocumentCode
1877730
Title
Processing of unstructured data for information extraction
Author
Ingle, V.A.
Author_Institution
Dept. of Comput. Sci. & IT, Dr. B.A.M. Univ., Aurangabad, India
fYear
2012
fDate
6-8 Dec. 2012
Firstpage
1
Lastpage
4
Abstract
Unstructured data are those that have no predetermined form or structure and are full of textual data. It does not fit well into relational tables. Most enterprise data today can actually be considered unstructured. Typical unstructured systems include emails, reports, contracts, transcripts of telephone conversations, and other communications. Web pages also contain links and references to External, often unstructured content such as images, XML files, animations and databases. This paper focuses on extracting features in html pages by using tokenization and Non matrix factorization. Classification of text is done using bag of words approach. The workbench is dataset collected in university domain web pages.
Keywords
Web sites; XML; information retrieval; text analysis; text detection; HTML pages; XML files; animations; contracts; databases; emails; enterprise data; feature extraction; images; information extraction; nonmatrix factorization; relational tables; reports; telephone conversation transcripts; textual data; tokenization; university domain Web pages; unstructured content; unstructured data processing; unstructured systems; Information Extraction; NMF; Text Mining; Tokenization; Unstructured data;
fLanguage
English
Publisher
ieee
Conference_Titel
Engineering (NUiCONE), 2012 Nirma University International Conference on
Conference_Location
Ahmedabad
Print_ISBN
978-1-4673-1720-7
Type
conf
DOI
10.1109/NUICONE.2012.6493202
Filename
6493202
Link To Document