Title :
Processing of unstructured data for information extraction
Author_Institution :
Dept. of Comput. Sci. & IT, Dr. B.A.M. Univ., Aurangabad, India
Abstract :
Unstructured data are those that have no predetermined form or structure and are full of textual data. It does not fit well into relational tables. Most enterprise data today can actually be considered unstructured. Typical unstructured systems include emails, reports, contracts, transcripts of telephone conversations, and other communications. Web pages also contain links and references to External, often unstructured content such as images, XML files, animations and databases. This paper focuses on extracting features in html pages by using tokenization and Non matrix factorization. Classification of text is done using bag of words approach. The workbench is dataset collected in university domain web pages.
Keywords :
Web sites; XML; information retrieval; text analysis; text detection; HTML pages; XML files; animations; contracts; databases; emails; enterprise data; feature extraction; images; information extraction; nonmatrix factorization; relational tables; reports; telephone conversation transcripts; textual data; tokenization; university domain Web pages; unstructured content; unstructured data processing; unstructured systems; Information Extraction; NMF; Text Mining; Tokenization; Unstructured data;
Conference_Titel :
Engineering (NUiCONE), 2012 Nirma University International Conference on
Conference_Location :
Ahmedabad
Print_ISBN :
978-1-4673-1720-7
DOI :
10.1109/NUICONE.2012.6493202