Processing of unstructured data for information extraction

Author

Ingle, V.A.

Author_Institution

Dept. of Comput. Sci. & IT, Dr. B.A.M. Univ., Aurangabad, India

fYear

2012

fDate

6-8 Dec. 2012

Firstpage

1

Lastpage

4

Abstract

Unstructured data are those that have no predetermined form or structure and are full of textual data. It does not fit well into relational tables. Most enterprise data today can actually be considered unstructured. Typical unstructured systems include emails, reports, contracts, transcripts of telephone conversations, and other communications. Web pages also contain links and references to External, often unstructured content such as images, XML files, animations and databases. This paper focuses on extracting features in html pages by using tokenization and Non matrix factorization. Classification of text is done using bag of words approach. The workbench is dataset collected in university domain web pages.

Keywords

Web sites; XML; information retrieval; text analysis; text detection; HTML pages; XML files; animations; contracts; databases; emails; enterprise data; feature extraction; images; information extraction; nonmatrix factorization; relational tables; reports; telephone conversation transcripts; textual data; tokenization; university domain Web pages; unstructured content; unstructured data processing; unstructured systems; Information Extraction; NMF; Text Mining; Tokenization; Unstructured data;

fLanguage

English

Publisher

ieee

Conference_Titel

Engineering (NUiCONE), 2012 Nirma University International Conference on

Conference_Location

Ahmedabad

Print_ISBN

978-1-4673-1720-7

Type

conf

DOI

10.1109/NUICONE.2012.6493202

Filename

6493202