Automatic news extraction system for Indian online news papers

Author

Wanjari, Yogesh W. ; Mohod, Vivek D. ; Gaikwad, Dipali B. ; Deshmukh, Sachin N.

Author_Institution

Dept. of CS & IT, Dr. Babasaheb Ambedkar Marathwada Univ., Aurangabad, India

fYear

2014

fDate

8-10 Oct. 2014

Firstpage

1

Lastpage

6

Abstract

Now a day´s Web technology is getting an emergence importance in day to day life! Everyone is familiar with surfing the Web, uploading personal or important data on the Web, sharing data with friends on social communities. Indian online news Web papers are producing more data every day on the Web. There are various technologies & researches which are focusing on the extraction of relevant information from large web data storage. But still there is requirement of availability of automatic annotation of this extracted information into a systematic way so to be processed further for various purposes. This paper provides an effective approach for the Indian online newspapers which extract contents from news web databases. First, we browse Web pages as per the input URL given by user. Next, we generate a DOM tree of the news Web page data. And at last, we not only identify and extract valuable news from the Indian news web pages but also remove noisy data. Moreover, in this paper we proposed the novel approach for extract data from online Indian newspapers written in the many popular languages such as Marathi, Hindi, Tamil, Gujarati, Kannada, Oriya, Telugu, Punjabi, etc. Experimental results can be analysed much easily on this domain. This proposed system is very first attempt in an India for news extraction from online web pages available in various Indian language.

Keywords

Internet; electronic publishing; hypermedia markup languages; information retrieval; natural language processing; DOM tree; Gujarati language; HTML; Hindi language; Indian language; Indian online news Web papers; Kannada language; Marathi language; Oriya language; Punjabi language; Tamil language; Telugu language; URL; Web pages browsing; Web surfing; Web technology; automatic annotation; automatic news extraction system; contents extraction; data extraction; data sharing; document object model; information extraction; large Web data storage; news Web databases; noisy data removal; personal data uploading; social communities; Browsers; Data mining; Databases; HTML; Manuals; Web pages; DOM tree generation; Data extraction; Tag pattern generation; Wrapper;

fLanguage

English

Publisher

ieee

Conference_Titel

Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), 2014 3rd International Conference on

Conference_Location

Noida

Print_ISBN

978-1-4799-6895-4

Type

conf

DOI

10.1109/ICRITO.2014.7014750

Filename

7014750