Title :
Automatic news extraction system for Indian online news papers
Author :
Wanjari, Yogesh W. ; Mohod, Vivek D. ; Gaikwad, Dipali B. ; Deshmukh, Sachin N.
Author_Institution :
Dept. of CS & IT, Dr. Babasaheb Ambedkar Marathwada Univ., Aurangabad, India
Abstract :
Now a day´s Web technology is getting an emergence importance in day to day life! Everyone is familiar with surfing the Web, uploading personal or important data on the Web, sharing data with friends on social communities. Indian online news Web papers are producing more data every day on the Web. There are various technologies & researches which are focusing on the extraction of relevant information from large web data storage. But still there is requirement of availability of automatic annotation of this extracted information into a systematic way so to be processed further for various purposes. This paper provides an effective approach for the Indian online newspapers which extract contents from news web databases. First, we browse Web pages as per the input URL given by user. Next, we generate a DOM tree of the news Web page data. And at last, we not only identify and extract valuable news from the Indian news web pages but also remove noisy data. Moreover, in this paper we proposed the novel approach for extract data from online Indian newspapers written in the many popular languages such as Marathi, Hindi, Tamil, Gujarati, Kannada, Oriya, Telugu, Punjabi, etc. Experimental results can be analysed much easily on this domain. This proposed system is very first attempt in an India for news extraction from online web pages available in various Indian language.
Keywords :
Internet; electronic publishing; hypermedia markup languages; information retrieval; natural language processing; DOM tree; Gujarati language; HTML; Hindi language; Indian language; Indian online news Web papers; Kannada language; Marathi language; Oriya language; Punjabi language; Tamil language; Telugu language; URL; Web pages browsing; Web surfing; Web technology; automatic annotation; automatic news extraction system; contents extraction; data extraction; data sharing; document object model; information extraction; large Web data storage; news Web databases; noisy data removal; personal data uploading; social communities; Browsers; Data mining; Databases; HTML; Manuals; Web pages; DOM tree generation; Data extraction; Tag pattern generation; Wrapper;
Conference_Titel :
Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), 2014 3rd International Conference on
Conference_Location :
Noida
Print_ISBN :
978-1-4799-6895-4
DOI :
10.1109/ICRITO.2014.7014750