DocumentCode :
2146546
Title :
Creation and Analysis of a Corpus of Text Rich Indian TV Videos
Author :
Chattopadhyay, T. ; Sengupta, Soumik ; Sinha, Aniruddha ; Rampuria, Nisha
Author_Institution :
Innovation Lab., Tata Consultancy Services, Kolkata, India
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
849
Lastpage :
853
Abstract :
A lot of research is now going on to extract the context of the show to provide additional information related to the TV show. One major method to extract the context from TV is to recognize the texts from the videos which is also known as video Optical Character Recognition (VOCR). The problem of VOCR from the TV shows of a multiligual country like India is more difficult. In India still more than 90% TV viewers are using RF Cable as input to TV and nearly 90% channels have multilingual texts in the TV shows. Thus the video quality is poor in compare to the modern digital TV signals as well as different text scripts are present in a single video frame. These made the problem of Indian TV context recognition more challenging. So this paper is concerned about the construction of a video corpus of text rich Indian TV shows. The proposed database contains more than 100 videos each of nearly 10 min duration containing text in the video frame. A statistical analysis of the corpus is also presented in the paper which can be used to identify the genre of TV show. The analysis also revealed that distribution of numerals, special characters, uppercase and lower case character can be used to classify a news video frame. This corpus is useful for a wide variety of research problems namely, (i) localization of the text regions from a video frame, (ii) recognition of texts from a video frame, (iii) extraction of context from video, and (iv) performance evaluation of a video OCR system.
Keywords :
optical character recognition; statistical analysis; television; text analysis; ubiquitous computing; video signal processing; Indian TV context recognition; RF cable; VOCR; context extraction; corpus analysis; corpus creation; lower case character; modern digital TV signal; multiligual country; multilingual text region localization; statistical analysis; text recognition; text rich Indian TV shows; text rich Indian TV video; text script; video OCR system; video corpus construction; video frame; video optical character recognition; Context; Motion pictures; Optical character recognition software; Statistical analysis; TV; Text recognition; Videos; Corpus; Indian TV Video Analysis; Indian TV video; Video OCR;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.174
Filename :
6065431
Link To Document :
بازگشت