مرکز منطقه ای اطلاع رساني علوم و فناوري - Identification of Latin-Based Languages through Character Stroke Categorization

DocumentCode :

2022289

Title :

Identification of Latin-Based Languages through Character Stroke Categorization

Author :

Shijian Lu ; Linlin Li ; Chew Lim Tan

Author_Institution :

Nat. Univ. of Singapore, Singapore

Volume :

fYear :

2007

fDate :

23-26 Sept. 2007

Firstpage :

352

Lastpage :

356

Abstract :

This paper presents a language identification technique that detects Latin-based languages of imaged documents without OCR. The proposed technique detects languages through the word shape coding, which converts each word image into a word shape code and accordingly transforms each document image into an electronic document vector. For each Latin-based language under study, a language template is first constructed through a corpus-based learning process. The underlying language of the query document is then determined based on the similarity between the query document vector and multiple constructed language templates. Compared with the reported methods, the proposed language identification technique is fast, accurate, and tolerant to text segmentation error caused by noise and various types of document degradation. Experimental results show some promising results.

Keywords :

natural language processing; text analysis; word processing; Latin-based languages; character stroke categorization; imaged documents; language identification; query document vector; text segmentation error; word shape coding; Computer science; Engines; Image coding; Image converters; Image segmentation; Labeling; Optical character recognition software; Optical noise; Shape; Switches;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on

Conference_Location :

Parana

ISSN :

1520-5363

Print_ISBN :

978-0-7695-2822-9

Type :

conf

DOI :

10.1109/ICDAR.2007.4378731

Filename :

4378731

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2022289