DocumentCode :
2149594
Title :
Mathematical Formula Identification in PDF Documents
Author :
Lin, Xiaoyan ; Gao, Liangcai ; Tang, Zhi ; Lin, Xiaofan ; Hu, Xuan
Author_Institution :
Inst. of Comput. Sci. & Technol., Peking Univ., Beijing, China
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
1419
Lastpage :
1423
Abstract :
Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.
Keywords :
document image processing; electronic publishing; knowledge based systems; learning (artificial intelligence); software packages; PDF documents; commercial software package; context content; document analysis; geometric layout; image based document; large-scale Chinese e-book production; learning based method; mathematical expression; mathematical formula identification; rule based method; Character recognition; Context; Feature extraction; Layout; Portable document format; Support vector machines; Text analysis; PDF document; formula extraction; mathematical expression recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.285
Filename :
6065544
Link To Document :
بازگشت