Abstract :
This paper describes our efforts to apply various advanced supervised machine learning and natural language processing techniques, including Binomial Logistic Regression, Support Vector Machines, Neural Networks, Ensemble Techniques, and Latent Dirichlet Allocation (LDA), to the problem of detecting fraud in financial reporting documents available from the United States´ Security and Exchange Commission EDGAR database. Specifically, we apply LDA to a collection of type 10-K financial reports and to generate document-topic frequency matrix, and then submit these data to a series of advanced classification algorithms. We then apply evaluation metrics, such as Precision, Receiver Operating Characteristic Curve, and Area Under the Curve to evaluate the performance of each algorithm. We conclude that these methods show promise and suggest applying the approach to a larger set of input documents.
Keywords :
document handling; financial data processing; fraud; learning (artificial intelligence); matrix algebra; natural language processing; neural nets; pattern classification; regression analysis; security of data; support vector machines; EDGAR database; LDA; Security and Exchange Commission; United States; area under the curve; binomial logistic regression; classification algorithms; document-topic frequency matrix; ensemble techniques; evaluation metrics; financial reporting documents; fraudulent financial reports detection; latent Dirichlet allocation; natural language processing techniques; neural networks; precision; receiver operating characteristic curve; supervised machine learning techniques; support vector machines; Accuracy; Classification algorithms; Correlation; Logistics; Natural language processing; Neural networks; Support vector machines; Ensemble; Financial Fraud Detection; Latent Dirichlet Allocation; Machine Learning; Natural Language Processing; Support Vector Machines;