DocumentCode :
259735
Title :
Iterative Hard Thresholding for Keyword Extraction from Large Text Corpora
Author :
Yadlowsky, Steve ; Nakkarin, Preetum ; Jingyan Wang ; Sharma, Rishi ; El Ghaoui, Laurent
Author_Institution :
Electr. Eng. & Comput. Sci, Univ. of California, Berkeley, Berkeley, CA, USA
fYear :
2014
fDate :
3-6 Dec. 2014
Firstpage :
588
Lastpage :
593
Abstract :
To better understand and analyze text corpora, such as the news, it is often useful to extract keywords that are meaningfully associated with a given topic. A corpus of documents labeled by their topic can be used to approach this as a learning problem. We consider this problem through the lens of statistical text analysis, using bag-of-words frequencies as features for a sparse linear model. We demonstrate, through numerical experiments, that iterative hard thresholding (IHT) is a practical and effective algorithm for keyword-extraction from large text corpora. In fact, our implementation of IHT can quickly analyze more than 800,000 documents, returning keywords comparable to algorithms solving a Lasso problem-formulation, with significantly less computation time. Further, we generalize the analysis of the IHT algorithm to show that it is stable for rank deficient matrices, as those arising from our bag-of-words model often are.
Keywords :
information retrieval; iterative methods; statistical analysis; text analysis; IHT algorithm; Lasso problem-formulation; bag-of-words frequencies; iterative hard thresholding; keyword extraction; rank deficient matrices; sparse linear model; statistical text analysis; text corpora;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications (ICMLA), 2014 13th International Conference on
Conference_Location :
Detroit, MI
Type :
conf
DOI :
10.1109/ICMLA.2014.101
Filename :
7033182
Link To Document :
بازگشت