DocumentCode :
2226588
Title :
Towards modernised and Web-specific stoplists for Web document analysis
Author :
Sinka, Mark P. ; Corne, David W.
Author_Institution :
Reading Univ., UK
fYear :
2003
fDate :
13-17 Oct. 2003
Firstpage :
396
Lastpage :
402
Abstract :
Research areas such as text classification and document clustering underpin many issues in Web intelligence. A fundamental tool in document clustering is a list of ´stop´ words (stoplist) that is used to identify frequent words that are unlikely to assist in classification and is hence removed during pre-processing. Current stoplists are outdated both in light of fluctuations in word usage, and innocent of ´Web-specific´ stop words, hence questioning their applicability in Web-based tasks. We explore this by developing new word-entropy based stoplists: one derived from random Web pages, and one derived from the BankSearch dataset. We evaluate these against other stoplists using accuracies obtained from unsupervised clustering experiments. We find that existing stoplists perform well, but are sometimes outperformed by our new stoplists, especially on hard classification tasks.
Keywords :
Internet; document handling; information retrieval; pattern classification; pattern clustering; search engines; Web document analysis; Web intelligence; Web pages; Web-specific stoplists; document clustering; pattern classification; unsupervised clustering experiments; word-entropy based stoplists; Databases; Fluctuations; Indexes; Internet; Search engines; Taxonomy; Text analysis; Text categorization; Uniform resource locators; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on
Print_ISBN :
0-7695-1932-6
Type :
conf
DOI :
10.1109/WI.2003.1241221
Filename :
1241221
Link To Document :
بازگشت