Title :
Using Text Analysis to Understand the Structure and Dynamics of the World Wide Web as a Multi-Relational Graph
Author :
Sethu, Harish ; Yates, Alexander
Author_Institution :
Dept. of ECE, Drexel Univ. Philadelphia, Philadelphia, PA, USA
Abstract :
A representation of the World Wide Web as a directed graph, with vertices representing web pages and edges representing hypertext links, underpins the algorithms used by web search engines today. However, this representation involves a key oversimplification of the true complexity of the Web: an edge in the traditional Web graph represents only the existence of a hyperlink; information on the context (e.g., informational, adversarial, commercial, spam) behind the hyperlink is absent. In this work-in-progress paper, we describe an ongoing collaborative project between two teams, one specializing in network science and analysis and the other specializing in text analysis and machine learning, to address this oversimplification. Using techniques in natural language processing, text mining and machine learning to extract relevant features of hyperlinks and classify them into one of several types, this undertaking builds and analyzes a multi-relational web graph. A key aspect of this work is that the multi-relational graph emerges naturally from the data instead of being based on an imposed classification of the hyperlinks.
Keywords :
Internet; data mining; directed graphs; learning (artificial intelligence); natural language processing; search engines; text analysis; Web search engines; World Wide Web; directed graph; hypertext links; machine learning; multi-relational graph; natural language processing; text analysis; text mining; Complexity theory; Data mining; Feature extraction; Social network services; USA Councils; Web pages; Web graph; classification; clustering; graph sampling; network science; text mining; web search;
Conference_Titel :
Social Computing (SocialCom), 2010 IEEE Second International Conference on
Conference_Location :
Minneapolis, MN
Print_ISBN :
978-1-4244-8439-3
Electronic_ISBN :
978-0-7695-4211-9
DOI :
10.1109/SocialCom.2010.105