DocumentCode :
3434344
Title :
SemAC algorithm based on the tag semantic distance in network service information extraction
Author :
Wu, Xiaochun ; Xu, Qiuchen ; Zhang, Zhihui ; He, Gang
Author_Institution :
Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2010
fDate :
24-26 Sept. 2010
Firstpage :
393
Lastpage :
397
Abstract :
The paper analyzed the methods of extracting information from the network traffic, among which the information extraction based on the Tag marker was studied in depth. Tags, which contain specific core words, can be considered as markers of different types of information. Due to the non-standardized nature the Tag marker, a Tag often has lots of interference characters and the traditional pattern matching algorithm may leads to mismatch or poor performance. After analyzing the network service tag characteristics, we presented the Tag word segmenting method and the concept of Tag semantic distance. By introducing the tag semantic distance concept, we improved the Aho-Corasick algorithm to identify the tags which we concerned in extracting network service information. This improved algorithm, SemAC, is based on tag words segmenting and semantic distance calculation by dual-way scanning on words-map. Using this improved AC algorithm based on pre-defined core words, we successfully identified the tags and extracted what we need from the packets stream of web sites. At the end of this paper time complexity is analyzed.
Keywords :
computational complexity; pattern matching; telecommunication congestion control; Aho-Corasick algorithm; SemAC algorithm; Web sites; dual-way scanning; network service information extraction; network traffic; pattern matching algorithm; semantic distance calculation; tag marker; tag semantic distance concept; tag word segmenting method; time complexity; words-map; Algorithm design and analysis; Automata; Complexity theory; Data mining; Pattern matching; Protocols; Semantics; Aho-Corasick algorithm; information extraction; multiple pattern matching; semantic distance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6851-5
Type :
conf
DOI :
10.1109/ICNIDC.2010.5657798
Filename :
5657798
Link To Document :
بازگشت