Title :
An automated change-detection algorithm for HTML documents based on semantic hierarchies
Author :
Lim, Seung-Jin ; Ng, Yiu-Kai
Author_Institution :
Dept. of Comput. Sci., Brigham Young Univ., Provo, UT, USA
Abstract :
The data at many Web sites is changing rapidly, and a significant amount of this data is presented in HTML documents that consist of markups and data contents. Although XML is becoming more popular for data exchange, the presentation of data contained in XML documents is given, by and large, in the HTML format using XSL(T). Since HTML was designed to “display” data from the human perspective, it is not trivial for a machine to detect (hierarchical) changes of data in an HTML document. In this paper, we propose a heuristic algorithm, called SCD (Semantic Change Detection), to detect semantic changes to the hierarchical data contents in any two HTML documents automatically. Semantic changes differ from syntactic changes since the latter refer to changes of data contents with respect to markup structures according to the HTML grammar. SCD does not require pre-processing, nor any knowledge of the internal structure of the source documents beforehand. The time complexity of SCD is O[(|X|×|Y|)log(|X|×|Y|)], where |X| and |Y| are the number of unique branches in the syntactic hierarchies of any two given HTML documents, respectively
Keywords :
computational complexity; hypermedia markup languages; information resources; HTML documents; HTML grammar; SCD algorithm; Web sites; XML documents; XSL(T); changing rapidly data; data contents; data display; data exchange; data presentation; heuristic algorithm; hierarchical data contents; markup structures; semantic change detection algorithm; semantic hierarchies; syntactic hierarchies; time complexity; unique branches; Change detection algorithms; Computer science; Displays; Eyes; HTML; Heuristic algorithms; Humans; Testing; XML;
Conference_Titel :
Data Engineering, 2001. Proceedings. 17th International Conference on
Conference_Location :
Heidelberg
Print_ISBN :
0-7695-1001-9
DOI :
10.1109/ICDE.2001.914842