Abstract :
For the sake of national security, very large volumes of
data and information are generated and gathered daily.
Much of this data and information is written in different
languages, stored in different locations, and may be
seemingly unconnected. Crosslingual semantic interoperability
is a major challenge to generate an overview of
this disparate data and information so that it can be analyzed,
shared, searched, and summarized. The recent
terrorist attacks and the tragic events of September 11,
2001 have prompted increased attention on national security
and criminal analysis. Many Asian countries and
cities, such as Japan, Taiwan, and Singapore, have been
advised that they may become the next targets of terrorist
attacks. Semantic interoperability has been a focus in
digital library research. Traditional information retrieval
(IR) approaches normally require a document to share
some common keywords with the query. Generating the
associations for the related terms between the two term
spaces of users and documents is an important issue.
The problem can be viewed as the creation of a thesaurus.
Apart from this, terrorists and criminals may
communicate through letters, e-mails, and faxes in languages
other than English. The translation ambiguity significantly
exacerbates the retrieval problem. The problem
is expanded to crosslingual semantic interoperability. In
this paper, we focus on the English/Chinese crosslingual
semantic interoperability problem. However, the developed
techniques are not limited to English and Chinese
languages but can be applied to many other languages.
English and Chinese are popular languages in the Asian
region. Much information about national security or crime
is communicated in these languages. An efficient automatically
generated thesaurus between these languages
is important to crosslingual information retrieval between
English and Chinese languages. To facilitate crosslingual
information retrieval, a corpus-based approach uses the
term co-occurrence statistics in parallel or comparable
corpora to construct a statistical translation model to
cross the language boundary. In this paper, the textbased approach to align English/Chinese Hong Kong
Police press release documents from the Web is first
presented. We also introduce an algorithmic approach to
generate a robust knowledge base based on statistical
correlation analysis of the semantics (knowledge) embedded
in the bilingual press release corpus. The research
output consisted of a thesaurus-like, semantic
network knowledge base, which can aid in semanticsbased
crosslingual information management and retrieval.