Advanced technologies in consumer electronics products have enabled individual users to record, share, and view videos on mobile devices. With the volume of videos increasing tremendously on the Internet, fast and accurate video search has attracted much research attention. A good similarity measure is a key component in a video retrieval system. Most of the existing solutions only rely on either the low-level visual features or the surrounding textual annotations. Those approaches often suffer from low recall as they are highly susceptible to changes in viewpoint, illumination, and noisy tags. By leveraging geo- metadata , more reliable and precise search results can be obtained. However, two issues remain challenging: (1) how to quantify the spatial relevance of videos with the visual similarity to generate a pertinent ranking of results according to users’ needs, and (2) how to design a compact video representation that supports efficient indexing for fast video retrieval. In this study, we propose a novel video description which consists of (a) determining the geographic coverage of a video based on the camera’s field-of-view and a pre-constructed geo-codebook, and (b) fusing video spatial relevance and region-aware visual similarities to achieve a robust video similarity measure. Toward a better encoding of a video’s geo-coverage, we construct a geo-codebook by semantically segmenting a map into a collection of coherent regions. To evaluate the proposed technique we developed a video retrieval prototype. Experiments show that our proposed method improves the mean average precision by
, compared with existing approaches.