英语论文网

t is that objects can
be recognized despite significant changes in viewpoint,
some amount of illumination variation and, due to multiple
local regions, despite partial occlusion since some of
the regions will be visible in such cases. Examples of
extracted regions and matches are shown in Figs. 2 and 5.
In this paper, we cast this approach as one of text
retrieval. In essence, this requires a visual analogy of a
word, and here we provide this by vector quantizing the
Manuscript received June 10, 2007; revised November 25, 2007. This work was
supported in part by the Mathematical and Physical Sciences Division, University of
Oxford and in part by EC Project Vibes.
The authors are with the Department of Engineering Science, University of Oxford,
OX1 3PJ Oxford, U.K. (e-mail: [email protected]; [email protected]).
Digital Object Identifier: 10.1109/JPROC.2008.916343
548 Proceedings of the IEEE | Vol. 96, No. 4, April 2008 0018-9219/$25.00 2008 IEEE
Authorized licensed use limited to: University College London. Downloaded on April 16, 2009 at 11:04 from IEEE Xplore. Restrictions apply.
descriptor vectors. The benefit of the text retrieval
approach is that matches are effectively precomputed so
that at run time frames and shots containing any
particular object can be retrieved with no delay. This
means that any object occurring in the video (and
conjunctions of objects) can be retrieved even though
there was no explicit interest in these objects when
descriptors were built for the video.
Note that the goal of this research is to retrieve
instances of a specific object, e.g., a specific bag or a
building with a particular logo (Figs. 1 and 2). This is in
contrast to retrieval and recognition of Bobject/scene
categories[ [8], [11], [13], [14], [35], [44], sometimes also
called Bhigh-level features[ or Bconcepts[ [4], [47] such as
Bbags,[ Bbuildings,[ or Bcars,[ where the goal is to find
any bag, building, or car, irrespective of its shape, color,
appearance, or any particular markings/logos.
We describe the steps by which we are able to use text
retrieval methods for object retrieval in Section II. Then in
Section III, we evaluate the proposed approach on a ground
truth set of six object queries. Object retrieval results,
including searches from within the movie and specified by
external images, are shown on feature films: BGroundhog
Day[ [Ramis, 1993], BCharade[ [Donen, 1963] and BPretty
Woman[ [Marshall, 1990]. Finally, in Section IV we
discuss three challenges for the presented video retrieval
approach and review some recent work addressing them.
II. TEXT RETRIEVAL APPROACH
TO OBJECT MATCHING
This section outlines the steps in building an object
retrieval system by combining methods from computer
vision and text retrieval.
Each frame of the video is represented by a set of
overlapping (local) regions with each region represented by
a visual word computed from its appearance. Section II-A
describes the visual regions and descriptors used.
Section II-B then describes their vector quantization into
visual Bwords.[ Sections II-C and II-D then show how text
retrieval techniques are applied to this visual word
representation. We will use the film BGroundhog Day[
as our running example, though the same method is
applied to all the feature films used in this paper.
A. Viewpo