英语论文网

h value. Although each subvector
is non-overlapping, the concatenated fingerprints have
substantial overlap. Assuming that the duration of the average
song is 3 minutes, then the number of fingerprints per song is
3 60 256=5 = 9216 10; 000.
Given this 8192-dimensional representation, the focus of our
work has been to develop an efficient search algorithm. This
search can be characterized as a nearest neighbor search in a
very high dimensional space. Of course, this assumes that the
nearest fingerprint in the database to the query is the correct
match. However, under some distortions, this assumption may
not be valid. Interestingly, it is less important to correctly
match the query to its corresponding fingerprint as it is to
match the fingerprint to its corresponding song. Since a song is
composed of many fingerprints, incorrectly matching a query
to a fingerprint does not necessarily lead to a song recognition
error. This issue is discussed in more detail in the experimental
evaluation of Section III.
High dimensional nearest neighbor search is a very well
studied problem. Proposed solutions generally create a tree
structure, the leaf nodes representing the known datum (fingerprints)
and searching becomes a traversal of the tree. Specific
algorithms differ in how this tree is constructed and traversed.
Two related data structures, kd-trees and vantage point or
vp-trees, have been extensively studied. However, both data
structures succumb to the curse of dimensionality, that is, as
the dimension of the datum increases, an increasing percentage
of the tree must be searched in order to locate the nearest
neighbor to a query.
1
Recent work [2], [3], [4], [5] appears to acknowledge
the fact that a perfect search that guarantees to find the
nearest neighbor in a high dimensional space is not feasible.
However, the curse of dimensionality can be removed if the
search is approximate. For example, Yianilos [5] describes
an algorithm that, with probability, p, will find a neighbor
within a Euclidean distance r of the query when the datum
are uniformly distributed within an n-dimensional hypercube.
Unfortunately, this work has not been extended to the binary
case and Hamming rather than Euclidean distance.
In this paper, we develop an approximate search algorithm
for high dimensional binary vectors. Section II first describes
the algorithm. Section III then presents experimental results on
a database of 1000 songs and 12,217,111 fingerprints. Finally,
Section IV summarizes our results and discusses possible
avenues of future work.
II. ALGORITHM
In the following subsections, we describe an approximate
search algorithm for binary vectors in a high dimension space.
Given the set of known fingerprints, we first construct a 256-
ary tree. Each 8192-bit fingerprint is represented as 1024 8-bit
bytes. The value of each consecutive byte in the fingerprint
determines which of the 256 possible children to descend. A
path from the root node to a leaf defines a fingerprint.
As the depth of the tree increases, in is common to find
nodes with only a single child. This is because the number
of actual fingerprints is very much less than the total possible
number. For efficiency purposes, we compress suchsequences
of nodes with only one child into a single node that represents
multip