Na této stránce máte možnost získat podklady k přednášce, kterou přednesl
Prof. Arturas Mazeika (Free University of Bozen-Bolzano, Italy)
dne 08.03.2007:
Estimating the Selectivity of Approximate String Queries
Abstract
Approximate queries on string data are important, due to the
prevalence of such data in databases and various conventions and
errors in string data. We present the VSol estimator, a novel
technique for estimating the selectivity of approximate string
queries. The VSol estimator is based on inverse strings
and makes the performance of the selectivity estimator independent
of the number of strings. To get inverse strings we decompose all
database strings into overlapping substrings of length q (q-grams)
and then associate each q-gram with its inverse string: the IDs of
all strings that contain the q-gram. We use signatures to compress
inverse strings.
We study our technique analytically and experimentally. The
space complexity of our estimator only depends on the number of
neighborhoods in the database and the desired estimation error.
The time to estimate the selectivity is independent of the
number of database strings and linear wrt the length of the
query string. We give a detailed empirical performance evaluation
of our solution for synthetic and real world datasets. We show that
VSol is effective for large skewed databases of short strings.
7.12.2007