Searching for Primes
What do prime numbers have to do with BitFunnel?
It turns out we use them to test our matching engine. One of the challenges in bringing up a new search engine is figuring out how to test it. If you happen to have another working search engine that has ingested the same corpus, you’re in luck - just compare its output with that of your new search engine.
Well that’s the theory, anyway. In practice this is difficult for a number of reasons:
The “oracle” search engine may not have indexed the same corpus as the search engine under test. As an example, it would be great to use a production Bing server as our oracle, but no one knows, at any given moment, exactly which documents are on a particular machine, and the set of documents is constantly changing.
The two search engines may model documents differently. For example, the Bing servers include ton’s of meta data and information about click streams which isn’t meaningful to anyone outside of Bing. We could model all of this information in a BitFunnel test, but it would involve a lot of code that was only useful for the test.
We’d like to make all of our tests available as open source, so the data required to run the tests needs to be publicly available and small enough to store on GitHub.
Down the road, we plan to configure an instance of Lucene as our oracle, but today, we need a really small, lightweight test that can be used for debugging and run before every commit.
A Synthetic Corpus
Our solution was to generate a synthetic corpus. We wanted something with the following characteristics:
- Corpus should be trivial to generate.
- Arbitrarily large corpuses can be constructed efficiently.
- Match verification algorithm should be trivial.
- Match verification should be fast.
- Can model phrases.
At this point, our goal is to test our query pipeline as it transforms the Term tree into various Row trees which become CompileNode trees which drop into an ICodeGenerator which yields native x64 code or byte code for our interpreter.
For these tests, we’re not concerned with the probabilistic nature of
BitFunnel - we just want to know if the matcher is computing the right
boolean expression over bits loaded from rows. We can easily eliminate
all probabilistic behavior by configuring the TermTable
to place
each term in its own, private row.
Since these tests eliminate probabilistic behavior, there is no requirement that our synthetic corpus have statistics that model a real world corpus. We can use really wacky documents, as long as they support enough interesting test cases.
Using Prime Factorizations
The solution we settled on was to model each document as containing only those terms corresponding to the integers that make up the prime factorization of the document’s id.
As an example, document number 100 might look something like:
Title: 100
2 2 5 5
and document 2310 might look something like
Title: 770
2 5 7 11
and document 1223, which corresponds to a prime number, would have a only single term
Title: 1223
1223
With this document structure, it is trivial to determine if a document contains a specific prime number term. Here’s the code:
bool Contains(size_t docId, size_t term)
{
return (docId % term) == 0ull;
}
Phrase matches are easy to detect, as well, if we model the documents as ordered sequences of prime factors. All we need to do is ask whether the sequence of terms that makes up the phrase is a subsequence of the integers that make up the document’s prime factorization.
Suppose, for example, that we’re looking for the phrase “2 5”. This is
equivalent to asking whether each document’s prime factorization sequence
contains the sequence [2,5]
.
Consider the documents above.
- Document 100 is a match because
[2,5]
is a subsequence of[2,2,5,5]
. - Document 770 is also a match because
[2,5]
is a subsequence of[2,5,7,11]
. - Document 1223, on the other hand, is not a match because
[2,5]
is not a subsequence of[1223]
.
Implementation Details
The implementation turned out to be surprisingly simple –
just under 200 lines of code in PrimeFactorsDocument.cpp
.
Our first step was to create mock documents.
The function CreatePrimeFactorsDocument
just creates an off-the-shelf
IDocument
and then fills it the prime factors of its DocId
using
calls to IDocument::AddTerm()
. Here’s the relevant fragment of code:
for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
size_t p = Primes::c_primesBelow10000[i];
if (p > docId)
{
break;
}
else
{
while ((docId % p) == 0)
{
auto const & term = Primes::c_primesBelow10000Text[i];
document->AddTerm(term.c_str());
docId /= p;
sourceByteSize += (1 + term.size());
}
}
}
Next we configured a TermTable to assign a private row to each term corresponding to a prime number. We only included mappings for primes up to the largest DocId. This TermTable gives us the desired non-probabilistic behavior for Terms corresponding to primes not exceeding the largest DocId.
Terms corresponding to larger primes or composite numbers will be implicitly mapped. The implicit rows, however will only contain zeros because none of the documents contain terms corresponding to large primes or composite numbers.
The conqeuence is that queries involving larger primes and composites will never show probabilistic behavior and therefore never yield false positives.
Here’s an excerpt from the function CreatePrimeFactorsTermTable()
which creates an ITermTable
and then provisions it with explicit
rows for the terms “0”, “1”, and each of the primes smaller than 10,000:
for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
size_t p = Primes::c_primesBelow10000[i];
if (p > maxDocId)
{
break;
}
else
{
auto text = Primes::c_primesBelow10000Text[i];
termTable->OpenTerm();
termTable->AddRowId(RowId(rank, explicitRowCount0++));
termTable->CloseTerm(Term::ComputeRawHash(text.c_str()));
}
}
Our last step was to create a mock index.
The function CreatePrimeFactorsIndex()
just creates an ISimpleIndex
with the prime factors term table replacing the default. Then a simple
for-loop fills the index:
for (DocId docId = 0; docId <= maxDocId; ++docId)
{
auto document =
Factories::CreatePrimeFactorsDocument(
index->GetConfiguration(),
docId,
maxDocId,
streamId);
index->GetIngestor().Add(docId, *document);
}