Searching for Primes · BitFunnel

Searching for Primes

8f49fe3932

What do prime numbers have to do with BitFunnel?

It turns out we use them to test our matching engine. One of the challenges in bringing up a new search engine is figuring out how to test it. If you happen to have another working search engine that has ingested the same corpus, you’re in luck - just compare its output with that of your new search engine.

Well that’s the theory, anyway. In practice this is difficult for a number of reasons:

Down the road, we plan to configure an instance of Lucene as our oracle, but today, we need a really small, lightweight test that can be used for debugging and run before every commit.

A Synthetic Corpus

Our solution was to generate a synthetic corpus. We wanted something with the following characteristics:

At this point, our goal is to test our query pipeline as it transforms the Term tree into various Row trees which become CompileNode trees which drop into an ICodeGenerator which yields native x64 code or byte code for our interpreter.

For these tests, we’re not concerned with the probabilistic nature of BitFunnel - we just want to know if the matcher is computing the right boolean expression over bits loaded from rows. We can easily eliminate all probabilistic behavior by configuring the TermTable to place each term in its own, private row.

Since these tests eliminate probabilistic behavior, there is no requirement that our synthetic corpus have statistics that model a real world corpus. We can use really wacky documents, as long as they support enough interesting test cases.

Using Prime Factorizations

The solution we settled on was to model each document as containing only those terms corresponding to the integers that make up the prime factorization of the document’s id.

As an example, document number 100 might look something like:

Title: 100

2 2 5 5

and document 2310 might look something like

Title: 770

2 5 7 11

and document 1223, which corresponds to a prime number, would have a only single term

Title: 1223

1223

With this document structure, it is trivial to determine if a document contains a specific prime number term. Here’s the code:

bool Contains(size_t docId, size_t term)
{
    return (docId % term) == 0ull;
}

Phrase matches are easy to detect, as well, if we model the documents as ordered sequences of prime factors. All we need to do is ask whether the sequence of terms that makes up the phrase is a subsequence of the integers that make up the document’s prime factorization.

Suppose, for example, that we’re looking for the phrase “2 5”. This is equivalent to asking whether each document’s prime factorization sequence contains the sequence [2,5].

Consider the documents above.

Implementation Details

The implementation turned out to be surprisingly simple – just under 200 lines of code in PrimeFactorsDocument.cpp.

Our first step was to create mock documents. The function CreatePrimeFactorsDocument just creates an off-the-shelf IDocument and then fills it the prime factors of its DocId using calls to IDocument::AddTerm(). Here’s the relevant fragment of code:

for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p > docId)
    {
        break;
    }
    else
    {
        while ((docId % p) == 0)
        {
            auto const & term = Primes::c_primesBelow10000Text[i];
            document->AddTerm(term.c_str());
            docId /= p;
            sourceByteSize += (1 + term.size());
        }
    }
}

Next we configured a TermTable to assign a private row to each term corresponding to a prime number. We only included mappings for primes up to the largest DocId. This TermTable gives us the desired non-probabilistic behavior for Terms corresponding to primes not exceeding the largest DocId.

Terms corresponding to larger primes or composite numbers will be implicitly mapped. The implicit rows, however will only contain zeros because none of the documents contain terms corresponding to large primes or composite numbers.

The conqeuence is that queries involving larger primes and composites will never show probabilistic behavior and therefore never yield false positives.

Here’s an excerpt from the function CreatePrimeFactorsTermTable() which creates an ITermTable and then provisions it with explicit rows for the terms “0”, “1”, and each of the primes smaller than 10,000:

for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p > maxDocId)
    {
        break;
    }
    else
    {
        auto text = Primes::c_primesBelow10000Text[i];

        termTable->OpenTerm();
        termTable->AddRowId(RowId(rank, explicitRowCount0++));
        termTable->CloseTerm(Term::ComputeRawHash(text.c_str()));
    }
}

Our last step was to create a mock index. The function CreatePrimeFactorsIndex() just creates an ISimpleIndex with the prime factors term table replacing the default. Then a simple for-loop fills the index:

for (DocId docId = 0; docId <= maxDocId; ++docId)
{
    auto document =
        Factories::CreatePrimeFactorsDocument(
            index->GetConfiguration(),
            docId,
            maxDocId,
            streamId);
    index->GetIngestor().Add(docId, *document);
}
Michael Hopcroft
A 19 year veteran at Microsoft, Mike has worked on Office, Windows and Visual Studio. He has spent the past 6 years developing cloud scale infrastructure for the Bing search engine. He is a founding member of BitFunnel.