Searching for Primes

Sat, Sep 24, 2016

NOTE: this site uses MathJax to render mathematical equations. Enabling JavaScript in your browser will allow MathJax to display these equations on the page.

8f49fe3932

What do prime numbers have to do with BitFunnel?

It turns out we use them to test our matching engine. One of the challenges in bringing up a new search engine is figuring out how to test it. If you happen to have another working search engine that has ingested the same corpus, you’re in luck - just compare its output with that of your new search engine.

Well that’s the theory, anyway. In practice this is difficult for a number of reasons:

The “oracle” search engine may not have indexed the same corpus as the search engine under test. As an example, it would be great to use a production Bing server as our oracle, but no one knows, at any given moment, exactly which documents are on a particular machine, and the set of documents is constantly changing.
The two search engines may model documents differently. For example, the Bing servers include ton’s of meta data and information about click streams which isn’t meaningful to anyone outside of Bing. We could model all of this information in a BitFunnel test, but it would involve a lot of code that was only useful for the test.
We’d like to make all of our tests available as open source, so the data required to run the tests needs to be publicly available and small enough to store on GitHub.

Down the road, we plan to configure an instance of Lucene as our oracle, but today, we need a really small, lightweight test that can be used for debugging and run before every commit.

A Synthetic Corpus

Our solution was to generate a synthetic corpus. We wanted something with the following characteristics:

Corpus should be trivial to generate.
Arbitrarily large corpuses can be constructed efficiently.
Match verification algorithm should be trivial.
Match verification should be fast.
Can model phrases.

At this point, our goal is to test our query pipeline as it transforms the Term tree into various Row trees which become CompileNode trees which drop into an ICodeGenerator which yields native x64 code or byte code for our interpreter.

For these tests, we’re not concerned with the probabilistic nature of BitFunnel - we just want to know if the matcher is computing the right boolean expression over bits loaded from rows. We can easily eliminate all probabilistic behavior by configuring the TermTable to place each term in its own, private row.

Since these tests eliminate probabilistic behavior, there is no requirement that our synthetic corpus have statistics that model a real world corpus. We can use really wacky documents, as long as they support enough interesting test cases.

Using Prime Factorizations

The solution we settled on was to model each document as containing only those terms corresponding to the integers that make up the prime factorization of the document’s id.

As an example, document number 100 might look something like:

Title: 100

2 2 5 5

and document 2310 might look something like

Title: 770

2 5 7 11

and document 1223, which corresponds to a prime number, would have a only single term

Title: 1223

1223

With this document structure, it is trivial to determine if a document contains a specific prime number term. Here’s the code:

bool Contains(size_t docId, size_t term)
{
    return (docId % term) == 0ull;
}

Phrase matches are easy to detect, as well, if we model the documents as ordered sequences of prime factors. All we need to do is ask whether the sequence of terms that makes up the phrase is a subsequence of the integers that make up the document’s prime factorization.

Suppose, for example, that we’re looking for the phrase “2 5”. This is equivalent to asking whether each document’s prime factorization sequence contains the sequence [2,5].

Consider the documents above.

Document 100 is a match because [2,5] is a subsequence of [2,2,5,5].
Document 770 is also a match because [2,5] is a subsequence of [2,5,7,11].
Document 1223, on the other hand, is not a match because [2,5] is not a subsequence of [1223].

Implementation Details

The implementation turned out to be surprisingly simple – just under 200 lines of code in PrimeFactorsDocument.cpp.

Our first step was to create mock documents. The function CreatePrimeFactorsDocument just creates an off-the-shelf IDocument and then fills it the prime factors of its DocId using calls to IDocument::AddTerm(). Here’s the relevant fragment of code:

for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p > docId)
    {
        break;
    }
    else
    {
        while ((docId % p) == 0)
        {
            auto const & term = Primes::c_primesBelow10000Text[i];
            document->AddTerm(term.c_str());
            docId /= p;
            sourceByteSize += (1 + term.size());
        }
    }
}

Next we configured a TermTable to assign a private row to each term corresponding to a prime number. We only included mappings for primes up to the largest DocId. This TermTable gives us the desired non-probabilistic behavior for Terms corresponding to primes not exceeding the largest DocId.

Terms corresponding to larger primes or composite numbers will be implicitly mapped. The implicit rows, however will only contain zeros because none of the documents contain terms corresponding to large primes or composite numbers.

The conqeuence is that queries involving larger primes and composites will never show probabilistic behavior and therefore never yield false positives.

Here’s an excerpt from the function CreatePrimeFactorsTermTable() which creates an ITermTable and then provisions it with explicit rows for the terms “0”, “1”, and each of the primes smaller than 10,000:

for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p > maxDocId)
    {
        break;
    }
    else
    {
        auto text = Primes::c_primesBelow10000Text[i];

        termTable->OpenTerm();
        termTable->AddRowId(RowId(rank, explicitRowCount0++));
        termTable->CloseTerm(Term::ComputeRawHash(text.c_str()));
    }
}

Our last step was to create a mock index. The function CreatePrimeFactorsIndex() just creates an ISimpleIndex with the prime factors term table replacing the default. Then a simple for-loop fills the index:

for (DocId docId = 0; docId <= maxDocId; ++docId)
{
    auto document =
        Factories::CreatePrimeFactorsDocument(
            index->GetConfiguration(),
            docId,
            maxDocId,
            streamId);
    index->GetIngestor().Add(docId, *document);
}