BitFunnel

Debugging Bit Densities

Thu, 27 Oct 2016 15:51:23 -0600

Things are starting to get exciting in the Land of BitFunnel. We’re now at the point where we can ingest a significant fraction of Wikipedia and run millions of queries, all without crashing – and we have a great set of analysis tools.

You might think that the end is in sight, but actually this is only the beginning. We’re now in what I like to call a “target rich environment” for bugs. Today there are so many bugs, data cleaning issues, and configuration errors that the only reliable statement we can make about the system is that it has bugs.

Having lots of bugs might sound scary or frustrating, but this is actually one of the most interesting stages of the project – when we bring everything together for the first time. I’ll make an analogy to the Boeing 787 Dreamliner.

The components were designed, manufactured and tested separately and then one day they were brought together and assembled for the first time into an airliner. There were some initial hitches getting everything to fit together, but before you knew it they were doing high speed taxi tests and then preparing for the first flight.

I spent the day bolting on the wings and spooling up the engines. My focus was primarily on the bit densities in the Row Tables.

Methodology

My experiments were based on enwiki-20161020-chunked1. After downloading and unzipping this file, I made a manifest file listing all 236 chunks in the corpus. This can be done on Windows with the dir command:

d:\temp\wikipedai> dir /s/b/a-d enwiki-20161020-chunked1 > manifest.txt

or on Linux with the find command:

% find enwiki-20161020-chunked1 -type f > manifest.txt

I then generated corpus statistics, built the term table, and started an analysis in the repl:

% mkdir /tmp/wikipedia/config
% BitFunnel statistics -text /tmp/wikipedia/manifest.txt /tmp/wikipedia/config
% BitFunnel termtable /tmp/wikipedia/config

% mkdir /tmp/wikipedia/analysis
% BitFunnel repl /tmp/wikipedia/config

Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.)

directory = "/tmp/wikipedia/config"
gram size = 1

Starting index ...
Index started successfully.

Type "help" to get started.

0: cache manifest /tmp/wikipedia/manifest.txt
Ingesting manifest "/tmp/wikipedia/manifest.txt"
Caching IDocuments for query verification.
Ingestion complete.

1: cd /tmp/wikipedia/analysis
output directory is now "/tmp/wikipedia/analysis".

2: analyze

This generated a number of files in the analysis directory, including RowDensities-0.csv which easy to view it in Excel.

Picking Low Hanging Fruit

My interest today was the densities of the rank 3 rows, where my earlier investigations suggested there might be bug.

Right off the bat, I found a really high density in a rank 3 row for the adhoc term scriptura. Cell Q47193 shows a density of 0.652111 in row 2353. This is way above the target density of 0.1 and something that needs to be addressed.

So how did the row density get so high? My first line of inquiry was to examine the other terms that share row 2353 with scriptura. To do this, I modified RowTableAnalyzer::AnalyzeRowsInOneShard() to print out the term text and frequency whenever row.GetIndex() == 2353 && row.GetRank() ==3.

Near the top of the function, I recorded the index of the row under observation, created a counter for terms in the row and a CsvTableFormatter to print out the terms. This latter was necessary to properly escape terms that contain commas (e.g. 10,000).

    const RowIndex specialRow = 2353;
    size_t specialTermCount = 0;
    CsvTsv::CsvTableFormatter specialFormatter(std::cout);
    std::cout << "Special row is " << specialRow << std::endl;

In the middle of the loop, I added code to print out information about each term in the special row:

    if (row.GetRank() == 3 && row.GetIndex() == specialRow)
    {
        std::cout
            << specialTermCount++
            << ",";
        specialFormatter.WriteField(termToText.Lookup(term.GetRawHash()));
        std::cout
            << ","
            << dfEntry.GetFrequency();
        specialFormatter.WriteRowEnd();
    }

Here are the results. There were 752 terms in row 2353, but as you can see, the sum of their frequencies is 0.102110806, a value that is pretty close to the target density of 0.1.

At first glance the fact that the frequencies sum up to .0102110806 suggests that the bin packing algorithm in TermTableBuilder::RowAssigner is working correctly. But wait – these are term frequencies which only correspond to densities at rank 0. At higher ranks, the densities are magnified because each bit at rank corresponds to multiple rank 0 bits. After a brief investigation, I found that we were, in fact, under counting bits added to adhoc rows. In RowAssigner::AssignAdhoc() we keep a running total of the density contributed by adhoc terms:

m_adhocTotal += frequency * count;

Later we’d use m_adhocTotal to compute the number of rows needed to store adhoc terms:

double rowCount = ceil(m_adhocTotal / m_density);

The precise problem is that each bit in a rank 3 row corresponds to 8 bits at rank 0, so the density in rank 3 will be significantly higher than the term’s frequency at rank 0. A small change addressed this issue:

double f = Term::FrequencyAtRank(frequency, m_rank);
m_adhocTotal += f * count;

This fix had a big impact, dropping scriptura row densities to the .20-.30 range.

This was a nice improvement, but the densities were still 2-3x what they should be. It turns out the there was another bug, this time in Term::FrequencyAtRank() (thanks to @danluu for spotting the error!)

The original code computed

return 1.0 - pow(1.0 - frequency, rank + 1.0);

The corrected code is

size_t rowCount = (1ull << rank);
return 1.0 - pow(1.0 - frequency, rowCount);

With these two fixes, the scriptura densities are all below 0.15:

An examination of other rows shows that the fix works for all adhoc rows. We still have a ways to go if we want all shared rows to have a density at or below 0.10, but we’ve made good progress.

One More Thing: During the course of this investigation, I noticed that some of the terms in the index contain punctuation (e.g. 39.4 and c's) and others are capitalized. It turns out that our Workbench tool which uses Lucene’s analyzer to tokenize the Wikipedia page bodies was letting the page titles pass through without analysis. We should have a fix for this shortly and then we will reprocess Wikipedia and upload new chunk files.

Where Things Stand

This a classic case of little bugs hiding behing big bugs. We’re also in a situation where we have bugs on multiple axes, making it hard to reason about why we’re seeing certain behaviors.

As an example, we know that query processing is slow. We expected it to be slow because we’re using the byte code interpreter instead of NativeJIT and because we indulged ourselves in some potentially expensive operations in the glue code (e.g. excess allocations and copies, lock contention, and calls to qsort).

The problem is that we don’t really know how fast query processing should be with the current design choices. It might be slow because of the reasons above, or it might be even slower because our row densities are too high, causing excess row reads, or our false positives are too high, causing too many results.

Or we might have a performance bug, say running each query twice when we thought we were running it once.

There are so many possibilities, that we just need to take one suspsicous piece of data at a time, come up with a single example, be it a bad row or a bad query and instrument it or trace it through.

Over time we will pick enough of this low hanging fruit that we generally trust the system and bugs will become more of an aberration.

Row Table Analysis

Sun, 23 Oct 2016 15:51:23 -0700

I spent the weekend implementing code to analyze bit densities in the rows and columns of the row tables. This tool should help us determine whether the row tables are configured correctly. A good row table should have the following characteristics:

Each column’s density is close to the system target density.
Each row’s density is close to the system target density.
Random pairs of terms are unlikely to share rows.
All rows assigned to a single term should be distinct.

To test the tool, I did a quick row table analysis for two corpuses. The first was the collection of Shakespeare Sonnets in TheBard. The second corpus consisted of the first 1805 documents from our processed Wikipedia dump (chunks AA\\wiki_00 to AA\\wiki_09). The remainder of this post describes my methodology and some early observations.

Methodology

I used the new -script option to BitFunnel replto start the repl and then execute commands from the file ingest.txt:

% BitFunnel repl -script ingest.txt /tmp/wikipedia/config

The -script option is a huge time saver. Here are the commands inside of ingest.txt.

cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_01
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_00
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_02
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_03
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_04
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_05
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_06
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_07
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_08
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_09
cd /tmp/wikipedia/out
analyze

The first 10 lines ingest chunks AA/wiki_01 to AA/wiki_09 while caching their IDocuments. It is important to use the cache command here because the column density analysis is only performed for IDocuments that are cached.

The cd command on the next line sets the output directory to /tmp/wikipedia/out. This is where the row table analyzer will put its output files.

Finally, the analyze command kicks off the row table analysis which generates the following files:

/tmp/wikipedia/out/ColumnDensities.csv
/tmp/wikipedia/out/ColumnDensitySummary.txt
/tmp/wikipedia/out/RowDensities-0.csv

Column Densities in Wikipedia

The column densities are recorded as a table in a .csv file. Here’s what the file looks like in Excel:

The first column is the document id, which in the case of wikipedia is the curid value. For example, cell A2 has an id of 1805 which corresponds to the Wikipedia page on Antibiotics (full url is https://en.wikipedia.org/wiki?curid=1805). Column B contains the number of postings contributed by each document. We can see from cell B2 that the page on Antibiotics contributes 1364 postings. Column C has the shard the document is stored in. For these runs, the index was configured with a single shard, so the column is all zeros. Columns D through K have the bit densities of the document’s column at ranks 0 to 7, respectively. Note that this index was configured to use only ranks 0 and 3.

We can learn a lot about the index by examining a scatter plot of column density vs. document posting count. In the chart below, blue dots represent rank 0 densities and orange dots represent rank 3 densities.

Right off the bat, the structure of the graph suggests two learnings.

The Need for Sharding The first learning is that we will need to shard the index by posting count. An examination of the blue dots shows that column density is directly proportional to document posting count. In other words, short documents have low column densities and long documents have high column densities.

This index was sized to accomodate the average sized document which has 766 postings. If we follow the red line up from 766 postings, it hits the blue dots near a density of 0.1 which was the target density used in the index configuration.

In this index, documents with less than 766 postings will consume more memory than necessary, while documents with more than 766 postings will contribute to an increased false positive rate.

We knew from Bing that we would need to shard the index into groups of documents with similar posting counts. This scatter plot just confirms the need for sharding.

A Bug in Rank 3? The cloud of orange dots is problematic in that it doesn’t show The expected linear structure and it includes many densities above the target of 0.1.

An examination of the TermTable builder code shows two design flaws that could potentially lead to excess density in higher rank rows.

The first problem is that each term is associated with a set of rows that are either all adhoc or all explicit. In some cases, a term has a low enough frequency to be adhoc at rank 0, but high enough to require explicit row assignment at rank 3. With the current algorithm, such a term’s rows will all be adhoc, leading to the overfilling of some rows at higher ranks.

This is a problem that exists in the current Bing codebase and explains why they sometimes have rows with unexpectedly high density.

The second problem is that the term treatment allows higher rank rows when their use would precipitate densities about the target density. Suppose, for example, that we have a term that appears in 5% of the corpus and its term treatment calls for two shared rows, one at rank 0 and the other at rank 3. The bin packing algorithm will ensure that the rank 0 row is not overfilled. The problem comes in the rank 3 row where the 5% bit density translates to 18.4% of the bits being set. In this case, the bit packing algorithm will forced to allocate a private rank 3 row, when a better choice might have been to allocate a lower rank row that could be shared with other terms.

Until we fix these problems we won’t be able to see whether there are other less impactful bugs lurking in the background.

Row Densities in Shakespeare Sonnets

The row density table seems to show the same issues with excess density in higher rank rows. The row table analyzer outputs one row density file for each shard. For this run, our index only has one shard, so its data appears in RowDensities-0.csv.

This is a .csv file with a ragged right edge. Each line in the file corresponds to the term listed in column A. Column B shows the term’s frequency in the corpus used during the statistics collection phase (BitFunnel statistics). In the excerpt below, we can see that the term Sonnet appears in 100% of the documents while the term but appears in 71.4%.

Columns C through E give the rank, row index, and observed density of the first row associated with each term. Columns F through H correspond to the second row, and so on. In the excerpt above, we can see that common terms correspond to a single private row. Since the row is not shared with other terms, its density can be above the target of 0.1.

Note that in the excerpt above, the term frequency in column B is exactly equal to the fraction of bits detected in column E, even though the corpus statistics were gathered from a slightly different corpus. These values aren’t required to be equal, but it is not suprising that they are equal for the most common terms.

In the excerpt below, we can see that at some point, less common terms begin to share rows. As soon as a term shares rows, it needs at least one additional row to drive down the noise. The term things is common enough to get its own, private row, while gentle is assigned to a pair of rows that are shared with other terms.

The red highlight shows the bug rearing its ugly head in the rank 3 rows. As an example, the term gentle has a density of 0.19 in a rank 3 row. If row 1000 is shared, its density will lead to an unexpectedly high false positive rate in queries involving terms that share rank 3 row number 1000. If, on the other hand, row 1000 is private, we may be wasting storage.

This problem becomes less prominent as terms get rarer. In the excerpt below, we see that wrinkes gets two shared rows, while seek gets three. In this portion of the table, all of the densities are below the target level of 0.1.

It is a bit suspcious that the rank 3 rows for rare terms seem to densities that are below the target density. This might be a result of the bug directing bits to the wrong rows, or it could mean that we’ve somehow overprovisioned these rows and are wasting memory.

It is great to have these analysis tools to get a handle on problems in the row tables.

Wikipedia as test corpus for BitFunnel

Fri, 21 Oct 2016 17:51:23 -0600

Wikipedia is a great test corpus for search engines. It is free and easy to obtain, it carries a license appropriate for research, and at ~59GB uncompressed, it is large, but not too large to fit on a reasonably-sized server. For those with extremely fast reflexes, even user data¹ is sometimes available.

Wikipedia is also probably more representative of common use cases of search: since it is edited by amateurs, it is a more pedestrian dataset than many other corpora. This likely makes it more relevant to many realistic applications of search, particularly those that contain mostly amateur-generated data (such as consumer web search and corporate document search).

All of this makes Wikipedia a sensible baseline dataset for testing BitFunnel. To facilitate this, we have released a pre-processed version of the 2016-10-20 dump of Wikipedia, so that it is trivial to ingest into BitFunnel.

In this post, we will look at (1) how you can obtain this pre-processed Wikipedia data, (2) how you can ingest it into a running BitFunnel instance, (3) how to get ahold of the intermediate processing files so that you can audit the chunk files, to make sure they’re correct, and (4) some simple statistics about the corpus.

Obtaining the corpus

The 2016-10-20 dump of Wikipedia is divided into 27 compressed XML files. We have transformed each of these XML files into BitFunnel’s custom chunk file format to make ingestion fast and painless. (See the blog post that introduces the chunk format).

Each of these 27 dump files generates many chunk files. These “segments” of chunk files can be found at URLs following the pattern in the code block below; to download one of the 27 segments, simply replace the ${1} with the number of the chunk you’d like. (The dump numbers start at 1 and end at 27, inclusive.)

https://bitfunnel.blob.core.windows.net/wiki-data/chunked/enwiki-20161020-chunked${1}.tar.gz

So, for example, if you want to obtain chunk 1, and you are running on Linux, you might run something like this:

wget https://bitfunnel.blob.core.windows.net/wiki-data/chunked/enwiki-20161020-chunked1.tar.gz

Alternatively, you can paste that URL into a browser.

Ingesting the corpus

There are a few ways to ingest chunk files. Probably the easiest is to use the REPL, which is what we will do in this section.

We explain a bit of the background of how BitFunnel can be configured in the index build tools post. Today, it is sufficient to download the gzip’d configuration files generated from the same Wikipedia dump.

From there, you can run something like:

$ BitFunnel repl /path/to/unzipped/config/directory
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = "/path/to/unzipped/config/directory"
gram size = 1

Starting index ...
Blocksize: [... your size here ...]
Index started successfully.

Type "help" to get started.

0: cache chunk /path/to/chunk/file
1: verify one wings
[... results go here ...]

When we start the REPL, you can see that we run two commands. cache chunk ingests a chunk file, and verify one queries the data in the index and verifies the document matches are correct.

Auditing the data

In the post on BitFunnel’s corpus file format, we write about the steps needed to convert the Wikipedia dump files into BitFunnel chunk files. This process is also specified in the README of BitFunnel’s Workbench project.

In brief, there are 3 steps: (1) download the Wikipedia XML dumps, (2) use WikiExtractor to filter out the Wikipedia markup; and (3) convert the “extracted” text to chunk files using BitFunnel’s Workbench project.

In order to allow people to audit the chunk files, we are also hosting both the raw XML dump files for the 2016-10-20 articles-only dump of Wikipedia, and the markup-filtered files we generated with WikiExtractor. (We provide the Wikipedia dump because eventually Wikipedia will stop offering it in its archive.)

Wikipedia generated 27 dump files, so there are 27 extracted files and 27 chunk directories. So, just like we had a URL pattern of chunk segments, we have one for raw wikipedia dump files and extracted dump files, respectively:

https://bitfunnel.blob.core.windows.net/wiki-data/raw/enwiki-20161020-pages-articles${1}.tar.gz

https://bitfunnel.blob.core.windows.net/wiki-data/extracted/enwiki-20161020-extracted${1}.tar.gz

As with the chunk file, simply replace ${1} with a number from 1 to 27 (inclusive) to recieve the corresponding dump file.

From here, you can inspect the data, or use Workbench and WikiExtractor to generate your own data and compare.

Statistics

Finally, just as a point of reference, here are some statistics relating to the size of the corpus in various states of processing:

Raw corpus: total size is ~16.7GB when compressed with gzip, and ~59.2GB when uncompressed.
WikiExtractor’d corpus: total size is ~4.8GB when compressed with gzip, and ~13.2 GB uncompressed.
Chunked corpus: total size is ~3.6GB when compressed with gzip, and ~10.2GB uncompressed.

If you were paying attention for exactly one day in 2012, you could obtain user query logs. These are particularly useful because they allow you to construct realistic synthetic workloads.

Lately this is quite a rare affordance. Probably every company to release significant user data has been burned (see, for example, AOL’s search query log scandal and Netflix’s recommender scandal), which makes other companies reticent to take the plunge.

But, as much as information retrieval researchers would like to use such data to improve search systems, there are also very good theoretical reasons (see this paper, for example: [PDF]) to believe that, at the very least, it is very difficult to do correctly. Difficult enough that it may never be worth the risk. ^[return]

BitFunnel Glossary

Thu, 13 Oct 2016 15:52:53 -0700

To get a high level overview of the algorithm, please see this talk transcript. This glossary is incomplete and needs a lot of work! While our plan is to fill out the whole thing, that will probably take a while. If there’s some particular term or concept that you’d like to see explained sooner, please let us know.

Top level concepts

TermTable

A TermTable contains the mapping from a term to the rows associated with the term. A term can be a word (1-gram) or an N-word phrase (N-gram).

Index / Ingestor

An Index contains, for one machine, an “index” of all documents that are indexed on that machine. An Index consists of multiple Shards, plus the configuration information necessary to ingest and look up documents, which means that it contains references to things like TermTables.

Shard

A Shard contains descriptors for the documents contained within the shard. This means that it has the DocTableDescriptors and the RowTableDescriptors for a Shard. The DocTableDescriptor tells you what information is stored along with each document besides the bits in the index. The RowTableDescriptors tell you what bits in the index actually mean. For example, given a DocIndex and the associated row information, it can give you the bit-address of a specific row for a document.

A Shard is also responsible for holding onto Slices.

Slice

A Slice owns the memory associated with a set of documents. Very roughly speaking, it’s a buffer of some size, with meta-information on the buffer and a ref count.

Other important concepts

Rank

In our hierarchical bloom filters, a row is said to have rank i if each bit represents 2**i documents. This means that, in a rank 0 row, each bit represents exactly one document, and in a rank 3 row, each bit represents 8 documents (in which case, the bit is set if any one of the 8 documents represented “wants” to set the bit).

FixedSizedBlob

Per document storage for a fixed-size chunks of data (i.e., data where the size of the blob is the same for every document). Because items are fixed-size, they can be stored in an array or an array-like structure.

VariableSizedBlob

Per document storage for variable-size chunks of data (i.e., data where the size of the blob can be different for every document). Because items are variable-size, pointers to items are stored.

DocTable / DocTableDescriptor

The DocTable is a collection of per-document data items for a Slice. An item in the DocTable consists of some number of FixedSizedBlobs as well as pointers to VariableSizedBlobs.

RowTableDescriptor

RowTableDescriptor exposes low-level operations on a Slice like GetBit, SetBit, and ClearBit. Given pointer to a SliceBuffer, a RowIndex and a DocIndex, the RowTableDescriptor lets you actually manipulate the information inside a Slice.

DocumentHandle

TokenManager / TokenTracker / Token

These are used to track the liveness of Slices.

The top-level object is a TokenManager, which can hand out TokenTrackers and Tokens. Tokens are basically monotonically increasing serial numbers that can be oustanding or complete. Each TokenTracker tracks if the TokenManager has any oustanding tokens issued before a cut-off serial number.

Recycler

The Recycler is a rudimentary garbage collector for slices. When the Index is done with a Slice, the Slice gets Expired. Expiring a slice schedules it to be recycled by a Recycler. Recycling (i.e., destruction) occurs when all tokens related to the slice are expired, i.e., when all users (read threads) of the Slice are done with the Slice.

TermTreatment

A mapping from term characteristics (today, IDF and gram size) to RowConfiguration (i.e., the number of rows at each rank).

Row

RowIdSequence

Other concepts

Allocator / IAllocator

Factories

FileSystem

FileManager

ShardDefinition

StreamConfiguration

Interface / IInterface

Configuration / IConfiguration

Document / IDocument

DocumentCache

DocumentDataSchema

DocumentFrequencyTable

FactSet

SimpleIndex

SliceBufferAllocator

TermTableBuilder

TermTableCollection

PackedRowIdSequence

AbstractRow

QueryPipeline

QueryPlanner

RowMatchNode

RowPlan

TermMatchNode

TermMatchTreeEvaulator

TermPlan

TermPlanConverter

ChunkEnumerator

ChunkIngestor

ChunkManifestIngestor

ChunkReader

ChunkTaskProcessor

Configuration

DocumentMap

TermToText

AbstractRowEnumerator

ByteCodeInterpreter

CompileNode

MatchTreeRewriter

MatchVerifier

PlanRows

QueryParser

RankDownCompiler

RankZeroCompiler

RegisterAllocator

SimplePlanner

A rudimentary query planner that only handles AND queries using the ByteCodeInterpreter. Takes a TermMatchNode tree, generates bytecode, and then runs the bytecode.

TODOs:

This should be grouped into more than just “top level” “important”, and “other”, but we’ve been saying we should do this for ages, so I’m putting this version out there just so there’s something.

How do make onboarding to BitFunnel easier?

Wed, 12 Oct 2016 15:02:02 -0700

I’ve been working on BitFunnel for roughly six months now. If I look at how I’ve used that time, my guess is that I’ve taken about a month of Mike’s time. If you look at the progress we’ve made, I think that’s a pretty good trade-off, but it doesn’t make for a scalable open source project.

It makes sense to invest a month of time in a full-time employee since they’re likely to be around for at least a year or two, and even in the case of an extraordinarily bad fit, they’ll probably stick around for long enough that you’ll get your time-investment back. But if it takes a month of your time to onboard a new open source contributor, that’s a losing proposition.

Mike and I often talk about his experience trying to help some MSR folks use BitFunnel as an example of the kind of thing we’d like to fix. We’ve made a lot of progress on that front; we now tend to write up tutorials of how to run the system as it exists when we make progress on adding functionality. Those documents are often a week or two behind the current code, but that’s not bad.

But if you look at the difficulty of not just running BitFunnel, but trying to actively contribute to it, that’s not so great. Things have improved a lot, but the onboarding process is still pretty rough. Mike has been pretty good about writing design notes for components but I haven’t been keeping up, and even if I was keeping up with Mike’s production of design notes, we might have eight of those documents insteasd of four, in a system where we have roughly 250 classes and are creating new classes all the time. That isn’t quite a fair comparison, because a design note can discuss multiple classes, but it gives you a rough idea of how much of the system doesn’t have design docs.

On my end, I can try to help by at least catching up to Mike’s production of design notes. I’ll also try to write a glossary, so that things that we haven’t deeply documented still have some explanation. I don’t think that’s enough, but I’m not sure what else to do.

We try to keep a list of issues tagged easy and we’ve gotten some pull requests as a result. That’s great, but there’s a huge gap between being able to fix easy issues and being able to make major contributions. We’d like to make it easier bridge the gap between fixing small issues and making large changes, but we’re not sure how to do that.

If you have any suggestions for how we can improve the situation for new contributors, please let us know.

Debugging an SEH Crash

Mon, 10 Oct 2016 15:51:23 -0600

Here’s a video showing how I debugged a read access violation that was caused by an earlier buffer overflow. This sort of problem can sometimes be hard to track down, but in this case, a data breakpoint made my job easier.

The video discusses the BlockAllocator, Slice buffers and the Row Tables they contain. If you’d like to try diagnosing the bug yourself, just checkout the SEHBug branch of the BitFunnel repository.

When will BitFunnel be usable?

Tue, 04 Oct 2016 00:00:01 -0700

How long should we expect this project to take? In theory, we should have a relatively easy time guessing how long this project will take because this project is a half-port-half-rewrite whose aim to produce an open source version that’s simpler than the internal version of the project, and we know how big the original project is.

If we do find . -name "*.h" -o -name "*.cpp" | grep -v NativeJIT | xargs wc on the original project to count all lines of code except NativeJIT, we get roughly 144k lines of code. I’m excluding NativeJIT because that was ported seperately from the BitFunnel repo, so our extrapolation should exclude that.

We’re currently at about 53kLOC in the new BitFunnel repo. If we graph the progress, we can see that it’s been roughly linear since May.

The date is on the x-axis and lines of code are on the y-axis. It’s a bit surprising to me that the progress looks so linear. We’ve had periods where I’ve been busy with non-coding duties and Mike has done the vast majority of the coding, and we’ve had periods where Mike’s been busy with non-coding duties and I’ve been doing the vast majority of the coding. Despite the wildly varying coding workload we’ve taken on at times, when you average everything out, progress has been approximately linear.

I don’t expect this to continue indefinitely – once we get to the point where we have enough of a system stood up so that we can run experiments, progress as measured in lines of code should slow down. We should also see some slowdown when we do intergration and integration testing with whatever we’re going to integrate with, which will probably be a lot of work but not much code. On top of that, we’ll probably enter a slow period as the holiday season rolls around. Additionally, the lines of code in the new project are somewhat differently scaled than the lines of code in the old project because we’ve been adding a license at the top of most files. With all those disclaimers aside, if we guess that we’ll end up with somewhere between 1/2x to 1x as much code as the original project, we can make a crude estimate of how long it will take to “finish” BitFunnel:

This is the same graph as before, but with a red horizontal line at the size of the old BitFunnel project and a green horizontal line at half the size of the old BitFunnel project. If we believe the linear estimate, we might be “done” anywhere between late this year and next July. If we take all of the caveats listed above into account, it’s likely that we won’t have something “complete” this calendar year. Beyond that, the error bars are so large that it’s hard to say much except that it’s plausible that we’ll have something “complete” by the end of the next calendar year.

All's Well That Ends Well

Sat, 24 Sep 2016 15:51:23 -0600

We’ve been having some stability problems of late. In our rush to get some minimal version of the document ingestion pipeline up and running, we created a number of tools for gathering corpus statistics and configuring term tables and we built an interactive REPL console to help our readers better understand the system. These tools are mostly system integrations, and as such, are not covered by unit tests. In recent days we’ve found these integrations to be broken more often than working.

While it feels great to be pouring a lot of concrete, we’ve decided to pause in order to shore up our foundations. One focus is to develop tests for the integration code.

The challenge with writing these tests stems from file system operations. The BitFunnel statistics command, for example, reads a number of configuration files and BitFunnel chunks from disk, and then after a bit of computation writes out a bunch of intermediate files like histograms and document frequency tables. As we continue to bring over new modules and functionality, the number of configuration, data, and intermediate files will only grow.

We could just write tests that rely on the filesystem, but as a general rule, I avoid writing tests that access the filesystem. My rationale relates to system configuration, developer data safety, and test brittleness. Let’s consider at each of these.

System Configuration In order for a test to access a file, the system must be configured correctly. This means that the test needs to know the path to the file and, in the case of a read operation, the file itself must exist and have the right permissions. In the case of a write operation, the path to the file must exist or be created and the test needs some policy for handling the case where it wants to overwrite an existing file (such as the partial file left over from the previous test run which crashed).

All of these problems are small. We could easily add a step to the README.md instructing the developer to setup an environment variable with the path to the test files. We could add a post-build step that creates the right directories and copies the required data files. We could use an OS specific temp directory generator. We could run tests in containers. All of these would work - the problem is that they add up over time to make the project hard to use.

Our goal is ease of use - the ideal onboarding experience is to clone the repo, install one or two tools (like the C++ compiler and CMake) and then kick off a build that works 100% of the time.

Data Safety There’s another problem with tests that access the filesystem and I’ve been bitten by this more than once. What happens is some well intentioned developer writes piece of code that “cleans” the test directory in preparation for the next test run. Then through a combination of string handling bugs or poorly chosen file names, real files, having nothing to do with the test, end up getting clobbered. Or maybe the test itself overwrites that great American novel draft you’ve been working on for years. These situations always lead to tears and usually the well intentioned developer suggests it was your fault for storing important files in whatever directory seemed like an obvious choice for test output. Or it was your fault for not setting $TEMP before the test did an rm -rf $TEMP/*.

Test Brittleness Let’s face it - working with files is hard. It is not because the code is complex - the problem is that files are outside of the process sandbox. Anyone can mess with a file. Maybe a virus checker quarantined your file. Maybe a previous test run was done with elevated permissions and now the current test run can’t overwrite the old files. Maybe you didn’t escape characters properly in the file name or you generated a temp path that was too long. Maybe a zombie process is holding a write handle. Maybe it works on the PC, but not the Mac.

There’s a million reasons why tests that interact with the filesystem become brittle. We want developers to run tests early and run tests often. If the tests are fast and 100% reliable, we will enter a virtuous cycle. If the tests are flakey or slow, developers will stop running them and relying on them and we end up in a vicious cycle.

The Bard Comes to My Rescue

Developing self-contained integration tests that don’t hit the filesystem will take some time. My first challenge is to find a replacement for the 17k Wikipedia pages we’re using for today’s tests. My criteria for the test corpus is

Small enough to embed in a C++ source file.
Large enough to support interesting scenarios.
Permissive license compatible with the MIT License.
Text makes some amount of sense to humans.

What I finally came up with was Shakespeare’s Sonnets. There are 154 of them, they fit into about 172Kb, they are plain text, and they are in the public domain.

My next step was to convert the Bard’s immortal words into C++ code. Here’s his source code, from the 1609 quarto entitled “SHAKE-SPEARES SONNETS. Never before Imprinted.”

Fortunately Project Gutenberg did the heavy lifting, converting scans of the original text to ASCII while updating the spelling:

When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty’s field,
Thy youth’s proud livery so gazed on now,
Will be a tatter’d weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv’d thy beauty’s use,
If thou couldst answer ‘This fair child of mine
Shall sum my count, and make my old excuse,’
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel’st it cold.

At this point my job was to tokenize the text and then write it out as C++ string literals in BitFunnel chunk format. I could have used the Workbench Tool or Lucene, but these seemed like giant hammers for a really small nail. In the end I wrote a little Node.JS app to do the work.

The process was mostly straight forward. I did need to use some care with the single quotes. Sometimes a single quote was used in a contraction like “feel’st” or “tatter’d” or a possessive like “beauty’s” or “youth’s”. In these cases, I wanted to keep the quote are part of the token.

Other times the single quote was used to demarcate a phrase, as in

‘This fair child of mine
Shall sum my count, and make my old excuse,’

These quotes should never be part of a token. My strategy was to first replace all of the interesting quotes with a sentinel character. I used the '#' character since it didn’t appear elsewhere in the corpus. Once these quotes were safely marked, I removed all of the remaining quotes and other punctuation. Then I replaced each '#' with a single quote. Here’s the code the I used to clean and tokenize each line.

function ProcessLine(input) {
    // Convert input to lower case.
    var a = input.toLowerCase();

    // Use hash to temporarily mark single quotes used in contractions.
    var b = a.replace(/(\w)'(\w)/g, "$1#$2");

    // Remove all punctuation, including remaining single quotes.
    var c = b.replace(/[,.!?:;']/g, "");

    // Convert contraction markers back to single quotes.
    var d = c.replace(/#/g, "'");

    // Replace spaces with word-end markers.
    var e = d.replace(/[ ]/g, "\\0") + "\\0";

    return e;
}

Outputting the C++ code was mostly straightforward. The only hitch involved concatenating octal escape codes with Arabic numerals in the C string literals. Take a look at the sample output below. The third line, "01\0Sonnet\0" "2\0\0" had to be broken into two adjacent string literals in order to keep the "\0" after "Sonnet" from concatenating with the "2" to form the octal literal "\02". Fortunately this situation only appeared in the titles so it was easy to special case the treatment.

char const * sonnet2 = 
    "0000000000000002\0"
    "02\0https://en.wikipedia.org/wiki/Sonnet_2\0\0"
    "01\0Sonnet\0" "2\0\0"
    "00\0"
    "when\0forty\0winters\0shall\0besiege\0thy\0brow\0"
    "and\0dig\0deep\0trenches\0in\0thy\0beauty's\0field\0"
    "thy\0youth's\0proud\0livery\0so\0gazed\0on\0now\0"
    "will\0be\0a\0tatter'd\0weed\0of\0small\0worth\0held\0\0"
    "then\0being\0asked\0where\0all\0thy\0beauty\0lies\0"
    "where\0all\0the\0treasure\0of\0thy\0lusty\0days\0\0"
    "to\0say\0within\0thine\0own\0deep\0sunken\0eyes\0"
    "were\0an\0all-eating\0shame\0and\0thriftless\0praise\0"
    "how\0much\0more\0praise\0deserv'd\0thy\0beauty's\0use\0"
    "if\0thou\0couldst\0answer\0this\0fair\0child\0of\0mine\0"
    "shall\0sum\0my\0count\0and\0make\0my\0old\0excuse\0"
    "proving\0his\0beauty\0by\0succession\0thine\0"
    "this\0were\0to\0be\0new\0made\0when\0thou\0art\0old\0"
    "and\0see\0thy\0blood\0warm\0when\0thou\0feel'st\0it\0cold\0"
    "\0\0";

Putting it all together.

Translating a small corpus into C++ string literals was a good first step. Making the end-to-end integration test required that I also virtualize all of the filesystem interactions, but this is a tale for another post.

One nice outcome of this work is that the build now generates an example that automatically configures and runs an index with no requirement to download corpus files.

It’s called TheBard and what it does is run the corpus statistics gathering stage on the sonnets, then builds a TermTable, and then boots up a interactive BitFunnel REPL console.

There are only a few command-line arguments and they happen to be optional.

% TheBard -help
TheBard
A small end-to-end index configuration and ingestion example based on 
154 Shakespeare sonnets.

Usage:
./TheBard [-help]
          [-verbose]
          [-gramsize <integer>]

[-help]
    Display help for this program. (boolean, defaults to false)


[-verbose]
    Print information gathered during statistics and termtable stages. 
    (boolean, defaults to false)


[-gramsize <integer>]
    Set the maximum ngram size for phrases. (integer, defaults to 1)

Here’s a sample session. In this case I didn’t supply the -gramsize parameter so we’ll be working with an index of unigrams.

% TheBard
Initializing RAM filesystem.
Gathering corpus statistics.
Building the TermTable.
Index is now configured.

Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.)

directory = "config"
gram size = 1

Starting index ...
Index started successfully.

Type "help" to get started.

0: help
Available commands:
  cache   Ingests documents into the index and also stores them in a cache
          for query verification purposes.
  delay   Prints a message after certain number of seconds
  help    Displays a list of available commands.
  load    Ingests documents into the index
  query   Process a single query or list of queries. (TODO)
  quit    waits for all current tasks to complete then exits.
  script  Runs commands from a file.(TODO)
  show    Shows information about various data structures. (TODO)
  status  Prints system status.
  verify  Verifies the results of a single query against the document cache.

Type "help <command>" for more information on a particular command.

1: cache manifest sonnets
Ingestion complete.

2: show rows blood
Term("blood")
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(3,  1111): 011000000010000001100000000000000000000000001000000000
  RowId(0,  1278): 010000000010000000100000000000000000000000000000000001

3: verify one blood
Processing query " blood"
  DocId(121)
  DocId(109)
  DocId(82)
  DocId(67)
  DocId(63)
  DocId(19)
  DocId(11)
  DocId(2)
8 match(es) out of 154 documents.

4: show rows shame
Term("shame")
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(3,  1058): 110000011100000000000000000000100111000000000000000000
  RowId(0,  1225): 110000011100010000000000000000000101000000000000000000

5: verify one shame
Processing query " shame"
  DocId(129)
  DocId(127)
  DocId(99)
  DocId(95)
  DocId(72)
  DocId(36)
  DocId(34)
  DocId(10)
  DocId(9)
  DocId(2)
10 match(es) out of 154 documents.

6: show rows tatter'd
Term("tatter'd")
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(3,  3349): 010000000000000000000000010000000000000000000010000000
  RowId(3,  3350): 010000000000000000000000010000000000000000000010000000
  RowId(3,  3351): 010000000000000000000000010000000000000000000010000000
  RowId(0,  1440): 010010010000000000010000010011011000000000000000000000

7: verify one tatter'd
Processing query " tatter'd"
  DocId(26)
  DocId(2)
2 match(es) out of 154 documents.

8: show rows love
Term("love")
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(0,  1019): 000000001100101000111110110010111111101101001110101000

At prompt 1, I enter cache manifest sonnets which loads all 154 sonnets into the index. I could have used cache chunk sonnet0 to load only the first chunk of 11 sonnets. Note that I used the cache command, instead of the load command. The difference is that the cache command also saves the IDocuments in a separate data structure that can be used to verify queries processed by the BitFunnel engine.

In prompts 2 and 3, I examine the rows associated with the word, “blood” and then run a query verification to see which documents should match. The show rows command lists each of the RowIds associated with a term, followed by the bits for the first 64 documents. The document ids are printed vertically above each column of bits. In this example, we can see that documents 2, 11, and 19 are likely to contain the word “blood” because their columns contain only 1s. The verify one command confirms these columns are, in fact, matches, and not false positives.

Prompts 4 and 5 repeat the experiment with the word, “shame”. This time we see what appear to be matches in columns 1, 2, 8, 9, 10, 34, and 36. The verify one command shows that columns 1 and 8 are not actually matches, but instead correspond to false positives.

Prompts 6 and 7 show, “tatter’d”, a word that is considerable more rare than “blood” and “shame”. Because “tatter’d” is rare, it requires four rows to drive the noise down to acceptable levels.

Constrast this with prompt 8 which looks at the word, “love”. Love appears in so many documents that it must reside in its own, private row.

Searching for Primes

Sat, 24 Sep 2016 15:51:23 -0600

What do prime numbers have to do with BitFunnel?

It turns out we use them to test our matching engine. One of the challenges in bringing up a new search engine is figuring out how to test it. If you happen to have another working search engine that has ingested the same corpus, you’re in luck - just compare its output with that of your new search engine.

Well that’s the theory, anyway. In practice this is difficult for a number of reasons:

The “oracle” search engine may not have indexed the same corpus as the search engine under test. As an example, it would be great to use a production Bing server as our oracle, but no one knows, at any given moment, exactly which documents are on a particular machine, and the set of documents is constantly changing.
The two search engines may model documents differently. For example, the Bing servers include ton’s of meta data and information about click streams which isn’t meaningful to anyone outside of Bing. We could model all of this information in a BitFunnel test, but it would involve a lot of code that was only useful for the test.
We’d like to make all of our tests available as open source, so the data required to run the tests needs to be publicly available and small enough to store on GitHub.

Down the road, we plan to configure an instance of Lucene as our oracle, but today, we need a really small, lightweight test that can be used for debugging and run before every commit.

A Synthetic Corpus

Our solution was to generate a synthetic corpus. We wanted something with the following characteristics:

Corpus should be trivial to generate.
Arbitrarily large corpuses can be constructed efficiently.
Match verification algorithm should be trivial.
Match verification should be fast.
Can model phrases.

At this point, our goal is to test our query pipeline as it transforms the Term tree into various Row trees which become CompileNode trees which drop into an ICodeGenerator which yields native x64 code or byte code for our interpreter.

For these tests, we’re not concerned with the probabilistic nature of BitFunnel - we just want to know if the matcher is computing the right boolean expression over bits loaded from rows. We can easily eliminate all probabilistic behavior by configuring the TermTable to place each term in its own, private row.

Since these tests eliminate probabilistic behavior, there is no requirement that our synthetic corpus have statistics that model a real world corpus. We can use really wacky documents, as long as they support enough interesting test cases.

Using Prime Factorizations

The solution we settled on was to model each document as containing only those terms corresponding to the integers that make up the prime factorization of the document’s id.

As an example, document number 100 might look something like:

Title: 100

2 2 5 5

and document 2310 might look something like

Title: 770

2 5 7 11

and document 1223, which corresponds to a prime number, would have a only single term

Title: 1223

1223

With this document structure, it is trivial to determine if a document contains a specific prime number term. Here’s the code:

bool Contains(size_t docId, size_t term)
{
    return (docId % term) == 0ull;
}

Phrase matches are easy to detect, as well, if we model the documents as ordered sequences of prime factors. All we need to do is ask whether the sequence of terms that makes up the phrase is a subsequence of the integers that make up the document’s prime factorization.

Suppose, for example, that we’re looking for the phrase “2 5”. This is equivalent to asking whether each document’s prime factorization sequence contains the sequence [2,5].

Consider the documents above.

Document 100 is a match because [2,5] is a subsequence of [2,2,5,5].
Document 770 is also a match because [2,5] is a subsequence of [2,5,7,11].
Document 1223, on the other hand, is not a match because [2,5] is not a subsequence of [1223].

Implementation Details

The implementation turned out to be surprisingly simple – just under 200 lines of code in PrimeFactorsDocument.cpp.

Our first step was to create mock documents. The function CreatePrimeFactorsDocument just creates an off-the-shelf IDocument and then fills it the prime factors of its DocId using calls to IDocument::AddTerm(). Here’s the relevant fragment of code:

for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p > docId)
    {
        break;
    }
    else
    {
        while ((docId % p) == 0)
        {
            auto const & term = Primes::c_primesBelow10000Text[i];
            document->AddTerm(term.c_str());
            docId /= p;
            sourceByteSize += (1 + term.size());
        }
    }
}

Next we configured a TermTable to assign a private row to each term corresponding to a prime number. We only included mappings for primes up to the largest DocId. This TermTable gives us the desired non-probabilistic behavior for Terms corresponding to primes not exceeding the largest DocId.

Terms corresponding to larger primes or composite numbers will be implicitly mapped. The implicit rows, however will only contain zeros because none of the documents contain terms corresponding to large primes or composite numbers.

The conqeuence is that queries involving larger primes and composites will never show probabilistic behavior and therefore never yield false positives.

Here’s an excerpt from the function CreatePrimeFactorsTermTable() which creates an ITermTable and then provisions it with explicit rows for the terms “0”, “1”, and each of the primes smaller than 10,000:

for (size_t i = 0; i < Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p > maxDocId)
    {
        break;
    }
    else
    {
        auto text = Primes::c_primesBelow10000Text[i];

        termTable->OpenTerm();
        termTable->AddRowId(RowId(rank, explicitRowCount0++));
        termTable->CloseTerm(Term::ComputeRawHash(text.c_str()));
    }
}

Our last step was to create a mock index. The function CreatePrimeFactorsIndex() just creates an ISimpleIndex with the prime factors term table replacing the default. Then a simple for-loop fills the index:

for (DocId docId = 0; docId <= maxDocId; ++docId)
{
    auto document =
        Factories::CreatePrimeFactorsDocument(
            index->GetConfiguration(),
            docId,
            maxDocId,
            streamId);
    index->GetIngestor().Add(docId, *document);
}

Sample Data

Tue, 20 Sep 2016 15:51:23 -0600

I’ve been trying to make it really easy to get started with BitFunnel, but we still have a ways to go. From the beginning we put a lot of effort into ensuring our code would build and run on Linux, OSX, and Windows, and we set up CI on Appveyor and Travis to help us quickly spot breaks on any OS. This has kept the build in good shape, but it seems that the system is still hard to configure and run, especially for those who don’t use it on a day-to-day basis.

After some brainstorming, we decided it would be helpful to make a sample corpus with all necessary configuration files available for download so that new users and contributors could get the system up and running with just a few steps.

The sample corpus consists of about 17k pages from the English version of Wikipedia. This small slice of Wikipedia is manageable, yet large enough to demonstrate interesting aspects of BitFunnel. Here are the download links:

The first file is for reference and is not needed unless you want to reprocess the entire corpus yourself from scratch.

In most cases it suffices to download the second link which contains the files necessary to run the BitFunnel index Read-Eval-Print-Loop (REPL). This download contains

wikiextractor text output (265MB uncompressed)
the corresponding BitFunnel chunk files (208MB uncompressed)
corpus statistics and configuration files (51MB uncompressed)

Downloading and Extracting Chunk Files

You can download these files directly from your browser, or on Linux or OSX use the wget and tar commands.

% cd /tmp

% wget https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
--2016-09-18 21:15:11--  https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
Resolving bitfunnel.blob.core.windows.net... 13.93.168.88
Connecting to bitfunnel.blob.core.windows.net|13.93.168.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 198148563 (189M) [application/octet-stream]
Saving to: 'small-corpus.tar.gz'

small-corpus.tar.gz          100%[==============================================>] 188.97M  1.67MB/s   in 1m 49s

2016-09-18 21:17:00 (1.74 MB/s) - 'small-corpus.tar.gz' saved [198148563/198148563]

% tar -xvzf small-corpus.tar.gz
x chunks/
x chunks/AA/
x chunks/AA/wiki_00
x chunks/AA/wiki_01
x chunks/AA/wiki_02
...
x chunks/AC/wiki_56
x chunks/AC/wiki_57
x text/
x text/AA/
x text/AA/wiki_00
x text/AA/wiki_01
x text/AA/wiki_02
...
x text/AC/wiki_56
x text/AC/wiki_57

% ls -l wikipedia
total 0
drwxr-xr-x  5 michaelhopcroft  wheel  170 Jul 29 21:40 chunks
drwxr-xr-x  8 michaelhopcroft  wheel  272 Sep 18 16:09 config
drwxr-xr-x  5 michaelhopcroft  wheel  170 Jul 29 21:34 text

Running the REPL

Once the files have been downloaded and uncompressed, we’re ready to run the REPL. The REPL is a subcommand of the BitFunnel executable which is located at tools\BitFunnel\src in the source tree. In the transcript below, I have set my path to point to the BitFunnel executable. The only required parameter is the path to the config directory that was created in the previous step.

% BitFunnel repl /tmp/wikipedia/config
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = "/tmp/wikipedia/config"
gram size = 1

Starting index ...
Blocksize: 11005320
Index started successfully.

Type "help" to get started.

Once the REPL console has started, we will load a single chunk file. We use the cache chunk command to ingest the documents from a single chunk file. The cache chunk command ingests documents like the load chunk command, but it also caches the IDocuments to assist in verifying the correctness of the BitFunnel matching engine.

0: cache chunk /tmp/wikipedia/chunks/AA/wiki_00
Ingesting chunk file "/tmp/wikipedia/chunks/AA/wiki_00"
Caching IDocuments for query verification.
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
Ingestion complete.

At this point, we’ve ingested /tmp/wikipedia/chunks/AA/wiki_00 which contains the following 41 wikipedia pages:

12: Anarchism
25: Autism
39: Albedo
128: Talk:Atlas Shrugged
290: A
295: User:AnonymousCoward
303: Alabama
305: Achilles
307: Abraham Lincoln
308: Aristotle
309: An American in Paris
316: Academy Award for Best Production Design
324: Academy Awards
330: Actrius
332: Animalia (book)
334: International Atomic Time
336: Altruism
339: Ayn Rand
340: Alain Connes
344: Allan Dwan
354: Talk:Algeria
358: Algeria
359: List of Atlas Shrugged characters
569: Anthropology
572: Agricultural science
573: Alchemy
579: Alien
580: Astronomer
582: Talk:Altruism/Archive 1
586: ASCII
590: Austin (disambiguation)
593: Animation
594: Apollo
595: Andre Agassi
597: Austroasiatic languages
599: Afroasiatic languages
600: Andorra
612: Arithmetic mean
615: American Football Conference
620: Animal Farm
621: Amphibian

Handy tip: if you’d like to know which pages are in a chunk file, run grep on the corresponding wikiextractor file. For example, if you are interested in knowing the contents of tmp/wikipedia/chunks/AA/wiki_00, run grep on tmp/wikipedia/text/AA/wiki_00:

grep "<doc id=" /tmp/wikipedia/text/AA/wiki_00
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
<doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
<doc id="39" url="https://en.wikipedia.org/wiki?curid=39" title="Albedo">
...
<doc id="620" url="https://en.wikipedia.org/wiki?curid=620" title="Animal Farm">
<doc id="621" url="https://en.wikipedia.org/wiki?curid=621" title="Amphibian">

Let’s try running a query using the verify one command to verify an expression. Today this command runs a very slow verification query engine on the IDocuments cached earlier by the cache chunk command. In the future, verify will run the BitFunnel query engine and compare its output with the verification query engine.

1: verify one anarchy
Processing query " anarchy"
  DocId(307)
  DocId(12)
2 match(es) out of 41 documents.

2: verify one frog
Processing query " frog"
  DocId(621)
1 match(es) out of 41 documents.

As we can see, documents 12 and 307 contain the word, “anarchy” and document 621 contains the word, “frog”. Try running verify one frog|anarchy and verify one frog anarchy (AND is implicit if OR isn’t specified). Did you get what you expected?

We don’t have the BitFunnel query pipeline ported yet, but you can examine the rows associated with various terms using the show rows command. This command lists each of the RowIds associated with a term, followed by the bits for the first 64 documents. The document ids are printed vertically above each column of bits.

3: show rows anarchy
Term("anarchy")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 11507): 10000000110000001000010000000000000000000
  RowId(3, 11508): 10000000110000001000010000000000000000000
  RowId(3, 11509): 10000000110000001000010000000000000000000
  RowId(0,  5354): 11000001110000000000010001001000000000010

4: show rows frog
Term("frog")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 19624): 00000010000010000100000000000000000000001
  RowId(3, 19625): 00000010000010000100000000000000000000001
  RowId(3, 19626): 00000010000010000000001000000000000000001
  RowId(3, 19627): 00000010000010000000001000000000000000001
  RowId(0,  5465): 10000011100010000000001100001001011000001

If we look at the output of show rows anarchy, we see that the first column, which corresponds to document 012, is completely filled with 1s, indicating a match. The second column, which corresponds to document 025 has some zeros so it is not a match.

There are also some false positives visible in the data. We know from running verify one anarchy that only documents 012 and 307 should match, but the query matrix above shows all 1s in the columns for documents 308 and 358. Once we have finished porting the document ingestion and query processing pipelines, we will turn our attention to configuration changes that drive down the false positive rate.

The goal of this post is to explain how to obtain and use the data files, so the examples are minimal. To learn more about the BitFunnel repl, statistics builder, and term table builder, see Index Build Tools.

BitFunnel performance estimation

Fri, 16 Sep 2016 10:10:54 -0700

Hi! I’m going to talk about two things today.

First, I’m going to talk about one way to think about performance. That is, one way you can reason about performance.

Second, I’m going to talk about search. We’re going to look at search as a case study because, when talking about perfomance, it’s often useful to have something concrete to reason about. We could use any problem domain. However, I think that the algorithm we’re going to discuss today is particularly interesting because we use it in Bing, despite the fact that it’s in a class of algorithms that’s been considered obsolete for almost 20 years (at least as core search engine technology).

In case it’s not obvious, this is a psuedo-transcript of a talk given at StrangeLoop 2016. See this link if you’d rather watch the video. I wrote this up before watching my talk, so the text probably doesn’t match the video exactly.

BTW, when I say performance, I don’t just mean speed (latency), or speed (throughput). We could also be talking about other aspects of performance like power. Although our example is going to be throughput oriented, the same style of reasoning works for other types of performance.

Why do we care about performance? One is answer is that we usually don’t care because most applications are fast enough. That’s true! Most applications are fast enough. Spending unecessary time thinking about performance is often an error.

However, when applications get larger, most applications become performance sensitive! This happens both because making a large application faster reduces its cost, and also because making a large application faster can increase its revenue. The second part isn’t intuitive to many people, but we’ll talk more about that later.

How do we think about performance? It turns out that we can often reason about the performance with siple arithmetic. For many applications, even applications that take years to build, it’s possible to estimate the performance before building the system with simple back-of-the-envelope calculations.

Here’s a popular tweet. It has 500 retweets! “Working code attracts people who want to code. Design documents attract people who want to talk.”

I get it. Coding feels like real work. Meetings, writing docs, creating slide decks, and giving talks don’t feel like work.

But when I look at outcomes, well, I often see two applications designed to the same thing that were implemented with similar resources where one application is 10x or 100x faster than the other. And when I ask around and find out why, I almost inevetiably find that the team that wrote the faster application spent a lot of time on design. I tend to work on applications that take a year or two to build, so let’s say we’re talking about something that took a year and a half. For a project of that duration, it’s not uncommon to spend months in the design phase before anyone writes any code that’s intended to be production code. And when I look at the slower application, the team that created the slower appliction usually had the idea that “meetings and whiteboarding aren’t real work” and jumped straight into coding.

The problem is that if you have something that takes a year and a half to build, if you build it, measure the performance, and then decide to iterate, your iteration time is a year and a half, whereas on the whiteboard, it can be hours or days. Moreover, if you build a system without reasoning about what the performance should be, when you build the system and measure its performance, you’ll only know how fast it runs, not how fast it should run, so you won’t even know that you should iterate.

It’s common to hear advice like “don’t optimize early, just profile and then optimize the important parts after it works”. That’s fine advice for non-performance critical systems, but it’s very bad advice for performane critical systems, where you may find that you have to re-do the entire architecture to get as much performance out of the system as your machine can give you.

Before we talk about performance, let’s talk about scale. Because people often mean different things when they talk about scale, I’m going to be very concrete here.

Since we’re talking about search, let’s imagine a few representative corpus sizes we might want to search: ten thousand, ten million, and ten billion documents.

And let’s assume that each document is 5kB. If we’re talking about the web, that’s a bit too small, and if we’re talking about email, that’s a bit too big, but you can scale this number to whatever corpus size you have.

BTW, the specific problem we’re going to look at is: we have a corupus of documents that we want to be able to search, and we’re going to handle AND queries.

That is, queries of the form, I want this word, and this word, and this word. For example, I want the words large AND yellow AND dog. The systems we’ll look at today can handle ORs and NOTs, but those aren’t fundamentally different and talking about them will add complexity, so we’ll only look at AND queries.

First, let’s consider searching ten thousand documents at 5kB per doc.

If you want to get an idea of how big this is, you can think of this as email search (for one person) or forum search (for one forum) in a typical case.

a k times a k is a million, and five times time is fifty, so 5kB times ten thousand is 50MB.

50MB is really small!

Today, for $50, you can buy a phone off amazon that has 1GB of RAM. 50MB will easily fit in RAM, even on a low-end phone.

If our data set fits in RAM and we have 50MB, we can try the most naive thing possible and basially just grep through our data. If you want something more concrete, you can think of this as looping over all documents, and for each document, looping over all terms.

Since we only need to handle AND queries, we can keep track of all the terms we want, and if a document has all of the terms we want, we can add that to our list of matches.

Ok. So, for ten thousand documents, the most naive thing we can think of works. What about ten million documents?

If you want to get a feel for how big ten million documents, you can think of this is roughly wikipedia-sized. Today, English language wikipedia has about five million documents.

5kB times ten million is 50GB. This is really close to wikipedia’s size – today, wikipedia is a bit over 50GB (uncompressed articles in XML, no talk, no history).

We can’t fit that in RAM on a phone, and we’d need a pretty weird laptop to fit that in RAM on a laptop, but we can easily fit that in RAM on a low-budget server. Today, we can buy a $2000 server that has 128GB of RAM.

What happens when we try to run our naive grep-like algorithm? Well, our cheap server can get 25GB/s of bandwidth…

… and we have 50GB of data. That means that it takes two seconds to do one search query!

And while we’re doing a query, we’re using all the bandwidth on the machine, so we can’t expect to do anything else on the machine while queries are running, including other queries. This implies that it takes two seconds to do a query, or that we get one-half a query per second, or ¹⁄₂ QPS.

Is that ok? Is two seconds of latency ok? It depends.

For many applications, that’s totally fine! I know a lot of devs who have an internal search tool (often over things like logs) that takes a second or two to return results. They’d like to get results back faster, but given the cost/benefit tradeoff, it’s not worth optimizing things more.

How about ¹⁄₂ QPS? It depends.

As with latency, a lot of devs I know have a search service that’s only used internally. If you have 10 or 20 devs typing in queries at keyboards, it’s pretty unlikely that they’ll exceed ¹⁄₂ QPS with manual queries, so there’s no point in creating a system that can handle more throughput.

Our naive grep-like algorithm is totally fine for many search problems!

However, as services get larger, two seconds of latency can be a problem.

If we look at studies on latency and revenue, we can see a roughly linear relationship between latency and revenue over a pretty wide range of latencies.

Amazon found that every 100ms of latency cost them more than 1% of revenue. Google once found that adding 500ms of latency, or half a second, cost them 20% of their users.

This isn’t only true of large companies – when Mobify looked at this, they also found that 100ms of latency cost them more than 1% of revenue. For them, 1% was only $300k or so. But even though I say “only”, that’s enough to pay a junior developer for a year. Latency can really matter!

Here’s a query from some search engine. The result came back in a little over half a second. That includes the time it takes to register input on the local computer, figure out what to do with the input, send it across the internet, go into some set of servers somewhere, do some stuff, go back across the internet, come back into the local computer, do some more stuff, and then render the results.

That’s a lot of stuff! If you do budgeting for a service like this and you want queries to have a half-second end-user round-trip budget, you’ll probably only leave tens of milliseconds to handle document matching on the machines that recieve queries and tell you which documents matched the queries. Two seconds of latency is definitely not ok in that case.

Furthermore, for a service like Bing or Google, provisioning for ¹⁄₂ QPS is somewhat insufficient.

What we can do? Maybe we can try using an index instead of grepping through all documents.

If we use an index, we can get widely varying performance characteristics. Asking what the performance is like if we “use an index” is like asking what the performance is like if we “use an algorithm”. It depends on the algorithm!

Today, we’ll talk about how to get performance in the range of thousands to tens of thousands of queries per second, but first…

… let’s finish our discussion about scale and talk about how to handle ten billion documents.

We’ve said that we can, using some kind of index, serve ten million documents from one machine with performance that we find to be acceptble. So how about ten billion?

With ten billion documents at 5kB a piece, we’re looking at 50TB. While it’s possible to get a single machine with 50TB of RAM, this approach isn’t cost effective for most problems, so we’ll look at using multiple cheap commodity machines instead of one big machine.

Search is a relatively easy problem to scale horizontally; that is, it’s relatively easy to split a search index across multiple machines. One way to do this (and this isn’t the only possible way) is to put different documents on different machines. Queries then go to all machines, and the result is just the union of all queries.

Since we have ten billion documents, and we’re assuming that we can serve ten million documents on a machine, if we split up the index we’ll have a thousand machines.

That’s ok, but if we have a cluster of a thousand machines and the cluster is in Redmond, and we have a customer in Europe, that could easily add 300ms of latnecy to the query. We’ve gone through all the effort of designing and index that can return a query in 10ms, and then we have customers that lose 300ms from having their queries go back and forth over the internet.

Instead of having a single cluster, we can use multiple clusters all over the world to reduce that problem.

Say we use ten clusters. Then we have ten thousand machines.

With ten thousand machines (or even with a thousand machines), we have another problem: given the failure rate of commodity hardware, with ten thousand machines, machines will be failing all the time. At any given time, in any given cluster, some machines will be down. If, for example, the machine that’s indexing cnn.com goes down and users who want to query that cluster can’t get results from CNN, that’s bad.

In order to avoid the loss of sites from failures, we might triple the number of machines for redundancy, which puts us at thirty thousand machines.

With thirty thousand machines, one problem we have is that we now have a distributed system. That’s a super interesting set of problems, but it’s beyond the scope of this talk.

Another problem we have is that we have a service that cost a non-trivial amount of money to run. If a machine costs a thousand dollars per year (amortized cost, including the cost of building out datacenters, buying machines, and running the machines), that puts us at thirty-million dollars a year. By the way, a thousand dollars a year is considered to be a relatively low total amortized cost. Even if we can hit that low number, we’re still looking at thirty-million dollars a year.

At thirty-million a year, if we can double the performance and halve the number of machines we need, that saves us fifteen-million a year. In fact, if we can even shave off one percent on the running time of a query, that would save three-hundred thousand dollars a year, saving enough money to pay a junior developer for an entire year.

Conventional wisdom often says that “machine time is cheaper than developer time, which means that you should use the most productive tools possible and not worry about performance”. That’s absolutely true for many applications. For example, that’s almost certainly true for any single-server rails app. But once you get to the point where you have thousands of machines per service, that logic is flipped on its head because machine time is more expensive than developer time.

Now that we’ve framed the discussion by talking about scale, let’s talk about search algorithms.

The problem we’re looking at is, given a bunch of documents, how can we handle AND queries.

The standard algorithm that people use for search indices is a posting list.

A posting list is basically what a layperson would call an index.

Here’s an index from the 1600s. If you look at the back of a book today, you’ll see the same thing: there’s a list of terms, and next to each term there’s a list pages that term appears on.

Computers don’t have pages in the same sense; if you want to imagine a simple version of a posting list, you can think of…

…a hash map from terms to linked lists of document ids. That is, a hash map where key is a term and the value is a list.

That’s one way to do it, and it’s standard. Another thing we could try to do is use Bloom Filters.

We do this in Bing in a system called BitFunnel. But before we can describe BitFunnel, we need to talk about how bloom filters work.

And before we talk about how bloom filters work, let’s consider a more naive solution we might construct.

One thing we might try would to be use something called in incidence matrix, that is, a 2d matrix where one dimension of the matrix is every single term we know about, and the other dimension is every single document we know about. Each entry in the matrix is a 1 if the term is in the document, and it’s a 0 if the term isn’t in the document.

What will the performance of that be?

Well, first, how many terms are there? How many terms do you think are on the internet? And let’s say we shard the internet a zillion ways and serve tens of millions of documents per server? How many unique terms do we have per server?

pause

someone shouts ten million

Turns out, when we do this, we can see tens of billions of terms per shard. This is often surprising to people. I’ve asked a lot of people this question, and people often guess that there are millons or billions of unique terms on the entire internet. But if you pick a random number under ten billion and search for it, you’re pretty likely to find it on the internet! So, there are probably more than ten billion terms on the internet!

In fact, if you limit the search to just github, you can find a single document with about fifty-million primes! And if you look at the whole internet, you can find a site with all primes under one trillion, which is over thirty-billion primes! If that site lands in a single shard, that shard is going to have at least thirty-billion unique terms. Turns out, a lot of people put long mathematical sequences online.

And in addition to numbers, there’s stuff that’s often designed to be unique, like catalog numbers, ID numbers, error codes, and GUIDs. Plus DNA! Really, DNA. Ok, DNA isn’t designed to be unique, but if you split it up into chunks of arbitrary numbers of characters, there’s a high probability that any N character chunk for N > 16 is unique.

There’s a lot of this stuff! One question you might ask is, do you need to index that stuff? Does anyone really search for GTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAATA?

It turns out, that when you ask people to evaluate a search engine, many of them will try to imagine the weirdest queries they can think of, try those, and then choose the search engine that handles those queries better. It doesn’t matter that they never do those queries normally. Some real people actually evaluate search engines that way. As a result, we have to index all of this weird stuff if we want people to use our search engine.

If we have tens of billions of terms, say we have thirty billion terms, how large is our incidence matrix? Even if we use a bit vector, one single document will take up thirty billion divided by 8 bytes, or 3.75GB. And that’s just one document!

How can we shrink that? Well, since most documents don’t contain most terms, we can hash terms down to a smaller space. Instead of reserving one slot for each unique term, we only need as many slots as we have terms in a document (times a constant factor which is necessary for bloom filter operation).

That’s basically what a bloom filter is! For the purposes of this talk, we can think of a bloom filter as a data structure that represents a set using a bit vector and a set of independant hash functions.

Here, we have the term “large” and we apply three independent hash functions, which hashes the term to locations five, seven, and twelve. Having three hash functions is arbitrary and we’ll talk about that tradeoff later.

To insert “large” into the document, we’ll set bits five, seven, and twelve. To query for “large”, we’ll do the bitwise AND of those locations. That is, we’ll check to see if all three locations are 1. If any location is a 0, the result will be 0 (false) otherwise the result will be 1 (true). For any term we’ve inserted, the query will be 1 (true), because we’ve just set those bits.

In this series of diagrams, any bit that’s colored is a 1 and any bit that’s white is a 0. The red bits are associated with the term “large”.

We can insert another term: “dog”. To do so, we’ll set those bits, one, seven, and ten. Seven was already set by “large” (red), but it’s fine to set it again with “dog”; all bits that are yellow are associated with the term “dog”. If we query for the term, as before, we’ll get a 1 (true) beacuse we’ve just set all the bits associated with the query.

We can also try querying a term that we didn’t insert into the document. Let’s say we query for “cat”, which happens to hash to three, ten, and twelve.

When we do the bitwise AND, we first look at bit three. Since bit three is a zero, we already know that the result will be 0 (false) before we look at the other bits and don’t have to look at bits ten and twelve.

Let’s try querying another term, “box”, and let’s say that term hashes to one, five, and ten.

Even if we don’t insert this term into the document, the query shows that the term is in the document because those bits were set by other terms. We have a false positive!

How bad is this problem? Well, what’s the probability that any query will return a false positive?

Let’s assume we have ten percent bit density. This is something we can control – for example, if we have a bit vector of length 100, and we have ten terms, each of which is hashed to one location, we expect the bit density to be slightly less than 10%. It would be 10% if no terms hashed to the same location, but it’s possible that some terms might collide nd hash to the same location.

What’s the probability of a false positive if we hash to one location instead of three locations?

If the term is actually in the document, then we’ll set the bit, and if we do a query, since the bit was set, we’ll definitely return true, so there’s no probability of a false negative.

If the term isn’t in the document and we haven’t set the associated bit for this term because of this term, what’s the probability the bit is set? Because our bit desnity is .1, or 10%, the probability is 10%.

What if we hash to two locations instead of one location? Since we’re assuming we have uniform 10% bit density, we can multiply the probabilities: we get .1 * .1 = .01 = 1%.

For three locations, the math is the same as before: .1 * .1 * .1 = .001 = 0.1%.

As we hash to more locations, if we don’t increase the size of the bit vector, the bit density will go up. Same amount of space, set more bits, higher bit density. So we have to increase the number of bits, and we have to increase the number of bits linearly. As we increase the number of bits linearly, we get an exponential decrease in the probability of a false positive.

One intuition as to why bloom filters work is that we pay a linear cost and get an exponential benefit.

Ok. We’ve talked about how to use a bloom filter to represent one document. Since our index needs to represent multiple documents, we’ll use multiple bloom filters.

In this diagram, each of the ten columns represents a document. That is, we have documents A through J.

One thing we could do is have ten independent bloom filters. We know that we can have one bloom filter represent one document, so why not use ten bloom filters for ten documents?

If we’re going to do that, we might as well maintain the same mapping from terms to rows; that is, use the same hash functions for each column, so that when we do a query, we can do the query in parallel.

In the single-document example, when we did a query, we did the bitwise AND of some bits. Now, to do a query, we’ll do the bitwise AND of rows of bits.

Now we’re going to query for all documents that have both “large” AND “dog”. As before, bits that are red are associated with the term “large” and bits that are yellow are associated with the “dog”. Additionally, bits that are grey are associated with other terms.

After we do the bitwise AND of all of the rows, the result will be a row vector with some bits set – those bits will be the documents that have both the terms “large” AND “dog”. We’re going to AND together rows one, five, seven, ten, and twelve and then look at the result.

In this diagram, on the right, the part’s that highlighted is the fraction of the query that we’ve done so far. On the left, the part’s that highlighted is the result of the computation so far.

When we start, we have row one.

When we AND rows one and five together, we can see that bit F is cleared to zero.

After we AND row seven into our result, nothing changes. Even though row seven has bit F set, an AND of a one and a zero is a zero, so the result in column F is still zero.

When we AND row ten in, bit I is cleared.

And then when we AND in the last row, nothing changes. The result of the query is that bit J is set. In other words, the query concludes that document J contains both the terms “large” AND “dog”, and no other document in this block contains both terms.

In our previous example, we queried a block of documents where at least one document contained both of the terms we cared about. We can also query a block of documents where none of the documents contain both of the terms.

As before, we want to take the bitwise AND of rows one, five, seven, ten, and twelve.

And as before, we’ll start with row one.

After we AND in row five, all of the bits are zero! When that happened in the “cat” example we did on a single document, we could stop because we knew that the document couldn’t possibly contain the term cat because we can’t set a bit by doing an AND. This same thing is true here, and we can stop and return that the result is all zeros.

I said, earlier, that we’d try to estimate the performance of a system. How do we do that?

We’ll want to have a cost model for operations and then figure out what operations we need to do. For us, we’re doing bitwise ANDs and reading data from memory. Reading data from memory is so much more expensive than a bitwise AND that we can ignore the cost of the ANDs and only consider the cost of memory accesses. If we had any disk accesses, those would even slower, but since we’re operating in memory, we’ll assume that a memory access is the most expensive thing that we do.

One bit of background is that on the machines that we run on, we do memory accesses in 512-bit blocks. So far, we’ve talked about doing operations on blocks of ten documents, but on the actual machine we can think of doing operations on 512 document blocks.

In that case, to get a performance estimate, we’ll need to know how many blocks we have, how many memory accesses (rows) we have per block, and how many memory accesses our machine can do per unit time.

To figure out how many memory accesses per block we want, we could work through the math…

…which is a series of probability calculations that will give us some number. I’m not going to do that here today, but it’s possible to do.

Another thing we can do is to run a simulation. Here’s the result of a simulation that was maybe thirty lines of code. This graph is a histogram of how many memory accesses we have to do per block, assuming we have 20% bit density, and a query that’s 14 rows.

If 14 rows sounds like a lot, well, we often do queries on 20 to 100 rows. That might sound weird, since we looked at an example where each term mapped to three rows. For one thing, terms can and sometimes do map to more than three rows. Additionally, we do query re-writing that makes queries more complicated (and hopefully better).

For example, let’s say we query for “large” AND “yellow” AND “dog”.

Maybe the user was actually searching for or trying to remember the name of some breed of large yellow dog, so we could re-write the query to be something like

(large AND yellow AND dog) OR (golden AND retriever)

as well as other breeds of dogs that can be large and yellow.

But the user might also be searching for some particular large yellow dog, so we could re-write the query to something like

(large AND yellow AND dog) OR (golden AND retriever) OR (old AND yeller)

and in fact we might want to query for the phrase “old yeller” and not just the AND of the terms, and so on and so forth.

When do you this kind of thing, and add in personalization based on location and query history, simple seeming queries can end up being relatively complicated, which is how we can get queries of 100 rows.

Coming back to the histogram of the number of memory accesses per block, we can see that it’s bimodal.

There’s he mode on the right, where we do 14 accesses. That mode corresponds to our first multi-document example, where at least one document in the block contained the terms. Because at least one document contained all of the terms, we don’t get all zeros in the result and do all 14 accesses.

The mode on the left, which is smeared out from 3 on to the right, is associated with blocks like our second example, where no document contained all of the terms in the query. In that case we’ll get a result of all zeros at some point with very high probability, and we can terminate the query early.

If we look at the average of the number of accesses we need for the left mode, it’s something like 4.6. On the right, it’s exactly 14. If we average these together, let’s say we get something like 5 accesses per query (just to get a nice, round, number).

Now we have what we need to do a first-order performance estimate!

If we go back to our roughly wikipedia-sized example, we had ten million documents. Since we’re on machine where memory accesses are 512 bits wide, that’s ten million divided by 512 equals twenty-thousand blocks, with a bit of rounding.

We said that we have roughly five memory accesses per query. If we have twenty-thousand blocks, that means that a query needs to do twenty-thousand times five memory accesses, or one hundred-thousand memory transfers.

We said that our cheap server can get 25GB/s of bandwidth out of. If we do 512-bit transfers, that’s three-hundred and ninety-million transfers per second.

If we divide three hundred-million transfers per second into a hundred thousand transfers per query, we get thirty-nine hundred QPS (with raounding from previous calculations).

When I do a calculation like this, if I’m just looking at the largest factors that affect performance, like we did here, I’m happy if we get within a factor of two.

If you adjust for a lot of smaller factors, it’s possible to get a more accurate estimate…

…but in the interest of time, we’re not going to look at all the smaller factors that add or remove 5% or 10% in performance.

However, there are a few major factors that affect performance a lot that I’ll briefly mention.

One thing is that our machines don’t only do document matching. So far, we’ve discussed an algorithm that, given a set of documents and a query will return a subset of those documents. We haven’t done any ranking, meaning that queries will come back unordered.

There are some domains where that’s fine, but in web search, we spend a significant fraction of CPU time ranking the documents that match the query.

Additionally, we also ingest new documents all the time. When news happens and people search for the news, they want to see it right away, so we can’t do batch updates.

This is something BitFunnel can actually do faster than querying. If we think about how queries worked, they’re global, in the sense that each query looked at information for each document. But when we’re ingesting new documents, since each document is a column, that’s possible to do without having to touch everything in the index. In fact, since our data structure is, in some sense, just an array that we want to set some bits in, it’s pretty straightforward to ingest documents with multiple threads while allowing queries with multiple threads.

It’s possible to work through the math for this the way we did for querying, but again, in the interest of time, I’ll just mention that this is possible.

Between ranking and ingestion, in the configuration we’re running today, that uses about half the machine, leaving half for matching, which reduces our performance by a factor of two.

However, we also have an optimization that drastically increases performance, which is using hierarchical bloom filters.

In our example, we had one bloom filter per document, which meant that if we had a query that only matched a single docucment, we’d have to examine at least one bit per document. In fact, we said that we’d end up looking at about five bits per document. If we use hierarchical bloom filters, it’s possible to look at a log number of bits per document instead of a linear number of bits per document.

The real production system we use has a number of not necessarily obvious changes in order to run at the speed that it does. Most of them aren’t required for the system to work correctly without taking up an unreasonable amount of memory, but one is.

If you take the algorithm I described it today and try to use it, when you look at sixteen rows in a block of ten documents, you might see something like this.

Notice that some columns (B and D) have most or all bits set, and some columns (A and C) have few or no bits set. This is because different documents have a different number of terms.

Let’s say we sized the number of rows so that we can efficiently store tweets. Let’s say, hypothetically, that means we need fifty rows. And then a weird document with ten million terms comes along and it wants to hash into the rows, say, thirty million times. That’s going to set every bit in its column, which means that every query will return true. Many weird documents like this contain terms that are almost never queried, so the query should almost never return true, but our system will always return true!

Say we size up the number of rows so that these weird ten million term documents are ok. Let’s say that means we need to have a hundred million rows. Ok, our queries will work fine, but we still have things like tweets that might want to set, say, sixteen bits. We said that we wanted to use bloom filters instead of arrays to save space by hashing to reduce the size of our array, but now we have all of these really sparse columns that have something like sixteen out of a hundred million bits set.

To get around this problem, we shard (split up the index) by the number of terms per document. Unlike many systems, which only run in a sharded configuration when they need to spill over onto another machine, we always run in a sharded configuration, even when we’re running on a single machine.

Although there are other low level details that you’d want to know to run an efficient system, this is the only change that you absolustely have to take into account when compared to the algorithm I’ve described today.

Let’s sum up what we’ve look at today.

Before we talk about the real conclusions, let’s discuss a few false impressions this talk could give.

“Search is simple”.

You’ve seen me describe an algorithm that’s used in production for web search. The algorithm is simple enough that it could be described in a thirty-minute talk with no background. However, to run this algorithm at the speed we’ve estimated today, there’s a fair amount of low-level implementation work. For example, to reduce the (otherwise substantial) control flow overhead of querying and ranking, we compile both our queries and our query ranking.

Additionally, even if this system were simple, this is less than 1% of the code in Bing. Search has a lot of moving parts and this is just one of them.

“Bloom filters are better than posting lists”.

I went into some detail about bloom filters and didn’t talk about posting lists much, except to say that they’re standard. This might give the impression that bloom filters are categorically better than posting lists. That’s not true! I only didn’t describe posting lists in detail and do a comparison because state-of-the-art posting list implementations are tremendously complicated and I couldn’t describe them to a non-specialist audience in thirty minutes, let alone do the comparison.

If you do the comparison, you’ll find that when one is better than the other depends on your workload. For an argument that posting lists are superior to bloom filters, see Zobel et al., “Inverted files versus signature files for text indexing”.

“You can easily reason about all performance”.

Today, we look at how an algorithm worked and estimated the performance of a system that took years to build. This was relatively straightforward because we were trying to calculate the average throughput of a system, which is something that’s amenable to back-of-the-envelope math. Something else that’s possible, but slightly more difficult, is to estimate the latency of a query on an unloaded system.

Something that’s substantially harder is estimating the latency on a system as load varies, and estimating the latency distribution.

Ok, now for an actual conclusion.

You can often reason about performance…

…and you can do so with simple arithmetic. Today, all we did was multiply and divide. Sometimes you might have to add, but you can often guess what the performance of a system should be with simple calculations.

Thanks to all of these people for help with this talk! Also, I seem to have forgotten to put Bill Barnes on the list, but he gave me some great feedback!

Post original talk: also, thanks to Laura Lindzey, Larry Marbuger, and someone’s name who I can’t remember for giving me great post-talk feedback that changed how I’m giving the next talk.

If you want to read more about the index we talked about today, BitFunnel, you can get more information at bitfunnel.org. We also have some code up at github.com/bitfunnel/bitfunnel.

Oh, yeah, I’m told you have to introduce yourself at these things. I’m Dan Luu, and I have a blog at danluu.com where I blog about the kind of thing I talked about here today. That is, I often write about performance, algorithms and data structures, and tradeoffs between different techniques.

Thanks for your time. Oh, also, I’m not going to take questions from the stage because I don’t know how people who aren’t particularly interested in the questions often feel obligated to stay for the question period. However, I really enjoy talking about this stuff and I’d be happy to take questions in the hallway or anytime later.

Some comments on the talk

Phew! I survived my first conference talk.

Considering how early the talk was (10am, the first non-keynote slot), I was surprised that the room was packed and people were standing. Here’s a photo Jessica Kerr took (and annotated) while we were chatting, maybe five or ten minutes before the talk started, before the room really filled up:

During the conference, I got a lot of positive comments on talk, which is great, but what I’d really love to hear about is where you were confused. If you felt lost at any point, you’d be doing me a favor my letting me know what you found to be confusing. Before I run this talk again, I’m probably going to flip the order of some slides in the Array/Bloom Filter/BitFunnel discussion, add another slide where I explictly talk about bit density, and add diagrams for a HashMap (in the posting list section) and an Array (in the lead-up to bloom filters). There are probably more changes I could make to make things clearer, though!

A Small Query Language

Sat, 10 Sep 2016 15:52:53 -0700

A challenge in bringing BitFunnel to open source is providing functionality that was previously supplied by portions of Bing upstream of BitFunnel. BitFunnel was designed as a library that takes, as input, a tree of TermMatchNodes which represents a boolean expression combining terms and phrases using logical operators like and, or, and not.

The Bing search pipeline does a ton of work on the query itself before presenting a TermMatchNode tree to BitFunnel. Examples include word breaking, stemming, spelling corrections, and augmentation with synonyms. The query also goes through a complex set of classifiers that determine whether to route the query to special modules, such as a baseball scoreboard or a weather forcast. The query is also annotated with scoring instructions at this time.

All of this processing is carried out on a tree data structure generated upstream of BitFunnel. Although this tree has a textual representation, there was never any need for BitFunnel to parse the tree, so we never included a parser.

Our open source project is a different story. Today, at a minimum, we need some sort of query language and parser to test the code as we stand it up. We also expect our users will want the option of a complete, end-to-end system that includes a simple, intuitive, human-authorable query language.

Our goals for the query language were

Easy and intuitive query authoring.
Small grammar that is easy to learn.
Simple to parse.
Familiar to people who have used other search systems.
Based on UTF-8 to allow queries in all languages.

Since our plan was to use Lucene as a testing oracle and performance baseline, it made sense to consider some subset of the Lucene query language. In the end, we chose to go with something more like the languages used by Bing and Google.

In making this decision, our main tradeoff was between complexity and familiarity on the one hand and Lucene compatability on the other. Lucene compatability would certainly make our lives as developers easier because we could feed identical queries to BitFunnel and our Lucene reference. It would also make it easier for search integrators to migrate between the two systems since they could just drop in whichever engine best met their business needs.

The reason we went with Bing/Google approach centered around complexity and familiarity of the operator precidence. In the Lucene query language, logical or is implicit and has precidence over logical and. For example, the query dogs cats mice matches documents that contain at least one of the terms “dogs”, “cats”, and “mice”. In Bing and Google, the same query tends to find those documents that contain all three terms (the exact semantics in Bing and Google is less clear because they may alter the original query based on complex systems that infer human intent). Our feeling was that users of internet search engines would find the Bing/Google approach more familiar, but that this would come at a cost of Lucene compatability and it would be less familiar for Lucene users.

The deciding factor was the complexity of Lucene’s logical or operator used in conjunction with the + operator. In Lucene, the + operator converts a portion of an or expression into an and expression. As an example, the query +dogs cats mice matches those documents that contain the term “dogs” and at least one of “cats” and “mice”. In other words, the addition of a unary + operator converts the logical expression dogs | cats | mice into the expression dogs & (cats | mice) which has a completely different structure.

We felt that the + operator’s ability convert an implicit or into an and and then distribute the and over the remaining or expression introduced too much complexity and potential ambiguity for what is essentially syntactic sugar.

In any event, the decision was low stakes because it will be easy to add a Lucene compatible parser in the future if we need one. Here’e what we came up with.

Query Language Overview

Our query language is inspired by a subset of the Bing query language. Today the functionality is limited to expressing boolean matching trees. Once we’re ready to port the BitFunnel ranker code, we will extend the language to include ranker annotations (e.g. boosting the weight of a particular term).

AND. The and operator is implicit so the query dogs cats mice matches those documents that contain all of the words in the query. One can also explicity specify logial and with the & symbol, e.g. dogs & cats & mice.

OR. The or operator, denoted by the | symbol, is explicit. The query dogs | cats | mice matches those documents that contain at least one of the three words in the query. Note that the or operator has lower precedence than the and operator, so the query dogs & cats | mice is equivalent to the query (dogs & cats) | mice.

NOT. The unary not operator, denoted by the - symbol matches documents that do not contain an expression. As an example, the query dogs cats -mice matches those documents tht contain “dogs” and “cats”, but do not contain “mice”. Note that the not operator can be applied to arbirary expressions such as dogs -(cats | mice). The not operator has higher precidence than the and and or operators.

TERM. A search term is any sequence of UTF-8 characters that does not include whitespace or special characters such as "()-:&|. Terms may include upper and lower case characters, but they may be converted to lowercase during the query planning process. Special characters may appear if they are escaped with a backslash. As an example, dog$cat$ would create the term associated with the string literal “dog(cat)” while dog(cat) would be equivalent to dog & cat. Note that it is legal to include an escaped space in a term, e.g. dog\ cat. Keep in mind that such a term is actually a unigram that happens to contain a space and not the bigram phrase "dog cat". Phrases and unigrams are treated differently in the index, so it is important to use the phrase syntax when the term is intended to be a higher order ngram (i.e. bigram, trigram, etc.)

PHRASE. A phrase can be specified by enclosing a sequence of search terms in double quotes, for example, "New York City". Each term in the phrase can include escaped characters.

STREAM PREFIX. A term match can be restricted to a particular stream by prefixing it with the name of the stream and a colon. As an example, the query title:dogs body:cat mice would match those documents that have “dog” in the title, “cat” in the body, and “mice” in the default stream. Stream names are defined by the application that hosts BitFunnel, which is also responsible for designating the default stream.

Grammar

Here’s the grammar for the BitFunnel query language.

OR:       AND (‘|’ AND)*

AND:       SIMPLE ([’&‘] SIMPLE)*

SIMPLE:
      ’-’ SIMPLE
      ’(’ OR ‘)’
      PREFIX

PREFIX:       [STREAM] TEXT

STREAM:       ~[SPECIAL | SPACE]+ ‘:’

TEXT:
      TERM
      " TERM [SPACE+ TERM]* "

TERM:       [~[SPECIAL | SPACE] | ESCAPE]+

SPECIAL:       [’"’ ‘(’ ‘)’ ‘-’ ‘:’ ‘&’ ‘|’]

SPACE:       [’\n’ ‘\r’ ‘\t’ ‘\v’ ‘\f’ ‘ ‘]

ESCAPE:       ‘\’ [SPECIAL | SPACE]

Try it out

Feel free to experiment with the interactive query parser found in the examples\QueryParser directory. Just fire up the program with no command line arguments and it will print out a brief help message and then dump you into an interacive console where you can type in queries and see their parse trees.

Welcome to the BitFunnel Query Parser Example.

This example is a Read-Eval-Print-Loop (REPL) that reads queries from
the console, parses them, and then prints out the resulting tree of
TermMatchNodes.

Enter a query after the % prompt and press return. To exit the demo
just enter a blank line. Here are some query ideas:
    Single terms
        dog
        title:cat
    Phrases
        "dogs are your best friend"
        anchors:"read this awesome page"
    Disjunctions
        dogs | cats
    Conjunctions
        dogs cats
        dogs & cats
    Negation
        -cats
    Grouping
        dogs (cats | fish)

% dog
Unigram("dog", 0)

% title:cat
Unigram("cat", 1)

% "dogs are your best friend"
Phrase {
  StreamId: 0,
  Grams: [
    "dogs",
    "are",
    "your",
    "best",
    "friend"
  ]
}

% anchors:"read this awesome page"
Phrase {
  StreamId: 2,
  Grams: [
    "read",
    "this",
    "awesome",
    "page"
  ]
}

% dogs | cats
Or {
  Children: [
    Unigram("cats", 0),
    Unigram("dogs", 0)
  ]
}

% dogs cats
And {
  Children: [
    Unigram("cats", 0),
    Unigram("dogs", 0)
  ]
}

% dogs & cats
And {
  Children: [
    Unigram("cats", 0),
    Unigram("dogs", 0)
  ]
}

% -cats
Not {
  Child: Unigram("cats", 0)
}

% dogs (cats | fish)
And {
  Children: [
    Or {
      Children: [
        Unigram("fish", 0),
        Unigram("cats", 0)
      ]
    },
    Unigram("dogs", 0)
  ]
}

% dog\"cat
Unigram("dog\"cat", 0)

% dog\(cat
Unigram("dog(cat", 0)

% dog\ cat
Unigram("dog cat", 0)

%
bye
Press any key to continue . . .

Current Limitations

The query parser is still very much a work-in-progress. Here are some known limitations and notes about future directions.

In the query parser example, the parser has been configured with stream prefixes body, title, and anchor. These prefixes are associated with Term:StreamId values of 0, 1, and 2, respectively.
The TERM production accepts all UTF-8 characters expect for ‘\0’ and members of our SPECIAL characters and SPACE characters. One consequence is that characters like Unicode directional quotations (U+2018, U+2019, U+201C, and U+201D ) will be treated as part of the TERM.
The query parser is configured to use an arena allocator with a fixed amount of memory. Unexpectedly long queries may cause the allocator to throw.
The query parser currently preserves letter case.
Currently stream prefixes may contain escaped special characters. This will likely be disallowed in the near future when we begin to store stream prefixes in configuration files where special characters may cause problems.

Stream Configuration

Fri, 09 Sep 2016 17:51:23 -0600

BitFunnel models each document as a set of streams, each of which consists of a sequence of terms corresponding to the words and phrases that make up the document.

Real world documents are usually organized with streams corresponding to structural concepts, such as the title, the URL, the body, and perhaps even the text of anchors on other pages that point to the document.

We may want to organize the index using a different principle. For example, we might index each document as a pair of streams, one that contains all terms associated with the document and another that contains only those terms that appear in streams other than the document body. This organization is useful for rewriting queries in order to return fewer results.

The StreamConfiguration provides a mapping between the streams in the document and the streams in the index. Let’s look at a more detailed example.

Consider a hypothetical document about dogs that resides at http://bitfunnel.org/dogs:

Dogs
Dogs are your best friend.

Suppose another page refers to our document via the following anchor tags:

<a href=“dogs”>Check out this awesome page!<a\/>
<a href=“dogs”>Who is your friend?<a\/>

Such a document might be organized into streams as follows:

title: [dogs] body: [dogs are your best friend] url: [http bitfunnel org dogs] anchors: [check out this awesome page] [who is your friend]

If the index modelled documents this way we could search for our document with queries like dogs, title:dogs or even anchors:"awesome page".

We could chose to index this document as two streams, one of which has all words associated with the document and the other that contains words from streams other than the body:

document: [dogs] [dogs are your best friend http] [bitfunnel org dogs]
[check out this awesome page] [who is your friend]
nonbody: [dogs] [http bitfunnel org] [check out this awesome page]
[who is your friend]

With this organization, we could find the document with the query nonbody:dogs but not nonbody:best.

Configuring Streams

The IDocument class uses the StreamConfiguration at ingestion time to organize its terms for indexing. The QueryParser class uses the StreamConfiguration to map from text stream names to Term::StreamId values.

Each IDocument is filled with streams of terms using a sequence of calls to the OpenStream(), AddTerm(), and CloseStream() methods. The Term::StreamId values passed to OpenStream() are document streams. The document above might be initialized with the following sequence of calls:

OpenStream(0);  // Title stream
AddTerm("dogs");
CloseStream();

OpenStream(1);  // Body stream
AddTerm("dogs");
AddTerm("are");
AddTerm("your");
AddTerm("best");
AddTerm("friend");
CloseStream();

OpenStream(2);  // URL stream
AddTerm("http");
AddTerm("bitfunnel");
AddTerm("org");
AddTerm("dogs");
CloseStream();

OpenStream(3);  // Anchors stream
AddTerm("check");
AddTerm("out");
AddTerm("this");
AddTerm("awesome");
AddTerm("page");
CloseStream();

// Close and then reopen stream to
// keep phrases from the two anchors
// separate.

OpenStream(3);  // Anchors stream
AddTerm("who");
AddTerm("is");
AddTerm("your");
AddTerm("friend");
CloseStream();

We can ingest this document as Document and NonBody streams by writing the following StreamConfiguration file:

Document: 0,1,2,3
NonBody: 1,2,3

The first line defines an index stream called “Document” which contains terms and phrases from document streams 0, 1, 2, and 3 which correspond to the document’s Title, Body, URL, and Anchor streams. The second line defines an index stream called “NonBody” which contains terms from the document’s Title, URL and Anchor streams.

This StreamConfiguration file will automatically configure the QueryParser to recognize the “Document” and “NonBody” prefixes. Note that the first entry in the StreamConfiguration file the default stream. When a query term does not have a stream prefix, it will use the default prefix. So in this case, the query dogs is equivalent to Document:dogs.

Getting started with NativeJIT

Thu, 01 Sep 2016 16:51:23 -0600

NativeJIT is a just-in-time compiler that handles expressions involving C data structures. It was originally developed in Bing, with the goal of being able to compile search query matching and search query ranking code in a query-dependent way. The goal was to create a compiler than can be used in systems with tens-of-thousands of queries per second without having compilation take a significant fraction of the query time.

Let’s look at a simple “Hello, World” and then look at what the API has to offer us.

Hello World

We’re going to build a function that computes the area of a circle, given its radius. If we were to write such a function in C, it would look something like

const float PI = 3.14159;

float area(float radius)
{
  return radius * radius * PI;
}

Building this function in NativeJIT involves three steps: creating a Function object which defines the function prototype, building the expression tree which defines the function body, and finally compiling the function into x64 machine code.

Create the Function Object

The Function constructor takes one to five template parameters and exactly two regular parameters. The template parameters define the function prototype, while the regular parameters supply resources necessary to compile and run x64 code.

Template Parameters

The template parameters define the function prototype for the compiled code. The first parameter defines to the return value type. The remaining template parameters correspond to the function’s parameter types.

For our example, we’re defining a function that takes a single float parameter for the radius and returns a float area, so our template parameters would be <float, float>.

Function<float, float> expression(allocator, code);

Allocator

The allocator provides the memory where expression nodes will reside. Any class that implements the IAllocator interface will do.

A reasonable default is to use the arena allocator provided in NativeJIT’s Temporary directory. The arena allocator hands out blocks of memory from a fixed size buffer. All of this memory can be recycled at once by calling the allocator’s Reset() method. The advantage of the arena allocation pattern is that it allows you to quickly dispose of an expression tree after compilation. The disadvantage is that it requires everything that uses the allocated memory to be aware that it’s using an arena allocator.

The constructor for allocator takes a single parameter which is the buffer size in bytes.

Allocator allocator(8192);

FunctionBuffer

The FunctionBuffer provides the executable memory where the compiled code will reside. In order to allow code execution, this memory must have Executable Space Protection disabled.

NativeJIT provides the ExecutionBuffer class which is an IAllocator that allocates blocks of executable code. Classes starting with I are interfaces, so this is saying that ExecutionBuffer satisfies the IAllocator interface. Its constructor takes a single parameter which specifies its buffer size. Note that the buffer size will typically be rounded up to the operating system virtual memory page size since the NX bit is applied at the page level.

The FunctionBuffer constructor takes an ExecutionBuffer and a buffer size. The buffer size parameter might seem redundant given that the ExecutionBuffer takes a buffer size as well. The reason the FunctionBuffer constructor takes a buffer size is that multiple FunctionBuffers can share a single ExecutionBuffer.

Here’s the code fragment:

ExecutionBuffer codeAllocator(8192);
FunctionBuffer code(codeAllocator, 8192);
Function<float, float> expression(allocator, code);

Build the Expression Tree

The next step is to build the expression tree which defines the function body. In the expression tree, interior nodes are operators and the leaf nodes are either literals or function parameters.

The tree is built from the bottom up.

Function<float, float> expression(allocator, code);

const float  PI = 3.14159265358979f;
auto & a = expression.Mul(expression.GetP1(), expression.GetP1());
auto & b = expression.Mul(a, expression.Immediate(PI));

In the code above, expression.GetP1() is a leaf node corresponding to the first parameter. Node a is defined to be the product of the first parameter with itself.

On the next line, expression.Immediate(PI) is an immediate value leaf node whose value is equal to PI. Node b is defined to be the product of node a and PI.

Note that each of the node factory methods on Function is templated by the types of its children. This is an important safeguard the prevents the construction of a tree with type errors (e.g. adding a double to a char).

Compile

Once the tree is built, it’s time to generate x64 machine code:

auto computeRadius = expression.Compile(b);

The compiler returns a pointer to the compiled function. In this example, its type is float (*)(float).

Run

You call this just like calling a C function!

auto result = computeRadius(radius_input);

Putting it all together

#include "NativeJIT/CodeGen/ExecutionBuffer.h"
#include "NativeJIT/CodeGen/FunctionBuffer.h"
#include "NativeJIT/Function.h"
#include "Temporary/Allocator.h"

#include <iostream>

using NativeJIT::Allocator;
using NativeJIT::ExecutionBuffer;
using NativeJIT::Function;
using NativeJIT::FunctionBuffer;

int main()
{
    ExecutionBuffer codeAllocator(8192);
    Allocator allocator(8192);
    FunctionBuffer code(codeAllocator, 8192);

    const float  PI = 3.14159265358979f;

    Function<float, float> expression(allocator, code);

    auto & a = expression.Mul(expression.GetP1(),
                              expression.GetP1());
    auto & b = expression.Mul(a, expression.Immediate(PI));
    auto function = expression.Compile(b);

    float p1 = 2.0;

    auto expected = PI * p1 * p1;
    auto observed = function(p1);

    std::cout << expected << " == " << observed << std::endl;

    return 0;
}

Examining the x64 Code

If you’re interested in seeing the x64 code, fire up the debugger and set a breakpoint on a line after the code has been compiled, for example the line float p1 = 2.0.

Then get the value of function, and switch into disassembly view, starting at this address. Here’s what you should see on Windows.

  sub    rsp,8                         ; Standard function prologue.
  mov    qword ptr [rsp],rbp           ; Standard function prologue.
  lea    rbp,[rsp+8]                   ; Standard function prologue.
  mulss  xmm0,xmm0                     ; Radius parameter by itself.
  mulss  xmm0,dword ptr [29E2A580000h] ; Multiply by PI.
  mov    rbp,qword ptr [rsp]           ; Standard function epilogue.
  add    rsp,8                         ; Standard function epilogue.

Linux output may look slightly different because of differences in the ABI.

On Windows you can single step through the generated code in the debugger. Because NativeJIT does not implement x64 stack unwinding on Linux and OSX, you may have trouble single stepping through the generated code, but it will run correctly.

Rules of the Road

For the most part, NativeJIT assumes that the entire expression tree is free of side effects. The only general purpose node that can cause a side effect is CallNode which calls out to an external function. The behavior of the generated code is undefined when calling out to external functions that cause side effects.

In the current implementation, each node will be evaluated exactly once. This guarantee is an important optimization that holds for common subexpressions. These are nodes that have multiple parents.

Common subexpressions often show up when traversing data structures. In the following example

foo[i].bar.baz + foo[i].bar.wat

the expression foo[i].bar is a fairly complicated subexpression involving multiplication, addition, and a pointer dereferencing. Since it is a common subexpression, this work will only be done once.

NativeJIT provides an experimental ConditionalNode analogous the the ternary conditional operator in C. Today the generated code evaluates both the true branch and the false branch, independent of the value of the conditional expression. Down the road we intend to rework the code generator to restrict execution to either the true or the false path. Today the register allocator makes assumptions about register spills and temporary allocations in both branches, and these assumptions must be carried forward through all code that is executed after the first conditional branch.

It’s important to take into account whether the code will run locally. For example, if you’re running JIT’d code locally, it is legal to use the address of a C symbol as a literal value. If you are JITing on run machine and executing code on another be aware that there’s guarantee that the symbol will be at the same address on another machine.

This scenario typically comes up when attempting to call out to external functions. The solution is to have the caller pass the address in as a parameter of the compiled code, instead of relying on a function address in an ImmediateNode.

As mentioned earlier, NativeJIT only implements x64 stack unwinding on Windows. Aside from the debugging impact mentioned above, the other risk in omitting stack unwinding is that an exception thrown from a C function called from NativeJIT code may not be caught properly on Linux and OSX.

If you grep for DESIGN NOTE in the code, you can find explanations of other quirks in NativeJIT.

Commonly used methods

Immediates

These are simple types (e.g., char or int) or pointers to anything. This means that we can have, for example, pointers to structs but we can’t have struct literals.

template <typename T> ImmediateNode<T>& Immediate(T value);

Examples

// Immediate.
Function<int64_t> exp1(allocator1, code1);

auto &imm1 = exp1.Immediate(1234ll);
auto fn1 = exp1.Compile(imm1);

assert(1234ll == fn1());

Unary Operators

template <typename TO, typename FROM> Node<TO>& Cast(Node<FROM>& value);

Pointer dereference; basically like *:

template <typename T> Node<T>& Deref(Node<T*>& pointer);
template <typename T> Node<T>& Deref(Node<T*>& pointer, int32_t index);
template <typename T> Node<T>& Deref(Node<T&>& reference);

Field de-reference; basically like ->. If you have a, and apply the b FieldPointer, that’s equivalent to a->b. There’s no . because we don’t have structs as value types.

If you have a reference to an object, you have to convert the reference to a pointer to apply this method. Note that this has no runtime cost.

template <typename OBJECT, typename FIELD, typename OBJECT1 = OBJECT>
Node<FIELD*>& FieldPointer(Node<OBJECT*>& object, FIELD OBJECT1::*field);

Because we have some operations that can only be done on pointers (or only done on references), we have AsPointer and AsReference to convert between pointer and reference. This is free in terms of actual runtime cost:

template <typename T> Node<T*>& AsPointer(Node<T&>& reference);
template <typename T> Node<T&>& AsReference(Node<T*>& pointer);

Examples

// Cast.
Function<int64_t> exp1(allocator1, code1);

auto &cast1 = exp1.Cast<float>(exp1.Immediate(10));
auto fn1 = exp1.Compile(cast1);

assert(float(10) == fn1());


// Access member via ->.
class Foo
{
public:
    uint32_t m_a;
    uint64_t m_b;
};

Function<uint64_t, Foo*> expression(allocator2, code2);

auto & a = expression.GetP1();
auto & b = expression.FieldPointer(a, &Foo::m_b);
auto & c = expression.Deref(b);
auto fn2 = expression.Compile(c);

Foo foo;
foo.m_b = 1234ull;
Foo* p1 = &foo;

assert(p1->m_b == fn2(p1));

Binary Operators

Binary artihmetic ops take either two nodes, or a node and an immediate. Note that although the types are templated as L and R, L and R should generally be the same for binary ops that take two nodes – conversions must be made explicit. For Rol, Shl, and Shr, the immediate should be a uint8_t.

template <typename L, typename R> Node<L>& Add(Node<L>& left, Node<R>& right);
template <typename L, typename R> Node<L>& And(Node<L>& left, Node<R>& right);
template <typename L, typename R> Node<L>& Mul(Node<L>& left, Node<R>& right);
template <typename L, typename R> Node<L>& Or(Node<L>& left, Node<R>& right);
template <typename L, typename R> Node<L>& Sub(Node<L>& left, Node<R>& right);

template <typename L, typename R> Node<L>& Rol(Node<L>& left, R right);
template <typename L, typename R> Node<L>& Shl(Node<L>& left, R right);
template <typename L, typename R> Node<L>& Shr(Node<L>& left, R right);

Like [], i.e., takes a pointer and adds an offset:

Node<T*>& Add(Node<T(*)[SIZE]>& array, Node<INDEX>& index);

template <typename T, typename INDEX> Node<T*>&
Add(Node<T*>& array, Node<INDEX>& index);

Examples

// Array dereference with binary operation.

Function<uint64_t, uint64_t*> exp1(allocator1, code1);

auto & idx1 = exp1.Add(expression.GetP1(),
                       expression.Immediate<uint64_t>(1ull));
auto & idx2 = exp1.Add(expression.GetP1(),
                       expression.Immediate<uint64_t>(2ull));
auto & sum = exp1.Add(expression.Deref(a), expression.Deref(b));
auto fn1 = exp1.Compile(sum);

uint64_t array[10];
array[1] = 1;
array[2] = 128;

uint64_t * p1 = array;

assert(array[1] + array[2] == fn1(p1));

Compare & Conditional

Unlike other nodes, which return a generic T, compare nodes return a flag. The flag can be passed to a conditional, which takes a flag.

FlagExpressionNode<JCC>& Compare(Node<T>& left, Node<T>& right);

template <JccType JCC, typename T>
Node<T>& Conditional(FlagExpressionNode<JCC>& condition,
                     Node<T>& trueValue,
                     Node<T>& falseValue);

template <typename CONDT, typename T>
Node<T>& IfNotZero(Node<CONDT>& conditionValue,
                   Node<T>& trueValue,
                   Node<T>& falseValue);

template <typename T>
Node<T>& If(Node<bool>& conditionValue,
            Node<T>& thenValue,
            Node<T>& elseValue);

x86 conditional tests are available; a full list is available here.

Example

// JA (jump if above), i.e. unsigned ">"
Function<uint64_t, uint64_t, uint64_t>
    exp1(setup->GetAllocator(), setup->GetCode());

uint64_t trueValue = 5;
uint64_t falseValue = 6;

auto & a =
  expression.Compare<JccType::JA>(expression.GetP1(), expression.GetP2());
auto & b =
  expression.Conditional(a,
                         expression.Immediate(trueValue),
                         expression.Immediate(falseValue));
auto function = expression.Compile(b);

uint64_t p1 = 3;
uint64_t p2 = 4;

auto expected = (p1 > p2) ? trueValue : falseValue;
auto observed = function(p1, p2);

assert(expected == observed);

p1 = 5;
p2 = 4;

expected = (p1 > p2) ? trueValue : falseValue;
observed = function(p1, p2);

assert(expected == observed);

Call

Calls a C function.

template <typename R>
Node<R>& Call(Node<R (*)()>& function);

template <typename R, typename P1>
Node<R>& Call(Node<R (*)(P1)>& function,
              Node<P1>& param1);

template <typename R, typename P1, typename P2>
Node<R>& Call(Node<R (*)(P1, P2)>& function,
              Node<P1>& param1,
              Node<P2>& param2);

template <typename R, typename P1, typename P2, typename P3>
Node<R>& Call(Node<R (*)(P1, P2, P3)>& function,
              Node<P1>& param1,
              Node<P2>& param2,
              Node<P3>& param3);

template <typename R, typename P1, typename P2, typename P3, typename P4>
Node<R>& Call(Node<R (*)(P1, P2, P3, P4)>& function,
              Node<P1>& param1,
              Node<P2>& param2,
              Node<P3>& param3,
              Node<P4>& param4);

Examples

// Call SampleFunction.
int SampleFunction(int p1, int p2)
{
    return p1 + p2;
}

Function<int, int, int> exp1(allocator1, code1);

typedef int (*F)(int, int);

auto &imm1 = exp1.Immediate<F>(SampleFunction);
auto &call1 = exp1.Call(imm1, exp1.GetP1(), exp1.GetP2());
auto fn1 = exp1.Compile(call2);

assert(10+35 == fn1(10, 35));

Rarely used methods

Unary methods

template <typename FROM> Node<FROM const>& AddConstCast(Node<FROM>& value);

template <typename FROM> Node<FROM>&
  RemoveConstCast(Node<FROM const>& value);

template <typename FROM> Node<FROM&>&
  RemoveConstCast(Node<FROM const &>& value);

template <typename FROM> Node<FROM const *>&
  AddTargetConstCast(Node<FROM*>& value);

template <typename FROM> Node<FROM*>&
  RemoveTargetConstCast(Node<FROM const *>& value);

These sounds really weird, but they were useful for some obscure reasons.

Binary methods

Node<T>& Shld(Node<T>& shiftee, Node<T>& filler, uint8_t bitCount);

This is used for packed types (i.e., bitfields that get packed into 64-bits) to extract a bitfield.

Index Build Tools

Tue, 30 Aug 2016 15:51:23 -0600

NOTE: This page was updated on 9/19/16 to reflect significant changes in the index build tools.

After many months of hard work, we kind of, sort of have a document ingestion pipeline that seems to work. By this I mean we have a minimal set of configuration and ingestion tools that we can compile and then run without crashing, and these tools seem to ingest files mostly as expected. We’re still going to need to do a lot of testing, tuning and evaluation, but I thought it would be helpful to take this time to walk through the process of bringing up an index from a set of chunk files extracted from Wikipedia.

The remainder of this post is a fairly long step-by-step description of the process I used to configure and start a BitFunnel index. It’s pretty dry, but should be useful for people who want to play around with the system.

Obtaining a Sample Corpus

I decided to use a small portion of the English Wikipedia as a basis for this walkthrough. This piece contains about 17k articles. I processed it into a collection of BitFunnel chunk files which you can download from our Azure blob storarge. (see the post entitled Sample Data for instructions on downloading the pre-built chunk files).

The chunk files were built from a Wikipedia dump using the process outlined in the WorkBench README. If you would like to build your own chunks from scratch, download the dump from the Wikipedia dump page or grab an archived copy from our Azure blob storage. Either way, the file must be decompressed before it can be used.

The Wikipedia dump file is converted to chunks using a two-step process. The first step uses an open source project called wikiextractor to filter out Wikipedia markup.

The output of Wikiextractor is a set of 1Mb XML files, with names like wiki_00, wiki_01, wiki_02, etc. and organized under directories AA, AB, AC, etc.

The second step uses the Java-based Workbench project to perform word-breaking, stemming, and stop-word elimination. The output of the Workbench stage is a set of BitFunnel chunk files.

Gathering Corpus Statistics

Because BitFunnel is a probabilistic algorithm based on Bloom Filters, its configuration depends on statistical properties of the corpus, like the distributions of term frequencies and document lengths.

The BitFunnel statistics command generates these statistics from a representative corpus. Run BitFunnel statistics -help to print out a help message describing the command line arguments.

% BitFunnel statistics -help
StatisticsBuilder
Ingest documents and compute statistics about them.

Usage:
BitFunnel statistics <manifestFile>
                     <outDir>
                     [-help]
                     [-text]
                     [-gramsize <integer>]

<manifestFile>
    Path to a file containing the paths to the chunk
    files to be ingested. One chunk file per line.
    Paths are relative to working directory. (string)

<outDir>
    Path to the output directory where files will
    be written.  (string)

[-help]
    Display help for this program. (boolean, defaults to false)


[-text]
    Create mapping from Term::Hash to term text. (boolean, defaults to false)


[-gramsize <integer>]
    Set the maximum ngram size for phrases. (integer,
    defaults to 1)

The first parameter is a manifest file that lists the paths to the chunk files, one file per line. You can generate a manifest file with all of the chunks using the Linux find command. Here’s an example that uses the find command to create a manifest for all of the prebuilt chunks that were downloaded to /tmp/wikipedia/chunks.

% find /tmp/wikipedia/chunks -type f > /tmp/wikipedia/manifest.txt

The second parameter to BitFunnel statistics is the output directory. In this case I used /tmp/wikipedia/config (note that the prebuilt chunk tarball includes a wikipedia/config directory with the output of this walkthrough). Be sure to create the output directory if it doesn’t already exist:

% mkdir /tmp/wikipedia/config

When I ran BitFunnel statistics, I omitted the -gramsize parameter because I wanted statistics for a corpus of unigrams. Had I included -gramsize, I could have generated statistics for a corpus that included bigrams or trigrams or larger ngrams.

I included the -text parameter because I wanted the document frequency table to be annotated with the text of each term. Had I not included -text, the terms would be represented solely by their hash values.

Here’s the console log:

% BitFunnel statistics /tmp/wikipedia/manifest.txt  /tmp/wikipedia/config
Blocksize: 3592
Loading chunk list file '/tmp/wikipedia/manifest.txt'
Temp dir: '/tmp/wikipedia/config'
Reading 259 files
Ingesting . . .
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/.DS_Store
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_01
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_02
...
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AC/wiki_56
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AC/wiki_57
Ingestion complete.
  Ingestion time = 11.5985
  Ingestion rate (bytes/s): 1.78954e+07
Shard count:1
Document count: 17618
Bytes/Document: 11781.1
Total bytes read: 207559890
Posting count: 12848420

The statistics files were written to c/tmp/wikipedia/config:

ls -l /tmp/wikipedia/config
total 100384
-rw-r--r--  1 mhop  wheel    216796 Sep 18 16:07 CumulativeTermCounts-0.csv
-rw-r--r--  1 mhop  wheel  21516222 Sep 18 16:07 DocFreqTable-0.csv
-rw-r--r--  1 mhop  wheel     18265 Sep 18 16:07 DocumentLengthHistogram.csv
-rw-r--r--  1 mhop  wheel   8202248 Sep 18 16:07 IndexedIdfTable-0.bin
-rw-r--r--  1 mhop  wheel  13463535 Sep 18 16:07 TermToText.bin

CumulativeTermCounts-0.csv tracks the number of unique terms encountered as a function of the number of documents ingested. It is not currently used but will be needed for accurate models of memory consumption.

DocumentLengthHistogram.csv is a histogram of documents organized by the number of unique terms in each document. It is not currently used, but will be needed to determine how to organize documents into shards according to posting count.

DocFreqTable-0.csv lists the unique terms in the corpus in descending frequency order. In other words, more common words appear before less common words. Here’s what the file looks like:

% more /tmp/wikipedia/config/DocFreqTable-0.csv
hash,gramSize,streamId,frequency,text
3f0ffc72a21fd2be,1,1,0.840447,from
b3697479c07d98d5,1,1,0.808378,which
4d34895e97b1888c,1,1,0.795266,also
17e90965afd3104d,1,1,0.763764,have
d14e34f5833aecee,1,1,0.755875,one
5d4e8d01c132cf18,1,1,0.745147,other
...
8f2e873ae4281b44,1,1,5.67601e-05,arslān
d859a38c4ac69616,1,1,5.67601e-05,influxu
c307a841264a8d2d,1,1,5.67601e-05,köşk
e19c09157d44124,1,1,5.67601e-05,tharil
3885b5929dd1a8dd,1,1,5.67601e-05,www.routledge.com

As you can see, the term “from” is the most common, appearing in about 84% of documents. Towards the end of the file, “arslān” is one of the rarest, appearing in 0.006% of documents, or once in the entire corpus of 17618 documents.

IndexedIdfTable-0.bin is a binary file containing the IDF value for each term. It is used for constructing Terms during document ingestion and query formulation.

Finally, TermToText.bin is binary file containing a mapping from a term’s hash value to its text representation. It is used for debugging and diagnostics.

These files will be used in the next stage where we build the TermTable.

Building a TermTable

The TermTable is one of the most important data structures in BitFunnel. It maps each term to the exact set of rows used to indicate the term’s presence in a document.

I won’t get into how the TermTable builder works at this point, but let’s look at how to run it. Type BitFunnel termtable -help:

BitFunnel termtable -help
TermTableBuilderTool
Generate a TermTable from a DocumentFrequencyTable.

Usage:
BitFunnel termtable <tempPath>
                    [-help]

<tempPath>
    Path to a tmp directory. Something like /tmp/ or c:\temp\,
    depending on platform.. (string)

[-help]
    Display help for this program. (boolean, defaults to false)

In this case the help message could be a bit better. All you need to know is that BitFunnel termtable has a single argument and this is the output directory from the BitFunnel statistics stage. The TermTable builder will read DocFreqTable-0.csv, constuct a very basic TermTable, and write it to TermTable-0.bin.

For now, the algorithm creates a TermTable for unigrams that uses a combination of rank 0 and rank 3 rows. The algorithm is naive and doesn’t handle higher order ngrams. Down the road, we will improve the algorithm and add more options.

Here’s the output from my run:

BitFunnel termtable /tmp/wikipedia/config
Loading files for TermTable build.
Starting TermTable build.
Total build time: 0.439845 seconds.
===================================
RowAssigner for rank 0
  Terms
    Total: 512640
    Adhoc: 468884
    Explicit: 42357
    Private: 1399

  Rows
    Total: 5814
    Adhoc: 590
    Explicit: 3822
    Private: 1399

  Bytes per document: 778.75


  Densities in explicit shared rows
    Mean: 0.0999029
    Min: 0.0980813
    Max: 0.0999549
    Variance: 3.00201e-08

===================================
RowAssigner for rank 1
  No terms

===================================
RowAssigner for rank 2
  No terms

===================================
RowAssigner for rank 3
  Terms
    Total: 511026
    Adhoc: 416436
    Explicit: 90958
    Private: 3632

  Rows
    Total: 36107
    Adhoc: 1827
    Explicit: 30648
    Private: 3632

  Bytes per document: 564.172


  Densities in explicit shared rows
    Mean: 0.099836
    Min: 0.0124819
    Max: 0.0999997
    Variance: 5.68987e-07

===================================
RowAssigner for rank 4
  No terms

===================================
RowAssigner for rank 5
  No terms

===================================
RowAssigner for rank 6
  No terms

===================================
RowAssigner for rank 7
  No terms

Writing TermTable files.
Done.

From the output, above, we can see that this index will consume roughly 779 bytes of rank 0 rows and 562 bytes of rank 3 rows per document ingested. This corpus has 17618 documents, so the total memory consumption for rows should be about 22.8Mb. This is a significant reduction over the 198Mb of chunk files, but we at this point, we can draw no conclusions because we have no idea whether this naive configuration has an acceptable false positive rate.

If we look in tmp/wikipedia/config we will see the TermTable is stored in a new binary file called TermTable-0.bin:

ls -l /tmp/wikipedia/config
total 100384
-rw-r--r--  1 mhop  wheel    216796 Sep 18 16:07 CumulativeTermCounts-0.csv
-rw-r--r--  1 mhop  wheel  21516222 Sep 18 16:07 DocFreqTable-0.csv
-rw-r--r--  1 mhop  wheel     18265 Sep 18 16:07 DocumentLengthHistogram.csv
-rw-r--r--  1 mhop  wheel   8202248 Sep 18 16:07 IndexedIdfTable-0.bin
-rw-r--r--  1 mhop  wheel   7970980 Sep 18 16:09 TermTable-0.bin
-rw-r--r--  1 mhop  wheel  13463535 Sep 18 16:07 TermToText.bin

We’ve now configured our system for a typical corpus with documents similar to those in the chunk files listed in the original manifest. In the next step we will ingest some files and look at the resulting row table values.

Ingesting a Small Corpus

Now we’re getting to the fun part. BitFunnel repl is a sample application that provides an interactive Read-Eval-Print loop for ingesting documents, running queries, and inspecting various data structures.

Here’s the help message:

BitFunnel repl -help
StatisticsBuilder
Ingest documents and compute statistics about them.

Usage:
BitFunnel repl <path>
               [-help]
               [-gramsize <integer>]
               [-threads <integer>]

<path>
    Path to a tmp directory. Something like /tmp/ or c:\temp\,
    depending on platform.. (string)

[-help]
    Display help for this program. (boolean, defaults to false)


[-gramsize <integer>]
    Set the maximum ngram size for phrases. (integer, defaults
    to 1)

[-threads <integer>]
    Set the thread count for ingestion and query processing.
    (integer, defaults to 1)

The first parameter is the path to the directory with the configuration files. In my case this is /tmp/wikipedia/config. The gramsize should be the same value used in the BitFunnel statistics stage.

When you start the application, it prints out a welcome message, explains how to get help, and then prompts for input. The prompt is an integer followed by a colon. Type “help” to get a list of commands. You can also get help on a specific command:

BitFunnel repl /tmp/wikipedia/config
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = "/tmp/wikipedia/config"
gram size = 1

Starting index ...
Blocksize: 11005320
Index started successfully.

Type "help" to get started.

0: help
Available commands:
  cache   Ingests documents into the index and also stores them in a cache
for query verification purposes.
  delay   Prints a message after certain number of seconds
  help    Displays a list of available commands.
  load    Ingests documents into the index
  query   Process a single query or list of queries. (TODO)
  quit    waits for all current tasks to complete then exits.
  script  Runs commands from a file.(TODO)
  show    Shows information about various data structures. (TODO)
  status  Prints system status.
  verify  Verifies the results of a single query against the document cache.

Type "help <command>" for more information on a particular command.

1: help cache
cache (manifest | chunk) <path>
  Ingests a single chunk file or a list of chunk
  files specified by a manifest.
  Also caches IDocuments for query verification.

2: help show
show cache <term>
   | rows <term> [<docstart> <docend>]
   | term <term>
  Shows information about various data structures.  PARTIALLY IMPLEMENTED

Right now the cache command doesn’t support the manifest option and the show command only supports the rows and terms option.

Let’s ingest a single chunk file. We’ll use /tmp/wikipedia/chunks/AA/wiki_00 which contains the following 41 documents:

12: Anarchism
25: Autism
39: Albedo
128: Talk:Atlas Shrugged
290: A
295: User:AnonymousCoward
303: Alabama
305: Achilles
307: Abraham Lincoln
308: Aristotle
309: An American in Paris
316: Academy Award for Best Production Design
324: Academy Awards
330: Actrius
332: Animalia (book)
334: International Atomic Time
336: Altruism
339: Ayn Rand
340: Alain Connes
344: Allan Dwan
354: Talk:Algeria
358: Algeria
359: List of Atlas Shrugged characters
569: Anthropology
572: Agricultural science
573: Alchemy
579: Alien
580: Astronomer
582: Talk:Altruism/Archive 1
586: ASCII
590: Austin (disambiguation)
593: Animation
594: Apollo
595: Andre Agassi
597: Austroasiatic languages
599: Afroasiatic languages
600: Andorra
612: Arithmetic mean
615: American Football Conference
620: Animal Farm
621: Amphibian

Here’s the chunk loading in the REPL console:

3: cache chunk /tmp/wikipedia/chunks/AA/wiki_00
Ingesting chunk file "/tmp/wikipedia/chunks/AA/wiki_00"
Caching IDocuments for query verification.
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
Ingestion complete.

We can now use the show rows command to display the rows associated with a particular term. This command lists each of the RowIds associated with a term, followed by the bits for the first 64 documents. The document ids are printed vertically above each column of bits.

4: show rows also
Term("also")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(0, 1005 ): 11111111111011111111111101011111111111011
5: show rows some
Term("some")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(0, 1011 ): 11111011111011001101110101001101111111011
6: show rows wings
Term("wings")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 6668 ): 00000001000000000000000000000000000000000
  RowId(3, 6669 ): 00000001000000000000000000000000000000000
  RowId(3, 6670 ): 00000001000000000000000000000000000000000
  RowId(0, 4498 ): 00000001000000000000110000000100000000000
7: show rows anarchy
Term("anarchy")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 11507): 10000000110000001000010000000000000000000
  RowId(3, 11508): 10000000110000001000010000000000000000000
  RowId(3, 11509): 10000000110000001000010000000000000000000
  RowId(0, 5354 ): 11000001110000000000010001001000000000010
8: show rows kingdom
Term("kingdom")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(0, 1609 ): 11000010010000100000000000000001000010001

At prompt 4, the command show rows also returns a single RowId(0, 1005) corresponding to the term “also”. This is a rank 0 row in position 1005. The fact that the word “also” is associated with a single row indicates that the term is fairly common in the corpus. This is consistent with the pattern of 1 bits which show that the term appears in every document except 316, 572, 579, and 615.

At prompt 5, we see that the word “some” is also common in the corpus. It appears in every document except 295, 316, 332, 334, 359, 572, 579, 580, and 615.

Both “also” and “some” appear so frequently that are assigned private rows.

The term “wings”, seen at prompt 6, is less common. It is actually rare enough to require four row intersections to drive the noise to a tolerable level. If we look at the intersection of the four rows, we see that only document 305 contains the term. This is the only column that consists solely of 1s. All of the other columns have some 0s.

At prompt 7 we see that anarchy is also rare enough to require four rows, but it seems to appear in documents 12, 307, 308, and 358. A quick search of the actual web pages shows that “anarchy” appears in document 12 which is about Anarchism and document 307 which is about Abraham Lincoln. Pages 308 and 358 do not actually contain the term, so we are seeing a case where BitFunnel would report false positives.

Now let’s look at the verify one command. Today this command runs a very slow verification query engine on the IDocuments cached earlier by the cache chunk command. In the future, verify will run the BitFunnel query engine and compare its output with the verification query engine.

9: verify one wings
Processing query " wings"
  DocId(305)
1 match(es) out of 41 documents.
10: verify one anarchy
Processing query " anarchy"
  DocId(307)
  DocId(12)

Prompt 9 shows that only document 305 contains the term “wings”. Prompt 10 reports that only documents 12 and 307 contain the term anarchy.

Note that verify one accepts any legal BitFunnel query. Here’s an example:

11: verify one -some (anarchy | kingdom)
Processing query " -some (anarchy | kingdom)"
  DocId(332)
1 match(es) out of 41 documents.

Well that’s enough for now. Hopefully this walkthrough will help you get started with BitFunnel configuration.