<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>BitFunnel</title>
    <link>http://bitfunnel.org/index.xml</link>
    <description>Recent content on BitFunnel</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 27 Oct 2016 15:51:23 -0600</lastBuildDate>
    <atom:link href="http://bitfunnel.org/index.xml" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Debugging Bit Densities</title>
      <link>http://bitfunnel.org/debugging-bit-densities/</link>
      <pubDate>Thu, 27 Oct 2016 15:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/debugging-bit-densities/</guid>
      <description>

&lt;p&gt;Things are starting to get exciting in the Land of BitFunnel.
We&amp;rsquo;re now at the point where we can ingest a significant fraction of Wikipedia
and run millions of queries, all without crashing &amp;ndash; and we have a great set
of analysis tools.&lt;/p&gt;

&lt;p&gt;You might think that the end is in sight, but actually this is only the beginning.
We&amp;rsquo;re now in what I like to call a &amp;ldquo;target rich environment&amp;rdquo; for bugs.
Today there are so many bugs, data cleaning issues, and configuration errors
that the only reliable statement we can make about the system is that it has bugs.&lt;/p&gt;

&lt;p&gt;Having lots of bugs might sound scary or frustrating, but this is actually one of the
most interesting stages of the project &amp;ndash; when we bring everything together for the first time.
I&amp;rsquo;ll make an analogy to the &lt;a href=&#34;https://en.wikipedia.org/wiki/Boeing_787_Dreamliner&#34;&gt;Boeing 787 Dreamliner&lt;/a&gt;.&lt;/p&gt;


&lt;figure &gt;
    &lt;a href=&#34;https://commons.wikimedia.org/w/index.php?curid=8775802&#34;&gt;
        &lt;img src=&#34;http://bitfunnel.org/debugging-bit-densities/Boeing_787_first_flight.jpg&#34; /&gt;
    &lt;/a&gt;
    
&lt;/figure&gt;


&lt;p&gt;The components were designed, manufactured
and tested separately and then one day they were brought together and
assembled for the first time into an airliner.
There were some initial hitches getting everything to fit together, but before you knew it
they were doing high speed taxi tests and then preparing for the first flight.&lt;/p&gt;

&lt;p&gt;I spent the day bolting on the wings and spooling up the engines. My focus was primarily on
the bit densities in the Row Tables.&lt;/p&gt;

&lt;h2 id=&#34;methodology&#34;&gt;Methodology&lt;/h2&gt;

&lt;p&gt;My experiments were based on &lt;a href=&#34;https://bitfunnel.blob.core.windows.net/wiki-data/chunked/enwiki-20161020-chunked1.tar.gz&#34;&gt;enwiki-20161020-chunked1&lt;/a&gt;.
After downloading and unzipping this file, I made a manifest file listing all 236 chunks in the corpus.
This can be done on Windows with the &lt;code&gt;dir&lt;/code&gt; command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;d:\temp\wikipedai&amp;gt; dir /s/b/a-d enwiki-20161020-chunked1 &amp;gt; manifest.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;or on Linux with the &lt;code&gt;find&lt;/code&gt; command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;% find enwiki-20161020-chunked1 -type f &amp;gt; manifest.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I then generated corpus statistics, built the term table, and started an analysis in the repl:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;% mkdir /tmp/wikipedia/config
% BitFunnel statistics -text /tmp/wikipedia/manifest.txt /tmp/wikipedia/config
% BitFunnel termtable /tmp/wikipedia/config

% mkdir /tmp/wikipedia/analysis
% BitFunnel repl /tmp/wikipedia/config

Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.)

directory = &amp;quot;/tmp/wikipedia/config&amp;quot;
gram size = 1

Starting index ...
Index started successfully.

Type &amp;quot;help&amp;quot; to get started.

0: cache manifest /tmp/wikipedia/manifest.txt
Ingesting manifest &amp;quot;/tmp/wikipedia/manifest.txt&amp;quot;
Caching IDocuments for query verification.
Ingestion complete.

1: cd /tmp/wikipedia/analysis
output directory is now &amp;quot;/tmp/wikipedia/analysis&amp;quot;.

2: analyze
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This generated a number of files in the &lt;code&gt;analysis&lt;/code&gt; directory,
including RowDensities-0.csv which easy to view it in Excel.&lt;/p&gt;

&lt;h2 id=&#34;picking-low-hanging-fruit&#34;&gt;Picking Low Hanging Fruit&lt;/h2&gt;

&lt;p&gt;My interest today was the densities of the rank 3 rows,
where my &lt;a href=&#34;http://bitfunnel.org/row-table-analysis&#34;&gt;earlier investigations&lt;/a&gt;
suggested there might be bug.&lt;/p&gt;

&lt;p&gt;Right off the bat, I found a really high density in a rank 3 row for the adhoc term &lt;code&gt;scriptura&lt;/code&gt;.
Cell Q47193 shows a density of 0.652111 in row 2353. This is way above the target density
of 0.1 and something that needs to be addressed.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/debugging-bit-densities/TermScriptura.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;So how did the row density get so high? My first line of inquiry was to examine the other terms
that share row 2353 with &lt;code&gt;scriptura&lt;/code&gt;. To do this, I modified &lt;code&gt;RowTableAnalyzer::AnalyzeRowsInOneShard()&lt;/code&gt;
to print out the term text and frequency whenever &lt;code&gt;row.GetIndex() == 2353 &amp;amp;&amp;amp; row.GetRank() ==3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Near the top of the function, I recorded the index of the row under observation, created a counter
for terms in the row and a CsvTableFormatter to print out the terms. This latter was necessary to properly
escape terms that contain commas (e.g. &lt;code&gt;10,000&lt;/code&gt;).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;    const RowIndex specialRow = 2353;
    size_t specialTermCount = 0;
    CsvTsv::CsvTableFormatter specialFormatter(std::cout);
    std::cout &amp;lt;&amp;lt; &amp;quot;Special row is &amp;quot; &amp;lt;&amp;lt; specialRow &amp;lt;&amp;lt; std::endl;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the middle of the loop, I added code to print out information about each term in the special row:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;    if (row.GetRank() == 3 &amp;amp;&amp;amp; row.GetIndex() == specialRow)
    {
        std::cout
            &amp;lt;&amp;lt; specialTermCount++
            &amp;lt;&amp;lt; &amp;quot;,&amp;quot;;
        specialFormatter.WriteField(termToText.Lookup(term.GetRawHash()));
        std::cout
            &amp;lt;&amp;lt; &amp;quot;,&amp;quot;
            &amp;lt;&amp;lt; dfEntry.GetFrequency();
        specialFormatter.WriteRowEnd();
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here are the results. There were 752 terms in row 2353, but as you can see,
the sum of their frequencies is 0.102110806, a value that is pretty close to the
target density of 0.1.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/debugging-bit-densities/Row2353.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;At first glance the fact that the frequencies sum up to .0102110806
suggests that the bin packing algorithm in
&lt;code&gt;TermTableBuilder::RowAssigner&lt;/code&gt; is working correctly.
But wait &amp;ndash; these are term frequencies which only correspond to densities
at rank 0. At higher ranks, the densities are magnified because each bit
at rank corresponds to multiple rank 0 bits.
After a brief investigation, I found that we were, in fact,
under counting bits added to adhoc rows.
In &lt;code&gt;RowAssigner::AssignAdhoc()&lt;/code&gt; we keep a running total of the density
contributed by adhoc terms:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;m_adhocTotal += frequency * count;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Later we&amp;rsquo;d use &lt;code&gt;m_adhocTotal&lt;/code&gt; to compute the number of rows needed to store
adhoc terms:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;double rowCount = ceil(m_adhocTotal / m_density);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The precise problem is that each bit in a rank 3 row corresponds to 8 bits at rank 0,
so the density in rank 3 will be significantly higher than the term&amp;rsquo;s
frequency at rank 0.
A &lt;a href=&#34;https://github.com/BitFunnel/BitFunnel/commit/ebf12d52efd163c005802b63cafa6bfae4f2da09&#34;&gt;small change&lt;/a&gt;
addressed this issue:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;double f = Term::FrequencyAtRank(frequency, m_rank);
m_adhocTotal += f * count;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This fix had a big impact, dropping &lt;code&gt;scriptura&lt;/code&gt; row densities to the .20-.30 range.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/debugging-bit-densities/TermScripturaFixed.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;This was a nice improvement, but the densities were still 2-3x what they should be.
It turns out the there was another bug, this time in &lt;code&gt;Term::FrequencyAtRank()&lt;/code&gt;
(thanks to &lt;a href=&#34;https://twitter.com/danluu&#34;&gt;@danluu&lt;/a&gt; for spotting the error!)&lt;/p&gt;

&lt;p&gt;The original code computed&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;return 1.0 - pow(1.0 - frequency, rank + 1.0);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;a href=&#34;https://github.com/BitFunnel/BitFunnel/commit/41db6a9f81c2517d1a88adc8c27b8612fb586eb5&#34;&gt;corrected code&lt;/a&gt; is&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;size_t rowCount = (1ull &amp;lt;&amp;lt; rank);
return 1.0 - pow(1.0 - frequency, rowCount);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;With these two fixes, the &lt;code&gt;scriptura&lt;/code&gt; densities are all below 0.15:&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/debugging-bit-densities/TermScripturaFixed2.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;An examination of other
rows shows that the fix works for all adhoc rows. We still have a ways to go if we want
all shared rows to have a density at or below 0.10, but we&amp;rsquo;ve made good progress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One More Thing:&lt;/strong&gt; During the course of this investigation, I noticed that some of the terms in the
index contain punctuation (e.g. &lt;code&gt;39.4&lt;/code&gt; and &lt;code&gt;c&#39;s&lt;/code&gt;) and others are capitalized.
It turns out that our &lt;a href=&#34;https://github.com/BitFunnel/Workbench&#34;&gt;Workbench&lt;/a&gt;
tool which uses &lt;a href=&#34;https://lucene.apache.org/&#34;&gt;Lucene&amp;rsquo;s&lt;/a&gt; analyzer to tokenize
the Wikipedia page bodies was letting the page titles pass through without analysis.
We should have a fix for this shortly and then we will reprocess Wikipedia and
upload new chunk files.&lt;/p&gt;

&lt;h2 id=&#34;where-things-stand&#34;&gt;Where Things Stand&lt;/h2&gt;

&lt;p&gt;This a classic case of little bugs hiding behing big bugs.
We&amp;rsquo;re also in a situation where we have bugs on multiple axes,
making it hard to reason about why we&amp;rsquo;re seeing certain behaviors.&lt;/p&gt;

&lt;p&gt;As an example, we know that query processing is slow.
We expected it to be slow because we&amp;rsquo;re using the byte code interpreter
instead of &lt;a href=&#34;https://github.com/BitFunnel/NativeJIT&#34;&gt;NativeJIT&lt;/a&gt;
and because we indulged ourselves in some potentially
expensive operations in the glue code (e.g. excess allocations and copies, lock
contention, and calls to qsort).&lt;/p&gt;

&lt;p&gt;The problem is that we don&amp;rsquo;t really know how fast query processing should be
with the current design choices.
It might be slow because of the reasons above, or it might be even slower
because our row densities are too high, causing excess row reads, or
our false positives are too high, causing too many results.&lt;/p&gt;

&lt;p&gt;Or we might have a performance bug, say running each query twice when
we thought we were running it once.&lt;/p&gt;

&lt;p&gt;There are so many possibilities, that we just need to take one suspsicous
piece of data at a time, come up with a single example, be it a bad row
or a bad query and instrument it or trace it through.&lt;/p&gt;

&lt;p&gt;Over time we will pick enough of this low hanging fruit that we generally
trust the system and bugs will become more of an aberration.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Row Table Analysis</title>
      <link>http://bitfunnel.org/row-table-analysis/</link>
      <pubDate>Sun, 23 Oct 2016 15:51:23 -0700</pubDate>
      
      <guid>http://bitfunnel.org/row-table-analysis/</guid>
      <description>

&lt;p&gt;I spent the weekend implementing code to analyze bit densities in the
rows and columns of the row tables. This tool should help us determine
whether the row tables are configured correctly. A good row table should
have the following characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each column&amp;rsquo;s density is close to the system target density.&lt;/li&gt;
&lt;li&gt;Each row&amp;rsquo;s density is close to the system target density.&lt;/li&gt;
&lt;li&gt;Random pairs of terms are unlikely to share rows.&lt;/li&gt;
&lt;li&gt;All rows assigned to a single term should be distinct.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To test the tool, I did a quick row table analysis for two corpuses.
The first was the collection of Shakespeare Sonnets in
&lt;a href=&#34;http://bitfunnel.org/alls-well-that-ends-well&#34;&gt;TheBard&lt;/a&gt;.
The second corpus consisted of the first 1805 documents from our
&lt;a href=&#34;http://bitfunnel.org/wikipedia-as-test-corpus-for-bitfunnel&#34;&gt;processed Wikipedia dump&lt;/a&gt;
(chunks &lt;code&gt;AA\\wiki_00&lt;/code&gt; to &lt;code&gt;AA\\wiki_09&lt;/code&gt;).
The remainder of this post describes my methodology and some early observations.&lt;/p&gt;

&lt;h2 id=&#34;methodology&#34;&gt;Methodology&lt;/h2&gt;

&lt;p&gt;I used the new &lt;code&gt;-script&lt;/code&gt; option to &lt;code&gt;BitFunnel repl&lt;/code&gt;to start
the repl and then execute commands from the file &lt;code&gt;ingest.txt&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;% BitFunnel repl -script ingest.txt /tmp/wikipedia/config
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;-script&lt;/code&gt; option is a huge time saver.
Here are the commands inside of &lt;code&gt;ingest.txt&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_01
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_00
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_02
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_03
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_04
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_05
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_06
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_07
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_08
cache chunk /tmp/wikipedia/enwiki-20161020-chunked1/AA/wiki_09
cd /tmp/wikipedia/out
analyze
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first 10 lines ingest chunks &lt;code&gt;AA/wiki_01&lt;/code&gt; to &lt;code&gt;AA/wiki_09&lt;/code&gt; while caching
their &lt;code&gt;IDocuments&lt;/code&gt;. It is important to use the &lt;code&gt;cache&lt;/code&gt; command here because
the column density analysis is only performed for &lt;code&gt;IDocuments&lt;/code&gt; that are cached.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;cd&lt;/code&gt; command on the next line sets the output directory to &lt;code&gt;/tmp/wikipedia/out&lt;/code&gt;.
This is where the row table analyzer will put its output files.&lt;/p&gt;

&lt;p&gt;Finally, the analyze command kicks off the row table analysis which generates
the following files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;/tmp/wikipedia/out/ColumnDensities.csv&lt;/li&gt;
&lt;li&gt;/tmp/wikipedia/out/ColumnDensitySummary.txt&lt;/li&gt;
&lt;li&gt;/tmp/wikipedia/out/RowDensities-0.csv&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&#34;column-densities-in-wikipedia&#34;&gt;Column Densities in Wikipedia&lt;/h2&gt;

&lt;p&gt;The column densities are recorded as a table in a .csv file.
Here&amp;rsquo;s what the file looks like in Excel:&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/row-table-analysis/column-density-spreadsheet.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;The first column is the document id, which in the case of wikipedia
is the &lt;code&gt;curid&lt;/code&gt; value. For example, cell &lt;code&gt;A2&lt;/code&gt; has an id of &lt;code&gt;1805&lt;/code&gt; which
corresponds to the
&lt;a href=&#34;https://en.wikipedia.org/wiki?curid=1805&#34;&gt;Wikipedia page on Antibiotics&lt;/a&gt;
(full url is &lt;code&gt;https://en.wikipedia.org/wiki?curid=1805&lt;/code&gt;).
Column &lt;code&gt;B&lt;/code&gt; contains the number of postings contributed by each document.
We can see from cell &lt;code&gt;B2&lt;/code&gt; that the page on Antibiotics contributes
&lt;code&gt;1364&lt;/code&gt; postings. Column &lt;code&gt;C&lt;/code&gt; has the shard the document is stored in.
For these runs, the index was configured with a single shard, so the column
is all zeros. Columns &lt;code&gt;D&lt;/code&gt; through &lt;code&gt;K&lt;/code&gt; have the bit densities of the
document&amp;rsquo;s column at ranks 0 to 7, respectively.
Note that this index was configured to use only ranks 0 and 3.&lt;/p&gt;

&lt;p&gt;We can learn a lot about the index by examining a scatter plot of column
density vs. document posting count. In the chart below, blue dots represent
rank 0 densities and orange dots represent rank 3 densities.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/row-table-analysis/column-density-vs-posting-count.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;Right off the bat, the structure of the graph suggests two learnings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Need for Sharding&lt;/strong&gt;
The first learning is that we will need to shard the index by posting count.
An examination of the blue dots shows that column density is directly
proportional to document posting count. In other words, short documents
have low column densities and long documents have high column densities.&lt;/p&gt;

&lt;p&gt;This index was sized to accomodate the average sized document which has
766 postings. If we follow the red line up from 766 postings, it hits
the blue dots near a density of 0.1 which was the target density used
in the index configuration.&lt;/p&gt;

&lt;p&gt;In this index, documents with less than 766 postings will consume more
memory than necessary, while documents with more than 766 postings will
contribute to an increased false positive rate.&lt;/p&gt;

&lt;p&gt;We knew from Bing that we would need to shard the index into groups of
documents with similar posting counts. This scatter plot just confirms
the need for sharding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Bug in Rank 3?&lt;/strong&gt;
The cloud of orange dots is problematic in that it doesn&amp;rsquo;t show The
expected linear structure and it includes many densities above the target
of 0.1.&lt;/p&gt;

&lt;p&gt;An examination of the TermTable builder code shows two design flaws that
could potentially lead to excess density in higher rank rows.&lt;/p&gt;

&lt;p&gt;The first problem is that each term is associated with a set of rows
that are either all adhoc or all explicit.
In some cases, a term has a low enough frequency to be adhoc at rank 0,
but high enough to require explicit row assignment at rank 3.
With the current algorithm, such a term&amp;rsquo;s rows will all be adhoc,
leading to the overfilling of some rows at higher ranks.&lt;/p&gt;

&lt;p&gt;This is a problem that exists in the current Bing codebase and explains
why they sometimes have rows with unexpectedly high density.&lt;/p&gt;

&lt;p&gt;The second problem is that the term treatment allows higher rank rows
when their use would precipitate densities about the target density.
Suppose, for example, that we have a term that appears in 5% of the corpus
and its term treatment calls for two shared rows, one at rank 0 and the
other at rank 3. The bin packing algorithm will ensure that the rank 0 row
is not overfilled. The problem comes in the rank 3 row where the 5%
bit density translates to 18.4% of the bits being set. In this case, the
bit packing algorithm will forced to allocate a private rank 3 row,
when a better choice might have been to allocate a lower rank row that
could be shared with other terms.&lt;/p&gt;

&lt;p&gt;Until we fix these problems we won&amp;rsquo;t be able to see whether there are other
less impactful bugs lurking in the background.&lt;/p&gt;

&lt;h2 id=&#34;row-densities-in-shakespeare-sonnets&#34;&gt;Row Densities in Shakespeare Sonnets&lt;/h2&gt;

&lt;p&gt;The row density table seems to show the same issues with excess density
in higher rank rows. The row table analyzer outputs one row density file
for each shard. For this run, our index only has one shard,
so its data appears in &lt;code&gt;RowDensities-0.csv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is a .csv file with a ragged right edge.
Each line in the file corresponds to the term listed in column &lt;code&gt;A&lt;/code&gt;.
Column &lt;code&gt;B&lt;/code&gt; shows the term&amp;rsquo;s frequency in the corpus used during the
statistics collection phase (&lt;code&gt;BitFunnel statistics&lt;/code&gt;). In the excerpt
below, we can see that the term &lt;code&gt;Sonnet&lt;/code&gt; appears in 100% of the documents
while the term &lt;code&gt;but&lt;/code&gt; appears in 71.4%.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/row-table-analysis/row-densities-bard-top.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;Columns &lt;code&gt;C&lt;/code&gt; through &lt;code&gt;E&lt;/code&gt; give the rank, row index, and observed density
of the first row associated with each term. Columns &lt;code&gt;F&lt;/code&gt; through &lt;code&gt;H&lt;/code&gt;
correspond to the second row, and so on. In the excerpt above, we
can see that common terms correspond to a single private row. Since
the row is not shared with other terms, its density can be above the
target of 0.1.&lt;/p&gt;

&lt;p&gt;Note that in the excerpt above, the
term frequency in column &lt;code&gt;B&lt;/code&gt; is exactly equal to the fraction of bits
detected in column &lt;code&gt;E&lt;/code&gt;, even though the corpus statistics were gathered
from a slightly different corpus. These values aren&amp;rsquo;t required to be
equal, but it is not suprising that they are equal for the most common
terms.&lt;/p&gt;

&lt;p&gt;In the excerpt below, we can see that at some point, less common terms
begin to share rows. As soon as a term shares rows, it needs at least one additional
row to drive down the noise. The term &lt;code&gt;things&lt;/code&gt; is common enough to get
its own, private row, while &lt;code&gt;gentle&lt;/code&gt; is assigned to a pair of rows that
are shared with other terms.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/row-table-analysis/row-densities-bard-middle.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;The red highlight shows the bug rearing its ugly head in the rank 3
rows. As an example, the term &lt;code&gt;gentle&lt;/code&gt; has a density of 0.19 in a
rank 3 row. If row &lt;code&gt;1000&lt;/code&gt; is shared, its density will lead to an
unexpectedly high false positive rate in queries involving terms that
share rank 3 row number &lt;code&gt;1000&lt;/code&gt;. If, on the other hand, row &lt;code&gt;1000&lt;/code&gt; is
private, we may be wasting storage.&lt;/p&gt;

&lt;p&gt;This problem becomes less prominent as terms get rarer. In the excerpt
below, we see that &lt;code&gt;wrinkes&lt;/code&gt; gets two shared rows, while &lt;code&gt;seek&lt;/code&gt; gets
three. In this portion of the table, all of the densities are
below the target level of 0.1.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/row-table-analysis/row-densities-bard-bottom.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;It is a bit suspcious that the rank 3 rows for rare terms
seem to densities that are below the target density. This might be a result of the
bug directing bits to the wrong rows, or it could mean that we&amp;rsquo;ve
somehow overprovisioned these rows and are wasting memory.&lt;/p&gt;

&lt;p&gt;It is great to have these analysis tools to get a handle on problems
in the row tables.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Wikipedia as test corpus for BitFunnel</title>
      <link>http://bitfunnel.org/wikipedia-as-test-corpus-for-bitfunnel/</link>
      <pubDate>Fri, 21 Oct 2016 17:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/wikipedia-as-test-corpus-for-bitfunnel/</guid>
      <description>

&lt;p&gt;Wikipedia is a great test corpus for search engines. It is &lt;a href=&#34;https://dumps.wikimedia.org/enwiki/&#34;&gt;free and easy to obtain&lt;/a&gt;, it carries a &lt;a href=&#34;https://en.wikipedia.org/wiki/Wikipedia:Copyrights&#34;&gt;license appropriate for research&lt;/a&gt;, and at ~59GB uncompressed, it is large, but not too large to fit on a reasonably-sized server. For those with extremely fast reflexes, even user data&lt;sup class=&#34;footnote-ref&#34; id=&#34;fnref:querylogs&#34;&gt;&lt;a rel=&#34;footnote&#34; href=&#34;#fn:querylogs&#34;&gt;1&lt;/a&gt;&lt;/sup&gt; is sometimes available.&lt;/p&gt;

&lt;p&gt;Wikipedia is also probably more representative of common use cases of search: since it is edited by amateurs, it is a more pedestrian dataset than many other corpora. This likely makes it more relevant to many realistic applications of search, particularly those that contain mostly amateur-generated data (such as consumer web search and corporate document search).&lt;/p&gt;

&lt;p&gt;All of this makes Wikipedia a sensible baseline dataset for testing BitFunnel. To facilitate this, we have released a pre-processed version of the 2016-10-20 dump of Wikipedia, so that it is trivial to ingest into BitFunnel.&lt;/p&gt;

&lt;p&gt;In this post, we will look at (1) how you can obtain this pre-processed Wikipedia data, (2) how you can ingest it into a running BitFunnel instance, (3) how to get ahold of the intermediate processing files so that you can audit the chunk files, to make sure they&amp;rsquo;re correct, and (4) some simple statistics about the corpus.&lt;/p&gt;

&lt;h1 id=&#34;obtaining-the-corpus&#34;&gt;Obtaining the corpus&lt;/h1&gt;

&lt;p&gt;The 2016-10-20 dump of Wikipedia is divided into 27 compressed XML files. We have transformed each of these XML files into BitFunnel&amp;rsquo;s custom &lt;em&gt;chunk&lt;/em&gt; file format to make ingestion fast and painless. (See the &lt;a href=&#34;http://bitfunnel.org/corpus-file-format/&#34;&gt;blog post&lt;/a&gt; that introduces the chunk format).&lt;/p&gt;

&lt;p&gt;Each of these 27 dump files generates many chunk files. These &amp;ldquo;segments&amp;rdquo; of chunk files can be found at URLs following the pattern in the code block below; to download one of the 27 segments, simply replace the &lt;code&gt;${1}&lt;/code&gt; with the number of the chunk you&amp;rsquo;d like. (The dump numbers start at 1 and end at 27, inclusive.)&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;https://bitfunnel.blob.core.windows.net/wiki-data/chunked/enwiki-20161020-chunked${1}.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So, for example, if you want to obtain chunk 1, and you are running on Linux, you might run something like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;wget https://bitfunnel.blob.core.windows.net/wiki-data/chunked/enwiki-20161020-chunked1.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Alternatively, you can paste that URL into a browser.&lt;/p&gt;

&lt;h1 id=&#34;ingesting-the-corpus&#34;&gt;Ingesting the corpus&lt;/h1&gt;

&lt;p&gt;There are a few ways to ingest chunk files. Probably the easiest is to use the REPL, which is what we will do in this section.&lt;/p&gt;

&lt;p&gt;We explain a bit of the background of how BitFunnel can be configured in the &lt;a href=&#34;http://bitfunnel.org/index-build-tools/&#34;&gt;index build tools&lt;/a&gt; post. Today, it is sufficient to download the &lt;a href=&#34;https://bitfunnel.blob.core.windows.net/wiki-data/config/enwiki-20161020-config.tar.gz&#34;&gt;gzip&amp;rsquo;d configuration files&lt;/a&gt; generated from the same Wikipedia dump.&lt;/p&gt;

&lt;p&gt;From there, you can run something like:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;$ BitFunnel repl /path/to/unzipped/config/directory
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = &amp;quot;/path/to/unzipped/config/directory&amp;quot;
gram size = 1

Starting index ...
Blocksize: [... your size here ...]
Index started successfully.

Type &amp;quot;help&amp;quot; to get started.

0: cache chunk /path/to/chunk/file
1: verify one wings
[... results go here ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When we start the REPL, you can see that we run two commands. &lt;code&gt;cache chunk&lt;/code&gt; ingests a chunk file, and &lt;code&gt;verify one&lt;/code&gt; queries the data in the index and verifies the document matches are correct.&lt;/p&gt;

&lt;h1 id=&#34;auditing-the-data&#34;&gt;Auditing the data&lt;/h1&gt;

&lt;p&gt;In the &lt;a href=&#34;http://bitfunnel.org/corpus-file-format/&#34;&gt;post on BitFunnel&amp;rsquo;s corpus file format&lt;/a&gt;, we write about the steps needed to convert the Wikipedia dump files into BitFunnel chunk files. This process is also specified in the README of BitFunnel&amp;rsquo;s &lt;a href=&#34;https://github.com/bitfunnel/Workbench&#34;&gt;Workbench&lt;/a&gt; project.&lt;/p&gt;

&lt;p&gt;In brief, there are 3 steps: (1) download the Wikipedia XML dumps, (2) use &lt;a href=&#34;https://github.com/attardi/wikiextractor&#34;&gt;WikiExtractor&lt;/a&gt; to filter out the Wikipedia markup; and (3) convert the &amp;ldquo;extracted&amp;rdquo; text to chunk files using BitFunnel&amp;rsquo;s &lt;a href=&#34;https://github.com/bitfunnel/Workbench&#34;&gt;Workbench&lt;/a&gt; project.&lt;/p&gt;

&lt;p&gt;In order to allow people to audit the chunk files, we are also hosting both the raw XML dump files for the 2016-10-20 articles-only dump of Wikipedia, and the markup-filtered files we generated with WikiExtractor. (We provide the Wikipedia dump because eventually Wikipedia will stop offering it in its archive.)&lt;/p&gt;

&lt;p&gt;Wikipedia generated 27 dump files, so there are 27 extracted files and 27 chunk directories. So, just like we had a URL pattern of chunk segments, we have one for raw wikipedia dump files and extracted dump files, respectively:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;https://bitfunnel.blob.core.windows.net/wiki-data/raw/enwiki-20161020-pages-articles${1}.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;https://bitfunnel.blob.core.windows.net/wiki-data/extracted/enwiki-20161020-extracted${1}.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As with the chunk file, simply replace &lt;code&gt;${1}&lt;/code&gt; with a number from 1 to 27 (inclusive) to recieve the corresponding dump file.&lt;/p&gt;

&lt;p&gt;From here, you can inspect the data, or use Workbench and WikiExtractor to generate your own data and compare.&lt;/p&gt;

&lt;h1 id=&#34;statistics&#34;&gt;Statistics&lt;/h1&gt;

&lt;p&gt;Finally, just as a point of reference, here are some statistics relating to the size of the corpus in various states of processing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Raw corpus:&lt;/strong&gt; total size is ~16.7GB when compressed with gzip, and ~59.2GB when uncompressed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://github.com/attardi/wikiextractor&#34;&gt;WikiExtractor&lt;/a&gt;&amp;rsquo;d corpus:&lt;/strong&gt; total size is ~4.8GB when compressed with gzip, and ~13.2 GB uncompressed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&#34;https://github.com/bitfunnel/Workbench&#34;&gt;Chunked&lt;/a&gt; corpus:&lt;/strong&gt; total size is ~3.6GB when compressed with gzip, and ~10.2GB uncompressed.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;footnotes&#34;&gt;

&lt;hr /&gt;

&lt;ol&gt;
&lt;li id=&#34;fn:querylogs&#34;&gt;If you were paying attention for exactly one day in 2012, you could obtain &lt;a href=&#34;https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/&#34;&gt;user query logs&lt;/a&gt;. These are particularly useful because they allow you to construct realistic synthetic workloads. &lt;br/&gt;&lt;br/&gt;Lately this is quite a rare affordance. Probably every company to release significant user data has been burned (see, for example, AOL&amp;rsquo;s &lt;a href=&#34;https://en.wikipedia.org/wiki/AOL_search_data_leak&#34;&gt;search query log scandal&lt;/a&gt; and Netflix&amp;rsquo;s &lt;a href=&#34;https://en.wikipedia.org/wiki/Netflix_Prize#Privacy_concerns&#34;&gt;recommender scandal&lt;/a&gt;), which makes other companies reticent to take the plunge. &lt;br/&gt;&lt;br/&gt;But, as much as information retrieval researchers would like to use such data to improve search systems, there are also very good theoretical reasons (see this paper, for example: [&lt;a href=&#34;http://www.cse.psu.edu/~duk17/papers/nflprivacy.pdf&#34;&gt;PDF&lt;/a&gt;]) to believe that, at the very least, it is very difficult to do correctly. Difficult enough that it may never be worth the risk.
 &lt;a class=&#34;footnote-return&#34; href=&#34;#fnref:querylogs&#34;&gt;&lt;sup&gt;[return]&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>BitFunnel Glossary</title>
      <link>http://bitfunnel.org/glossary/</link>
      <pubDate>Thu, 13 Oct 2016 15:52:53 -0700</pubDate>
      
      <guid>http://bitfunnel.org/glossary/</guid>
      <description>

&lt;p&gt;To get a high level overview of the algorithm, &lt;a href=&#34;//bitfunnel.org/strangeloop/&#34;&gt;please see this talk transcript&lt;/a&gt;. This glossary is incomplete and needs a lot of work! While our plan is to fill out the whole thing, that will probably take a while. If there&amp;rsquo;s some particular term or concept that you&amp;rsquo;d like to see explained sooner, &lt;a href=&#34;https://github.com/bitfunnel/bitfunnel/issues&#34;&gt;please let us know&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&#34;top-level-concepts&#34;&gt;Top level concepts&lt;/h2&gt;

&lt;h3 id=&#34;termtable&#34;&gt;TermTable&lt;/h3&gt;

&lt;p&gt;A TermTable contains the mapping from a term to the rows associated with the term. A term can be a word (1-gram) or an N-word phrase (N-gram).&lt;/p&gt;

&lt;h3 id=&#34;index-ingestor&#34;&gt;Index / Ingestor&lt;/h3&gt;

&lt;p&gt;An Index contains, for one machine, an &amp;ldquo;index&amp;rdquo; of all documents that are indexed on that machine. An Index consists of multiple Shards, plus the configuration information necessary to ingest and look up documents, which means that it contains references to things like TermTables.&lt;/p&gt;

&lt;h3 id=&#34;shard&#34;&gt;Shard&lt;/h3&gt;

&lt;p&gt;A Shard contains descriptors for the documents contained within the shard. This means that it has the DocTableDescriptors and the RowTableDescriptors for a Shard. The DocTableDescriptor tells you what information is stored along with each document besides the bits in the index. The RowTableDescriptors tell you what bits in the index actually mean. For example, given a DocIndex and the associated row information, it can give you the bit-address of a specific row for a document.&lt;/p&gt;

&lt;p&gt;A Shard is also responsible for holding onto Slices.&lt;/p&gt;

&lt;h3 id=&#34;slice&#34;&gt;Slice&lt;/h3&gt;

&lt;p&gt;A Slice owns the memory associated with a set of documents. Very roughly speaking, it&amp;rsquo;s a buffer of some size, with meta-information on the buffer and a ref count.&lt;/p&gt;

&lt;h2 id=&#34;other-important-concepts&#34;&gt;Other important concepts&lt;/h2&gt;

&lt;h3 id=&#34;rank&#34;&gt;Rank&lt;/h3&gt;

&lt;p&gt;In our hierarchical bloom filters, a row is said to have rank &lt;em&gt;i&lt;/em&gt; if each bit represents 2**i documents. This means that, in a rank 0 row, each bit represents exactly one document, and in a rank 3 row, each bit represents 8 documents (in which case, the bit is set if any one of the 8 documents represented &amp;ldquo;wants&amp;rdquo; to set the bit).&lt;/p&gt;

&lt;h3 id=&#34;fixedsizedblob&#34;&gt;FixedSizedBlob&lt;/h3&gt;

&lt;p&gt;Per document storage for a fixed-size chunks of data (i.e., data where the size of the blob is the same for every document). Because items are fixed-size, they can be stored in an array or an array-like structure.&lt;/p&gt;

&lt;h3 id=&#34;variablesizedblob&#34;&gt;VariableSizedBlob&lt;/h3&gt;

&lt;p&gt;Per document storage for variable-size chunks of data (i.e., data where the size of the blob can be different for every document). Because items are variable-size, pointers to items are stored.&lt;/p&gt;

&lt;h3 id=&#34;doctable-doctabledescriptor&#34;&gt;DocTable / DocTableDescriptor&lt;/h3&gt;

&lt;p&gt;The DocTable is a collection of per-document data items for a Slice. An item in the DocTable consists of some number of FixedSizedBlobs as well as pointers to VariableSizedBlobs.&lt;/p&gt;

&lt;h3 id=&#34;rowtabledescriptor&#34;&gt;RowTableDescriptor&lt;/h3&gt;

&lt;p&gt;RowTableDescriptor exposes low-level operations on a Slice like GetBit, SetBit, and ClearBit. Given pointer to a SliceBuffer, a RowIndex and a DocIndex, the RowTableDescriptor lets you actually manipulate the information inside a Slice.&lt;/p&gt;

&lt;h3 id=&#34;documenthandle&#34;&gt;DocumentHandle&lt;/h3&gt;

&lt;h3 id=&#34;tokenmanager-tokentracker-token&#34;&gt;TokenManager / TokenTracker / Token&lt;/h3&gt;

&lt;p&gt;These are used to track the liveness of Slices.&lt;/p&gt;

&lt;p&gt;The top-level object is a TokenManager, which can hand out TokenTrackers and Tokens. Tokens are basically monotonically increasing serial numbers that can be oustanding or complete. Each TokenTracker tracks if the TokenManager has any oustanding tokens issued before a cut-off serial number.&lt;/p&gt;

&lt;h3 id=&#34;recycler&#34;&gt;Recycler&lt;/h3&gt;

&lt;p&gt;The Recycler is a rudimentary garbage collector for slices. When the Index is done with a Slice, the Slice gets Expired. Expiring a slice schedules it to be recycled by a Recycler. Recycling (i.e., destruction) occurs when all tokens related to the slice are expired, i.e., when all users (read threads) of the Slice are done with the Slice.&lt;/p&gt;

&lt;h3 id=&#34;termtreatment&#34;&gt;TermTreatment&lt;/h3&gt;

&lt;p&gt;A mapping from term characteristics (today, IDF and gram size) to RowConfiguration (i.e., the number of rows at each rank).&lt;/p&gt;

&lt;h3 id=&#34;row&#34;&gt;Row&lt;/h3&gt;

&lt;h3 id=&#34;rowidsequence&#34;&gt;RowIdSequence&lt;/h3&gt;

&lt;h2 id=&#34;other-concepts&#34;&gt;Other concepts&lt;/h2&gt;

&lt;h3 id=&#34;allocator-iallocator&#34;&gt;Allocator / IAllocator&lt;/h3&gt;

&lt;h3 id=&#34;factories&#34;&gt;Factories&lt;/h3&gt;

&lt;h3 id=&#34;filesystem&#34;&gt;FileSystem&lt;/h3&gt;

&lt;h3 id=&#34;filemanager&#34;&gt;FileManager&lt;/h3&gt;

&lt;h3 id=&#34;sharddefinition&#34;&gt;ShardDefinition&lt;/h3&gt;

&lt;h3 id=&#34;streamconfiguration&#34;&gt;StreamConfiguration&lt;/h3&gt;

&lt;h3 id=&#34;interface-iinterface&#34;&gt;Interface / IInterface&lt;/h3&gt;

&lt;h3 id=&#34;configuration-iconfiguration&#34;&gt;Configuration / IConfiguration&lt;/h3&gt;

&lt;h3 id=&#34;document-idocument&#34;&gt;Document / IDocument&lt;/h3&gt;

&lt;h3 id=&#34;documentcache&#34;&gt;DocumentCache&lt;/h3&gt;

&lt;h3 id=&#34;documentdataschema&#34;&gt;DocumentDataSchema&lt;/h3&gt;

&lt;h3 id=&#34;documentfrequencytable&#34;&gt;DocumentFrequencyTable&lt;/h3&gt;

&lt;h3 id=&#34;factset&#34;&gt;FactSet&lt;/h3&gt;

&lt;h3 id=&#34;simpleindex&#34;&gt;SimpleIndex&lt;/h3&gt;

&lt;h3 id=&#34;slicebufferallocator&#34;&gt;SliceBufferAllocator&lt;/h3&gt;

&lt;h3 id=&#34;termtablebuilder&#34;&gt;TermTableBuilder&lt;/h3&gt;

&lt;h3 id=&#34;termtablecollection&#34;&gt;TermTableCollection&lt;/h3&gt;

&lt;h3 id=&#34;packedrowidsequence&#34;&gt;PackedRowIdSequence&lt;/h3&gt;

&lt;h3 id=&#34;abstractrow&#34;&gt;AbstractRow&lt;/h3&gt;

&lt;h3 id=&#34;querypipeline&#34;&gt;QueryPipeline&lt;/h3&gt;

&lt;h3 id=&#34;queryplanner&#34;&gt;QueryPlanner&lt;/h3&gt;

&lt;h3 id=&#34;rowmatchnode&#34;&gt;RowMatchNode&lt;/h3&gt;

&lt;h3 id=&#34;rowplan&#34;&gt;RowPlan&lt;/h3&gt;

&lt;h3 id=&#34;termmatchnode&#34;&gt;TermMatchNode&lt;/h3&gt;

&lt;h3 id=&#34;termmatchtreeevaulator&#34;&gt;TermMatchTreeEvaulator&lt;/h3&gt;

&lt;h3 id=&#34;termplan&#34;&gt;TermPlan&lt;/h3&gt;

&lt;h3 id=&#34;termplanconverter&#34;&gt;TermPlanConverter&lt;/h3&gt;

&lt;h3 id=&#34;chunkenumerator&#34;&gt;ChunkEnumerator&lt;/h3&gt;

&lt;h3 id=&#34;chunkingestor&#34;&gt;ChunkIngestor&lt;/h3&gt;

&lt;h3 id=&#34;chunkmanifestingestor&#34;&gt;ChunkManifestIngestor&lt;/h3&gt;

&lt;h3 id=&#34;chunkreader&#34;&gt;ChunkReader&lt;/h3&gt;

&lt;h3 id=&#34;chunktaskprocessor&#34;&gt;ChunkTaskProcessor&lt;/h3&gt;

&lt;h3 id=&#34;configuration&#34;&gt;Configuration&lt;/h3&gt;

&lt;h3 id=&#34;documentmap&#34;&gt;DocumentMap&lt;/h3&gt;

&lt;h3 id=&#34;termtotext&#34;&gt;TermToText&lt;/h3&gt;

&lt;h3 id=&#34;abstractrowenumerator&#34;&gt;AbstractRowEnumerator&lt;/h3&gt;

&lt;h3 id=&#34;bytecodeinterpreter&#34;&gt;ByteCodeInterpreter&lt;/h3&gt;

&lt;h3 id=&#34;compilenode&#34;&gt;CompileNode&lt;/h3&gt;

&lt;h3 id=&#34;matchtreerewriter&#34;&gt;MatchTreeRewriter&lt;/h3&gt;

&lt;h3 id=&#34;matchverifier&#34;&gt;MatchVerifier&lt;/h3&gt;

&lt;h3 id=&#34;planrows&#34;&gt;PlanRows&lt;/h3&gt;

&lt;h3 id=&#34;queryparser&#34;&gt;QueryParser&lt;/h3&gt;

&lt;h3 id=&#34;rankdowncompiler&#34;&gt;RankDownCompiler&lt;/h3&gt;

&lt;h3 id=&#34;rankzerocompiler&#34;&gt;RankZeroCompiler&lt;/h3&gt;

&lt;h3 id=&#34;registerallocator&#34;&gt;RegisterAllocator&lt;/h3&gt;

&lt;h3 id=&#34;simpleplanner&#34;&gt;SimplePlanner&lt;/h3&gt;

&lt;p&gt;A rudimentary query planner that only handles AND queries using the ByteCodeInterpreter. Takes a TermMatchNode tree, generates bytecode, and then runs the bytecode.&lt;/p&gt;

&lt;h2 id=&#34;todos&#34;&gt;TODOs:&lt;/h2&gt;

&lt;p&gt;This should be grouped into more than just &amp;ldquo;top level&amp;rdquo; &amp;ldquo;important&amp;rdquo;, and &amp;ldquo;other&amp;rdquo;, but we&amp;rsquo;ve been saying we should do this for ages, so I&amp;rsquo;m putting this version out there just so there&amp;rsquo;s something.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>How do make onboarding to BitFunnel easier?</title>
      <link>http://bitfunnel.org/new-contributors/</link>
      <pubDate>Wed, 12 Oct 2016 15:02:02 -0700</pubDate>
      
      <guid>http://bitfunnel.org/new-contributors/</guid>
      <description>&lt;p&gt;I’ve been working on BitFunnel for roughly six months now. If I look at how I’ve used that time, my guess is that I’ve taken about a month of Mike’s time. If you look at &lt;a href=&#34;//bitfunnel.org/progress/&#34;&gt;the progress we’ve made&lt;/a&gt;, I think that’s a pretty good trade-off, but it doesn’t make for a scalable open source project.&lt;/p&gt;

&lt;p&gt;It makes sense to invest a month of time in a full-time employee since they’re likely to be around for at least a year or two, and even in the case of an extraordinarily bad fit, they’ll probably stick around for long enough that you’ll get your time-investment back. But if it takes a month of your time to onboard a new open source contributor, that’s a losing proposition.&lt;/p&gt;

&lt;p&gt;Mike and I often talk about &lt;a href=&#34;//bitfunnel.org/on-the-road-to-open-source/&#34;&gt;his experience trying to help some MSR folks use BitFunnel&lt;/a&gt; as an example of the kind of thing we’d like to fix. We’ve made a lot of progress on that front; we now tend to write up tutorials of how to run the system as it exists when we make progress on adding functionality. Those documents are often a week or two behind the current code, but that’s not bad.&lt;/p&gt;

&lt;p&gt;But if you look at the difficulty of not just running BitFunnel, but trying to actively contribute to it, that’s not so great. Things have improved a lot, but the onboarding process is still pretty rough. Mike has been pretty good about writing &lt;a href=&#34;http://bitfunnel.org/series/design/&#34;&gt;design notes&lt;/a&gt; for components but I haven’t been keeping up, and even if I was keeping up with Mike’s production of design notes, we might have eight of those documents insteasd of four, in a system where we have roughly 250 classes and are creating new classes all the time. That isn’t quite a fair comparison, because a design note can discuss multiple classes, but it gives you a rough idea of how much of the system doesn’t have design docs.&lt;/p&gt;

&lt;p&gt;On my end, I can try to help by at least catching up to Mike’s production of design notes. I’ll also try to write a glossary, so that things that we haven’t deeply documented still have some explanation. I don’t think that’s enough, but I’m not sure what else to do.&lt;/p&gt;

&lt;p&gt;We try to keep a list of issues &lt;a href=&#34;https://github.com/BitFunnel/BitFunnel/labels/easy&#34;&gt;tagged easy&lt;/a&gt; and we’ve gotten some pull requests as a result. That’s great, but there’s a huge gap between being able to fix easy issues and being able to make major contributions. We’d like to make it easier bridge the gap between fixing small issues and making large changes, but we’re not sure how to do that.&lt;/p&gt;

&lt;p&gt;If you have any suggestions for how we can improve the situation for new contributors, &lt;a href=&#34;https://github.com/bitfunnel/bitfunnel/issues&#34;&gt;please let us know&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Debugging an SEH Crash</title>
      <link>http://bitfunnel.org/debugging-an-seh-crash/</link>
      <pubDate>Mon, 10 Oct 2016 15:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/debugging-an-seh-crash/</guid>
      <description>&lt;p&gt;Here&amp;rsquo;s a video showing how I debugged a read access violation
that was caused by an earlier buffer overflow. This sort of problem
can sometimes be hard to track down, but in this case, a data breakpoint
made my job easier.&lt;/p&gt;

&lt;p&gt;The video discusses the BlockAllocator, Slice buffers and the Row Tables they contain.
If you&amp;rsquo;d like to try diagnosing the bug yourself, just checkout the
&lt;a href=&#34;https://github.com/BitFunnel/BitFunnel/commits/SEHBug&#34;&gt;SEHBug&lt;/a&gt; branch of the
&lt;a href=&#34;https://github.com/BitFunnel/BitFunnel&#34;&gt;BitFunnel&lt;/a&gt; repository.&lt;/p&gt;


&lt;div style=&#34;position: relative; padding-bottom: 56.25%; padding-top: 30px; height: 0; overflow: hidden;&#34;&gt;
  &lt;iframe src=&#34;//www.youtube.com/embed/DaIk2vJajpk&#34; style=&#34;position: absolute; top: 0; left: 0; width: 100%; height: 100%;&#34; allowfullscreen frameborder=&#34;0&#34;&gt;&lt;/iframe&gt;
 &lt;/div&gt;

</description>
    </item>
    
    <item>
      <title>When will BitFunnel be usable?</title>
      <link>http://bitfunnel.org/progress/</link>
      <pubDate>Tue, 04 Oct 2016 00:00:01 -0700</pubDate>
      
      <guid>http://bitfunnel.org/progress/</guid>
      <description>&lt;p&gt;How long should we expect this project to take? In theory, we should have a relatively easy time guessing how long this project will take because this project is a half-port-half-rewrite whose aim to produce an open source version that&amp;rsquo;s simpler than the internal version of the project, and we know how big the original project is.&lt;/p&gt;

&lt;p&gt;If we do &lt;code&gt;find . -name &amp;quot;*.h&amp;quot; -o -name &amp;quot;*.cpp&amp;quot; | grep -v NativeJIT | xargs wc&lt;/code&gt; on the original project to count all lines of code except NativeJIT, we get roughly 144k lines of code. I&amp;rsquo;m excluding NativeJIT because that was ported seperately from the &lt;a href=&#34;https://github.com/bitfunnel/bitfunnel/&#34;&gt;BitFunnel&lt;/a&gt; repo, so our extrapolation should exclude that.&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;re currently at about 53kLOC in the new BitFunnel repo. If we graph the progress, we can see that it&amp;rsquo;s been roughly linear since May.&lt;/p&gt;

&lt;p&gt;&lt;img src=&#34;http://bitfunnel.org/progress/bitfunnel-progress.png&#34; width=&#34;1600&#34; height=&#34;1200&#34;&gt;&lt;/p&gt;

&lt;p&gt;The date is on the x-axis and lines of code are on the y-axis. It&amp;rsquo;s a bit surprising to me that the progress looks so linear. We&amp;rsquo;ve had periods where I&amp;rsquo;ve been busy with non-coding duties and Mike has done the vast majority of the coding, and we&amp;rsquo;ve had periods where Mike&amp;rsquo;s been busy with non-coding duties and I&amp;rsquo;ve been doing the vast majority of the coding. Despite the wildly varying coding workload we&amp;rsquo;ve taken on at times, when you average everything out, progress has been approximately linear.&lt;/p&gt;

&lt;p&gt;I don&amp;rsquo;t expect this to continue indefinitely &amp;ndash; once we get to the point where we have enough of a system stood up so that we can run experiments, progress as measured in lines of code should slow down. We should also see some slowdown when we do intergration and integration testing with whatever we&amp;rsquo;re going to integrate with, which will probably be a lot of work but not much code. On top of that, we&amp;rsquo;ll probably enter a slow period as the holiday season rolls around. Additionally, the lines of code in the new project are somewhat differently scaled than the lines of code in the old project because we&amp;rsquo;ve been adding a license at the top of most files. With all those disclaimers aside, if we guess that we&amp;rsquo;ll end up with somewhere between 1/2x to 1x as much code as the original project, we can make a crude estimate of how long it will take to &amp;ldquo;finish&amp;rdquo; BitFunnel:&lt;/p&gt;

&lt;p&gt;&lt;img src=&#34;http://bitfunnel.org/progress/bitfunnel-extrapolation.png&#34; width=&#34;1600&#34; height=&#34;1200&#34;&gt;&lt;/p&gt;

&lt;p&gt;This is the same graph as before, but with a red horizontal line at the size of the old BitFunnel project and a green horizontal line at half the size of the old BitFunnel project. If we believe the linear estimate, we might be &amp;ldquo;done&amp;rdquo; anywhere between late this year and next July. If we take all of the caveats listed above into account, it&amp;rsquo;s likely that we won&amp;rsquo;t have something &amp;ldquo;complete&amp;rdquo; this calendar year. Beyond that, the error bars are so large that it&amp;rsquo;s hard to say much except that it&amp;rsquo;s plausible that we&amp;rsquo;ll have something &amp;ldquo;complete&amp;rdquo; by the end of the next calendar year.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>All&#39;s Well That Ends Well</title>
      <link>http://bitfunnel.org/alls-well-that-ends-well/</link>
      <pubDate>Sat, 24 Sep 2016 15:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/alls-well-that-ends-well/</guid>
      <description>

&lt;p&gt;We&amp;rsquo;ve been having some stability problems of late.
In our rush to get some minimal version of the document ingestion pipeline
up and running, we created a number of tools for gathering corpus statistics
and configuring term tables and we built an
&lt;a href=&#34;http://bitfunnel.org/index-build-tools/&#34;&gt;interactive REPL console&lt;/a&gt; to help
our readers better understand the system. These tools are mostly system
integrations, and as such, are not covered by unit tests. In recent days we&amp;rsquo;ve
found these integrations to be broken more often than working.&lt;/p&gt;

&lt;p&gt;While it feels great to be pouring a lot of concrete, we&amp;rsquo;ve decided to pause
in order to shore up our foundations. One focus is to develop tests for the
integration code.&lt;/p&gt;

&lt;p&gt;The challenge with writing these tests stems from file system operations.
The &lt;code&gt;BitFunnel statistics&lt;/code&gt; command, for example, reads a number of configuration files
and &lt;a href=&#34;http://bitfunnel.org/corpus-file-format&#34;&gt;BitFunnel chunks&lt;/a&gt; from disk, and then after a bit of computation
writes out a bunch of intermediate files like histograms and document frequency tables.
As we continue to bring over new modules and functionality, the number of
configuration, data, and intermediate files will only grow.&lt;/p&gt;

&lt;p&gt;We could just write tests that rely on the filesystem, but
&lt;em&gt;as a general rule, I avoid writing tests that access the filesystem&lt;/em&gt;.
My rationale relates to system configuration, developer data safety,
and test brittleness.
Let&amp;rsquo;s consider at each of these.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Configuration&lt;/strong&gt;
In order for a test to access a file, the system must be configured correctly.
This means that the test needs to know the path to the file and, in the case of
a read operation, the file itself must exist and have the right permissions.
In the case of a write operation, the path to the file must exist or be created and
the test needs some policy for handling the case where it wants to overwrite
an existing file (such as the partial file left over from the previous test run
which crashed).&lt;/p&gt;

&lt;p&gt;All of these problems are small. We could easily add a step to the
&lt;a href=&#34;https://github.com/BitFunnel/BitFunnel/blob/master/README.md&#34;&gt;README.md&lt;/a&gt;
instructing the developer to setup an environment variable with the path
to the test files. We could add a post-build step that creates the right
directories and copies the required data files. We could use an OS
specific temp directory generator. We could run tests in containers.
All of these would work - the problem is that they add up over time
to make the project hard to use.&lt;/p&gt;

&lt;p&gt;Our goal is ease of use - the ideal onboarding experience
is to clone the repo, install one or two tools
(like the C++ compiler and CMake) and then kick off a build that works 100%
of the time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Safety&lt;/strong&gt;
There&amp;rsquo;s another problem with tests that access the filesystem and I&amp;rsquo;ve been
bitten by this more than once. What happens is some well intentioned developer
writes piece of code that &amp;ldquo;cleans&amp;rdquo; the test directory in preparation for the
next test run. Then through a combination of string handling bugs or poorly
chosen file names, real files, having nothing to do with the test, end up getting
clobbered. Or maybe the test itself overwrites that great American novel draft
you&amp;rsquo;ve been working on for years. These situations always lead to tears and usually
the well intentioned developer suggests it was your fault for storing important
files in whatever directory seemed like an obvious choice for test output.
Or it was your fault for not setting $TEMP before the test did an &lt;code&gt;rm -rf $TEMP/*&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test Brittleness&lt;/strong&gt;
Let&amp;rsquo;s face it - working with files is hard. It is not because the code is
complex - the problem is that files are outside of the process sandbox.
Anyone can mess with a file. Maybe a virus checker quarantined your file.
Maybe a previous test run was done with elevated permissions and now the
current test run can&amp;rsquo;t overwrite the old files. Maybe you didn&amp;rsquo;t escape
characters properly in the file name or you generated a temp path that was
too long. Maybe a zombie process is holding a write handle. Maybe it works
on the PC, but not the Mac.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s a million reasons why tests that interact with the filesystem become
brittle. We want developers to run tests early and run tests often. If the
tests are fast and 100% reliable, we will enter a virtuous cycle. If the tests
are flakey or slow, developers will stop running them and relying on them
and we end up in a vicious cycle.&lt;/p&gt;

&lt;h3 id=&#34;the-bard-comes-to-my-rescue&#34;&gt;The Bard Comes to My Rescue&lt;/h3&gt;

&lt;p&gt;Developing self-contained integration tests that don&amp;rsquo;t hit the filesystem
will take some time. My first challenge is to find a replacement for the
17k Wikipedia pages we&amp;rsquo;re using for today&amp;rsquo;s tests. My criteria for the test
corpus is&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small enough to embed in a C++ source file.&lt;/li&gt;
&lt;li&gt;Large enough to support interesting scenarios.&lt;/li&gt;
&lt;li&gt;Permissive license compatible with the MIT License.&lt;/li&gt;
&lt;li&gt;Text makes some amount of sense to humans.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I finally came up with was
&lt;a href=&#34;https://en.wikipedia.org/wiki/William_Shakespeare&#34;&gt;Shakespeare&amp;rsquo;s&lt;/a&gt; Sonnets.
There are 154 of them,
they fit into about 172Kb, they are plain text, and they are in the public
domain.&lt;/p&gt;

&lt;p&gt;My next step was to convert the Bard&amp;rsquo;s immortal words into C++ code.
Here&amp;rsquo;s his source code, from the 1609 quarto entitled
&lt;a href=&#34;https://en.wikipedia.org/wiki/Shakespeare%27s_sonnets&#34;&gt;&amp;ldquo;SHAKE-SPEARES SONNETS. Never before Imprinted.&amp;rdquo;&lt;/a&gt;&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/alls-well-that-ends-well/sonnet2.jpg&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;Fortunately &lt;a href=&#34;https://www.gutenberg.org/&#34;&gt;Project Gutenberg&lt;/a&gt;
did the heavy lifting, converting scans of the original text
to &lt;a href=&#34;http://www.gutenberg.org/cache/epub/1041/pg1041.txt&#34;&gt;ASCII&lt;/a&gt;
while updating the spelling:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When forty winters shall besiege thy brow,&lt;br /&gt;
And dig deep trenches in thy beauty&amp;rsquo;s field,&lt;br /&gt;
Thy youth&amp;rsquo;s proud livery so gazed on now,&lt;br /&gt;
Will be a tatter&amp;rsquo;d weed of small worth held:&lt;br /&gt;
Then being asked, where all thy beauty lies,&lt;br /&gt;
Where all the treasure of thy lusty days;&lt;br /&gt;
To say, within thine own deep sunken eyes,&lt;br /&gt;
Were an all-eating shame, and thriftless praise.&lt;br /&gt;
How much more praise deserv&amp;rsquo;d thy beauty&amp;rsquo;s use,&lt;br /&gt;
If thou couldst answer &amp;lsquo;This fair child of mine&lt;br /&gt;
Shall sum my count, and make my old excuse,&amp;rsquo;&lt;br /&gt;
Proving his beauty by succession thine!&lt;br /&gt;
This were to be new made when thou art old,&lt;br /&gt;
And see thy blood warm when thou feel&amp;rsquo;st it cold.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this point my job was to tokenize the text and then
write it out as C++ string literals in
&lt;a href=&#34;http://bitfunnel.org/corpus-file-format&#34;&gt;BitFunnel chunk format&lt;/a&gt;.
I could have used the
&lt;a href=&#34;https://github.com/BitFunnel/Workbench/blob/master/README.md&#34;&gt;Workbench Tool&lt;/a&gt;
or &lt;a href=&#34;https://lucene.apache.org/&#34;&gt;Lucene&lt;/a&gt;,
but these seemed like giant hammers for a really small nail.
In the end I wrote a little Node.JS app to do the work.&lt;/p&gt;

&lt;p&gt;The process was mostly straight forward. I did
need to use some care with the single quotes. Sometimes a
single quote was used in a contraction like &amp;ldquo;feel&amp;rsquo;st&amp;rdquo; or
&amp;ldquo;tatter&amp;rsquo;d&amp;rdquo; or a possessive like &amp;ldquo;beauty&amp;rsquo;s&amp;rdquo; or &amp;ldquo;youth&amp;rsquo;s&amp;rdquo;.
In these cases, I wanted to keep the quote are part of the token.&lt;/p&gt;

&lt;p&gt;Other times the single quote was used to demarcate a phrase,
as in&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&amp;lsquo;This fair child of mine&lt;br /&gt;
Shall sum my count, and make my old excuse,&amp;rsquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These quotes should never be part of a token. My strategy was to
first replace all of the interesting quotes with a sentinel character.
I used the &lt;code&gt;&#39;#&#39;&lt;/code&gt; character since it didn&amp;rsquo;t appear elsewhere in the corpus.
Once these quotes were safely marked, I removed all of the remaining quotes
and other punctuation. Then I replaced each &lt;code&gt;&#39;#&#39;&lt;/code&gt; with a single quote.
Here&amp;rsquo;s the code the I used to clean and tokenize each line.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-js&#34;&gt;function ProcessLine(input) {
    // Convert input to lower case.
    var a = input.toLowerCase();

    // Use hash to temporarily mark single quotes used in contractions.
    var b = a.replace(/(\w)&#39;(\w)/g, &amp;quot;$1#$2&amp;quot;);

    // Remove all punctuation, including remaining single quotes.
    var c = b.replace(/[,.!?:;&#39;]/g, &amp;quot;&amp;quot;);

    // Convert contraction markers back to single quotes.
    var d = c.replace(/#/g, &amp;quot;&#39;&amp;quot;);

    // Replace spaces with word-end markers.
    var e = d.replace(/[ ]/g, &amp;quot;\\0&amp;quot;) + &amp;quot;\\0&amp;quot;;

    return e;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Outputting the C++ code was mostly straightforward.
The only hitch involved concatenating octal escape codes with Arabic numerals
in the C string literals. Take a look at the sample output below. The third
line, &lt;code&gt;&amp;quot;01\0Sonnet\0&amp;quot; &amp;quot;2\0\0&amp;quot;&lt;/code&gt; had to be broken into two adjacent string
literals in order to keep the &lt;code&gt;&amp;quot;\0&amp;quot;&lt;/code&gt; after &lt;code&gt;&amp;quot;Sonnet&amp;quot;&lt;/code&gt; from concatenating with
the &lt;code&gt;&amp;quot;2&amp;quot;&lt;/code&gt; to form the octal literal &lt;code&gt;&amp;quot;\02&amp;quot;&lt;/code&gt;. Fortunately this situation only
appeared in the titles so it was easy to special case the treatment.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;char const * sonnet2 = 
    &amp;quot;0000000000000002\0&amp;quot;
    &amp;quot;02\0https://en.wikipedia.org/wiki/Sonnet_2\0\0&amp;quot;
    &amp;quot;01\0Sonnet\0&amp;quot; &amp;quot;2\0\0&amp;quot;
    &amp;quot;00\0&amp;quot;
    &amp;quot;when\0forty\0winters\0shall\0besiege\0thy\0brow\0&amp;quot;
    &amp;quot;and\0dig\0deep\0trenches\0in\0thy\0beauty&#39;s\0field\0&amp;quot;
    &amp;quot;thy\0youth&#39;s\0proud\0livery\0so\0gazed\0on\0now\0&amp;quot;
    &amp;quot;will\0be\0a\0tatter&#39;d\0weed\0of\0small\0worth\0held\0\0&amp;quot;
    &amp;quot;then\0being\0asked\0where\0all\0thy\0beauty\0lies\0&amp;quot;
    &amp;quot;where\0all\0the\0treasure\0of\0thy\0lusty\0days\0\0&amp;quot;
    &amp;quot;to\0say\0within\0thine\0own\0deep\0sunken\0eyes\0&amp;quot;
    &amp;quot;were\0an\0all-eating\0shame\0and\0thriftless\0praise\0&amp;quot;
    &amp;quot;how\0much\0more\0praise\0deserv&#39;d\0thy\0beauty&#39;s\0use\0&amp;quot;
    &amp;quot;if\0thou\0couldst\0answer\0this\0fair\0child\0of\0mine\0&amp;quot;
    &amp;quot;shall\0sum\0my\0count\0and\0make\0my\0old\0excuse\0&amp;quot;
    &amp;quot;proving\0his\0beauty\0by\0succession\0thine\0&amp;quot;
    &amp;quot;this\0were\0to\0be\0new\0made\0when\0thou\0art\0old\0&amp;quot;
    &amp;quot;and\0see\0thy\0blood\0warm\0when\0thou\0feel&#39;st\0it\0cold\0&amp;quot;
    &amp;quot;\0\0&amp;quot;;
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;putting-it-all-together&#34;&gt;Putting it all together.&lt;/h3&gt;

&lt;p&gt;Translating a small corpus into C++ string literals was a good first step.
Making the end-to-end integration test required that I also virtualize all of the
filesystem interactions, but this is a tale for another post.&lt;/p&gt;

&lt;p&gt;One nice outcome of this work is that the build now generates an example that
automatically configures and runs an index with no requirement to download
corpus files.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s called &lt;code&gt;TheBard&lt;/code&gt; and what it does is run the corpus statistics gathering
stage on the sonnets, then builds a &lt;code&gt;TermTable&lt;/code&gt;, and then boots up a interactive
BitFunnel REPL console.&lt;/p&gt;

&lt;p&gt;There are only a few command-line arguments and they happen to be optional.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% TheBard -help
TheBard
A small end-to-end index configuration and ingestion example based on 
154 Shakespeare sonnets.

Usage:
./TheBard [-help]
          [-verbose]
          [-gramsize &amp;lt;integer&amp;gt;]

[-help]
    Display help for this program. (boolean, defaults to false)


[-verbose]
    Print information gathered during statistics and termtable stages. 
    (boolean, defaults to false)


[-gramsize &amp;lt;integer&amp;gt;]
    Set the maximum ngram size for phrases. (integer, defaults to 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here&amp;rsquo;s a sample session. In this case I didn&amp;rsquo;t supply the &lt;code&gt;-gramsize&lt;/code&gt; parameter
so we&amp;rsquo;ll be working with an index of unigrams.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% TheBard
Initializing RAM filesystem.
Gathering corpus statistics.
Building the TermTable.
Index is now configured.

Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.)

directory = &amp;quot;config&amp;quot;
gram size = 1

Starting index ...
Index started successfully.

Type &amp;quot;help&amp;quot; to get started.

0: help
Available commands:
  cache   Ingests documents into the index and also stores them in a cache
          for query verification purposes.
  delay   Prints a message after certain number of seconds
  help    Displays a list of available commands.
  load    Ingests documents into the index
  query   Process a single query or list of queries. (TODO)
  quit    waits for all current tasks to complete then exits.
  script  Runs commands from a file.(TODO)
  show    Shows information about various data structures. (TODO)
  status  Prints system status.
  verify  Verifies the results of a single query against the document cache.

Type &amp;quot;help &amp;lt;command&amp;gt;&amp;quot; for more information on a particular command.

1: cache manifest sonnets
Ingestion complete.

2: show rows blood
Term(&amp;quot;blood&amp;quot;)
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(3,  1111): 011000000010000001100000000000000000000000001000000000
  RowId(0,  1278): 010000000010000000100000000000000000000000000000000001

3: verify one blood
Processing query &amp;quot; blood&amp;quot;
  DocId(121)
  DocId(109)
  DocId(82)
  DocId(67)
  DocId(63)
  DocId(19)
  DocId(11)
  DocId(2)
8 match(es) out of 154 documents.

4: show rows shame
Term(&amp;quot;shame&amp;quot;)
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(3,  1058): 110000011100000000000000000000100111000000000000000000
  RowId(0,  1225): 110000011100010000000000000000000101000000000000000000

5: verify one shame
Processing query &amp;quot; shame&amp;quot;
  DocId(129)
  DocId(127)
  DocId(99)
  DocId(95)
  DocId(72)
  DocId(36)
  DocId(34)
  DocId(10)
  DocId(9)
  DocId(2)
10 match(es) out of 154 documents.

6: show rows tatter&#39;d
Term(&amp;quot;tatter&#39;d&amp;quot;)
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(3,  3349): 010000000000000000000000010000000000000000000010000000
  RowId(3,  3350): 010000000000000000000000010000000000000000000010000000
  RowId(3,  3351): 010000000000000000000000010000000000000000000010000000
  RowId(0,  1440): 010010010000000000010000010011011000000000000000000000

7: verify one tatter&#39;d
Processing query &amp;quot; tatter&#39;d&amp;quot;
  DocId(26)
  DocId(2)
2 match(es) out of 154 documents.

8: show rows love
Term(&amp;quot;love&amp;quot;)
                 d 000000000000000000000000000000000000000000000000000000
                 o 000000000111111111122222222223333333333444444444455555
                 c 123456789012345678901234567890123456789012345678901234
  RowId(0,  1019): 000000001100101000111110110010111111101101001110101000
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At prompt 1, I enter &lt;code&gt;cache manifest sonnets&lt;/code&gt; which loads all 154 sonnets
into the index. I could have used &lt;code&gt;cache chunk sonnet0&lt;/code&gt; to load only the
first chunk of 11 sonnets. Note that I used the &lt;code&gt;cache&lt;/code&gt; command, instead
of the &lt;code&gt;load&lt;/code&gt; command. The difference is that the &lt;code&gt;cache&lt;/code&gt; command also
saves the IDocuments in a separate data structure that can be used to verify
queries processed by the BitFunnel engine.&lt;/p&gt;

&lt;p&gt;In prompts 2 and 3, I examine the rows associated with the word, &amp;ldquo;blood&amp;rdquo;
and then run a query verification to see which documents should
match. The &lt;code&gt;show rows&lt;/code&gt; command lists each of the RowIds associated with a term,
followed by the bits for the first 64 documents. The document ids are printed
vertically above each column of bits. In this example, we can see that
documents 2, 11, and 19 are likely to contain the word &amp;ldquo;blood&amp;rdquo; because
their columns contain only 1s. The &lt;code&gt;verify one&lt;/code&gt; command confirms these
columns are, in fact, matches, and not false positives.&lt;/p&gt;

&lt;p&gt;Prompts 4 and 5 repeat the experiment with the word, &amp;ldquo;shame&amp;rdquo;. This time we
see what appear to be matches in columns 1, 2, 8, 9, 10, 34, and 36.
The &lt;code&gt;verify one&lt;/code&gt; command shows that columns 1 and 8 are not actually
matches, but instead correspond to false
positives.&lt;/p&gt;

&lt;p&gt;Prompts 6 and 7 show, &amp;ldquo;tatter&amp;rsquo;d&amp;rdquo;, a word that is considerable more
rare than &amp;ldquo;blood&amp;rdquo; and &amp;ldquo;shame&amp;rdquo;. Because &amp;ldquo;tatter&amp;rsquo;d&amp;rdquo; is rare, it requires
four rows to drive the noise down to acceptable levels.&lt;/p&gt;

&lt;p&gt;Constrast this with prompt 8 which looks at the word, &amp;ldquo;love&amp;rdquo;. Love appears in
so many documents that it must reside in its own, private row.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Searching for Primes</title>
      <link>http://bitfunnel.org/searching-for-primes/</link>
      <pubDate>Sat, 24 Sep 2016 15:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/searching-for-primes/</guid>
      <description>

&lt;p&gt;What do prime numbers have to do with BitFunnel?&lt;/p&gt;

&lt;p&gt;It turns out we use them to test our matching engine.
One of the challenges in bringing up a new search engine
is figuring out how to test it.
If you happen to have another working search engine
that has ingested the same corpus, you&amp;rsquo;re in luck - just compare its output
with that of your new search engine.&lt;/p&gt;

&lt;p&gt;Well that&amp;rsquo;s the theory, anyway. In practice this is difficult for
a number of reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &amp;ldquo;oracle&amp;rdquo; search engine may not have indexed the same corpus as
the search engine under test.
As an example, it would be great to use a production Bing server as our
oracle, but no one knows, at any given moment, exactly which documents
are on a particular machine, and the set of documents is constantly changing.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;The two search engines may model documents differently. For example, the Bing
servers include ton&amp;rsquo;s of meta data and information about click streams
which isn&amp;rsquo;t meaningful to anyone outside of Bing. We could model all of this
information in a BitFunnel test, but it would involve a lot of code that was
only useful for the test.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;We&amp;rsquo;d like to make all of our tests available as open source,
so the data required to run the tests needs to be publicly available
and small enough to store on GitHub.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Down the road, we plan to configure an instance of Lucene as our oracle,
but today, we need a really small, lightweight test that can be used
for debugging and run before every commit.&lt;/p&gt;

&lt;h3 id=&#34;a-synthetic-corpus&#34;&gt;A Synthetic Corpus&lt;/h3&gt;

&lt;p&gt;Our solution was to generate a synthetic corpus. We wanted something with
the following characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Corpus should be trivial to generate.&lt;/li&gt;
&lt;li&gt;Arbitrarily large corpuses can be constructed efficiently.&lt;/li&gt;
&lt;li&gt;Match verification algorithm should be trivial.&lt;/li&gt;
&lt;li&gt;Match verification should be fast.&lt;/li&gt;
&lt;li&gt;Can model phrases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point, our goal is to test our query pipeline as it transforms
the Term tree into various Row trees which become CompileNode trees
which drop into an ICodeGenerator which yields native x64 code or
byte code for our interpreter.&lt;/p&gt;

&lt;p&gt;For these tests, we&amp;rsquo;re not concerned with the probabilistic nature of
BitFunnel - we just want to know if the matcher is computing the right
boolean expression over bits loaded from rows. We can easily eliminate
all probabilistic behavior by configuring the &lt;code&gt;TermTable&lt;/code&gt; to place
each term in its own, private row.&lt;/p&gt;

&lt;p&gt;Since these tests eliminate probabilistic behavior, there is no
requirement that our synthetic corpus have statistics that model a real
world corpus. We can use really wacky documents, as long as they
support enough interesting test cases.&lt;/p&gt;

&lt;h3 id=&#34;using-prime-factorizations&#34;&gt;Using Prime Factorizations&lt;/h3&gt;

&lt;p&gt;The solution we settled on was to model each document as containing
only those terms corresponding to the integers that make up the
prime factorization of the document&amp;rsquo;s id.&lt;/p&gt;

&lt;p&gt;As an example, document number 100 might look something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Title: 100&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2 2 5 5&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and document 2310 might look something like&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Title: 770&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;2 5 7 11&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and document 1223, which corresponds to a prime number, would have a only single term&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Title: 1223&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;1223&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With this document structure, it is trivial to determine if a document
contains a specific prime number term. Here&amp;rsquo;s the code:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;bool Contains(size_t docId, size_t term)
{
    return (docId % term) == 0ull;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Phrase matches are easy to detect, as well, if we model the documents
as &lt;em&gt;ordered sequences&lt;/em&gt; of prime factors. All we need to do is ask whether
the sequence of terms that makes up the phrase is a subsequence of the
integers that make up the document&amp;rsquo;s prime factorization.&lt;/p&gt;

&lt;p&gt;Suppose, for example, that we&amp;rsquo;re looking for the phrase &amp;ldquo;2 5&amp;rdquo;. This is
equivalent to asking whether each document&amp;rsquo;s prime factorization sequence
contains the sequence &lt;code&gt;[2,5]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Consider the documents above.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Document 100 is a match because &lt;code&gt;[2,5]&lt;/code&gt;
is a subsequence of &lt;code&gt;[2,2,5,5]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Document 770 is also a match because &lt;code&gt;[2,5]&lt;/code&gt;
is a subsequence of &lt;code&gt;[2,5,7,11]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Document 1223, on the other hand, is not a match because &lt;code&gt;[2,5]&lt;/code&gt; is
not a subsequence of &lt;code&gt;[1223]&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&#34;implementation-details&#34;&gt;Implementation Details&lt;/h3&gt;

&lt;p&gt;The implementation turned out to be surprisingly simple &amp;ndash;
just under 200 lines of code in &lt;code&gt;PrimeFactorsDocument.cpp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Our first step was to create mock documents.
The function &lt;code&gt;CreatePrimeFactorsDocument&lt;/code&gt; just creates an off-the-shelf
&lt;code&gt;IDocument&lt;/code&gt; and then fills it the prime factors of its &lt;code&gt;DocId&lt;/code&gt; using
calls to &lt;code&gt;IDocument::AddTerm()&lt;/code&gt;. Here&amp;rsquo;s the relevant fragment of code:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;for (size_t i = 0; i &amp;lt; Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p &amp;gt; docId)
    {
        break;
    }
    else
    {
        while ((docId % p) == 0)
        {
            auto const &amp;amp; term = Primes::c_primesBelow10000Text[i];
            document-&amp;gt;AddTerm(term.c_str());
            docId /= p;
            sourceByteSize += (1 + term.size());
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next we configured a TermTable to assign a private row to
each term corresponding to a prime number. We only included
mappings for primes up to the largest DocId. This TermTable
gives us the desired non-probabilistic behavior for Terms
corresponding to primes not exceeding the largest DocId.&lt;/p&gt;

&lt;p&gt;Terms corresponding to larger primes or composite numbers will be
implicitly mapped. The implicit rows, however will only contain
zeros because none of the documents contain terms corresponding
to large primes or composite numbers.&lt;/p&gt;

&lt;p&gt;The conqeuence is that queries involving larger primes and composites
will never show probabilistic behavior and therefore never yield
false positives.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s an excerpt from the function &lt;code&gt;CreatePrimeFactorsTermTable()&lt;/code&gt;
which creates an &lt;code&gt;ITermTable&lt;/code&gt; and then provisions it with explicit
rows for the terms &amp;ldquo;0&amp;rdquo;, &amp;ldquo;1&amp;rdquo;, and each of the primes smaller than 10,000:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;for (size_t i = 0; i &amp;lt; Primes::c_primesBelow10000.size(); ++i)
{
    size_t p = Primes::c_primesBelow10000[i];
    if (p &amp;gt; maxDocId)
    {
        break;
    }
    else
    {
        auto text = Primes::c_primesBelow10000Text[i];

        termTable-&amp;gt;OpenTerm();
        termTable-&amp;gt;AddRowId(RowId(rank, explicitRowCount0++));
        termTable-&amp;gt;CloseTerm(Term::ComputeRawHash(text.c_str()));
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Our last step was to create a mock index.
The function &lt;code&gt;CreatePrimeFactorsIndex()&lt;/code&gt; just creates an &lt;code&gt;ISimpleIndex&lt;/code&gt;
with the prime factors term table replacing the default. Then a simple
for-loop fills the index:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-c++&#34;&gt;for (DocId docId = 0; docId &amp;lt;= maxDocId; ++docId)
{
    auto document =
        Factories::CreatePrimeFactorsDocument(
            index-&amp;gt;GetConfiguration(),
            docId,
            maxDocId,
            streamId);
    index-&amp;gt;GetIngestor().Add(docId, *document);
}
&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
    <item>
      <title>Sample Data</title>
      <link>http://bitfunnel.org/sample-data/</link>
      <pubDate>Tue, 20 Sep 2016 15:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/sample-data/</guid>
      <description>

&lt;p&gt;I&amp;rsquo;ve been trying to make it really easy to get started with BitFunnel, but we
still have a ways to go. From the beginning we put a lot of effort into ensuring
our code would build and run on Linux, OSX, and Windows, and we set up CI on
&lt;a href=&#34;https://www.appveyor.com/&#34;&gt;Appveyor&lt;/a&gt; and &lt;a href=&#34;https://travis-ci.org/&#34;&gt;Travis&lt;/a&gt;
to help us quickly spot breaks on any OS. This has
kept the build in good shape, but it seems that the system is still hard to configure
and run, especially for those who don&amp;rsquo;t use it on a day-to-day basis.&lt;/p&gt;

&lt;p&gt;After some brainstorming, we decided it would be helpful to make a sample corpus
with all necessary configuration files available for download so that new users and
contributors could get the system up and running with just a few steps.&lt;/p&gt;

&lt;p&gt;The sample corpus consists of about 17k pages from the English version of Wikipedia.
This small slice of Wikipedia is manageable, yet large enough to demonstrate interesting
aspects of BitFunnel. Here are the download links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/enwiki-20160305-pages-articles1.xml-p000000010p000030302&#34;&gt;Wikipedia database dump file. (529MB)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz&#34;&gt;BitFunnel chunk and configuration files. (189MB)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first file is for reference and is not needed unless you want to reprocess the entire corpus
yourself from scratch.&lt;/p&gt;

&lt;p&gt;In most cases it suffices to download the second link which contains the files
necessary to run the BitFunnel index Read-Eval-Print-Loop (REPL). This download contains&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wikiextractor text output (265MB uncompressed)&lt;/li&gt;
&lt;li&gt;the corresponding BitFunnel chunk files (208MB uncompressed)&lt;/li&gt;
&lt;li&gt;corpus statistics and configuration files (51MB uncompressed)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&#34;downloading-and-extracting-chunk-files&#34;&gt;Downloading and Extracting Chunk Files&lt;/h3&gt;

&lt;p&gt;You can download these files directly from your browser, or on Linux or OSX use the &lt;code&gt;wget&lt;/code&gt; and &lt;code&gt;tar&lt;/code&gt; commands.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% cd /tmp

% wget https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
--2016-09-18 21:15:11--  https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
Resolving bitfunnel.blob.core.windows.net... 13.93.168.88
Connecting to bitfunnel.blob.core.windows.net|13.93.168.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 198148563 (189M) [application/octet-stream]
Saving to: &#39;small-corpus.tar.gz&#39;

small-corpus.tar.gz          100%[==============================================&amp;gt;] 188.97M  1.67MB/s   in 1m 49s

2016-09-18 21:17:00 (1.74 MB/s) - &#39;small-corpus.tar.gz&#39; saved [198148563/198148563]

% tar -xvzf small-corpus.tar.gz
x chunks/
x chunks/AA/
x chunks/AA/wiki_00
x chunks/AA/wiki_01
x chunks/AA/wiki_02
...
x chunks/AC/wiki_56
x chunks/AC/wiki_57
x text/
x text/AA/
x text/AA/wiki_00
x text/AA/wiki_01
x text/AA/wiki_02
...
x text/AC/wiki_56
x text/AC/wiki_57

% ls -l wikipedia
total 0
drwxr-xr-x  5 michaelhopcroft  wheel  170 Jul 29 21:40 chunks
drwxr-xr-x  8 michaelhopcroft  wheel  272 Sep 18 16:09 config
drwxr-xr-x  5 michaelhopcroft  wheel  170 Jul 29 21:34 text
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;running-the-repl&#34;&gt;Running the REPL&lt;/h3&gt;

&lt;p&gt;Once the files have been downloaded and uncompressed, we&amp;rsquo;re ready to run the REPL.
The REPL is a subcommand of the BitFunnel executable which is located at &lt;code&gt;tools\BitFunnel\src&lt;/code&gt;
in the source tree. In the transcript below, I have set my path to point to the
BitFunnel executable. The only required parameter is the path to the config directory
that was created in the previous step.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% BitFunnel repl /tmp/wikipedia/config
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = &amp;quot;/tmp/wikipedia/config&amp;quot;
gram size = 1

Starting index ...
Blocksize: 11005320
Index started successfully.

Type &amp;quot;help&amp;quot; to get started.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once the REPL console has started, we will load a single chunk file.
We use the &lt;code&gt;cache chunk&lt;/code&gt; command to ingest the documents from a single
chunk file. The &lt;code&gt;cache chunk&lt;/code&gt; command ingests documents like the
&lt;code&gt;load chunk&lt;/code&gt; command, but it also caches the IDocuments to assist in
verifying the correctness of the BitFunnel matching engine.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;0: cache chunk /tmp/wikipedia/chunks/AA/wiki_00
Ingesting chunk file &amp;quot;/tmp/wikipedia/chunks/AA/wiki_00&amp;quot;
Caching IDocuments for query verification.
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
Ingestion complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At this point, we&amp;rsquo;ve ingested &lt;code&gt;/tmp/wikipedia/chunks/AA/wiki_00&lt;/code&gt;
which contains the following 41 wikipedia pages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;12: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=12&#34;&gt;Anarchism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;25: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=25&#34;&gt;Autism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;39: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=39&#34;&gt;Albedo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;128: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=128&#34;&gt;Talk:Atlas Shrugged&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;290: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=290&#34;&gt;A&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;295: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=295&#34;&gt;User:AnonymousCoward&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;303: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=303&#34;&gt;Alabama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;305: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=305&#34;&gt;Achilles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;307: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=307&#34;&gt;Abraham Lincoln&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;308: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=308&#34;&gt;Aristotle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;309: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=309&#34;&gt;An American in Paris&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;316: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=316&#34;&gt;Academy Award for Best Production Design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;324: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=324&#34;&gt;Academy Awards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;330: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=330&#34;&gt;Actrius&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;332: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=332&#34;&gt;Animalia (book)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;334: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=334&#34;&gt;International Atomic Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;336: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=336&#34;&gt;Altruism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;339: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=339&#34;&gt;Ayn Rand&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;340: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=340&#34;&gt;Alain Connes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;344: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=344&#34;&gt;Allan Dwan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;354: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=354&#34;&gt;Talk:Algeria&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;358: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=358&#34;&gt;Algeria&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;359: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=359&#34;&gt;List of Atlas Shrugged characters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;569: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=569&#34;&gt;Anthropology&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;572: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=572&#34;&gt;Agricultural science&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;573: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=573&#34;&gt;Alchemy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;579: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=579&#34;&gt;Alien&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;580: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=580&#34;&gt;Astronomer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;582: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=582&#34;&gt;Talk:Altruism/Archive 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;586: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=586&#34;&gt;ASCII&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;590: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=590&#34;&gt;Austin (disambiguation)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;593: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=593&#34;&gt;Animation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;594: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=594&#34;&gt;Apollo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;595: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=595&#34;&gt;Andre Agassi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;597: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=597&#34;&gt;Austroasiatic languages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;599: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=599&#34;&gt;Afroasiatic languages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;600: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=600&#34;&gt;Andorra&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;612: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=612&#34;&gt;Arithmetic mean&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;615: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=615&#34;&gt;American Football Conference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;620: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=620&#34;&gt;Animal Farm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;621: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=621&#34;&gt;Amphibian&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Handy tip: if you&amp;rsquo;d like to know which pages are in a chunk file,
run &lt;code&gt;grep&lt;/code&gt; on the corresponding wikiextractor file. For example, if you
are interested in knowing the contents of &lt;code&gt;tmp/wikipedia/chunks/AA/wiki_00&lt;/code&gt;,
run grep on &lt;code&gt;tmp/wikipedia/text/AA/wiki_00&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;grep &amp;quot;&amp;lt;doc id=&amp;quot; /tmp/wikipedia/text/AA/wiki_00
&amp;lt;doc id=&amp;quot;12&amp;quot; url=&amp;quot;https://en.wikipedia.org/wiki?curid=12&amp;quot; title=&amp;quot;Anarchism&amp;quot;&amp;gt;
&amp;lt;doc id=&amp;quot;25&amp;quot; url=&amp;quot;https://en.wikipedia.org/wiki?curid=25&amp;quot; title=&amp;quot;Autism&amp;quot;&amp;gt;
&amp;lt;doc id=&amp;quot;39&amp;quot; url=&amp;quot;https://en.wikipedia.org/wiki?curid=39&amp;quot; title=&amp;quot;Albedo&amp;quot;&amp;gt;
...
&amp;lt;doc id=&amp;quot;620&amp;quot; url=&amp;quot;https://en.wikipedia.org/wiki?curid=620&amp;quot; title=&amp;quot;Animal Farm&amp;quot;&amp;gt;
&amp;lt;doc id=&amp;quot;621&amp;quot; url=&amp;quot;https://en.wikipedia.org/wiki?curid=621&amp;quot; title=&amp;quot;Amphibian&amp;quot;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let&amp;rsquo;s try running a query using the &lt;code&gt;verify one&lt;/code&gt; command to verify an
expression. Today this command runs a very slow verification query engine on the
IDocuments cached earlier by the &lt;code&gt;cache chunk&lt;/code&gt; command. In the future, &lt;code&gt;verify&lt;/code&gt;
will run the BitFunnel query engine and compare its output with the verification
query engine.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;1: verify one anarchy
Processing query &amp;quot; anarchy&amp;quot;
  DocId(307)
  DocId(12)
2 match(es) out of 41 documents.

2: verify one frog
Processing query &amp;quot; frog&amp;quot;
  DocId(621)
1 match(es) out of 41 documents.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As we can see, documents 12 and 307 contain the word, &amp;ldquo;anarchy&amp;rdquo; and document 621
contains the word, &amp;ldquo;frog&amp;rdquo;. Try running &lt;code&gt;verify one frog|anarchy&lt;/code&gt; and &lt;code&gt;verify one
frog anarchy&lt;/code&gt; (&lt;code&gt;AND&lt;/code&gt; is implicit if &lt;code&gt;OR&lt;/code&gt; isn&amp;rsquo;t specified). Did you get what you
expected?&lt;/p&gt;

&lt;p&gt;We don&amp;rsquo;t have the BitFunnel query pipeline ported yet, but you can examine the
rows associated with various terms using the &lt;code&gt;show rows&lt;/code&gt; command.  This command
lists each of the RowIds associated with a term, followed by the bits for the
first 64 documents. The document ids are printed vertically above each column of
bits.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;3: show rows anarchy
Term(&amp;quot;anarchy&amp;quot;)
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 11507): 10000000110000001000010000000000000000000
  RowId(3, 11508): 10000000110000001000010000000000000000000
  RowId(3, 11509): 10000000110000001000010000000000000000000
  RowId(0,  5354): 11000001110000000000010001001000000000010

4: show rows frog
Term(&amp;quot;frog&amp;quot;)
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 19624): 00000010000010000100000000000000000000001
  RowId(3, 19625): 00000010000010000100000000000000000000001
  RowId(3, 19626): 00000010000010000000001000000000000000001
  RowId(3, 19627): 00000010000010000000001000000000000000001
  RowId(0,  5465): 10000011100010000000001100001001011000001
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If we look at the output of &lt;code&gt;show rows anarchy&lt;/code&gt;, we see that the first
column, which corresponds to document 012, is completely filled with
1s, indicating a match. The second column, which corresponds to document
025 has some zeros so it is not a match.&lt;/p&gt;

&lt;p&gt;There are also some false positives visible in the data. We know from running &lt;code&gt;verify one anarchy&lt;/code&gt;
that only documents 012 and 307 should match, but the query matrix above shows
all 1s in the columns for documents 308 and 358. Once we have finished porting the
document ingestion and query processing pipelines, we will turn our attention
to configuration changes that drive down the false positive rate.&lt;/p&gt;

&lt;p&gt;The goal of this post is to explain how to obtain and use the data files,
so the examples are minimal. To learn more about the BitFunnel repl, statistics builder,
and term table builder, see &lt;a href=&#34;http://bitfunnel.org/index-build-tools&#34;&gt;Index Build Tools&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>BitFunnel performance estimation</title>
      <link>http://bitfunnel.org/strangeloop/</link>
      <pubDate>Fri, 16 Sep 2016 10:10:54 -0700</pubDate>
      
      <guid>http://bitfunnel.org/strangeloop/</guid>
      <description>

&lt;style&gt;
.slide {border: 1px solid;}
&lt;/style&gt;

&lt;p&gt;&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-0.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Hi! I&amp;rsquo;m going to talk about two things today.&lt;/p&gt;

&lt;p&gt;First, I&amp;rsquo;m going to talk about one way to think about performance. That is, one way you can reason about performance.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-1.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Second, I&amp;rsquo;m going to talk about search. We&amp;rsquo;re going to look at search as a case study because, when talking about perfomance, it&amp;rsquo;s often useful to have something concrete to reason about. We could use any problem domain. However, I think that the algorithm we&amp;rsquo;re going to discuss today is particularly interesting because we use it in Bing, despite the fact that it&amp;rsquo;s in a class of algorithms that&amp;rsquo;s been considered obsolete for almost 20 years (at least as core search engine technology).&lt;/p&gt;

&lt;p&gt;&lt;small&gt;
&lt;em&gt;In case it&amp;rsquo;s not obvious, this is a psuedo-transcript of a talk given at StrangeLoop 2016. &lt;a href=&#34;https://www.youtube.com/watch?v=80LKF2qph6I&#34;&gt;See this link&lt;/a&gt; if you&amp;rsquo;d rather watch the video. I wrote this up before watching my talk, so the text probably doesn&amp;rsquo;t match the video exactly.&lt;/em&gt;
&lt;/small&gt;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-2.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
BTW, when I say performance, I don&amp;rsquo;t just mean speed (latency), or speed (throughput). We could also be talking about other aspects of performance like power. Although our example is going to be throughput oriented, the same style of reasoning works for other types of performance.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-3.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Why do we care about performance? One is answer is that we usually don&amp;rsquo;t care because most applications are fast enough. That&amp;rsquo;s true! Most applications &lt;i&gt;are&lt;/i&gt; fast enough. Spending unecessary time thinking about performance is often an error.&lt;/p&gt;

&lt;p&gt;However, when applications get larger, most applications become performance sensitive! This happens both because making a large application faster reduces its cost, and also because making a large application faster can increase its revenue. The second part isn&amp;rsquo;t intuitive to many people, but we&amp;rsquo;ll talk more about that later.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-4.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
How do we think about performance? It turns out that we can often reason about the performance with siple arithmetic. For many applications, even applications that take years to build, it&amp;rsquo;s possible to estimate the performance before building the system with simple back-of-the-envelope calculations.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-5.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Here&amp;rsquo;s a popular tweet. It has 500 retweets! &amp;ldquo;Working code attracts people who want to code. Design documents attract people who want to talk.&amp;rdquo;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-6.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
I get it. Coding feels like real work. Meetings, writing docs, creating slide decks, and giving talks don&amp;rsquo;t feel like work.&lt;/p&gt;

&lt;p&gt;But when I look at outcomes, well, I often see two applications designed to the same thing that were implemented with similar resources where one application is 10x or 100x faster than the other. And when I ask around and find out why, I almost inevetiably find that the team that wrote the faster application spent a lot of time on design. I tend to work on applications that take a year or two to build, so let&amp;rsquo;s say we&amp;rsquo;re talking about something that took a year and a half. For a project of that duration, it&amp;rsquo;s not uncommon to spend months in the design phase before anyone writes any code that&amp;rsquo;s intended to be production code. And when I look at the slower application, the team that created the slower appliction usually had the idea that &amp;ldquo;meetings and whiteboarding aren&amp;rsquo;t real work&amp;rdquo; and jumped straight into coding.&lt;/p&gt;

&lt;p&gt;The problem is that if you have something that takes a year and a half to build, if you build it, measure the performance, and then decide to iterate, your iteration time is a year and a half, whereas on the whiteboard, it can be hours or days. Moreover, if you build a system without reasoning about what the performance should be, when you build the system and measure its performance, you&amp;rsquo;ll only know how fast it runs, not how fast it should run, so you won&amp;rsquo;t even know that you should iterate.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s common to hear advice like &amp;ldquo;don&amp;rsquo;t optimize early, just profile and then optimize the important parts after it works&amp;rdquo;. That&amp;rsquo;s fine advice for non-performance critical systems, but it&amp;rsquo;s very bad advice for performane critical systems, where you may find that you have to re-do the entire architecture to get as much performance out of the system as your machine can give you.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-7.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Before we talk about performance, let&amp;rsquo;s talk about scale. Because people often mean different things when they talk about scale, I&amp;rsquo;m going to be very concrete here.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-8.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Since we&amp;rsquo;re talking about search, let&amp;rsquo;s imagine a few representative corpus sizes we might want to search: ten thousand, ten million, and ten billion documents.&lt;/p&gt;

&lt;p&gt;And let&amp;rsquo;s assume that each document is 5kB. If we&amp;rsquo;re talking about the web, that&amp;rsquo;s a bit too small, and if we&amp;rsquo;re talking about email, that&amp;rsquo;s a bit too big, but you can scale this number to whatever corpus size you have.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-9.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
BTW, the specific problem we&amp;rsquo;re going to look at is: we have a corupus of documents that we want to be able to search, and we&amp;rsquo;re going to handle &lt;code&gt;AND&lt;/code&gt; queries.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-10.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
That is, queries of the form, I want &lt;em&gt;this&lt;/em&gt; word, and &lt;em&gt;this&lt;/em&gt; word, and &lt;em&gt;this&lt;/em&gt; word. For example, I want the words &lt;em&gt;large&lt;/em&gt; &lt;code&gt;AND&lt;/code&gt; &lt;em&gt;yellow&lt;/em&gt; &lt;code&gt;AND&lt;/code&gt; &lt;em&gt;dog&lt;/em&gt;. The systems we&amp;rsquo;ll look at today can handle &lt;code&gt;OR&lt;/code&gt;s and &lt;code&gt;NOT&lt;/code&gt;s, but those aren&amp;rsquo;t fundamentally different and talking about them will add complexity, so we&amp;rsquo;ll only look at &lt;code&gt;AND&lt;/code&gt; queries.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-11.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
First, let&amp;rsquo;s consider searching ten thousand documents at 5kB per doc.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-12.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If you want to get an idea of how big this is, you can think of this as email search (for one person) or forum search (for one forum) in a typical case.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-13.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
a &lt;code&gt;k&lt;/code&gt; times a &lt;code&gt;k&lt;/code&gt; is a million, and five times time is fifty, so &lt;code&gt;5kB&lt;/code&gt; times ten thousand is &lt;code&gt;50MB&lt;/code&gt;.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-14.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&lt;code&gt;50MB&lt;/code&gt; is really small!
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-15.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Today, for $50, you can buy a phone off amazon that has &lt;code&gt;1GB&lt;/code&gt; of RAM. &lt;code&gt;50MB&lt;/code&gt; will easily fit in RAM, even on a low-end phone.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-16.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If our data set fits in RAM and we have &lt;code&gt;50MB&lt;/code&gt;, we can try the most naive thing possible and basially just grep through our data. If you want something more concrete, you can think of this as looping over all documents, and for each document, looping over all terms.&lt;/p&gt;

&lt;p&gt;Since we only need to handle &lt;code&gt;AND&lt;/code&gt; queries, we can keep track of all the terms we want, and if a document has all of the terms we want, we can add that to our list of matches.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-17.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Ok. So, for ten thousand documents, the most naive thing we can think of works. What about ten million documents?
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-18.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If you want to get a feel for how big ten million documents, you can think of this is roughly wikipedia-sized. Today, English language wikipedia has about five million documents.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-19.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&lt;code&gt;5kB&lt;/code&gt; times ten million is &lt;code&gt;50GB&lt;/code&gt;. This is really close to wikipedia&amp;rsquo;s size &amp;ndash; today, wikipedia is a bit over &lt;code&gt;50GB&lt;/code&gt; (uncompressed articles in XML, no talk, no history).
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-20.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
We can&amp;rsquo;t fit that in RAM on a phone, and we&amp;rsquo;d need a pretty weird laptop to fit that in RAM on a laptop, but we can easily fit that in RAM on a low-budget server. Today, we can buy a $2000 server that has 128GB of RAM.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-21.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
What happens when we try to run our naive grep-like algorithm? Well, our cheap server can get &lt;code&gt;25GB/s&lt;/code&gt; of bandwidth&amp;hellip;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-22.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;hellip; and we have 50GB of data. That means that it takes two seconds to do one search query!&lt;/p&gt;

&lt;p&gt;And while we&amp;rsquo;re doing a query, we&amp;rsquo;re using all the bandwidth on the machine, so we can&amp;rsquo;t expect to do anything else on the machine while queries are running, including other queries. This implies that it takes two seconds to do a query, or that we get one-half a query per second, or &lt;sup&gt;1&lt;/sup&gt;&amp;frasl;&lt;sub&gt;2&lt;/sub&gt; QPS.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-23.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Is that ok? Is two seconds of latency ok? It depends.&lt;/p&gt;

&lt;p&gt;For many applications, that&amp;rsquo;s totally fine! I know a lot of devs who have an internal search tool (often over things like logs) that takes a second or two to return results. They&amp;rsquo;d like to get results back faster, but given the cost/benefit tradeoff, it&amp;rsquo;s not worth optimizing things more.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-24.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
How about &lt;sup&gt;1&lt;/sup&gt;&amp;frasl;&lt;sub&gt;2&lt;/sub&gt; QPS? It depends.&lt;/p&gt;

&lt;p&gt;As with latency, a lot of devs I know have a search service that&amp;rsquo;s only used internally. If you have 10 or 20 devs typing in queries at keyboards, it&amp;rsquo;s pretty unlikely that they&amp;rsquo;ll exceed &lt;sup&gt;1&lt;/sup&gt;&amp;frasl;&lt;sub&gt;2&lt;/sub&gt; QPS with manual queries, so there&amp;rsquo;s no point in creating a system that can handle more throughput.&lt;/p&gt;

&lt;p&gt;Our naive grep-like algorithm is totally fine for many search problems!
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-25.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
However, as services get larger, two seconds of latency can be a problem.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-26.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If we look at studies on latency and revenue, we can see a roughly linear relationship between latency and revenue over a pretty wide range of latencies.&lt;/p&gt;

&lt;p&gt;Amazon found that every 100ms of latency cost them more than 1% of revenue. Google once found that adding 500ms of latency, or half a second, cost them 20% of their users.&lt;/p&gt;

&lt;p&gt;This isn&amp;rsquo;t only true of large companies &amp;ndash; when Mobify looked at this, they also found that 100ms of latency cost them more than 1% of revenue. For them, 1% was only $300k or so. But even though I say &amp;ldquo;only&amp;rdquo;, that&amp;rsquo;s enough to pay a junior developer for a year. Latency can really matter!
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-27.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Here&amp;rsquo;s a query from some search engine. The result came back in a little over half a second. That includes the time it takes to register input on the local computer, figure out what to do with the input, send it across the internet, go into some set of servers somewhere, do some stuff, go back across the internet, come back into the local computer, do some more stuff, and then render the results.&lt;/p&gt;

&lt;p&gt;That&amp;rsquo;s a lot of stuff! If you do budgeting for a service like this and you want queries to have a half-second end-user round-trip budget, you&amp;rsquo;ll probably only leave tens of milliseconds to handle document matching on the machines that recieve queries and tell you which documents matched the queries. Two seconds of latency is definitely not ok in that case.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-28.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Furthermore, for a service like Bing or Google, provisioning for &lt;sup&gt;1&lt;/sup&gt;&amp;frasl;&lt;sub&gt;2&lt;/sub&gt; QPS is somewhat insufficient.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-29.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
What we can do? Maybe we can try using an index instead of grepping through all documents.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-30.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If we use an index, we can get widely varying performance characteristics. Asking what the performance is like if we &amp;ldquo;use an index&amp;rdquo; is like asking what the performance is like if we &amp;ldquo;use an algorithm&amp;rdquo;. It depends on the algorithm!&lt;/p&gt;

&lt;p&gt;Today, we&amp;rsquo;ll talk about how to get performance in the range of thousands to tens of thousands of queries per second, but first&amp;hellip;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-31.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;hellip; let&amp;rsquo;s finish our discussion about scale and talk about how to handle ten billion documents.&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ve said that we can, using some kind of index, serve ten million documents from one machine with performance that we find to be acceptble. So how about ten billion?
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-32.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
With ten billion documents at 5kB a piece, we&amp;rsquo;re looking at 50TB. While it&amp;rsquo;s possible to get a single machine with 50TB of RAM, this approach isn&amp;rsquo;t cost effective for most problems, so we&amp;rsquo;ll look at using multiple cheap commodity machines instead of one big machine.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-33.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Search is a relatively easy problem to scale horizontally; that is, it&amp;rsquo;s relatively easy to split a search index across multiple machines. One way to do this (and this isn&amp;rsquo;t the only possible way) is to put different documents on different machines. Queries then go to all machines, and the result is just the union of all queries.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-34.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Since we have ten billion documents, and we&amp;rsquo;re assuming that we can serve ten million documents on a machine, if we split up the index we&amp;rsquo;ll have a thousand machines.&lt;/p&gt;

&lt;p&gt;That&amp;rsquo;s ok, but if we have a cluster of a thousand machines and the cluster is in Redmond, and we have a customer in Europe, that could easily add 300ms of latnecy to the query. We&amp;rsquo;ve gone through all the effort of designing and index that can return a query in 10ms, and then we have customers that lose 300ms from having their queries go back and forth over the internet.&lt;/p&gt;

&lt;p&gt;Instead of having a single cluster, we can use multiple clusters all over the world to reduce that problem.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-35.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Say we use ten clusters. Then we have ten thousand machines.&lt;/p&gt;

&lt;p&gt;With ten thousand machines (or even with a thousand machines), we have another problem: given the failure rate of commodity hardware, with ten thousand machines, machines will be failing all the time. At any given time, in any given cluster, some machines will be down. If, for example, the machine that&amp;rsquo;s indexing cnn.com goes down and users who want to query that cluster can&amp;rsquo;t get results from CNN, that&amp;rsquo;s bad.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-36.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
In order to avoid the loss of sites from failures, we might triple the number of machines for redundancy, which puts us at thirty thousand machines.&lt;/p&gt;

&lt;p&gt;With thirty thousand machines, one problem we have is that we now have a distributed system. That&amp;rsquo;s a super interesting set of problems, but it&amp;rsquo;s beyond the scope of this talk.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-37.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Another problem we have is that we have a service that cost a non-trivial amount of money to run. If a machine costs a thousand dollars per year (amortized cost, including the cost of building out datacenters, buying machines, and running the machines), that puts us at thirty-million dollars a year. By the way, a thousand dollars a year is considered to be a relatively low total amortized cost. Even if we can hit that low number, we&amp;rsquo;re still looking at thirty-million dollars a year.&lt;/p&gt;

&lt;p&gt;At thirty-million a year, if we can double the performance and halve the number of machines we need, that saves us fifteen-million a year. In fact, if we can even shave off one percent on the running time of a query, that would save three-hundred thousand dollars a year, saving enough money to pay a junior developer for an entire year.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-38.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Conventional wisdom often says that &amp;ldquo;machine time is cheaper than developer time, which means that you should use the most productive tools possible and not worry about performance&amp;rdquo;. That&amp;rsquo;s absolutely true for many applications. For example, that&amp;rsquo;s almost certainly true for any single-server rails app. But once you get to the point where you have thousands of machines per service, that logic is flipped on its head because machine time is more expensive than developer time.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-39.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Now that we&amp;rsquo;ve framed the discussion by talking about scale, let&amp;rsquo;s talk about search algorithms.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-40.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
The problem we&amp;rsquo;re looking at is, given a bunch of documents, how can we handle &lt;code&gt;AND&lt;/code&gt; queries.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-41.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
The standard algorithm that people use for search indices is a posting list.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-42.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
A posting list is basically what a layperson would call an index.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s an index from the 1600s. If you look at the back of a book today, you&amp;rsquo;ll see the same thing: there&amp;rsquo;s a list of terms, and next to each term there&amp;rsquo;s a list pages that term appears on.&lt;/p&gt;

&lt;p&gt;Computers don&amp;rsquo;t have pages in the same sense; if you want to imagine a simple version of a posting list, you can think of&amp;hellip;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-43.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;hellip;a hash map from terms to linked lists of document ids. That is, a hash map where key is a term and the value is a list.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-44.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
That&amp;rsquo;s one way to do it, and it&amp;rsquo;s standard. Another thing we could try to do is use Bloom Filters.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-45.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
We do this in Bing in a system called BitFunnel. But before we can describe BitFunnel, we need to talk about how bloom filters work.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-46.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
And before we talk about how bloom filters work, let&amp;rsquo;s consider a more naive solution we might construct.&lt;/p&gt;

&lt;p&gt;One thing we might try would to be use something called in incidence matrix, that is, a 2d matrix where one dimension of the matrix is every single term we know about, and the other dimension is every single document we know about. Each entry in the matrix is a &lt;code&gt;1&lt;/code&gt; if the term is in the document, and it&amp;rsquo;s a &lt;code&gt;0&lt;/code&gt; if the term isn&amp;rsquo;t in the document.&lt;/p&gt;

&lt;p&gt;What will the performance of that be?
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-47.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Well, first, how many terms are there? How many terms do you think are on the internet? And let&amp;rsquo;s say we shard the internet a zillion ways and serve tens of millions of documents per server? How many unique terms do we have per server?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;pause&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;someone shouts ten million&lt;/em&gt;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-48.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Turns out, when we do this, we can see tens of billions of terms per shard. This is often surprising to people. I&amp;rsquo;ve asked a lot of people this question, and people often guess that there are millons or billions of unique terms on the entire internet. But if you pick a random number under ten billion and search for it, you&amp;rsquo;re pretty likely to find it on the internet! So, there are probably more than ten billion terms on the internet!&lt;/p&gt;

&lt;p&gt;In fact, if you limit the search to just github, you can find a single document with about fifty-million primes! And if you look at the whole internet, you can find a site with all primes under one trillion, which is over thirty-billion primes! If that site lands in a single shard, that shard is going to have at least thirty-billion unique terms. Turns out, a lot of people put long mathematical sequences online.&lt;/p&gt;

&lt;p&gt;And in addition to numbers, there&amp;rsquo;s stuff that&amp;rsquo;s often designed to be unique, like catalog numbers, ID numbers, error codes, and GUIDs. Plus DNA! Really, DNA. Ok, DNA isn&amp;rsquo;t designed to be unique, but if you split it up into chunks of arbitrary numbers of characters, there&amp;rsquo;s a high probability that any N character chunk for N &amp;gt; 16 is unique.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s a lot of this stuff! One question you might ask is, do you need to index that stuff? Does anyone really search for &lt;code&gt;GTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAATA&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;It turns out, that when you ask people to evaluate a search engine, many of them will try to imagine the weirdest queries they can think of, try those, and then choose the search engine that handles those queries better. It doesn&amp;rsquo;t matter that they never do those queries normally. Some real people actually evaluate search engines that way. As a result, we have to index all of this weird stuff if we want people to use our search engine.&lt;/p&gt;

&lt;p&gt;If we have tens of billions of terms, say we have thirty billion terms, how large is our incidence matrix? Even if we use a bit vector, one single document will take up thirty billion divided by 8 bytes, or 3.75GB. And that&amp;rsquo;s just one document!
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-49.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
How can we shrink that? Well, since most documents don&amp;rsquo;t contain most terms, we can hash terms down to a smaller space. Instead of reserving one slot for each unique term, we only need as many slots as we have terms in a document (times a constant factor which is necessary for bloom filter operation).
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-50.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
That&amp;rsquo;s basically what a bloom filter is! For the purposes of this talk, we can think of a bloom filter as a data structure that represents a set using a bit vector and a set of independant hash functions.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-51.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Here, we have the term &amp;ldquo;large&amp;rdquo; and we apply three independent hash functions, which hashes the term to locations five, seven, and twelve. Having three hash functions is arbitrary and we&amp;rsquo;ll talk about that tradeoff later.&lt;/p&gt;

&lt;p&gt;To insert &amp;ldquo;large&amp;rdquo; into the document, we&amp;rsquo;ll set bits five, seven, and twelve. To query for &amp;ldquo;large&amp;rdquo;, we&amp;rsquo;ll do the bitwise &lt;code&gt;AND&lt;/code&gt; of those locations. That is, we&amp;rsquo;ll check to see if all three locations are &lt;code&gt;1&lt;/code&gt;. If any location is a &lt;code&gt;0&lt;/code&gt;, the result will be &lt;code&gt;0&lt;/code&gt; (false) otherwise the result will be &lt;code&gt;1&lt;/code&gt; (true). For any term we&amp;rsquo;ve inserted, the query will be &lt;code&gt;1&lt;/code&gt; (true), because we&amp;rsquo;ve just set those bits.&lt;/p&gt;

&lt;p&gt;In this series of diagrams, any bit that&amp;rsquo;s colored is a &lt;code&gt;1&lt;/code&gt; and any bit that&amp;rsquo;s white is a &lt;code&gt;0&lt;/code&gt;. The red bits are associated with the term &amp;ldquo;large&amp;rdquo;.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-52.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
We can insert another term: &amp;ldquo;dog&amp;rdquo;. To do so, we&amp;rsquo;ll set those bits, one, seven, and ten. Seven was already set by &amp;ldquo;large&amp;rdquo; (red), but it&amp;rsquo;s fine to set it again with &amp;ldquo;dog&amp;rdquo;; all bits that are yellow are associated with the term &amp;ldquo;dog&amp;rdquo;. If we query for the term, as before, we&amp;rsquo;ll get a &lt;code&gt;1&lt;/code&gt; (true) beacuse we&amp;rsquo;ve just set all the bits associated with the query.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-53.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
We can also try querying a term that we didn&amp;rsquo;t insert into the document. Let&amp;rsquo;s say we query for &amp;ldquo;cat&amp;rdquo;, which happens to hash to three, ten, and twelve.&lt;/p&gt;

&lt;p&gt;When we do the bitwise &lt;code&gt;AND&lt;/code&gt;, we first look at bit three. Since bit three is a zero, we already know that the result will be &lt;code&gt;0&lt;/code&gt; (false) before we look at the other bits and don&amp;rsquo;t have to look at bits ten and twelve.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-54.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Let&amp;rsquo;s try querying another term, &amp;ldquo;box&amp;rdquo;, and let&amp;rsquo;s say that term hashes to one, five, and ten.&lt;/p&gt;

&lt;p&gt;Even if we don&amp;rsquo;t insert this term into the document, the query shows that the term is in the document because those bits were set by other terms. We have a false positive!
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-55.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
How bad is this problem? Well, what&amp;rsquo;s the probability that any query will return a false positive?
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-56.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Let&amp;rsquo;s assume we have ten percent bit density. This is something we can control &amp;ndash; for example, if we have a bit vector of length 100, and we have ten terms, each of which is hashed to one location, we expect the bit density to be slightly less than 10%. It would be 10% if no terms hashed to the same location, but it&amp;rsquo;s possible that some terms might collide nd hash to the same location.&lt;/p&gt;

&lt;p&gt;What&amp;rsquo;s the probability of a false positive if we hash to one location instead of three locations?&lt;/p&gt;

&lt;p&gt;If the term is actually in the document, then we&amp;rsquo;ll set the bit, and if we do a query, since the bit was set, we&amp;rsquo;ll definitely return true, so there&amp;rsquo;s no probability of a false negative.&lt;/p&gt;

&lt;p&gt;If the term isn&amp;rsquo;t in the document and we haven&amp;rsquo;t set the associated bit for this term because of this term, what&amp;rsquo;s the probability the bit is set? Because our bit desnity is .1, or 10%, the probability is 10%.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-57.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
What if we hash to two locations instead of one location? Since we&amp;rsquo;re assuming we have uniform 10% bit density, we can multiply the probabilities: we get .1 * .1 = .01 = 1%.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-58.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
For three locations, the math is the same as before: .1 * .1 * .1 = .001 = 0.1%.&lt;/p&gt;

&lt;p&gt;As we hash to more locations, if we don&amp;rsquo;t increase the size of the bit vector, the bit density will go up. Same amount of space, set more bits, higher bit density. So we have to increase the number of bits, and we have to increase the number of bits linearly. As we increase the number of bits linearly, we get an exponential decrease in the probability of a false positive.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-59.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
One intuition as to why bloom filters work is that we pay a linear cost and get an exponential benefit.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-60.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Ok. We&amp;rsquo;ve talked about how to use a bloom filter to represent one document. Since our index needs to represent multiple documents, we&amp;rsquo;ll use multiple bloom filters.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-61.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
In this diagram, each of the ten columns represents a document. That is, we have documents A through J.&lt;/p&gt;

&lt;p&gt;One thing we could do is have ten independent bloom filters. We know that we can have one bloom filter represent one document, so why not use ten bloom filters for ten documents?&lt;/p&gt;

&lt;p&gt;If we&amp;rsquo;re going to do that, we might as well maintain the same mapping from terms to rows; that is, use the same hash functions for each column, so that when we do a query, we can do the query in parallel.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-62.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
In the single-document example, when we did a query, we did the bitwise &lt;code&gt;AND&lt;/code&gt; of some bits. Now, to do a query, we&amp;rsquo;ll do the bitwise &lt;code&gt;AND&lt;/code&gt; of rows of bits.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-63.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Now we&amp;rsquo;re going to query for all documents that have both &amp;ldquo;large&amp;rdquo; &lt;code&gt;AND&lt;/code&gt; &amp;ldquo;dog&amp;rdquo;. As before, bits that are red are associated with the term &amp;ldquo;large&amp;rdquo; and bits that are yellow are associated with the &amp;ldquo;dog&amp;rdquo;. Additionally, bits that are grey are associated with other terms.&lt;/p&gt;

&lt;p&gt;After we do the bitwise &lt;code&gt;AND&lt;/code&gt; of all of the rows, the result will be a row vector with some bits set &amp;ndash; those bits will be the documents that have both the terms &amp;ldquo;large&amp;rdquo; &lt;code&gt;AND&lt;/code&gt; &amp;ldquo;dog&amp;rdquo;. We&amp;rsquo;re going to &lt;code&gt;AND&lt;/code&gt; together rows one, five, seven, ten, and twelve and then look at the result.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-64.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
In this diagram, on the right, the part&amp;rsquo;s that highlighted is the fraction of the query that we&amp;rsquo;ve done so far. On the left, the part&amp;rsquo;s that highlighted is the result of the computation so far.&lt;/p&gt;

&lt;p&gt;When we start, we have row one.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-65.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
When we &lt;code&gt;AND&lt;/code&gt; rows one and five together, we can see that bit F is cleared to zero.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-66.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
After we &lt;code&gt;AND&lt;/code&gt; row seven into our result, nothing changes. Even though row seven has bit F set, an &lt;code&gt;AND&lt;/code&gt; of a one and a zero is a zero, so the result in column F is still zero.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-67.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
When we &lt;code&gt;AND&lt;/code&gt; row ten in, bit I is cleared.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-68.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
And then when we &lt;code&gt;AND&lt;/code&gt; in the last row, nothing changes. The result of the query is that bit J is set. In other words, the query concludes that document J contains both the terms &amp;ldquo;large&amp;rdquo; &lt;code&gt;AND&lt;/code&gt; &amp;ldquo;dog&amp;rdquo;, and no other document in this block contains both terms.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-69.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
In our previous example, we queried a block of documents where at least one document contained both of the terms we cared about. We can also query a block of documents where none of the documents contain both of the terms.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-70.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
As before, we want to take the bitwise &lt;code&gt;AND&lt;/code&gt; of rows one, five, seven, ten, and twelve.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-71.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
And as before, we&amp;rsquo;ll start with row one.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-72.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
After we &lt;code&gt;AND&lt;/code&gt; in row five, all of the bits are zero! When that happened in the &amp;ldquo;cat&amp;rdquo; example we did on a single document, we could stop because we knew that the document couldn&amp;rsquo;t possibly contain the term cat because we can&amp;rsquo;t set a bit by doing an &lt;code&gt;AND&lt;/code&gt;. This same thing is true here, and we can stop and return that the result is all zeros.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-73.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
I said, earlier, that we&amp;rsquo;d try to estimate the performance of a system. How do we do that?&lt;/p&gt;

&lt;p&gt;We&amp;rsquo;ll want to have a cost model for operations and then figure out what operations we need to do. For us, we&amp;rsquo;re doing bitwise &lt;code&gt;AND&lt;/code&gt;s and reading data from memory. Reading data from memory is so much more expensive than a bitwise &lt;code&gt;AND&lt;/code&gt; that we can ignore the cost of the &lt;code&gt;AND&lt;/code&gt;s and only consider the cost of memory accesses. If we had any disk accesses, those would even slower, but since we&amp;rsquo;re operating in memory, we&amp;rsquo;ll assume that a memory access is the most expensive thing that we do.&lt;/p&gt;

&lt;p&gt;One bit of background is that on the machines that we run on, we do memory accesses in 512-bit blocks. So far, we&amp;rsquo;ve talked about doing operations on blocks of ten documents, but on the actual machine we can think of doing operations on 512 document blocks.&lt;/p&gt;

&lt;p&gt;In that case, to get a performance estimate, we&amp;rsquo;ll need to know how many blocks we have, how many memory accesses (rows) we have per block, and how many memory accesses our machine can do per unit time.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-74.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
To figure out how many memory accesses per block we want, we could work through the math&amp;hellip;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-75.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;hellip;which is a series of probability calculations that will give us some number. I&amp;rsquo;m not going to do that here today, but it&amp;rsquo;s possible to do.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-76.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Another thing we can do is to run a simulation. Here&amp;rsquo;s the result of a simulation that was maybe thirty lines of code. This graph is a histogram of how many memory accesses we have to do per block, assuming we have 20% bit density, and a query that&amp;rsquo;s 14 rows.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-77.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If 14 rows sounds like a lot, well, we often do queries on 20 to 100 rows. That might sound weird, since we looked at an example where each term mapped to three rows. For one thing, terms can and sometimes do map to more than three rows. Additionally, we do query re-writing that makes queries more complicated (and hopefully better).
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-78.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
For example, let&amp;rsquo;s say we query for &amp;ldquo;large&amp;rdquo; &lt;code&gt;AND&lt;/code&gt; &amp;ldquo;yellow&amp;rdquo; &lt;code&gt;AND&lt;/code&gt; &amp;ldquo;dog&amp;rdquo;.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-79.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
 Maybe the user was actually searching for or trying to remember the name of some breed of large yellow dog, so we could re-write the query to be something like&lt;/p&gt;

&lt;p&gt;(large &lt;code&gt;AND&lt;/code&gt; yellow &lt;code&gt;AND&lt;/code&gt; dog) &lt;code&gt;OR&lt;/code&gt; (golden &lt;code&gt;AND&lt;/code&gt; retriever)&lt;/p&gt;

&lt;p&gt;as well as other breeds of dogs that can be large and yellow.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-80.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
But the user might also be searching for some particular large yellow dog, so we could re-write the query to something like&lt;/p&gt;

&lt;p&gt;(large &lt;code&gt;AND&lt;/code&gt; yellow &lt;code&gt;AND&lt;/code&gt; dog) &lt;code&gt;OR&lt;/code&gt; (golden &lt;code&gt;AND&lt;/code&gt; retriever) &lt;code&gt;OR&lt;/code&gt; (old &lt;code&gt;AND&lt;/code&gt; yeller)&lt;/p&gt;

&lt;p&gt;and in fact we might want to query for the phrase &amp;ldquo;old yeller&amp;rdquo; and not just the &lt;code&gt;AND&lt;/code&gt; of the terms, and so on and so forth.&lt;/p&gt;

&lt;p&gt;When do you this kind of thing, and add in personalization based on location and query history, simple seeming queries can end up being relatively complicated, which is how we can get queries of 100 rows.&lt;/p&gt;

&lt;p&gt;&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-81.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;&lt;/p&gt;

&lt;p&gt;Coming back to the histogram of the number of memory accesses per block, we can see that it&amp;rsquo;s bimodal.&lt;/p&gt;

&lt;p&gt;There&amp;rsquo;s he mode on the right, where we do 14 accesses. That mode corresponds to our first multi-document example, where at least one document in the block contained the terms. Because at least one document contained all of the terms, we don&amp;rsquo;t get all zeros in the result and do all 14 accesses.&lt;/p&gt;

&lt;p&gt;The mode on the left, which is smeared out from 3 on to the right, is associated with blocks like our second example, where no document contained all of the terms in the query. In that case we&amp;rsquo;ll get a result of all zeros at some point with very high probability, and we can terminate the query early.&lt;/p&gt;

&lt;p&gt;If we look at the average of the number of accesses we need for the left mode, it&amp;rsquo;s something like 4.6. On the right, it&amp;rsquo;s exactly 14. If we average these together, let&amp;rsquo;s say we get something like 5 accesses per query (just to get a nice, round, number).
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-82.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Now we have what we need to do a first-order performance estimate!
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-83.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If we go back to our roughly wikipedia-sized example, we had ten million documents. Since we&amp;rsquo;re on machine where memory accesses are 512 bits wide, that&amp;rsquo;s ten million divided by 512 equals twenty-thousand blocks, with a bit of rounding.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-84.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
We said that we have roughly five memory accesses per query. If we have twenty-thousand blocks, that means that a query needs to do twenty-thousand times five memory accesses, or one hundred-thousand memory transfers.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-85.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
We said that our cheap server can get 25GB/s of bandwidth out of. If we do 512-bit transfers, that&amp;rsquo;s three-hundred and ninety-million transfers per second.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-86.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If we divide three hundred-million transfers per second into a hundred thousand transfers per query, we get thirty-nine hundred QPS (with raounding from previous calculations).
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-87.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
When I do a calculation like this, if I&amp;rsquo;m just looking at the largest factors that affect performance, like we did here, I&amp;rsquo;m happy if we get within a factor of two.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-88.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If you adjust for a lot of smaller factors, it&amp;rsquo;s possible to get a more accurate estimate&amp;hellip;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-89.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;hellip;but in the interest of time, we&amp;rsquo;re not going to look at all the smaller factors that add or remove 5% or 10% in performance.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-90.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
However, there are a few major factors that affect performance a lot that I&amp;rsquo;ll briefly mention.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-91.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
One thing is that our machines don&amp;rsquo;t only do document matching. So far, we&amp;rsquo;ve discussed an algorithm that, given a set of documents and a query will return a subset of those documents. We haven&amp;rsquo;t done any ranking, meaning that queries will come back unordered.&lt;/p&gt;

&lt;p&gt;There are some domains where that&amp;rsquo;s fine, but in web search, we spend a significant fraction of CPU time ranking the documents that match the query.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-92.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Additionally, we also ingest new documents all the time. When news happens and people search for the news, they want to see it right away, so we can&amp;rsquo;t do batch updates.&lt;/p&gt;

&lt;p&gt;This is something BitFunnel can actually do faster than querying. If we think about how queries worked, they&amp;rsquo;re global, in the sense that each query looked at information for each document. But when we&amp;rsquo;re ingesting new documents, since each document is a column, that&amp;rsquo;s possible to do without having to touch everything in the index. In fact, since our data structure is, in some sense, just an array that we want to set some bits in, it&amp;rsquo;s pretty straightforward to ingest documents with multiple threads while allowing queries with multiple threads.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s possible to work through the math for this the way we did for querying, but again, in the interest of time, I&amp;rsquo;ll just mention that this is possible.&lt;/p&gt;

&lt;p&gt;Between ranking and ingestion, in the configuration we&amp;rsquo;re running today, that uses about half the machine, leaving half for matching, which reduces our performance by a factor of two.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-93.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
However, we also have an optimization that drastically increases performance, which is using hierarchical bloom filters.&lt;/p&gt;

&lt;p&gt;In our example, we had one bloom filter per document, which meant that if we had a query that only matched a single docucment, we&amp;rsquo;d have to examine at least one bit per document. In fact, we said that we&amp;rsquo;d end up looking at about five bits per document. If we use hierarchical bloom filters, it&amp;rsquo;s possible to look at a log number of bits per document instead of a linear number of bits per document.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-94.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
The real production system we use has a number of not necessarily obvious changes in order to run at the speed that it does. Most of them aren&amp;rsquo;t required for the system to work correctly without taking up an unreasonable amount of memory, but one is.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-95.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If you take the algorithm I described it today and try to use it, when you look at sixteen rows in a block of ten documents, you might see something like this.&lt;/p&gt;

&lt;p&gt;Notice that some columns (B and D) have most or all bits set, and some columns (A and C) have few or no bits set. This is because different documents have a different number of terms.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s say we sized the number of rows so that we can efficiently store tweets. Let&amp;rsquo;s say, hypothetically, that means we need fifty rows. And then a weird document with ten million terms comes along and it wants to hash into the rows, say, thirty million times. That&amp;rsquo;s going to set every bit in its column, which means that every query will return true. Many weird documents like this contain terms that are almost never queried, so the query should almost never return true, but our system will always return true!&lt;/p&gt;

&lt;p&gt;Say we size up the number of rows so that these weird ten million term documents are ok. Let&amp;rsquo;s say that means we need to have a hundred million rows. Ok, our queries will work fine, but we still have things like tweets that might want to set, say, sixteen bits. We said that we wanted to use bloom filters instead of arrays to save space by hashing to reduce the size of our array, but now we have all of these really sparse columns that have something like sixteen out of a hundred million bits set.&lt;/p&gt;

&lt;p&gt;To get around this problem, we shard (split up the index) by the number of terms per document. Unlike many systems, which only run in a sharded configuration when they need to spill over onto another machine, we always run in a sharded configuration, even when we&amp;rsquo;re running on a single machine.&lt;/p&gt;

&lt;p&gt;Although there are other low level details that you&amp;rsquo;d want to know to run an efficient system, this is the only change that you absolustely have to take into account when compared to the algorithm I&amp;rsquo;ve described today.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-96.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Let&amp;rsquo;s sum up what we&amp;rsquo;ve look at today.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-97.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Before we talk about the real conclusions, let&amp;rsquo;s discuss a few false impressions this talk could give.&lt;/p&gt;

&lt;p&gt;&amp;ldquo;Search is simple&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;You&amp;rsquo;ve seen me describe an algorithm that&amp;rsquo;s used in production for web search. The algorithm is simple enough that it could be described in a thirty-minute talk with no background. However, to run this algorithm at the speed we&amp;rsquo;ve estimated today, there&amp;rsquo;s a fair amount of low-level implementation work. For example, to reduce the (otherwise substantial) control flow overhead of querying and ranking, we compile both our queries and our query ranking.&lt;/p&gt;

&lt;p&gt;Additionally, even if this system were simple, this is less than 1% of the code in Bing. Search has a lot of moving parts and this is just one of them.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-98.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;ldquo;Bloom filters are better than posting lists&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;I went into some detail about bloom filters and didn&amp;rsquo;t talk about posting lists much, except to say that they&amp;rsquo;re standard. This might give the impression that bloom filters are categorically better than posting lists. That&amp;rsquo;s not true! I only didn&amp;rsquo;t describe posting lists in detail and do a comparison because state-of-the-art posting list implementations are tremendously complicated and I couldn&amp;rsquo;t describe them to a non-specialist audience in thirty minutes, let alone do the comparison.&lt;/p&gt;

&lt;p&gt;If you do the comparison, you&amp;rsquo;ll find that when one is better than the other depends on your workload. For an argument that posting lists are superior to bloom filters, see Zobel et al., &amp;ldquo;Inverted files versus signature files for text indexing&amp;rdquo;.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-99.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;ldquo;You can easily reason about all performance&amp;rdquo;.&lt;/p&gt;

&lt;p&gt;Today, we look at how an algorithm worked and estimated the performance of a system that took years to build. This was relatively straightforward because we were trying to calculate the average throughput of a system, which is something that&amp;rsquo;s amenable to back-of-the-envelope math. Something else that&amp;rsquo;s possible, but slightly more difficult, is to estimate the latency of a query on an unloaded system.&lt;/p&gt;

&lt;p&gt;Something that&amp;rsquo;s substantially harder is estimating the latency on a system as load varies, and estimating the latency distribution.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-100.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Ok, now for an actual conclusion.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-101.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
You can often reason about performance&amp;hellip;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-102.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
&amp;hellip;and you can do so with simple arithmetic. Today, all we did was multiply and divide. Sometimes you might have to add, but you can often guess what the performance of a system should be with simple calculations.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-103.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
Thanks to all of these people for help with this talk! Also, I seem to have forgotten to put Bill Barnes on the list, but he gave me some great feedback!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Post original talk: also, thanks to Laura Lindzey, Larry Marbuger, and someone&amp;rsquo;s name who I can&amp;rsquo;t remember for giving me great post-talk feedback that changed how I&amp;rsquo;m giving the next talk.&lt;/em&gt;
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;
&lt;div class=&#34;slideplustext&#34;&gt;
&lt;div class=&#34;slide&#34;&gt;
&lt;img src=&#34;http://bitfunnel.org/strangeloop/strangeloop-104.png&#34; width=&#34;750&#34; height=&#34;422&#34;&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&#34;transcript&#34;&gt;
If you want to read more about the index we talked about today, BitFunnel, you can get more information at &lt;a href=&#34;http://bitfunnel.org&#34;&gt;bitfunnel.org&lt;/a&gt;. We also have some code up at &lt;a href=&#34;http://github.com/bitfunnel/bitfunnel&#34;&gt;github.com/bitfunnel/bitfunnel&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Oh, yeah, I&amp;rsquo;m told you have to introduce yourself at these things. I&amp;rsquo;m Dan Luu, and I have a blog at &lt;a href=&#34;https://danluu.com&#34;&gt;danluu.com&lt;/a&gt; where I blog about the kind of thing I talked about here today. That is, I often write about performance, algorithms and data structures, and tradeoffs between different techniques.&lt;/p&gt;

&lt;p&gt;Thanks for your time. Oh, also, I&amp;rsquo;m not going to take questions from the stage because I don&amp;rsquo;t know how people who aren&amp;rsquo;t particularly interested in the questions often feel obligated to stay for the question period. However, I really enjoy talking about this stuff and I&amp;rsquo;d be happy to take questions in the hallway or anytime later.
&lt;/div&gt;
&lt;p&gt;&lt;/div&gt;&lt;/p&gt;&lt;/p&gt;

&lt;h4 id=&#34;some-comments-on-the-talk&#34;&gt;Some comments on the talk&lt;/h4&gt;

&lt;p&gt;Phew! I survived my first conference talk.&lt;/p&gt;

&lt;p&gt;Considering how early the talk was (10am, the first non-keynote slot), I was surprised that the room was packed and people were standing. Here&amp;rsquo;s a photo Jessica Kerr took (and annotated) while we were chatting, maybe five or ten minutes before the talk started, before the room really filled up:&lt;/p&gt;

&lt;p&gt;&lt;img src=&#34;http://bitfunnel.org/strangeloop/packed-room.png&#34; alt=&#34;Packed room&#34; /&gt;&lt;/p&gt;

&lt;p&gt;During the conference, I got a lot of positive comments on talk, which is great, but what I&amp;rsquo;d really love to hear about is where you were confused. If you felt lost at any point, you&amp;rsquo;d be doing me a favor my letting me know what you found to be confusing. Before I run this talk again, I&amp;rsquo;m probably going to flip the order of some slides in the Array/Bloom Filter/BitFunnel discussion, add another slide where I explictly talk about bit density, and add diagrams for a HashMap (in the posting list section) and an Array (in the lead-up to bloom filters). There are probably more changes I could make to make things clearer, though!&lt;/p&gt;

&lt;iframe width=&#34;560&#34; height=&#34;315&#34; src=&#34;https://www.youtube.com/embed/80LKF2qph6I&#34; frameborder=&#34;0&#34; allowfullscreen&gt;&lt;/iframe&gt;
</description>
    </item>
    
    <item>
      <title>A Small Query Language</title>
      <link>http://bitfunnel.org/a-small-query-language/</link>
      <pubDate>Sat, 10 Sep 2016 15:52:53 -0700</pubDate>
      
      <guid>http://bitfunnel.org/a-small-query-language/</guid>
      <description>

&lt;p&gt;A challenge in bringing BitFunnel to open source is
providing functionality that was previously supplied by portions of
Bing upstream of BitFunnel. BitFunnel was designed as a library
that takes, as input, a tree of &lt;code&gt;TermMatchNodes&lt;/code&gt; which represents a boolean
expression combining terms and phrases using logical operators like
&lt;code&gt;and&lt;/code&gt;, &lt;code&gt;or&lt;/code&gt;, and &lt;code&gt;not&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The Bing search pipeline does a ton of work on the query itself
before presenting a &lt;code&gt;TermMatchNode&lt;/code&gt; tree to BitFunnel. Examples include
word breaking, stemming, spelling corrections, and augmentation with
synonyms. The query also goes through a complex set of classifiers
that determine whether to route the query to special modules, such
as a baseball scoreboard or a weather forcast. The query is also
annotated with scoring instructions at this time.&lt;/p&gt;

&lt;p&gt;All of this processing is carried out on a tree data structure
generated upstream of BitFunnel. Although this tree has a textual
representation, there was never any need for BitFunnel to parse
the tree, so we never included a parser.&lt;/p&gt;

&lt;p&gt;Our open source project is a different story. Today, at a minimum,
we need some sort of query language and parser to test the code
as we stand it up. We also expect our users will want the option of
a complete, end-to-end system that includes a simple, intuitive,
human-authorable query language.&lt;/p&gt;

&lt;p&gt;Our goals for the query language were&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy and intuitive query authoring.&lt;/li&gt;
&lt;li&gt;Small grammar that is easy to learn.&lt;/li&gt;
&lt;li&gt;Simple to parse.&lt;/li&gt;
&lt;li&gt;Familiar to people who have used other search systems.&lt;/li&gt;
&lt;li&gt;Based on UTF-8 to allow queries in all languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since our plan was to use &lt;a href=&#34;https://lucene.apache.org/&#34;&gt;Lucene&lt;/a&gt;
as a testing oracle and performance
baseline, it made sense to consider some subset of the Lucene query language.
In the end, we chose to go with something more like the languages used by Bing
and Google.&lt;/p&gt;

&lt;p&gt;In making this decision, our main tradeoff was between complexity and familiarity
on the one hand and Lucene compatability on the other. Lucene compatability
would certainly make our lives as developers easier because we could feed identical
queries to BitFunnel and our Lucene reference. It would also make it easier for search
integrators to migrate between the two systems since they could just drop in
whichever engine best met their business needs.&lt;/p&gt;

&lt;p&gt;The reason we went with Bing/Google approach centered around complexity and
familiarity of the operator precidence. In the Lucene query language, logical
&lt;code&gt;or&lt;/code&gt; is implicit and has precidence over logical &lt;code&gt;and&lt;/code&gt;. For example, the query
&lt;code&gt;dogs cats mice&lt;/code&gt; matches documents that contain at least one of the terms
&amp;ldquo;dogs&amp;rdquo;, &amp;ldquo;cats&amp;rdquo;, and &amp;ldquo;mice&amp;rdquo;. In Bing and Google, the same query tends to find
those documents that contain all three terms (the exact semantics in Bing and
Google is less clear because they may alter the original query based on
complex systems that infer human intent). Our feeling was that users of internet
search engines would find the Bing/Google approach more familiar, but that
this would come at a cost of Lucene compatability and it would be less familiar
for Lucene users.&lt;/p&gt;

&lt;p&gt;The deciding factor was the complexity of Lucene&amp;rsquo;s logical &lt;code&gt;or&lt;/code&gt; operator used
in conjunction with the &lt;code&gt;+&lt;/code&gt; operator. In Lucene, the &lt;code&gt;+&lt;/code&gt; operator converts a
portion of an &lt;code&gt;or&lt;/code&gt; expression into an &lt;code&gt;and&lt;/code&gt; expression. As an example, the
query &lt;code&gt;+dogs cats mice&lt;/code&gt; matches those documents that contain the term &amp;ldquo;dogs&amp;rdquo;
and at least one of &amp;ldquo;cats&amp;rdquo; and &amp;ldquo;mice&amp;rdquo;. In other words, the addition of a unary
&lt;code&gt;+&lt;/code&gt; operator converts the logical expression &lt;code&gt;dogs | cats | mice&lt;/code&gt; into the
expression &lt;code&gt;dogs &amp;amp; (cats | mice)&lt;/code&gt; which has a completely different structure.&lt;/p&gt;

&lt;p&gt;We felt that the &lt;code&gt;+&lt;/code&gt; operator&amp;rsquo;s ability
convert an implicit &lt;code&gt;or&lt;/code&gt; into an &lt;code&gt;and&lt;/code&gt; and then distribute the &lt;code&gt;and&lt;/code&gt; over the
remaining &lt;code&gt;or&lt;/code&gt; expression introduced too much complexity and potential ambiguity
for what is essentially syntactic sugar.&lt;/p&gt;

&lt;p&gt;In any event, the decision was low stakes because it will be easy to add
a Lucene compatible parser in the future if we need one. Here&amp;rsquo;e what we came up with.&lt;/p&gt;

&lt;h3 id=&#34;query-language-overview&#34;&gt;Query Language Overview&lt;/h3&gt;

&lt;p&gt;Our query language is inspired by a subset of the Bing query language.
Today the functionality is limited to expressing boolean matching trees.
Once we&amp;rsquo;re ready to port the BitFunnel ranker code, we will extend the
language to include ranker annotations (e.g. boosting the weight of a
particular term).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AND.&lt;/strong&gt; The &lt;code&gt;and&lt;/code&gt; operator is implicit so the query &lt;code&gt;dogs cats mice&lt;/code&gt; matches those
documents that contain all of the words in the query. One can also explicity
specify logial &lt;code&gt;and&lt;/code&gt; with the &lt;code&gt;&amp;amp;&lt;/code&gt; symbol, e.g. &lt;code&gt;dogs &amp;amp; cats &amp;amp; mice&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OR.&lt;/strong&gt; The &lt;code&gt;or&lt;/code&gt; operator, denoted by the &lt;code&gt;|&lt;/code&gt; symbol, is explicit. The query
&lt;code&gt;dogs | cats | mice&lt;/code&gt; matches those documents that contain at least one of the
three words in the query. Note that the &lt;code&gt;or&lt;/code&gt; operator has lower precedence
than the &lt;code&gt;and&lt;/code&gt; operator, so the query &lt;code&gt;dogs &amp;amp; cats | mice&lt;/code&gt; is equivalent to
the query &lt;code&gt;(dogs &amp;amp; cats) | mice&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOT.&lt;/strong&gt; The unary &lt;code&gt;not&lt;/code&gt; operator, denoted by the &lt;code&gt;-&lt;/code&gt; symbol matches documents that
do not contain an expression. As an example, the query &lt;code&gt;dogs cats -mice&lt;/code&gt;
matches those documents tht contain &amp;ldquo;dogs&amp;rdquo; and &amp;ldquo;cats&amp;rdquo;, but do not contain
&amp;ldquo;mice&amp;rdquo;. Note that the &lt;code&gt;not&lt;/code&gt; operator can be applied to arbirary expressions
such as &lt;code&gt;dogs -(cats | mice)&lt;/code&gt;. The &lt;code&gt;not&lt;/code&gt; operator has higher precidence than
the &lt;code&gt;and&lt;/code&gt; and &lt;code&gt;or&lt;/code&gt; operators.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TERM.&lt;/strong&gt; A search term is any sequence of UTF-8 characters that does not include
whitespace or special characters such as &lt;code&gt;&amp;quot;()-:&amp;amp;|&lt;/code&gt;. Terms may include
upper and lower case characters, but they may be converted to lowercase
during the query planning process. Special characters may appear if they are
escaped with a backslash. As an example, &lt;code&gt;dog\(cat\)&lt;/code&gt; would create the term associated with
the string literal &amp;ldquo;dog(cat)&amp;rdquo; while
&lt;code&gt;dog(cat)&lt;/code&gt; would be equivalent to &lt;code&gt;dog &amp;amp; cat&lt;/code&gt;. Note that it is legal to include
an escaped space in a term, e.g. &lt;code&gt;dog\ cat&lt;/code&gt;. Keep in mind that
such a term is actually a unigram that happens to contain a space and not the
bigram phrase &lt;code&gt;&amp;quot;dog cat&amp;quot;&lt;/code&gt;. Phrases and unigrams are treated differently in the
index, so it is important to use the phrase syntax when the term is intended to
be a higher order ngram (i.e. bigram, trigram, etc.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PHRASE.&lt;/strong&gt; A phrase can be specified by enclosing a sequence of search terms in
double quotes, for example, &amp;quot;New York City&amp;quot;. Each term in the phrase
can include escaped characters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STREAM PREFIX.&lt;/strong&gt; A term match can be restricted to a particular stream by prefixing it with
the name of the stream and a colon. As an example, the query
&lt;code&gt;title:dogs body:cat mice&lt;/code&gt; would match those documents that have &amp;ldquo;dog&amp;rdquo; in
the title, &amp;ldquo;cat&amp;rdquo; in the body, and &amp;ldquo;mice&amp;rdquo; in the default stream. Stream
names are defined by the application that hosts BitFunnel, which is also
responsible for designating the default stream.&lt;/p&gt;

&lt;h3 id=&#34;grammar&#34;&gt;Grammar&lt;/h3&gt;

&lt;p&gt;Here&amp;rsquo;s the grammar for the BitFunnel query language.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;OR:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;AND (&amp;lsquo;|&amp;rsquo; AND)*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AND:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SIMPLE ([&amp;rsquo;&amp;amp;&amp;lsquo;] SIMPLE)*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SIMPLE:&lt;/strong&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;rsquo;-&amp;rsquo; SIMPLE&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;rsquo;(&amp;rsquo; OR &amp;lsquo;)&amp;rsquo;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;PREFIX&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PREFIX:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[STREAM] TEXT&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STREAM:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;~[SPECIAL | SPACE]+ &amp;lsquo;:&amp;rsquo;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TEXT:&lt;/strong&gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;TERM&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;quot; TERM [SPACE+ TERM]* &amp;quot;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TERM:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[~[SPECIAL | SPACE] | ESCAPE]+&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SPECIAL:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[&amp;rsquo;&amp;quot;&amp;rsquo; &amp;lsquo;(&amp;rsquo; &amp;lsquo;)&amp;rsquo; &amp;lsquo;-&amp;rsquo; &amp;lsquo;:&amp;rsquo; &amp;lsquo;&amp;amp;&amp;rsquo; &amp;lsquo;|&amp;rsquo;]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SPACE:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[&amp;rsquo;\n&amp;rsquo; &amp;lsquo;\r&amp;rsquo; &amp;lsquo;\t&amp;rsquo; &amp;lsquo;\v&amp;rsquo; &amp;lsquo;\f&amp;rsquo; &amp;lsquo; &amp;lsquo;]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ESCAPE:&lt;/strong&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;lsquo;\&amp;rsquo; [SPECIAL | SPACE]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&#34;try-it-out&#34;&gt;Try it out&lt;/h3&gt;

&lt;p&gt;Feel free to experiment with the interactive query parser found in
the &lt;code&gt;examples\QueryParser&lt;/code&gt; directory. Just fire up the program with no
command line arguments and it will print out a brief help message and
then dump you into an interacive console where you can type in queries
and see their parse trees.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Welcome to the BitFunnel Query Parser Example.

This example is a Read-Eval-Print-Loop (REPL) that reads queries from
the console, parses them, and then prints out the resulting tree of
TermMatchNodes.

Enter a query after the % prompt and press return. To exit the demo
just enter a blank line. Here are some query ideas:
    Single terms
        dog
        title:cat
    Phrases
        &amp;quot;dogs are your best friend&amp;quot;
        anchors:&amp;quot;read this awesome page&amp;quot;
    Disjunctions
        dogs | cats
    Conjunctions
        dogs cats
        dogs &amp;amp; cats
    Negation
        -cats
    Grouping
        dogs (cats | fish)

% dog
Unigram(&amp;quot;dog&amp;quot;, 0)

% title:cat
Unigram(&amp;quot;cat&amp;quot;, 1)

% &amp;quot;dogs are your best friend&amp;quot;
Phrase {
  StreamId: 0,
  Grams: [
    &amp;quot;dogs&amp;quot;,
    &amp;quot;are&amp;quot;,
    &amp;quot;your&amp;quot;,
    &amp;quot;best&amp;quot;,
    &amp;quot;friend&amp;quot;
  ]
}

% anchors:&amp;quot;read this awesome page&amp;quot;
Phrase {
  StreamId: 2,
  Grams: [
    &amp;quot;read&amp;quot;,
    &amp;quot;this&amp;quot;,
    &amp;quot;awesome&amp;quot;,
    &amp;quot;page&amp;quot;
  ]
}

% dogs | cats
Or {
  Children: [
    Unigram(&amp;quot;cats&amp;quot;, 0),
    Unigram(&amp;quot;dogs&amp;quot;, 0)
  ]
}

% dogs cats
And {
  Children: [
    Unigram(&amp;quot;cats&amp;quot;, 0),
    Unigram(&amp;quot;dogs&amp;quot;, 0)
  ]
}

% dogs &amp;amp; cats
And {
  Children: [
    Unigram(&amp;quot;cats&amp;quot;, 0),
    Unigram(&amp;quot;dogs&amp;quot;, 0)
  ]
}

% -cats
Not {
  Child: Unigram(&amp;quot;cats&amp;quot;, 0)
}

% dogs (cats | fish)
And {
  Children: [
    Or {
      Children: [
        Unigram(&amp;quot;fish&amp;quot;, 0),
        Unigram(&amp;quot;cats&amp;quot;, 0)
      ]
    },
    Unigram(&amp;quot;dogs&amp;quot;, 0)
  ]
}

% dog\&amp;quot;cat
Unigram(&amp;quot;dog\&amp;quot;cat&amp;quot;, 0)

% dog\(cat
Unigram(&amp;quot;dog(cat&amp;quot;, 0)

% dog\ cat
Unigram(&amp;quot;dog cat&amp;quot;, 0)

%
bye
Press any key to continue . . .
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;current-limitations&#34;&gt;Current Limitations&lt;/h3&gt;

&lt;p&gt;The query parser is still very much a work-in-progress. Here are some known
limitations and notes about future directions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the query parser example, the parser has been &lt;a href=&#34;http://bitfunnel.org/stream-configuration&#34;&gt;configured&lt;/a&gt; with stream prefixes
&lt;code&gt;body&lt;/code&gt;, &lt;code&gt;title&lt;/code&gt;, and &lt;code&gt;anchor&lt;/code&gt;. These prefixes are associated with &lt;code&gt;Term:StreamId&lt;/code&gt; values of 0, 1, and 2, respectively.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;TERM&lt;/code&gt; production accepts all UTF-8 characters expect for &amp;lsquo;\0&amp;rsquo; and
members of our &lt;code&gt;SPECIAL&lt;/code&gt; characters and &lt;code&gt;SPACE&lt;/code&gt; characters. One consequence
is that characters like Unicode directional quotations
(U+2018, U+2019, U+201C, and U+201D ) will be treated as part of the &lt;code&gt;TERM&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The query parser is configured to use an arena allocator with a fixed
amount of memory. Unexpectedly long queries may cause the allocator to throw.&lt;/li&gt;
&lt;li&gt;The query parser currently preserves letter case.&lt;/li&gt;
&lt;li&gt;Currently stream prefixes may contain escaped special characters. This will
likely be disallowed in the near future when we begin to store stream prefixes
in configuration files where special characters may cause problems.&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Stream Configuration</title>
      <link>http://bitfunnel.org/stream-configuration/</link>
      <pubDate>Fri, 09 Sep 2016 17:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/stream-configuration/</guid>
      <description>

&lt;p&gt;BitFunnel models each document as a set of streams,
each of which consists of a sequence of terms corresponding to the
words and phrases that make up the document.&lt;/p&gt;

&lt;p&gt;Real world documents are usually organized with streams corresponding to
structural concepts, such as the title, the URL, the body, and perhaps even the
text of anchors on other pages that point to the document.&lt;/p&gt;

&lt;p&gt;We may want to organize the index using a different principle.
For example, we might index each document as a pair of streams,
one that contains all terms associated with the document and
another that contains only those terms that appear in streams
other than the document body. This organization is useful for
rewriting queries in order to return fewer results.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;StreamConfiguration&lt;/strong&gt; provides a mapping between the
streams in the document and the streams in the index.
Let&amp;rsquo;s look at a more detailed example.&lt;/p&gt;

&lt;p&gt;Consider a hypothetical document about dogs that resides at &lt;a href=&#34;http://bitfunnel.org/dogs:&#34;&gt;http://bitfunnel.org/dogs:&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dogs&lt;/strong&gt;&lt;br /&gt;
Dogs are your best friend.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Suppose another page refers to our document via the following anchor tags:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&amp;lt;a href=&amp;ldquo;dogs&amp;rdquo;&amp;gt;Check out this awesome page!&amp;lt;a\/&amp;gt;&lt;br /&gt;
&amp;lt;a href=&amp;ldquo;dogs&amp;rdquo;&amp;gt;Who is your friend?&amp;lt;a\/&amp;gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Such a document might be organized into streams as follows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;title: [dogs]
body: [dogs are your best friend]
url: [http bitfunnel org dogs]
anchors: [check out this awesome page] [who is your friend]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the index modelled documents this way we could search for
our document with queries like &lt;code&gt;dogs&lt;/code&gt;, &lt;code&gt;title:dogs&lt;/code&gt; or even
&lt;code&gt;anchors:&amp;quot;awesome page&amp;quot;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We could chose to index this document as two streams, one of which has all words
associated with the document and the other that contains words from streams other
than the body:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;document: [dogs] [dogs are your best friend http] [bitfunnel org dogs]&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; [check out this awesome page] [who is your friend]&lt;br /&gt;
nonbody: [dogs] [http bitfunnel org] [check out this awesome page]&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; [who is your friend]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With this organization, we could find the document with the query &lt;code&gt;nonbody:dogs&lt;/code&gt;
but not &lt;code&gt;nonbody:best&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&#34;configuring-streams&#34;&gt;Configuring Streams&lt;/h3&gt;

&lt;p&gt;The IDocument class uses the StreamConfiguration at ingestion time to
organize its terms for indexing. The QueryParser class uses the StreamConfiguration
to map from text stream names to &lt;code&gt;Term::StreamId&lt;/code&gt; values.&lt;/p&gt;

&lt;p&gt;Each IDocument is filled with streams of terms using a sequence of calls to the
OpenStream(), AddTerm(), and CloseStream() methods. The &lt;code&gt;Term::StreamId&lt;/code&gt; values
passed to OpenStream() are document streams. The document above might be initialized
with the following sequence of calls:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;OpenStream(0);  // Title stream
AddTerm(&amp;quot;dogs&amp;quot;);
CloseStream();

OpenStream(1);  // Body stream
AddTerm(&amp;quot;dogs&amp;quot;);
AddTerm(&amp;quot;are&amp;quot;);
AddTerm(&amp;quot;your&amp;quot;);
AddTerm(&amp;quot;best&amp;quot;);
AddTerm(&amp;quot;friend&amp;quot;);
CloseStream();

OpenStream(2);  // URL stream
AddTerm(&amp;quot;http&amp;quot;);
AddTerm(&amp;quot;bitfunnel&amp;quot;);
AddTerm(&amp;quot;org&amp;quot;);
AddTerm(&amp;quot;dogs&amp;quot;);
CloseStream();

OpenStream(3);  // Anchors stream
AddTerm(&amp;quot;check&amp;quot;);
AddTerm(&amp;quot;out&amp;quot;);
AddTerm(&amp;quot;this&amp;quot;);
AddTerm(&amp;quot;awesome&amp;quot;);
AddTerm(&amp;quot;page&amp;quot;);
CloseStream();

// Close and then reopen stream to
// keep phrases from the two anchors
// separate.

OpenStream(3);  // Anchors stream
AddTerm(&amp;quot;who&amp;quot;);
AddTerm(&amp;quot;is&amp;quot;);
AddTerm(&amp;quot;your&amp;quot;);
AddTerm(&amp;quot;friend&amp;quot;);
CloseStream();
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can ingest this document as &lt;code&gt;Document&lt;/code&gt; and &lt;code&gt;NonBody&lt;/code&gt; streams by writing
the following StreamConfiguration file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Document: 0,1,2,3
NonBody: 1,2,3
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line defines an index stream called &amp;ldquo;Document&amp;rdquo; which contains terms
and phrases from document streams 0, 1, 2, and 3 which correspond to the document&amp;rsquo;s
Title, Body, URL, and Anchor streams. The second line defines an index stream
called &amp;ldquo;NonBody&amp;rdquo; which contains terms from the document&amp;rsquo;s Title, URL and Anchor
streams.&lt;/p&gt;

&lt;p&gt;This StreamConfiguration file will automatically configure the QueryParser
to recognize the &amp;ldquo;Document&amp;rdquo; and &amp;ldquo;NonBody&amp;rdquo; prefixes. Note that the first entry
in the StreamConfiguration file the default stream. When a query term does not
have a stream prefix, it will use the default prefix. So in this case, the query
&lt;code&gt;dogs&lt;/code&gt; is equivalent to &lt;code&gt;Document:dogs&lt;/code&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Getting started with NativeJIT</title>
      <link>http://bitfunnel.org/getting-started-with-nativejit/</link>
      <pubDate>Thu, 01 Sep 2016 16:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/getting-started-with-nativejit/</guid>
      <description>

&lt;p&gt;&lt;a href=&#34;https://github.com/bitfunnel/nativejit/&#34;&gt;NativeJIT&lt;/a&gt; is a just-in-time compiler
that handles expressions involving C data structures. It was originally
developed in Bing, with the goal of being able to compile search query matching
and search query ranking code in a query-dependent way. The goal was to create a
compiler than can be used in systems with tens-of-thousands of queries per
second without having compilation take a significant fraction of the query time.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s look at a simple &amp;ldquo;Hello, World&amp;rdquo; and then look at what the API has to offer
us.&lt;/p&gt;

&lt;h2 id=&#34;hello-world&#34;&gt;Hello World&lt;/h2&gt;

&lt;p&gt;We&amp;rsquo;re going to build a function that computes the area of a circle, given its
radius.  If we were to write such a function in C, it would look something like&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;const float PI = 3.14159;

float area(float radius)
{
  return radius * radius * PI;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Building this function in NativeJIT involves three steps: creating a &lt;code&gt;Function&lt;/code&gt;
object which defines the function prototype, building the expression tree which
defines the function body, and finally compiling the function into x64 machine
code.&lt;/p&gt;

&lt;h3 id=&#34;create-the-function-object&#34;&gt;Create the Function Object&lt;/h3&gt;

&lt;p&gt;The Function constructor takes one to five template parameters and exactly two
regular parameters. The template parameters define the function prototype,
while the regular parameters supply resources necessary to compile and run x64
code.&lt;/p&gt;

&lt;h4 id=&#34;template-parameters&#34;&gt;Template Parameters&lt;/h4&gt;

&lt;p&gt;The template parameters define the function prototype for the compiled code.
The first parameter defines to the return value type. The remaining template
parameters correspond to the function&amp;rsquo;s parameter types.&lt;/p&gt;

&lt;p&gt;For our example, we&amp;rsquo;re defining a function that takes a single float parameter
for the radius and returns a float area, so our template parameters would be
&lt;code&gt;&amp;lt;float, float&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Function&amp;lt;float, float&amp;gt; expression(allocator, code);
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;allocator&#34;&gt;Allocator&lt;/h4&gt;

&lt;p&gt;The allocator provides the memory where expression nodes will reside. Any class
that implements the &lt;code&gt;IAllocator&lt;/code&gt; interface will do.&lt;/p&gt;

&lt;p&gt;A reasonable default is to use the arena allocator provided in NativeJIT&amp;rsquo;s
&lt;code&gt;Temporary&lt;/code&gt; directory. The arena allocator hands out blocks of memory from a
fixed size buffer. All of this memory can be recycled at once by calling the
allocator&amp;rsquo;s &lt;code&gt;Reset()&lt;/code&gt; method. The advantage of the arena allocation pattern is
that it allows you to quickly dispose of an expression tree after
compilation. The disadvantage is that it requires everything that uses the
allocated memory to be aware that it&amp;rsquo;s using an arena allocator.&lt;/p&gt;

&lt;p&gt;The constructor for allocator takes a single parameter which is
the buffer size in bytes.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Allocator allocator(8192);
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;functionbuffer&#34;&gt;FunctionBuffer&lt;/h4&gt;

&lt;p&gt;The &lt;code&gt;FunctionBuffer&lt;/code&gt; provides the executable memory where the compiled code will
reside.  In order to allow code execution, this memory must have &lt;a href=&#34;https://en.wikipedia.org/wiki/Executable_space_protection&#34;&gt;Executable
Space Protection&lt;/a&gt;
disabled.&lt;/p&gt;

&lt;p&gt;NativeJIT provides the &lt;code&gt;ExecutionBuffer&lt;/code&gt; class which is an &lt;code&gt;IAllocator&lt;/code&gt; that
allocates blocks of executable code. Classes starting with &lt;code&gt;I&lt;/code&gt; are interfaces,
so this is saying that &lt;code&gt;ExecutionBuffer&lt;/code&gt; satisfies the &lt;code&gt;IAllocator&lt;/code&gt;
interface. Its constructor takes a single parameter which specifies its buffer
size. Note that the buffer size will typically be rounded up to the operating
system virtual memory page size since the &lt;a href=&#34;https://en.wikipedia.org/wiki/NX_bit&#34;&gt;NX
bit&lt;/a&gt; is applied at the page level.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;FunctionBuffer&lt;/code&gt; constructor takes an ExecutionBuffer and a buffer size.
The buffer size parameter might seem redundant given that
the &lt;code&gt;ExecutionBuffer&lt;/code&gt; takes a buffer size as well. The reason the
&lt;code&gt;FunctionBuffer&lt;/code&gt; constructor takes a buffer size is that multiple
&lt;code&gt;FunctionBuffer&lt;/code&gt;s can share a single &lt;code&gt;ExecutionBuffer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s the code fragment:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ExecutionBuffer codeAllocator(8192);
FunctionBuffer code(codeAllocator, 8192);
Function&amp;lt;float, float&amp;gt; expression(allocator, code);
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;build-the-expression-tree&#34;&gt;Build the Expression Tree&lt;/h3&gt;

&lt;p&gt;The next step is to build the expression tree which defines the function body.
In the expression tree, interior nodes are operators and the leaf nodes are either literals or function parameters.&lt;/p&gt;

&lt;p&gt;The tree is built from the bottom up.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Function&amp;lt;float, float&amp;gt; expression(allocator, code);

const float  PI = 3.14159265358979f;
auto &amp;amp; a = expression.Mul(expression.GetP1(), expression.GetP1());
auto &amp;amp; b = expression.Mul(a, expression.Immediate(PI));
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the code above, &lt;code&gt;expression.GetP1()&lt;/code&gt; is a leaf node corresponding to the first parameter.
Node &lt;code&gt;a&lt;/code&gt; is defined to be the product of the first parameter with itself.&lt;/p&gt;

&lt;p&gt;On the next line, &lt;code&gt;expression.Immediate(PI)&lt;/code&gt; is an immediate value leaf node whose value is equal to &lt;code&gt;PI&lt;/code&gt;.
Node &lt;code&gt;b&lt;/code&gt; is defined to be the product of node &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;PI&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&#34;http://bitfunnel.org/nativejit/getting-started-with-nativejit/hello-world.png&#34; alt=&#34;expression tree&#34; /&gt;&lt;/p&gt;

&lt;p&gt;Note that each of the node factory methods on &lt;code&gt;Function&lt;/code&gt; is templated by the types of its children.
This is an important safeguard the prevents the construction of a tree with type errors
(e.g. adding a double to a char).&lt;/p&gt;

&lt;h3 id=&#34;compile&#34;&gt;Compile&lt;/h3&gt;

&lt;p&gt;Once the tree is built, it&amp;rsquo;s time to generate &lt;code&gt;x64&lt;/code&gt; machine code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;auto computeRadius = expression.Compile(b);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The compiler returns a pointer to the compiled function.
In this example, its type is &lt;code&gt;float (*)(float)&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&#34;run&#34;&gt;Run&lt;/h3&gt;

&lt;p&gt;You call this just like calling a C function!&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;auto result = computeRadius(radius_input);
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&#34;putting-it-all-together&#34;&gt;Putting it all together&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;#include &amp;quot;NativeJIT/CodeGen/ExecutionBuffer.h&amp;quot;
#include &amp;quot;NativeJIT/CodeGen/FunctionBuffer.h&amp;quot;
#include &amp;quot;NativeJIT/Function.h&amp;quot;
#include &amp;quot;Temporary/Allocator.h&amp;quot;

#include &amp;lt;iostream&amp;gt;

using NativeJIT::Allocator;
using NativeJIT::ExecutionBuffer;
using NativeJIT::Function;
using NativeJIT::FunctionBuffer;

int main()
{
    ExecutionBuffer codeAllocator(8192);
    Allocator allocator(8192);
    FunctionBuffer code(codeAllocator, 8192);

    const float  PI = 3.14159265358979f;

    Function&amp;lt;float, float&amp;gt; expression(allocator, code);

    auto &amp;amp; a = expression.Mul(expression.GetP1(),
                              expression.GetP1());
    auto &amp;amp; b = expression.Mul(a, expression.Immediate(PI));
    auto function = expression.Compile(b);

    float p1 = 2.0;

    auto expected = PI * p1 * p1;
    auto observed = function(p1);

    std::cout &amp;lt;&amp;lt; expected &amp;lt;&amp;lt; &amp;quot; == &amp;quot; &amp;lt;&amp;lt; observed &amp;lt;&amp;lt; std::endl;

    return 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;examining-the-x64-code&#34;&gt;Examining the x64 Code&lt;/h2&gt;

&lt;p&gt;If you&amp;rsquo;re interested in seeing the x64 code,
fire up the debugger and set a breakpoint
on a line after the code has been compiled,
for example the line &lt;code&gt;float p1 = 2.0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then get the value of &lt;code&gt;function&lt;/code&gt;, and switch into
disassembly view, starting at this address.
Here&amp;rsquo;s what you should see on Windows.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;  sub    rsp,8                         ; Standard function prologue.
  mov    qword ptr [rsp],rbp           ; Standard function prologue.
  lea    rbp,[rsp+8]                   ; Standard function prologue.
  mulss  xmm0,xmm0                     ; Radius parameter by itself.
  mulss  xmm0,dword ptr [29E2A580000h] ; Multiply by PI.
  mov    rbp,qword ptr [rsp]           ; Standard function epilogue.
  add    rsp,8                         ; Standard function epilogue.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Linux output may look slightly different because of differences in the &lt;a href=&#34;https://en.wikipedia.org/wiki/Application_binary_interface&#34;&gt;ABI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;On Windows you can single step through the generated code in the debugger.
Because NativeJIT does not implement x64 stack unwinding on Linux and OSX,
you may have trouble single stepping through the generated code, but it will
run correctly.&lt;/p&gt;

&lt;h2 id=&#34;rules-of-the-road&#34;&gt;Rules of the Road&lt;/h2&gt;

&lt;p&gt;For the most part, NativeJIT assumes that the entire expression tree is free of side effects.
The only general purpose node that can cause a side effect is &lt;code&gt;CallNode&lt;/code&gt; which calls out to an external function.
The behavior of the generated code is undefined when calling out to external functions that cause side effects.&lt;/p&gt;

&lt;p&gt;In the current implementation, each node will be evaluated exactly once.
This guarantee is an important optimization that holds for common subexpressions.
These are nodes that have multiple parents.&lt;/p&gt;

&lt;p&gt;Common subexpressions often show up when traversing data structures.
In the following example&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;foo[i].bar.baz + foo[i].bar.wat
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;the expression &lt;code&gt;foo[i].bar&lt;/code&gt; is a fairly complicated subexpression
involving multiplication, addition, and a pointer dereferencing.
Since it is a common subexpression, this work will only be done once.&lt;/p&gt;

&lt;p&gt;NativeJIT provides an experimental &lt;code&gt;ConditionalNode&lt;/code&gt; analogous the the
ternary conditional operator in C.
Today the generated code evaluates both the true branch and the false branch,
independent of the value of the conditional expression.
Down the road we intend to rework the code generator to
restrict execution to either the true or the false path.
Today the register allocator makes assumptions about register spills
and temporary allocations in both branches, and these assumptions must
be carried forward through all code that is executed after the first
conditional branch.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s important to take into account whether the code will run locally.
For example, if you&amp;rsquo;re running JIT&amp;rsquo;d code locally,
it is legal to use the address of a C symbol as a literal value.
If you are JITing on run machine and executing code on another
be aware that there&amp;rsquo;s guarantee that the symbol will be at the
same address on another machine.&lt;/p&gt;

&lt;p&gt;This scenario typically comes up when attempting to call out to
external functions. The solution is to have the caller pass the
address in as a parameter of the compiled code, instead of relying
on a function address in an &lt;code&gt;ImmediateNode&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;As mentioned earlier, NativeJIT only implements x64 stack
unwinding on Windows. Aside from the debugging impact mentioned
above, the other risk in omitting stack unwinding
is that an exception thrown from a C
function called from NativeJIT code may not be caught
properly on Linux and OSX.&lt;/p&gt;

&lt;p&gt;If you grep for &lt;code&gt;DESIGN NOTE&lt;/code&gt; in the code, you can find explanations of other quirks in NativeJIT.&lt;/p&gt;

&lt;h2 id=&#34;commonly-used-methods&#34;&gt;Commonly used methods&lt;/h2&gt;

&lt;h4 id=&#34;immediates&#34;&gt;Immediates&lt;/h4&gt;

&lt;p&gt;These are simple types (e.g., &lt;code&gt;char&lt;/code&gt; or &lt;code&gt;int&lt;/code&gt;) or pointers to anything. This means that we can have, for example, pointers to structs but we can&amp;rsquo;t have struct literals.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename T&amp;gt; ImmediateNode&amp;lt;T&amp;gt;&amp;amp; Immediate(T value);
&lt;/code&gt;&lt;/pre&gt;

&lt;h5 id=&#34;examples&#34;&gt;Examples&lt;/h5&gt;

&lt;pre&gt;&lt;code&gt;// Immediate.
Function&amp;lt;int64_t&amp;gt; exp1(allocator1, code1);

auto &amp;amp;imm1 = exp1.Immediate(1234ll);
auto fn1 = exp1.Compile(imm1);

assert(1234ll == fn1());
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;unary-operators&#34;&gt;Unary Operators&lt;/h4&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename TO, typename FROM&amp;gt; Node&amp;lt;TO&amp;gt;&amp;amp; Cast(Node&amp;lt;FROM&amp;gt;&amp;amp; value);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Pointer dereference; basically like &lt;code&gt;*&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename T&amp;gt; Node&amp;lt;T&amp;gt;&amp;amp; Deref(Node&amp;lt;T*&amp;gt;&amp;amp; pointer);
template &amp;lt;typename T&amp;gt; Node&amp;lt;T&amp;gt;&amp;amp; Deref(Node&amp;lt;T*&amp;gt;&amp;amp; pointer, int32_t index);
template &amp;lt;typename T&amp;gt; Node&amp;lt;T&amp;gt;&amp;amp; Deref(Node&amp;lt;T&amp;amp;&amp;gt;&amp;amp; reference);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Field de-reference; basically like &lt;code&gt;-&amp;gt;&lt;/code&gt;. If you have &lt;code&gt;a&lt;/code&gt;, and apply the &lt;code&gt;b&lt;/code&gt; &lt;code&gt;FieldPointer&lt;/code&gt;, that&amp;rsquo;s equivalent to &lt;code&gt;a-&amp;gt;b&lt;/code&gt;. There&amp;rsquo;s no &lt;code&gt;.&lt;/code&gt; because we don&amp;rsquo;t have structs as value types.&lt;/p&gt;

&lt;p&gt;If you have a reference to an object, you have to convert the reference to a pointer to apply this method. Note that this has no runtime cost.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename OBJECT, typename FIELD, typename OBJECT1 = OBJECT&amp;gt;
Node&amp;lt;FIELD*&amp;gt;&amp;amp; FieldPointer(Node&amp;lt;OBJECT*&amp;gt;&amp;amp; object, FIELD OBJECT1::*field);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Because we have some operations that can only be done on pointers (or only done on references), we have &lt;code&gt;AsPointer&lt;/code&gt; and &lt;code&gt;AsReference&lt;/code&gt; to convert between pointer and reference. This is free in terms of actual runtime cost:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename T&amp;gt; Node&amp;lt;T*&amp;gt;&amp;amp; AsPointer(Node&amp;lt;T&amp;amp;&amp;gt;&amp;amp; reference);
template &amp;lt;typename T&amp;gt; Node&amp;lt;T&amp;amp;&amp;gt;&amp;amp; AsReference(Node&amp;lt;T*&amp;gt;&amp;amp; pointer);
&lt;/code&gt;&lt;/pre&gt;

&lt;h5 id=&#34;examples-1&#34;&gt;Examples&lt;/h5&gt;

&lt;pre&gt;&lt;code&gt;// Cast.
Function&amp;lt;int64_t&amp;gt; exp1(allocator1, code1);

auto &amp;amp;cast1 = exp1.Cast&amp;lt;float&amp;gt;(exp1.Immediate(10));
auto fn1 = exp1.Compile(cast1);

assert(float(10) == fn1());


// Access member via -&amp;gt;.
class Foo
{
public:
    uint32_t m_a;
    uint64_t m_b;
};

Function&amp;lt;uint64_t, Foo*&amp;gt; expression(allocator2, code2);

auto &amp;amp; a = expression.GetP1();
auto &amp;amp; b = expression.FieldPointer(a, &amp;amp;Foo::m_b);
auto &amp;amp; c = expression.Deref(b);
auto fn2 = expression.Compile(c);

Foo foo;
foo.m_b = 1234ull;
Foo* p1 = &amp;amp;foo;

assert(p1-&amp;gt;m_b == fn2(p1));
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;binary-operators&#34;&gt;Binary Operators&lt;/h4&gt;

&lt;p&gt;Binary artihmetic ops take either two nodes, or a node and an immediate. Note that although the types are templated as &lt;code&gt;L&lt;/code&gt; and &lt;code&gt;R&lt;/code&gt;, &lt;code&gt;L&lt;/code&gt; and &lt;code&gt;R&lt;/code&gt; should generally be the same for binary ops that take two nodes &amp;ndash; conversions must be made explicit. For &lt;code&gt;Rol&lt;/code&gt;, &lt;code&gt;Shl&lt;/code&gt;, and &lt;code&gt;Shr&lt;/code&gt;, the immediate should be a &lt;code&gt;uint8_t&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; Add(Node&amp;lt;L&amp;gt;&amp;amp; left, Node&amp;lt;R&amp;gt;&amp;amp; right);
template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; And(Node&amp;lt;L&amp;gt;&amp;amp; left, Node&amp;lt;R&amp;gt;&amp;amp; right);
template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; Mul(Node&amp;lt;L&amp;gt;&amp;amp; left, Node&amp;lt;R&amp;gt;&amp;amp; right);
template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; Or(Node&amp;lt;L&amp;gt;&amp;amp; left, Node&amp;lt;R&amp;gt;&amp;amp; right);
template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; Sub(Node&amp;lt;L&amp;gt;&amp;amp; left, Node&amp;lt;R&amp;gt;&amp;amp; right);

template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; Rol(Node&amp;lt;L&amp;gt;&amp;amp; left, R right);
template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; Shl(Node&amp;lt;L&amp;gt;&amp;amp; left, R right);
template &amp;lt;typename L, typename R&amp;gt; Node&amp;lt;L&amp;gt;&amp;amp; Shr(Node&amp;lt;L&amp;gt;&amp;amp; left, R right);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Like &lt;code&gt;[]&lt;/code&gt;, i.e., takes a pointer and adds an offset:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Node&amp;lt;T*&amp;gt;&amp;amp; Add(Node&amp;lt;T(*)[SIZE]&amp;gt;&amp;amp; array, Node&amp;lt;INDEX&amp;gt;&amp;amp; index);

template &amp;lt;typename T, typename INDEX&amp;gt; Node&amp;lt;T*&amp;gt;&amp;amp;
Add(Node&amp;lt;T*&amp;gt;&amp;amp; array, Node&amp;lt;INDEX&amp;gt;&amp;amp; index);
&lt;/code&gt;&lt;/pre&gt;

&lt;h5 id=&#34;examples-2&#34;&gt;Examples&lt;/h5&gt;

&lt;pre&gt;&lt;code&gt;// Array dereference with binary operation.

Function&amp;lt;uint64_t, uint64_t*&amp;gt; exp1(allocator1, code1);

auto &amp;amp; idx1 = exp1.Add(expression.GetP1(),
                       expression.Immediate&amp;lt;uint64_t&amp;gt;(1ull));
auto &amp;amp; idx2 = exp1.Add(expression.GetP1(),
                       expression.Immediate&amp;lt;uint64_t&amp;gt;(2ull));
auto &amp;amp; sum = exp1.Add(expression.Deref(a), expression.Deref(b));
auto fn1 = exp1.Compile(sum);

uint64_t array[10];
array[1] = 1;
array[2] = 128;

uint64_t * p1 = array;

assert(array[1] + array[2] == fn1(p1));
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;compare-conditional&#34;&gt;Compare &amp;amp; Conditional&lt;/h4&gt;

&lt;p&gt;Unlike other nodes, which return a generic &lt;code&gt;T&lt;/code&gt;, compare nodes return a flag. The
flag can be passed to a conditional, which takes a flag.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;FlagExpressionNode&amp;lt;JCC&amp;gt;&amp;amp; Compare(Node&amp;lt;T&amp;gt;&amp;amp; left, Node&amp;lt;T&amp;gt;&amp;amp; right);

template &amp;lt;JccType JCC, typename T&amp;gt;
Node&amp;lt;T&amp;gt;&amp;amp; Conditional(FlagExpressionNode&amp;lt;JCC&amp;gt;&amp;amp; condition,
                     Node&amp;lt;T&amp;gt;&amp;amp; trueValue,
                     Node&amp;lt;T&amp;gt;&amp;amp; falseValue);

template &amp;lt;typename CONDT, typename T&amp;gt;
Node&amp;lt;T&amp;gt;&amp;amp; IfNotZero(Node&amp;lt;CONDT&amp;gt;&amp;amp; conditionValue,
                   Node&amp;lt;T&amp;gt;&amp;amp; trueValue,
                   Node&amp;lt;T&amp;gt;&amp;amp; falseValue);

template &amp;lt;typename T&amp;gt;
Node&amp;lt;T&amp;gt;&amp;amp; If(Node&amp;lt;bool&amp;gt;&amp;amp; conditionValue,
            Node&amp;lt;T&amp;gt;&amp;amp; thenValue,
            Node&amp;lt;T&amp;gt;&amp;amp; elseValue);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;x86 conditional tests are available; a full list is available &lt;a href=&#34;http://unixwiz.net/techtips/x86-jumps.html&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h5 id=&#34;example&#34;&gt;Example&lt;/h5&gt;

&lt;pre&gt;&lt;code&gt;// JA (jump if above), i.e. unsigned &amp;quot;&amp;gt;&amp;quot;
Function&amp;lt;uint64_t, uint64_t, uint64_t&amp;gt;
    exp1(setup-&amp;gt;GetAllocator(), setup-&amp;gt;GetCode());

uint64_t trueValue = 5;
uint64_t falseValue = 6;

auto &amp;amp; a =
  expression.Compare&amp;lt;JccType::JA&amp;gt;(expression.GetP1(), expression.GetP2());
auto &amp;amp; b =
  expression.Conditional(a,
                         expression.Immediate(trueValue),
                         expression.Immediate(falseValue));
auto function = expression.Compile(b);

uint64_t p1 = 3;
uint64_t p2 = 4;

auto expected = (p1 &amp;gt; p2) ? trueValue : falseValue;
auto observed = function(p1, p2);

assert(expected == observed);

p1 = 5;
p2 = 4;

expected = (p1 &amp;gt; p2) ? trueValue : falseValue;
observed = function(p1, p2);

assert(expected == observed);
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;call&#34;&gt;Call&lt;/h4&gt;

&lt;p&gt;Calls a C function.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename R&amp;gt;
Node&amp;lt;R&amp;gt;&amp;amp; Call(Node&amp;lt;R (*)()&amp;gt;&amp;amp; function);

template &amp;lt;typename R, typename P1&amp;gt;
Node&amp;lt;R&amp;gt;&amp;amp; Call(Node&amp;lt;R (*)(P1)&amp;gt;&amp;amp; function,
              Node&amp;lt;P1&amp;gt;&amp;amp; param1);

template &amp;lt;typename R, typename P1, typename P2&amp;gt;
Node&amp;lt;R&amp;gt;&amp;amp; Call(Node&amp;lt;R (*)(P1, P2)&amp;gt;&amp;amp; function,
              Node&amp;lt;P1&amp;gt;&amp;amp; param1,
              Node&amp;lt;P2&amp;gt;&amp;amp; param2);

template &amp;lt;typename R, typename P1, typename P2, typename P3&amp;gt;
Node&amp;lt;R&amp;gt;&amp;amp; Call(Node&amp;lt;R (*)(P1, P2, P3)&amp;gt;&amp;amp; function,
              Node&amp;lt;P1&amp;gt;&amp;amp; param1,
              Node&amp;lt;P2&amp;gt;&amp;amp; param2,
              Node&amp;lt;P3&amp;gt;&amp;amp; param3);

template &amp;lt;typename R, typename P1, typename P2, typename P3, typename P4&amp;gt;
Node&amp;lt;R&amp;gt;&amp;amp; Call(Node&amp;lt;R (*)(P1, P2, P3, P4)&amp;gt;&amp;amp; function,
              Node&amp;lt;P1&amp;gt;&amp;amp; param1,
              Node&amp;lt;P2&amp;gt;&amp;amp; param2,
              Node&amp;lt;P3&amp;gt;&amp;amp; param3,
              Node&amp;lt;P4&amp;gt;&amp;amp; param4);
&lt;/code&gt;&lt;/pre&gt;

&lt;h5 id=&#34;examples-3&#34;&gt;Examples&lt;/h5&gt;

&lt;pre&gt;&lt;code&gt;// Call SampleFunction.
int SampleFunction(int p1, int p2)
{
    return p1 + p2;
}

Function&amp;lt;int, int, int&amp;gt; exp1(allocator1, code1);

typedef int (*F)(int, int);

auto &amp;amp;imm1 = exp1.Immediate&amp;lt;F&amp;gt;(SampleFunction);
auto &amp;amp;call1 = exp1.Call(imm1, exp1.GetP1(), exp1.GetP2());
auto fn1 = exp1.Compile(call2);

assert(10+35 == fn1(10, 35));
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;rarely-used-methods&#34;&gt;Rarely used methods&lt;/h2&gt;

&lt;h4 id=&#34;unary-methods&#34;&gt;Unary methods&lt;/h4&gt;

&lt;pre&gt;&lt;code&gt;template &amp;lt;typename FROM&amp;gt; Node&amp;lt;FROM const&amp;gt;&amp;amp; AddConstCast(Node&amp;lt;FROM&amp;gt;&amp;amp; value);

template &amp;lt;typename FROM&amp;gt; Node&amp;lt;FROM&amp;gt;&amp;amp;
  RemoveConstCast(Node&amp;lt;FROM const&amp;gt;&amp;amp; value);

template &amp;lt;typename FROM&amp;gt; Node&amp;lt;FROM&amp;amp;&amp;gt;&amp;amp;
  RemoveConstCast(Node&amp;lt;FROM const &amp;amp;&amp;gt;&amp;amp; value);

template &amp;lt;typename FROM&amp;gt; Node&amp;lt;FROM const *&amp;gt;&amp;amp;
  AddTargetConstCast(Node&amp;lt;FROM*&amp;gt;&amp;amp; value);

template &amp;lt;typename FROM&amp;gt; Node&amp;lt;FROM*&amp;gt;&amp;amp;
  RemoveTargetConstCast(Node&amp;lt;FROM const *&amp;gt;&amp;amp; value);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These sounds really weird, but they were useful for some obscure reasons.&lt;/p&gt;

&lt;h4 id=&#34;binary-methods&#34;&gt;Binary methods&lt;/h4&gt;

&lt;pre&gt;&lt;code&gt;Node&amp;lt;T&amp;gt;&amp;amp; Shld(Node&amp;lt;T&amp;gt;&amp;amp; shiftee, Node&amp;lt;T&amp;gt;&amp;amp; filler, uint8_t bitCount);
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is used for packed types (i.e., bitfields that get packed into 64-bits) to extract a bitfield.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Index Build Tools</title>
      <link>http://bitfunnel.org/index-build-tools/</link>
      <pubDate>Tue, 30 Aug 2016 15:51:23 -0600</pubDate>
      
      <guid>http://bitfunnel.org/index-build-tools/</guid>
      <description>

&lt;div&gt;
&lt;span style=&#34;background-color:lightgray;color:red;font-size:12&#34;&gt;
    &lt;b&gt;NOTE:&lt;/b&gt; This page was updated on 9/19/16 to reflect significant changes in the index build tools.
&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;After many months of hard work,
we kind of, sort of have a document ingestion pipeline that seems to work.
By this I mean we have a minimal set of configuration and ingestion tools
that we can compile and then run without crashing,
and these tools seem to ingest files mostly as expected.
We&amp;rsquo;re still going to need to do a lot of testing, tuning and evaluation,
but I thought it would be helpful to take this time to walk through the process
of bringing up an index from a set of chunk files extracted from
Wikipedia.&lt;/p&gt;

&lt;p&gt;The remainder of this post is a fairly long step-by-step description of the process
I used to configure and start a &lt;a href=&#34;https://github.com/bitfunnel/bitfunnel/&#34;&gt;BitFunnel&lt;/a&gt; index.
It&amp;rsquo;s pretty dry, but should be useful for people who want to play around with the system.&lt;/p&gt;

&lt;h2 id=&#34;obtaining-a-sample-corpus&#34;&gt;Obtaining a Sample Corpus&lt;/h2&gt;

&lt;p&gt;I decided to use a
&lt;a href=&#34;https://dumps.wikimedia.org/enwiki/20160305/enwiki-20160305-pages-articles1.xml-p000000010p000030302.bz2&#34;&gt;small portion of the English Wikipedia&lt;/a&gt;
as a basis for this walkthrough. This piece contains about 17k articles.
I processed it into a collection of &lt;a href=&#34;http://bitfunnel.org/corpus-file-format&#34;&gt;BitFunnel chunk files&lt;/a&gt; which
you can download from our &lt;a href=&#34;https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz&#34;&gt;Azure blob storarge&lt;/a&gt;.
(see the post entitled &lt;a href=&#34;http://bitfunnel.org/sample-data&#34;&gt;Sample Data&lt;/a&gt; for instructions on downloading the pre-built chunk files).&lt;/p&gt;

&lt;p&gt;The chunk files were built from a Wikipedia dump using the process
outlined in the &lt;a href=&#34;https://github.com/BitFunnel/Workbench/blob/master/README.md&#34;&gt;WorkBench README&lt;/a&gt;.
If you would like to build your own chunks from scratch, download
&lt;a href=&#34;https://dumps.wikimedia.org/enwiki/20160305/enwiki-20160305-pages-articles1.xml-p000000010p000030302.bz2&#34;&gt;the dump&lt;/a&gt;
from the
&lt;a href=&#34;https://dumps.wikimedia.org/enwiki/20160305/&#34;&gt;Wikipedia dump page&lt;/a&gt; or grab an archived copy from our
&lt;a href=&#34;https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/enwiki-20160305-pages-articles1.xml-p000000010p000030302.bz2&#34;&gt;Azure blob storage&lt;/a&gt;.
Either way, the file must be decompressed before it can be used.&lt;/p&gt;

&lt;p&gt;The Wikipedia dump file is converted to chunks using a two-step process.
The first step uses an open source project called &lt;a href=&#34;https://github.com/attardi/wikiextractor&#34;&gt;wikiextractor&lt;/a&gt;
to filter out Wikipedia markup.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/sample-data/wikiextractor.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;The output of Wikiextractor is a set of 1Mb XML files, with names like wiki_00, wiki_01, wiki_02, etc. and
organized under directories AA, AB, AC, etc.&lt;/p&gt;

&lt;p&gt;The second step uses the Java-based &lt;a href=&#34;https://github.com/BitFunnel/Workbench&#34;&gt;Workbench project&lt;/a&gt; to
perform word-breaking, stemming, and stop-word elimination. The output of the Workbench stage
is a set of BitFunnel chunk files.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/sample-data/workbench.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;h2 id=&#34;gathering-corpus-statistics&#34;&gt;Gathering Corpus Statistics&lt;/h2&gt;

&lt;p&gt;Because BitFunnel is a probabilistic algorithm based on Bloom Filters,
its configuration depends on statistical properties of the corpus,
like the distributions of term frequencies and document lengths.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/sample-data/statistics.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;The &lt;code&gt;BitFunnel statistics&lt;/code&gt; command generates these statistics from a representative corpus.
Run &lt;code&gt;BitFunnel statistics -help&lt;/code&gt; to print out a help message
describing the command line arguments.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% BitFunnel statistics -help
StatisticsBuilder
Ingest documents and compute statistics about them.

Usage:
BitFunnel statistics &amp;lt;manifestFile&amp;gt;
                     &amp;lt;outDir&amp;gt;
                     [-help]
                     [-text]
                     [-gramsize &amp;lt;integer&amp;gt;]

&amp;lt;manifestFile&amp;gt;
    Path to a file containing the paths to the chunk
    files to be ingested. One chunk file per line.
    Paths are relative to working directory. (string)

&amp;lt;outDir&amp;gt;
    Path to the output directory where files will
    be written.  (string)

[-help]
    Display help for this program. (boolean, defaults to false)


[-text]
    Create mapping from Term::Hash to term text. (boolean, defaults to false)


[-gramsize &amp;lt;integer&amp;gt;]
    Set the maximum ngram size for phrases. (integer,
    defaults to 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first parameter is a manifest file that lists the paths to the chunk
files, one file per line. You can generate a manifest file with all of the
chunks using the Linux &lt;code&gt;find&lt;/code&gt; command. Here&amp;rsquo;s an example that uses the &lt;code&gt;find&lt;/code&gt; command
to create a manifest for all of the prebuilt chunks that were
downloaded to &lt;code&gt;/tmp/wikipedia/chunks&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% find /tmp/wikipedia/chunks -type f &amp;gt; /tmp/wikipedia/manifest.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The second parameter to &lt;code&gt;BitFunnel statistics&lt;/code&gt;
is the output directory. In this case I used &lt;code&gt;/tmp/wikipedia/config&lt;/code&gt;
(note that the prebuilt chunk tarball includes a &lt;code&gt;wikipedia/config&lt;/code&gt; directory
with the output of this walkthrough). Be sure to create the output directory
if it doesn&amp;rsquo;t already exist:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% mkdir /tmp/wikipedia/config
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;When I ran &lt;code&gt;BitFunnel statistics&lt;/code&gt;, I omitted the -gramsize parameter
because I wanted statistics for a corpus of unigrams.
Had I included -gramsize, I could have generated statistics for
a corpus that included bigrams or trigrams or larger
&lt;a href=&#34;https://en.wikipedia.org/wiki/N-gram&#34;&gt;ngrams&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I included the -text parameter because I wanted the document frequency
table to be annotated with the text of each term. Had I not included
-text, the terms would be represented solely by their hash values.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s the console log:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;% BitFunnel statistics /tmp/wikipedia/manifest.txt  /tmp/wikipedia/config
Blocksize: 3592
Loading chunk list file &#39;/tmp/wikipedia/manifest.txt&#39;
Temp dir: &#39;/tmp/wikipedia/config&#39;
Reading 259 files
Ingesting . . .
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/.DS_Store
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_01
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_02
...
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AC/wiki_56
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AC/wiki_57
Ingestion complete.
  Ingestion time = 11.5985
  Ingestion rate (bytes/s): 1.78954e+07
Shard count:1
Document count: 17618
Bytes/Document: 11781.1
Total bytes read: 207559890
Posting count: 12848420
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The statistics files were written to c&lt;code&gt;/tmp/wikipedia/config&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ls -l /tmp/wikipedia/config
total 100384
-rw-r--r--  1 mhop  wheel    216796 Sep 18 16:07 CumulativeTermCounts-0.csv
-rw-r--r--  1 mhop  wheel  21516222 Sep 18 16:07 DocFreqTable-0.csv
-rw-r--r--  1 mhop  wheel     18265 Sep 18 16:07 DocumentLengthHistogram.csv
-rw-r--r--  1 mhop  wheel   8202248 Sep 18 16:07 IndexedIdfTable-0.bin
-rw-r--r--  1 mhop  wheel  13463535 Sep 18 16:07 TermToText.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;CumulativeTermCounts-0.csv&lt;/code&gt; tracks the number of unique terms encountered
as a function of the number of documents ingested. It is not currently used
but will be needed for accurate models of memory consumption.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DocumentLengthHistogram.csv&lt;/code&gt; is a histogram of documents organized by
the number of unique terms in each document.
It is not currently used, but will be needed to determine
how to organize documents into shards according to posting count.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DocFreqTable-0.csv&lt;/code&gt; lists the unique terms in the corpus in descending
frequency order. In other words, more common words appear before less common
words. Here&amp;rsquo;s what the file looks like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;% more /tmp/wikipedia/config/DocFreqTable-0.csv
hash,gramSize,streamId,frequency,text
3f0ffc72a21fd2be,1,1,0.840447,from
b3697479c07d98d5,1,1,0.808378,which
4d34895e97b1888c,1,1,0.795266,also
17e90965afd3104d,1,1,0.763764,have
d14e34f5833aecee,1,1,0.755875,one
5d4e8d01c132cf18,1,1,0.745147,other
...
8f2e873ae4281b44,1,1,5.67601e-05,arslān
d859a38c4ac69616,1,1,5.67601e-05,influxu
c307a841264a8d2d,1,1,5.67601e-05,köşk
e19c09157d44124,1,1,5.67601e-05,tharil
3885b5929dd1a8dd,1,1,5.67601e-05,www.routledge.com
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, the term &amp;ldquo;from&amp;rdquo; is the most common,
appearing in about 84% of documents. Towards the end
of the file, &amp;ldquo;arslān&amp;rdquo; is one of the rarest, appearing
in 0.006% of documents, or once in the entire corpus of 17618 documents.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;IndexedIdfTable-0.bin&lt;/code&gt; is a binary file containing the IDF value for each term.
It is used for constructing Terms during document ingestion and query
formulation.&lt;/p&gt;

&lt;p&gt;Finally, &lt;code&gt;TermToText.bin&lt;/code&gt; is binary file containing a mapping from a term&amp;rsquo;s
hash value to its text representation. It is used for debugging and diagnostics.&lt;/p&gt;

&lt;p&gt;These files will be used in the next stage where we build the &lt;code&gt;TermTable&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&#34;building-a-termtable&#34;&gt;Building a TermTable&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;TermTable&lt;/code&gt; is one of the most important data structures in BitFunnel.
It maps each term to the exact set of rows used to indicate the term&amp;rsquo;s
presence in a document.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/sample-data/termtable.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;I won&amp;rsquo;t get into how the TermTable builder works at this point, but let&amp;rsquo;s look
at how to run it. Type &lt;code&gt;BitFunnel termtable -help&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;BitFunnel termtable -help
TermTableBuilderTool
Generate a TermTable from a DocumentFrequencyTable.

Usage:
BitFunnel termtable &amp;lt;tempPath&amp;gt;
                    [-help]

&amp;lt;tempPath&amp;gt;
    Path to a tmp directory. Something like /tmp/ or c:\temp\,
    depending on platform.. (string)

[-help]
    Display help for this program. (boolean, defaults to false)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this case the help message could be a bit better.
All you need to know is that &lt;code&gt;BitFunnel termtable&lt;/code&gt; has a single argument
and this is the output directory from the &lt;code&gt;BitFunnel statistics&lt;/code&gt; stage.
The TermTable builder will read &lt;code&gt;DocFreqTable-0.csv&lt;/code&gt;, constuct a very basic
&lt;code&gt;TermTable&lt;/code&gt;, and write it to &lt;code&gt;TermTable-0.bin&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For now, the algorithm creates a &lt;code&gt;TermTable&lt;/code&gt; for unigrams
that uses a combination of rank 0 and rank 3 rows. The algorithm
is naive and doesn&amp;rsquo;t handle higher order ngrams.
Down the road, we will improve the algorithm and add more options.&lt;/p&gt;

&lt;p&gt;Here&amp;rsquo;s the output from my run:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;BitFunnel termtable /tmp/wikipedia/config
Loading files for TermTable build.
Starting TermTable build.
Total build time: 0.439845 seconds.
===================================
RowAssigner for rank 0
  Terms
    Total: 512640
    Adhoc: 468884
    Explicit: 42357
    Private: 1399

  Rows
    Total: 5814
    Adhoc: 590
    Explicit: 3822
    Private: 1399

  Bytes per document: 778.75


  Densities in explicit shared rows
    Mean: 0.0999029
    Min: 0.0980813
    Max: 0.0999549
    Variance: 3.00201e-08

===================================
RowAssigner for rank 1
  No terms

===================================
RowAssigner for rank 2
  No terms

===================================
RowAssigner for rank 3
  Terms
    Total: 511026
    Adhoc: 416436
    Explicit: 90958
    Private: 3632

  Rows
    Total: 36107
    Adhoc: 1827
    Explicit: 30648
    Private: 3632

  Bytes per document: 564.172


  Densities in explicit shared rows
    Mean: 0.099836
    Min: 0.0124819
    Max: 0.0999997
    Variance: 5.68987e-07

===================================
RowAssigner for rank 4
  No terms

===================================
RowAssigner for rank 5
  No terms

===================================
RowAssigner for rank 6
  No terms

===================================
RowAssigner for rank 7
  No terms

Writing TermTable files.
Done.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;From the output, above, we can see that this index will consume roughly 779 bytes
of rank 0 rows and 562 bytes of rank 3 rows per document ingested. This corpus
has 17618 documents, so the total memory consumption for rows should be about 22.8Mb.
This is a significant reduction over the 198Mb of chunk files, but we at this
point, we can draw no conclusions because we have no idea whether this naive
configuration has an acceptable false positive rate.&lt;/p&gt;

&lt;p&gt;If we look in &lt;code&gt;tmp/wikipedia/config&lt;/code&gt; we will see the TermTable is stored in
a new binary file called &lt;code&gt;TermTable-0.bin&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;ls -l /tmp/wikipedia/config
total 100384
-rw-r--r--  1 mhop  wheel    216796 Sep 18 16:07 CumulativeTermCounts-0.csv
-rw-r--r--  1 mhop  wheel  21516222 Sep 18 16:07 DocFreqTable-0.csv
-rw-r--r--  1 mhop  wheel     18265 Sep 18 16:07 DocumentLengthHistogram.csv
-rw-r--r--  1 mhop  wheel   8202248 Sep 18 16:07 IndexedIdfTable-0.bin
-rw-r--r--  1 mhop  wheel   7970980 Sep 18 16:09 TermTable-0.bin
-rw-r--r--  1 mhop  wheel  13463535 Sep 18 16:07 TermToText.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We&amp;rsquo;ve now configured our system for a typical corpus with documents similar to
those in the chunk files listed in the original manifest. In the next step
we will ingest some files and look at the resulting row table values.&lt;/p&gt;

&lt;h2 id=&#34;ingesting-a-small-corpus&#34;&gt;Ingesting a Small Corpus&lt;/h2&gt;

&lt;p&gt;Now we&amp;rsquo;re getting to the fun part. &lt;code&gt;BitFunnel repl&lt;/code&gt; is a sample application that
provides an interactive Read-Eval-Print loop for ingesting documents, running
queries, and inspecting various data structures.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;http://bitfunnel.org/sample-data/repl.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;Here&amp;rsquo;s the help message:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;BitFunnel repl -help
StatisticsBuilder
Ingest documents and compute statistics about them.

Usage:
BitFunnel repl &amp;lt;path&amp;gt;
               [-help]
               [-gramsize &amp;lt;integer&amp;gt;]
               [-threads &amp;lt;integer&amp;gt;]

&amp;lt;path&amp;gt;
    Path to a tmp directory. Something like /tmp/ or c:\temp\,
    depending on platform.. (string)

[-help]
    Display help for this program. (boolean, defaults to false)


[-gramsize &amp;lt;integer&amp;gt;]
    Set the maximum ngram size for phrases. (integer, defaults
    to 1)

[-threads &amp;lt;integer&amp;gt;]
    Set the thread count for ingestion and query processing.
    (integer, defaults to 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first parameter is the path to the directory with the configuration
files. In my case this is &lt;code&gt;/tmp/wikipedia/config&lt;/code&gt;. The gramsize should be the same
value used in the &lt;code&gt;BitFunnel statistics&lt;/code&gt; stage.&lt;/p&gt;

&lt;p&gt;When you start the application, it prints out a welcome message, explains
how to get help, and then prompts for input. The prompt is an integer followed
by a colon. Type &amp;ldquo;help&amp;rdquo; to get a list of commands. You can also get help on
a specific command:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;BitFunnel repl /tmp/wikipedia/config
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = &amp;quot;/tmp/wikipedia/config&amp;quot;
gram size = 1

Starting index ...
Blocksize: 11005320
Index started successfully.

Type &amp;quot;help&amp;quot; to get started.

0: help
Available commands:
  cache   Ingests documents into the index and also stores them in a cache
for query verification purposes.
  delay   Prints a message after certain number of seconds
  help    Displays a list of available commands.
  load    Ingests documents into the index
  query   Process a single query or list of queries. (TODO)
  quit    waits for all current tasks to complete then exits.
  script  Runs commands from a file.(TODO)
  show    Shows information about various data structures. (TODO)
  status  Prints system status.
  verify  Verifies the results of a single query against the document cache.

Type &amp;quot;help &amp;lt;command&amp;gt;&amp;quot; for more information on a particular command.

1: help cache
cache (manifest | chunk) &amp;lt;path&amp;gt;
  Ingests a single chunk file or a list of chunk
  files specified by a manifest.
  Also caches IDocuments for query verification.

2: help show
show cache &amp;lt;term&amp;gt;
   | rows &amp;lt;term&amp;gt; [&amp;lt;docstart&amp;gt; &amp;lt;docend&amp;gt;]
   | term &amp;lt;term&amp;gt;
  Shows information about various data structures.  PARTIALLY IMPLEMENTED
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Right now the cache command doesn&amp;rsquo;t support the manifest option
and the show command only supports the rows and terms option.&lt;/p&gt;

&lt;p&gt;Let&amp;rsquo;s ingest a single chunk file. We&amp;rsquo;ll use &lt;code&gt;/tmp/wikipedia/chunks/AA/wiki_00&lt;/code&gt;
which contains the following 41 documents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;12: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=12&#34;&gt;Anarchism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;25: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=25&#34;&gt;Autism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;39: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=39&#34;&gt;Albedo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;128: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=128&#34;&gt;Talk:Atlas Shrugged&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;290: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=290&#34;&gt;A&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;295: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=295&#34;&gt;User:AnonymousCoward&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;303: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=303&#34;&gt;Alabama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;305: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=305&#34;&gt;Achilles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;307: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=307&#34;&gt;Abraham Lincoln&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;308: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=308&#34;&gt;Aristotle&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;309: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=309&#34;&gt;An American in Paris&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;316: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=316&#34;&gt;Academy Award for Best Production Design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;324: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=324&#34;&gt;Academy Awards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;330: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=330&#34;&gt;Actrius&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;332: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=332&#34;&gt;Animalia (book)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;334: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=334&#34;&gt;International Atomic Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;336: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=336&#34;&gt;Altruism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;339: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=339&#34;&gt;Ayn Rand&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;340: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=340&#34;&gt;Alain Connes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;344: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=344&#34;&gt;Allan Dwan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;354: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=354&#34;&gt;Talk:Algeria&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;358: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=358&#34;&gt;Algeria&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;359: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=359&#34;&gt;List of Atlas Shrugged characters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;569: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=569&#34;&gt;Anthropology&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;572: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=572&#34;&gt;Agricultural science&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;573: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=573&#34;&gt;Alchemy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;579: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=579&#34;&gt;Alien&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;580: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=580&#34;&gt;Astronomer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;582: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=582&#34;&gt;Talk:Altruism/Archive 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;586: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=586&#34;&gt;ASCII&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;590: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=590&#34;&gt;Austin (disambiguation)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;593: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=593&#34;&gt;Animation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;594: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=594&#34;&gt;Apollo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;595: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=595&#34;&gt;Andre Agassi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;597: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=597&#34;&gt;Austroasiatic languages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;599: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=599&#34;&gt;Afroasiatic languages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;600: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=600&#34;&gt;Andorra&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;612: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=612&#34;&gt;Arithmetic mean&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;615: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=615&#34;&gt;American Football Conference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;620: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=620&#34;&gt;Animal Farm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;621: &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=621&#34;&gt;Amphibian&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here&amp;rsquo;s the chunk loading in the REPL console:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;3: cache chunk /tmp/wikipedia/chunks/AA/wiki_00
Ingesting chunk file &amp;quot;/tmp/wikipedia/chunks/AA/wiki_00&amp;quot;
Caching IDocuments for query verification.
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
Ingestion complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can now use the &lt;code&gt;show rows&lt;/code&gt; command to display the rows
associated with a particular term.
This command lists each of the &lt;code&gt;RowIds&lt;/code&gt; associated with a term,
followed by the bits for the first 64 documents.
The document ids are printed vertically
above each column of bits.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;4: show rows also
Term(&amp;quot;also&amp;quot;)
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(0, 1005 ): 11111111111011111111111101011111111111011
5: show rows some
Term(&amp;quot;some&amp;quot;)
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(0, 1011 ): 11111011111011001101110101001101111111011
6: show rows wings
Term(&amp;quot;wings&amp;quot;)
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 6668 ): 00000001000000000000000000000000000000000
  RowId(3, 6669 ): 00000001000000000000000000000000000000000
  RowId(3, 6670 ): 00000001000000000000000000000000000000000
  RowId(0, 4498 ): 00000001000000000000110000000100000000000
7: show rows anarchy
Term(&amp;quot;anarchy&amp;quot;)
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 11507): 10000000110000001000010000000000000000000
  RowId(3, 11508): 10000000110000001000010000000000000000000
  RowId(3, 11509): 10000000110000001000010000000000000000000
  RowId(0, 5354 ): 11000001110000000000010001001000000000010
8: show rows kingdom
Term(&amp;quot;kingdom&amp;quot;)
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(0, 1609 ): 11000010010000100000000000000001000010001
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At prompt 4, the command &lt;code&gt;show rows also&lt;/code&gt; returns a single &lt;code&gt;RowId(0, 1005)&lt;/code&gt;
corresponding to the term &amp;ldquo;also&amp;rdquo;. This is a rank 0 row in position 1005.
The fact that the word &amp;ldquo;also&amp;rdquo; is associated with a single row indicates
that the term is fairly common in the corpus.
This is consistent with the pattern of 1 bits which show that the term
appears in every document except 316, 572, 579, and 615.&lt;/p&gt;

&lt;p&gt;At prompt 5, we see that the word &amp;ldquo;some&amp;rdquo; is also common in the corpus.
It appears in every document except 295, 316, 332, 334, 359, 572, 579, 580, and 615.&lt;/p&gt;

&lt;p&gt;Both &amp;ldquo;also&amp;rdquo; and &amp;ldquo;some&amp;rdquo; appear so frequently that are assigned private rows.&lt;/p&gt;

&lt;p&gt;The term &amp;ldquo;wings&amp;rdquo;, seen at prompt 6, is less common. It is actually rare enough to require
four row intersections to drive the noise to a tolerable level. If we look
at the intersection of the four rows, we see that only document 305 contains
the term. This is the only column that consists solely of 1s. All of the
other columns have some 0s.&lt;/p&gt;

&lt;p&gt;At prompt 7 we see that anarchy is also rare enough to require four rows,
but it seems to appear in documents 12, 307, 308, and 358. A quick search of the
actual web pages shows that &amp;ldquo;anarchy&amp;rdquo; appears in document 12 which is about
&lt;a href=&#34;https://en.wikipedia.org/wiki?curid=12&#34;&gt;Anarchism&lt;/a&gt; and document 307 which is
about &lt;a href=&#34;https://en.wikipedia.org/wiki?curid=307&#34;&gt;Abraham Lincoln&lt;/a&gt;. Pages 308 and
358 do not actually contain the term, so we are seeing a case where BitFunnel
would report false positives.&lt;/p&gt;

&lt;p&gt;Now let&amp;rsquo;s look at the &lt;code&gt;verify one&lt;/code&gt; command. Today this command
runs a very slow verification query engine on the IDocuments cached earlier by the &lt;code&gt;cache chunk&lt;/code&gt;
command. In the future, &lt;code&gt;verify&lt;/code&gt; will run the BitFunnel query engine and compare
its output with the verification query engine.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;9: verify one wings
Processing query &amp;quot; wings&amp;quot;
  DocId(305)
1 match(es) out of 41 documents.
10: verify one anarchy
Processing query &amp;quot; anarchy&amp;quot;
  DocId(307)
  DocId(12)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Prompt 9 shows that only document 305 contains the term &amp;ldquo;wings&amp;rdquo;. Prompt 10 reports
that only documents 12 and 307 contain the term anarchy.&lt;/p&gt;

&lt;p&gt;Note that &lt;code&gt;verify one&lt;/code&gt; accepts any &lt;a href=&#34;http://bitfunnel.org/a-small-query-language&#34;&gt;legal BitFunnel query&lt;/a&gt;.
Here&amp;rsquo;s an example:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;11: verify one -some (anarchy | kingdom)
Processing query &amp;quot; -some (anarchy | kingdom)&amp;quot;
  DocId(332)
1 match(es) out of 41 documents.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Well that&amp;rsquo;s enough for now. Hopefully this walkthrough will help you get
started with BitFunnel configuration.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>