Index Build Tools · BitFunnel

Index Build Tools

This post is part of a series on open sourcing portions of the Bing search engine.
849379b08b

After many months of hard work, we kind of, sort of have a document ingestion pipeline that seems to work. By this I mean we have a minimal set of configuration and ingestion tools that we can compile and then run without crashing, and these tools seem to ingest files mostly as expected. We’re still going to need to do a lot of testing, tuning and evaluation, but I thought it would be helpful to take this time to walk through the process of bringing up an index from a set of chunk files extracted from Wikipedia.

We break the process down into three phases, each of which has a dedicated tool. The remainder of this post is a fairly long step-by-step description of the process of using the three tools to configure and start a BitFunnel index. It’s pretty dry, but should be useful for people who want to play around with the system.

Gathering Corpus Statistics

Because BitFunnel is a probabilistic algorithm based on Bloom Filters, its configuration depends on statistical properties of the corpus, like the distributions of term frequencies and document lengths.

The StatisticsBuilder tool generates these statistics from a representative corpus. If you run StatisticsBuilder with no arguments, it will print out a help message describing its command line arguments.

C:\temp>StatisticBuilder
Expected <chunkListFileName>.
Use -help for more information.
StatisticsBuilder
Ingest documents and compute statistics about them.

Usage:
StatisticBuilder.exe <chunkListFileName>
                     <tempPath>
                     [-help]
                     [-statistics]
                     [-text]
                     [-gramsize <integer>]

<chunkListFileName>
    Path to a file containing the paths
    to the chunk files to be ingested. One
    chunk file per line. Paths are relative
    to working directory. (string)

<tempPath>
    Path to a tmp directory. Something like
    /tmp/ or c:\temp\, depending on platform..
    (string)

[-help]
    Display help for this program.
    (boolean, defaults to false)


[-statistics]
    Generate index statistics such as document
    frequency table, document length histogram,
    and cumulative term counts.
    (boolean, defaults to false)


[-text]
    Create mapping from Term::Hash to term text.
    (boolean, defaults to false)


[-gramsize <integer>]
    Set the maximum ngram size for phrases.
    (integer, defaults to 1)

The input to StatisticsBuilder is a manifest file listing the chunk files that should be used as a basis for generating corpus statistics. I obtained a set of chunk files by converting a Wikipedia Database Dump file using the process outlined in the Workbench README.md. By default Workbench outputs 1MB chunk files, organized in subdirectories containing 100 files each. When I ran Workbench on a small Wikipedia dump, it generated about 300 chunk files, spread across three directories:

C:\temp>dir Wikipedia
 Directory of c:\temp\Wikipedia

07/29/2016  09:57 PM    <DIR>          .
07/29/2016  09:57 PM    <DIR>          ..
07/29/2016  09:57 PM    <DIR>          AA
07/29/2016  09:57 PM    <DIR>          AB
07/29/2016  09:57 PM    <DIR>          AC
               0 File(s)
               5 Dir(s)

C:\temp>dir /w c:\temp\Wikipedia\AA
 Directory of c:\temp\Wikipedia\AA

[.]       [..]      wiki_00   wiki_01   wiki_02   wiki_03
wiki_04   wiki_05   wiki_06   wiki_07   wiki_08   wiki_09
wiki_10   wiki_11   wiki_12   wiki_13   wiki_14   wiki_15
wiki_16   wiki_17   wiki_18   wiki_19   wiki_20   wiki_21
wiki_22   wiki_23   wiki_24   wiki_25   wiki_26   wiki_27
wiki_28   wiki_29   wiki_30   wiki_31   wiki_32   wiki_33
wiki_34   wiki_35   wiki_36   wiki_37   wiki_38   wiki_39
wiki_40   wiki_41   wiki_42   wiki_43   wiki_44   wiki_45
wiki_46   wiki_47   wiki_48   wiki_49   wiki_50   wiki_51
wiki_52   wiki_53   wiki_54   wiki_55   wiki_56   wiki_57
wiki_58   wiki_59   wiki_60   wiki_61   wiki_62   wiki_63
wiki_64   wiki_65   wiki_66   wiki_67   wiki_68   wiki_69
wiki_70   wiki_71   wiki_72   wiki_73   wiki_74   wiki_75
wiki_76   wiki_77   wiki_78   wiki_79   wiki_80   wiki_81
wiki_82   wiki_83   wiki_84   wiki_85   wiki_86   wiki_87
wiki_88   wiki_89   wiki_90   wiki_91   wiki_92   wiki_93
wiki_94   wiki_95   wiki_96   wiki_97   wiki_98   wiki_99
             100 File(s)
               2 Dir(s)

C:\temp>

I created a manifest file, listing each of the chunk files:

C:temp>type c:\temp\Wikipedia\manifest100.txt

c:\git\Wikipedia\out\AA\wiki_00
c:\git\Wikipedia\out\AA\wiki_01
c:\git\Wikipedia\out\AA\wiki_02
...
c:\git\Wikipedia\out\AA\wiki_97
c:\git\Wikipedia\out\AA\wiki_98
c:\git\Wikipedia\out\AA\wiki_99

On Linux, you can do this with find. If you give find an absolute path, it will produce absolute paths, so one way to do this is to go to the output directory and run

find `pwd` -type f

When I ran StatisticsBuilder, I omitted the -gramsize parameter because I wanted statistics for a corpus of unigrams. Had I included -gramsize, I could have generated statistics for a corpus that included bigrams or trigrams or larger ngrams. I included the -text parameter because I wanted the document frequency table to be annotated with the text of each term. Had I not included -text, the terms would be represented solely by their hash values. The -statistics parameter indicates that statistics files should be generated.

C:\temp>StatisticBuilder c:\temp\Wikipedia\Manifest100.txt ^
                         c:\temp\output -statistics -text

Blocksize: 3592
Loading chunk list file 'c:\temp\Wikipedia\Manifest100.txt'
Temp dir: 'c:\temp\output'
Reading 100 files
Ingesting . . .
ChunkTaskProcessor::ProcessTask: filePath:c:\temp\Wikipedia\AA\wiki_00
ChunkTaskProcessor::ProcessTask: filePath:c:\temp\Wikipedia\AA\wiki_01
ChunkTaskProcessor::ProcessTask: filePath:c:\temp\Wikipedia\AA\wiki_02
...
ChunkTaskProcessor::ProcessTask: filePath:c:\temp\Wikipedia\AA\wiki_97
ChunkTaskProcessor::ProcessTask: filePath:c:\temp\Wikipedia\AA\wiki_98
ChunkTaskProcessor::ProcessTask: filePath:c:\temp\Wikipedia\AA\wiki_99
Ingestion complete.
  Ingestion time = 6.8961
  Ingestion rate (bytes/s): 1.16768e+07
Shard count:1
Document count: 6935
Bytes/Document: 11611.3
Total bytes read: 80524672
Posting count: 4978789

The statistics files were written to c:\temp\output:

C:\temp>dir c:\temp\output

 Directory of c:\temp\output

08/23/2016  06:56 PM    <DIR>          .
08/23/2016  06:56 PM    <DIR>          ..
08/30/2016  05:57 PM            80,982 CumulativeTermCounts-0.csv
08/30/2016  05:57 PM        11,952,730 DocFreqTable-0.csv
08/30/2016  05:57 PM            13,289 DocumentLengthHistogram.csv
08/30/2016  05:57 PM         4,570,472 IndexedIdfTable-0.bin
08/30/2016  05:57 PM         7,442,262 TermToText.bin
               5 File(s)
               2 Dir(s)

CumulativeTermCounts-0.csv tracks the number of unique terms encountered as a function of the number of documents ingested. It is not currently used but will be needed for accurate models of memory consumption.

DocumentLengthHistogram.csv is a histogram of documents organized by the number of unique terms in each document. It is not currently used, but will be needed to determine how to organize documents into shards according to posting count.

DocFreqTable-0.csv lists the unique terms in the corpus in descending frequency order. In other words, more common words appear before less common words. Here’s what the file looks like:

C:\temp>type c:\temp\output\DocFreqTable-0.csv

hash,gramSize,streamId,frequency,text
3f0ffc72a21fd2be,1,1,0.843259,from
b3697479c07d98d5,1,1,0.810382,which
4d34895e97b1888c,1,1,0.791204,also
17e90965afd3104d,1,1,0.769719,have
d14e34f5833aecee,1,1,0.753857,one
5d4e8d01c132cf18,1,1,0.742754,other
...
756abe13d678a849,1,1,0.000144196,arsenalistas
6a3184998008352b,1,1,0.000144196,29ubt
10e864b51171bcc5,1,1,0.000144196,baklawa
d146186a27cdee5a,1,1,0.000144196,openshaw
fa8d1f523a31ee5a,1,1,0.000144196,cremisan

As you can see, the term “from” is the most common, appearing in about 84% of documents. Towards the end of the file, “cremisan” is one of the rarest, appearing in 0.014% of documents, or once in the entire corpus of 6935 documents.

IndexedIdfTable-0.bin is a binary file containing the IDF value for each term. It is used for constructing Terms during document ingestion and query formulation.

Finally, TermToText.bin is binary file containing a mapping from a term’s hash value to its text representation. It is used for debugging and diagnostics.

These files will be used in the next stage where we build the TermTable.

Building a TermTable

The TermTable is one of the most important data structures in BitFunnel. It maps each term to the exact set of rows used to indicate the term’s presence in a document.

I won’t get into how the TermTableBuilder works at this point, but let’s look at how to run it. As with StatisticsBuilder, if you run TermTableBuilder with no parameters, it will print out a help message:

C:\temp>TermTableBuilder
Expected <tempPath>.
Use -help for more information.
TermTableBuilder
Generate a TermTable from a DocumentFrequencyTable.

Usage:
TermTableBuilder.exe <tempPath>
                     [-help]

<tempPath>
    Path to a tmp directory. Something like /tmp/ or c:\temp\,
    depending on platform.. (string)

[-help]
    Display help for this program. (boolean, defaults to false)

In this case the help message could be a bit better. All you need to know is that TermTableBuilder has a single argument and this is the output directory from the StatisticsBuilder stage. TermTableBuilder will read DocFreqTable-0.csv, constuct a very basic TermTable, and write it to TermTable-0.bin.

For now, TermTableBuilder creates a TermTable for unigrams that uses a combination of rank 0 and rank 3 rows. The algorithm is naive and doesn’t handle higher order ngrams. Down the road, we will improve the algorithm and add more options.

Here’s the output from my run:

C:\temp>TermTableBuilder c:\temp\output
Loading files for TermTable build.
Starting TermTable build.
Total build time: 0.469099 seconds.
===================================
RowAssigner for rank 0
  Terms
    Total: 285654
    Adhoc: 240077
    Explicit: 44212
    Private: 1365

  Rows
    Total: 5731
    Adhoc: 550
    Explicit: 3813
    Private: 1365

  Bytes per document: 773.375


  Densities in explicit shared rows
    Mean: 0.099865
    Min: 0.0131218
    Max: 0.0999282
    Variance: 1.99592e-06

===================================
RowAssigner for rank 1
  No terms

===================================
RowAssigner for rank 2
  No terms

===================================
RowAssigner for rank 3
  Terms
    Total: 284073
    Adhoc: 170966
    Explicit: 109490
    Private: 3617

  Rows
    Total: 36749
    Adhoc: 1233
    Explicit: 31899
    Private: 3617

  Bytes per document: 574.203


  Densities in explicit shared rows
    Mean: 0.0996313
    Min: 0.00115307
    Max: 0.0999989
    Variance: 6.75945e-07

===================================
RowAssigner for rank 4
  No terms

===================================
RowAssigner for rank 5
  No terms

===================================
RowAssigner for rank 6
  No terms

===================================
RowAssigner for rank 7
  No terms

Writing TermTable files.
Computing regression
Done.

If we look in c:\temp\output we will see there a new binary file called TermTable-0.bin:

C:\temp>dir c:\temp\output

 Directory of c:\temp\output

08/23/2016  06:56 PM    <DIR>          .
08/23/2016  06:56 PM    <DIR>          ..
08/30/2016  05:57 PM            80,982 CumulativeTermCounts-0.csv
08/30/2016  05:57 PM        11,952,730 DocFreqTable-0.csv
08/30/2016  05:57 PM            13,289 DocumentLengthHistogram.csv
08/30/2016  05:57 PM         4,570,472 IndexedIdfTable-0.bin
08/30/2016  05:58 PM         5,630,804 TermTable-0.bin
08/30/2016  05:57 PM         7,442,262 TermToText.bin
               6 File(s)
               2 Dir(s)

We’ve now configured our system for a typical corpus with documents similar to those in the chunk files listed in the original manifest. In the next step we will ingest some files and look at the resulting row table values.

Ingesting a Small Corpus

Now we’re getting to the fun part. IngestAndQuery is a sample application that provides an interactive Read-Eval-Print loop for ingesting documents, running queries, and inspecting various data structures.

Here’s the help message:

C:\temp>IngestAndQuery.exe
Expected <path>.
Use -help for more information.

Ingest documents and compute statistics about them.

Usage:
IngestAndQuery.exe <path>
                   [-help]
                   [-gramsize <integer>]
                   [-threads <integer>]

<path>
    Path to a tmp directory.
    Something like /tmp/ or c:\temp\,
    depending on platform.. (string)

[-help]
    Display help for this program.
    (boolean, defaults to false)


[-gramsize <integer>]
    Set the maximum ngram size for phrases.
    (integer, defaults to 1)

[-threads <integer>]
    Set the thread count for ingestion
    and query processing.
    (integer, defaults to 1)

The first parameter is the path to the directory with the configuration files. In my case this is c:\temp\output. The gramsize should be the same value used in the StatisticsBuilder stage.

When you start the application, it prints out a welcome message, explains how to get help, and then prompts for input. The prompt is an integer followed by a colon. Type “help” to get a list of commands. You can also get help on a specific command:

C:\temp>IngestAndQuery.exe c:\temp\output
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = "c:\temp\output"
gram size = 1

Starting index ...
Blocksize: 11005320
Index started successfully.

Type "help" to get started.

0: help
Available commands:
  cache   Ingests documents into the index and also stores them in a cache
  delay   Prints a message after certain number of seconds
  help    Displays a list of available commands.
  load    Ingests documents into the index
  query   Process a single query or list of queries. (TODO)
  quit    waits for all current tasks to complete then exits.
  script  Runs commands from a file.(TODO)
  show    Shows information about various data structures. (TODO)
  status  Prints system status.

Type "help <command>" for more information on a particular command.

1: help load
load (manifest | chunk) <path>
  Ingests a single chunk file or a list of chunk
    files specified by a manifest.

2: help show
show cache <term>
  | rows <term> [<docstart> <docend>]
  | term <term>
    Shows information about various data structures.  PARTIALLY IMPLEMENTED

Right now the ingest command doesn’t support the manifest option and the show command only supports the rows and terms option.

In the following run, I ingest a single chunk file and then examine rows for various terms:

0: ingest chunk c:\temp\Wikipedia\AA\wiki_00
Ingesting chunk file "c:\temp\Wikipedia\AA\wiki_00"
ChunkTaskProcessor::ProcessTask: filePath:c:\temp\Wikipedia\AA\wiki_00
Ingestion complete.
4: show rows also
Term("also")
  RowId(0, 1005): 1
  5: show rows some
  Term("some")
  RowId(0, 1011): 1
  6: show rows wings
  Term("wings")
  RowId(3, 6683): 0
  RowId(3, 6684): 0
  RowId(3, 6685): 0
  RowId(0, 4555): 0
  7: show rows anarchy
  Term("anarchy")
  RowId(3, 18536): 1
  RowId(3, 18537): 1
  RowId(3, 18538): 1
  RowId(0, 5356): 1
  8: show rows kingdom
  Term("kingdom")
  RowId(0, 1609): 1

At prompt 0, I requested that the system ingest a single chunk file. Once the ingestion finishes, I use the “show rows” command at prompts 2 through 5 to examine the rows associated with various terms.

Today, the “show rows” command displays only the first bit of each row associated with the term. In this case, the first bit happens to correspond to the Wikipedia article on Anarchy.

As we can see, the article contains the word “also”. We know this because the row associated with “also” has a 1 in its first column. The fact that the word is associated with a single row indicates that it is fairly common in the corpus. A search on the page in Wikipedia shows that “also” appears on the page 14 times.

The word “some” is also common in the corpus and a search on the page shows 9 appearances.

The term “wings”, on the other hand, doesn’t appear in the document. We can see this because its rows have a 0 in the first column. In the case of “wings” all of the rows start with 0. A term not in the document must have at least one row with a zero. Note also that “wings” is a bit less common in the corpus, needing two rows to drive the noise to acceptable levels.

Although the term anarchy is common in the document, appearing 51 times, it is actually quite rare in the corpus, requiring 5 rows.

Well that’s enough for now. Hopefully this walkthrough will help you get started with BitFunnel configuration.

Michael Hopcroft
An 18 year veteran at Microsoft, Mike has worked on Office, Windows and Visual Studio. He has spent the past 6 years developing cloud scale infrastructure for the Bing search engine. He is a founding member of BitFunnel.