Sample Data

Tue, Sep 20, 2016

NOTE: this site uses MathJax to render mathematical equations. Enabling JavaScript in your browser will allow MathJax to display these equations on the page.

e91e498599

I’ve been trying to make it really easy to get started with BitFunnel, but we still have a ways to go. From the beginning we put a lot of effort into ensuring our code would build and run on Linux, OSX, and Windows, and we set up CI on Appveyor and Travis to help us quickly spot breaks on any OS. This has kept the build in good shape, but it seems that the system is still hard to configure and run, especially for those who don’t use it on a day-to-day basis.

After some brainstorming, we decided it would be helpful to make a sample corpus with all necessary configuration files available for download so that new users and contributors could get the system up and running with just a few steps.

The sample corpus consists of about 17k pages from the English version of Wikipedia. This small slice of Wikipedia is manageable, yet large enough to demonstrate interesting aspects of BitFunnel. Here are the download links:

The first file is for reference and is not needed unless you want to reprocess the entire corpus yourself from scratch.

In most cases it suffices to download the second link which contains the files necessary to run the BitFunnel index Read-Eval-Print-Loop (REPL). This download contains

wikiextractor text output (265MB uncompressed)
the corresponding BitFunnel chunk files (208MB uncompressed)
corpus statistics and configuration files (51MB uncompressed)

Downloading and Extracting Chunk Files

You can download these files directly from your browser, or on Linux or OSX use the wget and tar commands.

% cd /tmp

% wget https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
--2016-09-18 21:15:11--  https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
Resolving bitfunnel.blob.core.windows.net... 13.93.168.88
Connecting to bitfunnel.blob.core.windows.net|13.93.168.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 198148563 (189M) [application/octet-stream]
Saving to: 'small-corpus.tar.gz'

small-corpus.tar.gz          100%[==============================================>] 188.97M  1.67MB/s   in 1m 49s

2016-09-18 21:17:00 (1.74 MB/s) - 'small-corpus.tar.gz' saved [198148563/198148563]

% tar -xvzf small-corpus.tar.gz
x chunks/
x chunks/AA/
x chunks/AA/wiki_00
x chunks/AA/wiki_01
x chunks/AA/wiki_02
...
x chunks/AC/wiki_56
x chunks/AC/wiki_57
x text/
x text/AA/
x text/AA/wiki_00
x text/AA/wiki_01
x text/AA/wiki_02
...
x text/AC/wiki_56
x text/AC/wiki_57

% ls -l wikipedia
total 0
drwxr-xr-x  5 michaelhopcroft  wheel  170 Jul 29 21:40 chunks
drwxr-xr-x  8 michaelhopcroft  wheel  272 Sep 18 16:09 config
drwxr-xr-x  5 michaelhopcroft  wheel  170 Jul 29 21:34 text

Running the REPL

Once the files have been downloaded and uncompressed, we’re ready to run the REPL. The REPL is a subcommand of the BitFunnel executable which is located at tools\BitFunnel\src in the source tree. In the transcript below, I have set my path to point to the BitFunnel executable. The only required parameter is the path to the config directory that was created in the previous step.

% BitFunnel repl /tmp/wikipedia/config
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = "/tmp/wikipedia/config"
gram size = 1

Starting index ...
Blocksize: 11005320
Index started successfully.

Type "help" to get started.

Once the REPL console has started, we will load a single chunk file. We use the cache chunk command to ingest the documents from a single chunk file. The cache chunk command ingests documents like the load chunk command, but it also caches the IDocuments to assist in verifying the correctness of the BitFunnel matching engine.

0: cache chunk /tmp/wikipedia/chunks/AA/wiki_00
Ingesting chunk file "/tmp/wikipedia/chunks/AA/wiki_00"
Caching IDocuments for query verification.
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
Ingestion complete.

At this point, we’ve ingested /tmp/wikipedia/chunks/AA/wiki_00 which contains the following 41 wikipedia pages:

12: Anarchism
25: Autism
39: Albedo
128: Talk:Atlas Shrugged
290: A
295: User:AnonymousCoward
303: Alabama
305: Achilles
307: Abraham Lincoln
308: Aristotle
309: An American in Paris
316: Academy Award for Best Production Design
324: Academy Awards
330: Actrius
332: Animalia (book)
334: International Atomic Time
336: Altruism
339: Ayn Rand
340: Alain Connes
344: Allan Dwan
354: Talk:Algeria
358: Algeria
359: List of Atlas Shrugged characters
569: Anthropology
572: Agricultural science
573: Alchemy
579: Alien
580: Astronomer
582: Talk:Altruism/Archive 1
586: ASCII
590: Austin (disambiguation)
593: Animation
594: Apollo
595: Andre Agassi
597: Austroasiatic languages
599: Afroasiatic languages
600: Andorra
612: Arithmetic mean
615: American Football Conference
620: Animal Farm
621: Amphibian

Handy tip: if you’d like to know which pages are in a chunk file, run grep on the corresponding wikiextractor file. For example, if you are interested in knowing the contents of tmp/wikipedia/chunks/AA/wiki_00, run grep on tmp/wikipedia/text/AA/wiki_00:

grep "<doc id=" /tmp/wikipedia/text/AA/wiki_00
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
<doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
<doc id="39" url="https://en.wikipedia.org/wiki?curid=39" title="Albedo">
...
<doc id="620" url="https://en.wikipedia.org/wiki?curid=620" title="Animal Farm">
<doc id="621" url="https://en.wikipedia.org/wiki?curid=621" title="Amphibian">

Let’s try running a query using the verify one command to verify an expression. Today this command runs a very slow verification query engine on the IDocuments cached earlier by the cache chunk command. In the future, verify will run the BitFunnel query engine and compare its output with the verification query engine.

1: verify one anarchy
Processing query " anarchy"
  DocId(307)
  DocId(12)
2 match(es) out of 41 documents.

2: verify one frog
Processing query " frog"
  DocId(621)
1 match(es) out of 41 documents.

As we can see, documents 12 and 307 contain the word, “anarchy” and document 621 contains the word, “frog”. Try running verify one frog|anarchy and verify one frog anarchy (AND is implicit if OR isn’t specified). Did you get what you expected?

We don’t have the BitFunnel query pipeline ported yet, but you can examine the rows associated with various terms using the show rows command. This command lists each of the RowIds associated with a term, followed by the bits for the first 64 documents. The document ids are printed vertically above each column of bits.

3: show rows anarchy
Term("anarchy")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 11507): 10000000110000001000010000000000000000000
  RowId(3, 11508): 10000000110000001000010000000000000000000
  RowId(3, 11509): 10000000110000001000010000000000000000000
  RowId(0,  5354): 11000001110000000000010001001000000000010

4: show rows frog
Term("frog")
                 d 00012233333333333333333555555555555566666
                 o 12329900000123333344555677788899999901122
                 c 25980535789640246904489923902603457902501
  RowId(3, 19624): 00000010000010000100000000000000000000001
  RowId(3, 19625): 00000010000010000100000000000000000000001
  RowId(3, 19626): 00000010000010000000001000000000000000001
  RowId(3, 19627): 00000010000010000000001000000000000000001
  RowId(0,  5465): 10000011100010000000001100001001011000001

If we look at the output of show rows anarchy, we see that the first column, which corresponds to document 012, is completely filled with 1s, indicating a match. The second column, which corresponds to document 025 has some zeros so it is not a match.

There are also some false positives visible in the data. We know from running verify one anarchy that only documents 012 and 307 should match, but the query matrix above shows all 1s in the columns for documents 308 and 358. Once we have finished porting the document ingestion and query processing pipelines, we will turn our attention to configuration changes that drive down the false positive rate.

The goal of this post is to explain how to obtain and use the data files, so the examples are minimal. To learn more about the BitFunnel repl, statistics builder, and term table builder, see Index Build Tools.