I’ve been trying to make it really easy to get started with BitFunnel, but we still have a ways to go. From the beginning we put a lot of effort into ensuring our code would build and run on Linux, OSX, and Windows, and we set up CI on Appveyor and Travis to help us quickly spot breaks on any OS. This has kept the build in good shape, but it seems that the system is still hard to configure and run, especially for those who don’t use it on a day-to-day basis.
After some brainstorming, we decided it would be helpful to make a sample corpus with all necessary configuration files available for download so that new users and contributors could get the system up and running with just a few steps.
The sample corpus consists of about 17k pages from the English version of Wikipedia. This small slice of Wikipedia is manageable, yet large enough to demonstrate interesting aspects of BitFunnel. Here are the download links:
The first file is for reference and is not needed unless you want to reprocess the entire corpus yourself from scratch.
In most cases it suffices to download the second link which contains the files necessary to run the BitFunnel index Read-Eval-Print-Loop (REPL). This download contains
- wikiextractor text output (265MB uncompressed)
- the corresponding BitFunnel chunk files (208MB uncompressed)
- corpus statistics and configuration files (51MB uncompressed)
Downloading and Extracting Chunk Files
You can download these files directly from your browser, or on Linux or OSX use the
% cd /tmp % wget https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz --2016-09-18 21:15:11-- https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz Resolving bitfunnel.blob.core.windows.net... 188.8.131.52 Connecting to bitfunnel.blob.core.windows.net|184.108.40.206|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 198148563 (189M) [application/octet-stream] Saving to: 'small-corpus.tar.gz' small-corpus.tar.gz 100%[==============================================>] 188.97M 1.67MB/s in 1m 49s 2016-09-18 21:17:00 (1.74 MB/s) - 'small-corpus.tar.gz' saved [198148563/198148563] % tar -xvzf small-corpus.tar.gz x chunks/ x chunks/AA/ x chunks/AA/wiki_00 x chunks/AA/wiki_01 x chunks/AA/wiki_02 ... x chunks/AC/wiki_56 x chunks/AC/wiki_57 x text/ x text/AA/ x text/AA/wiki_00 x text/AA/wiki_01 x text/AA/wiki_02 ... x text/AC/wiki_56 x text/AC/wiki_57 % ls -l wikipedia total 0 drwxr-xr-x 5 michaelhopcroft wheel 170 Jul 29 21:40 chunks drwxr-xr-x 8 michaelhopcroft wheel 272 Sep 18 16:09 config drwxr-xr-x 5 michaelhopcroft wheel 170 Jul 29 21:34 text
Running the REPL
Once the files have been downloaded and uncompressed, we’re ready to run the REPL.
The REPL is a subcommand of the BitFunnel executable which is located at
in the source tree. In the transcript below, I have set my path to point to the
BitFunnel executable. The only required parameter is the path to the config directory
that was created in the previous step.
% BitFunnel repl /tmp/wikipedia/config Welcome to BitFunnel! Starting 1 thread (plus one extra thread for the Recycler. directory = "/tmp/wikipedia/config" gram size = 1 Starting index ... Blocksize: 11005320 Index started successfully. Type "help" to get started.
Once the REPL console has started, we will load a single chunk file.
We use the
cache chunk command to ingest the documents from a single
chunk file. The
cache chunk command ingests documents like the
load chunk command, but it also caches the IDocuments to assist in
verifying the correctness of the BitFunnel matching engine.
0: cache chunk /tmp/wikipedia/chunks/AA/wiki_00 Ingesting chunk file "/tmp/wikipedia/chunks/AA/wiki_00" Caching IDocuments for query verification. ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00 Ingestion complete.
At this point, we’ve ingested
which contains the following 41 wikipedia pages:
- 12: Anarchism
- 25: Autism
- 39: Albedo
- 128: Talk:Atlas Shrugged
- 290: A
- 295: User:AnonymousCoward
- 303: Alabama
- 305: Achilles
- 307: Abraham Lincoln
- 308: Aristotle
- 309: An American in Paris
- 316: Academy Award for Best Production Design
- 324: Academy Awards
- 330: Actrius
- 332: Animalia (book)
- 334: International Atomic Time
- 336: Altruism
- 339: Ayn Rand
- 340: Alain Connes
- 344: Allan Dwan
- 354: Talk:Algeria
- 358: Algeria
- 359: List of Atlas Shrugged characters
- 569: Anthropology
- 572: Agricultural science
- 573: Alchemy
- 579: Alien
- 580: Astronomer
- 582: Talk:Altruism/Archive 1
- 586: ASCII
- 590: Austin (disambiguation)
- 593: Animation
- 594: Apollo
- 595: Andre Agassi
- 597: Austroasiatic languages
- 599: Afroasiatic languages
- 600: Andorra
- 612: Arithmetic mean
- 615: American Football Conference
- 620: Animal Farm
- 621: Amphibian
Handy tip: if you’d like to know which pages are in a chunk file,
grep on the corresponding wikiextractor file. For example, if you
are interested in knowing the contents of
run grep on
grep "<doc id=" /tmp/wikipedia/text/AA/wiki_00 <doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism"> <doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism"> <doc id="39" url="https://en.wikipedia.org/wiki?curid=39" title="Albedo"> ... <doc id="620" url="https://en.wikipedia.org/wiki?curid=620" title="Animal Farm"> <doc id="621" url="https://en.wikipedia.org/wiki?curid=621" title="Amphibian">
Let’s try running a query using the
verify one command to verify an
expression. Today this command runs a very slow verification query engine on the
IDocuments cached earlier by the
cache chunk command. In the future,
will run the BitFunnel query engine and compare its output with the verification
1: verify one anarchy Processing query " anarchy" DocId(307) DocId(12) 2 match(es) out of 41 documents. 2: verify one frog Processing query " frog" DocId(621) 1 match(es) out of 41 documents.
As we can see, documents 12 and 307 contain the word, “anarchy” and document 621
contains the word, “frog”. Try running
verify one frog|anarchy and
frog anarchy (
AND is implicit if
OR isn’t specified). Did you get what you
We don’t have the BitFunnel query pipeline ported yet, but you can examine the
rows associated with various terms using the
show rows command. This command
lists each of the RowIds associated with a term, followed by the bits for the
first 64 documents. The document ids are printed vertically above each column of
3: show rows anarchy Term("anarchy") d 00012233333333333333333555555555555566666 o 12329900000123333344555677788899999901122 c 25980535789640246904489923902603457902501 RowId(3, 11507): 10000000110000001000010000000000000000000 RowId(3, 11508): 10000000110000001000010000000000000000000 RowId(3, 11509): 10000000110000001000010000000000000000000 RowId(0, 5354): 11000001110000000000010001001000000000010 4: show rows frog Term("frog") d 00012233333333333333333555555555555566666 o 12329900000123333344555677788899999901122 c 25980535789640246904489923902603457902501 RowId(3, 19624): 00000010000010000100000000000000000000001 RowId(3, 19625): 00000010000010000100000000000000000000001 RowId(3, 19626): 00000010000010000000001000000000000000001 RowId(3, 19627): 00000010000010000000001000000000000000001 RowId(0, 5465): 10000011100010000000001100001001011000001
If we look at the output of
show rows anarchy, we see that the first
column, which corresponds to document 012, is completely filled with
1s, indicating a match. The second column, which corresponds to document
025 has some zeros so it is not a match.
There are also some false positives visible in the data. We know from running
verify one anarchy
that only documents 012 and 307 should match, but the query matrix above shows
all 1s in the columns for documents 308 and 358. Once we have finished porting the
document ingestion and query processing pipelines, we will turn our attention
to configuration changes that drive down the false positive rate.
The goal of this post is to explain how to obtain and use the data files, so the examples are minimal. To learn more about the BitFunnel repl, statistics builder, and term table builder, see Index Build Tools.