Sample Data
I’ve been trying to make it really easy to get started with BitFunnel, but we still have a ways to go. From the beginning we put a lot of effort into ensuring our code would build and run on Linux, OSX, and Windows, and we set up CI on Appveyor and Travis to help us quickly spot breaks on any OS. This has kept the build in good shape, but it seems that the system is still hard to configure and run, especially for those who don’t use it on a day-to-day basis.
After some brainstorming, we decided it would be helpful to make a sample corpus with all necessary configuration files available for download so that new users and contributors could get the system up and running with just a few steps.
The sample corpus consists of about 17k pages from the English version of Wikipedia. This small slice of Wikipedia is manageable, yet large enough to demonstrate interesting aspects of BitFunnel. Here are the download links:
The first file is for reference and is not needed unless you want to reprocess the entire corpus yourself from scratch.
In most cases it suffices to download the second link which contains the files necessary to run the BitFunnel index Read-Eval-Print-Loop (REPL). This download contains
- wikiextractor text output (265MB uncompressed)
- the corresponding BitFunnel chunk files (208MB uncompressed)
- corpus statistics and configuration files (51MB uncompressed)
Downloading and Extracting Chunk Files
You can download these files directly from your browser, or on Linux or OSX use the wget
and tar
commands.
% cd /tmp
% wget https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
--2016-09-18 21:15:11-- https://bitfunnel.blob.core.windows.net/sample-data/Wikipedia/small-corpus.tar.gz
Resolving bitfunnel.blob.core.windows.net... 13.93.168.88
Connecting to bitfunnel.blob.core.windows.net|13.93.168.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 198148563 (189M) [application/octet-stream]
Saving to: 'small-corpus.tar.gz'
small-corpus.tar.gz 100%[==============================================>] 188.97M 1.67MB/s in 1m 49s
2016-09-18 21:17:00 (1.74 MB/s) - 'small-corpus.tar.gz' saved [198148563/198148563]
% tar -xvzf small-corpus.tar.gz
x chunks/
x chunks/AA/
x chunks/AA/wiki_00
x chunks/AA/wiki_01
x chunks/AA/wiki_02
...
x chunks/AC/wiki_56
x chunks/AC/wiki_57
x text/
x text/AA/
x text/AA/wiki_00
x text/AA/wiki_01
x text/AA/wiki_02
...
x text/AC/wiki_56
x text/AC/wiki_57
% ls -l wikipedia
total 0
drwxr-xr-x 5 michaelhopcroft wheel 170 Jul 29 21:40 chunks
drwxr-xr-x 8 michaelhopcroft wheel 272 Sep 18 16:09 config
drwxr-xr-x 5 michaelhopcroft wheel 170 Jul 29 21:34 text
Running the REPL
Once the files have been downloaded and uncompressed, we’re ready to run the REPL.
The REPL is a subcommand of the BitFunnel executable which is located at tools\BitFunnel\src
in the source tree. In the transcript below, I have set my path to point to the
BitFunnel executable. The only required parameter is the path to the config directory
that was created in the previous step.
% BitFunnel repl /tmp/wikipedia/config
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.
directory = "/tmp/wikipedia/config"
gram size = 1
Starting index ...
Blocksize: 11005320
Index started successfully.
Type "help" to get started.
Once the REPL console has started, we will load a single chunk file.
We use the cache chunk
command to ingest the documents from a single
chunk file. The cache chunk
command ingests documents like the
load chunk
command, but it also caches the IDocuments to assist in
verifying the correctness of the BitFunnel matching engine.
0: cache chunk /tmp/wikipedia/chunks/AA/wiki_00
Ingesting chunk file "/tmp/wikipedia/chunks/AA/wiki_00"
Caching IDocuments for query verification.
ChunkManifestIngestor::IngestChunk: filePath = /tmp/wikipedia/chunks/AA/wiki_00
Ingestion complete.
At this point, we’ve ingested /tmp/wikipedia/chunks/AA/wiki_00
which contains the following 41 wikipedia pages:
- 12: Anarchism
- 25: Autism
- 39: Albedo
- 128: Talk:Atlas Shrugged
- 290: A
- 295: User:AnonymousCoward
- 303: Alabama
- 305: Achilles
- 307: Abraham Lincoln
- 308: Aristotle
- 309: An American in Paris
- 316: Academy Award for Best Production Design
- 324: Academy Awards
- 330: Actrius
- 332: Animalia (book)
- 334: International Atomic Time
- 336: Altruism
- 339: Ayn Rand
- 340: Alain Connes
- 344: Allan Dwan
- 354: Talk:Algeria
- 358: Algeria
- 359: List of Atlas Shrugged characters
- 569: Anthropology
- 572: Agricultural science
- 573: Alchemy
- 579: Alien
- 580: Astronomer
- 582: Talk:Altruism/Archive 1
- 586: ASCII
- 590: Austin (disambiguation)
- 593: Animation
- 594: Apollo
- 595: Andre Agassi
- 597: Austroasiatic languages
- 599: Afroasiatic languages
- 600: Andorra
- 612: Arithmetic mean
- 615: American Football Conference
- 620: Animal Farm
- 621: Amphibian
Handy tip: if you’d like to know which pages are in a chunk file,
run grep
on the corresponding wikiextractor file. For example, if you
are interested in knowing the contents of tmp/wikipedia/chunks/AA/wiki_00
,
run grep on tmp/wikipedia/text/AA/wiki_00
:
grep "<doc id=" /tmp/wikipedia/text/AA/wiki_00
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
<doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
<doc id="39" url="https://en.wikipedia.org/wiki?curid=39" title="Albedo">
...
<doc id="620" url="https://en.wikipedia.org/wiki?curid=620" title="Animal Farm">
<doc id="621" url="https://en.wikipedia.org/wiki?curid=621" title="Amphibian">
Let’s try running a query using the verify one
command to verify an
expression. Today this command runs a very slow verification query engine on the
IDocuments cached earlier by the cache chunk
command. In the future, verify
will run the BitFunnel query engine and compare its output with the verification
query engine.
1: verify one anarchy
Processing query " anarchy"
DocId(307)
DocId(12)
2 match(es) out of 41 documents.
2: verify one frog
Processing query " frog"
DocId(621)
1 match(es) out of 41 documents.
As we can see, documents 12 and 307 contain the word, “anarchy” and document 621
contains the word, “frog”. Try running verify one frog|anarchy
and verify one
frog anarchy
(AND
is implicit if OR
isn’t specified). Did you get what you
expected?
We don’t have the BitFunnel query pipeline ported yet, but you can examine the
rows associated with various terms using the show rows
command. This command
lists each of the RowIds associated with a term, followed by the bits for the
first 64 documents. The document ids are printed vertically above each column of
bits.
3: show rows anarchy
Term("anarchy")
d 00012233333333333333333555555555555566666
o 12329900000123333344555677788899999901122
c 25980535789640246904489923902603457902501
RowId(3, 11507): 10000000110000001000010000000000000000000
RowId(3, 11508): 10000000110000001000010000000000000000000
RowId(3, 11509): 10000000110000001000010000000000000000000
RowId(0, 5354): 11000001110000000000010001001000000000010
4: show rows frog
Term("frog")
d 00012233333333333333333555555555555566666
o 12329900000123333344555677788899999901122
c 25980535789640246904489923902603457902501
RowId(3, 19624): 00000010000010000100000000000000000000001
RowId(3, 19625): 00000010000010000100000000000000000000001
RowId(3, 19626): 00000010000010000000001000000000000000001
RowId(3, 19627): 00000010000010000000001000000000000000001
RowId(0, 5465): 10000011100010000000001100001001011000001
If we look at the output of show rows anarchy
, we see that the first
column, which corresponds to document 012, is completely filled with
1s, indicating a match. The second column, which corresponds to document
025 has some zeros so it is not a match.
There are also some false positives visible in the data. We know from running verify one anarchy
that only documents 012 and 307 should match, but the query matrix above shows
all 1s in the columns for documents 308 and 358. Once we have finished porting the
document ingestion and query processing pipelines, we will turn our attention
to configuration changes that drive down the false positive rate.
The goal of this post is to explain how to obtain and use the data files, so the examples are minimal. To learn more about the BitFunnel repl, statistics builder, and term table builder, see Index Build Tools.