Wikipedia as test corpus for BitFunnel

NOTE: this site uses MathJax to render mathematical equations. Enabling JavaScript in your browser will allow MathJax to display these equations on the page.

c20bf78031

Wikipedia is a great test corpus for search engines. It is free and easy to obtain, it carries a license appropriate for research, and at ~59GB uncompressed, it is large, but not too large to fit on a reasonably-sized server. For those with extremely fast reflexes, even user data¹ is sometimes available.

Wikipedia is also probably more representative of common use cases of search: since it is edited by amateurs, it is a more pedestrian dataset than many other corpora. This likely makes it more relevant to many realistic applications of search, particularly those that contain mostly amateur-generated data (such as consumer web search and corporate document search).

All of this makes Wikipedia a sensible baseline dataset for testing BitFunnel. To facilitate this, we have released a pre-processed version of the 2016-10-20 dump of Wikipedia, so that it is trivial to ingest into BitFunnel.

In this post, we will look at (1) how you can obtain this pre-processed Wikipedia data, (2) how you can ingest it into a running BitFunnel instance, (3) how to get ahold of the intermediate processing files so that you can audit the chunk files, to make sure they’re correct, and (4) some simple statistics about the corpus.

Obtaining the corpus

The 2016-10-20 dump of Wikipedia is divided into 27 compressed XML files. We have transformed each of these XML files into BitFunnel’s custom chunk file format to make ingestion fast and painless. (See the blog post that introduces the chunk format).

Each of these 27 dump files generates many chunk files. These “segments” of chunk files can be found at URLs following the pattern in the code block below; to download one of the 27 segments, simply replace the ${1} with the number of the chunk you’d like. (The dump numbers start at 1 and end at 27, inclusive.)

https://bitfunnel.blob.core.windows.net/wiki-data/chunked/enwiki-20161020-chunked${1}.tar.gz

So, for example, if you want to obtain chunk 1, and you are running on Linux, you might run something like this:

wget https://bitfunnel.blob.core.windows.net/wiki-data/chunked/enwiki-20161020-chunked1.tar.gz

Alternatively, you can paste that URL into a browser.

Ingesting the corpus

There are a few ways to ingest chunk files. Probably the easiest is to use the REPL, which is what we will do in this section.

We explain a bit of the background of how BitFunnel can be configured in the index build tools post. Today, it is sufficient to download the gzip’d configuration files generated from the same Wikipedia dump.

From there, you can run something like:

$ BitFunnel repl /path/to/unzipped/config/directory
Welcome to BitFunnel!
Starting 1 thread
(plus one extra thread for the Recycler.

directory = "/path/to/unzipped/config/directory"
gram size = 1

Starting index ...
Blocksize: [... your size here ...]
Index started successfully.

Type "help" to get started.

0: cache chunk /path/to/chunk/file
1: verify one wings
[... results go here ...]

When we start the REPL, you can see that we run two commands. cache chunk ingests a chunk file, and verify one queries the data in the index and verifies the document matches are correct.

Auditing the data

In the post on BitFunnel’s corpus file format, we write about the steps needed to convert the Wikipedia dump files into BitFunnel chunk files. This process is also specified in the README of BitFunnel’s Workbench project.

In brief, there are 3 steps: (1) download the Wikipedia XML dumps, (2) use WikiExtractor to filter out the Wikipedia markup; and (3) convert the “extracted” text to chunk files using BitFunnel’s Workbench project.

In order to allow people to audit the chunk files, we are also hosting both the raw XML dump files for the 2016-10-20 articles-only dump of Wikipedia, and the markup-filtered files we generated with WikiExtractor. (We provide the Wikipedia dump because eventually Wikipedia will stop offering it in its archive.)

Wikipedia generated 27 dump files, so there are 27 extracted files and 27 chunk directories. So, just like we had a URL pattern of chunk segments, we have one for raw wikipedia dump files and extracted dump files, respectively:

https://bitfunnel.blob.core.windows.net/wiki-data/raw/enwiki-20161020-pages-articles${1}.tar.gz

https://bitfunnel.blob.core.windows.net/wiki-data/extracted/enwiki-20161020-extracted${1}.tar.gz

As with the chunk file, simply replace ${1} with a number from 1 to 27 (inclusive) to recieve the corresponding dump file.

From here, you can inspect the data, or use Workbench and WikiExtractor to generate your own data and compare.

Statistics

Finally, just as a point of reference, here are some statistics relating to the size of the corpus in various states of processing:

Raw corpus: total size is ~16.7GB when compressed with gzip, and ~59.2GB when uncompressed.
WikiExtractor’d corpus: total size is ~4.8GB when compressed with gzip, and ~13.2 GB uncompressed.
Chunked corpus: total size is ~3.6GB when compressed with gzip, and ~10.2GB uncompressed.

If you were paying attention for exactly one day in 2012, you could obtain user query logs. These are particularly useful because they allow you to construct realistic synthetic workloads.

Lately this is quite a rare affordance. Probably every company to release significant user data has been burned (see, for example, AOL’s search query log scandal and Netflix’s recommender scandal), which makes other companies reticent to take the plunge.

But, as much as information retrieval researchers would like to use such data to improve search systems, there are also very good theoretical reasons (see this paper, for example: [PDF]) to believe that, at the very least, it is very difficult to do correctly. Difficult enough that it may never be worth the risk. ^[return]