Corpus File Format

NOTE: this site uses MathJax to render mathematical equations. Enabling JavaScript in your browser will allow MathJax to display these equations on the page.

e5b51a0df6

One of the challenges in making BitFunnel relevant to the open source community is removing Bing-specific functionality that has deep dependencies on the internals of the rest of the Bing web crawling and index serving infrastructure.

As I mentioned in my first post, we plan to start with an empty repository and bring over BitFunnel modules one by one. We are essentially bootstrapping the BitFunnel project, and this process will require a new test corpus, a set of performance benchmarks, and some system to help verify correctness.

We’d love to use a corpus from Bing, but this approach has some problems, the first of which is the document file format.

Today Bing uses a file format called IdfDocument to represent all of the information we know about a document. It comes as no surprise that an IdfDocument file contains broadly applicable information like the text from the document body and title streams. The problem is that it also contains a ton of other information, some of which may be interesting, but most of which is only applicable to the idiosyncrasies of the Bing crawling, indexing, and relevance algorithms. Some examples of this other data include information about URLs users have clicked to reach this document, various static scores relating to Bing’s notion of document importance, fields indicating whether the document is likely to be porn or spam, and tags for geographical region and language.

Removing the Bing ideosyncracies from IdfDocument seems hard, and even if we did this work, there’s still the issue of the test corpus. We would like to make the test corpus publicly available under a permissive license so that anyone reproduce our results and build upon our learnings.

Since the Bing web crawl corpus is unavailable to us for this purpose, our plan is to start with an English language Wikipedia dump and then process it with Lucene to perform the word breaking and stemming that had previously been handled by portions of the Bing ingestion pipeline upstream of BitFunnel.

This processed Wikipedia dump will serve as a baseline for metrics tracking index size, document ingestion rate, and query processing rate. Since there is really no way to do an apples-to-apples comparison with Bing, we will use Lucene as our oracle and comparison system. This will allow us to spot correctness issues early on and characterize BitFunnel performance in a context that is more meaningful to the open source community. It is our intention that our experiments will be well-documented and committed to the repository so that others can reproduce our results down the road.

Corpus File Format

We decided to invent our own file format to use as input to both the Lucene-based reference index and the new BitFunnel index. Here are our top two goals:

The format should be trivial to process. Both the reference index, based on Lucene, and the BitFunnel index will need to process this format, so we will have to write ingestion code in both Java and C++. Using a trivial format will simplify the task of developing the ingestion code and make it easier to verify that both indexes have the same semantics.
The format should allow very low overhead processing. Low and predictable overhead allows us to more easily separate the cost of the file format from the cost of document ingestion in performance tests. Low overhead also increases our agility in the edit-compile-debug-measure cycle.

Grammar

The corpus file models a sequence of documents, each of which is composed of an sixteen digit hexidecimal document id followed by sequence of streams, each of which consists of a two digit hexidecimal stream id followed by an ordered sequence of terms. Documents, streams, and terms are separated by the ‘\0’ sentinal character.

Corpus:       Document* End

Document:       DocumentId End Stream* End

DocumentId:       Hex2 Hex2 Hex2 Hex2 Hex2 Hex2 Hex2 Hex2

Stream:       StreamId End (Term End)* End

StreamId:       Hex2

Hex2:       Hex Hex

Hex:       [0123456789abcdef]

Term:       MBCS *

MBCS:       (any UTF-8 character sequence that doesn’t contain END).

END:       (the byte ‘\0’)

Example

Consider a corpus with the following two documents, each of which has a title and some body text.

// document id = 18
Dogs
Dogs are man’s best friend.

// document id = 200
Cat Facts
The internet is made of cats.

Each document can be modeled as a pair of streams, one for the title and one for the body. Assume that the ids for the title and body streams are 0 and 1, respectively. In this case the corpus could be represented as the following C string:

char const * corpus =

  // First document
  "0000000000000012\0"
  "00\0Dogs\0\0"
  "01\0Dogs\0are\0man's\0best\0friend.\0\0"
  "\0"

  // Second document
  "00000000000000c8\0"
  "00\0Cat\0Facts\0\0"
  "01\0The\0internet\0is\0made\0of\0cats.\0\0"
  "\0"

  // End of corpus
  "\0";

Here’s a hex dump of the same corpus.

0x00000000  30 30 30 30  30 30 30 30  30 30 30 30  000000000000
0x0000000C  30 30 31 32  00 30 30 00  44 6f 67 73  0012.00.Dogs
0x00000018  00 00 30 31  00 44 6f 67  73 00 61 72  ..01.Dogs.ar
0x00000024  65 00 6d 61  6e 27 73 00  62 65 73 74  e.man's.best
0x00000030  00 66 72 69  65 6e 64 2e  00 00 00 30  .friend....0
0x0000003C  30 30 30 30  30 30 30 30  30 30 30 30  000000000000
0x00000048  30 63 38 00  30 30 00 43  61 74 00 46  0c8.00.Cat.F
0x00000054  61 63 74 73  00 00 30 31  00 54 68 65  acts..01.The
0x00000060  00 69 6e 74  65 72 6e 65  74 00 69 73  .internet.is
0x0000006C  00 6d 61 64  65 00 6f 66  00 63 61 74  .made.of.cat
0x00000078  73 2e 00 00  00 00 00 00  00 00 00 00  s...........

Characteristics

This format has some pros and cons. Pros include

Very easy to develop a parser.
No overhead from complex library dependencies.
Parsing process is lightweight and efficient.
A human can understand the format in a debugger dump.
Capable of representing arbitrary streams.
All streams are optional.
Captures text ordering.
Supports any term that can be encoded as a zero-terminated UTF-8 string.
C and C++ parsers can access zero-terminated strings directly, without making a copy.

Some negatives are

Non-standard format.
Cannot easily author, modify, or inspect using text editor.
No compression.
The document count and document, stream, and string lengths aren’t known up front. Streams must be scanned in their entirety to determine their lengths.
Does not enforce any schema ensuring documents are well-formed.