Corpus File Format
e5b51a0df6One of the challenges in making BitFunnel relevant to the open source community is removing Bing-specific functionality that has deep dependencies on the internals of the rest of the Bing web crawling and index serving infrastructure.
As I mentioned in my first post, we plan to start with an empty repository and bring over BitFunnel modules one by one. We are essentially bootstrapping the BitFunnel project, and this process will require a new test corpus, a set of performance benchmarks, and some system to help verify correctness.
We’d love to use a corpus from Bing, but this approach has some problems, the first of which is the document file format.
Today Bing uses a file format called IdfDocument to represent all of the information we know about a document. It comes as no surprise that an IdfDocument file contains broadly applicable information like the text from the document body and title streams. The problem is that it also contains a ton of other information, some of which may be interesting, but most of which is only applicable to the idiosyncrasies of the Bing crawling, indexing, and relevance algorithms. Some examples of this other data include information about URLs users have clicked to reach this document, various static scores relating to Bing’s notion of document importance, fields indicating whether the document is likely to be porn or spam, and tags for geographical region and language.
Removing the Bing ideosyncracies from IdfDocument seems hard, and even if we did this work, there’s still the issue of the test corpus. We would like to make the test corpus publicly available under a permissive license so that anyone reproduce our results and build upon our learnings.
Since the Bing web crawl corpus is unavailable to us for this purpose, our plan is to start with an English language Wikipedia dump and then process it with Lucene to perform the word breaking and stemming that had previously been handled by portions of the Bing ingestion pipeline upstream of BitFunnel.
This processed Wikipedia dump will serve as a baseline for metrics tracking index size, document ingestion rate, and query processing rate. Since there is really no way to do an apples-to-apples comparison with Bing, we will use Lucene as our oracle and comparison system. This will allow us to spot correctness issues early on and characterize BitFunnel performance in a context that is more meaningful to the open source community. It is our intention that our experiments will be well-documented and committed to the repository so that others can reproduce our results down the road.
Corpus File Format
We decided to invent our own file format to use as input to both the Lucene-based reference index and the new BitFunnel index. Here are our top two goals:
The format should be trivial to process. Both the reference index, based on Lucene, and the BitFunnel index will need to process this format, so we will have to write ingestion code in both Java and C++. Using a trivial format will simplify the task of developing the ingestion code and make it easier to verify that both indexes have the same semantics.
The format should allow very low overhead processing. Low and predictable overhead allows us to more easily separate the cost of the file format from the cost of document ingestion in performance tests. Low overhead also increases our agility in the edit-compile-debug-measure cycle.
Grammar
The corpus file models a sequence of documents, each of which is composed of an sixteen digit hexidecimal document id followed by sequence of streams, each of which consists of a two digit hexidecimal stream id followed by an ordered sequence of terms. Documents, streams, and terms are separated by the ‘\0’ sentinal character.
Corpus: Document* End
Document: DocumentId End Stream* End
DocumentId: Hex2 Hex2 Hex2 Hex2 Hex2 Hex2 Hex2 Hex2
Stream: StreamId End (Term End)* End
StreamId: Hex2
Hex2: Hex Hex
Hex: [0123456789abcdef]
Term: MBCS *
MBCS: (any UTF-8 character sequence that doesn’t contain END).
END: (the byte ‘\0’)
Example
Consider a corpus with the following two documents, each of which has a title and some body text.
// document id = 18
Dogs
Dogs are man’s best friend.// document id = 200
Cat Facts
The internet is made of cats.
Each document can be modeled as a pair of streams, one for the title and one for the body. Assume that the ids for the title and body streams are 0 and 1, respectively. In this case the corpus could be represented as the following C string:
char const * corpus =
// First document
"0000000000000012\0"
"00\0Dogs\0\0"
"01\0Dogs\0are\0man's\0best\0friend.\0\0"
"\0"
// Second document
"00000000000000c8\0"
"00\0Cat\0Facts\0\0"
"01\0The\0internet\0is\0made\0of\0cats.\0\0"
"\0"
// End of corpus
"\0";
Here’s a hex dump of the same corpus.
0x00000000 30 30 30 30 30 30 30 30 30 30 30 30 000000000000
0x0000000C 30 30 31 32 00 30 30 00 44 6f 67 73 0012.00.Dogs
0x00000018 00 00 30 31 00 44 6f 67 73 00 61 72 ..01.Dogs.ar
0x00000024 65 00 6d 61 6e 27 73 00 62 65 73 74 e.man's.best
0x00000030 00 66 72 69 65 6e 64 2e 00 00 00 30 .friend....0
0x0000003C 30 30 30 30 30 30 30 30 30 30 30 30 000000000000
0x00000048 30 63 38 00 30 30 00 43 61 74 00 46 0c8.00.Cat.F
0x00000054 61 63 74 73 00 00 30 31 00 54 68 65 acts..01.The
0x00000060 00 69 6e 74 65 72 6e 65 74 00 69 73 .internet.is
0x0000006C 00 6d 61 64 65 00 6f 66 00 63 61 74 .made.of.cat
0x00000078 73 2e 00 00 00 00 00 00 00 00 00 00 s...........
Characteristics
This format has some pros and cons. Pros include
- Very easy to develop a parser.
- No overhead from complex library dependencies.
- Parsing process is lightweight and efficient.
- A human can understand the format in a debugger dump.
- Capable of representing arbitrary streams.
- All streams are optional.
- Captures text ordering.
- Supports any term that can be encoded as a zero-terminated UTF-8 string.
- C and C++ parsers can access zero-terminated strings directly, without making a copy.
Some negatives are
- Non-standard format.
- Cannot easily author, modify, or inspect using text editor.
- No compression.
- The document count and document, stream, and string lengths aren’t known up front. Streams must be scanned in their entirety to determine their lengths.
- Does not enforce any schema ensuring documents are well-formed.