BitFunnel Engineering Diary

We're open sourcing BitFunnel, a library for high performance indexing, retrieval, and ranking of documents. Today the code runs at massive scale inside of Bing's data centers, but our dream is to make the code available and relevant to anyone, anywhere who values search. As we release each module, we will document our key design decisions here on this blog.

Stream Configuration

BitFunnel models each document as a set of streams, each of which consists of a sequence of terms corresponding to the words and phrases that make up the document. Real world documents are usually organized with streams corresponding to structural concepts, such as the title, the URL, the body, and perhaps even the text of anchors on other pages that point to the document. We may want to organize the index using a different principle. (read more...)

Getting started with NativeJIT

NativeJIT is a just-in-time compiler that handles expressions involving C data structures. It was originally developed in Bing, with the goal of being able to compile search query matching and search query ranking code in a query-dependent way. The goal was to create a compiler than can be used in systems with tens-of-thousands of queries per second without having compilation take a significant fraction of the query time. Let’s look at a simple “Hello, World” and then look at what the API has to offer us. (read more...)

Index Build Tools

NOTE: This page was updated on 9/19/16 to reflect significant changes in the index build tools. After many months of hard work, we kind of, sort of have a document ingestion pipeline that seems to work. By this I mean we have a minimal set of configuration and ingestion tools that we can compile and then run without crashing, and these tools seem to ingest files mostly as expected. (read more...)

Corpus File Format

One of the challenges in making BitFunnel relevant to the open source community is removing Bing-specific functionality that has deep dependencies on the internals of the rest of the Bing web crawling and index serving infrastructure. As I mentioned in my first post, we plan to start with an empty repository and bring over BitFunnel modules one by one. We are essentially bootstrapping the BitFunnel project, and this process will require a new test corpus, a set of performance benchmarks, and some system to help verify correctness. (read more...)