BitFunnel Blog

We're open sourcing BitFunnel, a key component of the Bing search engine. The BitFunnel library provides high performance indexing, retrieval, and ranking of documents. Today the code runs at massive scale inside of Bing's data centers, but our dream is to make the code available and relevant to anyone, anywhere who values search. As we release each module, we will document our key design decisions here on this blog.

A Small Query Language

A challenge in bringing BitFunnel to open source is providing functionality that was previously supplied by portions of Bing upstream of BitFunnel. BitFunnel was designed as a library that takes, as input, a tree of TermMatchNodes which represents a boolean expression combining terms and phrases using logical operators like and, or, and not. The Bing search pipeline does a ton of work on the query itself before presenting a TermMatchNode tree to BitFunnel. (read more...)

Stream Configuration

BitFunnel models each document as a set of streams, each of which consists of a sequence of terms corresponding to the words and phrases that make up the document. Real world documents are usually organized with streams corresponding to structural concepts, such as the title, the URL, the body, and perhaps even the text of anchors on other pages that point to the document. We may want to organize the index using a different principle. (read more...)

Getting started with NativeJIT

NativeJIT is a just-in-time compiler that handles expressions involving C data structures. It was originally developed in Bing, with the goal of being able to compile search query matching and search query ranking code in a query-dependent way. The goal was to create a compiler than can be used in systems with tens-of-thousands of queries per second without having compilation take a significant fraction of the query time. Let’s look at a simple “Hello, World” and then look at what the API has to offer us. (read more...)

Index Build Tools

After many months of hard work, we kind of, sort of have a document ingestion pipeline that seems to work. By this I mean we have a minimal set of configuration and ingestion tools that we can compile and then run without crashing, and these tools seem to ingest files mostly as expected. We’re still going to need to do a lot of testing, tuning and evaluation, but I thought it would be helpful to take this time to walk through the process of bringing up an index from a set of chunk files extracted from Wikipedia. (read more...)