BitFunnel Engineering Diary

We're open sourcing BitFunnel, a library for high performance indexing, retrieval, and ranking of documents. Today the code runs at massive scale inside of Bing's data centers, but our dream is to make the code available and relevant to anyone, anywhere who values search. As we release each module, we will document our key design decisions here on this blog.

How do make onboarding to BitFunnel easier?

I’ve been working on BitFunnel for roughly six months now. If I look at how I’ve used that time, my guess is that I’ve taken about a month of Mike’s time. If you look at the progress we’ve made, I think that’s a pretty good trade-off, but it doesn’t make for a scalable open source project. It makes sense to invest a month of time in a full-time employee since they’re likely to be around for at least a year or two, and even in the case of an extraordinarily bad fit, they’ll probably stick around for long enough that you’ll get your time-investment back. (read more...)

Debugging an SEH Crash

Here’s a video showing how I debugged a read access violation that was caused by an earlier buffer overflow. This sort of problem can sometimes be hard to track down, but in this case, a data breakpoint made my job easier. The video discusses the BlockAllocator, Slice buffers and the Row Tables they contain. If you’d like to try diagnosing the bug yourself, just checkout the SEHBug branch of the BitFunnel repository. (read more...)

When will BitFunnel be usable?

How long should we expect this project to take? In theory, we should have a relatively easy time guessing how long this project will take because this project is a half-port-half-rewrite whose aim to produce an open source version that’s simpler than the internal version of the project, and we know how big the original project is. If we do find . -name "*.h" -o -name "*.cpp" | grep -v NativeJIT | xargs wc on the original project to count all lines of code except NativeJIT, we get roughly 144k lines of code. (read more...)

All's Well That Ends Well

We’ve been having some stability problems of late. In our rush to get some minimal version of the document ingestion pipeline up and running, we created a number of tools for gathering corpus statistics and configuring term tables and we built an interactive REPL console to help our readers better understand the system. These tools are mostly system integrations, and as such, are not covered by unit tests. In recent days we’ve found these integrations to be broken more often than working. (read more...)