BitFunnel Engineering Diary
We're open sourcing BitFunnel, a library for high performance indexing, retrieval, and ranking of documents.
Today the code runs at massive scale inside of Bing's data centers, but our dream is
to make the code available and relevant to anyone, anywhere who values search.
As we release each module, we will document our key design decisions here on this blog.
Wed, Oct 12, 2016
I’ve been working on BitFunnel for roughly six months now. If I look at how I’ve used that time, my guess is that I’ve taken about a month of Mike’s time. If you look at the progress we’ve made, I think that’s a pretty good trade-off, but it doesn’t make for a scalable open source project.
It makes sense to invest a month of time in a full-time employee since they’re likely to be around for at least a year or two, and even in the case of an extraordinarily bad fit, they’ll probably stick around for long enough that you’ll get your time-investment back.
Mon, Oct 10, 2016
Here’s a video showing how I debugged a read access violation that was caused by an earlier buffer overflow. This sort of problem can sometimes be hard to track down, but in this case, a data breakpoint made my job easier.
The video discusses the BlockAllocator, Slice buffers and the Row Tables they contain. If you’d like to try diagnosing the bug yourself, just checkout the SEHBug branch of the BitFunnel repository.
Tue, Oct 4, 2016
How long should we expect this project to take? In theory, we should have a relatively easy time guessing how long this project will take because this project is a half-port-half-rewrite whose aim to produce an open source version that’s simpler than the internal version of the project, and we know how big the original project is.
If we do find . -name "*.h" -o -name "*.cpp" | grep -v NativeJIT | xargs wc on the original project to count all lines of code except NativeJIT, we get roughly 144k lines of code.
Sat, Sep 24, 2016
We’ve been having some stability problems of late. In our rush to get some minimal version of the document ingestion pipeline up and running, we created a number of tools for gathering corpus statistics and configuring term tables and we built an interactive REPL console to help our readers better understand the system. These tools are mostly system integrations, and as such, are not covered by unit tests. In recent days we’ve found these integrations to be broken more often than working.