On the Road to Open Source · BitFunnel

On the Road to Open Source

Today we are kicking off an effort to open source BitFunnel, a library for high-performance full-text search over a chunk of the internet, spread across thousands of machines. It is based on a probabilistic algorithm that identifies and ranks documents according to queries involving keywords, phrases, and mathematical expressions.

BitFunnel is the most significant accomplishment in my 35 years building software. I am proud of the system we created and want to share it with the world. For me, BitFunnel is a true joy and I can’t get enough of it. I love the elegance of the underlying math and its translation into a tight inner loop. I love the algorithm’s contrarian performance profile and its quirky probabilistic nature, but what I love most of all is the code.

We were lucky to be able to build the team and the codebase from scratch and to do it our way. We had no dependencies and as such, we were masters of our destiny. We strove to make the code modular, to adopt a test-forward approach, to deeply review every change, and to continuously knead our lump of clay, always refactoring, always striving for simplicity, and never stopping until there was nothing left to remove.

So did we succeed? On the one hand, the project was a success in that our small team started with a blank sheet of paper and an unorthodox approach and created a search engine that has handled live traffic continuously for years on thousands of machines deep inside of Bing. It is not a toy. As I write this, we have never had a production outage, and the codebase continues to adapt to a growing pool of internal collaborators, partners, and customers.

On the flip side, we created a system that is super-complex with too much coupling and not enough documentation. BitFunnel started out small and elegant and then grew more complex over time. We were pragmatic and made tradeoffs that were necessary to form partnerships and meet schedules and get into production. And this was good for Bing, but over time, BitFunnel became less and less relevant outside of Bing.

The real epiphany came one day when a researcher from MSR asked if he could try out BitFunnel. “Of course,” I said. “we designed it up front to be modular and free of dependencies. You should have no trouble using it.” Famous last words. I then sat with him for a week trying to boot a copy of the system in his office and failed utterly. What an embarrassment! I had no idea how many couplings and dependencies and hidden assumptions had made their way into the codebase over the years, despite a culture of constant refactoring and detailed design reviews.

Now we have an opportunity to revisit the decisions that introduced unnecessary complexity. BitFunnel is but a small cog in the massive machine that is Bing and today there is no way to boot up the system outside of our data centers. Our biggest challenge in making BitFunnel relevant to the open source community will be removing Bing-specific functionality that has deep dependencies on the internals of the rest of the Bing web crawling and index serving infrastructure.

Our other challenge will be removing barnacles - those bits of code and idioms whose original rationale have been lost to the mists of time but can’t be removed without pulling a thread that may unravel a large amount of code.

We plan to remove the barnacles while reusing planks from the old ship to build a new, smaller vessel with a more streamlined design. Our approach is to start with an empty repository and bring over modules one-by-one, keeping only the functionality that is useful and can be justified by a fairly rigorous set of scenario-focused tests.

A few years back, I was thrilled to see BitFunnel go live on the internet. The idea worked, the math had been proven, the code was written. We watched with bated breath as the first traffic hit our servers and were excited to see the stream of queries grow from a trickle to a thundering Niagara Falls.

Today that excitement continues as we carry out a copy/paste rewrite of the BitFunnel codebase. We will continue to bring over modules one by one, porting them to Linux and OSX, while writing about our journey on these pages.

Won’t you join us? It’s going to be fun. I think this is the beginning of a beautiful friendship.

-Michael Hopcroft

Michael Hopcroft
A 19 year veteran at Microsoft, Mike has worked on Office, Windows and Visual Studio. He has spent the past 6 years developing cloud scale infrastructure for the Bing search engine. He is a founding member of BitFunnel.