BitFunnel Engineering Diary
We're open sourcing BitFunnel, a library for high performance indexing, retrieval, and ranking of documents.
Today the code runs at massive scale inside of Bing's data centers, but our dream is
to make the code available and relevant to anyone, anywhere who values search.
As we release each module, we will document our key design decisions here on this blog.
Thu, Oct 27, 2016
Things are starting to get exciting in the Land of BitFunnel. We’re now at the point where we can ingest a significant fraction of Wikipedia and run millions of queries, all without crashing – and we have a great set of analysis tools.
You might think that the end is in sight, but actually this is only the beginning. We’re now in what I like to call a “target rich environment” for bugs.
Sun, Oct 23, 2016
I spent the weekend implementing code to analyze bit densities in the rows and columns of the row tables. This tool should help us determine whether the row tables are configured correctly. A good row table should have the following characteristics:
Each column’s density is close to the system target density. Each row’s density is close to the system target density. Random pairs of terms are unlikely to share rows.
Fri, Oct 21, 2016
Wikipedia is a great test corpus for search engines. It is free and easy to obtain, it carries a license appropriate for research, and at ~59GB uncompressed, it is large, but not too large to fit on a reasonably-sized server. For those with extremely fast reflexes, even user data1 is sometimes available.
Wikipedia is also probably more representative of common use cases of search: since it is edited by amateurs, it is a more pedestrian dataset than many other corpora.
Thu, Oct 13, 2016
To get a high level overview of the algorithm, please see this talk transcript. This glossary is incomplete and needs a lot of work! While our plan is to fill out the whole thing, that will probably take a while. If there’s some particular term or concept that you’d like to see explained sooner, please let us know.
Top level concepts TermTable A TermTable contains the mapping from a term to the rows associated with the term.