BitFunnel Glossary · BitFunnel

BitFunnel Glossary

To get a high level overview of the algorithm, please see this talk transcript. This glossary is incomplete and needs a lot of work! While our plan is to fill out the whole thing, that will probably take a while. If there’s some particular term or concept that you’d like to see explained sooner, please let us know.

Top level concepts

TermTable

A TermTable contains the mapping from a term to the rows associated with the term. A term can be a word (1-gram) or an N-word phrase (N-gram).

Index / Ingestor

An Index contains, for one machine, an “index” of all documents that are indexed on that machine. An Index consists of multiple Shards, plus the configuration information necessary to ingest and look up documents, which means that it contains references to things like TermTables.

Shard

A Shard contains descriptors for the documents contained within the shard. This means that it has the DocTableDescriptors and the RowTableDescriptors for a Shard. The DocTableDescriptor tells you what information is stored along with each document besides the bits in the index. The RowTableDescriptors tell you what bits in the index actually mean. For example, given a DocIndex and the associated row information, it can give you the bit-address of a specific row for a document.

A Shard is also responsible for holding onto Slices.

Slice

A Slice owns the memory associated with a set of documents. Very roughly speaking, it’s a buffer of some size, with meta-information on the buffer and a ref count.

Other important concepts

Rank

In our hierarchical bloom filters, a row is said to have rank i if each bit represents 2**i documents. This means that, in a rank 0 row, each bit represents exactly one document, and in a rank 3 row, each bit represents 8 documents (in which case, the bit is set if any one of the 8 documents represented “wants” to set the bit).

FixedSizedBlob

Per document storage for a fixed-size chunks of data (i.e., data where the size of the blob is the same for every document). Because items are fixed-size, they can be stored in an array or an array-like structure.

VariableSizedBlob

Per document storage for variable-size chunks of data (i.e., data where the size of the blob can be different for every document). Because items are variable-size, pointers to items are stored.

DocTable / DocTableDescriptor

The DocTable is a collection of per-document data items for a Slice. An item in the DocTable consists of some number of FixedSizedBlobs as well as pointers to VariableSizedBlobs.

RowTableDescriptor

RowTableDescriptor exposes low-level operations on a Slice like GetBit, SetBit, and ClearBit. Given pointer to a SliceBuffer, a RowIndex and a DocIndex, the RowTableDescriptor lets you actually manipulate the information inside a Slice.

DocumentHandle

TokenManager / TokenTracker / Token

These are used to track the liveness of Slices.

The top-level object is a TokenManager, which can hand out TokenTrackers and Tokens. Tokens are basically monotonically increasing serial numbers that can be oustanding or complete. Each TokenTracker tracks if the TokenManager has any oustanding tokens issued before a cut-off serial number.

Recycler

The Recycler is a rudimentary garbage collector for slices. When the Index is done with a Slice, the Slice gets Expired. Expiring a slice schedules it to be recycled by a Recycler. Recycling (i.e., destruction) occurs when all tokens related to the slice are expired, i.e., when all users (read threads) of the Slice are done with the Slice.

TermTreatment

A mapping from term characteristics (today, IDF and gram size) to RowConfiguration (i.e., the number of rows at each rank).

Row

RowIdSequence

Other concepts

Allocator / IAllocator

Factories

FileSystem

FileManager

ShardDefinition

StreamConfiguration

Interface / IInterface

Configuration / IConfiguration

Document / IDocument

DocumentCache

DocumentDataSchema

DocumentFrequencyTable

FactSet

SimpleIndex

SliceBufferAllocator

TermTableBuilder

TermTableCollection

PackedRowIdSequence

AbstractRow

QueryPipeline

QueryPlanner

RowMatchNode

RowPlan

TermMatchNode

TermMatchTreeEvaulator

TermPlan

TermPlanConverter

ChunkEnumerator

ChunkIngestor

ChunkManifestIngestor

ChunkReader

ChunkTaskProcessor

Configuration

DocumentMap

TermToText

AbstractRowEnumerator

ByteCodeInterpreter

CompileNode

MatchTreeRewriter

MatchVerifier

PlanRows

QueryParser

RankDownCompiler

RankZeroCompiler

RegisterAllocator

SimplePlanner

A rudimentary query planner that only handles AND queries using the ByteCodeInterpreter. Takes a TermMatchNode tree, generates bytecode, and then runs the bytecode.

TODOs:

This should be grouped into more than just “top level” “important”, and “other”, but we’ve been saying we should do this for ages, so I’m putting this version out there just so there’s something.

Dan Luu
Prior to working on BitFunnel, Dan worked on network virtualization hardware at Microsoft (SmartNIC), deep learning hardware at Google (TPU), and x86/ARM processors at Centaur.