To get a high level overview of the algorithm, please see this talk transcript. This glossary is incomplete and needs a lot of work! While our plan is to fill out the whole thing, that will probably take a while. If there’s some particular term or concept that you’d like to see explained sooner, please let us know.
Top level concepts
A TermTable contains the mapping from a term to the rows associated with the term. A term can be a word (1-gram) or an N-word phrase (N-gram).
Index / Ingestor
An Index contains, for one machine, an “index” of all documents that are indexed on that machine. An Index consists of multiple Shards, plus the configuration information necessary to ingest and look up documents, which means that it contains references to things like TermTables.
A Shard contains descriptors for the documents contained within the shard. This means that it has the DocTableDescriptors and the RowTableDescriptors for a Shard. The DocTableDescriptor tells you what information is stored along with each document besides the bits in the index. The RowTableDescriptors tell you what bits in the index actually mean. For example, given a DocIndex and the associated row information, it can give you the bit-address of a specific row for a document.
A Shard is also responsible for holding onto Slices.
A Slice owns the memory associated with a set of documents. Very roughly speaking, it’s a buffer of some size, with meta-information on the buffer and a ref count.
Other important concepts
In our hierarchical bloom filters, a row is said to have rank i if each bit represents 2**i documents. This means that, in a rank 0 row, each bit represents exactly one document, and in a rank 3 row, each bit represents 8 documents (in which case, the bit is set if any one of the 8 documents represented “wants” to set the bit).
Per document storage for a fixed-size chunks of data (i.e., data where the size of the blob is the same for every document). Because items are fixed-size, they can be stored in an array or an array-like structure.
Per document storage for variable-size chunks of data (i.e., data where the size of the blob can be different for every document). Because items are variable-size, pointers to items are stored.
DocTable / DocTableDescriptor
The DocTable is a collection of per-document data items for a Slice. An item in the DocTable consists of some number of FixedSizedBlobs as well as pointers to VariableSizedBlobs.
RowTableDescriptor exposes low-level operations on a Slice like GetBit, SetBit, and ClearBit. Given pointer to a SliceBuffer, a RowIndex and a DocIndex, the RowTableDescriptor lets you actually manipulate the information inside a Slice.
TokenManager / TokenTracker / Token
These are used to track the liveness of Slices.
The top-level object is a TokenManager, which can hand out TokenTrackers and Tokens. Tokens are basically monotonically increasing serial numbers that can be oustanding or complete. Each TokenTracker tracks if the TokenManager has any oustanding tokens issued before a cut-off serial number.
The Recycler is a rudimentary garbage collector for slices. When the Index is done with a Slice, the Slice gets Expired. Expiring a slice schedules it to be recycled by a Recycler. Recycling (i.e., destruction) occurs when all tokens related to the slice are expired, i.e., when all users (read threads) of the Slice are done with the Slice.
A mapping from term characteristics (today, IDF and gram size) to RowConfiguration (i.e., the number of rows at each rank).
Allocator / IAllocator
Interface / IInterface
Configuration / IConfiguration
Document / IDocument
A rudimentary query planner that only handles AND queries using the ByteCodeInterpreter. Takes a TermMatchNode tree, generates bytecode, and then runs the bytecode.
This should be grouped into more than just “top level” “important”, and “other”, but we’ve been saying we should do this for ages, so I’m putting this version out there just so there’s something.