Stream Configuration · BitFunnel

# Stream Configuration

BitFunnel models each document as a set of streams, each of which consists of a sequence of terms corresponding to the words and phrases that make up the document.

Real world documents are usually organized with streams corresponding to structural concepts, such as the title, the URL, the body, and perhaps even the text of anchors on other pages that point to the document.

We may want to organize the index using a different principle. For example, we might index each document as a pair of streams, one that contains all terms associated with the document and another that contains only those terms that appear in streams other than the document body. This organization is useful for rewriting queries in order to return fewer results.

The StreamConfiguration provides a mapping between the streams in the document and the streams in the index. Let’s look at a more detailed example.

Consider a hypothetical document about dogs that resides at http://bitfunnel.org/dogs:

Dogs

Suppose another page refers to our document via the following anchor tags:

<a href=“dogs”>Check out this awesome page!<a\/>

Such a document might be organized into streams as follows:

title: [dogs] body: [dogs are your best friend] url: [http bitfunnel org dogs] anchors: [check out this awesome page] [who is your friend]

If the index modelled documents this way we could search for our document with queries like dogs, title:dogs or even anchors:"awesome page".

We could chose to index this document as two streams, one of which has all words associated with the document and the other that contains words from streams other than the body:

document: [dogs] [dogs are your best friend http] [bitfunnel org dogs]
[check out this awesome page] [who is your friend]
nonbody: [dogs] [http bitfunnel org] [check out this awesome page]

With this organization, we could find the document with the query nonbody:dogs but not nonbody:best.

### Configuring Streams

The IDocument class uses the StreamConfiguration at ingestion time to organize its terms for indexing. The QueryParser class uses the StreamConfiguration to map from text stream names to Term::StreamId values.

Each IDocument is filled with streams of terms using a sequence of calls to the OpenStream(), AddTerm(), and CloseStream() methods. The Term::StreamId values passed to OpenStream() are document streams. The document above might be initialized with the following sequence of calls:

OpenStream(0);  // Title stream
CloseStream();

OpenStream(1);  // Body stream
CloseStream();

OpenStream(2);  // URL stream
CloseStream();

OpenStream(3);  // Anchors stream
CloseStream();

// Close and then reopen stream to
// keep phrases from the two anchors
// separate.

OpenStream(3);  // Anchors stream
CloseStream();


We can ingest this document as Document and NonBody streams by writing the following StreamConfiguration file:

Document: 0,1,2,3
NonBody: 1,2,3


The first line defines an index stream called “Document” which contains terms and phrases from document streams 0, 1, 2, and 3 which correspond to the document’s Title, Body, URL, and Anchor streams. The second line defines an index stream called “NonBody” which contains terms from the document’s Title, URL and Anchor streams.

This StreamConfiguration file will automatically configure the QueryParser to recognize the “Document” and “NonBody” prefixes. Note that the first entry in the StreamConfiguration file the default stream. When a query term does not have a stream prefix, it will use the default prefix. So in this case, the query dogs is equivalent to Document:dogs.

Michael Hopcroft
A 19 year veteran at Microsoft, Mike has worked on Office, Windows and Visual Studio. He has spent the past 6 years developing cloud scale infrastructure for the Bing search engine. He is a founding member of BitFunnel.