Stream Configuration
BitFunnel models each document as a set of streams, each of which consists of a sequence of terms corresponding to the words and phrases that make up the document.
Real world documents are usually organized with streams corresponding to structural concepts, such as the title, the URL, the body, and perhaps even the text of anchors on other pages that point to the document.
We may want to organize the index using a different principle. For example, we might index each document as a pair of streams, one that contains all terms associated with the document and another that contains only those terms that appear in streams other than the document body. This organization is useful for rewriting queries in order to return fewer results.
The StreamConfiguration provides a mapping between the streams in the document and the streams in the index. Let’s look at a more detailed example.
Consider a hypothetical document about dogs that resides at http://bitfunnel.org/dogs:
Dogs
Dogs are your best friend.
Suppose another page refers to our document via the following anchor tags:
<a href=“dogs”>Check out this awesome page!<a\/>
<a href=“dogs”>Who is your friend?<a\/>
Such a document might be organized into streams as follows:
title: [dogs] body: [dogs are your best friend] url: [http bitfunnel org dogs] anchors: [check out this awesome page] [who is your friend]
If the index modelled documents this way we could search for
our document with queries like dogs
, title:dogs
or even
anchors:"awesome page"
.
We could chose to index this document as two streams, one of which has all words associated with the document and the other that contains words from streams other than the body:
document: [dogs] [dogs are your best friend http] [bitfunnel org dogs]
[check out this awesome page] [who is your friend]
nonbody: [dogs] [http bitfunnel org] [check out this awesome page]
[who is your friend]
With this organization, we could find the document with the query nonbody:dogs
but not nonbody:best
.
Configuring Streams
The IDocument class uses the StreamConfiguration at ingestion time to
organize its terms for indexing. The QueryParser class uses the StreamConfiguration
to map from text stream names to Term::StreamId
values.
Each IDocument is filled with streams of terms using a sequence of calls to the
OpenStream(), AddTerm(), and CloseStream() methods. The Term::StreamId
values
passed to OpenStream() are document streams. The document above might be initialized
with the following sequence of calls:
OpenStream(0); // Title stream
AddTerm("dogs");
CloseStream();
OpenStream(1); // Body stream
AddTerm("dogs");
AddTerm("are");
AddTerm("your");
AddTerm("best");
AddTerm("friend");
CloseStream();
OpenStream(2); // URL stream
AddTerm("http");
AddTerm("bitfunnel");
AddTerm("org");
AddTerm("dogs");
CloseStream();
OpenStream(3); // Anchors stream
AddTerm("check");
AddTerm("out");
AddTerm("this");
AddTerm("awesome");
AddTerm("page");
CloseStream();
// Close and then reopen stream to
// keep phrases from the two anchors
// separate.
OpenStream(3); // Anchors stream
AddTerm("who");
AddTerm("is");
AddTerm("your");
AddTerm("friend");
CloseStream();
We can ingest this document as Document
and NonBody
streams by writing
the following StreamConfiguration file:
Document: 0,1,2,3
NonBody: 1,2,3
The first line defines an index stream called “Document” which contains terms and phrases from document streams 0, 1, 2, and 3 which correspond to the document’s Title, Body, URL, and Anchor streams. The second line defines an index stream called “NonBody” which contains terms from the document’s Title, URL and Anchor streams.
This StreamConfiguration file will automatically configure the QueryParser
to recognize the “Document” and “NonBody” prefixes. Note that the first entry
in the StreamConfiguration file the default stream. When a query term does not
have a stream prefix, it will use the default prefix. So in this case, the query
dogs
is equivalent to Document:dogs
.