A Small Query Language

NOTE: this site uses MathJax to render mathematical equations. Enabling JavaScript in your browser will allow MathJax to display these equations on the page.

74dc10d348

A challenge in bringing BitFunnel to open source is providing functionality that was previously supplied by portions of Bing upstream of BitFunnel. BitFunnel was designed as a library that takes, as input, a tree of TermMatchNodes which represents a boolean expression combining terms and phrases using logical operators like and, or, and not.

The Bing search pipeline does a ton of work on the query itself before presenting a TermMatchNode tree to BitFunnel. Examples include word breaking, stemming, spelling corrections, and augmentation with synonyms. The query also goes through a complex set of classifiers that determine whether to route the query to special modules, such as a baseball scoreboard or a weather forcast. The query is also annotated with scoring instructions at this time.

All of this processing is carried out on a tree data structure generated upstream of BitFunnel. Although this tree has a textual representation, there was never any need for BitFunnel to parse the tree, so we never included a parser.

Our open source project is a different story. Today, at a minimum, we need some sort of query language and parser to test the code as we stand it up. We also expect our users will want the option of a complete, end-to-end system that includes a simple, intuitive, human-authorable query language.

Our goals for the query language were

Easy and intuitive query authoring.
Small grammar that is easy to learn.
Simple to parse.
Familiar to people who have used other search systems.
Based on UTF-8 to allow queries in all languages.

Since our plan was to use Lucene as a testing oracle and performance baseline, it made sense to consider some subset of the Lucene query language. In the end, we chose to go with something more like the languages used by Bing and Google.

In making this decision, our main tradeoff was between complexity and familiarity on the one hand and Lucene compatability on the other. Lucene compatability would certainly make our lives as developers easier because we could feed identical queries to BitFunnel and our Lucene reference. It would also make it easier for search integrators to migrate between the two systems since they could just drop in whichever engine best met their business needs.

The reason we went with Bing/Google approach centered around complexity and familiarity of the operator precidence. In the Lucene query language, logical or is implicit and has precidence over logical and. For example, the query dogs cats mice matches documents that contain at least one of the terms “dogs”, “cats”, and “mice”. In Bing and Google, the same query tends to find those documents that contain all three terms (the exact semantics in Bing and Google is less clear because they may alter the original query based on complex systems that infer human intent). Our feeling was that users of internet search engines would find the Bing/Google approach more familiar, but that this would come at a cost of Lucene compatability and it would be less familiar for Lucene users.

The deciding factor was the complexity of Lucene’s logical or operator used in conjunction with the + operator. In Lucene, the + operator converts a portion of an or expression into an and expression. As an example, the query +dogs cats mice matches those documents that contain the term “dogs” and at least one of “cats” and “mice”. In other words, the addition of a unary + operator converts the logical expression dogs | cats | mice into the expression dogs & (cats | mice) which has a completely different structure.

We felt that the + operator’s ability convert an implicit or into an and and then distribute the and over the remaining or expression introduced too much complexity and potential ambiguity for what is essentially syntactic sugar.

In any event, the decision was low stakes because it will be easy to add a Lucene compatible parser in the future if we need one. Here’e what we came up with.

Query Language Overview

Our query language is inspired by a subset of the Bing query language. Today the functionality is limited to expressing boolean matching trees. Once we’re ready to port the BitFunnel ranker code, we will extend the language to include ranker annotations (e.g. boosting the weight of a particular term).

AND. The and operator is implicit so the query dogs cats mice matches those documents that contain all of the words in the query. One can also explicity specify logial and with the & symbol, e.g. dogs & cats & mice.

OR. The or operator, denoted by the | symbol, is explicit. The query dogs | cats | mice matches those documents that contain at least one of the three words in the query. Note that the or operator has lower precedence than the and operator, so the query dogs & cats | mice is equivalent to the query (dogs & cats) | mice.

NOT. The unary not operator, denoted by the - symbol matches documents that do not contain an expression. As an example, the query dogs cats -mice matches those documents tht contain “dogs” and “cats”, but do not contain “mice”. Note that the not operator can be applied to arbirary expressions such as dogs -(cats | mice). The not operator has higher precidence than the and and or operators.

TERM. A search term is any sequence of UTF-8 characters that does not include whitespace or special characters such as "()-:&|. Terms may include upper and lower case characters, but they may be converted to lowercase during the query planning process. Special characters may appear if they are escaped with a backslash. As an example, dog\(cat\) would create the term associated with the string literal “dog(cat)” while dog(cat) would be equivalent to dog & cat. Note that it is legal to include an escaped space in a term, e.g. dog\ cat. Keep in mind that such a term is actually a unigram that happens to contain a space and not the bigram phrase "dog cat". Phrases and unigrams are treated differently in the index, so it is important to use the phrase syntax when the term is intended to be a higher order ngram (i.e. bigram, trigram, etc.)

PHRASE. A phrase can be specified by enclosing a sequence of search terms in double quotes, for example, "New York City". Each term in the phrase can include escaped characters.

STREAM PREFIX. A term match can be restricted to a particular stream by prefixing it with the name of the stream and a colon. As an example, the query title:dogs body:cat mice would match those documents that have “dog” in the title, “cat” in the body, and “mice” in the default stream. Stream names are defined by the application that hosts BitFunnel, which is also responsible for designating the default stream.

Grammar

Here’s the grammar for the BitFunnel query language.

OR:       AND (‘|’ AND)*

AND:       SIMPLE ([’&‘] SIMPLE)*

SIMPLE:
      ’-’ SIMPLE
      ’(’ OR ‘)’
      PREFIX

PREFIX:       [STREAM] TEXT

STREAM:       ~[SPECIAL | SPACE]+ ‘:’

TEXT:
      TERM
      " TERM [SPACE+ TERM]* "

TERM:       [~[SPECIAL | SPACE] | ESCAPE]+

SPECIAL:       [’"’ ‘(’ ‘)’ ‘-’ ‘:’ ‘&’ ‘|’]

SPACE:       [’\n’ ‘\r’ ‘\t’ ‘\v’ ‘\f’ ‘ ‘]

ESCAPE:       ‘\’ [SPECIAL | SPACE]

Try it out

Feel free to experiment with the interactive query parser found in the examples\QueryParser directory. Just fire up the program with no command line arguments and it will print out a brief help message and then dump you into an interacive console where you can type in queries and see their parse trees.

Welcome to the BitFunnel Query Parser Example.

This example is a Read-Eval-Print-Loop (REPL) that reads queries from
the console, parses them, and then prints out the resulting tree of
TermMatchNodes.

Enter a query after the % prompt and press return. To exit the demo
just enter a blank line. Here are some query ideas:
    Single terms
        dog
        title:cat
    Phrases
        "dogs are your best friend"
        anchors:"read this awesome page"
    Disjunctions
        dogs | cats
    Conjunctions
        dogs cats
        dogs & cats
    Negation
        -cats
    Grouping
        dogs (cats | fish)

% dog
Unigram("dog", 0)

% title:cat
Unigram("cat", 1)

% "dogs are your best friend"
Phrase {
  StreamId: 0,
  Grams: [
    "dogs",
    "are",
    "your",
    "best",
    "friend"
  ]
}

% anchors:"read this awesome page"
Phrase {
  StreamId: 2,
  Grams: [
    "read",
    "this",
    "awesome",
    "page"
  ]
}

% dogs | cats
Or {
  Children: [
    Unigram("cats", 0),
    Unigram("dogs", 0)
  ]
}

% dogs cats
And {
  Children: [
    Unigram("cats", 0),
    Unigram("dogs", 0)
  ]
}

% dogs & cats
And {
  Children: [
    Unigram("cats", 0),
    Unigram("dogs", 0)
  ]
}

% -cats
Not {
  Child: Unigram("cats", 0)
}

% dogs (cats | fish)
And {
  Children: [
    Or {
      Children: [
        Unigram("fish", 0),
        Unigram("cats", 0)
      ]
    },
    Unigram("dogs", 0)
  ]
}

% dog\"cat
Unigram("dog\"cat", 0)

% dog\(cat
Unigram("dog(cat", 0)

% dog\ cat
Unigram("dog cat", 0)

%
bye
Press any key to continue . . .

Current Limitations

The query parser is still very much a work-in-progress. Here are some known limitations and notes about future directions.

In the query parser example, the parser has been configured with stream prefixes body, title, and anchor. These prefixes are associated with Term:StreamId values of 0, 1, and 2, respectively.
The TERM production accepts all UTF-8 characters expect for ‘\0’ and members of our SPECIAL characters and SPACE characters. One consequence is that characters like Unicode directional quotations (U+2018, U+2019, U+201C, and U+201D ) will be treated as part of the TERM.
The query parser is configured to use an arena allocator with a fixed amount of memory. Unexpectedly long queries may cause the allocator to throw.
The query parser currently preserves letter case.
Currently stream prefixes may contain escaped special characters. This will likely be disallowed in the near future when we begin to store stream prefixes in configuration files where special characters may cause problems.