A Small Query Language
74dc10d348A challenge in bringing BitFunnel to open source is
providing functionality that was previously supplied by portions of
Bing upstream of BitFunnel. BitFunnel was designed as a library
that takes, as input, a tree of TermMatchNodes
which represents a boolean
expression combining terms and phrases using logical operators like
and
, or
, and not
.
The Bing search pipeline does a ton of work on the query itself
before presenting a TermMatchNode
tree to BitFunnel. Examples include
word breaking, stemming, spelling corrections, and augmentation with
synonyms. The query also goes through a complex set of classifiers
that determine whether to route the query to special modules, such
as a baseball scoreboard or a weather forcast. The query is also
annotated with scoring instructions at this time.
All of this processing is carried out on a tree data structure generated upstream of BitFunnel. Although this tree has a textual representation, there was never any need for BitFunnel to parse the tree, so we never included a parser.
Our open source project is a different story. Today, at a minimum, we need some sort of query language and parser to test the code as we stand it up. We also expect our users will want the option of a complete, end-to-end system that includes a simple, intuitive, human-authorable query language.
Our goals for the query language were
- Easy and intuitive query authoring.
- Small grammar that is easy to learn.
- Simple to parse.
- Familiar to people who have used other search systems.
- Based on UTF-8 to allow queries in all languages.
Since our plan was to use Lucene as a testing oracle and performance baseline, it made sense to consider some subset of the Lucene query language. In the end, we chose to go with something more like the languages used by Bing and Google.
In making this decision, our main tradeoff was between complexity and familiarity on the one hand and Lucene compatability on the other. Lucene compatability would certainly make our lives as developers easier because we could feed identical queries to BitFunnel and our Lucene reference. It would also make it easier for search integrators to migrate between the two systems since they could just drop in whichever engine best met their business needs.
The reason we went with Bing/Google approach centered around complexity and
familiarity of the operator precidence. In the Lucene query language, logical
or
is implicit and has precidence over logical and
. For example, the query
dogs cats mice
matches documents that contain at least one of the terms
“dogs”, “cats”, and “mice”. In Bing and Google, the same query tends to find
those documents that contain all three terms (the exact semantics in Bing and
Google is less clear because they may alter the original query based on
complex systems that infer human intent). Our feeling was that users of internet
search engines would find the Bing/Google approach more familiar, but that
this would come at a cost of Lucene compatability and it would be less familiar
for Lucene users.
The deciding factor was the complexity of Lucene’s logical or
operator used
in conjunction with the +
operator. In Lucene, the +
operator converts a
portion of an or
expression into an and
expression. As an example, the
query +dogs cats mice
matches those documents that contain the term “dogs”
and at least one of “cats” and “mice”. In other words, the addition of a unary
+
operator converts the logical expression dogs | cats | mice
into the
expression dogs & (cats | mice)
which has a completely different structure.
We felt that the +
operator’s ability
convert an implicit or
into an and
and then distribute the and
over the
remaining or
expression introduced too much complexity and potential ambiguity
for what is essentially syntactic sugar.
In any event, the decision was low stakes because it will be easy to add a Lucene compatible parser in the future if we need one. Here’e what we came up with.
Query Language Overview
Our query language is inspired by a subset of the Bing query language. Today the functionality is limited to expressing boolean matching trees. Once we’re ready to port the BitFunnel ranker code, we will extend the language to include ranker annotations (e.g. boosting the weight of a particular term).
AND. The and
operator is implicit so the query dogs cats mice
matches those
documents that contain all of the words in the query. One can also explicity
specify logial and
with the &
symbol, e.g. dogs & cats & mice
.
OR. The or
operator, denoted by the |
symbol, is explicit. The query
dogs | cats | mice
matches those documents that contain at least one of the
three words in the query. Note that the or
operator has lower precedence
than the and
operator, so the query dogs & cats | mice
is equivalent to
the query (dogs & cats) | mice
.
NOT. The unary not
operator, denoted by the -
symbol matches documents that
do not contain an expression. As an example, the query dogs cats -mice
matches those documents tht contain “dogs” and “cats”, but do not contain
“mice”. Note that the not
operator can be applied to arbirary expressions
such as dogs -(cats | mice)
. The not
operator has higher precidence than
the and
and or
operators.
TERM. A search term is any sequence of UTF-8 characters that does not include
whitespace or special characters such as "()-:&|
. Terms may include
upper and lower case characters, but they may be converted to lowercase
during the query planning process. Special characters may appear if they are
escaped with a backslash. As an example, dog\(cat\)
would create the term associated with
the string literal “dog(cat)” while
dog(cat)
would be equivalent to dog & cat
. Note that it is legal to include
an escaped space in a term, e.g. dog\ cat
. Keep in mind that
such a term is actually a unigram that happens to contain a space and not the
bigram phrase "dog cat"
. Phrases and unigrams are treated differently in the
index, so it is important to use the phrase syntax when the term is intended to
be a higher order ngram (i.e. bigram, trigram, etc.)
PHRASE. A phrase can be specified by enclosing a sequence of search terms in double quotes, for example, "New York City". Each term in the phrase can include escaped characters.
STREAM PREFIX. A term match can be restricted to a particular stream by prefixing it with
the name of the stream and a colon. As an example, the query
title:dogs body:cat mice
would match those documents that have “dog” in
the title, “cat” in the body, and “mice” in the default stream. Stream
names are defined by the application that hosts BitFunnel, which is also
responsible for designating the default stream.
Grammar
Here’s the grammar for the BitFunnel query language.
OR: AND (‘|’ AND)*
AND: SIMPLE ([’&‘] SIMPLE)*
SIMPLE:
’-’ SIMPLE
’(’ OR ‘)’
PREFIXPREFIX: [STREAM] TEXT
STREAM: ~[SPECIAL | SPACE]+ ‘:’
TEXT:
TERM
" TERM [SPACE+ TERM]* "TERM: [~[SPECIAL | SPACE] | ESCAPE]+
SPECIAL: [’"’ ‘(’ ‘)’ ‘-’ ‘:’ ‘&’ ‘|’]
SPACE: [’\n’ ‘\r’ ‘\t’ ‘\v’ ‘\f’ ‘ ‘]
ESCAPE: ‘\’ [SPECIAL | SPACE]
Try it out
Feel free to experiment with the interactive query parser found in
the examples\QueryParser
directory. Just fire up the program with no
command line arguments and it will print out a brief help message and
then dump you into an interacive console where you can type in queries
and see their parse trees.
Welcome to the BitFunnel Query Parser Example.
This example is a Read-Eval-Print-Loop (REPL) that reads queries from
the console, parses them, and then prints out the resulting tree of
TermMatchNodes.
Enter a query after the % prompt and press return. To exit the demo
just enter a blank line. Here are some query ideas:
Single terms
dog
title:cat
Phrases
"dogs are your best friend"
anchors:"read this awesome page"
Disjunctions
dogs | cats
Conjunctions
dogs cats
dogs & cats
Negation
-cats
Grouping
dogs (cats | fish)
% dog
Unigram("dog", 0)
% title:cat
Unigram("cat", 1)
% "dogs are your best friend"
Phrase {
StreamId: 0,
Grams: [
"dogs",
"are",
"your",
"best",
"friend"
]
}
% anchors:"read this awesome page"
Phrase {
StreamId: 2,
Grams: [
"read",
"this",
"awesome",
"page"
]
}
% dogs | cats
Or {
Children: [
Unigram("cats", 0),
Unigram("dogs", 0)
]
}
% dogs cats
And {
Children: [
Unigram("cats", 0),
Unigram("dogs", 0)
]
}
% dogs & cats
And {
Children: [
Unigram("cats", 0),
Unigram("dogs", 0)
]
}
% -cats
Not {
Child: Unigram("cats", 0)
}
% dogs (cats | fish)
And {
Children: [
Or {
Children: [
Unigram("fish", 0),
Unigram("cats", 0)
]
},
Unigram("dogs", 0)
]
}
% dog\"cat
Unigram("dog\"cat", 0)
% dog\(cat
Unigram("dog(cat", 0)
% dog\ cat
Unigram("dog cat", 0)
%
bye
Press any key to continue . . .
Current Limitations
The query parser is still very much a work-in-progress. Here are some known limitations and notes about future directions.
- In the query parser example, the parser has been configured with stream prefixes
body
,title
, andanchor
. These prefixes are associated withTerm:StreamId
values of 0, 1, and 2, respectively. - The
TERM
production accepts all UTF-8 characters expect for ‘\0’ and members of ourSPECIAL
characters andSPACE
characters. One consequence is that characters like Unicode directional quotations (U+2018, U+2019, U+201C, and U+201D ) will be treated as part of theTERM
. - The query parser is configured to use an arena allocator with a fixed amount of memory. Unexpectedly long queries may cause the allocator to throw.
- The query parser currently preserves letter case.
- Currently stream prefixes may contain escaped special characters. This will likely be disallowed in the near future when we begin to store stream prefixes in configuration files where special characters may cause problems.