Ben Isherwood

HCI: Querying Indexes

Blog Post created by Ben Isherwood on Apr 21, 2017

Blog18_1.png

 

I'm sure that nearly everyone has queried a search engine before - typically by specifying one or more keywords of interest.  But what happens when you begin to add other metadata criteria to your searches? Anyone who has fine tuned a product search on Amazon.com (e.g. by size or color) knows that adding metadata values to your query can both improve relevancy and enable advanced data exploration.

 

As a quick example, try searching Google for keyword "HCI" ...  Didn't find the result you wanted?  Try searching instead with the input "HCI site:hds.com". Much more relevant results!

 

With HCI, you are always building an index of searchable metadata fields. You have full control over which terms end up in which fields for each indexed document, allowing you to provide the exact search experience your users want.

 

Let's dive into the details of how queries work using the Lucene query language:

 

Basic Queries

The simplest form of query in HCI is known as a "basic" or "term" query. These are the types of queries you may typically run in search engines such as Google. These queries consist of one or more field names and the values you will be searching for:

 

The query language allows you to specify the field(s) and value(s) you want to match, like:

     <field>:<value>

 

You are always searching for a value within a specified indexed field name. When specifying plain keywords without a field prefix, the system will use the system configured default field.  In HCI, the default field searched (without specifying a field name) is "HCI_text", which contains all indexed keywords from the document.

 

So, for the following keyword query:

     dog

 

You are actually querying for the following:

     HCI_text:dog

 

This is how your keywords are compared to indexed documents, by matching the values of the fields in each.

 

Wildcards

 

A query for "all" results in the index can be specified using wildcards (all fields and all values):

    *:*

 

A query for all results where the field "author" is defined in the document would be as follows:

    author:*

 

A query for all results where the author starts with "Stephen" would then be:

   author:Stephen*

 

Note that wildcards must be at the end or middle of each query clause only (e.g. "author:*tephen" is not valid, but "author:Steph*King" is valid.).

 

You can also wildcard a specific single character using the "?" syntax:

    author:S?ephen*

 

Phrase Query

Phrase queries are used to match multiple terms that should be found next to one another in sequence:

      author:"Stephen King"

 

This query matches authors containing exactly the terms "Stephen King", but not "Stephen R. King", "King, Stephen", or "Stephen Kingston".

 

Sloppy Phrase Query

If you would like your results to match "Stephen J. King" or "author: King, Stephen", you can use a "sloppy phrase query".   A phrase query can be sloppy by specifying the number of term position edits (extra inline keyword moves) allowed to match after the "~" character:

author:"Stephen King"~1 

This sloppy query would match "Stephen J. King", because one word hop was enough to generate a phrase match. This also matches "King, Stephen" because one hop (or edit) was made to switch the values. However, it would not match "Stephen 'the author' King" because that would require 2 word hops to match your phrase. It matches if you use ~2 instead.

 

This count of term edit/removals required to make a match is called the slop factor.

 

Sloppy phrase matches based on slop values for phrase query "red-faced politician":

 

Blog18_2.png

 

 

Boolean Query

A boolean query is any query containing multiple keywords or clauses, like:

     HCI_text:foo  +Content_Type:”text/html”

 

A clause may be OPTIONAL (relevancy ranked), REQUIRED ( + ), or PROHIBITED ( - ):

  • OPTIONAL : The default (no operator specified). Results will be returned in relevancy ranked order where matches get higher boosts.
  • REQUIRED  ( + ) : When specified, this clause is required to match, and only exact document matches are returned.
  • PROHIBITED ( - ) : When specified, this clause is required NOT to match, and any exact document matches are not returned.

 

Search for author "Stephen King", MUST contain keyword "Shawshank" and MUST NOT have category "movie":

      +author:"Stephen King" +Shawshank -category:movie

 

Search for only results mentioning keyword "dog" and boost documents with keywords "german" and "shepherd":

     +dog german shepherd

 

AND, OR, and NOT

Note that you may use "AND", "OR", and "NOT" between phrases. You may also use their symbol equivalents (&, |, and !). Care must be taken when using this syntax, as this does not apply traditional boolean logic to each term clause:

  • AND becomes a "+" operator, denoting REQUIRED clauses
  • OR becomes no operator, denoting OPTIONAL clauses
  • NOT becomes a "-" operator, denoting PROHIBITED clauses

 

So this query:

     (foo AND bar) OR dog NOT bear

 

Is interpreted as the following:

    +foo +bar dog -bear

 

Example: None of these queries will produce equivalent results

     banana AND apple OR orange

     banana AND (apple OR orange)

     (banana AND apple) OR orange

 

Why? Because they evaluate to 3 different queries:

    +banana apple orange

    +banana +apple +orange

    +banana +apple orange

 

Therefore it is strongly recommended to avoid the AND, OR, and NOT operators in general. Just use "+" and "-" around clauses where they are needed!

 

Range Query

A range query lets you search for documents with values between 2 boundaries. Range queries work with numeric fields, date fields, and even string fields.

 

Find all documents with an age field whose values are between 18 and 30:

    age:[18 TO 30]

 

Find all documents with ages older than 65 (age > 65):

     age:[65 TO *]

 

For string/text fields, you can also find all words found alphabetically between apple and banana:

    name:[apple TO banana]

 

You may use curly brackets to indicate "inclusive" matches vs. "exclusive. To find all documents with ages 65 and older (age >= 65):

    age:{65 TO *]

 

Find all documents with specific ages 18, 19, or 20 and also ages 25 or 26:

    age:{18 TO 20} age:[24 TO 26}

 

Sub Query Grouping

Blog19_1.png

 

You can use parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query.

 

To search for either "stephen " or "king" and "author" use the query:

    (stephen OR king) AND author

 

Or, as we previously learned, the preferred form avoiding the boolean logic keywords:

     (stephen king) +author

 

This eliminates any confusion and makes sure you that author must exist and either term stephen or king may exist.

 

Field Grouping

You can use parentheses to group multiple clauses to a single field.

To search for a title that contains both the word "return" and the phrase "pink panther" use the query:

    title:(+return +"pink panther")

 

Fuzzy Query

You can perform fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:

     roam~

 

This search will find terms like foam and roams.  You can also specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:

    roam~0.8

 

The default similarity that is used if the parameter is not given is 0.5.

 

Boosted Query

HCI calculates the relevance level of matching documents based on the terms found.

To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Boosting allows you to control the relevance of a document by boosting its term. For example, you are searching for:

    stephen king

 

If you want the term "king" to be more relevant, boost it using the ^ symbol along with the boost factor next to the term. You would type:

    stephen king^4

 

This will make documents with the terms "king" appear more relevant. You can also boost Phrase Terms (e.g. "stephen king") as in the example:

    "stephen king"^4 "author"

 

By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2).

HCI also provides an index query setting to do this automatically for certain field values across all queries against a specific index.

 

Constant Score Query

A constant score query is like a boosted query, but it produces the same score for every document that matches the query. The score produced is equal to the query boost. The ^= operator is used to turn any query clause into a constant score query.  This is desirable when you only care about matches for a particular clause and don't want other relevancy factors such as term frequency (the number of times the term appears in the field) or inverse document frequency (a measure across the whole index for how rare a term is in a field).

 

Example 1:

     (description:blue OR color:blue)^=1.0 text:shoes

 

Example 2:

     (inStock:true text:solr)^=100 native code faceting

 

Proximity Query

A proximity search looks for terms that are within a specific distance from one another.

To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For example, to search for a "stephen" and "king" within 10 words of each other in a document, use the search:

     "stephen king"~10

 

The distance referred to here is the number of term movements needed to match the specified phrase. In the example above, if "stephen" and "king" were 10 spaces apart in a field, but "stephen" appeared before "king", more than 10 term movements would be required to move the terms together and position "stephen" to the right of "king" with a space in between.

 

Filter Query

A filter query retrieves a set of documents matching a query from the filter cache. Since scores are not cached, all documents that match the filter produce the same score (0 by default). Cached filters will be extremely fast when they are used again in another query.

 

Filter Query Example:

     description:HDTV OR filter(+promotion:tv +promotion_date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY])

 

The power of the filter() syntax is that it may be used anywhere within the query syntax.

 

Summary

You've now learned some powerful query syntax, congratulations!

 

We've only scratched the surface with what's possible using HCI search, so stay tuned for more advanced query topics in upcoming blogs.

 

Here are some additional introductions to the Lucene query language that I've found helpful:

http://yonik.com/solr/query-syntax/

https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

http://www.lucenetutorial.com/lucene-query-syntax.html

 

Thanks for reading,

-Ben

Outcomes