Skip navigation
1 2 Previous Next

Hitachi Content Intelligence

18 posts

Wanted to learn to create custom stages so I threw this together. This is the code of a simple plugin that replaces all matches of a regex in a field with a string of your choice.

 

Can serve as template for other plugins that replace the content of a field with a user-specified value.

 

Jar is also provided, in case you need to use it.

Are you looking for a deeper overview of Hitachi Content Intelligence (and a peek under the covers)? Are you using the system and want to better understand how to optimize for specific use cases? Look no further - this is the blog for you.

 

Content Intelligence is a software solution comprised of 3 distinct bundled layers:

 

First, an embedded services platform leverages the portability of modern container technology, enabling flexible and consistent deployment of the complete solution in physical, virtual, and cloud environments. From there, it adds the ability to cluster, scale, monitor, update, triage, and manage the solution via REST API, CLI, and UI. Controls are provided for scaling, configuring, load balancing, and even repairing specific services. Plugin and service frameworks support easily extending and evolving system capabilities to meet custom use cases using a provided SDK.

 

Next, an advanced content processing engine allows for connecting to data sources of interest and processing the information by categorizing, tagging, and augmenting metadata representing each item found. Deep analysis against raw data streams produces both raw text (enabling keyword search) and additional metadata. Optimized for large scale parallel processing, the included workflow engine can blend structured and unstructured information into a normalized form that is perfect for aggregating data for reports, triggering notifications, and/or building search engine indexes.

 

Finally, a text and metadata search component delivers comprehensive search engine indexing and index management capabilities. Tools are provided for designing, building, tuning, and optimizing search engine indexes. The system allows for scaling and monitoring locally managed indexes and/or registering external indexes to participate in globally federated queries. A full-featured customizable search application is provided - supporting secure access to query results that may be automatically tuned to the needs of specific user groups and use cases.

 

arch.png

 

For a deep dive into the architecture, feature set, and best practices of Content Intelligence, see the attached whitepaper below.

 

-Ben

One of the simplest ways to further optimize a search engine index is to register stopwords.

 

Stopwords are terms that are typically irrelevant in searches, like "a", "and", and "the". Removing these terms while indexing can significantly reduce index size without adversely impacting user query results.

 

Stopwords can affect the index in three ways: relevance, performance, and resource utilization.

 

  • From a relevance perspective, these high-frequency terms tend to throw off the scoring algorithm, and you won't get the best possible matching results if you leave them in. At the same time, if you remove them, you can return bad results when the stopword is actually important. Choose stopwords wisely!

 

  • From a performance perspective, if you don’t specify stopwords, some queries (especially phrase queries) can be very slow by comparison, because more terms are compared to each indexed document.

 

  • From a resource utilization perspective, if you don’t specify stopwords, the index is much larger than if you remove them. Larger indexes require more memory and disk resources.

 

Because they are effectively filtered from the index, stopwords are not considered when matching query terms with index terms. For example, when using stopwords {do, me, a, this}, a query for “do me a favor” would match a document containing the phrase “this favor”, making "favor" the most important search term impacting matches.

 

This is typically the desired behavior, as the same processing performed at index time as is performed at query time to “normalize” the user input to associate with matches. The best matches get the highest relevancy score, and appear higher in query results.

 

However, if literal exact phrases with these terms included are important, less stopwords can be better. For example, removing “do” as a stop word in the example above would cause phrase query “do me a favor” to NOT match “this favor”, but the query would still match a document containing “do this favor”.

 

The HCI index stopwords file (see "Index > Advanced > stopwords.txt") is used by the HCI_text and HCI_snippet fields. This file is empty by default for newly created indexes in 1.1.X releases, but will be populated with defaults in future releases.  It is highly recommended that you add relevant stopwords to this file prior to indexing!

 

A conservative example English stopword list that can satisfy the majority of use cases would be the following:

 

a

an

and

are

as

at

be

but

by

for

if

in

into

is

it

no

not

of

on

or

such

that

the

their

then

there

these

they

this

to

was

will

with

 

Example stopwords files in different languages are also available in the product as examples. See the "stopword_<country/language>.txt" files under the "Index > Advanced > lang" folder in the Admin application.  The above list comes from the default English stopwords_en.txt  file, taken from Lucene's StopAnalyzer.

 

Happy optimizing!

 

Thanks,

-Ben

One of the design goals of HCI was extensibility.

 

The team wanted to make sure that HCI could connect to any system that you would like to use today or in the future. We also wanted the system to easily adjust to changing technology trends and evolve it's capabilities and use cases over time. As a result, HCI is a highly pluggable system. This allows us to upgrade and evolve technologies simply by building new plugins - without requiring updates to HCI itself. It also means that we can continue to provide seamless compatibility to existing systems as we prototype with bleeding edge advancements in content processing/analysis as well as taking advantage of new data stores.


The HCI Plugin SDK allows all end users to extend the capabilities of their HCI system. With the plugin SDK, one can:

  • Build connections to various systems to get data into HCI
  • Support customized processing on the data

 

The plugin SDK is available as a separate download from HCI, and is available on the downloads page. It includes:

  • Multiple levels of documentation, including full interface javadoc and useful README files
  • Helpful utilities to use in plugin logic
  • Example connector and stage plugins
  • A full-featured, offline plugin test harness

 

Looking for simple steps to build an HCI plugin?   Here's the process in a nutshell:

 

Step 1. Read the documentation

 

To make developers lives easier, the plugin SDK contains lots of inline help:

  • See the top level README.txt file for an overview on HCI and plugins
  • See the examples/EXAMPLES.txt file for instructions on building the example code
  • See the plugin-test/TEST_AND_DEBUG.txt file for plugin test harness test and debugging instructions

 

HTML javadoc is also provided for all plugin interfaces (ConnectorPlugin, StagePlugin) and utility classes in the doc/javadoc folder.  Click on the index.html file in this directory to open it in your web browser.

 

b1.png

 

Step 2. Build the example code

 

Example HCI connector and stage plugins are immediately available for customization. Hacking on these examples is probably the best way to learn the plugin technologies.

 

The following instructions come straight from the HCI EXAMPLES.txt file.

 

To build the example plugins:

1. Unpack the HCI SDK package.  

2. Navigate to the examples directory:
   cd HCI-Plugin-SDK/examples 

3. Create the classes directory:
   mkdir classes 

4. Compile the java files for your plugin:
    javac -cp ../lib/plugin-sdk-<build-number>.jar -d classes/ \
         src/com/hds/ensemble/sdk/examples/connect/*.java \
         src/com/hds/ensemble/sdk/examples/stage/*.java 

5. Copy the plugin resource file:
    cp -R src/META-INF/ classes/ 

6. Create the final jar:
    cd classes && \
    jar -cf ../HCI-example-plugins.jar * && \
    cd ..

This process generates a new plugin jar file named "HCI-example-plugins.jar" that you can test in the plugin-test harness and upload directly to the HCI system to use right away!


Note: The plugin jar file must be a "fat" jar and contain all dependency libraries for everything it uses inside it. HCI provides plugin classloader isolation, allowing different plugins to use different versions of software within the same system. To make this work, you need to include all libraries inside the plugin jar file that your plugin will use. Make sure that you do NOT include the HCI plugin SDK library within your plugin jar. The HCI Plugin SDK jar should be on the compilation classpath per the example above (a "provided jar"), but NOT end up within the final plugin jar. This allows the HCI SDK in the product to evolve without breaking your plugin.

 

Step 3. Validate, test, and debug plugin code

 

The plugin SDK comes with a full featured test harness: "plugin-test".  You can use this tool to validate, test, and debug custom HCI plugins.

 

The plugin test harness can operate in one of three modes:

  • Validation mode
  • Test mode
  • Debug mode

 

In validation mode, the harness ensures that the plugin manifest file exists, is well-formed, and that the plugin interfaces have all been correctly implemented. If your plugin validates successfully, then you have correctly edited the manifest file and implemented the required interfaces.


In test mode, custom configuration for each plugin is specified in the plugin-test harness configuration file. The test harness will then utilize each configuration to exercise additional functionality, check for errors, and make recommendations. Use this mode to exercise your plugin logic, and identify any issues.


Debug mode allows for configured plugins to be further utilized in a debuggable environment, allowing plugin logic itself to be checked for errors as methods are executed by the test harness. Executing in debug mode will prompt the user to connect any Java debugger or IDE to the specified port (5903 by default). This allows the user to set breakpoints in their plugin code and debug the logic itself while the tests execute.

 

Plugin Interface Validation

 

To validate a plugin:

  1. Go to the plugin-test directory: 
      cd HCI-Plugin-SDK/plugin-test 

  2. Run the test harness in validation mode:
     ./plugin-test -j <path-to-your-plugin-bundle-name>.jar 

The test harness validates the implementation of your plugin and:

  • Indicates where errors exist
  • Makes recommendations to resolve any issues it identifies
  • Makes best practice recommendations

 

plugin-test output in validation mode:

b3.jpg

 

Once your plugin validates successfully, you're ready to test it out!

 

Plugin Instance Testing

 

Once your plugin has been validated, you can move on to deep testing and analysis of the plugin.

 

In test mode, custom configuration for each plugin is specified in the plugin-test harness configuration file. The test harness will then utilize each configuration to exercise additional functionality, check for errors, and make recommendations. Use this mode to exercise your plugin logic, and identify any issues.


To test your plugin, you will first need to configure the plugin within the test harness. This involves:

  • Defining a PluginConfig to use when testing your connector or stage plugin
  • Defining the fields and streams on an inputDocument to use when testing your (stage) plugin


Fortunately, the plugin-test tool makes life easy with an "autoconfigure" option, which allows you to generate a plugin-test configuration for any or all plugins found within the specified plugin jar file.


To generate a configuration file for all plugins in a bundle:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -a [output-config-file] 

To generate a configuration file for only one plugin:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -plugin <plugin name> \
      -a [output-config-file]

This automatically generated configuration file may then be saved and edited in order to fine tune the plugin configuration to be used while running the plugin test harness. The auto-generated config will reflect the plugin-defined default values for all properties.


Custom configuration values for each plugin may be applied. If a plugin requires user input for config properties in the default configuration, these values must be specified in the plugin-test config file before testing the plugin. This is accomplished by adding the proper "value" entries for all default config properties in the plugin configuration.

 

Example PluginConfig Property:

{

  "name": "com.hds.hci.plugins.myplugin.property1",

  "type": "TEXT",

  "userVisibleName": "Field Name",

  "userVisibleDescription": "The name of the field to process",

  "options": [],

  "required": true,

  "value": "" // <--- Add non-empty value here for all required fields

}

 

For stage plugin testing, the inputDocument fields and streams may also be customized in the automatically generated plugin-test configuration file.

 

Example Input Document declaration:

"inputDocument": {

       "fields": {

            // ---> ADD OR MODIFY FIELDS HERE <----

            "HCI_id": [

                "https://testDocument.example.com/testDocument1"

            ],

            "HCI_doc_version": [

                "1"

            ],

            "HCI_displayName": [

                "Test Document"

            ],

            "HCI_URI": [

                "https://testDocument.example.com/testDocument1"

            ]

        },

       "streams": {

          // ---> ADD OR MODIFY STREAMS HERE <----

           "HCI_content": {

               "HCI_local-path": "/tmp/tmp1904411799648787907.tmp"

               // Instructions: To present a stream, use it's file path as the value of HCI_local-path above

        }

    }

}

 

Auto-configure and test execution example:

b4.jpg

 

Plugin Instance Debugging

 

Debug mode allows for configured plugins to be further utilized in a debug environment, allowing plugin logic itself to be checked for errors as methods are executed by the test harness. Executing in debug mode will prompt the user to connect any Java debugger or IDE to the specified port (5903 by default). This allows the user to set breakpoints in their plugin code and debug the logic itself while the tests executed.

 

The only difference between test mode and debug mode is:

  • The user sets Java IDE/debugger breakpoints in the methods of their plugin source code
  • The user starts the plugin-test tool using the "-d" option
  • The plugin-test tool prints the debugging port to connect to (5903 by default) and waits for a connection
  • The user attaches their Java IDE/debugger to the specified port (5903 by default)
  • Tests begin to execute and breakpoints will be hit

 

To run a test in debug mode:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -a [output-config-file] -d 

To run a test for a specific plugin in debug mode:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -plugin <plugin name> \
     -a [output-config-file] -d

 

plugin-test waiting for the user to connect the Java IDE to "Remote at port 5903":

b5.jpg

 

After the user connects to the process using the Java IDE (e.g. IntelliJ/Eclipse), tests will begin executing and breakpoints within the plugin source code can be stepped through manually to debug the custom plugin:

b6.jpg


Note that you can build, test, and debug an HCI plugin without ever touching an HCI system!

 

Step 4. Upload to HCI

 

Once your plugin bundle JAR file is ready and tested, you are ready to use it in production.

 

Simply upload your new plugin(s) to the UI by dragging and dropping it into the upload pane from any file browser, or just click to browse for the file locally.


  b1.jpg

 

b2.jpg


After upload, HCI will automatically make your plugin code available to all system instances and you may begin using your custom "Data Connections" and processing pipeline "Stages" immediately. Because HCI UI is dynamically generated based on the plugins available, you will immediately see your plugin appear in the existing drop down options.


Hopefully this demonstrates how anyone can easily build, package, and test custom HCI plugins.

 

Thanks for reading!

-Ben

Hitachi Content Intelligence provides a REST API, CLI, and UI in the admin application for the custom management of both authentication and authorization.

 

Let's dive right into how this works...

 

Identity Providers

 

First, administrators register Identity Providers with the HCI system by selecting and configuring any of the available plugin implementations.

 

Currently, “Active Directory”, “LDAP compatible”, “OpenLDAP”, and “389 Directory Server” plugins are available today:

plugins.png

 

In order to integrate with other identity providers such as Keystone, IAM, Google, or Facebook, an "Identity Provider" plugin could be produced for each. Each plugin requires different configuration settings, which the UI displays dynamically.

 

Adding an “Active Directory” Identity Provider  (Admin UI > System Configuration > Security > Identity Providers):

2.jpg

 

Listing configured Identity Providers:

1.jpg

 

Groups

 

Next, you may use these Identity Providers to discover and map Groups into HCI.

 

Registering a Group from the “Active Directory” Identity Provider:

3.jpg

 

Listing all registered groups:

4.jpg

 

Roles

 

Groups may be assigned one or more Roles. Roles are groups of one or more Permissions. Each HCI service can register a custom set of permissions that can be enforced by the system. Administrators combine permissions into custom roles they would like to use for a specific application, and assign one or more of those roles to the registered groups.

 

Creating the “Admin” role:

5.jpg

 

Editing permissions for the “Admin” role:

6.jpg

 

Group assigned the “Admin” Role, effectively giving all permissions to that group:

7.jpg

 

Only groups that have been configured with specific roles will have the permissions required to access the corresponding set of services/APIs within the system. When granting permissions to a user, the corresponding areas of the UI will become available and REST API requests would be allowed. When disabling permissions for a user, the admin UI will also dynamically remove those sections of the UI to prevent them from being accessed and any REST API or CLI requests for those services would fail with an error.

 

Summary

 

When logging into any HCI application, users may choose the security realm to utilize, which will use the corresponding identity provider for authentication. Each security realm name associated with each Identity Provider is chosen by the administrator:

 

8.jpg

 

The application will then:

  • Authenticate the user against the selected Identity Provider
  • If successful, determine what group(s) that user belongs to
  • Given group membership, determine which roles and permissions have been assigned to that user
  • Enforce the that user can access services they have been granted permission to access, and cannot access services for which they have not been granted access.

 

I hope this demonstrates the ability of HCI to easily integrate to existing customer directory services and manage fully customizable roles for any given organization and any particular application.

 

Thanks for reading!

-Ben

As promised I posted the two example workflows that I used in the Top Gun HCI Workshop on May 24th.

The first example bundle can be found here:Pictures with GPS Bundle.bundle.

The second example bundle can be found here: DICOM Bundle.bundle.

 

Note: These bundles include data connections that point to HCP namespaces that you DON'T have access to. So I copied the namespaces to two AWS S3 buckets:

https://s3.amazonaws.com/topgun-dicom/

https://s3.amazonaws.com/topgun-pictures/

As an exercise, see if you can replace the HCP data connection with an AWS S3 connection. These buckets are in US East (in case you were wondering) and are read only to Everyone.

 

You will have to wait to use the DICOM bundle because the DICOM plugin is not commercially available yet (I got a sneak preview from Engineering). Unless, of course, you choose to write your own DICOM plugin...

 

Happy to get feedback and to see how folks might improve on these workflows. Let's start a groundswell of sharing!

 

Good luck!

 

Jonathan

Ben Isherwood

HCI: Querying Indexes

Posted by Ben Isherwood Apr 21, 2017

Blog18_1.png

 

I'm sure that nearly everyone has queried a search engine before - typically by specifying one or more keywords of interest.  But what happens when you begin to add other metadata criteria to your searches? Anyone who has fine tuned a product search on Amazon.com (e.g. by size or color) knows that adding metadata values to your query can both improve relevancy and enable advanced data exploration.

 

As a quick example, try searching Google for keyword "HCI" ...  Didn't find the result you wanted?  Try searching instead with the input "HCI site:hds.com". Much more relevant results!

 

With HCI, you are always building an index of searchable metadata fields. You have full control over which terms end up in which fields for each indexed document, allowing you to provide the exact search experience your users want.

 

Let's dive into the details of how queries work using the Lucene query language:

 

Basic Queries

The simplest form of query in HCI is known as a "basic" or "term" query. These are the types of queries you may typically run in search engines such as Google. These queries consist of one or more field names and the values you will be searching for:

 

The query language allows you to specify the field(s) and value(s) you want to match, like:

     <field>:<value>

 

You are always searching for a value within a specified indexed field name. When specifying plain keywords without a field prefix, the system will use the system configured default field.  In HCI, the default field searched (without specifying a field name) is "HCI_text", which contains all indexed keywords from the document.

 

So, for the following keyword query:

     dog

 

You are actually querying for the following:

     HCI_text:dog

 

This is how your keywords are compared to indexed documents, by matching the values of the fields in each.

 

Wildcards

 

A query for "all" results in the index can be specified using wildcards (all fields and all values):

    *:*

 

A query for all results where the field "author" is defined in the document would be as follows:

    author:*

 

A query for all results where the author starts with "Stephen" would then be:

   author:Stephen*

 

Note that wildcards must be at the end or middle of each query clause only (e.g. "author:*tephen" is not valid, but "author:Steph*King" is valid.).

 

You can also wildcard a specific single character using the "?" syntax:

    author:S?ephen*

 

Phrase Query

Phrase queries are used to match multiple terms that should be found next to one another in sequence:

      author:"Stephen King"

 

This query matches authors containing exactly the terms "Stephen King", but not "Stephen R. King", "King, Stephen", or "Stephen Kingston".

 

Sloppy Phrase Query

If you would like your results to match "Stephen J. King" or "author: King, Stephen", you can use a "sloppy phrase query".   A phrase query can be sloppy by specifying the number of term position edits (extra inline keyword moves) allowed to match after the "~" character:

author:"Stephen King"~1 

This sloppy query would match "Stephen J. King", because one word hop was enough to generate a phrase match. This also matches "King, Stephen" because one hop (or edit) was made to switch the values. However, it would not match "Stephen 'the author' King" because that would require 2 word hops to match your phrase. It matches if you use ~2 instead.

 

This count of term edit/removals required to make a match is called the slop factor.

 

Sloppy phrase matches based on slop values for phrase query "red-faced politician":

 

Blog18_2.png

 

 

Boolean Query

A boolean query is any query containing multiple keywords or clauses, like:

     HCI_text:foo  +Content_Type:”text/html”

 

A clause may be OPTIONAL (relevancy ranked), REQUIRED ( + ), or PROHIBITED ( - ):

  • OPTIONAL : The default (no operator specified). Results will be returned in relevancy ranked order where matches get higher boosts.
  • REQUIRED  ( + ) : When specified, this clause is required to match, and only exact document matches are returned.
  • PROHIBITED ( - ) : When specified, this clause is required NOT to match, and any exact document matches are not returned.

 

Search for author "Stephen King", MUST contain keyword "Shawshank" and MUST NOT have category "movie":

      +author:"Stephen King" +Shawshank -category:movie

 

Search for only results mentioning keyword "dog" and boost documents with keywords "german" and "shepherd":

     +dog german shepherd

 

AND, OR, and NOT

Note that you may use "AND", "OR", and "NOT" between phrases. You may also use their symbol equivalents (&, |, and !). Care must be taken when using this syntax, as this does not apply traditional boolean logic to each term clause:

  • AND becomes a "+" operator, denoting REQUIRED clauses
  • OR becomes no operator, denoting OPTIONAL clauses
  • NOT becomes a "-" operator, denoting PROHIBITED clauses

 

So this query:

     (foo AND bar) OR dog NOT bear

 

Is interpreted as the following:

    +foo +bar dog -bear

 

Example: None of these queries will produce equivalent results

     banana AND apple OR orange

     banana AND (apple OR orange)

     (banana AND apple) OR orange

 

Why? Because they evaluate to 3 different queries:

    +banana apple orange

    +banana +apple +orange

    +banana +apple orange

 

Therefore it is strongly recommended to avoid the AND, OR, and NOT operators in general. Just use "+" and "-" around clauses where they are needed!

 

Range Query

A range query lets you search for documents with values between 2 boundaries. Range queries work with numeric fields, date fields, and even string fields.

 

Find all documents with an age field whose values are between 18 and 30:

    age:[18 TO 30]

 

Find all documents with ages older than 65 (age > 65):

     age:[65 TO *]

 

For string/text fields, you can also find all words found alphabetically between apple and banana:

    name:[apple TO banana]

 

You may use curly brackets to indicate "inclusive" matches vs. "exclusive. To find all documents with ages 65 and older (age >= 65):

    age:{65 TO *]

 

Find all documents with specific ages 18, 19, or 20 and also ages 25 or 26:

    age:{18 TO 20} age:[24 TO 26}

 

Sub Query Grouping

Blog19_1.png

 

You can use parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query.

 

To search for either "stephen " or "king" and "author" use the query:

    (stephen OR king) AND author

 

Or, as we previously learned, the preferred form avoiding the boolean logic keywords:

     (stephen king) +author

 

This eliminates any confusion and makes sure you that author must exist and either term stephen or king may exist.

 

Field Grouping

You can use parentheses to group multiple clauses to a single field.

To search for a title that contains both the word "return" and the phrase "pink panther" use the query:

    title:(+return +"pink panther")

 

Fuzzy Query

You can perform fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:

     roam~

 

This search will find terms like foam and roams.  You can also specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:

    roam~0.8

 

The default similarity that is used if the parameter is not given is 0.5.

 

Boosted Query

HCI calculates the relevance level of matching documents based on the terms found.

To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Boosting allows you to control the relevance of a document by boosting its term. For example, you are searching for:

    stephen king

 

If you want the term "king" to be more relevant, boost it using the ^ symbol along with the boost factor next to the term. You would type:

    stephen king^4

 

This will make documents with the terms "king" appear more relevant. You can also boost Phrase Terms (e.g. "stephen king") as in the example:

    "stephen king"^4 "author"

 

By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2).

HCI also provides an index query setting to do this automatically for certain field values across all queries against a specific index.

 

Constant Score Query

A constant score query is like a boosted query, but it produces the same score for every document that matches the query. The score produced is equal to the query boost. The ^= operator is used to turn any query clause into a constant score query.  This is desirable when you only care about matches for a particular clause and don't want other relevancy factors such as term frequency (the number of times the term appears in the field) or inverse document frequency (a measure across the whole index for how rare a term is in a field).

 

Example 1:

     (description:blue OR color:blue)^=1.0 text:shoes

 

Example 2:

     (inStock:true text:solr)^=100 native code faceting

 

Proximity Query

A proximity search looks for terms that are within a specific distance from one another.

To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For example, to search for a "stephen" and "king" within 10 words of each other in a document, use the search:

     "stephen king"~10

 

The distance referred to here is the number of term movements needed to match the specified phrase. In the example above, if "stephen" and "king" were 10 spaces apart in a field, but "stephen" appeared before "king", more than 10 term movements would be required to move the terms together and position "stephen" to the right of "king" with a space in between.

 

Filter Query

A filter query retrieves a set of documents matching a query from the filter cache. Since scores are not cached, all documents that match the filter produce the same score (0 by default). Cached filters will be extremely fast when they are used again in another query.

 

Filter Query Example:

     description:HDTV OR filter(+promotion:tv +promotion_date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY])

 

The power of the filter() syntax is that it may be used anywhere within the query syntax.

 

Summary

You've now learned some powerful query syntax, congratulations!

 

We've only scratched the surface with what's possible using HCI search, so stay tuned for more advanced query topics in upcoming blogs.

 

Here are some additional introductions to the Lucene query language that I've found helpful:

http://yonik.com/solr/query-syntax/

https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

http://www.lucenetutorial.com/lucene-query-syntax.html

 

Thanks for reading,

-Ben

Many people that we talk to about Hitachi Content Intelligence are curious about why we focus so intently on building out tools for content processing and data analysis. Isn't this a search tool?

It turns out that great search experiences are directly driven by the quality of the processing performed on the content. This makes processing, normalizing, and categorizing content the most important aspect of any (great) search technology. It also turns out that most search tools are surprisingly deficient in these areas! This has been a great opportunity for the HCI team to work on filling these gaps.

 

A while ago, I was asked whether or not HCI could be used to process and search HDI data that was stored (obfuscated) in HCP namespaces.  Here's what I did to find out the answer...

 

Step 1: Add an HCP data source

I started out testing HCI against a namespace in HCP that was backing an HDI file system.

The goal was to see how searching the core would work with HDI and the determine how difficult this task would be.

 

First, we connected HCI to the HCP namespace containing the HDI data:

Blog1_1.jpg

Configuring a data connection is all that's needed to begin processing data in HCI.

 

Step 2. Auto-generate a Content Class

HDI file paths in the HCP namespace are obfuscated, so it's impossible to search the contents of these namespaces directly.

However, because HDI stores both the full content and HDI custom metadata in HCP, we can easily take advantage of this.

Using example custom metadata from one of the files in the namespace, HCI was used to auto-generate an "HDI custom metadata" content class. This content class could be used to pull the metadata from the XML file into the pipeline engine for further processing.


HDI auto-generated content class:

Blog1_2.jpg

 

Step 3: Create and test an HDI processing pipeline

This step required some effort... and resulted in the addition of some new built-in plugins.

After cloning the default pipeline, I added a content class extraction stage to the pipeline for reading the "default" custom metadata annotation: “HCP_customMetadata_default”. This enabled me to pull the XML into the system, extract all of the fields, and present them to the processing pipeline.

After browsing for a file on the data source and running a pipeline "test" operation against it, I quickly found that the file paths found in the "default" annotation were URL encoded - making a search against these fields difficult. I built a URL Encoder/Decoder stage to decode them, uploaded the plugin, and started using it in the pipeline immediately. Now these fields were clearly visible!

URL decode any encoded HCI metadata fields:

Blog1_3.jpg


I noticed that there were a lot of UNIX timestamps in the metadata values on these fields. The date conversion stage didn't (yet) have support to normalize UNIX timestamps into standard date fields. Adding support in the stage for these resolved that issue.

 

Normalizing HCI metadata date fields:

Blog1_4.jpg

 

In some documents, the "HDI_file_path" metadata field wasn’t available, so I also configured the pipeline to use the "HCI_DisplayName" in place of the HDI_file_path metadata for those specific documents.

 

Step 4: Click to build an optimized index

After running the workflow, HCI automatically discovered all sorts of metadata from the files in the namespace.

Using the "Workflow > Task > Discoveries" UI, I created and configured an index from the field recommendations with a single click!


Auto-generated index schema:

Blog1_5.jpg

 

Step 5: Customize the search experience

HCI allows you to quickly customize the search experience for specific sets of users.

I customized a results display configuration in the index query setting where:

  • The value of the HDI file path field was used as the title of each document
  • The link takes you directly to the object in the HCP system
  • Add "snippet" text below each result containing raw extracted text from each document
  • Add a document metadata expandable panel with each result and named it "HDI Metadata"


Customizing search result display:

Blog1_6.jpg


Built-in default HCI autocomplete support uses phrases within the content itself:

Blog1_7.jpg


In the index "Public" query setting, enabled support for range queries on time fields, file names, etc:

Blog1_8.jpg


Can also expose detailed document metadata within each search result:

Blog1_9.jpg


Minutes later – full featured HDI file system search, just by crawling the HCP namespace!  And we had implemented 2 new features in the process: a new "URL Encoder/Decoder" stage and UNIX timestamp support in the "Data Conversion" stage.

 

Hopefully you've learned how HCI content processing technologies can accelerate the development of full featured search and categorization.

 

Thanks for reading!

-Ben

Ben Isherwood

HCI Plugins: Geocoding

Posted by Ben Isherwood Apr 14, 2017

Many of us are aware of the photo image geotagging capabilities of our smart phones. It's how social networks can report to the masses where we were when we posted our latest vacation slideshows. This process simply identifies your location at the time the photo was taken by leveraging the global positioning systems found in each device. The metadata coordinates of your position are attached to the document for later use.


So how can our systems take advantage of this data for search and analysis? Rather than manually sifting through millions of documents, it's useful to to identify all documents related to a specific building, city, state, or country and produce the results on demand. These results may then also be further categorized by other metadata found in each document to help you find that exact vacation image you were looking for.


To support the use of geotagged information, HCI provides the Geocoding stage.


Geocoding is really "reverse geotagging". You you take the metadata latitude and longitude coordinates attached to Documents as metadata and convert them into the corresponding City, State, Country, or even Timezone values that make sense for your use case.


The stage supports input latitude and longitude fields in the following format:

geo_lat : 42.482119 
geo_long : -71.186761

Note that these fields are automatically extracted by the "Text and Metadata Extraction" stage, so any geotagged metadata in each Document will be typically immediately available for further processing and analysis.

 

Geocoding.png


So, what can you do with this geotagged metadata?


The HCI Geocoding stage uses public location data collected and made available from the GeoNames project to map the latitude and longitude coordinates to the nearest local position on earth.


The stage supports the following output configuration values (any combination may be specified):

  • cityField - The name of the output field that should contain the city (defaults to "city", optional)
  • stateProvinceField - The name of the output field that should contain the state/province (defaults to "state", optional)
  • countryField - The name of the output field that should contain the country (defaults to "country", optional)
  • timeZoneField - The name of the output field that should contain the time zone (defaults to "timeZone", optional)

 

Geocoding-config.png

 

HCI pipelines can be configured to automatically extract these additional information fields from Documents given only the geotagged input fields that the camera added to each image. The resulting Documents contain even more metadata that may be utilized by the pipeline or in query requests for further faceting and categorization.

 

Geocoding-result.png

 

Now that we have the metadata, we can index it and leverage it for queries. Instead of just random keywords, we can then specifically target keywords found in documents matching "City:Burlington" and "State:MA" in our queries.


Now you'll know how to easily locate that ONE great St. Lucia skyline photo from an image repository of billions. And you may even run into hundreds just like it in the process!


Thanks for reading,

-Ben

  An important part of the HCI (Hitachi Content Intelligence) setup and configuration is the user security settings.   HCI only comes with one default user, but HCI will have multiple end users and different sets of users and they will all
need to be made available through the Identity providers. We think of the users in groups by function.  The most common groups are HCI system admin, Developers, and End Users.  Each group is given specific security rights
for their particular function. The details and exact settings need to be determined with the client, below is our typical setup and how we configure security.

 

HCI System Admin

   The HCI system admin typically is responsible for the day to day activities in the System to include System trouble shooting, receiving system alerts, balancing of services on nodes, granting/applying security to users. 
Depending on the client security concerns the admins may or may not be allowed access to actually see/view the data.
In our example they will have full rights to everything in the system.  I am not referring to the default account, in this example.

 

Developer role

 

  The developer role is intended for the person that will create the workflows.  This includes the Data connectors, Pipelines, and Indexes.  They should also be able to import and export plugins and workflows into the system.  Certain sites may have different rules for a test versus production environment, but in our experience they will have
access to all data manipulation activities.

 

End Users 

 

  End users are the people who will be using the processed data in the Search GUI.  We typically have several groups of end users as certain users are allowed access to certain sets and types of Data.  In our lab we have two sets of End users groups.  So we can validate that one group can see the processed data and one group cannot.    The data in our lab system represents documents with a person’s social security number. (An American person
number).  This information is highly sensitive and has legal ramifications if the wrong people are accessing or
using that information incorrectly.

 

The first step in setting up HCI security is to link the HCI system to Identity Provider.  We have
mostly been working with LDAP but several types are supported to include 389 Directory Server, Active Directory (LDAP), LDAP Compatible, & OpenLDAP.

 

Go to

 

System Configuration→ Security→ Identity Provider

 

It will ask for your AD information

 

Type= 389 Directory Server, Active Directory (LDAP), LDAP Compatible, or OpenLDAP

 

Identity provider host name= example “dc.hcpdemo.com”

 

Transport security= None, TLS Security or SSL

 

Identity provider host port= this is the Network Port # to communicate with the Identity provider 

 

User name & Password=A user that has admin rights to pull back the AD groups.

 

Domain= Example “HCP.demo”

 

Search Base DN=  Example “  dc=hcpdemo, dc=com”

 

Default Domain name= Example “hcp.demo.com”  (this filed is optional, but should be used if missed/blank upon login into HCI the user login will have to include hcpdemo.com  so Tmyers@hcp.demo.com versus tmyers if it is completed)

 

Once that is completed hit test and then update if everything is correct.

 

Once that is done we will need to go to

 

System Configuration →Security →Groups

 

At this point you should Discover groups and be able to see all the AD groups in the system. 

 

System configuration →Security →Roles

 

Roles do not come preconfigured in HCI, at the beginning I mentioned an HCI System Admin, Developer and End users.  When you create a role you will have to go into the permissions where permissions are assigned to a role.  HCI has very granular security permissions.  If you look at Content classes you will see it is subdivided into Create, Delete,
Update, & Read.   

 

For the HCI admin we typically give them all permissions with or without the ability to Query as the Admins are not allowed to see the data.  Some sites also create a separate admin that can create or edit security.

 

For our Developer role we typically choose the features around data manipulation.  I.e. Workflows, data connectors, Pipelines, & indexes.

 

For end users we start by granting the group rights to search/query the data.  We can also tie
the Index to individual roles as one way of limiting access to view the data.    This is a two-step process
listed later on. 

 

When you are finished creating roles and assigning permissions to those roles you will need to link a role to the group.  This is done in System Configuration →Security→ Groups

 

Enabling Data security is a two part process

 

 

First enable security on the index

 

Workflows  →Indexes →  Example Index  →Query Settings

 

Create a Query setting and then you will be able to enforce security on this index, under the access control by changing it to enforce document security yes.  Once this is saved you must enable it and disable the public.  Once we have applied security to the index we will need to grant our end user groups/roles access to that index.

 

Link Index to the roles

  So go back to System Configuration →Security Groups

Choose your group and edit you will see an option for indexes, assign the index to the appropriate roles.  At this point in time you should be able to log into the search screen, and see the data assigned/ restricted as
appropriate.

HCI pipelines provide an easy-to-use mechanism for analyzing, normalizing, and transforming data.


But how do I know what additional stages I need to add to my pipeline to process a set of data?


HCI introduces a "pipeline test" tool. This tool allows you to browse a data source, select any file, and process that file using the pipeline. The pipeline test UI will show you the values of document metadata before entering a processing stage and can compare it with the metadata that exists after you leave that stage. This allows you to not only see exactly how each stage processes a document in the pipeline, but provides visibility to the metadata values that are already there.

 

Workflow pipeline test tool, displaying the new metadata added by the MIME Type Detection stage:

Blog3_1.jpg

 

There are many useful built-in stages to utilize when testing a pipeline. Let's take a look at a few now...

 

Snippet Extraction

 

First, let's talk about the Snippet Extraction stage.

One convenient aspect of the snippet extraction stage is that it supports pulling raw content from a data stream. This allows you to configure this stage to pull raw data into a metadata field of choice, and let you visualize the content of that raw stream.


For example, this stage has helped to resolve issues in which content class extraction was failing.

The content class extraction stage in the pipeline below was expecting to read raw XML data from the HCI_content stream, but it wasn't working. The stage was configured correctly with the correct source stream name, so why didn't it work?


Why is the content class stage not extracting metadata? There are no changes!

Blog3_2.jpg

 

To figure this out, we first added a Snippet Extraction stage to the pipeline before the Content Class Extraction stage, and configured it to read data from the stream and store it in a "$TestContents" field.


Note: the "$" prefix can be used to name fields that should never be indexed, but that may be used for debugging or stage to stage communication.


After running the pipeline test, voila! The content dump indicates that the stream attempting to be processed was not XML, but raw text! Looks like we configured the stage to process the raw "HCI_content" stream instead of the custom metadata stream we should have used: "HCP_customMetadata_default"...

 

Indicates that stream contains text, instead of the expected XML:

Blog3_3.jpg

 

Fixing the content class stage to point to the correct XML content stream name ("HCP_customMetadata_default") resolved the issue, allowing the extraction to work as expected:

Blog3_4.jpg

 

Snippet extraction may be used whenever you need to gain insight into what content is actually inside the data streams you are processing.

 

 

Reject Documents

 

Let's look at yet another stage that is extremely useful when building and testing pipelines: "Reject Documents".


The Reject Documents stage allows you to cause any Document to immediately fail processing with a custom error message of your choosing.


Adding Reject Documents to the pipeline:

Blog12_1.png

 

Document failures generated as a result of the Reject Documents stage look like any other document failure, and are reported exactly the same way. The workflow will halt all further processing of these rejected documents, but list them for users for further investigation. This stage can therefore serve a similar purpose as assert statements found in many programming languages.


The stage allows pipeline creators to specify "required" criteria for a Document to be further processed, allowing for validation of specific document conditions.


Configure a "Reject Document" stage by specifying a custom message:

Blog12_2.png

 

Consider a scenario in which you (the pipeline designer) expect all Documents to have a stream named "HCI_text" at a specific point in the pipeline. Since all further processing depends on this condition, you can introduce a "Reject" stage to enforce this.

Blog12_3.png

 

If any Document enters the pipeline at this position WITHOUT a stream named "HCI_text", that Document will fail processing and result in a document failure.


This behavior can be invaluable in identifying which documents in your data set do not conform to specific processing criteria. You can use this information to either process those failing documents further, or update the pipeline to handle them in special ways. In this specific case, you can add an additional "Text and Metadata Extraction" stage to generate the expected "HCI_text" stream in the case that it does not already exist.


This stage is also useful in "pausing" the expansion of documents such as CSV, log, PST, ZIP, and TAR files. If you'd like to halt processing to see how a pipeline is handling these sub-documents, you can add a "Reject Document" stage to halt processing at any point in the pipeline, optionally conditionalized based on a specific file. Using the pipeline test stage diff tools, you can then determine how these expanded Documents were handled by the pipeline stages before and adjust pipeline logic and stages accordingly.

 

 

Conclusion


As always, use pipeline testing with example documents from your data set for analysis. Even just a few minutes testing example documents can lead to vast improvements in index query performance, relevancy, and accuracy.

 

More on additional stages that can be used to debug pipelines in future blogs.

Thanks for reading!

-Ben

Traditional Databases

In the enterprise, systems requiring a fast lookup of many results matching a query (especially at large scale) have typically required the use of a database. These database systems provide for data modeling around rows of table data in which there is one or more primary key combination of fields which uniquely identifies each row.

 

Consider the following sets of keyword containing Documents:

Doc1: The quick brown fox jumped over the lazy dog 
Doc2: Quick brown fox leap over lazy dogs in summer
Doc3: The fox quickly jumped over the bridge


A traditional database would typically store the information about each document based on the unique ID of each document:

id       keywords 
----------------------------------------------------------
Doc1   | quick brown fox jump over lazy dog
Doc2   | quick brown fox leap over lazy dog summer
Doc3   | fox quick jump over bridge
----------------------------------------------------------

 

Performance is gained by querying against the primary key directly, or by building efficient "indexes" for traversing these database tables by other criteria. The goal for efficient querying is to limit the database traversal to only the possible subset of rows in the table that could actually satisfy your query.

 

Search engines, on the other hand, are optimized for efficient ranked, keyword based query across billions of documents in milliseconds. Because queries against these indexes cannot be typically predicated ahead of time, an index cannot typically be determined in advance. These search engines make heavy use of the inverted index.

 

Inverted Index

Instead of traditional forward indexes, search engines use an inverted index:

Term      Doc1     Doc2    Doc3 
---------------------------------
quick   |       |   X   |   X   |
brown   |   X   |   X   |       |
dog     |   X   |   X   |       |
fox     |   X   |   X   |   X   |
jump    |   X   |       |   X   |
lazy    |   X   |   X   |       |
leap    |       |   X   |       |
over    |   X   |   X   |   X   |
quick   |   X   |       |       |
summer  |       |   X   |       |
bridge  |       |       |   X   |
---------------------------------

With the inverted index, queries can be resolved by jumping to the searched word (via random access) to deliver query results quickly. The resulting data structure is very similar to a hash map, providing the same performance benefits.


If the user queries for keywords "lazy" and "summer", the inverted index provides the results of "Doc1,Doc2" for the first term and "Doc2" for the second. Because the second Document appears in the results more frequently, it gains a higher relevancy ranking. The query results will then be returned as "Doc2" and then "Doc1".

 

Search Index vs. Database

If your use case allows a person to type words into a search box to initiate any sort of processing, you want a search engine vs. a database.

 

Databases and Search Indexes have complementary strengths and weaknesses. SQL supports very simple wildcard-based text search with some simple normalization like matching upper case to lower case. The problem is that these are full table scans. In search engines, all searchable words are stored in the inverted index, which searches orders of magnitude faster.


From a database perspective, a search index can be thought of as one DB table with very fast lookups and interesting enhancements for text search. This index is relatively expensive in space and creation time. The engine typically wraps this API with a full-featured front end, providing these additions:

  • Schema design and text processing features
  • Clean deployment as a web service for indexing and searching
  • Convenient scalability across multiple servers
  • Learning curve & adoption improvement of ~2 orders of magnitude

 

 

Index Processing

Search engines generally provide a form of document processing on ingest into the search engine index for full content keyword streams. These parsers generally provide 3 forms of index driven document processing:

  • Stopword removal - Stop words (terms that do not improve relevance for document discovery, or that appear frequently) are removed from the data set at both index and query time in order to maximize query matches and improve efficiency and relevancy of the index data
  • Stemming/Lemmatization - Words may be stored in the search engine index in their normalized form to maximize query discovery. For example, "walk" is the base form for word "walking", "walked", and "walker" and can be stored instead of those values. On query time, the same process is executed (e.g. "walker" is converted to "walk") and will match a larger number of similarities in the index. Note that this approach is in direct conflict with "exact phrase" query, so steps need to be taken to support one or both types of query.
  • Synonym Matching - When querying for "used car" you may want the text of a relevant document to hit documents mentioning “passenger vehicle” or “automobile.” If the indexing system associated these alternate terms as synonyms of “car,” the relevant documents would also be attached to the index term “car.”


Documents are ingested into the the index with stopwords removed & synonyms added:

 

1.jpg


At query time, the same text processing is performed on the input rank/relevance is determined accordingly:

 

2.jpg

 

Index Querying

Before populating an inverted index, support for specific types of queries is taken into account by defining index fields of certain types. Each field type can be configured to support a different type of query against that information field.


This process looks like the following:

 

3.gif

 

Here is a demonstration of results for specific query types against the inverted index:

 

Query TypeDescriptionQuery and Results
SimpleKeyword query

t1.jpg

PrefixMatch keywords by prefix

t2.jpg

BooleanMatch terms whether they must (AND), should not (NOT), or may exist (OR)

t3.jpg

PhraseIdentifies a set of tokens at these positions in the index

t4.jpg

 

Conclusion

As the number of documents within a system grows larger and larger, search engine indexes begin to provide benefits over traditional database systems.

 

In a world of Big Data, the ability to discover a set of documents for finite processing becomes increasingly important. Running a discovery query against a set of inverted indexes can help to identify a smaller subset of content for processing, effectively replacing the need to build, optimize, and refine the large, purpose specific indexes traditionally associated with a database.

 

Thanks for reading,

-Ben

Today's search based applications quickly push beyond the capabilities of a single machine to satisfy both large index sizes and high query volume. A well configured, distributed multi-instance index service configuration can provide sub-second search response times across *billions* of indexed documents.


The standard procedure for scaling HCI indexes is:

  • First, maximize single instance index performance by configuration and schema tuning.
  • Next, address availability concerns and absorb high query volume by replicating the index to multiple machines.
  • If the index becomes too large for a single machine, split the index across multiple machines (i.e.,  "shard" the index).
  • Finally, for high query volume and large index size,  replicate each index shard to separate server instances.

 

Summary of index scaling options:


1.jpg
Let's look at how each of these tasks is accomplished using the HCI index service.

 

 

Maximizing Index Performance

 

HCI aims to provide great out-of-the-box performance for typical index and query use cases, but proper tuning for your specific environment can bring significant performance improvements.


There are two major aspects of index performance:  indexing and query. Both can be improved by a properly configured index schema. Configuration is the key to maximizing index performance. It’s important to set up your index from the start with performance in mind.

 

Tips:

  • Be sure to choose the proper index and schema configuration to avoid complete re-indexing of your content, which can be time consuming for large indexes.
  • "Schemaless" indexing can be great for starting out to see what fields can be generated, but optimally your index should not contain any fields you are not using and no duplicate fields.
  • Use the HCI "Basic" index schema template, and add optimized index fields automatically using the pipeline test and workflow discoveries tools. Your index size will be substantially smaller and indexing/query will be more efficient.
  • Use the correct field types AND field attributes for your fields that you need. Use "omitNorms" whenever you can. Don't mark fields as "stored" when you don't need them returned in results.
  • Minimize fields with full content tokenization (text_*) and limit the number of "dynamic" and "copy" fields in use to the minimum required. Avoid dynamic fields completely for a fully predictable schema.
  • Configure stopwords.txt in the "Index > Advanced" UI to eliminate overly common terms from index and query processing like "a", "the", or "is" - these are just noise and overhead that can be easily avoided. There are example configurations here for each language in the "lang/stopwords_*.txt" section.
  • Keep the Index service's memory "Heap size" below a max of 31.5 GB (default is 1.8 GB). Using a setting of 32 GB or greater will impact JVM optimizations and can result in out of memory errors and increased risk of GC pauses. Therefore, scaling out with more instances is preferred to scaling up a single instance beyond these levels.

 

When in doubt, run a workflow (or pipeline/workflow test) and use the "Add Fields to Index" feature on each discovered fields page to automatically configure your indexes for production use.

 

2.png


Tuning HCI index fields and their impact by use case (only select what you need):

 

3.jpg

 

 

Availability and High Query Volume

 

Eventually, your index will grow to a point where a single machine can’t keep up with a given query load. The proper way to handle this situation is to replicate the index to other servers. HCI will then automatically load balance query requests across the each index service instance, each of which contains a "copy" of the index. Copies are updated over time as the "master" version of the index changes.

 

4.jpg

 

Index replication is accomplished in HCI by editing the "Index Protection Level" of the index from 1 copy (the default) to multiple copies. This count determines the number of index copies to store across all HCI index service instances. HCI automatically balances these copies across the index services running on each HCI instance. You may modify and apply this setting (increasing or decreasing dynamically) at any time. Additional index copies improves the availability of each search index as well as results in a distributed query load (allowing for more concurrent queries without performance degradation).

 

5.jpg

 

Scale by Sharding

 

Eventually, some indexes will get so large that a single machine cannot contain them. In the millions of documents range, you will likely run into this. The general solution is to break up the index into pieces that are kept on on multiple servers.


A single search can then be issued to each server and the results can be pulled (likely in parallel), and then combined into a single result set for the user.


  6.jpg

Building a distributed index starts by running the HCI "index" service on multiple HCI instances. For each index, you then define the shard count to use for each individual index you create.


  7.jpg


Index shard count can be configured when the index is initially created. If you expect your index to get very large, increase the default shard count from 1 to a number large enough to balance across the number of index service instances you expect to use for this index. Initially "over sharding" is a great strategy for when you know your index will grow very large. Keep in mind that you will need enough HCI instances to run enough copies of the index service across the system to satisfy the target shard count. For example, a shard count of 3 would likely require 3 HCI instances running the index service for peak performance. Keeping multiple shards on a single instance works fine and allows you to grow by adding instances in the future by balancing shards to other index service instances. However, there is a performance trade-off when using multiple shards on a single instance that you can avoid by balancing up front across more instances.


HCI also supports the automatic balancing of shards across all available instances of the index service. Additional instances of the index service may be run on new HCI instances, and a service configuration task may be executed to balance the shards across those instances.

 

8.jpg

 

Availability and Scale

 

When your index is too large for a single machine and you have a query volume that single shards cannot keep up with, it's time to replicate each shard in your distributed search setup. Using multiple index shards for scale out plus replication for availability and high query load support gives you the best of both worlds. In the image below, there will be a "master" index service for each shard and then 1-n "slave" shards that are replicated from each master to other index services in the HCI system. This allows the master to handle updates and optimizations without adversely affecting query handling performance. Query requests are automatically load balanced across each of the shard slaves, providing both increased query handling capacity and seamless fail over to backup copies if a server goes down.

 

 

9.jpg

 

10.jpg

 

Summary

 

There are a number of options and configurations available to scale HCI indexes for improved availability, to satisfy query load requirements, and/or to grow to store billions of indexed documents. As always, the optimal configuration will depend on your specific use case and requirements for your applications. HCI provides the tools to help make index management at large scale as simple and flexible as possible.

 

Thanks for reading,

-Ben

The HCI characterization team has characterized email processing performance of HCI v1.1 and come up with a performance white paper documenting the results. The primary objectives of this white paper are to:

  • Summarize HCI performance when processing the Enron data set in terms of objects/sec for various HCI configurations.
  • Demonstrate that the performance scales as CPU cores, memory, and number of HCI instances are scaled.
  • Document the methodology used to modify the workflow settings for performance optimizations.
  • Provide Index and Workflow-Agent service settings and recommendations for an eight-instance configuration.
  • Recommend when to scale-up versus scale-out HCI configuration based on performance needs.
  • Compare performance of the regular HCP and HCP MQE data connections.

 

Some of the key findings from the performance results:

  • Performance generally scales linearly as instances are scaled up from minimum hardware to recommended hardware or HCI is scaled out to add more instances.
  • Scaling out the instances results in better performance as compared to scaling up the hardware configuration.
  • HCP MQE data connection performance is 60% better as compared to the regular HCP data connection.
  • Removing unwanted pipeline stages and optimizing the required ones can result in huge performance gains.
  • Modifying the workflow performance task settings can influence the memory requirements for the workflow and should be adjusted accordingly.
  • Ideally, for optimal index performance, an index should have one shard for each HCI instance running the Index service.
  • Workflow-Agent service should be scaled up to run on all available instances for optimal workflow performance.

 

For more details, please look at the attached white paper.

 

If you have any questions regarding the content in the white paper, please free to ask in the comments section below.

 

Thanks,

Nitesh

Empowering users with powerful intelligent search tools can change and improve the way that everyone works.

 

Hitachi Content Intelligence provides a great toolkit for building and optimizing such solutions - but no one starts out as an expert in content processing and search engine index schema management. Don't worry, we're here to help!

 

Here is a checklist of HCI best practices to help you hit the ground running:

 

1.  Connect your source repositories

 

    • Configure your data source(s) in the Workflows > Data Connections UI

    • Run the “test” operation on each to ensure connectivity

    • Resolve any certificate issues up front, by adding them automatically to the trust store when prompted

 

2. Build and test pipeline(s) with example document(s) loaded directly from the data source(s)

 

    • Begin with the default pipeline in the "Workflows > Pipelines" UI  (or clone it!)
    • In the pipeline test UI, click "Select Document" to browse your data connections for documents to test
    • Verify changes made to documents as they are processed by each stage in the pipeline (click "View Results" on each stage to see the changes each stage performed).
    • View the metadata fields available on each document by clicking the "Discovered Fields" tab.
    • Evolve the pipeline to customize desired processing behaviors
    • (Optional)  Define and use Content Classes
      • Extract field data from XML, JSON, and PATTERN matching regular expressions
      • Blend this data with other metadata fields to improve search quality
      • Re-use these definitions across systems, and enable testing before going into production
    • Leverage the community
      • Upload any custom plugins/packages you want to use via "System Configuration > Plugins" or "Packages".
      • Import workflow bundles containing pipelines, workflows, indexes, and content classes at "Workflows > Import/Export", or export your own workflow bundles to share!
      • Use the context-specific online help by pressing "(?) > Help", or access the HCI community any time using the UI link at "(?) > Community"

 

3. Run (and test) a workflow

 

    • Configure a data source with a representative subset of your data for exploration and issue identification
    • Add the default pipeline to see what processing is possible, or use your own pipelines for custom processing
    • (Optional)  Create and add an index as a workflow output
      • Using a "schemaless" index can help to identify fields and any document failures that could occur for indexing
      • Test end user queries in the index “query” tab and the search console
    • Run the workflow (with or without outputs) to generate a workflow task report
      • Explore the field discoveries, their counts and values
        • This information can be used to build an optimized production index schema with a click!
        • Pipelines should combine fields containing the same information together into a single field (use the “Mapping” stage)
        • Pipelines should drop unnecessary fields and documents to minimize index size & maximize performance (use the “Filter” and “Drop” stages)
        • Normalize date fields using a "Date Conversion" stage and adding the field name to the config
    • Document failures can be identified here and addressed early through data cleansing or special processing
    • Consider “exploding” any container documents found such as PST, ZIP, email, etc. into individual files on a data source for direct query and linking.
      • Otherwise, search results containing hits of files within a ZIP would simply link back to the ZIP file (not to the individual files contained inside the ZIP) for download.

 

4. Optimize Pipeline Performance

 

    • Check the "Workflow > Task > Performance" report to identify any expensive processing stages, and take steps to reduce this time:
      • Remove any unnecessary stages from the pipeline
      • Reduce stage processing time by introducing additional conditional logic to avoid the processing of any expensive files that are not producing meaningful value
    • Add known file extensions found in your data set to the MIME type detection stage configurations to avoid an expensive “deep” detection to improve performance

 

5.  Start with a new “Basic” index for production

 

    • The default "Schemaless" index template ensures that all fields discovered in the pipeline will get added to the index automatically.
      • This is great for getting started, but can result in bloated, inefficient indexes in production.
      • Dynamic field type "guessing" can sometimes guess wrong, which may result in document indexing failures
    • To avoid these issues completely, create a new index with the “Basic” template in "Workflows > Index Collections > Create Index"
      • This template defines only minimal required fields and eliminates dynamic fields for a controlled, predictable schema
    • Use the "Workflow > Task > Discoveries” tab to select and add specific fields (with optimized type recommendations) to your new production index with a single click. This ensures that you only index the fields you need, and keeps your index size as small as possible.
      • You can also use the pipeline/workflow test “Discovered Fields” UI to do the same

 

6.  Fine tune your index schema

 

    • Locate your index in "Workfows > Index Collections".
    • Remove any unnecessary fields from the schema that will not be used to query against
    • Pay close attention to field attributes marked “HIGH” impact and evaluate if they are necessary
    • Eliminate any variable indexing configurations (e.g. dynamic fields) if not required
    • Backup all your system configurations (pipelines, data sources, index schemas, etc.) to a “package” for safekeeping.

 

7.  Verify your enhanced search experience

 

    • Use the "Open Search App" button in "Workfows > Index Collections" to access the search console.
    • Ensure that queries match the desired behavior and performance characteristics
    • Customize index “query settings” for proper visibility to specific field information in results
    • Leverage the index “query” tab (in addition to the search console) to test the behavior of each query setting
    • Tweak faceting, refinement, field customization, and results display to match your use cases

 

8. Size your production system

 

    • Follow the documented HCI recommendations for specific document count targets and performance goals
    • Your mileage will vary!
      • The only way to be sure that your system can handle your use case is to try it out.
      • Check your test environments for expected index size and index service memory utilization with the given counts.
    • Compare your results to the recommendations and extrapolate accordingly.
      • It can be helpful to identify your largest sized container document
        • Some stages may require all data be loaded into memory for processing (PST Expansion, Email Expansion, MBOX Expansion).
        • Ensure that your configured workflow task “Executor Heap Memory” is sufficient to hold these documents (the default is 1 GB).
          • Check the instance Monitoring page to ensure that you have this memory available, and not allocated to other services.
        • If you run into workflow task “crawler” out of memory failures. increase the “Driver Heap Memory” (the default is 1 GB).
    • Decide on your availability needs
      • If you want your index to grow very large...
        • Index Shards
          • HCI allows you to break your index into smaller segments (called "shards") which can be dynamically distributed across the instances in your HCI cluster. This allows your system to grow very large, allowing for balancing shards to new instances if you run out of space or want to improve performance.
          • You set the index shard count when creating your index.
          • At least one shard per index service instance is ideal.
          • Increase shard count for an index to allow your index to grow very large (shards can be balanced to other Index service instances)
            • If your system will ever grow, you will want to over-shard so that extra shards may be seamlessly balanced to other instances in the future. For example, if your index instance count will double over time, double your initial shard count.
      • If you want your index to survive failures...
        • Index Replicas
          • HCI can create backup copies of your index shards, called "replicas", and store them on separate instances - allowing an index to survive a node outage and continue to support queries.
          • To create replicas, increase the “Index Protection Level” in "System Config > Services > Manage Services > Index > Configure" from the default of 1 to 2 in order for an index to survive single node outages. Increase further to create additional copies.
          • Increasing IPL automatically creates replica copies of each of your index shards on other instances, allowing the system to survive a single instance failure at the expense of additional resource utilization. Decreasing IPL will automatically delete replica copies.
          • Each new index copy will increase the resource requirements accordingly, and may require increasing index service instance counts.
      • The default service configuration recommendation for 4+ instance systems is typically 2 or 3 redundant service copies. This allows services to survive a node outage and continue working normally.
        • If you don’t need redundancy, consider scaling some services down to a single instance to free up resources (at the expense of HA)

 

9.       Optimize service distribution

 

    • As a general rule of thumb, you should never allocate more than 80% of system resources on each instance
      • ~20% of all system resources should be reserved for the operating system
      • ~1 GB of physical RAM should be reserved for the low level system services on WORKER instances
      • ~2 GB of physical RAM should be reserved for the low level system services on MASTER instances
    • Ensure that you have swap space enabled, and have enough to meet your needs (~5-10 GB partition is typical)
    • When possible, run each individual service you want to optimize by itself on dedicated instance(s)
    • The index service works best without any other services running on those instances (including workflow agent)
    • Maximize the RAM allocated to the index service – Solr requires as much RAM as you can spare in order to build & query a large index
      • Be careful not to meet/exceed the total physical RAM across your allocation of services!!
      • Limit the scaling of any JVM based service to 31.5 GB of RAM or less, to take advantage of JVM optimizations. Beyond this point, it likely makes sense to add instances rather than scaling up a service.
    • Check the Monitoring page
      • Provides detailed load and container metrics rolled up at the cluster, instance, and service levels
      • Can identify if a single instance is constantly busy - move services to instances that are not busy
      • Can identify if a service is utilizing all of it's allocated resources - increase those allocations, or move the service to an instance with free resources

 

10.   Test your production indexing

 

    • Check your indexing rate, and measure it against your target rate
    • Revisit stage performance and index schema fields to make further pipeline & index optimizations
    • Normalize any additional field data which did not match expectations
    • Check for any unexpected document failures, and take steps to resolve
    • Check the Monitoring page to help identify instances where heavy processing is occurring, and balance service load accordingly
    • Verify your end user search experience!

 

 

That's it!  You're now an HCI search expert.

 

If you have any other questions, feel free to ask them in the HCI Community!

 

Thanks,

-Ben