Skip navigation
1 2 Previous Next

Hitachi Content Intelligence

22 posts

Do you want more insight into the state of your workflows? Do the workflow metrics don't update as frequently as you like? Are you interested to find out the speed of your data connector? Confused when to use which pipeline execution mode? Would HCI deployed in virtual environment give the same performance as physical deployment?


The attached white paper address these questions and highlight the improvements and optimizations introduced in version 1.2 to improve the overall performance of Hitachi Content Intelligence.


The content of the white paper:

  • Highlights workflow performance improvements and optimizations introduced in version 1.2 and compare to previous releases.
  • Summarizes the improved performance results of list-based data connectors.
  • Provides methodology used to determine pure crawler performance for different data connections.
  • Demonstrates that document failures reporting to the metrics service has been drastically improved.
  • Recommends when to use Preprocessing execution mode over Workflow-Agent.
  • Compares performance of a physical HCI with that of similarly configured HCI deployed in a virtual environment.



Questions/Feedback? Please use the comments section below.



Before Updating to 1.2.1, please view the following question/answer addressing a known issue if you have updated from a previous version to 1.2.0 and more than one week has elapsed:


Updating from 1.2.0: Failed to initialize UpdateManager




Jon Chinitz

Making an HCI OVF Bigger

Posted by Jon Chinitz Oct 21, 2017

Some of you have asked about increasing the size of the OVF that ships with Hitachi Content Intelligence. The default disk volume today is 50GB. The following quick sheet of instructions will show you how to increase the disk volume.


Step 1: shutdown the node

Step 2: using the vSphere console (or any other method you feel comfortable with) navigate to the node's settings and change the size of "Hard Disk 1" (I chose to increase it from 50GB to 100GB):



Step 3: save the edits and restart the node.

Step 4: ssh into the node and display the mounted filesystems. The filesystem we are after is the root filesystem mounted at /dev/sda3:



Step 5: run the fdisk command specifying the disk device /dev/sda:



Step 6: while in fdisk you are going to (d)elete partition 3, create a (n)ew partition 3 with the default starting sector and size offered up by fdisk. The starting sector is the same as /dev/sda3 had only the size is now the number of sectors left to the end of the disk. Last thing to do is to (w)rite the new partition table back to the disk (you can safely ignore the error).

Step 7: reboot the node.

Step 8: ssh back into the node and run xfs_growfs. Be sure to specify the partition /dev/sda3:



As you can see the root filesystem has been resized to occupy the new space.



This stage is an attempt to increase the HDI indexing functionalities of Content Intelligence.

It transforms the ACL entries found in the HCP custom metadata left by HDI. Whenever possible, the plugin attempts to create a separate metadata field for each ACL entry and transform the permissions and user/group IDs to readable formats, as seen in the following example:



As an optional step, the plugin can automatically map user/group SIDs to their respective Active Directory names, by providing the parameters shown in the following example:



Alternatively, you can search for an specific user/group by first obtaining its SID in Active Directory and then using said SID for the query.


The plugin can not transform HDI RIDs in its current version.

When analyzing a collection of data of varying types, the first challenge you'll encounter is how to ensure that your processing tasks can all speak the same consistent language and provide common capabilities.


Can your system easily determine the difference between an email, image, PDF, or call recording? If so, how? Can the system make additional processing decisions automatically based only on the data provided?


Typically, these tasks are performed simply through the generation and evaluation of metadata. The MIME type of a file, for example, can help to determine how it should be processed. Geo-location metadata can help to identify where content was generated. Date/time metadata can identify when a document was created or last accessed. But can the system easily understand how to interact with, augment, and/or repair that metadata? Can it automatically determine the data types of those metadata values to assist in parsing and database entry? How does the system access the raw content streams for further analysis given only metadata? Is the data associated with any additional data streams? How are those accessed? Is the answer different each time?


In Content Intelligence, data is represented from any data source in the form of a data structure called a Document.


A Document is a representation of a specific piece of data and/or its associated metadata. Any data can be represented in this form, providing a normalization mechanism for advanced content processing and metadata analysis.


Documents are made up of any number of fields and/or streams.


A field is any individual metadata key/value pair associated with the data.


For example, a medical image can become a Document that contains field/value pairs such as "Doctor = John Smith" and "Location = City Hospital". These fields serve as the metadata for your files and can be used for general processing and to construct a searchable index. Fields may be (optionally) strongly typed, though all fields can still be evaluated in their native string form. Fields can also have a single value, or multiple values associated with them.


A stream is a pointer to a sequence of raw data bytes that live in another location, outside of the document itself, but that can be accessed on demand.


Streams typically point to larger data files that would be prohibitively expensive to load into memory as Document fields, such as the full text content of a large PDF file. Rather than spending system resources passing this large amount of data through a pipeline, Content Intelligence uses these streams to access data and read it from where it lives on-demand. This is accomplished through the evaluation of stream metadata that is evaluated by the connector to determine which data to pull into the system for streamed processing. These data streams are typically analyzed within the system without requiring the full contents of the stream to be loaded into memory.


Here's a visual example of a Document in Content Intelligence representing a PDF file:



Notice that this Document has a number of metadata fields defined, such as Content_Type, and HCI_filename. Processing stages may add, remove, and change these metadata fields to build a self describing entity. Tagging additional fields to Documents can direct other processing stages in how they should process this Document to extract additional value. 


This Document also has a "streams" section, where it defines 2 named streams. First, there's the HCI_content stream, which contains the raw bytes of the PDF file. Second (having been stored on HCP), we see an additional custom metadata annotation stream named .metapairs,  containing additional XML formatted metadata associated with this Document.


At any time during processing, each individual data stream associated with this Document can be read by name from the original content source. When directed, the system streams the data from the original data connection back to the processing stage that requested it. This allows for tasks such as reading that XML annotation, parsing the information contained, and adding that information to the Document as additional fields for further processing. Like fields, streams can also be added/removed from the Document on demand, so that other processing stages can easily consume it.


Creating and Updating Documents


Content Intelligence data connectors and processing stages both enable flexible interactions with Documents. See a previous blog of writing custom plugins for more details.


For example, a custom file system connector may perform a directory listing to identify metadata and create a Document for each file it found. Each Document would contain fields representing the metadata for each. Creating a Document is accomplished through the use of the DocumentBuilder, obtained from the PluginCallback object:


     Document emptyDocument = callback.documentBuilder().build();


Adding fields to this existing document is accomplished using the builder as follows:


     DocumentBuilder documentBuilder = callback.documentBuilder().copy(emptyDocument);

     documentBuilder.addMetadata("HCI_id",  StringDocumentFieldValue.builder().setString("/file.pdf").build());

     documentBuilder.addMetadata("HCI_URI",  StringDocumentFieldValue.builder().setString("file://file.pdf").build());

     documentBuilder.addMetadata("category",  StringDocumentFieldValue.builder().setString("Business").build());

     Document myFileDocument =;


This Document now includes required fields "HCI_id", containing the unique identifier of the file on that data source, and "HCI_URI", which has a single value "file://file.pdf" defining how to remotely access it. It also contains a custom field: "category = Business". You can do this with any information you obtain about this Document, effectively building a list of metadata associated with it that can be accessed by other parts of the system easily.


Now, let's allow callers to access the raw data stream from this Document by attaching a stream named "HCI_Content". Because we're only adding a pointer to the file (not actual stream contents), we use the setStreamMetadata method:


     DocumentBuilder documentBuilder = callback.documentBuilder().copy(myFileDocument);

     documentBuilder.setStreamMetadata("HCI_Content",  Collections.emptyMap());

     Document myFileDocumentWithStream =;


Notice that we haven't set any stream metadata for this new stream, and the provided Map is empty. This is because connectors already use the standard "HCI_Content" stream name to represent the raw data for this file. This directs the system to use the HCI_URI field to read the file (e.g. from the local filesystem) and present the stream contents to the caller.


If you have an InputStream, you can also write streams to system managed temp space using setStream:


     DocumentBuilder documentBuilder = callback.documentBuilder().copy(myFileDocument);

     documentBuilder.setStream("xmlAttachment",  Collections.emptyMap(), inputStream);

     Document myFileDocumentWithStream =;


When writing this inputStream to HCI, the system will attach additional stream metadata containing the local temp file path this file was written to. Stream metadata can be used, for example, to store any additional details required to tell the connector how it should read this data when asked. This metadata can tell the system to load the file from temp directory where it was stored. All temporary streams are deleted automatically when workflows complete.


Working with Documents


Callers from other processing stages can read  fields and streams from the provided Document as follows:


     // Reading fields

    String category = document.getMetadataValue("category").toString();  

     // Reading streams

     try (InputStream inputStream = callback.openNamedStream(document, streamName)) {

           // Use the inputStream for processing

           // Add additional metadata fields to the Document based on the contents found



Processing Example


Consider a virus detection stage, tasked with reading the content stream of each individual Document, and adding a metadata field to indicate "PASS" or "FAIL". This stage would follow the procedures above to first analyze the contents, and again to add additional metadata to the Document for evaluation by other stages.


     private Document scanForVirus(Document document) {    

          DocumentBuilder documentBuilder = callback.documentBuilder().copy(document);

          // Analyze the content stream from this document

          try (InputStream inputStream = callback.openNamedStream(document, "HCI_Content")) {

               // Determine if there's a virus!

               boolean foundVirus = readStreamContentsAndCheckForVirus(inputStream);   

               // Add the result to the document

               documentBuilder.addMetadata("VirusFound",  BooleanDocumentFieldValue.builder().setBoolean(foundVirus).build());





A subsequent stage could then check the value of VirusFound on incoming Documents, and take steps to quarantine any files on the data sources where the virus was detected.


This work can be performed without directly interacting with the data sources themselves - just by interacting with the Document representations in the Content Intelligence environment. This eliminates much of the complexity of dealing directly with client SDKs, connection pools, and retry logic, simplifying the development of new processing solutions.


Standardizing on field and stream names (such as HCI_URI and HCI_content), can reduce any custom configuration required on each processing stage, by leveraging built-in out of the box defaults. This can help to eliminate many common configuration mistakes, such as typos in field names, while promoting the re-use of stages.  


I hope this demonstrates the flexibility and convenience provided by standardizing on a useful data structure such as the Content Intelligence Document. Whether the data is a tweet, a database row, or an office document, the data can be represented, accessed, analyzed, and augmented in the same consistent way. By using a standard, normalized mechanism for accessing and consuming information, you can quickly generate reusable code that can help to quickly satisfy a number of use cases. Even those you haven't thought of yet...


Thanks for reading!


Are you looking for a deeper overview of Hitachi Content Intelligence (and a peek under the covers)? Are you using the system and want to better understand how to optimize for specific use cases? Look no further - this is the blog for you.


Content Intelligence is a software solution comprised of 3 distinct bundled layers:


First, an embedded services platform leverages the portability of modern container technology, enabling flexible and consistent deployment of the complete solution in physical, virtual, and cloud environments. From there, it adds the ability to cluster, scale, monitor, update, triage, and manage the solution via REST API, CLI, and UI. Controls are provided for scaling, configuring, load balancing, and even repairing specific services. Plugin and service frameworks support easily extending and evolving system capabilities to meet custom use cases using a provided SDK.


Next, an advanced content processing engine allows for connecting to data sources of interest and processing the information by categorizing, tagging, and augmenting metadata representing each item found. Deep analysis against raw data streams produces both raw text (enabling keyword search) and additional metadata. Optimized for large scale parallel processing, the included workflow engine can blend structured and unstructured information into a normalized form that is perfect for aggregating data for reports, triggering notifications, and/or building search engine indexes.


Finally, a text and metadata search component delivers comprehensive search engine indexing and index management capabilities. Tools are provided for designing, building, tuning, and optimizing search engine indexes. The system allows for scaling and monitoring locally managed indexes and/or registering external indexes to participate in globally federated queries. A full-featured customizable search application is provided - supporting secure access to query results that may be automatically tuned to the needs of specific user groups and use cases.




For a deep dive into the architecture, feature set, and best practices of Content Intelligence, see the attached whitepaper below.



One of the simplest ways to further optimize a search engine index is to register stopwords.


Stopwords are terms that are typically irrelevant in searches, like "a", "and", and "the". Removing these terms while indexing can significantly reduce index size without adversely impacting user query results.


Stopwords can affect the index in three ways: relevance, performance, and resource utilization.


  • From a relevance perspective, these high-frequency terms tend to throw off the scoring algorithm, and you won't get the best possible matching results if you leave them in. At the same time, if you remove them, you can return bad results when the stopword is actually important. Choose stopwords wisely!


  • From a performance perspective, if you don’t specify stopwords, some queries (especially phrase queries) can be very slow by comparison, because more terms are compared to each indexed document.


  • From a resource utilization perspective, if you don’t specify stopwords, the index is much larger than if you remove them. Larger indexes require more memory and disk resources.


Because they are effectively filtered from the index, stopwords are not considered when matching query terms with index terms. For example, when using stopwords {do, me, a, this}, a query for “do me a favor” would match a document containing the phrase “this favor”, making "favor" the most important search term impacting matches.


This is typically the desired behavior, as the same processing performed at index time as is performed at query time to “normalize” the user input to associate with matches. The best matches get the highest relevancy score, and appear higher in query results.


However, if literal exact phrases with these terms included are important, less stopwords can be better. For example, removing “do” as a stop word in the example above would cause phrase query “do me a favor” to NOT match “this favor”, but the query would still match a document containing “do this favor”.


The HCI index stopwords file (see "Index > Advanced > stopwords.txt") is used by the HCI_text and HCI_snippet fields. This file is empty by default for newly created indexes in 1.1.X releases, but will be populated with defaults in future releases.  It is highly recommended that you add relevant stopwords to this file prior to indexing!


A conservative example English stopword list that can satisfy the majority of use cases would be the following:




































Example stopwords files in different languages are also available in the product as examples. See the "stopword_<country/language>.txt" files under the "Index > Advanced > lang" folder in the Admin application.  The above list comes from the default English stopwords_en.txt  file, taken from Lucene's StopAnalyzer.


Happy optimizing!




One of the design goals of HCI was extensibility.


The team wanted to make sure that HCI could connect to any system that you would like to use today or in the future. We also wanted the system to easily adjust to changing technology trends and evolve it's capabilities and use cases over time. As a result, HCI is a highly pluggable system. This allows us to upgrade and evolve technologies simply by building new plugins - without requiring updates to HCI itself. It also means that we can continue to provide seamless compatibility to existing systems as we prototype with bleeding edge advancements in content processing/analysis as well as taking advantage of new data stores.

The HCI Plugin SDK allows all end users to extend the capabilities of their HCI system. With the plugin SDK, one can:

  • Build connections to various systems to get data into HCI
  • Support customized processing on the data


The plugin SDK is available as a separate download from HCI, and is available on the downloads page. It includes:

  • Multiple levels of documentation, including full interface javadoc and useful README files
  • Helpful utilities to use in plugin logic
  • Example connector and stage plugins
  • A full-featured, offline plugin test harness


Looking for simple steps to build an HCI plugin?   Here's the process in a nutshell:


Step 1. Read the documentation


To make developers lives easier, the plugin SDK contains lots of inline help:

  • See the top level README.txt file for an overview on HCI and plugins
  • See the examples/EXAMPLES.txt file for instructions on building the example code
  • See the plugin-test/TEST_AND_DEBUG.txt file for plugin test harness test and debugging instructions


HTML javadoc is also provided for all plugin interfaces (ConnectorPlugin, StagePlugin) and utility classes in the doc/javadoc folder.  Click on the index.html file in this directory to open it in your web browser.




Step 2. Build the example code


Example HCI connector and stage plugins are immediately available for customization. Hacking on these examples is probably the best way to learn the plugin technologies.


The following instructions come straight from the HCI EXAMPLES.txt file.


To build the example plugins:

1. Unpack the HCI SDK package.  

2. Navigate to the examples directory:
   cd HCI-Plugin-SDK/examples 

3. Create the classes directory:
   mkdir classes 

4. Compile the java files for your plugin:
    javac -cp ../lib/plugin-sdk-<build-number>.jar -d classes/ \
         src/com/hds/ensemble/sdk/examples/connect/*.java \

5. Copy the plugin resource file:
    cp -R src/META-INF/ classes/ 

6. Create the final jar:
    cd classes && \
    jar -cf ../HCI-example-plugins.jar * && \
    cd ..

This process generates a new plugin jar file named "HCI-example-plugins.jar" that you can test in the plugin-test harness and upload directly to the HCI system to use right away!

Note: The plugin jar file must be a "fat" jar and contain all dependency libraries for everything it uses inside it. HCI provides plugin classloader isolation, allowing different plugins to use different versions of software within the same system. To make this work, you need to include all libraries inside the plugin jar file that your plugin will use. Make sure that you do NOT include the HCI plugin SDK library within your plugin jar. The HCI Plugin SDK jar should be on the compilation classpath per the example above (a "provided jar"), but NOT end up within the final plugin jar. This allows the HCI SDK in the product to evolve without breaking your plugin.


Step 3. Validate, test, and debug plugin code


The plugin SDK comes with a full featured test harness: "plugin-test".  You can use this tool to validate, test, and debug custom HCI plugins.


The plugin test harness can operate in one of three modes:

  • Validation mode
  • Test mode
  • Debug mode


In validation mode, the harness ensures that the plugin manifest file exists, is well-formed, and that the plugin interfaces have all been correctly implemented. If your plugin validates successfully, then you have correctly edited the manifest file and implemented the required interfaces.

In test mode, custom configuration for each plugin is specified in the plugin-test harness configuration file. The test harness will then utilize each configuration to exercise additional functionality, check for errors, and make recommendations. Use this mode to exercise your plugin logic, and identify any issues.

Debug mode allows for configured plugins to be further utilized in a debuggable environment, allowing plugin logic itself to be checked for errors as methods are executed by the test harness. Executing in debug mode will prompt the user to connect any Java debugger or IDE to the specified port (5903 by default). This allows the user to set breakpoints in their plugin code and debug the logic itself while the tests execute.


Plugin Interface Validation


To validate a plugin:

  1. Go to the plugin-test directory: 
      cd HCI-Plugin-SDK/plugin-test 

  2. Run the test harness in validation mode:
     ./plugin-test -j <path-to-your-plugin-bundle-name>.jar 

The test harness validates the implementation of your plugin and:

  • Indicates where errors exist
  • Makes recommendations to resolve any issues it identifies
  • Makes best practice recommendations


plugin-test output in validation mode:



Once your plugin validates successfully, you're ready to test it out!


Plugin Instance Testing


Once your plugin has been validated, you can move on to deep testing and analysis of the plugin.


In test mode, custom configuration for each plugin is specified in the plugin-test harness configuration file. The test harness will then utilize each configuration to exercise additional functionality, check for errors, and make recommendations. Use this mode to exercise your plugin logic, and identify any issues.

To test your plugin, you will first need to configure the plugin within the test harness. This involves:

  • Defining a PluginConfig to use when testing your connector or stage plugin
  • Defining the fields and streams on an inputDocument to use when testing your (stage) plugin

Fortunately, the plugin-test tool makes life easy with an "autoconfigure" option, which allows you to generate a plugin-test configuration for any or all plugins found within the specified plugin jar file.

To generate a configuration file for all plugins in a bundle:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -a [output-config-file] 

To generate a configuration file for only one plugin:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -plugin <plugin name> \
      -a [output-config-file]

This automatically generated configuration file may then be saved and edited in order to fine tune the plugin configuration to be used while running the plugin test harness. The auto-generated config will reflect the plugin-defined default values for all properties.

Custom configuration values for each plugin may be applied. If a plugin requires user input for config properties in the default configuration, these values must be specified in the plugin-test config file before testing the plugin. This is accomplished by adding the proper "value" entries for all default config properties in the plugin configuration.


Example PluginConfig Property:


  "name": "com.hds.hci.plugins.myplugin.property1",

  "type": "TEXT",

  "userVisibleName": "Field Name",

  "userVisibleDescription": "The name of the field to process",

  "options": [],

  "required": true,

  "value": "" // <--- Add non-empty value here for all required fields



For stage plugin testing, the inputDocument fields and streams may also be customized in the automatically generated plugin-test configuration file.


Example Input Document declaration:

"inputDocument": {

       "fields": {

            // ---> ADD OR MODIFY FIELDS HERE <----

            "HCI_id": [



            "HCI_doc_version": [



            "HCI_displayName": [

                "Test Document"


            "HCI_URI": [




       "streams": {

          // ---> ADD OR MODIFY STREAMS HERE <----

           "HCI_content": {

               "HCI_local-path": "/tmp/tmp1904411799648787907.tmp"

               // Instructions: To present a stream, use it's file path as the value of HCI_local-path above





Auto-configure and test execution example:



Plugin Instance Debugging


Debug mode allows for configured plugins to be further utilized in a debug environment, allowing plugin logic itself to be checked for errors as methods are executed by the test harness. Executing in debug mode will prompt the user to connect any Java debugger or IDE to the specified port (5903 by default). This allows the user to set breakpoints in their plugin code and debug the logic itself while the tests executed.


The only difference between test mode and debug mode is:

  • The user sets Java IDE/debugger breakpoints in the methods of their plugin source code
  • The user starts the plugin-test tool using the "-d" option
  • The plugin-test tool prints the debugging port to connect to (5903 by default) and waits for a connection
  • The user attaches their Java IDE/debugger to the specified port (5903 by default)
  • Tests begin to execute and breakpoints will be hit


To run a test in debug mode:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -a [output-config-file] -d 

To run a test for a specific plugin in debug mode:

 ./plugin-test -j <path-to-your-plugin-bundle-name>.jar -plugin <plugin name> \
     -a [output-config-file] -d


plugin-test waiting for the user to connect the Java IDE to "Remote at port 5903":



After the user connects to the process using the Java IDE (e.g. IntelliJ/Eclipse), tests will begin executing and breakpoints within the plugin source code can be stepped through manually to debug the custom plugin:


Note that you can build, test, and debug an HCI plugin without ever touching an HCI system!


Step 4. Upload to HCI


Once your plugin bundle JAR file is ready and tested, you are ready to use it in production.


Simply upload your new plugin(s) to the UI by dragging and dropping it into the upload pane from any file browser, or just click to browse for the file locally.




After upload, HCI will automatically make your plugin code available to all system instances and you may begin using your custom "Data Connections" and processing pipeline "Stages" immediately. Because HCI UI is dynamically generated based on the plugins available, you will immediately see your plugin appear in the existing drop down options.

Hopefully this demonstrates how anyone can easily build, package, and test custom HCI plugins.


Thanks for reading!


Hitachi Content Intelligence provides a REST API, CLI, and UI in the admin application for the custom management of both authentication and authorization.


Let's dive right into how this works...


Identity Providers


First, administrators register Identity Providers with the HCI system by selecting and configuring any of the available plugin implementations.


Currently, “Active Directory”, “LDAP compatible”, “OpenLDAP”, and “389 Directory Server” plugins are available today:



In order to integrate with other identity providers such as Keystone, IAM, Google, or Facebook, an "Identity Provider" plugin could be produced for each. Each plugin requires different configuration settings, which the UI displays dynamically.


Adding an “Active Directory” Identity Provider  (Admin UI > System Configuration > Security > Identity Providers):



Listing configured Identity Providers:





Next, you may use these Identity Providers to discover and map Groups into HCI.


Registering a Group from the “Active Directory” Identity Provider:



Listing all registered groups:





Groups may be assigned one or more Roles. Roles are groups of one or more Permissions. Each HCI service can register a custom set of permissions that can be enforced by the system. Administrators combine permissions into custom roles they would like to use for a specific application, and assign one or more of those roles to the registered groups.


Creating the “Admin” role:



Editing permissions for the “Admin” role:



Group assigned the “Admin” Role, effectively giving all permissions to that group:



Only groups that have been configured with specific roles will have the permissions required to access the corresponding set of services/APIs within the system. When granting permissions to a user, the corresponding areas of the UI will become available and REST API requests would be allowed. When disabling permissions for a user, the admin UI will also dynamically remove those sections of the UI to prevent them from being accessed and any REST API or CLI requests for those services would fail with an error.




When logging into any HCI application, users may choose the security realm to utilize, which will use the corresponding identity provider for authentication. Each security realm name associated with each Identity Provider is chosen by the administrator:




The application will then:

  • Authenticate the user against the selected Identity Provider
  • If successful, determine what group(s) that user belongs to
  • Given group membership, determine which roles and permissions have been assigned to that user
  • Enforce the that user can access services they have been granted permission to access, and cannot access services for which they have not been granted access.


I hope this demonstrates the ability of HCI to easily integrate to existing customer directory services and manage fully customizable roles for any given organization and any particular application.


Thanks for reading!


As promised I posted the two example workflows that I used in the Top Gun HCI Workshop on May 24th.

The first example bundle can be found here:Pictures with GPS Bundle.bundle.

The second example bundle can be found here: DICOM Bundle.bundle.


Note: These bundles include data connections that point to HCP namespaces that you DON'T have access to. So I copied the namespaces to two AWS S3 buckets:

As an exercise, see if you can replace the HCP data connection with an AWS S3 connection. These buckets are in US East (in case you were wondering) and are read only to Everyone.


You will have to wait to use the DICOM bundle because the DICOM plugin is not commercially available yet (I got a sneak preview from Engineering). Unless, of course, you choose to write your own DICOM plugin...


Happy to get feedback and to see how folks might improve on these workflows. Let's start a groundswell of sharing!


Good luck!



Ben Isherwood

HCI: Querying Indexes

Posted by Ben Isherwood Apr 21, 2017



I'm sure that nearly everyone has queried a search engine before - typically by specifying one or more keywords of interest.  But what happens when you begin to add other metadata criteria to your searches? Anyone who has fine tuned a product search on (e.g. by size or color) knows that adding metadata values to your query can both improve relevancy and enable advanced data exploration.


As a quick example, try searching Google for keyword "HCI" ...  Didn't find the result you wanted?  Try searching instead with the input "HCI". Much more relevant results!


With HCI, you are always building an index of searchable metadata fields. You have full control over which terms end up in which fields for each indexed document, allowing you to provide the exact search experience your users want.


Let's dive into the details of how queries work using the Lucene query language:


Basic Queries

The simplest form of query in HCI is known as a "basic" or "term" query. These are the types of queries you may typically run in search engines such as Google. These queries consist of one or more field names and the values you will be searching for:


The query language allows you to specify the field(s) and value(s) you want to match, like:



You are always searching for a value within a specified indexed field name. When specifying plain keywords without a field prefix, the system will use the system configured default field.  In HCI, the default field searched (without specifying a field name) is "HCI_text", which contains all indexed keywords from the document.


So, for the following keyword query:



You are actually querying for the following:



This is how your keywords are compared to indexed documents, by matching the values of the fields in each.




A query for "all" results in the index can be specified using wildcards (all fields and all values):



A query for all results where the field "author" is defined in the document would be as follows:



A query for all results where the author starts with "Stephen" would then be:



Note that wildcards must be at the end or middle of each query clause only (e.g. "author:*tephen" is not valid, but "author:Steph*King" is valid.).


You can also wildcard a specific single character using the "?" syntax:



Phrase Query

Phrase queries are used to match multiple terms that should be found next to one another in sequence:

      author:"Stephen King"


This query matches authors containing exactly the terms "Stephen King", but not "Stephen R. King", "King, Stephen", or "Stephen Kingston".


Sloppy Phrase Query

If you would like your results to match "Stephen J. King" or "author: King, Stephen", you can use a "sloppy phrase query".   A phrase query can be sloppy by specifying the number of term position edits (extra inline keyword moves) allowed to match after the "~" character:

author:"Stephen King"~1 

This sloppy query would match "Stephen J. King", because one word hop was enough to generate a phrase match. This also matches "King, Stephen" because one hop (or edit) was made to switch the values. However, it would not match "Stephen 'the author' King" because that would require 2 word hops to match your phrase. It matches if you use ~2 instead.


This count of term edit/removals required to make a match is called the slop factor.


Sloppy phrase matches based on slop values for phrase query "red-faced politician":





Boolean Query

A boolean query is any query containing multiple keywords or clauses, like:

     HCI_text:foo  +Content_Type:”text/html”


A clause may be OPTIONAL (relevancy ranked), REQUIRED ( + ), or PROHIBITED ( - ):

  • OPTIONAL : The default (no operator specified). Results will be returned in relevancy ranked order where matches get higher boosts.
  • REQUIRED  ( + ) : When specified, this clause is required to match, and only exact document matches are returned.
  • PROHIBITED ( - ) : When specified, this clause is required NOT to match, and any exact document matches are not returned.


Search for author "Stephen King", MUST contain keyword "Shawshank" and MUST NOT have category "movie":

      +author:"Stephen King" +Shawshank -category:movie


Search for only results mentioning keyword "dog" and boost documents with keywords "german" and "shepherd":

     +dog german shepherd


AND, OR, and NOT

Note that you may use "AND", "OR", and "NOT" between phrases. You may also use their symbol equivalents (&, |, and !). Care must be taken when using this syntax, as this does not apply traditional boolean logic to each term clause:

  • AND becomes a "+" operator, denoting REQUIRED clauses
  • OR becomes no operator, denoting OPTIONAL clauses
  • NOT becomes a "-" operator, denoting PROHIBITED clauses


So this query:

     (foo AND bar) OR dog NOT bear


Is interpreted as the following:

    +foo +bar dog -bear


Example: None of these queries will produce equivalent results

     banana AND apple OR orange

     banana AND (apple OR orange)

     (banana AND apple) OR orange


Why? Because they evaluate to 3 different queries:

    +banana apple orange

    +banana +apple +orange

    +banana +apple orange


Therefore it is strongly recommended to avoid the AND, OR, and NOT operators in general. Just use "+" and "-" around clauses where they are needed!


Range Query

A range query lets you search for documents with values between 2 boundaries. Range queries work with numeric fields, date fields, and even string fields.


Find all documents with an age field whose values are between 18 and 30:

    age:[18 TO 30]


Find all documents with ages older than 65 (age > 65):

     age:[65 TO *]


For string/text fields, you can also find all words found alphabetically between apple and banana:

    name:[apple TO banana]


You may use curly brackets to indicate "inclusive" matches vs. "exclusive. To find all documents with ages 65 and older (age >= 65):

    age:{65 TO *]


Find all documents with specific ages 18, 19, or 20 and also ages 25 or 26:

    age:{18 TO 20} age:[24 TO 26}


Sub Query Grouping



You can use parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query.


To search for either "stephen " or "king" and "author" use the query:

    (stephen OR king) AND author


Or, as we previously learned, the preferred form avoiding the boolean logic keywords:

     (stephen king) +author


This eliminates any confusion and makes sure you that author must exist and either term stephen or king may exist.


Field Grouping

You can use parentheses to group multiple clauses to a single field.

To search for a title that contains both the word "return" and the phrase "pink panther" use the query:

    title:(+return +"pink panther")


Fuzzy Query

You can perform fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:



This search will find terms like foam and roams.  You can also specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:



The default similarity that is used if the parameter is not given is 0.5.


Boosted Query

HCI calculates the relevance level of matching documents based on the terms found.

To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Boosting allows you to control the relevance of a document by boosting its term. For example, you are searching for:

    stephen king


If you want the term "king" to be more relevant, boost it using the ^ symbol along with the boost factor next to the term. You would type:

    stephen king^4


This will make documents with the terms "king" appear more relevant. You can also boost Phrase Terms (e.g. "stephen king") as in the example:

    "stephen king"^4 "author"


By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2).

HCI also provides an index query setting to do this automatically for certain field values across all queries against a specific index.


Constant Score Query

A constant score query is like a boosted query, but it produces the same score for every document that matches the query. The score produced is equal to the query boost. The ^= operator is used to turn any query clause into a constant score query.  This is desirable when you only care about matches for a particular clause and don't want other relevancy factors such as term frequency (the number of times the term appears in the field) or inverse document frequency (a measure across the whole index for how rare a term is in a field).


Example 1:

     (description:blue OR color:blue)^=1.0 text:shoes


Example 2:

     (inStock:true text:solr)^=100 native code faceting


Proximity Query

A proximity search looks for terms that are within a specific distance from one another.

To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For example, to search for a "stephen" and "king" within 10 words of each other in a document, use the search:

     "stephen king"~10


The distance referred to here is the number of term movements needed to match the specified phrase. In the example above, if "stephen" and "king" were 10 spaces apart in a field, but "stephen" appeared before "king", more than 10 term movements would be required to move the terms together and position "stephen" to the right of "king" with a space in between.


Filter Query

A filter query retrieves a set of documents matching a query from the filter cache. Since scores are not cached, all documents that match the filter produce the same score (0 by default). Cached filters will be extremely fast when they are used again in another query.


Filter Query Example:

     description:HDTV OR filter(+promotion:tv +promotion_date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY])


The power of the filter() syntax is that it may be used anywhere within the query syntax.



You've now learned some powerful query syntax, congratulations!


We've only scratched the surface with what's possible using HCI search, so stay tuned for more advanced query topics in upcoming blogs.


Here are some additional introductions to the Lucene query language that I've found helpful:


Thanks for reading,


Many people that we talk to about Hitachi Content Intelligence are curious about why we focus so intently on building out tools for content processing and data analysis. Isn't this a search tool?

It turns out that great search experiences are directly driven by the quality of the processing performed on the content. This makes processing, normalizing, and categorizing content the most important aspect of any (great) search technology. It also turns out that most search tools are surprisingly deficient in these areas! This has been a great opportunity for the HCI team to work on filling these gaps.


A while ago, I was asked whether or not HCI could be used to process and search HDI data that was stored (obfuscated) in HCP namespaces.  Here's what I did to find out the answer...


Step 1: Add an HCP data source

I started out testing HCI against a namespace in HCP that was backing an HDI file system.

The goal was to see how searching the core would work with HDI and the determine how difficult this task would be.


First, we connected HCI to the HCP namespace containing the HDI data:


Configuring a data connection is all that's needed to begin processing data in HCI.


Step 2. Auto-generate a Content Class

HDI file paths in the HCP namespace are obfuscated, so it's impossible to search the contents of these namespaces directly.

However, because HDI stores both the full content and HDI custom metadata in HCP, we can easily take advantage of this.

Using example custom metadata from one of the files in the namespace, HCI was used to auto-generate an "HDI custom metadata" content class. This content class could be used to pull the metadata from the XML file into the pipeline engine for further processing.

HDI auto-generated content class:



Step 3: Create and test an HDI processing pipeline

This step required some effort... and resulted in the addition of some new built-in plugins.

After cloning the default pipeline, I added a content class extraction stage to the pipeline for reading the "default" custom metadata annotation: “HCP_customMetadata_default”. This enabled me to pull the XML into the system, extract all of the fields, and present them to the processing pipeline.

After browsing for a file on the data source and running a pipeline "test" operation against it, I quickly found that the file paths found in the "default" annotation were URL encoded - making a search against these fields difficult. I built a URL Encoder/Decoder stage to decode them, uploaded the plugin, and started using it in the pipeline immediately. Now these fields were clearly visible!

URL decode any encoded HCI metadata fields:


I noticed that there were a lot of UNIX timestamps in the metadata values on these fields. The date conversion stage didn't (yet) have support to normalize UNIX timestamps into standard date fields. Adding support in the stage for these resolved that issue.


Normalizing HCI metadata date fields:



In some documents, the "HDI_file_path" metadata field wasn’t available, so I also configured the pipeline to use the "HCI_DisplayName" in place of the HDI_file_path metadata for those specific documents.


Step 4: Click to build an optimized index

After running the workflow, HCI automatically discovered all sorts of metadata from the files in the namespace.

Using the "Workflow > Task > Discoveries" UI, I created and configured an index from the field recommendations with a single click!

Auto-generated index schema:



Step 5: Customize the search experience

HCI allows you to quickly customize the search experience for specific sets of users.

I customized a results display configuration in the index query setting where:

  • The value of the HDI file path field was used as the title of each document
  • The link takes you directly to the object in the HCP system
  • Add "snippet" text below each result containing raw extracted text from each document
  • Add a document metadata expandable panel with each result and named it "HDI Metadata"

Customizing search result display:


Built-in default HCI autocomplete support uses phrases within the content itself:


In the index "Public" query setting, enabled support for range queries on time fields, file names, etc:


Can also expose detailed document metadata within each search result:


Minutes later – full featured HDI file system search, just by crawling the HCP namespace!  And we had implemented 2 new features in the process: a new "URL Encoder/Decoder" stage and UNIX timestamp support in the "Data Conversion" stage.


Hopefully you've learned how HCI content processing technologies can accelerate the development of full featured search and categorization.


Thanks for reading!


Ben Isherwood

HCI Plugins: Geocoding

Posted by Ben Isherwood Apr 14, 2017

Many of us are aware of the photo image geotagging capabilities of our smart phones. It's how social networks can report to the masses where we were when we posted our latest vacation slideshows. This process simply identifies your location at the time the photo was taken by leveraging the global positioning systems found in each device. The metadata coordinates of your position are attached to the document for later use.

So how can our systems take advantage of this data for search and analysis? Rather than manually sifting through millions of documents, it's useful to to identify all documents related to a specific building, city, state, or country and produce the results on demand. These results may then also be further categorized by other metadata found in each document to help you find that exact vacation image you were looking for.

To support the use of geotagged information, HCI provides the Geocoding stage.

Geocoding is really "reverse geotagging". You you take the metadata latitude and longitude coordinates attached to Documents as metadata and convert them into the corresponding City, State, Country, or even Timezone values that make sense for your use case.

The stage supports input latitude and longitude fields in the following format:

geo_lat : 42.482119 
geo_long : -71.186761

Note that these fields are automatically extracted by the "Text and Metadata Extraction" stage, so any geotagged metadata in each Document will be typically immediately available for further processing and analysis.



So, what can you do with this geotagged metadata?

The HCI Geocoding stage uses public location data collected and made available from the GeoNames project to map the latitude and longitude coordinates to the nearest local position on earth.

The stage supports the following output configuration values (any combination may be specified):

  • cityField - The name of the output field that should contain the city (defaults to "city", optional)
  • stateProvinceField - The name of the output field that should contain the state/province (defaults to "state", optional)
  • countryField - The name of the output field that should contain the country (defaults to "country", optional)
  • timeZoneField - The name of the output field that should contain the time zone (defaults to "timeZone", optional)




HCI pipelines can be configured to automatically extract these additional information fields from Documents given only the geotagged input fields that the camera added to each image. The resulting Documents contain even more metadata that may be utilized by the pipeline or in query requests for further faceting and categorization.




Now that we have the metadata, we can index it and leverage it for queries. Instead of just random keywords, we can then specifically target keywords found in documents matching "City:Burlington" and "State:MA" in our queries.

Now you'll know how to easily locate that ONE great St. Lucia skyline photo from an image repository of billions. And you may even run into hundreds just like it in the process!

Thanks for reading,


  An important part of the HCI (Hitachi Content Intelligence) setup and configuration is the user security settings.   HCI only comes with one default user, but HCI will have multiple end users and different sets of users and they will all
need to be made available through the Identity providers. We think of the users in groups by function.  The most common groups are HCI system admin, Developers, and End Users.  Each group is given specific security rights
for their particular function. The details and exact settings need to be determined with the client, below is our typical setup and how we configure security.


HCI System Admin

   The HCI system admin typically is responsible for the day to day activities in the System to include System trouble shooting, receiving system alerts, balancing of services on nodes, granting/applying security to users. 
Depending on the client security concerns the admins may or may not be allowed access to actually see/view the data.
In our example they will have full rights to everything in the system.  I am not referring to the default account, in this example.


Developer role


  The developer role is intended for the person that will create the workflows.  This includes the Data connectors, Pipelines, and Indexes.  They should also be able to import and export plugins and workflows into the system.  Certain sites may have different rules for a test versus production environment, but in our experience they will have
access to all data manipulation activities.


End Users 


  End users are the people who will be using the processed data in the Search GUI.  We typically have several groups of end users as certain users are allowed access to certain sets and types of Data.  In our lab we have two sets of End users groups.  So we can validate that one group can see the processed data and one group cannot.    The data in our lab system represents documents with a person’s social security number. (An American person
number).  This information is highly sensitive and has legal ramifications if the wrong people are accessing or
using that information incorrectly.


The first step in setting up HCI security is to link the HCI system to Identity Provider.  We have
mostly been working with LDAP but several types are supported to include 389 Directory Server, Active Directory (LDAP), LDAP Compatible, & OpenLDAP.


Go to


System Configuration→ Security→ Identity Provider


It will ask for your AD information


Type= 389 Directory Server, Active Directory (LDAP), LDAP Compatible, or OpenLDAP


Identity provider host name= example “”


Transport security= None, TLS Security or SSL


Identity provider host port= this is the Network Port # to communicate with the Identity provider 


User name & Password=A user that has admin rights to pull back the AD groups.


Domain= Example “HCP.demo”


Search Base DN=  Example “  dc=hcpdemo, dc=com”


Default Domain name= Example “”  (this filed is optional, but should be used if missed/blank upon login into HCI the user login will have to include  so versus tmyers if it is completed)


Once that is completed hit test and then update if everything is correct.


Once that is done we will need to go to


System Configuration →Security →Groups


At this point you should Discover groups and be able to see all the AD groups in the system. 


System configuration →Security →Roles


Roles do not come preconfigured in HCI, at the beginning I mentioned an HCI System Admin, Developer and End users.  When you create a role you will have to go into the permissions where permissions are assigned to a role.  HCI has very granular security permissions.  If you look at Content classes you will see it is subdivided into Create, Delete,
Update, & Read.   


For the HCI admin we typically give them all permissions with or without the ability to Query as the Admins are not allowed to see the data.  Some sites also create a separate admin that can create or edit security.


For our Developer role we typically choose the features around data manipulation.  I.e. Workflows, data connectors, Pipelines, & indexes.


For end users we start by granting the group rights to search/query the data.  We can also tie
the Index to individual roles as one way of limiting access to view the data.    This is a two-step process
listed later on. 


When you are finished creating roles and assigning permissions to those roles you will need to link a role to the group.  This is done in System Configuration →Security→ Groups


Enabling Data security is a two part process



First enable security on the index


Workflows  →Indexes →  Example Index  →Query Settings


Create a Query setting and then you will be able to enforce security on this index, under the access control by changing it to enforce document security yes.  Once this is saved you must enable it and disable the public.  Once we have applied security to the index we will need to grant our end user groups/roles access to that index.


Link Index to the roles

  So go back to System Configuration →Security Groups

Choose your group and edit you will see an option for indexes, assign the index to the appropriate roles.  At this point in time you should be able to log into the search screen, and see the data assigned/ restricted as

HCI pipelines provide an easy-to-use mechanism for analyzing, normalizing, and transforming data.

But how do I know what additional stages I need to add to my pipeline to process a set of data?

HCI introduces a "pipeline test" tool. This tool allows you to browse a data source, select any file, and process that file using the pipeline. The pipeline test UI will show you the values of document metadata before entering a processing stage and can compare it with the metadata that exists after you leave that stage. This allows you to not only see exactly how each stage processes a document in the pipeline, but provides visibility to the metadata values that are already there.


Workflow pipeline test tool, displaying the new metadata added by the MIME Type Detection stage:



There are many useful built-in stages to utilize when testing a pipeline. Let's take a look at a few now...


Snippet Extraction


First, let's talk about the Snippet Extraction stage.

One convenient aspect of the snippet extraction stage is that it supports pulling raw content from a data stream. This allows you to configure this stage to pull raw data into a metadata field of choice, and let you visualize the content of that raw stream.

For example, this stage has helped to resolve issues in which content class extraction was failing.

The content class extraction stage in the pipeline below was expecting to read raw XML data from the HCI_content stream, but it wasn't working. The stage was configured correctly with the correct source stream name, so why didn't it work?

Why is the content class stage not extracting metadata? There are no changes!



To figure this out, we first added a Snippet Extraction stage to the pipeline before the Content Class Extraction stage, and configured it to read data from the stream and store it in a "$TestContents" field.

Note: the "$" prefix can be used to name fields that should never be indexed, but that may be used for debugging or stage to stage communication.

After running the pipeline test, voila! The content dump indicates that the stream attempting to be processed was not XML, but raw text! Looks like we configured the stage to process the raw "HCI_content" stream instead of the custom metadata stream we should have used: "HCP_customMetadata_default"...


Indicates that stream contains text, instead of the expected XML:



Fixing the content class stage to point to the correct XML content stream name ("HCP_customMetadata_default") resolved the issue, allowing the extraction to work as expected:



Snippet extraction may be used whenever you need to gain insight into what content is actually inside the data streams you are processing.



Reject Documents


Let's look at yet another stage that is extremely useful when building and testing pipelines: "Reject Documents".

The Reject Documents stage allows you to cause any Document to immediately fail processing with a custom error message of your choosing.

Adding Reject Documents to the pipeline:



Document failures generated as a result of the Reject Documents stage look like any other document failure, and are reported exactly the same way. The workflow will halt all further processing of these rejected documents, but list them for users for further investigation. This stage can therefore serve a similar purpose as assert statements found in many programming languages.

The stage allows pipeline creators to specify "required" criteria for a Document to be further processed, allowing for validation of specific document conditions.

Configure a "Reject Document" stage by specifying a custom message:



Consider a scenario in which you (the pipeline designer) expect all Documents to have a stream named "HCI_text" at a specific point in the pipeline. Since all further processing depends on this condition, you can introduce a "Reject" stage to enforce this.



If any Document enters the pipeline at this position WITHOUT a stream named "HCI_text", that Document will fail processing and result in a document failure.

This behavior can be invaluable in identifying which documents in your data set do not conform to specific processing criteria. You can use this information to either process those failing documents further, or update the pipeline to handle them in special ways. In this specific case, you can add an additional "Text and Metadata Extraction" stage to generate the expected "HCI_text" stream in the case that it does not already exist.

This stage is also useful in "pausing" the expansion of documents such as CSV, log, PST, ZIP, and TAR files. If you'd like to halt processing to see how a pipeline is handling these sub-documents, you can add a "Reject Document" stage to halt processing at any point in the pipeline, optionally conditionalized based on a specific file. Using the pipeline test stage diff tools, you can then determine how these expanded Documents were handled by the pipeline stages before and adjust pipeline logic and stages accordingly.




As always, use pipeline testing with example documents from your data set for analysis. Even just a few minutes testing example documents can lead to vast improvements in index query performance, relevancy, and accuracy.


More on additional stages that can be used to debug pipelines in future blogs.

Thanks for reading!