Skip navigation
1 2 Previous Next

Hitachi Content Intelligence

30 posts

Hitachi Content Intelligence delivers a flexible and robust solution framework to provide comprehensive discovery and quick exploration of critical business data and storage operations.

 

Make smarter decisions with better data and deliver the best information to the right people at the right time.

  • Connect to all of your data for real-time access regardless of its location or format - including on-premises, off-premises, or in the cloud
  • Combine multiple data sources into a single, centralized, and unified search experience
  • Data in context is everything – put data into meaningful form that can be easily consumed
  • Deliver relevant and insightful business information to the right users - wherever they are, whenever they need it

 

Designed for performance and scalable to meet your needs.

  • Flexible deployment options enable physical, virtual, or hosted instances
  • Dynamically scale performance up to 10,000+ nodes
  • Adopt new data formats, and create custom data connections and processing stages for business integrations and custom applications with a fully-featured software development kit

 

Connect Understand Act.png

 

What’s new in Hitachi Content Intelligence v1.3

 

  • Hitachi Content Monitor
  • Simplified navigation of Hitachi Content Intelligence consoles
  • External storage support for Docker Service Containers
  • Increased flexibility with new Workflow Jobs
  • Enhanced data processing actions
  • New and improved data connectors
  • Overall improvements to performance and functionality

 

Hitachi Content Monitor provides enhanced storage monitoring for Hitachi Content Platform.

  • Centrally monitor HCP G Series and HCP VM storage performance at scale, in near real-time, and for specific time periods
  • Analyze trends to improve capacity planning of resources - such as storage, compute, and networking
  • Customize monitoring of performance metrics that are relevant to business needs
  • Create detailed analytics and graphical visualizations that are easy to understand

 

HCP Storage and Objects - for blog.png

 

Hitachi Content Platform (HCP) is a massively scalable, multi-tiered, multi-tenant, hybrid cloud solution that spans small, mid-sized, and enterprise organizations.  While HCP already provides monitoring capabilities, Hitachi Content Monitor (Content Monitor) is a tightly-integrated, cost-effective add-on that delivers enhanced monitoring and performance visualizations of HCP G Series and HCP VM storage nodes.

 

Content Monitor’s tight-integration with HCP enables comprehensive insights into HCP performance to enable proactive capacity planning and more timely troubleshooting.  Customizable and pre-built dashboards provide a convenient view of critical HCP events and performance violations.  Receive e-mail and syslog notifications when defined thresholds are exceeded.  Aggregate and visualize multiple HCP performance metrics into a single view, and correlate events with each other to enable deeper insights into HCP behavior.

 

Content Monitor is quick to install, easy to configure, and simple to use.

 

HCP Application Load.png

With Content Monitor, a feature of the Hitachi Content Intelligence (Content Intelligence) product, you can monitor multiple HCP clusters in near real-time from a single management console for information on capacity, I/O, utilization, throughput, latency, and more.

 

 

Simplified navigation

  • Easily and seamlessly navigate, and automatically authenticate, between Content Intelligence apps (Admin, Search, Monitor) with enhanced toolbar actions 
  • No more need for numerous web browser tabs

 

External storage support for Docker Service Containers

  • Use external storage with Content Intelligence for more robust data storage features and improved sharing of remote volumes across multiple containers

 

Increased flexibility with new Workflow Jobs

  • Each Content Intelligence workflow job can now be individually monitored and configured to run on all Content Intelligence instances, a specific subset of instances, or to float across instances to dynamically run wherever resources are available

 

Enhanced data processing actions

  • Conditionally index processed documents to existing Content Intelligence, Elasticsearch, or Apache Solr indexes
  • New Aggregation calculations for 'Standard Deviation' and 'Variance' of values in fields of data

 

New and improved Content Intelligence data connectors

  • New connector for performance monitoring of HCP systems
  • New connector for processing HCP syslog events on Apache Kafka queues, and improvements to existing Kafka queue connectors

 

For more information, join the Hitachi Content Intelligence Community.

 

Also, check out the following resources:

 

Thanks for reading!

 


Michael Pacheco

Senior Solutions Marketing Manager, Hitachi Vantara

 

Follow me on Twitter:  @TechMikePacheco

Problem Overview

 

As technology and businesses/government continue to become more dispersed across the planet, the location of activity becomes almost as important as the data itself.   Increasingly, these entities find that location of activity provides value in finding patterns for predictive analytics or simply understanding the current and past location of assets.  Positioning on the globe is expressed as a geospatial location.

 

Geospatial location can be expressed in many different forms.  Over the years, organizations have created location specifications that focus on a specific region, and others that specify anywhere around the globe on land, sea, and air.  This article will not describe all the possible ways to specify location, but instead focus on the most common mechanism utilized by modern mapping software readily available.  For instance, Google Maps and Google Earth are likely the most popular.  These mapping solutions utilize the Geodetic system that references locations at sea level called WGS 84 (World Geodetic Systems)

 

There are many articles on the internet that describe these and other location specifications and go into great detail.  But a starting point are the following URLS:

 

https://en.wikipedia.org/wiki/Geodetic_datum

https://en.wikipedia.org/wiki/World_Geodetic_System

http://www.earthpoint.us/Convert.aspx

 

This article will focus on the WGS84 system expressing latitude and longitude.  The WGS84 system has multiple ways of specifying latitude and longitude, but this article will focus on the decimal number specification. For example, the position for the Empire State Building, New York City, New York, USA is:

 

Latitude: 40.748393

Longitude: -73.985413

 

Once location about data is available, the key component is to be able to utilize this information for relevant content discovery.  This article will utilize pictures taken with a camera that includes location coordinates for which the picture was taken. The HCI platform can enable this kind of discovery with appropriate setup.

 

This article will discuss the following topics to perform geospatial search.

  • Geospatial Data Preparation
  • Solr Index Preparation
  • Workflow Construction
  • Performing Geospatial Search

 

Geospatial Data Preparation

 

The first part of this effort is the construction of the data discovery and extraction phase that prepare data for indexing.   The images are preloaded into an HCP namespace to be utilized by HCI.  An HCI data connection of type HCP MQE and named “Image Data Lake” will be utilized to access the images.  There is nothing special about the data connection, thus will not be detailed by this article.

For processing the images in preparation for indexing, an HCI pipeline was constructed with 3 main parts:

  1. Data Extraction of geospatial information from images
  2. Data Enrichment and Preparation of geospatial information.
  3. Date/Time preparations

 

Data Extraction

 

The data extraction portion simply makes sure that the document is a JPEG file, and then performs the generic Text/Metadata Extraction to get any geospatial coordinates.  This stage places the coordinates in the geo_lat and geo_long document fields.

 

Screen Shot 2018-04-11 at 3.06.46 PM.png

 

Data Enrichment

 

The high-level goal of this part is to prepare geospatial information for indexing.  If the previous part generated geo_lat and geo_long fields, processing will proceed for enrichment.

 

First 3 stages utilize the Geocoding stage to generate the human readable city, state, country information.  The result are document fields named loc_city, loc_state, loc_country, and loc_display (combination of other 3 fields).

 

The last two stages setup a GPS_Coordinates field that is in the form of <lat>,<long>. This format is required by Solr for the location data type that will exist in the index once we create it later in this article. The Tagging stage sets up GPS_Coordinates to have the string of “LAT,LON” that will be used as a template for the next stage.  Then using the Replace stage replaces LAT with contents of field geo_lat, and LON with the contents of field geo_long, thus producing the document field like:

 

GPS_Coordinate: 40.748393,-73.985413

 

The pipeline stages for this processing is the following:

 

Screen Shot 2018-04-11 at 3.07.39 PM.png

 

Date/Time Preparations

 

In general, date and time specifications can be specified in nearly infinite number of ways. Although HCI has a built-in Date Conversion stage, there is still a little bit of processing required to prepare general conversion so that HCI can index the dates.

 

For instance, the GPS date and time information returned by the Text/Metadata Extraction stage as two different fields. The result is to reconstruct a field GPS_DateTime from these two fields in a form that the Date Conversion stage can understand without additional definitions in that stage.  The sample fields generated by Text/Metadata Extraction are:

 

GPS_Date_Stamp: 2018:01:27

GPS_Time_Stamp: 17:11:25.000 UTC

 

The goal is to put it into the following form that is understood by the Date Conversion stage:

 

2018-01-27T17:11:25.000+0000

 

What precisely each stage does is again beyond the scope of this article.

 

The pipeline stages for this processing is the following:

 

Screen Shot 2018-04-11 at 3.08.24 PM.png

 

Solr Index Preparation

 

Solr contains built-in support for indexing geospatial coordinates based on the WSG84 coding system. For indexing, the field must be of a specific type and be formatted for this type. The formatting was already performed previously, but essentially the field must be of the form:

 

<lat>,<long>

 

To prepare HCI for indexing and performing geospatial search, it is required to:

  1. Patch the HCI installation,
  2. Construct an appropriate index schema, and
  3. Define a Query Result configuration

 

HCI Installation Patch

 

When an HCI index is constructed, there is a base configuration included in the HCI installation that is used to help simplify the index creation process. One part of the base configuration is a managed-schema concept. This is essentially a configuration mechanism that maps internal Solr data types to simpler names along with default data types attributes.

 

For the purposes of geospatial data, there is a data type called location.  However, there is a problem with the definition in HCI as it uses deprecated and inadequate data types in the definition.   The current implementation of HCI (1.2) allows for changing the managed-schema of an index once it is created; however, the problem with this approach is that if the index is exported and imported, any changes to the managed-schema on that index will be lost.

 

For most deployments, it may not be necessary to export and then re-import an index; however, during development of work flows it is usually typical practice to want to totally clear out the index periodically. Thus the recommendation is to patch the HCI installation until such a time as HCI is updated with more appropriate definition for this data type.

 

The patch procedure is to manually edit 3 files on each node of the HCI installation.  Assuming the HCI installation is rooted at /opt/hci and the HCI version is 1.2.1.139, the managed-schema files are rooted at:

 

/opt/hci/1.2.1.139/data/com.hds.ensemble.plugins.service.adminApp/solr-configs

 

In this folder, the following are the relative paths to the 3 files that need to be updated:

 

basic/managed-schema

default/managed-schema

schemaless/managed-schema

 

 

The changes to the basic/managed-schema and default/managed-schema are identical by just changing the definition for the location field type and adding a locations data type for multi-valued field.  The following are the old and new line(s).

 

OLD LINE:

 

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

 

NEW LINES:

 

<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>

<fieldType name="locations" class="solr.LatLonPointSpatialField" docValues="true" multiValued="true"/>

 

The changes to the schemaless/managed-schema is a bit more complex and requires 3 changes.

 

Change 1:  Delete the following lines.

 

<!-- Type used to index the lat and lon components for the "location" FieldType -->

<dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false" />

 

 

Change 2: Change the following lines.

    OLD:

 

<dynamicField name="*_p"  type="location" indexed="true" stored="true"/>

 

    NEW:

 

<dynamicField name="*_p"  type="location" docValues="true" stored="true"/>

<dynamicField name="*_ps"  type="location" docValues="true" multiValued=”true” stored="true"/>

 

 

Change 3: Change the following lines.

   OLD:

 

<!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

 

 

   NEW:

 

<!-- A specialized field for geospatial search. -->

<fieldType name="location" class="solr.LatLonPointSpatialField" docValues="true"/>

<fieldType name="locations" class="solr.LatLonPointSpatialField" docValues="true" multiValued="true"/>

 

Once all the changes have been made to the files, then the HCI software needs to be restarted. On a CentOS system, the command executed on all nodes would be:

 

systemctl restart HCI

 

Wait for HCI to completely reboot where all services are running.  This can be monitored in the Admin GUI under Monitoring -> Dashboard -> Services.

 

NOTES:

  • If nodes are added to the HCI installation, those new nodes must also be patched, otherwise, unexpected index definitions may occur if the AdminApp service is run on those nodes.

 

  • If this configuration is desired pre-installation of HCI, the procedure can be performed on the installation distribution and then the distribution repackaged.  Then all instances of HCI installations using the patched distribution media will contain the appropriate changes and will survive node additions.

 

Solr Index Schema Definition

 

Once the new managed-schema definition has been updated in the installation, the next step is to create a Solr Index schema to accept the geospatial data collected.  To keep things simple, the HCI index created is a Basic type.  Basic initial schema will create a base set of HCI fields for indexing.

 

Screen Shot 2018-04-11 at 3.34.36 PM.png

 

Along with the basic fields, the fields that will contain the location information generated by the Geocoding built-in stage must be created as simple strings:

 

Screen Shot 2018-04-11 at 3.09.44 PM.png

 

To hold the geospatial information, the following index fields need to be created:

 

Screen Shot 2018-04-11 at 3.10.27 PM.png

 

The GPS_DateTime field is a simple date index field.

The GPS_Coordinates field is a location field type and will contain the definition configured in the managed-schema definition.

 

To verify the new definition of the location field type is activated on HCI, within the schema view of the index just created (1) click on Advanced link, (2) click on managed-schema configuration file, and (3) observe the definition of the location field type as shown below.

 

Screen Shot 2018-04-11 at 3.33.21 PM.png

 

 

HCI Query Result Definition

 

Next step is to modify the image index Query Settings to specify how content should be returned by queries. At a minimum, it is necessary to add the GPS_* and loc_* index fields.  The simplest approach is to just add all fields to the query setting that exist in the index schema definition as there are only a few that were added.  This is accomplished in the index Query Settings page under Fields, then select the Action “Add All Fields”.  See the picture below for guidance.

 

Screen Shot 2018-04-11 at 3.32.33 PM.png

 

Optionally, these fields can also be added to the Query Settings Results as well and is left up as an exercise for the reader.

Workflow Construction


At this point, there is a Data Connection named “Image Data Lake” pointing at an HCP namespace, a pipeline named “Geospatial Image Indexing” that processes the images, and an index named “ImageIdx” that can receive the fields.  The last step is to construct a workflow that can be run to generate the index. Create the work flow by executing the wizard and adding the data connection, pipeline, and output index.  The result when viewing the workflow should look like the following:

 

Screen Shot 2018-04-11 at 3.05.11 PM.png

 

Run the workflow to generate the index.

Performing Geospatial Search

 

Now comes the fun part of performing geospatial queries. As previously mentioned, Solr has a powerful set of capabilities around geospatial points.  The simplest is the range search where two points are provided and all content within the box it forms will be returned.  Then there are more advanced search capabilities that utilize Solr functions that find all points within a distance from a given point, bounding box which sized by the distance from a point, and boosting and sorting capabilities based on the distance from the points.

 

For a fuller description see the following URL:

 

https://lucene.apache.org/solr/guide/6_6/spatial-search.html

 

WORD OF CAUTION:  There were problems with using some forms of the examples at this link specifically around the usage of Solr query filters.  Either it was user error, HCI confusion, or errors (or deprecated specifications) in the examples.  Regardless, the following examples are the forms that worked with HCI.

 

The very simplest form of query is to find all images that reside within a rectangular box that is constructed from two points. This is also called range search on geospatial data. The range search consists of the lower-left point of the box and the upper-right corner of the box.  An example criteria that can be used in the advanced query in the Search Console is:

 

+GPS_Coordinates:[42.377,-71.526 TO 42.378,-71.524]

 

This finds all images within the rectangle with lower-left point of 42.377,-71.526 and upper right point of 42.378,-72.524.  Below is the example run in HCI Search Console.

 

Screen Shot 2018-04-11 at 3.14.34 PM.png

 

The next example consists of utilizing built-in Solr geospatial query. In order to specify additional query parameters via the HCI Search Console, it is necessary to enable this functionality. This is accomplished in the index query results main settings as shown below.

 

Screen Shot 2018-04-11 at 3.31.41 PM.png

 

One such function is to filter content within a circle from a specified point and distance from that point. This function is called geofilt. The below image illustrates what this represents.

 

GeoCircle.png

 

Below is an example in the HCI Search Console using the geofilt filter for finding all pictures that are less than 1 kilometer from the point 42.39,-71.6.

 

Screen Shot 2018-04-11 at 3.25.01 PM.png

 

The last example consists of utilizing the built-in Solr geospatial filter bbox.  This filter constructs a box centered in a given point that the middle of the edges of the box are the specified distance.

 

bbox.png

To perform this type of query, change the geofilt filter to bbox in the Advanced Parameters field as shown below.

 

Screen Shot 2018-04-11 at 3.26.51 PM.png

Hope you enjoyed this article and gained valuable knowledge on how to utilize HCI to perform geospatial search for content.

Here's a quick start guide to help you to integrate to Content Intelligence REST APIs!

 

All of the features available in the Admin and Search UI applications are also available via both the CLI and REST API.

 

Content Intelligence provides 2 distinct REST APIs:

  • Admin - Used to configure and manage a system
  • Search - Used to query across search engine indexes

 

 

REST UI

 

Both the Admin and Search applications in Content Intelligence provide Swagger REST API UI tool for each of their respective APIs.

 

The admin REST API UI can be found in any deployed Content Intelligence system here on the default admin port:

https://<host-ip>:8000/doc/api/

 

The search REST API UI can be found here, on the default search port:

https://<host-ip>:8888/doc/api/

 

searchRESTUI.jpg

 

This REST UI tool documents and demonstrates the entire REST API, allowing you to exercise real request/responses, list and manage system state, display curl examples, etc. This tool has proved invaluable in accelerating product integrations with Content Intelligence, while demonstrating it's capabilities.

 

Expand an API to see it's request/response model objects:

RESTUI1.PNG

 

 

Click "Try it out!" to run a request against the live service to see it's behavior (and get a curl example):

RESTUI2.PNG

 

 

Authentication

 

How does Content Intelligence handle authentication?

 

We use OAuth framework to generate access tokens. The process works as follows:

 

1. Request an access token

 

Once you have a user account, you need to request an authentication token from the system. To do this, you send an HTTP POST request to the /auth/oauth endpoint on the application you're using.

 

Here's an example using the cURL command-line tool:

curl -ik -X POST https://<system-hostname>:8000/auth/oauth/ \

-d grant_type=password \

-d username=<your-username> \

-d password=<your-password> \

-d scope=* \

-d client_secret=hci-client \

-d client_id=hci-client \

-d realm=<security-realm-name-for-an-identity-provider-added-to-HCI>

 

In response to this request, you receive a JSON response body containing an access_token field. The value for this field is your token. For example:

{ "access_token" : "eyJr287bjle..." }

 

2.  Submit your access token with each REST API call

 

You need to specify your access token as part of all REST API requests that you make. You do this by submitting an Authorization header along with your request. Here's an example that uses cURL to list the instances in the system:

curl -X GET --header "Accept:application/json" https://<system-hostname>:<admin-app-port>/api/admin/instances --header "Authorization: Bearer <your-access-token-here>"

 

Notes:

 

• This same mechanism works with local admin users and remote directory servers (e.g. "identity providers"). To get a list of security realms available in the system, send an HTTP GET request to the /setup endpoint. This can be used to let users select the type of authentication credentials they will be providing. For example, to do this with cURL to get the list of realms:

curl -X GET --header 'Accept: application/json' 'https://<hostname>:<admin-app-port>/api/admin/setup'

• To get an access token for the local admin user account, you can omit the realm option for the request, or specify a realm value of "Local".

• If a token expires (resulting in a 401 Unauthorized error), you may need to generate a new one the same way as before. This expiration duration is configurable in the Admin App.

 

 

Workflow Admin

 

List all instances in the system:

curl -X GET --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/instances'

 

List all workflows in the system:

curl -X GET --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/workflows'

 

Run a specific workflow:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/workflows/1f7a6156-4a64-4ac0-b2e8-d73f691dea73/task'

 

Simple Search Queries

 

Querying Content Intelligence search indexes generally involves:

  • List the indexes in the system.
    • Index name is a required input for querying indexes (federated or otherwise).
  • Submit a query request, and obtain a response.
    • There are a number of queries users can perform, the most basic of which is a simple query string. When there are many results, the API also supports paging and limits on responses.

 

List all indexes in the system:

curl -X GET --header 'Accept: application/json' 'https://cluster110f-vm3:8000/api/admin/indexes'

 

Submit a simple query request:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{

  "indexName": "Enron",

  "queryString": "*:*",

  "offset": 0,

  "itemsToReturn": 1

}' 'https://cluster110f-vm3:8888/api/search/query'

 

Query results returned:

{

  "indexName": "Enron",

  "results": [

    {

      "metadata": {

        "HCI_snippet": [

          "Rhonda,\n\nYou need to check with Genia as I have never handled the physical power agreement matters.\n\nSusan \n\n -----Original Message-----\nFrom: \tDenton, Rhonda L.  \nSent:\tTuesday, January 15, 2002 2:17 PM\nTo:\tBailey, Susan\nCc:\tHansen, Leslie\nSubject:\tSouthern Company Netting\n\nHere's Southern.  I never received a copy of the Virginia Electric Master Netting.  We do have netting within the EEI.\n << File: 96096123.pdf >> \n\n"

        ],

        "Content_Type": [

          "message/rfc822"

        ],

        "HCI_dataSourceUuid": [

          "f1c05be1-5947-41e1-a9f1-03a98f0fa036"

        ],

        "HCI_id": [

          "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25."

        ],

        "HCI_doc_version": [

          "2015-07-02T09:06:02-0400"

        ],

        "HCI_displayName": [

          "RE: Southern Company Netting"

        ],

        "HCI_URI": [

          "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25."

        ],

        "HCI_dataSourceName": [

          "HCP Enron"

        ]

      },

      "relevance": 1

      "id": "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25.",

      "title": "RE: Southern Company Netting",

      "link": "https://ns1.ten1.cluster27d.lab.archivas.com/rest/enron/maildir/bailey-s/deleted_items/25."

    }

  ],

  "facets": [],

  "hitCount": 478

}

 

The latest release of HCI 1.2, introduces us to new connector plugins that connect to external databases like mySql and Postgresql. With HCI 1.2 , we have this ability to create custom JDBC connectors for different databases using a base template and the whole development process is simplified by using the HCI plugin-sdk.

 

Before developing the jdbc connector make sure the following dependencies are available :

 

1) The latest HCI Plugin-sdk.jar

2) jdbc-1.2.0.jar -- this jar is located in the HCI installation directory.

    Example : /opt/hci/plugins/jdbc-1.2.0.jar

3) Database Specific driver jar.

    Example: ojdbc7.jar, sqljdbc.jar

4) Java jdk 1.8 or higher

5) Any build tool like gradle,maven to package the plugin jar with required dependencies.

 

 

Create a new project and extend the BaseJdbcConnectorPlugin Class present in the jdbc-1.2.0.jar.

Add all the unimplemented methods and the methods to override in your custom connector plugin.

The main methods to override are:

1) getJDBCDriver

2) getBatchSizeLimitPredicate

The "getJDBCDriver" method should provide the database specific driver. For example , to connect to an Oracle database use the "oracle.jdbc.driver.OracleDriver" driver.

 

The "getBatchSizeLimitPredicate" method should return the predicate for providing the default batch size to fetch when we test the data connection.

 

Example:

"OFFSET 0 ROWS FETCH NEXT " + batchSize + " ROWS ONLY";

In the above example the "bathSize" is determined by the Query Batch Size property value entered by the user during the data connection creation. This is required in order to prevent the plugin from fetching all the rows in the table. This is used as a safeguard against retrieving millions of rows from a database table, especially during a test.

 

 

Structure of the plugin:

 

 

The whole process of connecting to a database and executing SQL statements is abstracted by the BaseJDBCConenctorPlugin which simplifies development without having to worry about any database connections.

 

Use any build tool like gradle, maven to include/exclude dependencies while packaging the plugin.Make sure the plugin.json manifest file is present in the META-INF directory of the plugin.

 

Sample Plugin.json

 

I was asked today about the role of connector plugins in Content Intelligence, so thought I'd pass along the details.

 

Content Intelligence connector plugins can (today) operate in one of two modes:

  • List-based (CRAWL_LIST)
  • Change-based  (GET_CHANGES)

 

CRAWL_LIST mode

 

In this mode, users specify a “starting” Document by configuring the data source. The plugin's role is to perform a container listing from starting points that are requested by the HCI Crawler. HCI provides all of the bookkeeping, and decides which starting points to ask for.

 

Example:  If a content structure is as follows:

/folder1

/folder1/file1

/folder1/subfolder1

/folder2

/folder2/file2

/folder2/subfolder1

/folder2/subfolder1/file3

/file4

 

First, a user configures a starting point of “/” in the data source configuration.

 

When the workflow is executed, HCI will call the “root()” method on the plugin, which should return a Document for the “/” starting point. This is typically a "container" Document, which works like a directory.

 

HCI will call “list()” with that starting document of “/”, which should return all Documents under "/":

  • /folder1
  • /folder2
  • /file4

HCI will then call “list()” with as starting document of “/folder1”, which should return the following Documents:

  • /folder1/file1
  • /folder1/subfolder1

 

The process continues until all objects are crawled. HCI keeps track of the documents that have already been visited, and will not crawl the same object again unless directed later.

 

In continuous mode, the entire process automatically starts again from the root container. HCI will only send only changed Documents from the previous pass to the processing pipelines in this case.

 

 

GET_CHANGES mode

 

In this change-based mode, the connector plugin can collect and return Documents in any order or frequency it would like.

 

HCI calls the “getChanges()” method on plugins in this mode to return Documents. The plugin can return a plugin-defined "token" with each response. The token is opaque and is only interpreted by the plugin. HCI stores this token, and will provide the token returned in the last getChanges() call to the next call. Plugins decide what to do (if anything) with the provided token. For example, if the getChanges() call executes a time-based query to return Documents, the token can include a timestamp of the last discovered Document. On the next getChanges() call, HCI will provide this token to the plugin, which can use it to build the next query.

 

It’s completely up to the plugin to determine what to return in getChanges(), such as a batch of Documents or a single Document. This method can return no changes until the connector discovers a new Document to return.

 

 

The Role of Connector Plugins

 

For details, I've included Alan Bryant's excellent overview of the role of connector plugins here:

 

"A quick overview:

 

There are currently two modes that ConnectorPlugins can use, either list-based or change-based. Since you are working with a filesystem, you probably want list-based. You should implement getMode() to return ConnectorMode.CRAWL_LIST. You can also then implement getChanges() to throw some exception since you won't be needing it.

 

The starting point is the getDefaultConfig() method. This should define any configuration that the user should specify. In this case you should have them specify the starting path that the connector should provide access to.

 

Once the user has specified the config, build() will be called. You should construct a new instance of your plugin here with the provided config and the callback. See the examples.

 

startSession() will then be called. You should put any state associated with your plugin on this session... anything that's expensive to create or clean up. There is no guarantee that your plugin instance will be kept around between calls. The session will be cached where possible.

 

To actually crawl the datasource, we start with the root() method. This should return a Document representing the root of your datasource. Generally this should be a container (see DocumentBuilder.setIsContainer).

 

After that, the Crawler will call list() for a particular container. List should return an Iterator of Documents. Each Document generally represents something that has content to be processed (see DocumentBuilder.setHasContent) or that is a container of other Documents, like root(). These containers generally correspond to real-world objects, like Directories, but can really be any grouping you want.

 

If you are returning large numbers of Documents, look at StreamingDocumentIterator so you don't cause OutOfMemory issues by holding all the Documents in memory at once.

 

As Documents are discovered, they will be processed in the workflow. During this time, stages may ask for streams from the Document. This is implemented by calling openNamedStream(). openNamedStream should use metadata that was added in list() to be able to open the streams. So, list() just adds stream metadata (see DocumentBuilder.setStreamMetadata) and it's used later when we call openNamedStream.

 

Other things you should do:

  • get() should operate like openNamedStream being passed the StandardFields.CONTENT.
  • getMetadata returns an up-to-date version of a Document based on just a URI. It is very important that this returns the same types of metadata, in the same format, as list(). getMetadata() is used for pipeline and workflow tests. If the data is different, then the test will be useless.
  • test() should be implemented to ensure that the config is correct... basically, make sure the configured directory exists. In plugins that do network access, this can trigger SSL certificate processes."

 

 

Thanks,

-Ben

During some data assessments on CIFS filesystems I have observed some strange behavior of how HCI deals with system metadata time stamps. System metadata timestamps are the usual timestamps we all know from CIFS and NFS when looking into Windows Explorer or do an ls -la. This is NOT the additional timestamps can be stored with the object itself created and maintained by the creating application, like PDF writer, MS Office apps and DICOM applications, for instance.

Here is a some more information in addition what has been discussed already here in other threads for hopefully better understanding from a single point of view.

 

I hope this gives more insight into this because it is very important to understand this when it comes to file age profiling.

 

 

 

Also important to know BEFORE starting such an assessment is the following:

Jon Chinitz

Installing HCI in EC2

Posted by Jon Chinitz Employee Feb 7, 2018

Ben asked me to put this note together. I thought I would verify the steps by actually doing them on a fresh instance that I just created. So here it goes:

 

Create the EC2 instance in AWS:

  1. Pick a base image (I like Ubuntu)2018-01-18_1556.png
  2. Decide which machine type you are going to run this on. Remember that you need a minimum of 4 cores and 16GB of RAM. I chose a t2.xlarge.2018-01-18_1557.png
  3. Configure the instance details. I create my instances in us-east-1b because that is where I have a VPC configured that connects the Waltham Lab to AWS (this way I can run HCI in EC2 and connect to HCP and other resources in the lab).2018-01-18_1558.png
  4. I make sure that I have enough disk for my instance (the default is way too small)2018-01-18_1559.png
  5. Next I associate a Security Group with the instance. This is *very important* so that you don't leave your instances exposed to the Internet (where they can get hijacked to do Bitcoin mining amongst other things...). The requirements are that you need to be able to connect to port 8000 and 8888 from outside the instance (and of course port 22 for ssh). You will see that my SG allows all addresses on the VPC subnet access to my instance. 2018-01-18_1601.png
  6. Final step is to associate an ssh key with the instance and launch it2018-01-18_1602.png
  7. Check out the instance in the EC2 console. Specifically you are going to need to capture its IP address. Note: I will be using the PRIVATE IP to talk to the instance. that's because I have the VPC that allows me to do so. You will likely use the PUBLIC IP to communicate with the instance.2018-01-18_1604.png
  8. ssh into the instance. Login with the user 'ubuntu'.2018-01-18_1606.png2018-01-18_1607.png
  9. Make sure the instance has all the up to date patches.
  10. Install Docker on your instance if it is not already there. For ubuntu you do 'sudo apt-get install docker.io'
  11. Don't forget to add the ubuntu user to the Docker group ('sudo useradd -aG docker ubuntu').
  12. Test that docker installed properly by running 'sudo docker run hello-world'.
  13. Next scp the HCI tarball onto the EC2 instance. Untar it into /opt/hci ('mkdir -p /opt/hci' if it is not already there).
  14. Install the HCI bits: 'cd /opt/hci; sudo 1.2.0.79/bin/install'
  15. run 'sudo sysctl -w vm.max_map_count=262144' -- if you forget HCI setup will remind you
  16. Run setup: 'sudo bin/setup -i <INTERNAL IP>'. Note: this is the INTERNAL IP address used by the instance. If you are installing an instance that is part of a cluster follow the instructions in the Installation Guide
  17. Run 'sudo bin/run' OR use systemctl to enable and start HCI.service (see Installation Guide)
  18. Wait for HCI to come up (either check systemctl status OR docker ps)
  19. Once HCI is up then you use a browser to complete the install. Note: In my example I am using the Private IP but in yours you are probably going to be using the PUBLIC IP port 8000:2018-01-18_1647.png2018-01-18_1648.png
  20. At this point you are ready to work with HCI2018-01-18_1651.png

Enjoy!

There has been a lot of talk recently about how to use the HCP connector, specifically how/when to use the actions Output File, Write Annotation and Write File. Following is an example of how I used these actions in a series of simple workflows.

 

I started out with a namespace (Cars) that I knew had files in it with custom metadata (CM) already attached. I actually used this namespace years ago to do some HCP MQE demos with. What I didn’t realize was that it also had some other files that didn’t have CM.

 

Step 1: Make a copy of the Cars namespace to do my work on. I accomplished this with a simple workflow that I keep around (Figure 1) .

 

Namespace_Copy_Workflow.png

Figure 1

The output section of the workflow uses the Output File action because I want to copy the entire object along with its CM (Figure 2).

 

outputFile_Example.png

Figure 2

 

Step 2: Get rid of the files that didn’t have CM. I wanted to preserve the files so the workflow copies them to another namespace (‘Split’) and then deletes them from the source. I accomplished this with a simple pipeline. The pipeline has an Output File action for the copy and a Delete action. Both are tucked inside an IF block that checks for the presence of the HCP_customMetadata field. The field is created by the HCP Connector and is set to true when the file has one or more metadata annotations (Figure 3).

 

Copy_if_no_CM_and_Delete_from_Source.pngFigure 3

 

The workflow for the copy is in Figure 4. Note that it has no Output section since the actions are performed in the processing pipeline.

 

Copy_and_Delete_Workflow.png

Figure 4

 

Step 3: To demonstrate adding a new annotation I first needed to create the annotation. The easiest way to do this is a simple pipeline that has two stages. The first creates a couple of fields using a Tagging stage (Figure 5).

 

writeAnnotation-2-tagging-stage.pngFigure 5

 

The second creates an XML stream using the fields in an XML Formatter stage. The stream will be used as the annotation. The CM is written to the object in the Output section of the workflow using the Write Annotation action (Figure 6). Note the fields and streams in the Write Annotation configuration.

Slide1.JPGFigure 6

 

The workflow that pulls all this together is in Figure 7.

Add_CM_Workflow.pngFigure 7

Do you want more insight into the state of your workflows? Do the workflow metrics don't update as frequently as you like? Are you interested to find out the speed of your data connector? Confused when to use which pipeline execution mode? Would HCI deployed in virtual environment give the same performance as physical deployment?

 

The attached white paper address these questions and highlight the improvements and optimizations introduced in version 1.2 to improve the overall performance of Hitachi Content Intelligence.

 

The content of the white paper:

  • Highlights workflow performance improvements and optimizations introduced in version 1.2 and compare to previous releases.
  • Summarizes the improved performance results of list-based data connectors.
  • Provides methodology used to determine pure crawler performance for different data connections.
  • Demonstrates that document failures reporting to the metrics service has been drastically improved.
  • Recommends when to use Preprocessing execution mode over Workflow-Agent.
  • Compares performance of a physical HCI with that of similarly configured HCI deployed in a virtual environment.

 

 

Questions/Feedback? Please use the comments section below.

 

-Nitesh

Before Updating to 1.2.1, please view the following question/answer addressing a known issue if you have updated from a previous version to 1.2.0 and more than one week has elapsed:

 

Updating from 1.2.0: Failed to initialize UpdateManager

 

Thanks,

-Jared

Jon Chinitz

Making an HCI OVF Bigger

Posted by Jon Chinitz Employee Oct 21, 2017

Some of you have asked about increasing the size of the OVF that ships with Hitachi Content Intelligence. The default disk volume today is 50GB. The following quick sheet of instructions will show you how to increase the disk volume.

 

Step 1: shutdown the node

Step 2: using the vSphere console (or any other method you feel comfortable with) navigate to the node's settings and change the size of "Hard Disk 1" (I chose to increase it from 50GB to 100GB):

 

Change_VMDK_Size.png

Step 3: save the edits and restart the node.

Step 4: ssh into the node and display the mounted filesystems. The filesystem we are after is the root filesystem mounted at /dev/sda3:

 

df.png

Step 5: run the fdisk command specifying the disk device /dev/sda:

 

fdisk.png

Step 6: while in fdisk you are going to (d)elete partition 3, create a (n)ew partition 3 with the default starting sector and size offered up by fdisk. The starting sector is the same as /dev/sda3 had only the size is now the number of sectors left to the end of the disk. Last thing to do is to (w)rite the new partition table back to the disk (you can safely ignore the error).

Step 7: reboot the node.

Step 8: ssh back into the node and run xfs_growfs. Be sure to specify the partition /dev/sda3:

 

xfs_growfs.png

As you can see the root filesystem has been resized to occupy the new space.

Hi,

 

This stage is an attempt to increase the HDI indexing functionalities of Content Intelligence.

It transforms the ACL entries found in the HCP custom metadata left by HDI. Whenever possible, the plugin attempts to create a separate metadata field for each ACL entry and transform the permissions and user/group IDs to readable formats, as seen in the following example:

 

ResultOverview.png

As an optional step, the plugin can automatically map user/group SIDs to their respective Active Directory names, by providing the parameters shown in the following example:

 

ConfigOverview.png

Alternatively, you can search for an specific user/group by first obtaining its SID in Active Directory and then using said SID for the query.

 

The plugin can not transform HDI RIDs in its current version.

When analyzing a collection of data of varying types, the first challenge you'll encounter is how to ensure that your processing tasks can all speak the same consistent language and provide common capabilities.

 

Can your system easily determine the difference between an email, image, PDF, or call recording? If so, how? Can the system make additional processing decisions automatically based only on the data provided?

 

Typically, these tasks are performed simply through the generation and evaluation of metadata. The MIME type of a file, for example, can help to determine how it should be processed. Geo-location metadata can help to identify where content was generated. Date/time metadata can identify when a document was created or last accessed. But can the system easily understand how to interact with, augment, and/or repair that metadata? Can it automatically determine the data types of those metadata values to assist in parsing and database entry? How does the system access the raw content streams for further analysis given only metadata? Is the data associated with any additional data streams? How are those accessed? Is the answer different each time?

 

In Content Intelligence, data is represented from any data source in the form of a data structure called a Document.

 

A Document is a representation of a specific piece of data and/or its associated metadata. Any data can be represented in this form, providing a normalization mechanism for advanced content processing and metadata analysis.

 

Documents are made up of any number of fields and/or streams.

 

A field is any individual metadata key/value pair associated with the data.

 

For example, a medical image can become a Document that contains field/value pairs such as "Doctor = John Smith" and "Location = City Hospital". These fields serve as the metadata for your files and can be used for general processing and to construct a searchable index. Fields may be (optionally) strongly typed, though all fields can still be evaluated in their native string form. Fields can also have a single value, or multiple values associated with them.

 

A stream is a pointer to a sequence of raw data bytes that live in another location, outside of the document itself, but that can be accessed on demand.

 

Streams typically point to larger data files that would be prohibitively expensive to load into memory as Document fields, such as the full text content of a large PDF file. Rather than spending system resources passing this large amount of data through a pipeline, Content Intelligence uses these streams to access data and read it from where it lives on-demand. This is accomplished through the evaluation of stream metadata that is evaluated by the connector to determine which data to pull into the system for streamed processing. These data streams are typically analyzed within the system without requiring the full contents of the stream to be loaded into memory.

 

Here's a visual example of a Document in Content Intelligence representing a PDF file:

 

documentFieldStreams.png

Notice that this Document has a number of metadata fields defined, such as Content_Type, and HCI_filename. Processing stages may add, remove, and change these metadata fields to build a self describing entity. Tagging additional fields to Documents can direct other processing stages in how they should process this Document to extract additional value. 

 

This Document also has a "streams" section, where it defines 2 named streams. First, there's the HCI_content stream, which contains the raw bytes of the PDF file. Second (having been stored on HCP), we see an additional custom metadata annotation stream named .metapairs,  containing additional XML formatted metadata associated with this Document.

 

At any time during processing, each individual data stream associated with this Document can be read by name from the original content source. When directed, the system streams the data from the original data connection back to the processing stage that requested it. This allows for tasks such as reading that XML annotation, parsing the information contained, and adding that information to the Document as additional fields for further processing. Like fields, streams can also be added/removed from the Document on demand, so that other processing stages can easily consume it.

 

Creating and Updating Documents

 

Content Intelligence data connectors and processing stages both enable flexible interactions with Documents. See a previous blog of writing custom plugins for more details.

 

For example, a custom file system connector may perform a directory listing to identify metadata and create a Document for each file it found. Each Document would contain fields representing the metadata for each. Creating a Document is accomplished through the use of the DocumentBuilder, obtained from the PluginCallback object:

 

     Document emptyDocument = callback.documentBuilder().build();

 

Adding fields to this existing document is accomplished using the builder as follows:

 

     DocumentBuilder documentBuilder = callback.documentBuilder().copy(emptyDocument);

     documentBuilder.addMetadata("HCI_id",  StringDocumentFieldValue.builder().setString("/file.pdf").build());

     documentBuilder.addMetadata("HCI_URI",  StringDocumentFieldValue.builder().setString("file://file.pdf").build());

     documentBuilder.addMetadata("category",  StringDocumentFieldValue.builder().setString("Business").build());

     Document myFileDocument = documentBuilder.build();

 

This Document now includes required fields "HCI_id", containing the unique identifier of the file on that data source, and "HCI_URI", which has a single value "file://file.pdf" defining how to remotely access it. It also contains a custom field: "category = Business". You can do this with any information you obtain about this Document, effectively building a list of metadata associated with it that can be accessed by other parts of the system easily.

 

Now, let's allow callers to access the raw data stream from this Document by attaching a stream named "HCI_Content". Because we're only adding a pointer to the file (not actual stream contents), we use the setStreamMetadata method:

 

     DocumentBuilder documentBuilder = callback.documentBuilder().copy(myFileDocument);

     documentBuilder.setStreamMetadata("HCI_Content",  Collections.emptyMap());

     Document myFileDocumentWithStream = documentBuilder.build();

 

Notice that we haven't set any stream metadata for this new stream, and the provided Map is empty. This is because connectors already use the standard "HCI_Content" stream name to represent the raw data for this file. This directs the system to use the HCI_URI field to read the file (e.g. from the local filesystem) and present the stream contents to the caller.

 

If you have an InputStream, you can also write streams to system managed temp space using setStream:

 

     DocumentBuilder documentBuilder = callback.documentBuilder().copy(myFileDocument);

     documentBuilder.setStream("xmlAttachment",  Collections.emptyMap(), inputStream);

     Document myFileDocumentWithStream = documentBuilder.build();

 

When writing this inputStream to HCI, the system will attach additional stream metadata containing the local temp file path this file was written to. Stream metadata can be used, for example, to store any additional details required to tell the connector how it should read this data when asked. This metadata can tell the system to load the file from temp directory where it was stored. All temporary streams are deleted automatically when workflows complete.

 

Working with Documents

 

Callers from other processing stages can read  fields and streams from the provided Document as follows:

 

     // Reading fields

    String category = document.getMetadataValue("category").toString();  

     // Reading streams

     try (InputStream inputStream = callback.openNamedStream(document, streamName)) {

           // Use the inputStream for processing

           // Add additional metadata fields to the Document based on the contents found

     }

 

Processing Example

 

Consider a virus detection stage, tasked with reading the content stream of each individual Document, and adding a metadata field to indicate "PASS" or "FAIL". This stage would follow the procedures above to first analyze the contents, and again to add additional metadata to the Document for evaluation by other stages.

 

     private Document scanForVirus(Document document) {    

          DocumentBuilder documentBuilder = callback.documentBuilder().copy(document);

          // Analyze the content stream from this document

          try (InputStream inputStream = callback.openNamedStream(document, "HCI_Content")) {

               // Determine if there's a virus!

               boolean foundVirus = readStreamContentsAndCheckForVirus(inputStream);   

               // Add the result to the document

               documentBuilder.addMetadata("VirusFound",  BooleanDocumentFieldValue.builder().setBoolean(foundVirus).build());

          }

          return documentBuilder.build();

     }

 

A subsequent stage could then check the value of VirusFound on incoming Documents, and take steps to quarantine any files on the data sources where the virus was detected.

 

This work can be performed without directly interacting with the data sources themselves - just by interacting with the Document representations in the Content Intelligence environment. This eliminates much of the complexity of dealing directly with client SDKs, connection pools, and retry logic, simplifying the development of new processing solutions.

 

Standardizing on field and stream names (such as HCI_URI and HCI_content), can reduce any custom configuration required on each processing stage, by leveraging built-in out of the box defaults. This can help to eliminate many common configuration mistakes, such as typos in field names, while promoting the re-use of stages.  

 

I hope this demonstrates the flexibility and convenience provided by standardizing on a useful data structure such as the Content Intelligence Document. Whether the data is a tweet, a database row, or an office document, the data can be represented, accessed, analyzed, and augmented in the same consistent way. By using a standard, normalized mechanism for accessing and consuming information, you can quickly generate reusable code that can help to quickly satisfy a number of use cases. Even those you haven't thought of yet...

 

Thanks for reading!

-Ben

Are you looking for a deeper overview of Hitachi Content Intelligence (and a peek under the covers)? Are you using the system and want to better understand how to optimize for specific use cases? Look no further - this is the blog for you.

 

Content Intelligence is a software solution comprised of 3 distinct bundled layers:

 

First, an embedded services platform leverages the portability of modern container technology, enabling flexible and consistent deployment of the complete solution in physical, virtual, and cloud environments. From there, it adds the ability to cluster, scale, monitor, update, triage, and manage the solution via REST API, CLI, and UI. Controls are provided for scaling, configuring, load balancing, and even repairing specific services. Plugin and service frameworks support easily extending and evolving system capabilities to meet custom use cases using a provided SDK.

 

Next, an advanced content processing engine allows for connecting to data sources of interest and processing the information by categorizing, tagging, and augmenting metadata representing each item found. Deep analysis against raw data streams produces both raw text (enabling keyword search) and additional metadata. Optimized for large scale parallel processing, the included workflow engine can blend structured and unstructured information into a normalized form that is perfect for aggregating data for reports, triggering notifications, and/or building search engine indexes.

 

Finally, a text and metadata search component delivers comprehensive search engine indexing and index management capabilities. Tools are provided for designing, building, tuning, and optimizing search engine indexes. The system allows for scaling and monitoring locally managed indexes and/or registering external indexes to participate in globally federated queries. A full-featured customizable search application is provided - supporting secure access to query results that may be automatically tuned to the needs of specific user groups and use cases.

 

arch.png

 

For a deep dive into the architecture, feature set, and best practices of Content Intelligence, see the attached whitepaper below.

 

-Ben

One of the simplest ways to further optimize a search engine index is to register stopwords.

 

Stopwords are terms that are typically irrelevant in searches, like "a", "and", and "the". Removing these terms while indexing can significantly reduce index size without adversely impacting user query results.

 

Stopwords can affect the index in three ways: relevance, performance, and resource utilization.

 

  • From a relevance perspective, these high-frequency terms tend to throw off the scoring algorithm, and you won't get the best possible matching results if you leave them in. At the same time, if you remove them, you can return bad results when the stopword is actually important. Choose stopwords wisely!

 

  • From a performance perspective, if you don’t specify stopwords, some queries (especially phrase queries) can be very slow by comparison, because more terms are compared to each indexed document.

 

  • From a resource utilization perspective, if you don’t specify stopwords, the index is much larger than if you remove them. Larger indexes require more memory and disk resources.

 

Because they are effectively filtered from the index, stopwords are not considered when matching query terms with index terms. For example, when using stopwords {do, me, a, this}, a query for “do me a favor” would match a document containing the phrase “this favor”, making "favor" the most important search term impacting matches.

 

This is typically the desired behavior, as the same processing performed at index time as is performed at query time to “normalize” the user input to associate with matches. The best matches get the highest relevancy score, and appear higher in query results.

 

However, if literal exact phrases with these terms included are important, less stopwords can be better. For example, removing “do” as a stop word in the example above would cause phrase query “do me a favor” to NOT match “this favor”, but the query would still match a document containing “do this favor”.

 

The HCI index stopwords file (see "Index > Advanced > stopwords.txt") is used by the HCI_text and HCI_snippet fields. This file is empty by default for newly created indexes in 1.1.X releases, but will be populated with defaults in future releases.  It is highly recommended that you add relevant stopwords to this file prior to indexing!

 

A conservative example English stopword list that can satisfy the majority of use cases would be the following:

 

a

an

and

are

as

at

be

but

by

for

if

in

into

is

it

no

not

of

on

or

such

that

the

their

then

there

these

they

this

to

was

will

with

 

Example stopwords files in different languages are also available in the product as examples. See the "stopword_<country/language>.txt" files under the "Index > Advanced > lang" folder in the Admin application.  The above list comes from the default English stopwords_en.txt  file, taken from Lucene's StopAnalyzer.

 

Happy optimizing!

 

Thanks,

-Ben