Scott Baker

GDPR starts with awareness: addressing Article 4(1) with Hitachi Content Intelligence

Blog Post created by Scott Baker on Dec 13, 2016

Introduction

 

Picture1.pngArticle 4(1) of the General Data Protection Regulations (GDPR) defines personal data as “any information relating to an identified or identifiable natural person,” and specifically acknowledges that this includes both ‘direct’ and ‘indirect’ identification (for example, you know me by name – that’s direct identification; you describe me as “the Sr. Director of the Emerging Business Portfolio at Hitachi Data Systems” – that’s indirect identification).

 

The same GDPR article expands this definition with the fact that identification can also be by means of “an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity.“  Wow… that’s vague – does that mean then that an IP address, a website cookie string or the geographical metadata of a picture match this classification of an “identification number”?   The vagueness is by intent. EU regulators were minded to treat the definition of “personal data” as broadly as possible based on the content, purpose and result of the data.  And, yes, I have to imagine that those examples I just provided will generally be considered personal, even if the organization does not.  The take-away here is that GDPR is meant to be descriptive, not prescriptive.

In this blog post, what I want to focus on is a specific category of personal data, as defined in Article 4(1),  that is commonly referred to as “special” or “sensitive personal data.”  The concepts presented in this post can be easily expanded to address broader GDPR definitions and include additional data as described and used within organizational assets.  The “special” data that I am referring to is personal data that is afforded extra protection and covers data elements such as a National Insurance Number, Personal ID Number, Credit Card Number, etc.  Providing the extra protection to special data means that an organization is able to scan each file using specific patterns and data profiling. 

As shown in Figure 1, this is just one area where Hitachi Content Intelligence excels – by enabling an organization readily identify, locate, categorize and reference files that contain PII.  This is the first step on the journey to GDPR compliance – awareness.

 

 

Picture1.png

Figure 1: Overview of a compliance architecture from Hitachi

Finding Personally Identifiable Information with Hitachi Content Intelligence

 

Personal Data.jpgOne of the more powerful features of Content Intelligence is the Content Class. A Content Class is essentially a query expression that defines how to find and extract information from within the contents of the file being processed or within the file’s metadata. Content Classes can represent the ideal pattern matching based on an XPath in an xml file, a JSONPath in a JavaScript Object Notation document, or regular expressions.  Clearly we are jumping into this topic pretty deep and pretty quick.  If you’re not willing to wait for the explanation, CLICK HERE to jump to the end of the post for the final video.  Otherwise, stay with me and I will explain how easy Content Intelligence can use this capability to adhere to Article4(1) of GDPR, or any other regulatory effort where data profiling and pattern matching are necessary. 

Content Classes are used as customizable data processing stages that profile and match data based on the query expression created by the organization.  Of the three types mentioned previously, this example deals with regular expressions.  A regular expression is a sequence of characters that describes a pattern used to find text in within a larger chunk of text, validate that a string complies to the conditions of the pattern, and extract the subset of text according to the expressed rule.  Now this post would get wildly out of control if I were to try to explain regular expressions in detail – instead, I would suggest YouTube as a way to get started understanding regular expressions if necessary.  However, I will cover one specific regular expression to provide some insight into how they are used by Content Intelligence.

 

Content Classes: Breaking Down Regular Expressions

 

In the associated demo a form of PII can be the license plate number of an automobile that I own (an indirect identifier).  When I was in the US Army, I spent 3.5 years in Yorkshire, UK (more on that some other time), and had a license plate assigned to my car “YG01 SMR”. Finding that pattern within a file can be achieved with a regular expression, as shown in Figure 2 and 3 below:

Picture1.png

Figure 2: Pattern matching with regular expressions

 

Picture1.png

Figure 3: Highlighting a matched pattern

To achieve this match, the pattern shown in Figure 4 was defined within a Content Class:

Picture1.png

Figure 4: The regular expression to find a pattern matching a UK National Insurance ID Number

The expression in figure 3 is broken down into its individual elements as follows:

    • \b  asserts the position of the search to be on a word boundary (e.g. the beginning of the string being evaluated)
    • [A - CEGHJ - PR - TW - Z]{1}  matches the first character exactly one time unless it is the  "D, F, I, Q, U, or V" character
    • [A - CEGHJ - PR - TW - Z]{1}  matches the second character exactly one time unless it is the  "D, F, I, Q, U, or V" character
    • [0 - 9]{6}  matches the next six characters as numerical digits ranging from 0 to 9
    • [A - DFM]{0,1}  optionally matches the letters "A" through "D, F, or M" in the last position of the string
    • \b  asserts the position of a word boundary (e.g. the end of the end of the string being evaluated)

 

You can learn more about regular expressions from a number of sites on the internet.  One of my favorite is RegEx101 as it includes a built in testing environment that also describes how the regular expressions matches a given string (if at all).

 

For this post, the table below contains the full compliment of the regular expressions created for this GDPR Content Class.

 

METADATA Field Name

Regular Expression

PII_UK_PHONE(((\+44\s?\d{4}|\(?0\d{4}\)?)\s?\d{3}\s?\d{3})|((\+44\s?\d{3}|\(?0\d{3}\)?)\s?\d{3}\s?\d{4})|((\+44\s?\d{2}|\(?0\d{2}\)?)\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?
PII_EMAIL\b[\w-\.]+@([\w-]+\.)+[\w-]{2,4}\b
PII_ID_NUMBER\b(?!000)([0-6]\d{2}|7([0-6]\d|7[012]))([ -]?)(?!00)\d\d\3(?!0000)\d{4}\b
PII_ADDRESS\b\d+\s[A-z]+\s[A-z]+\b
PII_UK_POSTAL_CODE\b([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)\b
PII_UK_NATIONAL_INS_CODE\b[A-CEGHJ-PR-TW-Z]{1}[A-CEGHJ-NPR-TW-Z]{1}[0-9]{6}[A-DFM]{0,1}\b
PII_UK_BANK_SORT_CODE \b[0-9]{2}[-][0-9]{2}[-][0-9]{2}\b
PII_CREDIT_CARD\b((4\d{3})|(5[1-5]\d{2}))(-?|\040?)(\d{4}(-?|\040?)){3}|^(3[4,7]\d{2})(-?|\040?)\d{6}(-?|\040?)\d{5}\b
PII_NAME\A.*
PII_UK_LICENSE_PLATE\b([A-Z]{3}\s?(\d{3}|\d{2}|d{1})\s?[A-Z])|([A-Z]\s?(\d{3}|\d{2}|\d{1})\s?[A-Z]{3})|(([A-HK-PRSVWY][A-HJ-PR-Y])\s?([0][2-9]|[1-9][0-9])\s?[A-HJ-PR-Z]{3})\b

 

Enough Already.... On With The Demonstrations!

 

Content Intelligence provides users with the ability to test their work as they design Workflows, Pipelines, Content Classes, etc.  Testing during the design process ensure the final results match the intent using a smaller subset of data versus executing those same tasks against a large repository.   Consider that a large number of repositories and files, flowing against a complex data model, can be a time and resource consuming effort.  Coming to the end of that effort and not having anything to show for it can be incredibly frustrating.

 

For this post, the videos (best viewed in full screen mode) marked as Demo 1 and Demo 2 walk you through the testing process for the Content Class, and its parent Pipeline, to support the GDPR requirements for Article 4(1).


Demonstration 1: Pattern matching and data profiling with Content Classes

 

Demonstration 2: Including the Content Class in a Workflow Pipeline

Testing the logic of a Workflow, Pipeline or Content Class is an ideal way to garner stakeholder support.  In real-time, users can be presented with the results of a test for consideration and input.  Together the content manager and end-user refine how the data is mapped and enriched to ensure it is of the highest quality and relevance.

Following the tests, executing the Workflow can seem a bit anti-climactic.  It is essentially performing the same actions shown in the previous two demonstrations with the addition of committing the results to output location.  Take a look at Demo 3 below, which covers the Workflow execution in detail.

 

Demonstration 3: Executing the Workflow

After building and centralizing the document index with the Workflow’s execution, the content manager must now consider how the results are presented to the end-user.  In these last two video demonstrations, learn how to further heighten the quality and relevance of the newly discovered data by tailoring it representation to the intended audience.  Demonstration 4 provides an overview of how the results are customized and walks through the Hitachi Content Search end-user application to navigate files containing PII.




Demonstration 4: Customizing the results set for the end-user

 

Demonstration 5: Working with the resulting index using the Search App (powered by Hitachi Content Intelligence)

 

Summary

That was quite a great deal to cover in a single blog post – it is possible that a series of four may have been easier to consume.  Regardless, the growing concern over the implications of GDPR caused me to err on the side of detail to ensure you received as much a “How-To” as a “Why It’s Relevant” kind of post.

Hitachi Content Intelligence, combined with the broader Hitachi Content Portfolio, provides several strong benefits for organizations concerned with PII protection – too many to list in this post, but (4) specific ones are outlined below.  Keep in mind that this is not an end-to-end solution, rather it’s the first step along a broader GDPR journey – one where awareness of where PII exists, and to what degree, within any organizational asset.

  1. Contextual Analysis: using real-time data pattern matching and profiling to isolate direct, indirect and customized occurrences of PII.
  2. Adaptable Controls: the results can be tailored to match how an end-user thinks and works with data.  This flexibility continually refines the quality of data with end-user involvement and greater adoption due to the ease-of-use.
  3. Feedback: the ability to test the Content Class, Pipeline and Workflow results in the ability to quickly isolate errors before the end-user accesses the result set.  A bad end-user experience can erode trust in the solution and the data - a lack of trust can quickly result in a lack of system and data use (among other things).
  4. Managed Access: granular policies (either locally defined or relayed from the organization's security services) protects the sensitivity of PII by only allowing those with authority to access the results set or see specific parts of PII data.  For example, a user could be authorized to access and explore a result set, but can not see any PII data - Content Intelligence provides a means to redact that sensitive data selectively.

I'd be happy to receive feedback on this topic and any other videos and tutorials that would be helpful - just let me know.  Please use the comments section to share your feedback and recommendations on how, together, we can make Hitachi Content Intelligence a powerful edition to your GDPR initiatives.

 

Cheers!

Scott Baker

 

NOTES:

  1. My thanks to Duncan Brown of IDC who pointed out that I was referencing the wrong GDPR Article in this post.  I've modified the post on 12/14/2016 to reflect the correct Article.
  2. My thanks to Jon Chinitz who suggested that the last video be bisected to make it easier to consume.

Outcomes