Community Data Validation

Document created by Pedro Goncalves on Aug 10, 2017Last modified by Diogo Belem on Nov 9, 2017
Version 8Show Document
  • View in full screen mode


 

CDV brings the ability to create validation tests on the Pentaho Business Analytics server, in order to verify both the integrity of the server itself and the data being used by the server.

 

Credits: Webdetails Team, Lead - Pedro Alves

 

Version: 16.01.22

License: MPLv2

Available in: Marketplace

 


 

Motivation

Very often, we want that several questions related to the data could be answered before we noticed it:

  • Do we have up to date data?
  • Can we trust the data?
  • How long did the queries take to run?
  • Do we have wrong data? (duplicated users in community)
  • Be able to validate big chunks of reprocessing
  • Did the data format change?
  • Is the data stalled? (eg: number of twitter followers not updating)
  • Do we have peaks or valleys in the data? (due to double process or no process)
  • We need a way to handle known effects. (eg: Christmas dip)
  • We need to correlate independent data sources.
  • Do we have a big number of 'unknowns'? (tk=1 in DW)
  • Do we have clearly wrong rows in resultset? (e.g.: a line there)

So we decided to build CDV - a data validator that periodically does a set of tests that answer all the above questions.

You can see the original RFC for CDV. Note that not every feature has been implemented.

 


 

User Interface

 

Main Features

See existing validations:

  • Allows firing a specific validation
  • Get the URL of a specific validation / all validations

 

CDA Query error dashboard:

  • Query and parameters
  • Errors
  • Incidents

 

Create / Edit validation:

  • Define query name
  • Define queries and parameters
  • Define validation functions
  • Chose log alerts (when to throw error / severe / warn / ok)
  • Define error messages
  • Choose duration thresholds
  • Define crons

 

Validation status dashboard

 

Duration Dashboard to identify slow points in the system:

  • Query and parameters
  • Duration
  • Incidents

 


 

Creating New Validations

 

Creating Validations

All the validations created will be stored in solution/cdv/tests/.

The files will have the format Name.cdv and will internally be a JSON file with a specific structure.

 

cdv.registerTest({
  id: 1,
  type: "query",
  name: 'test',
  group: "",
  path: 'cdv/test/test.cdv',
  createdBy: '',
  createdAt: 1339430893246, 
  queries: [{
    cdaFile: "",
    dataAccessId: "",
    parameters: {}
  }],    
  validations: [{       
    validationName: "",       
    validationType: "",       
    validationFunction: function(rs, conf) { }    
  }],    
  executionTimeValidation: {
    expected: 100,
    warnPercentage: 0.30,
    errorPercentage: 0.70,
    errorOnLow: true
  },
  cron: "0 0 10 * ? *"
});

 

Each test must have an unique ID.

You can set different datasources by selecting different CDA files, and set the respective tests.

 

validation: [{ 
  cdaFile: "/solution/cda/test.cda",
  dataAccessId: "1" ,
  parameters: [...]
},
{
  cdaFile: "/solution/cda/test2.cda", 
  dataAccessId: "2" ,
  parameters: [...] 
}],
tests: [{
  validationType: "custom",
  validationFunction: function(rs, conf) {
    var exists = rs.map(function(r) {
      return r.length > 0
    }).reduce(function(prev, curr) {
      return conf.testAll ? (curr && prev) : (curr || prev);
    });
    return exists ? Alarm.ERROR : Alarm.OK;
  }
}]

 

We'll be using the Steel-Wheels Sample Data solution to create some examples of the validations that can be done.

As an example, the following MDX query returns the # of Quantity and Sales for a specific year.

 

select NON EMPTY {[Measures].[Quantity], [Measures].[Sales]} ON COLUMNS, 
  {Descendants       
    ( [Time].[${yearParameter}],      
    [Time].[${yearParameter}], AFTER      
    )   
  } ON ROWS
from [SteelWheelsSales]

 

 

["QTR1", 4561, 445094.69],
["Jan", 1357, 129753.6],
["Feb", 1449, 140836.19000000006],
["Mar", 1755, 174504.89999999997],

["QTR2", 5695, 564842.02],
["Apr", 1993, 201609.55],
["May", 2017, 192673.11000000002], 
["Jun", 1685, 170559.36000000004],  

["QTR3", 6629, 687268.8699999998],
["Jul", 2145, 225486.21000000002], 
["Aug", 1974, 197809.30000000002], 
["Sep", 2510, 263973.36],  

["QTR4", 19554, 1980178.4199999995], 
["Oct", 5731, 589963.9], 
["Nov", 10862, 1086720.4000000001], 
["Dec", 2961, 303494.11999999994]

 

We can test if the resultset has data for all months and quarters as expected, and test the variations between months and quarters to detect peaks or valleys in the data due to double process or no process. For this, we can perform the test below, using the CDA file with the MDX query shown above.

 

cdv.registerTest({
  id: 99999,
  type: "query",
  name: 'Existence of data',
  group: "Steel-Wheels",
  path: 'cdv/tests/steelwheels-existence.cdv',
  createdBy: 'Webdetails',
  createdAt: 1339430893246,
  queries: [{
    cdaFile: "/plugin-samples/cdv/steelwheels-tests.cda",
    dataAccessId: "monthlyQuery",
    parameters: {
      yearParameter: "2003"
    }
  }],
  validations: [{
    validationName: "Steel-Wheels Data Validation",
    validationType: "custom",
    validationFunction: function(rs, conf) {
      var success = true,
          dif1 = [], dif2 = [];

      //Test existence of data
      var i = rs[0].resultset.length;

      if ( i < 16 ) {
        return { type: "ERROR", description: "Missing data in Steels-Wheels!" };
      }

      return success ? "OK" : { type: "ERROR", description: "Missing data in Steels-Wheels!" };
    }
  }],
  executionTimeValidation: {
    expected: 100,
    warnPercentage: 0.30,
    errorPercentage: 0.70,
    errorOnLow: false
  },
  cron: "0 0 10 * ? *"
});

 

Since the resultset is supposed to show all quarters and months of a year, we expect that it has 16 rows, so we can use that to check if we have all the data for the year we want. As alternative we can also check if there are values on the resultset.

If the test fails, the message is shown settled in the return command.

 

Steel-Wheels: Existence of data and variation - 1 ERROR [Missing data in Steels-Wheels!]

 

If the previous test passes, then it will do the variations tests. It uses a trigger of lower than 10% per quarter variation and greater than 200% per month variation, but you're free to change it as you will, and watch the results.

At the bottom, in executionTimeValidation, you can set the expected time for the query to run and set the warnPercentage and errorPercentage margin to receive alerts when a certain query takes too long to run. Also if the query runs too fast you and you set the errorOnLow to true, you should receive an alert.

For last, you can schedule a time for the test to run automatically on the cron line, using the cron predefined scheduling definitions.

If we're using another CDA file or had another dataAccessId returning a new resultset, it can be called with rs[1].resultset, where the value 1 is in agreement with the order settled in the queries section, from top to bottom.

The tests will be sorted by groups, defined within the creation of each test. In each group, each line corresponds to a test, where we can see the name of this test, the path to the CDA file used as Datasource, the Validation name, the Expected Duration of the query, the Status of the test and an Options button.

If the test returned a WARN last time it was run, the font color will change to orange, and in case of an ERROR, will change to red.

 

Invocation and Scheduling

There are 2 ways to call the validations:

  • By URL Request;
  • Scheduled Calls.

The URL will be based on the ID / query name (tbd). The schedule calls are cron based.

 


 

Alerts

Over the Alerts area, you're able see the running tests, sorted by time of run, and filter them by the Status of the test.

 


 

CDA Errors

Within the CDA Errors, there will be a list of every single error found on a CDA query.

 


 

Slow Queries

Inside the Slow Queries, there is shown a list of the queries that took longer to execute than the estimated time set in the CDV file for the test.

 


 

Notifications

External Interfaces Supported

  • E-mail;
  • HTTP;
  • Nagios integration;
  • Server up check.

The last one is a very specific check. All the other integrations will fail if the server suddenly hangs, and we must be notified of that. On HTTP and Nagios integration, we'll be able to get reports not only on the individual tests, but also on the test groups. This will not rerun the tests, but get the report on the last status of a test.

On the HTTP case, we can pass a flat to force a test to be rerun.

For Nagios, we can have an export of test rules.

 

Attachments

    Outcomes