Community Data Access

Document created by Pedro Goncalves on Aug 10, 2017Last modified by Diogo Belem on Nov 9, 2017
Version 11Show Document
  • View in full screen mode


 

CDA is a Pentaho plugin designed for accessing data with great flexibility. Born for overcoming some cons of the older implementation, CDA allows you to access and deliver data in different formats from the many Pentaho datasources.

 

Credits: Webdetails Team, Lead - Pedro Alves; Thomas Morgner; Daniel Einspanjer;

Will Gorman; Sergio Ramazzina; Aaron Phillips; Pentaho Community

 

Version: 16.01.22

License: MPLv2

CommunityCDA.gif

 

 

Overview

Community Data Access (CDA) was designed to allow great flexibility for datasources.

Most tools available can perform queries over a single datasource and there's no easy way to join data coming from different databases or in different languages (joining data from a SQL query with data from a MDX query). These tasks usually require an independent ETL job to synchronize different data sources and build a new database. CDA was developed as an abstraction tool between database connections and CDF (Community Dashboard Framework).

It allows data to be retrieved from multiple data sources and combined in a single output that can easily be passed on to dashboard components. It serves three main purposes:

  • Allows you to join data from different sources just by editing an XML file;
  • Cache queries providing a great boost in performance;
  • Avoids SQL injection problems within CDF;
  • Allows an easier way to export data from queries in different formats (csv, xls, etc.) through the Pentaho User Console;
  • Sort and paginate data server-side.

 


 

Concepts

CDA uses two different components: the connection (a database or Pentaho datasource to use) and the dataAccess (a query over that connection).

Connections and queries are defined in a XML file (the CDA file) lying in the solution repository. Data access is done by a call to a specific dataAccess ID in the CDA file.

Each data access may have parameters and calculated columns added to the results. It's also possible to have compoundDataAccess elements, which represent joins or unions on different queries.

CDA features an API to fetch query results, an editor to modify CDA files and a previewer to visualize the results of a query in table form.

Finally, export: query results can be returned from the API in various formats. Currently supported formats are JSON, XML, CSV, XLS and HTML visual mode.

 


 

Files

A CDA file is a XML file which defines the database connections and the queries over them. All its content must be enclosed in a CDADescriptor block.

 

<?xml version="1.0" encoding="utf-8"?>
  <CDADescriptor>
    (...)
  </CDADescriptor>

 

Inside the main block we define datasources (the database connections), data accesses (the queries themselves) and compound data accesses (joins and unions of queries).

The CDA file can also be written from inside the CDE, just by defining all the datasources, data accesses and compound data accesses, which is simpler than writing all the code.

 


 

Connections

Connections are defined inside a Datasources block. Each connection is defined by a Connection block, which contains all the necessary information. For each connection, an (unique) ID and type must be provided.

 

Supported Connection Types

  • metadata.metadata
  • sql.jdbc
  • sql.jndi
  • mondrian.jdbc
  • mondrian.jndi
  • olap4j.jdbc
  • olap4j.jndi
  • scripting.scripting
  • kettle.TransFromFile
  • xpath.xPath

olap4j, scripting, xpath and kettle connections are available only on CDA for Pentaho 3.6 onwards.

A connection defines the datasource and the way the connection is established and each connection type requires a different set of properties to be defined.

 

metadata.metadata

  • ID: the identifier of the connection.
  • XmiFile: the file containing the metadata definitions, exported from the Metadata repository.
  • DomainId: the Metadata domain as defined on the Pentaho metadata layer.

 

sql.jdbc

  • ID: the identifier of the connection.
  • Driver: the Java classname of the driver (e.g. com.mysql.jdbc.Driver).
  • URL: the URL to connect to (e.g. jdbc:mysql://localhost:3306/database).
  • User Name: the username to use
  • Password: the user's password

 

sql.jndi

  • ID: the identifier of the connection.
  • Jndi: the connection's parameters as defined in the context.xml file.

 

mondrian.jdbc

  • ID: the identifier of the connection.
  • Mondrian schema: mondrian schema file.
  • Driver: the Java classname of the driver (e.g. com.mysql.jdbc.Driver).
  • URL: the URL to connect to (e.g. jdbc:mysql://localhost:3306/database).
  • User Name: the username to use.
  • Password: the user's password.
  • Banded Mode: compact or classic.

 

mondrian.jndi

  • ID: the identifier of the connection.
  • Mondrian schema: mondrian schema file
  • Jndi: the connection's parameters as defined in the context.xml file.
  • Banded Mode: compact or classic. There is also the possibility of a denormalized mondrian.jdbc and a denormalized mondrian.jndi MDX queries that have the same options as the standard (normalized) queries. The differences between denormalized and standard (normalized) queries are explained on data accesses. The differences between compact and classic on Banded Mode are also explained on data accesses.

 

olap4j.jdbc

  • ID: the identifier of the connection.
  • Driver: the olap4j driver to use (typically mondrian.olap4j.MondrianOlap4jDriver ).
  • Url: the connection url, typically jdbc:mondrian:.
  • User name for the underlying database: the username to use.
  • Password for the underlying database : the user's password.
  • Driver for the underlying database: the Java classname of driver (e.g. org.hsqldb.jdbcDriver).
  • URL pointing at the underlying database: the database url (e.g. jdbc:hsqldb:hsql://localhost:9001/sampledata).
  • Catalog to apply to the underlying database: path to the the Mondrian schema file (e.g. solution:/steel-wheels/analysis/steelwheels.mondrian.xml).

 

olap4j.jndi

  • ID: the identifier of the connection.
  • Jndi: the connection's parameters as defined in the context.xml file. There is also the possibility of a denormalized olap4j.jdbc and a denormalized olap4j.jndi MDX queries that have the same options as the standard (normalized) queries.

 

scripting.scripting

  • ID: the identifier of the connection.
  • Language: the scripting language used; only beanshell is currently supported.
  • InitScript: the initialization script to be run before the query.

 

kettle.TransFromFile

  • ID: the identifier of the connection.
  • Kettle Transformation File: the kettle transformation file to use.
  • Variable: the value of a variable to be used in the transformation; multiple variable properties may be defined, one for each variable to use.

 

xpath.xpath

  • ID: the identifier of the connection.
  • DataFile: a XML file with the data to query.

 

Examples

You can find examples of all these types of connection in your solution repository, under public/plugin-samples/cda/cdafiles.

 


 

Data Accesses

 

Simple Data Accesses

A simple data access is a query over a connection. It's defined by a DataAccess block. CDA supports the following types of data accesses:

  • SQL;
  • MDX;
  • Metadata;
  • Kettle;
  • etc.;
  • Compositions of the above.

 

Inside a DataAccess block the query is defined inside a Query block, containing the query code.The following attributes must be defined:

  • Access Level: public or private; public data accesses are available from outside calls, while private data accesses can only be called from other data accesses (e.g. compound data accesses);
  • Parameters: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray);
  • Output: which columns to output and by which order;
  • Output Mode: the column's output mode, include or exclude the columns set above;
  • Columns: names of the columns;
  • Calculated Columns: the columns to be calculated by a given formula;
  • Query: the query itself;
  • Cache: the possibility to cache the results, either true or false (optional, defaults to true);
  • Cache Duration: time to keep the results in cache, in seconds (optional, defaults to 3600).

 

Reference information about the different query types

 

Banded Mode

When setting a Mondrian connection there is the possibility to choose between compact or classic Banded Mode. This option can make a lot of difference in the output of the results.

 

select NON EMPTY {[Measures].[Sales]} ON COLUMNS,
NON EMPTY ({[Markets].[EMEA].Children}) ON ROWS
from [SteelWheelsSales]
WHERE [Time].[2005]

 

This query will show the Sales from a given Market at a given Time.

In compact mode this query will output one column for Sales and one column for the selected Market, as shown below:

 

MarketsSales
Austria68250.26
Belgium25040.11
Denmark26012.87
Finland126851.71
France242956.40
Italy41509.94
Spain326798.17
Sweden31606.72
UK40802.81

 

However, if you choose the classic mode, the same query returns the full mondrian hierarchy:

 

[Markets].[(All)][Markets].[Territory][Markets].[Country][Measures].[Sales]
All MarketsEMEAAustria68250.26
All MarketsEMEABelgium25040.11
All MarketsEMEADenmark26012.87
All MarketsEMEAFinland126851.71
All MarketsEMEAFrance242956.40
All MarketsEMEAItaly41509.94
All MarketsEMEASpain326798.17
All MarketsEMEASweden31606.72
All MarketsEMEAUK40802.81

 

 

Normalized vs. Denormalized MDX

CDA allows two different types of MDX queries: standard (normalized) or denormalized. It's not within the scope of this document to elaborate over the details of data normalization/denormalization, but the following example can shed some light over the differences between the two:

 

MDX QUERY SAMPLE

select {[Measures].[Sales], [Measures].[Quantity]} ON COLUMNS,
NON EMPTY [Time].Children ON ROWS
from [SteelWheelsSales]
where ([Order Status].[${status}])

 

This query will generate one column for each measure in normalized form and have only one row for each child of the Time dimension. In denormalized form there will be only two columns added for measures, one indicating the name of the measure and another for its value, regardless of the number of measures queried.

The resulting dataset will have one row for each combination of measure and child of the Time dimension.

In a very loose way one can say that normalized minimizes row redundancy at the expense of the number of columns whereas denormalized form limits the number of columns at the expense of row redundancy.

So, in normalized form, the result of the above query is:

 

[Time].[(All)][Time].[Years][Measures].[Sales][Measures].[Quantity]
All Years20044750205.8947151
All Years20051513074.4614607
All Years20033573701.2535313

 

In denormalized form, the same query returns:

 

[Time].[(All)][Time].[Years][Measures].[MeasuresLevel]Measure
All Years2003Quantity35313
All Years2004Quantity47151
All Years2005Quantity14607
All Years2005Sales1513074.46
All Years2004Sales4750205.89
All Years2003Sales3573701.25

 

 

Parameters

Parameters are variables that can be passed on to the query, for example to create dynamic queries depending on user input.

 

select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o
where o.STATUS = ${status} and o.ORDERDATE > ${orderDate}
group by o.YEAR_ID, o.STATUS

 

This query uses two parameters: status and orderDate. Parameters are always passed to the query as ${parameterName}.

The definition of parameters is done inside a Parameters block, where the various parameters are defined. For each parameter the type and default attributes must be specified. A third attribute, pattern must be specified for parameters of the date type.

  • type: String, Integer, Numeric (double), Date, StringArray, IntegerArray, NumericArray and DateArray;
  • default: the default value if the parameter value isn't specified when the data access is called;
  • pattern: date format. All formats accepted by the SimpleDateFormat Java class are valid.

 

<Parameters>
  <Parameter name="status" type="String" default="Shipped"/>
  <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
</Parameters>

 

Columns

The Columns block supports renaming of columns and defining of new columns, using formulas.

Each column may be renamed using a Column block, by specifying its name:

 

<Column idx="1">
  <Name>Year</Name>
</Column>

 

Calculated columns are added by a CalculatedColumn block, where the new column's name and the formula to use are defined.

Each calculated column requires two properties: Name (the name that will be output by CDA), and Formula (the column's definition itself).

Formulas are written in Open Formula format.

 

<CalculatedColumn>
  <Name>PriceInM</Name>
  <Formula>=[price]/1000000</Formula>
</CalculatedColumn>

 

Output

The Output tag define which columns to output and by which order. If not specified, all columns from the query will be returned in the same order as defined on the query.

Output requires only one attribute, the indexes to be returned. The order of the output columns is defined by the order in which the indexes are specified.

 

<Output indexes="1,0,3"/>

 

Compound Data Accesses

Compound data accesses combine the results of two queries. They're defined by a CompoundDataAccess block.

There are two types of compound data accesses supported: union and join.

 

Union

Unions take the results of two queries with the same number of columns and return the rows from both data accesses on top of each other.

A union compound data access requires only these attributes:

  • Name: the name of the compound query;
  • Top: the id of the query that will stay on top;
  • Bottom:the id of the query that will stay on the bottom;
  • Parameters: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray) that are passed on to the compound query.

If the columns from both queries have different names, it will prevail the names of the query that is on top.

 

<CompoundDataAccess id="3" type="union">
  <Top id="2"/>
  <Bottom id="1"/>
  <Parameters>
    <Parameter name="year" type="Numeric" default="2004"/>
  </Parameters>
</CompoundDataAccess>

 

Join

Joins take the results of two queries and join them by a specified set of keys. The result has the columns of both queries. The join performed is full outer.

Join compound data accesses require that both the Left and Right data accesses be specified (by id) as well as the keys (column ids on the source queries) by which to join.

A join compound data access requires the following attributes:

  • Name: the name of the compound query;
  • Left:  the first query;
  • Left Keys: the id or ids of the columns from the first query that are common to the second query;
  • Right: the second query;
  • Right Keys:  the id or ids of the columns from the second query that are common to the first query;
  • Parameters: lists the parameter's name, default value (the default value if the parameter value isn't specified when the data access is called) and type (String, Integer, Numeric, Date, StringArray, IntegerArray, NumericArray and DateArray) that are passed on to the compound query;
  • Output Options: the ids of the columns to output from both queries by order, starting with the columns from the left query and them the ones from the right query;
  • Output Mode: the column's output mode, include or exclude the columns set above.

 

<CompoundDataAccess id="3" type="join">
  <Left id="1" keys="0,1"/>
  <Right id="2" keys="0,1"/>
  <Parameters>
    <Parameter name="status" type="String" default="Shipped"/>
    <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>
  </Parameters>
  <Output indexes="0,1,2,5,6"/>
</CompoundDataAccess>

 

The columns that are common in both queries must be of the same type.

 


 

Web API Reference

CDA is called by HTTP requests to $BASE_URL/pentaho/plugin/cda/api/.

 

Methods Supported

  • doQuery;
  • listQueries;
  • getCdaList;
  • listParameters;
  • clearCache;
  • listDataAccessTypes.

 

CDA File

Some of the methods listed above require the CDA file to be specified in the URL. This can be done by providing the full path from the solution repository.

 

path=public/plugin-samples/cda/cdafiles/compoundJoin.cda

 

Output

All methods listed above (except doQuery and clearCache) output the results as a JSON formatted object, with two key values: resultset with the data itself and metadata with info about column indexes, names and data types.

 

{
  "resultset": [
    ["3", "", "union"],
    ["2", "Sql Query on SampleData", "sql"]
  ],
  "metadata": [{
    "colIndex": 0,
    "colType": "String",
    "colName": "id
  },
  { 
    "colIndex": 1,
    "colType": "String",
    "colName": "name"
  },
  {
    "colIndex": 2,
    "colType": "String",
    "colName": "type"
  }]
}

 

clearCache returns the string "Cache cleared".

doQuery returns the query results in one of four formats: JSON, XML, XLS or CSV. The default is JSON.

 

Method Reference

 

doQuery

doQuery establishes the necessary connections and performs the query specified by the dataAccessId parameter.

You may specify the values of the different parameters used by the query (if not specified default values defined on the CDA file will be used). Parameters are passed to the URL as paramParameter, where Parameter is the parameter's name.

Check this example: if the data access with id=2 of the file $SOLUTION_REPOSITORY/public/plugin-samples/cda/cdafiles/compoundJoin.cda has a parameter called status and you want the query to use Shipped as the parameter the URL to use is:

 

$BASE_URL/pentaho/plugin/cda/api/doQuery?path=public/plugin-samples/cda/cdafiles/compoundJoin.cda&dataAccessId=2&paramstatus=Shipped 

 

doQuery has an optional argument, outputType, where the format of the returned data is specified as either json, xml, csv, xls or html. The default value is json.

 

listQueries

This method returns a JSON formatted list of all public queries in the specified CDA file.

 

getCdaList

Returns a JSON formatted list of all CDA files in the solution repository. Doesn't take any arguments.

 

listDataAccessTypes

Returns a JSON formatted list with all supported data access types and connections and their definitions.

 

listParameters

Returns a list of parameters used by a data access. Requires both the path and dataAccessId to be specified.

 


 

Previewer

The CDA previewer is an application that allows CDA queries to be returned as HTML and displayed on table form.

Parameters can be changed using input boxes and a dropdown menu allows changing the dataAccessId. The previewer is called from the URL.

 

$BASE_URL/pentaho/plugin/cda/api/previewQuery?path=public/plugin-samples/cda/cdafiles/compoundJoin.cda

 

The desired file location is specified in the path argument.

 


 

Editor

An embedded editor allows editing CDA files under the scope of the Pentaho server, avoiding the need to use external editors and manually refreshing the solution repository.

This only works on some browsers, like Firefox or Chrome. The editor is called from the URL. Check the example below.

 

$BASE_URL/pentaho/plugin/cda/api/editFile?path=public/plugin-samples/cda/cdafiles/compoundJoin.cda

 

The desired file location is specified in the path argument.

The editor does not allow creating new CDA files. You must manually create a blank CDA file on the solution repository and refresh the solution repository to be able to edit the file.

 


 

Samples

See the solution repository of your Pentaho installation for sample CDA files.

 

$SOLUTION_REPOSITORY/public/plugin-samples/cda

 

 

Attachments

    Outcomes