Introduction

In this article, I will describe how the ODI Generator tool is used in a large datawarehouse project in one of the most important Austrian banks. The department which is using the tool is responsible for the collection of data originated from hundreds of different sources across the company. These data are provided via text and binary files or collected from relational databases. Every day, we are loading thousands of files and getting data from tens of source databases. All this big amount of data is loaded into an Oracle database.

Data is subject to an ELT process consisting of several distinct steps (load to state, validation, integration). Most of the jobs which are part of the process are implemented as ODI scenarios build on top of ODI packages which act as a wrapper of ODI mappings where the business rules are implemented. As the whole process is pretty specific, we are using custom knowledge modules developed in-house.

Besides the primary Oracle database, we are also running a big data environment. Data loaded in Oracle primary database is copied every day to Hadoop. Oozie jobs used for this purpose have been also generated using ODI Generator.

We have to maintain several environments (DEV, FAT, UAT, PROD) for both Oracle and big-data. We are continuously adding new data sources to our data lake. On top of this, we also have to, quite frequently, make changes to existing data sources import jobs as a result of changes done in the source systems.

Without a tool that can generate ODI artifacts (and other types of artifacts like Oozie workflows, SQL code) it would be difficult to successfully manage in our daily activity. 

Source definition file

Each distinct data source (i.e. data file or table in a source relational database) is described using a so-called source stream definition file. This is nothing else than a plain text file containing an annotated CREATE TABLE statement. CREATE TABLE statement provides information about the structure of the source file (indirectly) or describes the structure of the source database table from where we get the data.

The annotations are used to provide any type of additional information which does not "fit" the CREATE TABLE statement. Such additional information can be:

  • information about the source file like the number of header lines, record separator character(s). field separator characters(s), fixed record length (when is the case), and so on.
  • information about numeric and date value formats
  • information about the tables where the data is loaded (i.e. partitioning, size factor)

Such source stream definition file is shown in the following code snippet:

 

As you can see it contains information about the structure of the source data (we extrapolate the structure of the source file from it) and a series of annotations that the ODI Generator is using to generate ELT artifacts. As mentioned based on these simple files, with the help of templates, we are able to generate the following types of artifacts:

  • ODI datastores and ODI mappings
  • ODI packages and scenarios
  • DDL
  • Oozie workflows

The source stream definitions files are organized hierarchically (i.e. based on the source system which provides the data they are describing). We are storing them in a Git repository in order to share them between the team members and to track the changes.

Templates

If you were reading the tutorials available on this site you might have seen that ODI Generator is using SQL-like statements for creating ODI artifacts. These statements can be written manually but of course, this would be too much burden for even small projects. We create these SQL-like statements using ODI Generator itself. ODI Generator includes a code generation engine as well. Using as input different types of configuration information, ODI Generator can generate any code is required. This is done using an industry-standard template engine.

Statements like the annotated CREATE TABLE shown above are one of the inputs that the ODI Generator understands. There are also other types of configuration information. These can be used for creating simple or complex ODI mappings. 

To not make things complex, I will not provide here too many details about how the template engine. I will just show below what is the result of applying two different templates on the source stream definition file shown above.

The following code snippet contains the result of applying the "ODI" dedicated template. The code can then be executed in ODI Generator to create the actual ODI objects.

 

As mentioned before we do not generate only ODI objects. The following snippet code shows a Groovy program created by applying the Oozie dedicated template. Executing the code below in a Groovy shell we obtain an Oozie workflow which we use to load data from Oracle to Hadoop. 

Besides these two examples, we have a small number of other templates that we use to create the DDL statement for our database objects, importing configuration information in a database, ad-hoc bulk operations, etc.

How it works

As we can fully generate our ELT, the developers spend most of the time analyzing the source data (i.e. file structures, data access, data capture mechanism, etc.). Once we gathered the information necessary to create the source stream definition file shown above the rest is a matter of minutes. We create the definition file and use the ODI Generator (most of the time the CLI version) to create the necessary ODI, database, and big-data artifacts.

Conclusion

ODI Generator tool greatly improved our capacity to deliver in time and to cope with the large number of business requests we receive and it proved to be useful for scenarios that are not strictly related to ODI.

There are other features that are not shown in this article. I will describe them in a future post.

Please contact me if you are interested in evaluating/using ODI Generator or knowing more about how it can help you in data projects.