10 Years Pentaho Community Meeting

200 people from almost 40 countries have come to Mainz, Germany to join for the 10th Pentaho Community Meeting.

24 speakers from all parts of the world presented their newest developments and usecases and shared them with the community.

Alain DEBECKER, member of TEAM PARTNERS as Business Intelligence Senior Consultant presented his research on data manipulation including how to monitor data loading processes.

Extract from IT-Novum Article

In the beginning, Alain Debecker asked the audience the following questions: 

What to do when your ETL moves into professional production ?
When you coordinate a bunch of automatic data load running every now and again on various crontab or carte servers ?
When you need to monitor a migration done by distant developers ?
When you are responsible for the data to be there, on time and correct? 
Why, in those case, not use the logging system which is shipped with your PDI?

The PDI logging system records all the details about every thing that is happening during execution and stores it in a database table.

As it is sitting in a database, it is easily displayed on a dynamic report or on a web page so that you can follow the load in real-time. At the same time, you will see the load of all ETLs running and recording on the same logging database, wether launched by an automatic scheduler or by other developers. It is also pretty easy to go back in history and look if and when something went wrong. Additionally you can program and schedule a morning mail to yourself in order to know if your data server is up and running with the correct data, even before you reach the office. Enabling PDI loggings is a 7 clicks operation explained here, with best practices here.

All you need to monitor for the ETL automated loads is the logging of the transformations. As a matter of facts, during consultancies for customers with heavily automated ETL systems, we discovered that in fine what you are after is how much data per table is loaded, and this data is always loaded by a transformation. To log at job level is useful during coding, synchronization and optimization but it does not give you the helicopter view you need to quickly review the nightly ETL. And if you have some authority on the developers, then try to simplify your life with a simple a convention: as far as possible, each transformation loads only one table and has the same name as this table.

On github, you’ll find a JSP that displays continuously the logging table on a web page. Slip the loggings.jsp in a tomcat/webapps/etl folder. Then type yourhost:8080/etl/loggings.jsp in your browser. We made sure to use the PDI defaults and no jar libraries so that it should work as soon as the config.properties points to your database connection.

The information displayed was selected after numerous trials and errors. They are, for each transformation the date of the run, the state (running, finished,…), the duration, the number of record written or updated and the number of errors. The transformation name is a link redirecting to the recent history load so that you can immediately see if the last number of output records is suspicious and if load time is increasing. And selecting the date of the run sends you to the detailed log, which you can follow real-time.

You’ll also find a MorningMail.ktr which sends you a small report of the previous 24h loads and makes a little bit of clean up in the historical data. Schedule it on your crontab server for a time which is convenient to you, for example on your smartphone on the way to the office.

The graph.jsp is a work-in-progress piece of code, that will eventually replace the history log by a more graphical view. Watch for the updates and, in the meanwhile, send your feedback to Alain. So much can be done, and it is so much more productive to be driven by the users‘ needs on that respect.

The next step is to enable stop and restart your ETL from the logging display page. We are blocked by two requests still open in the Vantara-Pentaho jira:

PDI-16549 to record the transformation path instead of the transformation name. If the name is enough to see which and when a transformation did not performed its job correctly, you definitely need to put the hands on the actual ktr to restart it, hence the need of the full path.
PDI-16550 to record the parameters of the transformation at the time of the run so that you can restart it in the same conditions. As sometimes the parameter values are given by an orchestrating job, so to guess them on the next day is no always an easy task.
In conclusion, setup your PDI logging, install the ETL-pilot on your server, watch in real-time what you and your colleagues are loading, enjoy its MIT license to make it suits to your own needs. And when you have a few month of data tell Alain what you think about the graphs, and do not forget to help the community by voting for the JIRA-16549 and the JIRA-16550 as massively as possible.