Tuesday, February 04, 2014

Parallelization monster framework for Pentaho Kettle

We always end up with ROFL in our team, when trying to find a name for strange looking ETL processes diagrams. This monster has no name yet:


This is a parallelization framework for Pentaho Kettle 4.x. As you probably know in the upcoming version of Kettle (5.0) there's native ability to launch job entries in parallel, but we haven't got there yet.

In order to run a job in parallel, you have to call this abstract monster job, and provide it with 3 parameters:

  • Path to your job (which is supposed to run in parallel).
  • Number of threads (concurrency level).
  • Optional flag that says whether to wait for completion of all jobs or not.
Regarding the number of threads, as you can see the framework supports up to 8 threads, but it can be easily extended.

How this stuff works. "Thread #N" transformations are executed in parallel on all rows copies. Rows are split then, and filtered in these transformations by the given number of threads, so only a relevant portion of rows is passed to the needed job (Job - Thread #N). For example, if the original row set was:

           ["Apple", "Banana", "Orange", "Lemon", "Cucumber"]

and the concurrency level was 2, then the first job (Job - Thread #1) will get the ["Apple", "Banana", "Orange"] and the second job will get the rest: ["Lemon", "Cucumber"]. All the other jobs will get an empty row set.

Finally, there's a flag which tells whether we should wait until all jobs are completed.

I hope one will find attached transformations useful. And if not, at least help me find a name for the ETL diagram. Fish, maybe? :)