Wednesday, October 28, 2015

Parallelization or parallel execution in Talend

The parallelization execution can be achieved in Talend in many ways. Let’s see some of the methods and important considerations.
What is parallelization in Talend?
In parallelization, a Talend Job partitions a data flow into multiple threads and simultaneously executes them so as to augment the performance.
Hope you won’t be clearly understand at this stage, however the below example will make you clear.
As per my knowledge, the parallelization can be automated by 3 main method.
  1. Enabling Mulit-threaded Execution
  2. Using the tParallelize component
  3. Using the parallel execution for Execution Plan in Talend Administration Center
Enabling Mulit-threaded Execution
The Multi thread execution feature allows you to run multiple Subjobs that are active in the workspace in parallel. When the Subjobs do not have any dependencies between them, you might want to launch them at the same time. For example, the below example show that four Subjobs within a Job and with no dependencies in between. When you run this, you will noticed that 1st sub job only started to run and other two will star one after other completion. You can noticed that in the bottom under Job tab you can noticed that “Multi thread execution” feature is not enabled, highlighted in red box.
Select the Multi thread execution check box to enable the parallel execution.
When the Use project settings check box is selected, the Multi thread execution check box could be greyed out and become unavailable. In this situation, clear the Use project settings check box to activate the Multi thread execution check box. You will noticed that all three jobs are running parallel, once you enable the Multi thread execution.
This feature is optimal when the number of threads (in general a Subjob count one thread) do not exceed the number of processors of the machine you use for parallel executions. Otherwise, some of the Subjobs have to wait until any processor is freed up.
Also note that you cannot parallelize more than your number of CPU, otherwise it will wait for the processors and will be overhead for processors.
tParallelize component
tParallelize helps you to manage complex Job systems. It executes several subjobs simultaneously and synchronizes the execution of a subjob with other sub-jobs within the main Job. This component can be used as either a start or middle component in a main Job built of numerous subjobs. It can be connected to preceding or following components Parallelize or Synchronize links.
Let see what’s the difference between Parallelize and Synchronize in tParallelize component in Talend?
Parallelize linked sub jobs run parallel regardless of which ones finish first.
Synchronize linked sub jobs starts to run only when all other parallelize sub jobs finishes.
In this example, Job1, Job2 and Job3 will run parallel, Job4 will run only when Job1, Job2 and Job3 ends. So, tParallelize is the best component if you have a request that need some of subjobs to run parallel, and a subjob starts to run only when all other parallelize subjobs finishes. Also, tParallelize component makes your job design more flexible.
Using the parallel execution for Execution Plan in Talend Administration Center: This feature will do the same thing as tParallelize component do, but at the deployment level. This feature only available those who deploy the job via TAC (Talend Administration Center) in Enterprise Edition.
In the Execution Plan list, select the plan to which you want to add tasks. Click Root: please configure this node in the planned task tree view panel to the right. The Edit planned task panel opens.
To define multiple tasks for parallel execution at the root node, select the Use parallel execution box. The configuration options for parallel execution appear. The below example show that how to setup jobs parallel.

https://help.talend.com/download/attachments/9309045/execution_plan_add_task_parallel_execution.png?version=1&modificationDate=1355436560000&api=v2https://help.talend.com/download/attachments/9309045/execution_plan_task_added.png?version=1&modificationDate=1355436562000&api=v2
Cheers! Uma

7 comments:

  1. thanks for sharing nice article... Can you please share the Project zip files, if possible..

    Thanks,
    Viraj

    ReplyDelete
  2. Very clear explanation.Great
    Thanks for sharing

    ReplyDelete
  3. (Maharashtra State Board of Secondary and Higher Secondary Education) is one of the most famous boards in India which provides affiliation to many state schools within the state at
    mahresult.nic.in 2018

    ReplyDelete
  4. I am using Talend Open Studio for Big Data version 7.1, but i do not have the tParallelize component. Please let me know whether this component is subscription based and how i can have access to it?

    ReplyDelete
  5. Thankyou so much Uma for clearly defining the Parallelization in Talend

    ReplyDelete