Wednesday, October 28, 2015

Parallelization or parallel execution in Talend

The parallelization execution can be achieved in Talend in many ways. Let’s see some of the methods and important considerations.
What is parallelization in Talend?
In parallelization, a Talend Job partitions a data flow into multiple threads and simultaneously executes them so as to augment the performance.
Hope you won’t be clearly understand at this stage, however the below example will make you clear.
As per my knowledge, the parallelization can be automated by 3 main method.
  1. Enabling Mulit-threaded Execution
  2. Using the tParallelize component
  3. Using the parallel execution for Execution Plan in Talend Administration Center
Enabling Mulit-threaded Execution
The Multi thread execution feature allows you to run multiple Subjobs that are active in the workspace in parallel. When the Subjobs do not have any dependencies between them, you might want to launch them at the same time. For example, the below example show that four Subjobs within a Job and with no dependencies in between. When you run this, you will noticed that 1st sub job only started to run and other two will star one after other completion. You can noticed that in the bottom under Job tab you can noticed that “Multi thread execution” feature is not enabled, highlighted in red box.
Select the Multi thread execution check box to enable the parallel execution.
When the Use project settings check box is selected, the Multi thread execution check box could be greyed out and become unavailable. In this situation, clear the Use project settings check box to activate the Multi thread execution check box. You will noticed that all three jobs are running parallel, once you enable the Multi thread execution.
This feature is optimal when the number of threads (in general a Subjob count one thread) do not exceed the number of processors of the machine you use for parallel executions. Otherwise, some of the Subjobs have to wait until any processor is freed up.
Also note that you cannot parallelize more than your number of CPU, otherwise it will wait for the processors and will be overhead for processors.
tParallelize component
tParallelize helps you to manage complex Job systems. It executes several subjobs simultaneously and synchronizes the execution of a subjob with other sub-jobs within the main Job. This component can be used as either a start or middle component in a main Job built of numerous subjobs. It can be connected to preceding or following components Parallelize or Synchronize links.
Let see what’s the difference between Parallelize and Synchronize in tParallelize component in Talend?
Parallelize linked sub jobs run parallel regardless of which ones finish first.
Synchronize linked sub jobs starts to run only when all other parallelize sub jobs finishes.
In this example, Job1, Job2 and Job3 will run parallel, Job4 will run only when Job1, Job2 and Job3 ends. So, tParallelize is the best component if you have a request that need some of subjobs to run parallel, and a subjob starts to run only when all other parallelize subjobs finishes. Also, tParallelize component makes your job design more flexible.
Using the parallel execution for Execution Plan in Talend Administration Center: This feature will do the same thing as tParallelize component do, but at the deployment level. This feature only available those who deploy the job via TAC (Talend Administration Center) in Enterprise Edition.
In the Execution Plan list, select the plan to which you want to add tasks. Click Root: please configure this node in the planned task tree view panel to the right. The Edit planned task panel opens.
To define multiple tasks for parallel execution at the root node, select the Use parallel execution box. The configuration options for parallel execution appear. The below example show that how to setup jobs parallel.

https://help.talend.com/download/attachments/9309045/execution_plan_add_task_parallel_execution.png?version=1&modificationDate=1355436560000&api=v2https://help.talend.com/download/attachments/9309045/execution_plan_task_added.png?version=1&modificationDate=1355436562000&api=v2
Cheers! Uma

Thursday, October 22, 2015

Interesting Agile article from .NET magazine “simple and sweet”

Recently I read article about Agile from “net” magazine and written by Paul Woods. The agile process is explained very clearly and simplified. I thought to share some of the important points and whole article in the form of screenshots.
Some Important points
Digital scrum board
Exepert tips

You can read the full article in the following pages:

Cheers! Uma

Monday, October 12, 2015

Understand XPath syntax and Define Loop XPath Query and mapping XPath Query for tFileInputXML

In these days, XML source format is one of the mostly used source system for data integration. To handle XML format source file, depth understanding of XPATH is very important.
In this example illustrates, how to extract data from XML source file, and how to define XPATH for each required data. From this file, Monthly transaction needs to extract such as Month, Value, and Branch.
First this is, define the Loop XPath query. Month, value and Branch will be loop data, so LOOP XPATH:
Loop XPath query:
"/Account/LEVL1/LEVEL2/LEVEL3/LEVEL4/LEVEL5/MonthlyTransaction"


Source File:
Expected Output:
XPATH for mapping as per output requirement:
* You can not call another Loop elements in the same tFileInput , you should use another tFileInput with the separate XPath Loop query. After that using tMap use can join and bring them into same flow.

* Loop element's properties can be call such as "Branch/@id"

XPath syntax elements for XML file:
XPath
Description
/
the root object/element
.
the current object/element
/
child operator
..
parent operator
//
Recursive descent.
*
Wildcard. All objects/elements regardless their names.
@
Attribute access.


Cheers! Uma

Friday, October 9, 2015

What are the differents between Generating and Deploying via Job Conductor and Execution Plan

In Talend Administration Center (TAC), you can run/schedule the job via Job Conductor or Execution Plan. However, both the options are giving different features, also based on the situation you should choice, which one you need to use.
Job conductor
In job conductor, each job needs to be run/schedule individually, or you have to design the Master Job to run/schedule many job at the same execution. Because you cannot define the task flow in Job conductor. Also every jobs need to generate and deploy separately.
Execution Task
https://help.talend.com/images/54/bk-tac-ug-542/page-execution_plan.png
A task execution plan outlines dependencies among different tasks that form an execution plan, the thing we cannot see in the task list in the Job Conductor page. These dependencies are defined by using a hierarchical view of main and child tasks where each task in the hierarchical view can have a subordinate task. From this page, you can define a task execution plan and then add different tasks to this plan in a specific order depending on the two conditions OnOk and On Error, or simply by using After. Later the tasks are executed in the specified order.
Once you generate and deploy the Execution plan task, it will generate and deploy all the jobs under the Execution task.
Sometime same standard jobs may use for many Execution task, in this instance you can define different parameter/context values for the same job. This feature allows to run the same job with under different parameters.
https://help.talend.com/images/54/bk-tac-ug-542/execution_plan-task_context.png
Always recommend to Generate and Deploy via Job Conductor unless you sure about status about all the job under the particular execution plan,becaue when you generate and deploy in Execution plan all the related Job also will effect, which might be under development. Even you generate and deploy via Job conductor or Execution Plan both will give same effect.
Cheers! Uma

Friday, October 2, 2015

Talend Enterprise Data Integration Certification Exam Questions Part 1

Talend Enterprise Data Integration Certification Exam Questions Part 1
You will have around 110 questions and have to answer within 150 minutes.

  1. From which tab in a component view would you specify the component label?
  2. What is the best practice for arranging components on the design workspace
  3. How to place your component in a job?
  1. From which view in Talend open studio would you read the comments attached to a component?


  1. How do you run a job in Talend studio?
  1. In which user interface element do you place the components of your job?
  1. In which user interface element do you find Business Models, Job Designs, and Metadata?
  1. What is indicated by an asterisk next to the job name in the design workspace?


  1. When you first start talend Open Studio what are the advantages of creating a Talend account?
  1. From which View in Talend Open Studio would you clear the statistics from the design workspace?
  1. How do you create a row between two components?
  1. How do you prevent a sub job from running without deleting it?
  1. How do you see a configuration error message for a component?
  1. What are two ways to add text information to a job?
Cheers!

Table Row Count for particular schema

Query for Table Row Count for particular schema

SELECT
    t.NAME as TableName,
    p.rows as RowCounts
FROM
    sys.tables t
INNER JOIN
    sys.schemas s ON t.schema_id = s.schema_id
INNER JOIN    
    sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN
    sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
WHERE
    t.is_ms_shipped = 0 AND s.Name='Staging'
GROUP BY
    t.NAME, s.Name, p.Rows
ORDER BY
    s.Name, t.Name