::: Uma's Blog :::: November 2015

Sunday, November 29, 2015

Basic information about Hadoop system and it's function

This post contains basic information about Hadoop and it's function. This content and images are collection of many already available posts. You can simply understand Hadoop terms by comparing Hadoop and Google terms. Because Hadoop is inspired by the Google techniques.

What is HDFS and why it is special?

The Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. The HDFS is the primary storage system used by Hadoop application. In initially Yahoo was enhanced the GFS and released as HDFS.

Traditional file system block size is 4KB, but HDFS file block size is 64MB default, this can be increased up to 128MB. There are many other advantages HDFS file system over traditional file system. Note that this HDFS file system only useful when we handle with large size file.

First advantage is, Free space can be used to store other data, for example if you want to store 2KB file, it store in a block which is 4KB, inside this blog 2KB space is free, however this space cannot be used by other purposes by traditional file system, but HDFS file system is support to use this free space within a blog. For example if 30 MB File stored in a blog then excess 34MB Space will be used for other data storing.

Second important advantage is, less space for metadata when storing large file. Big blog size 64Mb reduce overhead for Name node, metadata less, less storage for name node. In traditional file system data split into so many 4KB chunks and need more space to store metadata of these chunks. More space is required to store metadata as well as file retrieval also will take more time. For example, if you need to store the 1GB File, only just 16 block are required (1024/64) in HDFS system for the block size 64MB, but in traditional file system 262144 (1024*1024/4) blocks are required.

Third advantage is provides reliability through replication. Each block is replicated across several Data Notes. By default HDFS set for 3 replication. If you store 64MB file, this will be store in three Nodes same file in a Block in the cluster system. If one data node down, no worries replicate data there, but if name node down no way. This is called single point of fail over.

http://web.cs.ucla.edu/~alitteneker/CS239/images/image01.png

How HDFS system functioning

HDFS has five main services

Name node
Secondary Name mode
Job Tracker
Data node
Task Tracker

First 3 are master service and last 2 are slave services. In this system Name note talk to Data note and Job tracker talk to Task tracker.

Actual data store in Data notes and Name note used to store Meta data

Secondary Name node is used in failure of main Name node

Let’s see with the example, how this system is functioning. In 200MB file needs to store in 64KB block size, in this file is chunked as 4 small file, which would fit to HDFS block size. Assume that those 4 files are a.txt (64MB), b.txt (64MB), c.txt (64MB) and d.txt (8MB).

Client communicate with Name Note

Name not will give responds to client which location data needs to be stored

For example if 1,3,5,7 then a.txt will be stored under Data node 1

What will happened if Node 1 is going to down where a.txt saved. To make the availability by default it gives 3 replication by default, a.txt file will be copied into another 2 notes, for example Node2 and Node4.

Acknowledgement will be given after each saved on noted (dot dot line)

How this name notes know which notes this a.txt file saved

All the slave notes give proper blog reports to Name node every sort period of time, to say that some client store data on it and still alive and processing properly (heart beat)

Blog report will update on Name node

Some other important points are:

200 MB file will be store in 600MB space because of REPLICATION

Based on slave notes heard beat Name note will update the meta data. For example if note 2 is down then it remote note 2 for a.txt, and name note will choose another not to store a.txt

When note come back alive, but that data note don’t have any more data, it will start as fresh.

Map reduce:

Let say you have written a 10 KB program, you need to bring 200MB data to client so send the 10KB program to HDFS. Here on ward job tracker will handle.

No communication between job tracker and data notes

Job tracker assign task to Task tracker

Task tracker will be chosen based on nearest one.

For example 10 KB program will be assign/send to task tracker 1 in node1- This process called map

200MB file = a.txt, b.txt, c.txt and d.txt

Job taker send 10 KB program to node1 for a.txt, note3 for b.txt , note 5 for c.txt and node 7 for d.txt. THIS IS CALLED MAP

Input file = 200 MB (this will split into a, b,c and d)

Input splits = a.txt, b.txt, c.txt and d.txt

No of file splits = no of map process

Any case any of the task tracker not able to process, then job tracker will assign to another task tracker

All these task trackers are slave service for job tracker, so task trackers gives heard beat back to job tracker every 3 minutes.

If particular task tracker is busier, job tracker will decide to change the task tracker

Job tracker can monitor all the task trackers

If job tracker down all the process data will be lost for that Name node and Job tracker node we use high reliable hardware.

Task tracker find the information about this files and output will be store, for example output files 4KB,1 KB,4KB,3KB. One the information find each and every Name node separately, the output file will be used by reducer.

Reducer can be any node, if node 8 process reduce then it will put the final output in node 8 and update to node 8.

Instead of copy whole data and process, program is sent each node and find the output separately and finally combine the output.

Cheers!
Uma

Friday, November 27, 2015

Why TempDB is growing while SSIS job is running

One of my client raised an issue that their production tempdb is growing when SSIS package runs. This put me into confuse because generally SSIS not using tempdbs of SQL instances at all. Tempdb holds the intermediate data when SQL Server is doing tasks like:

DBCC CHECKDB or DBCC CHECKTABLE;
Common Table Expressions (CTE);
ORDER BY, GROUP BY, UNION Queries;
Index rebuild or creation.

Only one task was matched to my scenario that Index rebuild and creation. Yes finally found that the reason was not SSIS and the reason was destination tables have so many indexes.

The following answers/post were helped me to find the reason.

https://ask.sqlservercentral.com/questions/46812/does-ssis-use-tempdb.html

I am trying to understand the use of tempdb by SSIS to see if we would benefit from having a server dedicated solely to running SSIS packages. Are there situations were SSIS would need to use a tempdb apart from the tempdb located on servers defined by package connections. That is to say, if my SSIS machine is Server A, and it is performing an ETL from Server B to Server A, are there situations where the SSIS package would need to use the tempdb on Server A?

As you wrote if you have Server A - SSIS and then Server B and Server C - SQL Instances and moving data from B to C using SSIS, then SSIS will not use a tempdb on the Server A if there is also a SQL instance installed.

Generally SSIS does not utilize a tempdbs of SQL instances at all. Only if SSIS executes some SQL statements against the SQL instance then depending on the nature of the statement the tempdb can be utilized to accomplish that statement.

On the other side even SSIS does not utilize temp db directly, it can utilize a space on the Server A if it runs out of memory. SSIS uses a temp files for it's internal purposes. If have a Data Flow task in the package and take a look on the properties of the Data Flow Task, you will find there a BufferTempStoragePath and BLOBTempStoragePath, where you can specify the path where the temp files will be stored in case of hi memory pressure. The same relates to the Cached Lookups or Cache Transformation. Whenever the amount of data doesn't fit the process memory, it will be stored into temp files.

When you do not specify those properties, then a default temp location of an account under which the SSIS package is being executed will be used. Therefore it is important to set the temp location correctly for the accounts in case you have say a small system drive and big data drive as the default temp location for accounts is on the system drive.

https://social.msdn.microsoft.com/forums/sqlserver/en-US/f149300e-d357-40da-bb11-a1c0f1e82f6b/ssis-package-causes-tempdb-to-run-out-of-space
An OLE DB record is available. Source: "Microsoft SQL Native Client" Hresult: 0x80004005 Description: "Could not allocate space for object 'dbo.SORT temporary run storage: '...' in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup."

The destination is sql 2000. The package does not create any large temp tables.

apart from setting MaxInsertCommitSize property on the destination in Data Flow, is there anything else I could do? Where do I find this MaxInsertCommitSize option?

There may be several reasons for running out of space in TEMP db. One could be that the table you are loading has indexes; then the DB engine will use the TEMP DB to perform sort operations.

I would use the equivalent of a SQL Profiler to monitor the activity in the target DB and find out what exactly is happening. Also, what are the space allocation setting in the TEMP Db?The bottom line is that this issue looks more ralated to the destiantion DB engine than to SSIS

Cheers!
Uma

Thursday, November 26, 2015

What the difference between tMap Catch Lookup Inner Join Reject and Catch Output Reject

In tMap there are two type reject links options available.

Catch Lookup Inner Join Reject: This option, once activated, allows you to catch the records rejected by the inner join operation performed on the input flows.
Catch Output Reject: This option, once activated, allows you to catch the records rejected by a filter you have defined in the appropriate area.

The following example will make you more clearly about the differences. Create a job with main and lookup flow as shown below.

In the tMap editor:

Click the white arrow on each table to display the filter fields.

In the customer table, type !, press Ctrl+Space, click row1.CustomerAddress in the autocompletion list and type the Java syntax.equals("unknown").

In the state table, type state.LabelState !=null && !state.LabelState.isEmpty() to avoid null and empty fields.

In the out1 table, type the Java syntax"Ohio".equals() and press Ctrl+Space and, between the brackets, click state.LabelState in the autocompletion list.

Click OK

Run the job, you will see the output as shown below, only “Ohio” State data only will be flow over out1.

Now let see how to catch the rejected records by the filter.

In the tMap editor:

Click (+) on the toolbar above the out1 table, to add a new output table. In the Add a new output table wizard, type the name you want to give to the table (rejectFilterOut1). Click OK.

In the rejectFilterOut1, click the wrench button and change the Catch output reject option to true to collect all the data rejected by the out1 output filter.

Run the job, you will see all the records which are rejected by filter flow through rejectFilterOut1.

You can see that the Job is working, but still few more records are missing. These were rejected by Inner Join not the output filter. We want to put them in a different output.

Create new output and For the Catch lookup inner join reject property, select the true value.

If you run the job, you will see all the records are passing.

Cheers!

Uma

Sunday, November 15, 2015

What is the best component to use for Foreach loop and While loop in Talend - tFlowToIterate and tIterateToFlow

Based on experience, I would recommend tFlowToIterate and tIterateToFlow are the best component to use for Foreach and While loop.

tFlowToIterate

This component is used to read data line by line from the input flow and store the data entries in iterative global variables.

In this above Job, set of records come from MSSQL database via query, you need to do three things for each and every row. You can use “row13.column1”,”row13.column2”,etc in tHttpRequest_1, tJava_2 and tHttpRequest_2.

Also another advantage of using this component, we can use many link to iterate these iterates are run in order. In here iterete2 (order:1) run first, because (order:1) then order:2, Order:3, etc. This feature is help for many ETL process.

tIterateToFlow

Allows you to transform non processable data into a processable flow.

https://help.talend.com/download/attachments/8123220/Use_Case_tIterateToFlow1.png?version=1&modificationDate=1355333601000&api=v2

Mapping table:

https://help.talend.com/download/attachments/8123220/Use_Case_tIterateToFlow4.png?version=1&modificationDate=1355333604000&api=v2

Output will be:

Cheers!
Uma

JSON Path Expression for tExtractJSONFields and tFileInputJSON Talend components

I see, one of the main reason why Talend Open Studio (TOS) become a famous ETL tool because it support JSON and XML data handling in very convenient way whereas SSIS is not support or need custom codes to handle it. In this blog, I am going to discuss about the JSON filed extract components of TOS. There are 2 components available in Talend Open Studio, to extract the data from JSON format.

tFileInputJSON (only for JOSN File)

https://help.talend.com/images/54/bk-components-rg-542/tFileInputJSON_icon32_white.png

tFileInputJSON extracts JSON data from a file according to the JSON Path query, then transferring the data to a file, a database table, etc.

tExtractJSONFields (a file, a database table, etc)

https://help.talend.com/images/54/bk-components-rg-542/tExtractJSONFields_icon32_white.png

tExtractJSONFields extracts the data from JSON fields stored in a file, a database table, etc., based on the XPath query.

The following example shows that how to use this components, mainly define the JSON Path.

In this example, the JSON file is as follows:

tFileInputJSON - Extracting JSON data from a file

https://help.talend.com/images/54/bk-components-rg-542/use_case_tfileinputjson_2.png

In the Mapping table, the schema automatically appears in the Column part.

https://help.talend.com/images/54/bk-components-rg-542/use_case_tfileinputjson_4.png

In the JSONPath query column, enter the following queries:

For the columns type and name, enter the JSONPath queries "$.movieCollection[*].type" and "$.movieCollection[*].name"respectively. They correspond to the first nodes of the JSON data.

Here, "$.movieCollection[*]" stands for the root node relative to the nodes type and name, namely movieCollection.

For the columns release, rating and starring, enter the JSONPath queries "$..release", "$..rating" and "$..starring"respectively.

Here, ".." stands for the recursive decent of the details node, namely release, rating and starring.

Output will be as shown below:

tFileInputJSON - Extracting JSON data from a file using XPath

https://help.talend.com/images/54/bk-components-rg-542/use_case_tfileinputjson_7.png

In the Loop JSONPath query field, enter "/movieCollection/details".

In the Mapping table, the schema automatically appears in the Column part.

https://help.talend.com/images/54/bk-components-rg-542/use_case_tfileinputjson_8.png

In the XPath query column, enter the following queries:

For the columns type and name, enter the XPath queries "../type" and "../name" respectively. They correspond to the first nodes of the JSON data.
For the columns release, rating and starring, enter the XPath queries "release", "rating" and "starring" respectively.

You will get same output as above example

tExtractJSONFields

Node within the JSON field, on which the loop is based.

Mapping: Column: schema defined to hold the data extracted from the JSON field.

XPath Query: XPath Query to specify the node within the JSON field.

Get nodes: select this check box to extract the JSON data of all the nodes specified in the XPath query list or select the check box next to a specific node to extract its JSON data only.

Is Array: select this check box when the JSON field to be extracted is an array instead of an object.

XPath Query: XPath Query to specify the node within the JSON field. In the above JSON file, next to the friends column, retrieving the entire friends node from the source file.

JSONPath query = "$.user.friends[*]"

Mapping Column: schema defined to hold the data extracted from the JSON field.

Add columns that you need to retrieve on the right-side

JSON field - List of the JSON fields to be extracted, in this example it is “friends”

XPath Query: XPath Query to specify the node within the JSON field.

JOSN Path Query:

To work with this components you need to understand the JSON Path, you can find very useful article in the following link: Here I have copied part of that article. http://goessner.net/articles/JsonPath/

# JSONPath - XPath for JSON

A frequently emphasized advantage of XML is the availability of plenty tools to analyse, transform and selectively extract data out of XML documents. XPath is one of these powerful tools.

It's time to wonder, if there is a need for something like XPath4JSON and what are the problems it can solve.

Data may be interactively found and extracted out of JSON structures on the client without special scripting.
JSON data requested by the client can be reduced to the relevant parts on the server, such minimizing the bandwidth usage of the server response.

If we agree, that a tool for picking parts out of a JSON structure at hand does make sense, some questions come up. How should it do its job? How do JSONPath expressions look like?

Due to the fact, that JSON is a natural representation of data for the C family of programming languages, the chances are high, that the particular language has native syntax elements to access a JSON structure.

The following XPath expression

/store/book[1]/title

would look like

x.store.book[0].title

x['store']['book'][0]['title']

in Javascript, Python and PHP with a variable x holding the JSON structure. Here we observe, that the particular language usually has a fundamental XPath feature already built in.

The JSONPath tool in question should …

be naturally based on those language characteristics.
cover only essential parts of XPath 1.0.
be lightweight in code size and memory consumption.
be runtime efficient.

|2007-08-17| e2# JSONPath expressions

JSONPath expressions always refer to a JSON structure in the same way as XPath expression are used in combination with an XML document. Since a JSON structure is usually anonymous and doesn't necessarily have a "root member object" JSONPath assumes the abstract name $ assigned to the outer level object.

JSONPath expressions can use the dot–notation

$.store.book[0].title

or the bracket–notation

$['store']['book'][0]['title']

for input pathes. Internal or output pathes will always be converted to the more general bracket–notation.

JSONPath allows the wildcard symbol * for member names and array indices. It borrows the descendant operator '..' from E4X and the array slice syntax proposal[start:end:step] from ECMASCRIPT 4.

Expressions of the underlying scripting language (<expr>) can be used as an alternative to explicit names or indices as in

$.store.book[(@.length-1)].title

using the symbol '@' for the current object. Filter expressions are supported via the syntax ?(<boolean expr>) as in

$.store.book[?(@.price < 10)].title

Here is a complete overview and a side by side comparison of the JSONPath syntax elements with its XPath counterparts.

XPath	JSONPath	Description
/	$	the root object/element
.	@	the current object/element
/	. or []	child operator
..	n/a	parent operator
//	..	recursive descent. JSONPath borrows this syntax from E4X.
*	*	wildcard. All objects/elements regardless their names.
@	n/a	attribute access. JSON structures don't have attributes.
[]	[]	subscript operator. XPath uses it to iterate over element collections and for predicates. In Javascript and JSON it is the native array operator.
\|	[,]	Union operator in XPath results in a combination of node sets. JSONPath allows alternate names or array indices as a set.
n/a	[start:end:step]	array slice operator borrowed from ES4.
[]	?()	applies a filter (script) expression.
n/a	()	script expression, using the underlying script engine.
()	n/a	grouping in Xpath

XPath has a lot more to offer (Location pathes in not abbreviated syntax, operators and functions) than listed here. Moreover there is a remarkable difference how the subscript operator works in Xpath and JSONPath.

Square brackets in XPath expressions always operate on the node set resulting from the previous path fragment. Indices always start by 1.
With JSONPath square brackets operate on the object or array addressed by the previous path fragment. Indices always start by 0.

|2007-08-18| e3# JSONPath examples

Let's practice JSONPath expressions by some more examples. We start with a simple JSON structure built after an XML example representing a bookstore (original XML file).

{ "store": {

"book": [

{ "category": "reference",

"author": "Nigel Rees",

"title": "Sayings of the Century",

"price": 8.95

{ "category": "fiction",

"author": "Evelyn Waugh",

"title": "Sword of Honour",

"price": 12.99

{ "category": "fiction",

"author": "Herman Melville",

"title": "Moby Dick",

"isbn": "0-553-21311-3",

"price": 8.99

{ "category": "fiction",

"author": "J. R. R. Tolkien",

"title": "The Lord of the Rings",

"isbn": "0-395-19395-8",

"price": 22.99

}

"bicycle": {

"color": "red",

"price": 19.95

}

XPath	JSONPath	Result
/store/book/author	$.store.book[*].author	the authors of all books in the store
//author	$..author	all authors
/store/*	$.store.*	all things in store, which are some books and a red bicycle.
/store//price	$.store..price	the price of everything in the store.
//book[3]	$..book[2]	the third book
//book[last()]	$..book[(@.length-1)] $..book[-1:]	the last book in order.
//book[position()<3]	$..book[0,1] $..book[:2]	the first two books
//book[isbn]	$..book[?(@.isbn)]	filter all books with isbn number
//book[price<10]	$..book[?(@.price<10)]	filter all books cheapier than 10
//*	$..*	all Elements in XML document. All members of JSON structure.

Cheers!

Uma