Thursday, March 24, 2016

SSIS Master Package Analysis based on Performance Counter and Wait Stats

In order to achieve effective and goal-oriented performance tuning it is mandatory to know your system’s bottlenecks. In most cases the root causes for those bottlenecks lie deep in the system’s architecture and are not visible at first sight.  For example the following image shows one of the complex master package design.
C:\Users\Vane\Downloads\image.png
Performance Monitor
Performance Monitor allows you to analyze all performance counters in a single window. Depending on the number of counters this can be very confusing.
SSIS Package Instances: Total number of simultaneous SSIS Packages running.
C:\Users\Vane\Downloads\image (1).png
The above image illustrates that we can go up to 26 Maximum Parallel Processing (Number of Logical Processing + 2 = 24 + 2= 26).
C:\Users\Vane\Desktop\Talend\unnamed.png
This information will help to re organize sub packages in proper order in the master packages and parallel processing design.
Buffer Spooled: "Buffers spooled" counter to determine whether data buffers are being written to disk temporarily while a package is running. This swapping reduces performance and indicates that the computer has insufficient memory.
C:\Users\Vane\Downloads\image (2).png
Total SQL memory vs SSIS memory utilization (Only SSIS packages):
C:\Users\Vane\Downloads\image (3).pngFrom this output we can come to the conclusion that there are no any Memory Bottlenecks.
SQL Server I/O Bottlenecks:
Disk sec/Transfer –> Time taken to perform the I/O operation
Ideal value for Disk sec/Transfer is 0.005-0.015 sec. If you consistently notice this counter is beyond 0.015 then there is a serious I/O bottleneck. Look for Disk Bytes /sec immediately. If it is below 150 MB for SAN disk and Below 50 MB for Single disk then the problem is with I/O subsystem engage hardware vendor.
C:\Users\Vane\Downloads\image (8).png
From this output we can come to the conclusion that there is a series bottleneck with I/O operation in the server. However this is not applicable for production environment or other environment because in the server all file groups are pointed to one disk.

% Processor Time (Total): %Processor Time Indicated how much the processor actually spends on productive threads. Seek to understand how much CPU is being used by Integration Services and how much CPU is being used overall by SQL Server while Integration Services is running.
C:\Users\Vane\Downloads\image (5).png This outcome will help us for future re organize sub packages/tasks in packages and parallel processing design.
Network: SSIS moves data as fast as your network is able to handle it. Because of this, it is important to understand your network topology and ensure that the path between your source and target have both low latency and high throughput.C:\Users\Vane\Desktop\Talend\unnamed (1).png
C:\Users\Vane\Downloads\image (6).png
From this outcome we can come to conclusion that there are no any Network bottle neck in the server.
 
Wait Stats When SQL Server is executing any task, and if for any reason it has to wait for resources to execute the task, this wait is recorded by SQL Server with the for the delay.C:\Users\Vane\Desktop\Talend\unnamed (2).png
CXPACKET: When a parallel operation is created for a SQL query, multiple threads for a single query are used. Each query usually deals with a different set of data or rows. Due to some reasons, one or more of the threads lag behind, creating the CXPACKET Wait Stat.
Depending on the server workload type, there are several ways to reduce or avoid this wait type. Each way uses the properties “Maximum Degree of Parallelism” and “Cost Threshold for Parallelism”. 
C:\Users\Vane\Downloads\image (7).png
IO_COMPLETION: The IO_COMPLETION wait type occurs while SQL Server is waiting for I/O operations to complete. This wait type generally represents non-data page I/Os.
This outcome clearly illustrate that parallel execution should be handle optimized way, otherwise unnecessary waiting happen in SQL Server. This information will help to re organize sub packages and task.
Cheers!
Uma

Install Apache Spark on Windows environment

In this post I will walk through the process of install Apache Spark on Windows environment

Spark can be installed in two ways
  1. Building Spark from Source Code
  2. Use Prebuilt Spark package

First make sure you have installed Java and set the required environment variable JAVA_HOME, Path, you can confirm using java-version command, if not install and configure Java.


If you are downloading Pre-Build version the prerequisites are:
  1. Java Development Kit
  2. Python (only required, if you are using Python instead of Scala or Java)

Download Pre-Build for Hadoop 2.6 and later, as shown below

Extract the file, here I place all the Spark related files into C:\Learning\
Set the SPARK_HOME and add %SPARK_HOME%\bin in PATH in environment variables
When you run Spark and you will see the following exception:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. The reason is because spark expect to find HADOOP_HOME environment variable pointed to hadoop binary distribution.


Download hadoop common binary, extract the downloaded zipped file to C:\Learning\


Then set the HADOOP_HOME for example

Spark use Apache log4j for logging, to configure log4j go to C:\Learning\spark-1.6.0\conf, you will find template file called ‘log4j.properties.template’.


Delete the extension ‘.template’, and open file in text editor, find the property called ‘log4j.rootCategory’ , you can set it to the level you want.
In my case I changed to ‘ERROR instead of ‘INFO AND WARN’


Spark has two shells, they are existed in ‘C:\Learning\spark-1.6.0\bin’ directory :
  1. Scala shell (C:\Learning\spark-1.6.0\bin \spark-shell.cmd).
  2. Python shell (C:\Learning\spark-1.6.0\bin\pyspark.cmd)

If you start using spark-shell the you will see the shell as shown below

If you start using pyspark the you will see the shell as shown below

You can check via UI also
Now let run and see Word count example, I create a small text file as shown below and save it
Let’s run using spark-shell


Let’s run using pyspark


In addition to this, if you need to get Source file of the Spark and build, you need to Install Spark and configure also you need to install any tool that can be used to build the source code such as SBT.

In the following steps shows that how to install Scala

Download the scala from www.scala-lang.org/download
Run the downloaded msi file and install
Set SCALA_HOME  and add %SCALA_HOME%\bin in   PATH variable in environment variables.
You can test using scala command in command prompt

Cheers!
Uma

How install Spark in Linux

In this post I will walk through the process of install Apache Spark on Linux environment
uma@uma-VirtualBox:~/Downloads$ sudo wget http://www.eu.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
uma@uma-VirtualBox:~/Downloads$ sudo tar -xvf spark-1.6.0-bin-hadoop2.6.tgz -C /home/uma/work/
uma@uma-VirtualBox:~/work$ sudo mv spark-1.6.0-bin-hadoop2.6 spark-1.6.0
Bin/spark-shell
Lot of information
uma@uma-VirtualBox:~/work/spark-1.6.0/conf$ sudo cp log4j.properties.template log4j.properties
uma@uma-VirtualBox:~/work/spark-1.6.0/conf$ sudo gedit log4j.properties
uma@uma-VirtualBox:~$ cd work
uma@uma-VirtualBox:~/work$ sudo chmod -R 777 spark-1.6.0/
Bin/spark-shell

Cheers!
Uma

What the difference between Local Standalone Mode, Pseudo-Distributed Mode and Fully-Distributed Mode in Hadoop installation

Hadoop cluster in one of the three supported modes
  • Local (Standalone) Mode
  • Pseudo-Distributed Mode
  • Fully-Distributed Mode

Standalone Operation / Single node setup
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
 $ mkdir input
 $ cp etc/hadoop/*.xml input
 $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
 $ cat output/*

Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
For example, HDFS configuration define separate nodes for NameNode and DataNode, but in same machine.
If you start HDFS and Yarn, using jps command you can check how many separate daemons running in Hadoop. Note that all the daemons are running on same machine.
Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions.
Typically one machine in the cluster is designated as the NameNode and another machine the as ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load. The rest of the machines in the cluster act as both DataNode and NodeManager.
Mode details: https://hadoop.apache.org

Cheers!
Uma

Tuesday, March 22, 2016

Generate JSON format from table and export as JSON File from MSSQL

I have gone through many post constructing JSON Strings in SQL Server, most of them are not working as we expect. Finally I found the following code, which gives perfect output and more convenient to customize.

CREATE PROCEDURE [dbo].[GetJSON] (
@ParameterSQL AS VARCHAR(MAX)
)
AS
BEGIN

DECLARE @SQL NVARCHAR(MAX)
DECLARE @XMLString VARCHAR(MAX)
DECLARE @XML XML
DECLARE @Paramlist NVARCHAR(1000)
SET @Paramlist = N'@XML XML OUTPUT'
SET @SQL = 'WITH PrepareTable (XMLString) '
SET @SQL = @SQL + 'AS ( '
SET @SQL = @SQL + @ParameterSQL+ ' FOR XML RAW, TYPE, ELEMENTS '
SET @SQL = @SQL + ') '
SET @SQL = @SQL + 'SELECT @XML = XMLString FROM PrepareTable '
EXEC sp_executesql @SQL, @Paramlist, @XML=@XML OUTPUT
SET @XMLString = CAST(@XML AS VARCHAR(MAX))

DECLARE @JSON VARCHAR(MAX)
DECLARE @Row VARCHAR(MAX)
DECLARE @RowStart INT
DECLARE @RowEnd INT
DECLARE @FieldStart INT
DECLARE @FieldEnd INT
DECLARE @KEY VARCHAR(MAX)
DECLARE @Value VARCHAR(MAX)

DECLARE @StartRoot VARCHAR(100); SET @StartRoot = '<row>'
DECLARE @EndRoot VARCHAR(100); SET @EndRoot = '</row>'
DECLARE @StartField VARCHAR(100); SET @StartField = '<'
DECLARE @EndField VARCHAR(100); SET @EndField = '>'

SET @RowStart = CharIndex(@StartRoot, @XMLString, 0)
SET @JSON = ''
WHILE @RowStart > 0
BEGIN
      SET @RowStart = @RowStart+Len(@StartRoot)
      SET @RowEnd = CharIndex(@EndRoot, @XMLString, @RowStart)
      SET @Row = SubString(@XMLString, @RowStart, @RowEnd-@RowStart)
      SET @JSON = @JSON+'{'

      -- for each row
      SET @FieldStart = CharIndex(@StartField, @Row, 0)
      WHILE @FieldStart > 0
      BEGIN
            -- parse node key
            SET @FieldStart = @FieldStart+Len(@StartField)
            SET @FieldEnd = CharIndex(@EndField, @Row, @FieldStart)
            SET @KEY = SubString(@Row, @FieldStart, @FieldEnd-@FieldStart)
            SET @JSON = @JSON+'"'+@KEY+'":'

            -- parse node value
            SET @FieldStart = @FieldEnd+1
            SET @FieldEnd = CharIndex('</', @Row, @FieldStart)
            SET @Value = SubString(@Row, @FieldStart, @FieldEnd-@FieldStart)
            SET @JSON = @JSON+'"'+@Value+'",'

            SET @FieldStart = @FieldStart+Len(@StartField)
            SET @FieldEnd = CharIndex(@EndField, @Row, @FieldStart)
            SET @FieldStart = CharIndex(@StartField, @Row, @FieldEnd)
      END  
      IF LEN(@JSON)>SET @JSON = SubString(@JSON, 0, LEN(@JSON))
      SET @JSON = @JSON+'},'
      --/ for each row

      SET @RowStart = CharIndex(@StartRoot, @XMLString, @RowEnd)
END
IF LEN(@JSON)>SET @JSON = SubString(@JSON, 0, LEN(@JSON))
SET @JSON = '[' + @JSON + ']'
SELECT @JSON

END

To run:
EXEC GetJSON 'SELECT * FROM Production.ProductCategory'

The output will be as below
[
{"ProductCategoryID":"1","Name":"Bikes","rowguid":"CFBDA25C-DF71-47A7-B81B-64EE161AA37C","ModifiedDate":"2002-06-01T00:00:00"},
{"ProductCategoryID":"2","Name":"Components","rowguid":"C657828D-D808-4ABA-91A3-AF2CE02300E9","ModifiedDate":"2002-06-01T00:00:00"},
{"ProductCategoryID":"3","Name":"Clothing","rowguid":"10A7C342-CA82-48D4-8A38-46A2EB089B74","ModifiedDate":"2002-06-01T00:00:00"},
{"ProductCategoryID":"4","Name":"Accessories","rowguid":"2BE3BE36-D9A2-4EEE-B593-ED895D97C2A6","ModifiedDate":"2002-06-01T00:00:00"}
]

If the Select query is complex, then insert the query result to temp table, and then pass the temp table query to the JSON stored procedure.
For Example: EXEC GetJSON 'SELECT * FROM #Result'

You can export the output as JSON file using following command:

Modify the SP with output parameter:

CREATE PROCEDURE [dbo].[GetJSON] (
@ParameterSQL AS VARCHAR(MAX),
@JsonString AS VARCHAR(MAX) OUTPUT
)
AS
BEGIN

DECLARE @SQL NVARCHAR(MAX)
DECLARE @XMLString VARCHAR(MAX)
DECLARE @XML XML
DECLARE @Paramlist NVARCHAR(1000)
SET @Paramlist = N'@XML XML OUTPUT'
SET @SQL = 'WITH PrepareTable (XMLString) '
SET @SQL = @SQL + 'AS ( '
SET @SQL = @SQL + @ParameterSQL+ ' FOR XML RAW, TYPE, ELEMENTS '
SET @SQL = @SQL + ') '
SET @SQL = @SQL + 'SELECT @XML = XMLString FROM PrepareTable '
EXEC sp_executesql @SQL, @Paramlist, @XML=@XML OUTPUT
SET @XMLString = CAST(@XML AS VARCHAR(MAX))

DECLARE @JSON VARCHAR(MAX)
DECLARE @Row VARCHAR(MAX)
DECLARE @RowStart INT
DECLARE @RowEnd INT
DECLARE @FieldStart INT
DECLARE @FieldEnd INT
DECLARE @KEY VARCHAR(MAX)
DECLARE @Value VARCHAR(MAX)

DECLARE @StartRoot VARCHAR(100); SET @StartRoot = '<row>'
DECLARE @EndRoot VARCHAR(100); SET @EndRoot = '</row>'
DECLARE @StartField VARCHAR(100); SET @StartField = '<'
DECLARE @EndField VARCHAR(100); SET @EndField = '>'

SET @RowStart = CharIndex(@StartRoot, @XMLString, 0)
SET @JSON = ''
WHILE @RowStart > 0
BEGIN
      SET @RowStart = @RowStart+Len(@StartRoot)
      SET @RowEnd = CharIndex(@EndRoot, @XMLString, @RowStart)
      SET @Row = SubString(@XMLString, @RowStart, @RowEnd-@RowStart)
      SET @JSON = @JSON+'{'

      -- for each row
      SET @FieldStart = CharIndex(@StartField, @Row, 0)
      WHILE @FieldStart > 0
      BEGIN
            -- parse node key
            SET @FieldStart = @FieldStart+Len(@StartField)
            SET @FieldEnd = CharIndex(@EndField, @Row, @FieldStart)
            SET @KEY = SubString(@Row, @FieldStart, @FieldEnd-@FieldStart)
            SET @JSON = @JSON+'"'+@KEY+'":'

            -- parse node value
            SET @FieldStart = @FieldEnd+1
            SET @FieldEnd = CharIndex('</', @Row, @FieldStart)
            SET @Value = SubString(@Row, @FieldStart, @FieldEnd-@FieldStart)
            SET @JSON = @JSON+'"'+@Value+'",'

            SET @FieldStart = @FieldStart+Len(@StartField)
            SET @FieldEnd = CharIndex(@EndField, @Row, @FieldStart)
            SET @FieldStart = CharIndex(@StartField, @Row, @FieldEnd)
      END  
      IF LEN(@JSON)>SET @JSON = SubString(@JSON, 0, LEN(@JSON))
      SET @JSON = @JSON+'},'
      --/ for each row

      SET @RowStart = CharIndex(@StartRoot, @XMLString, @RowEnd)
END
IF LEN(@JSON)>SET @JSON = SubString(@JSON, 0, LEN(@JSON))
SET @JSON = '[' + @JSON + ']'
SET @JsonString = @JSON

END


To export as file

DECLARE @Jsonoutput VARCHAR(MAX)
DECLARE @Xpcommand VARCHAR(8000)
EXEC dbo.GetJSON 'SELECT * FROM # Result', @Jsonoutput OUT
SET @Xpcommand = 'echo ' + @Jsonoutput + ' >C:\output\outputfile.json'
EXEC master..xp_cmdshell @Xpcommand

Cheers!
Uma