November 2013 – Canburak Tümer's Blog

The day has come, after playing with hadoop distributions around a year and two trainings; I feel ready to write an introduction post about Big Data, Hadoop and ecosystem projects.

1. What is Big Data?

Big Data is not Hadoop, Hadoop is just an implementation of Big Data concept. Big Data is a young concept in data analysis, ETL, data warehousing and data discovery fields; or data science for short. Every year, every day, every minute we create data, and every time we create more than we’ve created before. And we, data workers, process this data to get valuable pieces to our company, client, industry to increase income.

Here, Big Data comes in our life. Big Data has 5 V’s (in some sources you may find only 3 V’s) which are : Volume, Velocity, Variety, Verification (or Validity) and last but the most important to me Value.

Volume : Data we need to process and analyze every day is getting bigger day by day. So we need a new approach to data processing, this is where big data comes in.

Velocity : With the use of mobile devices, social media and internet we can create more data in some time, before we could do. For example, before social media if we generate just X MB data on the internet in a day, now we generate more than X GB of data in a day, may be in a couple hours. So we need to catch and process data really fast to catch-up with its speed. (There is also CEP or Fast Data concepts you may interest in this specific topic.)

Variety : I will blame social media again but with the increase of social media usage and internet usage now we generate unstructured and various types of data, like shares, likes, status updates, retweets, vines, videos, texts, gifs, other images. And to create value to our company we have to process this data. They are also coming from variable sources, network systems, internet, forms on corporate website etc.

Verification : Of course, we need the right data to get right results.

Value : This is it! It’s the reason why we process so much data in so short time. We need to get valuable pieces of data, we try to extract VALUE from the data. It is like searching for diamond in a mine.

2. What is Hadoop?

Hadoop is an open source project which implements Big data. It’s a distributed system to store and process data with commodity hardware. It does not require big powerful servers, instead you can create a cluster with desktop computers of quad-core processors with 2GB of RAM. (less than most of the modern laptops, almost all smartphones have 1 or 2 GB of RAM nowadays.)

Hadoop first started with Google’s white paper publishing on its Big Table and MapReduce architecture. Some cool guys tried to implement and develop these features in open source manner. Then Yahoo, Facebook, Google and Apache Foundation supported them. Now Hadoop is an open source Apache project.

3. Hadoop Distributions

You can install and use Hadoop through Linux’s repositories. But there is also start-ups who bundled Hadoop with some other open-source ecosystem projects for Hadoop and with their own tools as well.

Cloudera is one of these start-ups who has own bundle and it provides a VM also for quick start.

Hortonworks is the one other start-up, who has own bundle and own VM to getting started.

Also there is bigger solutions to Hadoop like Oracle’s Big Data Appliance, Teradata’s solution, IBM and HP also have their own enterprise solutions.

4. Hadoop Ecosystem Projects

Pig : Pig is one of the data analytic tools we can use with Hadoop. It has its own language to code which is a scripting language and called Pig Latin. It is really close to English so you can code like you are writing an English essay.

Hive : Hive is another way to run data analytics. It has a SQL like language, so it is usually preferred by developers who already have SQL knowledge.

Impala : Impala is the rival of Hive, it also has a SQL like language and it is much more faster than Hive, because it does not convert code to MapReduce, instead it runs on HDFS directly. (I will tell more about MapReduce and HDFS next time.)

Oozie : Oozie is a scheduling and job management tool. Where you can define flows as XML files, and it runs jobs as defined in this XML.

Sqoop : Sqoop is a tool to load data to Hadoop from a RDBMS or vice versa. It crates MapReduce jobs to load data and runs them automatically.

Flume : Flume is a listener basically. User defines an input channel and Flume polls it repeatedly, for example user defines a log file as a channel and Flume polls in every five minutes to get latest logs from this file.

Ambari : Ambari is administration (provisioning, managing and monitoring) console for the Hadoop cluster.

That’s all for introduction post, soon I will be writing about Hadoop Internals: HDFS and MapReduce. Please do not hesitate to leave comments or ask question in comments section. And hopefully in february I will be building a mini Hadoop cluster at home which will be topic for another blog post.

Thanks for reading.

While using ODI to implement your ETLs, you may need to have loops. Let’s look at the examples, where I will implement loops that will iterate n times (for loops) and loops that will iterate while they ensure the condition. (while loops)

For Loop

In programming we implement for loop as follows,

for (i = 0; i < 10; i++){
//statements
}

This is a simple loop which iterates ten times, if we parse the part in the parenthesis we can see in the first part we assign a value to a variable, second part we define the condition and the last part is change of variable value per iteration.

In ODI 11g we can implement this as follows:

1- Create a variable
I created a variable called V_FOR_LOOP which is numeric and does not have a refreshing code.

2- Create a package
I create a package and name it as P_FOR_LOOP, I will put a screenshot of package’s final status when we complete all steps.

3- Set variable
Set a value to our variable V_FOR_LOOP as an initialization value. I will set it as 0. Also name the step as set initial.

4- Evaluate variable
Evaluate V_FOR_LOOP against iteration condition. I will use “less than 5” as iteration value. You can choose between the options as you wish or your requirement. Name step as Evaluate Value.

5- Place your statements
Now it is time to place your statements which will iterate. I will only put one interface.

6- Increment your variable
Increment your variable one step using SET VARIABLE object’s Increment option, I will increment by one and name this step as Increment.

7- Connect your Increment step to Evaluate Value step
Until this step every object was connected to its following object with an OK line, now connect Increment to Evaluate Value with an OK line. Now it will go back to evaluation and iterate until the evaluation is false.

Here is how our package looks in final form:

And the operator screen when we run the package:

As seen above steps numbered 1,2,3 repeats 5 times, then Evaluate Value runs one more time, decides that V_FOR_LOOP < 5 is not true enough and package finishes its run.

While Loop

In programming we can implement while loop as follows:

while (flag == true){
//statements
}

So this will iterate unknown times until its condition becomes incorrect. Confession time : I have to admit that I have never felt need of using while loop in ETL/ODI but you may need.

Before implementing this step-by-step, I created a table includes two columns c1 and flag, where I will use flag as my condition. My data is as follows :

C1 F
— –
1 T
2 T
3 T
4 T
5 F
6 T
7 T
8 T
9 T
10 F

Now let’s implement while loop:

1- Create a variable to hold flag value
I create a variable called V_WHILE_LOOP which is alphanumeric and refreshing by : select flag from variable.test where c1 = #V_FOR_LOOP
I will use my V_FOR_LOOP to select flag values, in this sample case. Your case will contain different logic than this sample for sure.

2- Create a package
I create a package named P_WHILE_LOOP.

3- Set Variable (in my case)
Since I am refreshing my flag depending on V_FOR_LOOP, I set this as first step.

4- Refresh Flag
Refresh your flag variable.

5- Check Flag
Evaluate flag variable.

6- Statements
Place your statements, I will put my sample interface and also I will increment V_FOR_LOOP as I will need this to reach an invalid flag.

7- Set your connections
Until the end of your statements every step will be connected by an OK, when you reach the end connect it to Refresh Flag step, so you will refresh, check and start your statements again and again until flag is false.

Here is a view of package :

And the view from operator:

You can see it hits the end when we refresh flag for the 5th time since it will return F as flag value, which is not suitable to our condition.

So here we are at the end of the post, now with the knowledge of “How to implement loops in ODI 11g”

Thank you for your patience to read, and if you have any questions or comments please drop a comment and I will read (and reply if it’s a question) it for sure.

Month: November 2013

Hadoop: Introduction to Big Data, Hadoop and Hadoop Ecosystem

ODI 11g: Implementing Loops