Hadoop: Introduction to Big Data, Hadoop and Hadoop Ecosystem

The day has come, after playing with hadoop distributions around a year and two trainings; I feel ready to write an introduction post about Big Data, Hadoop and ecosystem projects.

1. What is Big Data?

Big Data is not Hadoop, Hadoop is just an implementation of Big Data concept. Big Data is a young concept in data analysis, ETL, data warehousing and data discovery fields; or data science for short. Every year, every day, every minute we create data, and every time we create more than we’ve created before. And we, data workers, process this data to get valuable pieces to our company, client, industry to increase income.

Here, Big Data comes in our life. Big Data has 5 V’s (in some sources you may find only 3 V’s) which are : Volume, Velocity, Variety, Verification (or Validity) and last but the most important to me Value.

5V for Big Data
5V for Big Data

Volume : Data we need to process and analyze every day is getting bigger day by day. So we need a new approach to data processing, this is where big data comes in.

Velocity : With the use of mobile devices, social media and internet we can create more data in some time, before we could do. For example, before social media if we generate just X MB data on the internet in a day, now we generate more than X GB of data in a day, may be in a couple hours. So we need to catch and process data really fast to catch-up with its speed. (There is also CEP or Fast Data concepts you may interest in this specific topic.)

Variety : I will blame social media again but with the increase of social media usage and internet usage now we generate unstructured and various types of data, like shares, likes, status updates, retweets, vines, videos, texts, gifs, other images. And to create value to our company we have to process this data. They are also coming from variable sources, network systems, internet, forms on corporate website etc.

Verification : Of course, we need the right data to get right results.

Value : This is it! It’s the reason why we process so much data in so short time. We need to get valuable pieces of data, we try to extract VALUE from the data. It is like searching for diamond in a mine.

2. What is Hadoop?

Hadoop is an open source project which implements Big data. It’s a distributed system to store and process data with commodity hardware. It does not require big powerful servers, instead you can create a cluster with desktop computers of quad-core processors with 2GB of RAM. (less than most of the modern laptops, almost all smartphones have 1 or 2 GB of RAM nowadays.)

Apache Hadoop Elephant

Hadoop first started with Google’s white paper publishing on its Big Table and MapReduce architecture. Some cool guys tried to implement and develop these features in open source manner. Then Yahoo, Facebook, Google and Apache Foundation supported them. Now Hadoop is an open source Apache project.

3. Hadoop Distributions

You can install and use Hadoop through Linux’s repositories. But there is also start-ups who bundled Hadoop with some other open-source ecosystem projects for Hadoop and with their own tools as well.

Cloudera is one of these start-ups who has own bundle and it provides a VM also for quick start.

Hortonworks is the one other start-up, who has own bundle and own VM to getting started.

Also there is bigger solutions to Hadoop like Oracle’s Big Data Appliance, Teradata’s solution, IBM and HP also have their own enterprise solutions.

4. Hadoop Ecosystem Projects

Pig : Pig is one of the data analytic tools we can use with Hadoop. It has its own language to code which is a scripting language and called Pig Latin. It is really close to English so you can code like you are writing an English essay.

Hive : Hive is another way to run data analytics. It has a SQL like language, so it is usually preferred by developers who already have SQL knowledge.

Impala : Impala is the rival of Hive, it also has a SQL like language and it is much more faster than Hive, because it does not convert code to MapReduce, instead it runs on HDFS directly. (I will tell more about MapReduce and HDFS next time.)

Oozie : Oozie is a scheduling and job management tool. Where you can define flows as XML files, and it runs jobs as defined in this XML.

Sqoop : Sqoop is a tool to load data to Hadoop from a RDBMS or vice versa. It crates MapReduce jobs to load data and runs them automatically.

Flume : Flume is a listener basically. User defines an input channel and Flume polls it repeatedly, for example user defines a log file as a channel and Flume polls in every five minutes to get latest logs from this file.

Ambari : Ambari is administration (provisioning, managing and monitoring) console for the Hadoop cluster.

 

That’s all for introduction post, soon I will be writing about Hadoop Internals: HDFS and MapReduce. Please do not hesitate to leave comments or ask question in comments section. And hopefully in february I will be building a mini Hadoop cluster at home which will be topic for another blog post.

Thanks for reading.

ODI 11g: Implementing Loops

While using ODI to implement your ETLs, you may need to have loops. Let’s look at the examples, where I will implement loops that will iterate n times (for loops) and loops that will iterate while they ensure the condition. (while loops)

For Loop

In programming we implement for loop as follows,

for (i = 0; i < 10; i++){
//statements
}

This is a simple loop which iterates ten times, if we parse the part in the parenthesis we can see in the first part we assign a value to a variable, second part we define the condition and the last part is change of variable value per iteration.

In ODI 11g we can implement this as follows:

1- Create a variable
I created a variable called V_FOR_LOOP which is numeric and does not have a refreshing code.

2- Create a package
I create a package and name it as P_FOR_LOOP, I will put a screenshot of package’s final status when we complete all steps.

3- Set variable
Set a value to our variable V_FOR_LOOP as an initialization value. I will set it as 0. Also name the step as set initial.

4- Evaluate variable
Evaluate V_FOR_LOOP against iteration condition. I will use “less than 5” as iteration value. You can choose between the options as you wish or your requirement. Name step as Evaluate Value.

5- Place your statements
Now it is time to place your statements which will iterate. I will only put one interface.

6- Increment your variable
Increment your variable one step using SET VARIABLE object’s Increment option, I will increment by one and name this step as Increment.

7- Connect your Increment step to Evaluate Value step
Until this step every object was connected to its following object with an OK line, now connect Increment to Evaluate Value with an OK line. Now it will go back to evaluation and iterate until the evaluation is false.

Here is how our package looks in final form:

For Loop Package
For Loop Package

And the operator screen when we run the package:

 

For Loop Operator View
For Loop Operator View

As seen above steps numbered 1,2,3 repeats 5 times, then Evaluate Value runs one more time, decides that V_FOR_LOOP < 5 is not true enough and package finishes its run.

While Loop

In programming we can implement while loop as follows:

while (flag == true){
//statements
}

So this will iterate unknown times until its condition becomes incorrect. Confession time : I have to admit that I have never felt need of using while loop in ETL/ODI but you may need.

Before implementing this step-by-step, I created a table includes two columns c1 and flag, where I will use flag as my condition. My data is as follows :

C1 F
— –
1 T
2 T
3 T
4 T
5 F
6 T
7 T
8 T
9 T
10 F

Now let’s implement while loop:

1- Create a variable to hold flag value
I create a variable called V_WHILE_LOOP which is alphanumeric and refreshing by : select flag from variable.test where c1 = #V_FOR_LOOP
I will use my V_FOR_LOOP to select flag values, in this sample case. Your case will contain different logic than this sample for sure.

2- Create a package
I create a package named P_WHILE_LOOP.

3- Set Variable (in my case)
Since I am refreshing my flag depending on  V_FOR_LOOP, I set this as first step.

4- Refresh Flag
Refresh your flag variable.

5- Check Flag
Evaluate flag variable.

6- Statements
Place your statements, I will put my sample interface and also I will increment V_FOR_LOOP as I will need this to reach an invalid flag.

7- Set your connections
Until the end of your statements every step will be connected by an OK, when you reach the end connect it to Refresh Flag step, so you will refresh, check and start your statements again and again until flag is false.

Here is a view of package :

While Loop Package
While Loop Package

And the view from operator:

 

While Loop Operator
While Loop Operator

You can see it hits the end when we refresh flag for the 5th time since it will return F as flag value, which is not suitable to our condition.

So here we are at the end of the post, now with the knowledge of “How to implement loops in ODI 11g”

Thank you for your patience to read, and if you have any questions or comments please drop a comment and I will read (and reply if it’s a question) it for sure.

ODI 12c: First Look and Repository Creation

Hello,

After a long pause on blog, here I am again. Oracle Data Integrator 12c is finally available for everyone to download. So in this post I will discuss about my first impressions and I will explain how to create repositories, both master and work. Actually it is pretty simple and almost same with 11g which I told in this post.

So first impressions, when you download ODI 12c through this page, you will get odi_121200.jar (numbers can differ with time since it’s version number) and some opatches bundled with it. Actually it is a bit disturbing for me to have a jar file which is 1.8GB. I’d like to have an exe for Windows.

Anyway I had some problems with running this jar also, first I tried it on my VM which has 32bit Windows 7 and got an error that states it could not reach jar file. So I moved to my physical machine that is 64bit Windows 7 and OUI could not recognize the platform and exited everytime, until I download and install Java 1.7. So after solving the problem with Java, I moved to my VM again to solve other problem where it came out that my path is problematic, since my user name is Canburak Tümer, space created a problem to reach file. I created another user without space that can run the installer.

Finally I could see the installer UI. It was a pretty straight forward installation, I just selected enterprise and went on. After installation, I ran the ODI Studio, it has a really clean and elegant splash screen and it asks to migrate any user settings from old installations. After splash screen, ODI workbench has been load:

ODI 12c Start Screen
ODI 12c Start Screen

Creating Master Repository

As I mentioned before, repository creations are almost same with ODI 11g, we will start by clicking File > New and we will see screen below:

Master - 1
Master – 1

Select “Master Repository Creation Wizard” in ODI tab and click “OK”.

Master - 2
Master – 2

You will see screen above, where we will enter database information, schema where we will create repository and DBA user to run some of the creation scripts.

Master - 3
Master – 3

Define and confirm password for SUPERVISOR user. DO NOT FORGET THIS PASSWORD UNLESS YOU HAVE ANOTHER USER WITH SUPERVISOR PRIVILEGES. For this reason many ODI developers/admins make this password “SUNOPSIS” as an old habit. I prefer to have it as “SUPERVISOR” in my VM and personal development environment.

Master - 4
Master – 4

Select password storage as you wish. Then click finish, it will run scripts now to create master repository, it took around 4 minutes in my VM, probably it will take around 2-3 minutes in your physical machines. Now it’s time to create a connection to master repository.

Connection
Connection

Click on “Connect to Repository” then click to green plus in the pop-up window, then fill required information in the form. Use SUPERVISOR as ODI user and DB user which you have created the master repository with. Make sure you have selected “Master repository only” radio button. Then click “OK”.

Wallet
Wallet

ODI 12c will ask you to if you want to keep passwords in a secure wallet with a master password. I do not have enough information about this wallet yet, but I will learn and write another post about it. I prefer the less secure way which does not include the wallet. Now we have master repository and connection to master repository. Now it’s time to create work repository.

Creating Work Repository

To create work repository, connect your master repository then go to Topology tab and expand Repositories menu.

Work - 1
Work – 1

Right click to “Work Repositories” and click to “New Work Repository” from menu.

Work - 2
Work – 2

Insert connection information of schema which you want to create Work Repository in. (I had a problem with this step, actually I wanted to use odiw_c user but ODI 12c keeps filling the form in upper case so it gives invalid credentials error.)

Work - 3
Work – 3

In final step, insert repository name and select repository type.

Work - 4
Work – 4

You can also define a password for repository, which is different from ODI user password or DB user password. This password is just to secure the repository connection. When you click “Finish” it will run scripts to create work repository and will ask you if you want to create a connection to work repository. It will create a connection without ODI user information. So you will need to edit connection to insert ODI user information.

After all these steps we have installed ODI 12c and setup both master and work repositories for our environment. And we have a final view as below :

We are ready to develop.
We are ready to develop.

Now, it’s time to create our topology connections, models, projects; import or reverse engineer data sources; develop mappings (new name for interfaces), packages and more.

Welcome to ODI 12c, keep following my blog for further posts and please do not hesitate to contact me through comment form below.

 

 

 

[Quote] Confessions of A Job Hopper

“A job, if you’re lucky enough to have one, is not a prison. If you’re bored, feeling underpaid, underappreciated, want to live in another part of the country or world or you’re just too ambitious for your own good it’s okay to change jobs*. (*Just make sure you have the new one before you leave the old one! And never, ever burn bridges.)”

Source : Here

After a series of workshops

Introduction to SQL Session
Introduction to SQL Session

We had a series of workshops in Istanbul Hackerspace about SQL and PL/SQL. Sessions are held by me, and there were three sessions, each focuses on another topic.

 

 

Our road map was:

  •  Introduction to SQL
  •  Introduction to PL/SQL
  •  Introduction to PL/SQL tuning & Oracle catalog tables.

You can find the material I’ve prepared for these workshops on http://www.canburaktumer.com/istanbulhs For me, these session were useful and succesful. I did not break the Istanbul Hackerspace tradition and had a decreasing number of participants. First day we had six participants, second day we had three participants and finally last day it was only me.

By the way, I also would like to introduce about hackerspace concept and Istanbul HS. Hackerspaces are world wide “free project ateliers”. They are basically producing projects with electronics and software. You can see a full list of hackerspaces on hackerspaces.org There are two hackerspaces in Istanbul, one in Anatolian side and one in European side. I am a volunteer in the Anatolian side, we are running an Android application project now, and we are having workshops. You can find more info on istanbulhs.org in Turkish.

That’s all for today, keep following because ODI posts will continue to come.

Project Management

Maia

There is a good project management article on the internet. Which takes delivering baby as a Software project and share which role thinks what about the project exactly. I believe it is really close to daily life projects, we have in enterprise companies, where we have an obivous line and knowledge difference between project roles. Here is that article.

Have fun!

  • Project Manager is a person who thinks nine women can deliver a baby in one month.
  • Developer is a person who thinks it will eighteen months to deliver a baby.
  • The Onsite Coordinator is one who thinks single woman can deliver nine babies in one month.
  • The Client is the one who doesn’t know why he wants a baby.
  • Marketing Manager is a person who thinks he can deliver a baby even if no man and woman are available.
  • The Resource Optimization Team thinks they don’t need a man or woman; they’ll produce a child with zero resources.
  • The Documentation Team doesn’t care whether the child is delivered, they’ll just document 9 months.
  • The User Interface Team will design a baby with three arms and one leg and ask if it can be done.
  • The Quality Auditor is the person who is never happy with the process to produce a baby.
  • Tester is a person who always tells his wife that this is not the right baby.

Test your SQLs online

Nowadays I am trying to answer questions about SQL, PL/SQL and Oracle on StackOverflow . You can see my profile card below :

 profile for Canburak Tümer at Stack Overflow, Q&A for professional and enthusiast programmers

Recently I am seeing a new web site link in answers, while I am navigating through as much questions as possible. That site called SQLFiddle where you can create an online database and try your codes.

Screenshot from web site
Screenshot from web site

As you can see above it is pretty simple, and it is really easy to use. There is different products and different versions for some products you can choose between. While I am writing this post, exact list was like below:

  • MySQL 5.5.30
  • MySQL 5.6.6 m9
  • MySQL 5.1.61
  • Oracle 11gR2
  • PostgreSQL 9.1.8
  • PostgreSQL 9.2.1
  • PostgreSQL 8.4.12
  • PostgreSQL 8.3.20
  • SQLite (WebSQL)
  • SQLite (SQL.js)
  • MS SQL Server 2008
  • MS SQL Server 2012

Jake Feasel, creator of SQLFiddle, explains why he built the site as :

I found JS Fiddle to be a great tool for answering javascript / jQuery questions, but I also found that there was nothing available that offered similar functionality for the SQL questions. So, that was my inspiration to build this site. Basically, I built this site as a tool for developers like me to be more effective in assisting other developers.

How to use

So let’s talk about how to use SQLFiddle,

First of all, you select the product you want to use from combo box on top-left corner, next to site logo. Then you should create your schema, in left editor you can create tables and insert data to your tables. When you are done with coding click Build Schema button to run your code. Now you have your tables and data. You can write your queries in the right editor and run them. Results will appear below editors.

Limitation 

There is a limitation with MySQL, that Jake Feasel explains in About page. On the right editor you can only use select if you are using MySQL, with other products feel free to use all DML operations.

Login

There are around 12 different log-in options like G+ and OpenID. When you login to site, you can see your Fiddle history, and your favorite fiddles. There is no other advantages of logging in yet.

That’s all for today. Do not hesitate to comment.

P.S. I am preparing new ODI post but I am really busy with my work these days. So follow my blog for updates.

Online Tools for Productivity

Hello all,

Today I am going to write a non-technical post. The online tools I use for productivity in my college and work life, which make my life simpler and easier. All the tools I will mention below have free plans or demo versions as well as they have premium, or paid services.

Mailing

      For mailing, my first choice was GMail, which has been developed and  served by Google. It was a huge thing to have 2GB inbox space when GMail first published. I remember around eight years ago when I first heard of it. There was only one way tou subscribe to GMail, which is getting an invitation. Since these days, it became an public service, Inbox space enlargened to 10GBs and still growing.

GMail then had more abilities, and integrated to new services like calendar, docs and etc.

 

 

 

 

 

Outlook.com is the new e-mail service of Microsoft. And I am pretty happy with this account and web mail interface. It is easy to use, has a clear interface, and nice features. Microsoft is going to turnover all hotmail        accounts to outlook until the end of June 2013, says rumors on the web. Also it is nice that nowadays there are lots of options for e-mail addresses you can take. For example your_name.surname@gmail.com may be taken in gmail but it is quite possible to find your_name.surname@outlook.com address available. Also as Microsoft Office Outlook is a well known software, this e-mail addresses seems more professional.

 

Meetings And Collaboration

 

For meeting appointments I use Doodle.com. You create a survey including times, send invitation to attendees, then they check their available time. So you can find a suitable time for all to make an appointment for your meeting or other event.

 

Scriblink is an online whiteboard, where all the attendees can add/drop something to the white board. Attendees can also chat in the chat pane.

 

 

 

 

Google Docs is like an online Office tool with Spreadsheets, Presentations, Word Processor. You can view, edit and share documents via Google Docs. It recognizes and opens all common file extensions we use in daily life.

 

 

 

 

Skype is the software I use most for video-calls. It has chat, voice call and video call abilities, with conference support. It is easy to use and has ability to call phones also.

 

 

 

 

 

 

 

Dropbox is a well-known file sharing tool. I also use it for sharing files between our project group and also publishing files through internet for public download.

 

Facebook groups,  we use Facebook Groups for really small project groups for innovative discussions, sharing videos and photos and ideas between our group. As everybody uses facebook on a daily basis nowadays, it is an easy and secure way to update whole group via Facebook. Also you can share documents and files, which is a great plus. And the most useful ability is to make your group hidden and closed.

 

Other Utilities

Evernote.com  It is the most used utility for me, I use it on my Android phone and my iPod to write notes on the way. And use its web interface to check and edit my notes. I have all kinds of information on my evernote like new blog posts, project topics, my home’s utilization bills, etc.

Creately.com  This is the website I use to draw. Drawing ER diagrams, workflows, GANTT charts etc. is very easy with using creately. So it is useful and light-weight.

TeamViewer Remote control is a need in IT environment, using my computer at home form work or from my mobile device is easy with TeamViewer. Some times you need something from your personal computer, or you need a connection without proxy servers, then TeamViewer works for me.

Google Calendar  I am an android user. So I have an integrated phone with my Google account, which leads me to use Google Calendar for my appointments, it is nice to have a synchronized agenda in my pocket all the time.

Hootsuite.com This is a social media center for you. In free plan you can add up to five accounts, and control all accounts through online dashboard. It is nice to have all social media in one screen, control all your virtual presence, develop your personal brand and image.

That’s all, I am using these tools to get my productivity and collaboration boosted. Try them and comment me, or comment the online tools you like most.