Big Data – Canburak Tümer's Blog

So yesterday we’ve completed a challenge. After uninterrupted coding for around 19 hours, I got some sleep.

First thing is first, I have to admit that I did not get any prize. I couldn’t make it to top three in this hackathon. I completed only 7 of the 9 challenges. If I may give you some statistics,

149 Developer signed up for code challenge on Eventbrite
87 of the came to the hackathon
After midnight we were around 20 left
Only 16 of developers submitted at least one solution (I submitted 7 so I think myself as a good developer)

Now I want to talk about challenges:

There were 9 challenges everyone require to process a little more than 100GB of data, 1TB in total. Except for the second challenge every challenge had a Twitter JSON data, second had an fixed length file. Every challenge had a 150MB sample data to try on.

Challenges were:

1- Counting the tweets grouped by country codes. And output should be ordered by country codes.
2- Every record was an 16 byte array, 8 byte for key and 8 byte for value. Challenge is to sum values grouped by keys. Output should be ordered by keys. But the real challenge was to read this type of file. Developers had to implement their own InputFormat and RecordReader.
3- It was about counting the unique users.
4- Counting tweets from giving language. Parameter should be given as input.
5- Finding the MAX and MIN user id.
6- Finding the person with the max popularity score, which is calculated as if A mentions B and A has X tweets and Y follower B gets X*Y points.
7- Finding the user with MAX and MIN score, which is calculated as if A tweets it gets -1 score and if A gets mentioned it gets +1 score.
8 – Counting all the tweets.
9 – Finding the person which has a maximum number of first and second degree connections, based on mentions.

I could not submit second and nineth challenges. And as I look the challenges now I saw that I misunderstood third one (blame the sleepless night).

The day has come, after playing with hadoop distributions around a year and two trainings; I feel ready to write an introduction post about Big Data, Hadoop and ecosystem projects.

1. What is Big Data?

Big Data is not Hadoop, Hadoop is just an implementation of Big Data concept. Big Data is a young concept in data analysis, ETL, data warehousing and data discovery fields; or data science for short. Every year, every day, every minute we create data, and every time we create more than we’ve created before. And we, data workers, process this data to get valuable pieces to our company, client, industry to increase income.

Here, Big Data comes in our life. Big Data has 5 V’s (in some sources you may find only 3 V’s) which are : Volume, Velocity, Variety, Verification (or Validity) and last but the most important to me Value.

Volume : Data we need to process and analyze every day is getting bigger day by day. So we need a new approach to data processing, this is where big data comes in.

Velocity : With the use of mobile devices, social media and internet we can create more data in some time, before we could do. For example, before social media if we generate just X MB data on the internet in a day, now we generate more than X GB of data in a day, may be in a couple hours. So we need to catch and process data really fast to catch-up with its speed. (There is also CEP or Fast Data concepts you may interest in this specific topic.)

Variety : I will blame social media again but with the increase of social media usage and internet usage now we generate unstructured and various types of data, like shares, likes, status updates, retweets, vines, videos, texts, gifs, other images. And to create value to our company we have to process this data. They are also coming from variable sources, network systems, internet, forms on corporate website etc.

Verification : Of course, we need the right data to get right results.

Value : This is it! It’s the reason why we process so much data in so short time. We need to get valuable pieces of data, we try to extract VALUE from the data. It is like searching for diamond in a mine.

2. What is Hadoop?

Hadoop is an open source project which implements Big data. It’s a distributed system to store and process data with commodity hardware. It does not require big powerful servers, instead you can create a cluster with desktop computers of quad-core processors with 2GB of RAM. (less than most of the modern laptops, almost all smartphones have 1 or 2 GB of RAM nowadays.)

Hadoop first started with Google’s white paper publishing on its Big Table and MapReduce architecture. Some cool guys tried to implement and develop these features in open source manner. Then Yahoo, Facebook, Google and Apache Foundation supported them. Now Hadoop is an open source Apache project.

3. Hadoop Distributions

You can install and use Hadoop through Linux’s repositories. But there is also start-ups who bundled Hadoop with some other open-source ecosystem projects for Hadoop and with their own tools as well.

Cloudera is one of these start-ups who has own bundle and it provides a VM also for quick start.

Hortonworks is the one other start-up, who has own bundle and own VM to getting started.

Also there is bigger solutions to Hadoop like Oracle’s Big Data Appliance, Teradata’s solution, IBM and HP also have their own enterprise solutions.

4. Hadoop Ecosystem Projects

Pig : Pig is one of the data analytic tools we can use with Hadoop. It has its own language to code which is a scripting language and called Pig Latin. It is really close to English so you can code like you are writing an English essay.

Hive : Hive is another way to run data analytics. It has a SQL like language, so it is usually preferred by developers who already have SQL knowledge.

Impala : Impala is the rival of Hive, it also has a SQL like language and it is much more faster than Hive, because it does not convert code to MapReduce, instead it runs on HDFS directly. (I will tell more about MapReduce and HDFS next time.)

Oozie : Oozie is a scheduling and job management tool. Where you can define flows as XML files, and it runs jobs as defined in this XML.

Sqoop : Sqoop is a tool to load data to Hadoop from a RDBMS or vice versa. It crates MapReduce jobs to load data and runs them automatically.

Flume : Flume is a listener basically. User defines an input channel and Flume polls it repeatedly, for example user defines a log file as a channel and Flume polls in every five minutes to get latest logs from this file.

Ambari : Ambari is administration (provisioning, managing and monitoring) console for the Hadoop cluster.

That’s all for introduction post, soon I will be writing about Hadoop Internals: HDFS and MapReduce. Please do not hesitate to leave comments or ask question in comments section. And hopefully in february I will be building a mini Hadoop cluster at home which will be topic for another blog post.

Thanks for reading.

Category: Big Data

Hadoop: After the big data hackathon

Hadoop: T2 Hackathon at Ankara

Hadoop: Introduction to Big Data, Hadoop and Hadoop Ecosystem