Skip to main content

What is big data ? and what role Hadoop has to play?

In recent times we have been hearing , reading (in many blogs) quite often about cloud computing ,big data and HPC ie. high performance computing. Undoubtedly, these are the buzz words for this decade. If you have been in the industry for quite some time then you can recall last decade was web 2.0 decade.
So what is big data? Is it a new framework or new data modelling technique or part of NO SQL movement?
I have seen many times people relate big data to one of these or they are confused about it. Same is with HADOOP, it is thought as a NO SQL database framework.
Big data is nothing but just huge amount of data which requires mining. Per wiki, Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.So big data is a hovering problem for organizations which deals in high volumes of data and requires storage,search, sharing, analytics and visualizing like Google Search, Facebook, Amazon or twitter. And with time this data will only grow and the analytics or mining would become bigger a problem.

Now the question could be how HADOOP comes in to picture? and are NO SQL databases are part of the problem?
Big data can be any kind of data either in RDBMS, NO SQL or software logs, sensors. Big Data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to Big Data include massively parallel processing (MPP) databases, datamining grids etc.
However, if we take google or facebook example. Google has it's own NO SQL database Bigtable which is nothing but a big hashmap (key-value pair) or it is data in JSON objects form. Similarly Facebook uses Cassandra for storing it's users data. The advantage of using these kind of databases is that the retrieval of data is considerably faster than the traditional RDBMS. Hence the through put is higher even with 100 millions of records.
So if it is that fast then where is the problem?
The problem is this data is unstructured unlike RDBMS where you just need to write the correct SQL to retrieve the desired data (using the grid computing or parallel computing). Hence , searching ,analyzing or visualization is a bigger problem on this kind of data.
Here we have MapReduce to rescue. MapReduce is a programming model and implementation developed by Google (Patented) for processing massive-scale, distributed data sets. And Hadoop is an Apache initiative inspired from Google's MapReduce. The framework is inspired by the map and reduce functions commonly used in functional programming.
Per Wiki, MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or as a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).

How does MapReduce works?
For a more theoritical aspect, you can look at below figure,


Hence hadoop is basically a framework for enabling powerful parallel processing of huge data. Figure below shows how MapReduce can be utilized for analytics purpose,



References

Comments

Popular posts from this blog

Exporting/Saving Putty/SecureCRT/SecureFX sessions on Windows!!

This is my first stint with blogging and I thought why not just start with sharing some useful information/tricks  that can save you some time. To begin with , here are couple of tips that could save you some time. Actually, I had once this problem where I have to take backup and move my work data to a new laptop and as we normally do , I also put my data on some external drive and copied it to the new box. During all this exercise I realized that by this I can only back up the data, what about the work I am doing..my configurations and installations...? What the heck...I need to spend lot of time to redo all that...crap. I was using Putty, SecureCRT and SecureFX to log on to client side boxes and to work upon. And all three had at least 30 saved sessions having IP/user/pwd information, reconfiguring those sessions would be a daunting tasks. I did some research "well by that I meant googling also..:)" and here is how a little information can help you.. If you are usin

class.forName(String className) used in JDBC unfolded

During many interviews I have taken , I have observed even seasoned programmers find it difficult to explain the use of class.forname() method while making a JDBC connection object. In this post, I am just trying to explain this, Here is a simple JDBC code, import java.sql.*; public class JdbcConCode { public static void main(String args[]) { Connection con = null; String url = "jdbc:mysql://localhost:3306/nilesh"; String driver = "com.mysql.jdbc.Driver"; String user = "root"; String pass = "nilesh"; try { Class.forName(driver); con = DriverManager.getConnection(url, user, pass); System.out.println("Connection is created...."); //TODO //rest of the code goes here } catch (Exception e) { System.out.println(e); } } } Let's go through the code above, the url indicates the location of the database schema, here the database is nilesh . Then there is Driver which is the fully qualified JDBC driver class name.