In recent times we have been hearing , reading (in many blogs) quite often about cloud computing ,big data and HPC ie. high performance computing. Undoubtedly, these are the buzz words for this decade. If you have been in the industry for quite some time then you can recall last decade was web 2.0 decade.
So what is big data? Is it a new framework or new data modelling technique or part of NO SQL movement?
I have seen many times people relate big data to one of these or they are confused about it. Same is with HADOOP, it is thought as a NO SQL database framework.
Big data is nothing but just huge amount of data which requires mining. Per wiki, Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.So big data is a hovering problem for organizations which deals in high volumes of data and requires storage,search, sharing, analytics and visualizing like Google Search, Facebook, Amazon or twitter. And with time this data will only grow and the analytics or mining would become bigger a problem.
Now the question could be how HADOOP comes in to picture? and are NO SQL databases are part of the problem?
Big data can be any kind of data either in RDBMS, NO SQL or software logs, sensors. Big Data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to Big Data include massively parallel processing (MPP) databases, datamining grids etc.
However, if we take google or facebook example. Google has it's own NO SQL database Bigtable which is nothing but a big hashmap (key-value pair) or it is data in JSON objects form. Similarly Facebook uses Cassandra for storing it's users data. The advantage of using these kind of databases is that the retrieval of data is considerably faster than the traditional RDBMS. Hence the through put is higher even with 100 millions of records.
So if it is that fast then where is the problem?
The problem is this data is unstructured unlike RDBMS where you just need to write the correct SQL to retrieve the desired data (using the grid computing or parallel computing). Hence , searching ,analyzing or visualization is a bigger problem on this kind of data.
Here we have MapReduce to rescue. MapReduce is a programming model and implementation developed by Google (Patented) for processing massive-scale, distributed data sets. And Hadoop is an Apache initiative inspired from Google's MapReduce. The framework is inspired by the map and reduce functions commonly used in functional programming.
Per Wiki, MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or as a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).
How does MapReduce works?
For a more theoritical aspect, you can look at below figure,
Hence hadoop is basically a framework for enabling powerful parallel processing of huge data. Figure below shows how MapReduce can be utilized for analytics purpose,
References
So what is big data? Is it a new framework or new data modelling technique or part of NO SQL movement?
I have seen many times people relate big data to one of these or they are confused about it. Same is with HADOOP, it is thought as a NO SQL database framework.
Big data is nothing but just huge amount of data which requires mining. Per wiki, Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.So big data is a hovering problem for organizations which deals in high volumes of data and requires storage,search, sharing, analytics and visualizing like Google Search, Facebook, Amazon or twitter. And with time this data will only grow and the analytics or mining would become bigger a problem.
Now the question could be how HADOOP comes in to picture? and are NO SQL databases are part of the problem?
Big data can be any kind of data either in RDBMS, NO SQL or software logs, sensors. Big Data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to Big Data include massively parallel processing (MPP) databases, datamining grids etc.
However, if we take google or facebook example. Google has it's own NO SQL database Bigtable which is nothing but a big hashmap (key-value pair) or it is data in JSON objects form. Similarly Facebook uses Cassandra for storing it's users data. The advantage of using these kind of databases is that the retrieval of data is considerably faster than the traditional RDBMS. Hence the through put is higher even with 100 millions of records.
So if it is that fast then where is the problem?
The problem is this data is unstructured unlike RDBMS where you just need to write the correct SQL to retrieve the desired data (using the grid computing or parallel computing). Hence , searching ,analyzing or visualization is a bigger problem on this kind of data.
Here we have MapReduce to rescue. MapReduce is a programming model and implementation developed by Google (Patented) for processing massive-scale, distributed data sets. And Hadoop is an Apache initiative inspired from Google's MapReduce. The framework is inspired by the map and reduce functions commonly used in functional programming.
Per Wiki, MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or as a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).
How does MapReduce works?
For a more theoritical aspect, you can look at below figure,
Hence hadoop is basically a framework for enabling powerful parallel processing of huge data. Figure below shows how MapReduce can be utilized for analytics purpose,
References
[1] Hadoop Wiki: http://wiki.apache.org/hadoop/PoweredBy
[2] Apache Hadoop: http://hadoop.apache.org/
[3] Hadoop tutorial : http://www.ibm.com/developerworks/java/library/j-javadev2-15/index.html
[4] Blog : http://cleanclouds.wordpress.com/2011/03/28/big-data-with-hadoop-cloud
[2] Apache Hadoop: http://hadoop.apache.org/
[3] Hadoop tutorial : http://www.ibm.com/developerworks/java/library/j-javadev2-15/index.html
[4] Blog : http://cleanclouds.wordpress.com/2011/03/28/big-data-with-hadoop-cloud
Comments
Post a Comment