HadooP HDFS-Architecture History-Complete Overview



In the last article, we learn Big data. If you have not yet checked it out, I would highly recommend you to read it so that you have a basic overview of all the Big data concepts and characteristics. In this article, we will discuss four important features of what is Hadoop exactly?


It will good to cover my previous articles before this. Its help to learn more about Hadoop and Big data. In which I explained. Click here MongoDB and Hadoop.

In this section, we will learn

  1. Platform History
  2. What is Hadoop Exactly?
  3. How Hadoop works
  4. How Hadoop works
  5. The architecture of Hadoop
  6. Platform strength
  7. Platform weakness
  8. Future of Hadoop

Platform History

Hadoop was developed by Doug Cutting in Mike Cafarella. Almost 20 years ago he found two issue related the web search engine number one how can reliably to store all data and number two how to built a massive lookup index. In 2005 the Apache Hadoop came it is 100 % open source provides the new way of storing and processing data.

Doug, who was working at Yahoo at the time and is now Chief Architect of Cloudera. His son was 2 years old at the time when the project lunch he decides the project name his son’s toy elephant.

its developed by Apache Nutch web search engine project. But currently, HDFS is part of Apache Hadoop subproject.

What is Hadoop Exactly?

The first thing that we analyze is the data rate continuously increase. The data grow is huge as compared last year. All the data have value.

The data fit firm a single computer is challenging. The different distribution technique has used and saved data at different nodes also the distribution of data for faster computation. The Apache is open source software that is scalable reliable and distributed computing.

The Apache Hadoop is a framework using the single programming model it processes a large amount of data in the distribution environment. it’s work thousand of independent computer and Petabytes of data.

How Hadoop works

Hadoop consists of the distributed file system (HDFS) that is designed to run on commodity hardware also provides fault tolerance. The application that has a large amount of data is suitable for HDFS for high throughput access to the application.

Hadoop architecture use with two major layers HDFS and Map Reduce. Hadoop Distributed File System Stores files in blocks across many nodes in a cluster. The Name node which runs on a single node also called master node.

The Data node which runs all node in a cluster. The names node control all the creation delegation and replication on data node. A large amount of data split into different node like node 1 node 2 and node 3 and managed by the Hadoop cluster.

Alt tag What is Hadoop Exactly

How Map-Reduce works

Let me explain how its work. All the data store in node 1 node 2 or maybe node3. The data is stored in the form of the key-value pair like JSON format.

Alt tag what is Hadoop exactly

First Step: Map

In Haddop all the data is map or store in node 1 and node 2.

Second Step: Shuffle

The Haddop shuffle the same data in <node 1>  like balance “$2001” “age”: 26, “Eye”: “blue”, “name”: “programmehelper”, etc are stored in node one.

While remaining “balance”: “$2002”, “age”: 36, “eye”: “Red”, “name”: “programmerhelper2”, “gender”: “male” are store in <node 2>.

Third Step: Reduce

The final work is done in Hadoop. The distribution data in different node faster came on the final node.

How different from the conventional database

The new requirement is arising in the area of the database some new database have come with the highly excellent feature.

The database is used to handling a large amount of data. The relational database has a lot of qualities like performance scalability flexibility robustness and compatibility but lack.

In recent days it is difficult to handle both the size of data and concurrent actions on data with the standard row column RDBMS.

Comparison SQL Hadoop
Structure Unstructured Semi-structured data
Schema on write Schema on write
Record key-value -pair
Normalized Denormalized
SQL Map-Reduce
Scale-Out Scale-Up
Processing  limited Processing with a large amount of data
Relation database data
is saved on one server
Data in Hadoop distribution
on different nodes. processing
data is distributed
Relation database data
can be read much time
In Hadoop data is write once and read many. when you write data you cannot update.you just delete
the data
Relation database support
SQL database
Hadoop is supported many NoSQL
and SQL database
SAP, Oracle, SQL server
Hadoop is the very rich tool or platform which is open source.
OLTP Analytical
License Free
Relation database is run on very low commodity hardware. it runs on many operating systems. Hadoop is run very low commodity
hardware.it runs on many operating systems.


Architecture of Hadoop


Platform strength

  • In Hadoop can work with parallel databases along with high fault tolerance and ability to run heterogeneous environments.
  • Hadoop minimized the data processing time, especially in complex query processing.
  • Hadoop is a scalable fault tolerance and flexible.
  • Scalable: Unlike the relational database that is not scalable and not processes a large amount of data. The Hadoop is highly scalable because it stores a large amount of data in thousand of the server. The Hadoop is used to manage terabytes of data in thousand of the server.
  • Fault Tolerance: The data is sent to several of the node form servers. The data is sent completely replicated to all of the nodes. In case of one copy of data fail it is available from another node.
  • Flexible: It is flexible in such a way to process both types of data (Structure and unstructured)

Platform weakness

  • Installing and configuring database is the difficult task. It mandatory to use DBMS.
  • The programming model is not easy because it changes the interface to SQL.
  • For the smaller dataset, Hadoop is not efficient. Due to high capacity design and distributed file system its lack to work small file. So the Hadoop is not the best choice for smaller datasets.
  • Hadoop has used the properties of the transaction (ACID). The  DBMS engines which will affect the dynamic scheduling and fault tolerance of Hadoop.
  • Stability issue because of its open source.
  • In Hadoop developer always write a code for each operation which very difficult task.
  • Security is managed in a Hadoop is the challenge.  The application contains the massive amount of data. The data store in a network is a major issue it does not provide any encryption.
  • The Hadoop Map-reduce function slows the performance because of its support of multiple formats.

Future of Hadoop

Every technology comes into the ground with new features. The Hadoop come in a market with the excellent feature. It characterized the organization which is the huge amount of data.

The traditional relational database SQL Server Excel many more is not managing a large amount of data. These traditional approaches would more costly and organization feel hesitate to use it.

Thus Hadoop comes with exciting feature Hadoop HDFS and Map-reduce. The data rate continuously increasing most of the company are prefer Hadoop for the future.

A lot of reason behind this data processing speed fast and identified the trends and predictions in previous data.  The company data increased in seconds like twitter facebook terabytes to petabytes they need Hadoop.


I wish I could tell you that a great site. You just understand the key element above post-What is Hadoop exactly?. More, detail about MongoDB and Hadoop You also read my previous lecture. I hope you will understand this lecture. Thank you for reading this lecture. Hope you got the idea. please share it. If you find any mistake or confusion please Commit on reply section.