Overview of Hadoop Cluster
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.
Hadoop Cluster Operational Process: Divide-and-conquer strategies can be quite effective for several kinds of workloads that deal with massive amounts of data: a single large workload can be divided or mapped into smaller sub-workloads, and the results from the sub-workloads can be merged, condensed, and reduced to obtain the final result. The idea behind Hadoop was to exploit this feature and assign the smaller sub-workloads to a large cluster of inexpensive nodes built with general-purpose hardware rather than use expensive, fault-tolerant hardware. Further, handling massive amounts of data requires storing massive amounts of data. Hadoop has a distributed, cluster file system that scales to store these massive amounts of data. The cluster is built so that the entire infrastructure is resilient and fault tolerant, even though individual components can fail, dramatically lowering the system wide MTBF (Mean-Time-Between-Failure) rate despite a higher component.
Mainly there are two types of Hadoop clusters.
Single Node Cluster: By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. Everything runs in a single JVM instance. This single node cluster is mainly used for development and testing.
Multi-Node: Here we have all the services (daemons) up & running on different JVM instances on different hosts/machines. It follows a master-slave architecture i.e At least one machine will act as a master on which Namenode daemon will run and rest other hosts/machines will act as slave to run other Hadoop daemons. The below diagram is a physical model of Hadoop Multiple Node Cluster.
The below internal components are involved in the Hadoop Multi node cluster
Switch: A network switch (also called a switching hub, bridging hub, officially MAC bridge) is a computer networking device that connects devices together on a computer network, by using packet switching to receive, process and forward data to the destination device
Core Switch: Network switch is configured for handling communication between racks.
Rack Switch: Network switch is configured for intra-rack communication
Name Node: The Name Node is the centrepiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives.
Data Node: A Data Node stores data in the Hadoop file system. A functional file system has more than one Data Node, with data replicated across them. On start-up, a Data Node connects to the Namenode; spinning until that service comes up. It then responds to requests from the Namenode for file system operations.
Secondary Name Node: The term "secondary name-node" is somewhat misleading. Periodically it pulls the name node information and keeps it in secondary name node. It can replace the primary name-node in case of its failure.
As we can see the Hadoop's scalability and fault tolerance are built into the architecture. When data is sent one node it is replicated to other nodes. Therefore, it provides for a big advantage over other existing relational technologies.