
What is Apache Sqoop?
Sqoop provides a mechanism to connect the external systems like EDW (Enterprise Data Warehouse like Amazon Redshift), Relational Database Management Systems like MySQL, Oracle, MS SQL Server, etc. to transfer data between Hadoop System and the above mentioned external Systems. It efficiently transfers bulk data between Hadoop systems and databases. Sqoop helps us to have the structured data from RDBMS into HDFS, whereas the other Hadoop eco system components such as MapReduce

Installing Cloudera Manager
Installing Cloudera Manager 5.4.1 in VirtualBox/Linux/CentOS A step by step guide to install Cloudera Manager in VirtualBox for a clean installation Step 1 - OS Installation: This is a guide to install Cloudera Manager 5.4.1 in Oracle VirtualBox. The first and fore most step is to install OS on the VirtualBox. In this tutorial, we use CentOS 6.6; Cloudera Manager supports the following operating systems (source cloudera website): RHEL-compatible Red Hat Enterprise Linux and

What is MapReduce in Hadoop ?
Explaining MapReduce with an example... MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. In this article, we will see how map reduce works in the Hadoop eco system with an example. Let’s assume, we have large data sets of temperature data recorded at different cities on a particular day at some specific intervals as below; the goal is to figure out the highest

HDFS Architecture
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. Multiple data node act as a slaves. NameNode: It is master node that controls the whole cluster and division of files into block. Typical block size is 64MB or 128MB, however it can be configured using parameter which are in <HADOOP_INSTALL>/conf/hdfs-site.xml. And how many copies of each block ar

Hadoop Cluster
Overview of Hadoop Cluster A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Hadoop Cluster Operational Process: Divide-and-conquer strategies can be quite effective for several kinds of workloads that deal with massive amounts of data: a single large workload can be divided or mapped into smaller sub-workloads, and the results from the sub-workl


Why "Curiosity" is something that hiring managers for Big Data are looking for
Big Data Projects are complex that require innovative solutions. Traditional data warehousing, processing and ETL approaches, in and of themselves, are not effective answers to the increasing volumes and complexity of the data being generated. The increasing complexity is due to the newer kinds of data that are being collected such as from sensors, devices, instrumentation, social media etc. many of which did not exist a few years ago. Same goes for the volume. With ever incr