What is MapReduce in Hadoop ?

Explaining MapReduce with an example...


MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. In this article, we will see how map reduce works in the Hadoop eco system with an example. Let’s assume, we have large data sets of temperature data recorded at different cities on a particular day at some specific intervals as below; the goal is to figure out the highest and lowest temperature recorded for a day with respect to these cities.


The following three steps will play a vital role in the MapReduce model.

  • Mapper

  • Combiner

  • Reducer


Mapper will do the split on the large datasets into chunks of smaller datasets (sub-datasets). And the computations will be done on each of the sub-datasets in parallel to get the required output. Splitting and Mapping will be done in the first step of a MapReduce. The output of this Split will be smaller datasets whereas the output of the mapping will be the key value pairs. Key value pairs will be formed with respect to the sub-datasets; meaning that for every dataset there will be associated key value pairs.

So, in this case, Mapper will produce output like below:

Combiner will do shuffling of the key value pairs formed by the Mapper. The sub- dataset’s key-value pairs will be merged and sorted by their keys. The values will be grouped together with respect to their keys and they will be merged under the same key. This will produce the output of {Key = List<Values>}; then the values will be sorted out using their keys. The combiner will act as a mini-map reducer as follows:

Reducer will do the final step in MapReduce paradigm; this will compute the finally sorted datasets (sorted key value pairs) as per the requirement.


In our case, the highest and lowest temperature will be fetched off from the sorted datasets as follows:

The output of the Reducer is not re-sorted.


Some of the advantages of the map reduce framework include its cost effectiveness, flexibility as well as scalability due to its inherent parallel processing architecture. The scalability of this framework enables businesses to run map reduce across a number of nodes that could involve huge volumes of data.

Featured Posts
Recent Posts
Archive
Search By Tags
No tags yet.
Follow Us
  • Facebook Basic Square
  • Twitter Basic Square
  • Google+ Basic Square