MongoDB Map and Reduce
What is MapReduce?
Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. For map-reduce operations, MongoDB provides
the mapReduce database command. In very simple terms, the mapReduce command takes 2 primary inputs, the mapper function and the reducer function.
Mapper function: MongoDB applies this phase to each document. A document in MongoDB is analogous to a row in a SQL database. The purpose of map phase is to emit key-value pairs. This means for every document a map phase can emit numerous key-value pairs. In case of multiple key-value pairs with same keys, the values with same key get merged. For eg. A => 9 and A => 10 will get merged to A => [9, 10]
If the above text doesn't make much of sense to you, no worries, we will see all of this in action in the next part of the article.
Reducer function: The map phase is followed by the reduce phase. The reduce phase is called for every key value pair. The reduce phase processes the values and outputs a result per key.
From an implementation prospective, most Map/Reduce frameworks operate on tuples. The map implementation accepts some set of data and transforms it into
another set of data, typically tuples (key/value pairs). Consequently, the reduce implementation accepts the output from a map implementation as its input
and combines (reduces) those tuples into a smaller (aggregated) set of tuples, which eventually becomes a final result.
A Mapper will start off by reading a collection of data and building a Map with only the required fields we wish to process and group them into one array based on the key. And then this key value pair is fed into a Reducer, which will process the values.
Ex: Let's say that we have the following data
And we want to count the price for all the items with same name. We will run this data through a Mapper and then a Reducer to achieve the result.
When we ask a Mapper to process the above data without any conditions, it will generate the following result
That is, it has grouped all the data together which have a similar key, in our case a name. Then these results will be sent to the Reducer.
Now, in the reducer, we get the first row from the above table. We will iterate through all the values and add them up. This will be the sum for first row. Next, the reducer will receive the second and it will do the same thing, till all the rows are completed.
The final output would be
So now you can understand why a Mapper is called a Mapper (because, it will create a map of data) & why a Reducer is called a Reducer (because it will reduce the data that the mapper has generated to a more simplified form).
The db.collection.mapReduce() method is used to performs map-reduce style data aggregation.
out specifies the location of the result of the map-reduce operation.
query specifies the selection criteria using query operators for determining the documents input to the map function.
Sorts the input documents. This option is useful for optimization. The sort key must be in an existing index for this collection.
limit specifies maximum number of document into the map function.
finalize follows the reduce method and modifies the output.
scope specifies global variables that are accessible in the map.
jsMode specifies whether to convert intermediate data into BSON format between the execution of the map and reduce functions.
verbose specifies whether to include the timing information in the result information.
When dealing with millions and billions of records, the benefit of using Map-Reduce is that both job functions can be distributed which means the code written can be executed by multiple CPUs and thousands of servers. Furthermore, complex queries are cleaner and easier to write when using MapReduce operations in MongoDB