Java Login

About javalogin.com

Hello guys,
javalogin.com is for Java and J2EE developers, all examples are simple and easy to understand 

It is developed and maintained by Vaibhav Sharma. The views expressed on this website are his own and do not necessarily reflect the views of his former, current or future employers. I am professional Web development. I work for an IT company as Senior Consultant. Primary I write about spring, hibernate and web-services. I am trying to present here new technologies.


     << Previous
Next >>     


MongoDB Map and Reduce


What is MapReduce?

Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. For map-reduce operations, MongoDB provides the mapReduce database command. In very simple terms, the mapReduce command takes 2 primary inputs, the mapper function and the reducer function.

Mapper function: MongoDB applies this phase to each document. A document in MongoDB is analogous to a row in a SQL database. The purpose of map phase is to emit key-value pairs. This means for every document a map phase can emit numerous key-value pairs. In case of multiple key-value pairs with same keys, the values with same key get merged. For eg. A => 9 and A => 10 will get merged to A => [9, 10]
If the above text doesn't make much of sense to you, no worries, we will see all of this in action in the next part of the article.
Reducer function: The map phase is followed by the reduce phase. The reduce phase is called for every key value pair. The reduce phase processes the values and outputs a result per key.


Map-Reduce Behavior

From an implementation prospective, most Map/Reduce frameworks operate on tuples. The map implementation accepts some set of data and transforms it into another set of data, typically tuples (key/value pairs). Consequently, the reduce implementation accepts the output from a map implementation as its input and combines (reduces) those tuples into a smaller (aggregated) set of tuples, which eventually becomes a final result.
A Mapper will start off by reading a collection of data and building a Map with only the required fields we wish to process and group them into one array based on the key. And then this key value pair is fed into a Reducer, which will process the values.
Ex: Let's say that we have the following data


{item: drink, price: 9 },
{item: drink, price: 12 },
{item: tea, price: 8 },
{item: oil, price: 3 },
{item: oil, price: 5 }


And we want to count the price for all the items with same name. We will run this data through a Mapper and then a Reducer to achieve the result.

When we ask a Mapper to process the above data without any conditions, it will generate the following result
KeyValue
drink[9,12]
tea   [8]
oil    [3,5]

That is, it has grouped all the data together which have a similar key, in our case a name. Then these results will be sent to the Reducer.
Now, in the reducer, we get the first row from the above table. We will iterate through all the values and add them up. This will be the sum for first row. Next, the reducer will receive the second and it will do the same thing, till all the rows are completed.
The final output would be
NameTotal
drink   21
tea       8
oil 8
So now you can understand why a Mapper is called a Mapper (because, it will create a map of data) & why a Reducer is called a Reducer (because it will reduce the data that the mapper has generated to a more simplified form).
The db.collection.mapReduce() method is used to performs map-reduce style data aggregation.


db.collection.mapReduce(
<map>,
<reduce>,
{
out: <collection>,
query: <document>,
sort: <document>,
limit: <number>,
finalize: <function>,
scope: <document>,
jsMode: <boolean>,
verbose: <boolean>
}
)

Here,
out specifies the location of the result of the map-reduce operation.
query specifies the selection criteria using query operators for determining the documents input to the map function.

Sorts the input documents. This option is useful for optimization. The sort key must be in an existing index for this collection.

limit specifies maximum number of document into the map function.

finalize follows the reduce method and modifies the output.

scope specifies global variables that are accessible in the map.

jsMode specifies whether to convert intermediate data into BSON format between the execution of the map and reduce functions.

verbose specifies whether to include the timing information in the result information.

When dealing with millions and billions of records, the benefit of using Map-Reduce is that both job functions can be distributed which means the code written can be executed by multiple CPUs and thousands of servers. Furthermore, complex queries are cleaner and easier to write when using MapReduce operations in MongoDB


     << Previous
Next >>