Java Login


Hello guys, is for Java and J2EE developers, all examples are simple and easy to understand 

It is developed and maintained by Vaibhav Sharma. The views expressed on this website are his own and do not necessarily reflect the views of his former, current or future employers. I am professional Web development. I work for an IT company as Senior Consultant. Primary I write about spring, hibernate and web-services. I am trying to present here new technologies.

     << Previous
Next >>     

MongoDB Sharding

Sharding is the process of horizontal scaling where data set is divided among multiple servers or shards. Each shard would be an independent set, but all the shards together would form a complete logical collection defined for the application. When data size increases, a single machine may not be sufficient to store the data for read and write operations. With Sharding you add more machines to support data growth and the demands of read and write operations. Sharding decreases the amount of data that each shard has to handle, thus increasing speed and performance. In MongoDB, sharding involves three major components:

  • Shards - Shards are used to store data. They provide high availability and data consistency. Each shards contains separate replica set in production environment.
  • Configuration servers - Config servers store the cluster's metadata. This data contains a mapping of the cluster's data set to the shards.
  • Query routers - Query routers are basically mongo instances, interface with client applications and direct operations to the appropriate shard. The query router processes and targets the operations to shards and then returns results to the clients. A sharded cluster can contain more than one query router to divide the client request load. A client sends requests to one query router. Generally, a sharded cluster have many query routers.

Difference between Replication and Sharding on MongoDB

A Replica-Set means that you have multiple instances of MongoDB which each mirror all the data of each other. A replica-set consists of one Master (also called "Primary") and one or more Slaves (aka Secondary). Read-operations can be served by any slave, so you can increase read-performance by adding more slaves to the replica-set (provided that your client application is capable to actually use different set-members). But write-operations always take place on the master of the replica-set and are then propagated to the slaves, so writes won't get faster when you add more slaves.

A Sharded Cluster means that each shard of the cluster (which can also be a replica-set) takes care of a part of the data. Each request, both reads and writes, is served by the cluster where the data resides. This means that both read- and write performance can be increased by adding more shards to a cluster. Which document resides on which shard is determined by the shard key of each collection. It should be chosen in a way that the data can be evenly distributed on all clusters and so that it is clear for the most common queries where the shard-key resides (example: when you frequently query by user_name, your shard-key should include the field user_name so each query can be delegated to only the one shard which has that document).

Why Sharding?

  • Local storage is not sufficient for data store
  • Vertical scaling is too costly
  • Single replica set has limitation of 12 nodes only
  • In replication, all writes go to master node
  • Memory can't be large enough when active data set is big

Shard Keys

One of the fundamental steps in sharding a collection is to select a shard key. There are some important points to be considered while selecting the shard key.

  • A good shard key has
    • sufficient cardinality; low cardinality can be for boolean fields
    • distributed writes
    • targeted reads
  • Monotonically increasing keys should not be used as they create 'hot spots' in a period of time. For example, timestamps or _id fields. (A hot spot is simply an active portion of the table which is accessed more frequently than the rest of the table)
  • Shard key has to be included in every query.
    • In case shard key cannot be included, then scatter-gather has to be done, meaning the whole collection has to be sent and retrieved.
  • A good shard key enhances performance and scalability.

Consider an application that maintains details of all employees and projects in an IT firm. We would have employee and project collections. Each employee shall have records for current projects and past projects, with different roles in each and the data may exceed 16MB. Hence, we need to maintain different collections for employee_projects. A single employee_projects collection would have an employee_id, particular project with roles and responsibilities in that project. One employee may have dozens or even hundreds of documents in the employee_projects collection.
Consider sharding the employee_projects collection. What would be the best shard key for the employee_projects collection, provided we are willing to run scatter-gather operations to do research and run studies on various projects? (Think mostly about the operational aspects of such a system.)

Options for shard key include:

  • _id
  • employee_id
  • project_name
  • role
  • project_allocation_date

The most favourable option is employee_id, as we can break up the employee_projects collection to n number of shards and still perform actions smoothly based on employee_id.

Sharding Types

Sharding types are categorized as:

  • Range
  • Tag-aware
  • Hash

To start with, we use the range shard, where a range of documents based on shard key will be distributed to different shards. For example: 0-10000, 10000-20000, 20000-30000, and so on. This provides read/write scalability.

Tag-aware sharding includes shard tags like winter, spring, summer, or fall, with each having start and end dates defined.

Hash-sharding uses hashed index of a single field as shard key like, 0000-4444, 4445-8000, 8001-aaaa, aaab-ffff.

Lets create all of the above components in a single instance i.e on your localhost.

Creating Mongodb Config Server

$ mkdir dataconfigdb
$ mongod --configsvr --port 27010

  • First creates a data directory to store the cluster metadata
  • Second launches the config server deamon on port 27010. The default port 27019, but I have overriden by using the --port command line option.

Setting up Query Routers (mongos instances)
This is the routing service for the sharding cluster where by it takes queries from the application and gets the data from the specific shards. Query routers can be setup by using the mongos command as shown below:

$ mongos -configdb localhost:27010 --port 27011

2015-02-01T18:51:35.606+0300 warning: running with 1 config server should be done only for testing purposes and is not recommended for production

It is recommended to run with 3 configdb server for production so as to avoid a single point of failure. But for our testing, 1 configdb server should be fine.
--configdb command line option is used to let the Query router know about the config servers we have setup. It takes a comma separated : values like -configdb host1:port1,host2:port2. In our case we have only 1 config server.

Running mongodb shards Now we need to run the actual mongodb instances which store the shared data. We will created 3 sharded instances of mongodb and run all of these on localhost on different ports and provide each mongodb instance its own --dbpath as shown below:

Mongodb Shard - 1

$ mongod --port 27012 --dbpath datadb

Mongodb Shard - 2

$ mongod --port 27013 --dbpath datadb2

Mongodb Shard - 3

$ mongod --port 27014 --dbpath datadb3

Now we have three shards of mongodb running on localhost. For the database I will be using the students database having collection grades. The structure of the documents in grades is given below:

"_id" : ObjectId("50906d7fa3c412bb040eb577"),
"student_id" : 0,
"type" : "exam",
"score" : 54.6535436362647

You can choose any database of your choice.

Registering the shards with mongos
Now that we have created our two mongodb shards running at localhost:27012 and localhost:27013 respectively, we will go ahead and register these shards with our mongos query router, also define which database we need to shard and then enable sharding on the collection we are interested by providing the shard key. All these have to be carried out by connecting to the mongos query router as shown in the below commands:

$ mongo --port 27011 --host localhost
mongos> sh.addShard("localhost:27012")
{ "shardAdded" : "shard0000", "ok" : 1 }

mongos> sh.addShard("localhost:27013")
{ "shardAdded" : "shard0001", "ok" : 1 }

mongos> sh.addShard("localhost:27014")
{ "shardAdded" : "shard0001", "ok" : 1 }

mongos> sh.enableSharding("students")
{ "ok" : 1 }

mongos> sh.shardCollection("students.grades", {"student_id" : 1})

{ "collectionsharded" : "students.grades", "ok" : 1 }


In the sh.shardCollection we specify the collection and the field from the collection which is to be used as a shard key.

Adding data to the mongodb sharded cluster

Lets connect to mongos and run some code to populate data to the grades collection in students database.

for ( i = 1; i < 10000; i++ ) {
db.grades.insert({student_id: i, type: "exam", score : Math.random() * 100 });
db.grades.insert({student_id: i, type: "quiz", score : Math.random() * 100 });
db.grades.insert({student_id: i, type: "homework", score : Math.random() * 100 });

WriteResult({ "nInserted" : 1 })

After inserting the data we would notice some activity in the mongos daemon stating that it is moving some chunks for specific shard and so on i.e the balancer will be in action trying to balance the data across the shards. The output will be something like:

2015-02-02T18:26:26.770+0300 [Balancer] moving chunk ns: students.grades moving
( ns: students.grades, shard: shard0000:localhost:27012, lastmod: 1|1||000000000000000000000000, min: { student_id: MinKey }, max: { student_id: 200.0 })
shard0000:localhost:27012 -> shard0001:localhost:27013

2015-02-02T18:31:12.314+0300 [Balancer] moving chunk ns: students.grades moving
( ns: students.grades, shard: shard0000:localhost:27012, lastmod: 2|2||000000000000000000000000, min: { student_id: 200.0 }, max: { student_id: 2096.0 })
shard0000:localhost:27012 -> shard0002:localhost:27014

Lets look at the status of the shards by connecting to the mongos. It can be achieved by using the sh.status() command.

$ mongo --port 27011 --host localhost
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54cf95d9d9309193f5fa0780")
{ "_id" : "shard0000", "host" : "localhost:27012" }
{ "_id" : "shard0001", "host" : "localhost:27013" }
{ "_id" : "shard0002", "host" : "localhost:27014" }
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "blog", "partitioned" : false, "primary" : "shard0000" }
{ "_id" : "course", "partitioned" : false, "primary" : "shard0000" }
{ "_id" : "m101", "partitioned" : false, "primary" : "shard0000" }
{ "_id" : "school", "partitioned" : false, "primary" : "shard0000" }
{ "_id" : "students", "partitioned" : true, "primary" : "shard0000" }
shard key: { "student_id" : 1 }
shard0001 1
shard0002 1
shard0000 1
{ "student_id" : { "$minKey" : 1 } } -->> { "student_id" : 200 } on : shard0001 Timestamp(2, 0)

{ "student_id" : 200 } -->> { "student_id" : 2096 } on : shard0002 Timestamp(3, 0)

{ "student_id" : 2096 } -->> { "student_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(3, 1)

{ "_id" : "task-db", "partitioned" : false, "primary" : "shard0000" }

{ "_id" : "test", "partitioned" : false, "primary" : "shard0000" }

The above output shows that the database students is sharded and the sharded collection is the grades collection. It also shows the different shards available and the range of shard keys distributed across different shards. So on shard0001 we have student_id from minimum to 200, then on shard0002 we have student_id from 200 upto 2096 and the rest in shard0000.

We can also connect to individual shards and query to find out the max and minimum student ids available.

On shard0000

$ mongo --host localhost --port 27012
MongoDB shell version: 2.6.7
connecting to: localhost:27012/test
> use students
switched to db students

> db.grades.find().sort({student_id : 1}).limit(1)
{ "_id" : ObjectId("54cf97295a23cc67efa848c8"), "student_id" : 2096, "type" : "exam", "score" : 6.7372970981523395 }

> db.grades.find().sort({student_id : -1}).limit(1)
{ "_id" : ObjectId("54cf973b5a23cc67efa8a567"), "student_id" : 9999, "type" : "homework", "score" : 60.64519872888923 }

On shard0001

C:UsersMohamed>mongo --host localhost --port 27013
MongoDB shell version: 2.6.7
connecting to: localhost:27013/test
> use students
switched to db students

> db.grades.find().sort({student_id:1}).limit(1).pretty()
"_id" : ObjectId("54cf97d05a23cc67efa8a568"),
"student_id" : 1,
"type" : "exam",
"score" : 5.511052813380957

> db.grades.find().sort({student_id:-1}).limit(1).pretty()
"_id" : ObjectId("54cf97d15a23cc67efa8a7bc"),
"student_id" : 199,
"type" : "homework",
"score" : 51.78457708097994

On shard0002

$ mongo --host localhost --port 27014
MongoDB shell version: 2.6.7
connecting to: localhost:27014/test
> use students
switched to db students

> db.grades.find().sort({student_id:1}).limit(1).pretty()
"_id" : ObjectId("54cf971f5a23cc67efa83292"),
"student_id" : 200,
"type" : "homework",
"score" : 79.56434232182801

> db.grades.find().sort({student_id:-1}).limit(1).pretty()
"_id" : ObjectId("54cf97295a23cc67efa848c7"),
"student_id" : 2095,
"type" : "homework",
"score" : 62.75710032787174

Lets execute the same set of queries on the mongos query router and see that the results this time will include data from all the shards and not just individual shard.

$ mongo --port 27011 --host localhost
MongoDB shell version: 2.6.7
connecting to: localhost:27011/test
mongos> use students
switched to db students

mongos> db.grades.find().sort({student_id:-1}).limit(1).pretty()
"_id" : ObjectId("54cf973b5a23cc67efa8a567"),
"student_id" : 9999,
"type" : "homework",
"score" : 60.64519872888923

mongos> db.grades.find().sort({student_id:1}).limit(1).pretty()
"_id" : ObjectId("54cf97d05a23cc67efa8a568"),
"student_id" : 1,
"type" : "exam",
"score" : 5.511052813380957

So this brings to the end of setting up sharded mongodb cluster on localhost. Hope it was informative and useful!

     << Previous
Next >>