How
To Create a Sharded Cluster in MongoDB
Introduction
MongoDB is a
NoSQL document database system that scales well horizontally and implements
data storage through a key-value system. A popular choice for web applications
and websites, MongoDB is easy to implement and access programmatically.
MongoDB achieves scaling through a
technique known as "sharding". Sharding is the process of writing
data across different servers to distribute the read and write load and data
storage requirements.
MongoDB
Sharding Topology
Sharding is implemented through
three separate components. Each part performs a specific function:
·
Config Server: Each production
sharding implementation must contain exactly three configuration servers. This
is to ensure redundancy and high availability.
Config servers are used to store
the metadata that links requested data with the shard that contains it. It
organizes the data so that information can be retrieved reliably and
consistently.
·
Query Routers: The query routers are
the machines that your application actually connects to. These machines are
responsible for communicating to the config servers to figure out where the
requested data is stored. It then accesses and returns the data from the appropriate
shard(s).
Each query router runs the
"mongos" command.
·
Shard Servers: Shards are responsible
for the actual data storage operations. In production environments, a single
shard is usually composed of a replica set instead of a single machine. This is
to ensure that data will still be accessible in the event that a primary shard
server goes offline.
Implementing replicating sets is
outside of the scope of this tutorial, so we will configure our shards to be
single machines instead of replica sets. You can easily modify this if you
would like to configure
replica sets for your own configuration.
Initial
Set Up
If you were paying attention
above, you probably noticed that this configuration requires quite a few
machines. In this tutorial, we will configure an example sharding cluster that
contains:
·
3
Config Servers (Required in production environments)
·
2
Query Routers (Minimum of 1 necessary)
·
4
Shard Servers (Minimum of 2 necessary)
This means that you will need nine
VPS instances to follow along exactly. In reality, some of these functions can
overlap (for instance, you can run a query router on the same VPS you use as a
config server) and you only need one query router and a minimum of 2 shard
servers.
We will go above this minimum in
order to demonstrate adding multiple components of each type. We will also
treat all of these components as discrete machines for clarity and simplicity.
For the
purposes of this tutorial, we will refer to the components as being accessible
at these subdomain:
·
Config
Servers
o
config0.example.com
o
config1.example.com
o
config2.example.com
·
Query
Routers
o
query0.example.com
o
query1.example.com
·
Shard
Servers
o
shard0.example.com
o
shard1.example.com
o
shard2.example.com
o
shard3.example.com
If you do
not set up subdomains, you can still follow along, but your configuration will
not be as robust. If you wish to go this route, simply substitute the subdomain
specifications with your droplet's IP address.
Initialize
the Config Servers
The first components that must be
set up are the configuration servers. These must be online and operational
before the query routers or shards can be configured.
Log into your first configuration
server as root.
The first thing we need to do is
create a data directory, which is where the configuration server will store the
metadata that associates location and content:
mkdir /mongo-metadata
Now, we simply have to start up
the configuration server with the appropriate parameters. The service that
provides the configuration server is called
mongod.
The default port number for this component is 27019.
We can start the configuration
server with the following command:
mongod --configsvr --dbpath /mongo-metadata --port 27019
The server will start outputting
information and will begin listening for connections from other components.
Repeat this process exactly on the
other two configuration servers. The port number should be the same across all
three servers.
Configure
Query Router Instances
At this point, you should have all
three of your configuration servers running and listening for connections. They
must be operational before continuing.
Log into your first query router
as root.
The first thing we need to do is
stop the
mongodb process on this instance if it is already running. The query
routers use data locks that conflict with the main MongoDB process:service mongodb stop
Next, we need to start the query
router service with a specific configuration string. The configuration string
must be exactly the same for every query router you configure (including the
order of arguments). It is composed of the address of each configuration server
and the port number it is operating on, separated by a comma.
They query router service is
called
mongos. The default port number for this process is 27017 (but the port number in the configuration refers to the
configuration server port number, which is 27019 by
default).
The end result is that the query
router service is started with a string like this:
mongos
--configdb config0.example.com:27019,config1.example.com:27019,config2.example.com:27019
Your first query router should
begin to connect to the three configuration servers. Repeat these steps on the
other query router. Remember that the
mongodb
service must be stopped prior to typing in the command.
Also, keep in mind that the exact
same command must be used to start each query router. Failure to do so will
result in an error.
Add
Shards to the Cluster
Now that we have our configuration
servers and query routers configured, we can begin adding the actual shard
servers to our cluster. These shards will each hold a portion of the total
data.
Log into one of your shard servers
as root.
As we mentioned in the beginning,
in this guide we will only be using single machine shards instead of replica
sets. This is for the sake of brevity and simplicity of demonstration. In
production environments, a replica set is very highly recommended in order to
ensure the integrity and availability of the data. To configure
replica sets in MongoDB, follow this guide.
To actually add the shards to the
cluster, we will need to go through the query routers, which are now configured
to act as our interface with the cluster. We can do this by connecting to any
of the query routers like this:
mongo --host query0.example.com --port 27017
This will connect to the
appropriate query router and open a mongo prompt. We will add all of our shard
servers from this prompt.
To add our first shard, type:
sh.addShard( "shard0.example.com:27017" )
You can then add your remaining
shard droplets in this same interface. You do not need to log into each shard
server individually.
sh.addShard( "shard1.example.com:27017" )
sh.addShard( "shard2.example.com:27017" )
sh.addShard( "shard3.example.com:27017" )
If you are configuring a
production cluster, complete with replication sets, you have to instead specify
the replication set name and a replication set member to establish each set as
a distinct shard. The syntax would look something like this:
sh.addShard( "rep_set_name/rep_set_member:27017" )
How
to Enable Sharding for a Database Collection
MongoDB organizes information into
databases. Inside each database, data is further compartmentalized through
"collections". A collection is akin to a table in traditional
relational database models.
In this section, we will be
operating using the querying routers again. If you are not still connected to
the query router, you can access it again using the same mongo command you used
in the last section:
mongo --host config0.example.com --port 27017
Enable Sharding on the Database
Level
We will enable sharding first on
the database level. To do this, we will create a test database called
(appropriately)
test_db.
To create this database, we simply
need to change to it. It will be marked as our current database and created
dynamically when we first enter data into it:
use test_db
We can check that we are currently
using the database we just created by typing:
db
test_db
We can see all of the available
databases by typing:
show dbs
You may notice that the database that
we just created does not show up. This is because it holds no data so it is not
quite real yet.
We can enable sharding on this
database by issuing this command:
sh.enableSharding("test_db")
Again, if we enter the
show dbs command, we will not see our new database. However, if we switch to
the config database which is generated automatically, and issue a find() command, our new database will be returned:use config
db.databases.find()
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test_db", "partitioned" : true, "primary" : "shard0003" }
Your database will show up with
the
show
dbs command when MongoDB has added some data to
the new database.
Enable Sharding on the
Collections Level
Now that our database is marked as
being available for sharding, we can enable sharding on a specific collection.
At this point, we need to decide
on a sharding strategy. Sharding works by organizing data into different
categories based on a specific field designated as the
shard key in the documents it is storing. It puts all of the documents that
have a matching shard key on the same shard.
For instance, if your database is
storing employees at a company and your shard key is based on favorite color,
MongoDB will put all of the employees with
blue in
the favorite color field on a single shard. This can lead to disproportional
storage if everybody likes a few colors.
A better choice for a shard key
would be something that's guaranteed to be more evenly distributed. For
instance, in a large company, a birthday (month and day) field would probably
be fairly evenly distributed.
In cases where you're unsure about
how things will be distributed, or there is no appropriate field, you can
create a "hashed" shard key based on an existing field. This is what
we will be doing for our data.
We can create a collection called
test_collection and hash its "id" field. Make sure we're using our
testdb database and then issue the command:use test_db
db.test_collection.ensureIndex( { _id : "hashed" } )
We can then shard the collection
by issuing this command:
sh.shardCollection("test_db.test_collection", { "_id": "hashed" } )
This will shard the collection
across all of the available shards.
Insert Test Data into the
Collection
We can see our sharding in action
by using a loop to create some objects. This loop comes
directly from the MongoDB website for generating test data.
We can insert data into the
collection using a simple loop like this:
use test_db
for (var i = 1; i <= 500; i++) db.test_collection.insert( { x : i } )
This will create 500 simple
documents ( only an ID field and an "x" field containing a number)
and distribute them among the different shards. You can see the results by
typing:
db.test_collection.find()
{ "_id" : ObjectId("529d082c488a806798cc30d3"), "x" : 6 }
{ "_id" : ObjectId("529d082c488a806798cc30d0"), "x" : 3 }
{ "_id" : ObjectId("529d082c488a806798cc30d2"), "x" : 5 }
{ "_id" : ObjectId("529d082c488a806798cc30ce"), "x" : 1 }
{ "_id" : ObjectId("529d082c488a806798cc30d6"), "x" : 9 }
{ "_id" : ObjectId("529d082c488a806798cc30d1"), "x" : 4 }
{ "_id" : ObjectId("529d082c488a806798cc30d8"), "x" : 11 }
. . .
To get more values, type:
it
{ "_id" : ObjectId("529d082c488a806798cc30cf"), "x" : 2 }
{ "_id" : ObjectId("529d082c488a806798cc30dd"), "x" : 16 }
{ "_id" : ObjectId("529d082c488a806798cc30d4"), "x" : 7 }
{ "_id" : ObjectId("529d082c488a806798cc30da"), "x" : 13 }
{ "_id" : ObjectId("529d082c488a806798cc30d5"), "x" : 8 }
{ "_id" : ObjectId("529d082c488a806798cc30de"), "x" : 17 }
{ "_id" : ObjectId("529d082c488a806798cc30db"), "x" : 14 }
{ "_id" : ObjectId("529d082c488a806798cc30e1"), "x" : 20 }
. . .
To get information about the
specific shards, you can type:
sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 3,
"currentVersion" : 4,
"clusterId" : ObjectId("529cae0691365bef9308cd75")
}
shards:
{ "_id" : "shard0000", "host" : "162.243.243.156:27017" }
{ "_id" : "shard0001", "host" : "162.243.243.155:27017" }
. . .
This will provide information
about the chunks that MongoDB distributed between the shards.
Conclusion
By the end of this guide, you should
be able to implement your own MongoDB sharding configuration. The specific
configuration of your servers and the shard key that you choose for each
collection will have a big impact on the performance of your cluster.
Choose the field or fields that have
the best distribution properties and most closely represent the logical
groupings that will be reflected in your database queries. If MongoDB only has
to go to a single shard to retrieve your data, it will return faster.