What is the MongoDB Pre-Splitting?


In Two steps to shard a MongoDB collection and What is a MongoDB chunk? posts we explained how to shard a collection, what a chunk is, which of them are moved when our system is not balanced and what shard will be its destination.

Today, we are going to use the pre-splitting technique, this will allow us to create the chunks and well distribute them across the shards before beginning to insert data. Also, we will be able to decide the range of each chunk.

When is pre-splitting used

We use pre-splitting when we need to load a great amount of data in our collection and wish to avoid to the database the job of split and chunk moves (round balancing). Data will be straight inserted in the shard who owns the chunk referred to the shard key we are going to insert.

In the next example, in which we do not have used pre-splitting in it, we can note that MongoDB has distributed all chunks of the namespace “school.students” across the three shards (s0, s1 and s2) or the cluster:

Without our manual intervention is MongoDB who does the split, when the chunk exceeds its maximum value (64MB by default), who chooses what chunks must be moved and what will be its new shard.

The collection we are going to use is the same than before, “school.students”. First of all, we drop it (also its chunks) for beginning again. The “drop” command will remove both data and metadata (chunks). The “remove” command would remove only the data.

We shard now the collection:

“sh.status()” command tells us that MongoDB has created one chunk. This gathers the whole range for the shard key (student_id). Also, we can check that this chunk owns to s0 shard, where database “school” belongs to.

Choosing ranges and creating chunks

The next step is create our chunks choosing their ranges (in this case each one gathers 10,000 values, the maximum will be 100,000):

This is the result of the round balancing:

We have reached our goal, we have created the chunks choosing their ranges and they are distributed across the shards of the cluster before having loaded our data.

From now, each insert will be straight executed only in the correct shard attending at the shard key.

In a next post we will choose the shard in which we wish to store each one of our data.

As usual, your comments will be well received. I hope I have shared with all of you a little bit of useful knowledge.

Leave a comment

Your email address will not be published. Required fields are marked *

two + 4 =