Removing Duplicates from MongoDB

IMPORTANT:  The dropDups option was removed starting with MongoDB 3.x, so this solution is only valid for MongoDB versions 2.x and before.  There is no direct replacement for the dropDups option. The answers to the question posed at http://stackoverflow.com/questions/30187688/mongo-3-duplicates-on-unique-index-dropdups offer some possible alternative ways to remove duplicates in Mongo 3.x.

Duplicate records can be removed from a MongoDB collection by creating a unique index on the collection and specifying the dropDups option.

Assuming the collection includes a field named record_id that uniquely identifies a record in the collection, the command to use to create a unique index and drop duplicates is:

db.collection.ensureIndex( { record_id:1 }, { unique:true, dropDups:true } )

Here is the trace of a session that shows the contents of a collection before and after creating a unique index with dropDups.  Notice that duplicate records are no longer present after the index is created.

> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8d”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8f”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>
> db.pages.ensureIndex( { scan_id:1, leaf_num:1 }, { unique:true, dropDups:true } )
>
> db.pages.find()
{ “_id” : ObjectId(“52829c886602e2c8428d1d8c”), “leaf_num” : “1”, “scan_id” : “smithsoniancont251985smit”, “height” : 3464, “width” : 2548 }
{ “_id” : ObjectId(“52829c886602e2c8428d1d8e”), “leaf_num” : “2”, “scan_id” : “smithsoniancont251985smit”, “height” : 3587, “width” : 2503 }
>

More information about creating MongoDB indexes, including the use of the dropDups option, can be found here.

Advertisements

5 Responses to Removing Duplicates from MongoDB

  1. Liam Bowen says:

    The dropDups option was removed in 3.x, so this solution no longer works.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: