This document provides a number of practical examples that display the capabilities of the aggregation framework.
The Aggregations using the Zip Codes Data Set examples uses a publicly available data set of all zipcodes and populations in the United States. These data are available at: zips.json.
- MongoDB, version 2.6.4 or later.
Use the following command to load zips.json data set into your mongod instance:
mongoimport --drop -d test -c zipcodes zips.jsonEach document in this collection has the following form:
{
"_id" : "35004",
"city" : "Acmar",
"state" : "AL",
"pop" : 6055,
"loc" : [-86.51557, 33.584132]
}In these documents:
- The
_idfield holds the zipcode as a string. - The
cityfield holds the city name. - The
statefield holds the two letter state abbreviation. - The
popfield holds the population. - The
locfield holds the location as a[latitude, longitude]array.
To get all states with a population greater than 10 million, use the following aggregation pipeline:
coll = db.zipcodes
coll.aggregate(
{ $group: {_id: "$state", totalPop: {$sum: "$pop"}} },
{ $match: {totalPop: {$gte: 10000000}} }
)The result:
{
"result": [
{ "_id": "PA", "totalPop": 11881643 },
{ "_id": "OH", "totalPop": 10847115 },
{ "_id": "NY", "totalPop": 17990455 },
{ "_id": "TX", "totalPop": 16986510 },
{ "_id": "FL", "totalPop": 12937926 },
{ "_id": "IL", "totalPop": 11430602 },
{ "_id": "CA", "totalPop": 29760021 }
],
"ok": 1
}The above aggregation pipeline is build from two pipeline operators:
$group and $match.
The $group pipeline operator requires _id field where we specify
grouping; remaining fields specify how to generate composite value and
must use one of
the group aggregation functions:
$addToSet, $first, $last, $max, $min, $avg, $push, $sum.
The $match pipeline operator syntax is the same as
the read operation
query syntax.
The $group process reads all documents and for each state it
creates a separate document, for example:
{
"_id": "CA",
"totalPop": 29760021
}The totalPop field uses the $sum aggregation function
to sum the values of all pop fields in the source documents.
Documents created by $group are piped to the $match pipeline
operator. It returns the documents with the value of totalPop field
greater than or equal to 10 million.
To get the first three states with the greatest average population per city, use the following aggregation:
coll.aggregate(
{ $group: {_id: {state: "$state", city: "$city"}, pop: {$sum: "$pop"}} },
{ $group: {_id: "$_id.state", avgCityPop: {$avg: "$pop"} } },
{ $sort: {avgCityPop: -1} },
{ $limit: 3 }
)This aggregate pipeline produces:
{
"result": [
{ "_id": "DC", "avgCityPop": 303450 },
{ "_id": "CA", "avgCityPop": 27581.113067655235 },
{ "_id": "FL", "avgCityPop": 26676.136082474226 }
],
"ok": 1
}The above aggregation pipeline is build from three pipeline operators:
$group, $sort and $limit.
The first $group operator creates the following documents:
{
"_id": {
"state": "AZ",
"city": "GOODYEAR"
},
"pop": 5819
}Note, that the $group operator can’t use nested documents
except the _id field.
The second $group uses these documents to create the following
documents:
{
"_id": "WY",
"avgCityPop": 3359.911111111111
}These documents are sorted by the avgCityPop field in descending order.
Finally, the $limit pipeline operator returns the first 3 documents
from the sorted set.
To get the smallest and largest cities by population for each state, use the following aggregate pipeline:
coll.aggregate(
{ $group: {_id: {state: "$state", city: "$city"}, pop: {$sum: "$pop"}} },
{ $sort: {pop: 1} },
{
$group: {
_id: "$_id.state",
smallestCity: {$first: "$_id.city"},
smallestPop: {$first: "$pop"},
biggestCity: {$last: "$_id.city"},
biggestPop: {$last: "$pop"}
}
}
)The first $group operator creates a new document for every
combination of the state and city fields from the source
documents. Each document created at this stage has the field pop
which is set to the value computed by the $sum operator.
It sums the values of the pop field in the grouped documents.
The sample document created at this stage looks like:
{
"_id": {
"state": "AZ",
"city": "GOODYEAR"
},
"pop": 5819
}Note: To preserve the values of the state and city fields
for later use in the pipeline we specify the value of _id
as a nested document which contains both values.
The second $group operator groups the documents by the value of
_id.state.
The sorting order is preserved within grouped documents.
So, $first operators return name of the city with
the smallest population and the city population.
The $last operators return the city name with the
biggest population and the city population.
The sample document created at this stage looks like:
{
"_id": "CA",
"smallestCity": "OREGON HOUSE",
"smallestPop": 0,
"biggestCity": "LOS ANGELES",
"biggestPop": 2102295
}To run the examples below you need this data set: name_days.json.
Use mongoimport to import this data set into MongoDB:
mongoimport --drop --db test --collection cal name_days.jsonor use the cal collection from here:
mongo --username student --password sesja2013 153.19.1.202/testThe collection cal should contain 364 documents in the following format:
{
"_id": ObjectId("51643484c20a89f0145ac8e8"),
"names": [
"Mieszka", "Mieczysława", "Marii" ],
"date": {
"day": 1, "month": 1
}
}The following aggregation pipeline computes this:
coll = db.cal // switch collection
coll.aggregate(
{ $project: {names: 1, _id: 0} },
{ $unwind: "$names" },
{ $group: {_id: "$names", count: {$sum: 1}} },
{ $sort: {count: -1} },
{ $limit: 6 }
)The sample document created by the $project pipeline operator
looks like:
{ "names": [ "Sylwestra", "Melanii", "Mariusza" ] }The $unwind operator creates one document for every member of
names array. For example, the above document is unwinded into three
documents:
{ "names": "Sylwestra" }
{ "names": "Melanii" }
{ "names": "Mariusza" }These documents are grouped by the names field and the documents
in each group are counted by the $sum operator.
The sample document created at this stage looks like:
{ "_id": "Julii", "count": 3 }Finally, the $sort operator sorts these documents by the
count field in descending order, and the $limit operator
outputs the first 6 documents:
{
"result": [
{ "_id": "Jana", "count": 21 },
{ "_id": "Marii", "count": 16 },
{ "_id": "Grzegorza", "count": 9 },
{ "_id": "Piotra", "count": 9 },
{ "_id": "Feliksa", "count": 8 },
{ "_id": "Leona", "count": 8 }
],
"ok": 1
}We want to pivot the name_days.json data set. Precisely, we want to convert documents from this format:
{
"date": { "day": 1, "month": 1 }
"names": [ "Mieszka", "Mieczysława", "Marii" ],
}into this format:
{
"name": "Mateusza",
"dates": [{"day": 13, "month": 11}, {"day": 21, "month": 9}]
}The following aggregation pipeline does the trick:
coll.aggregate(
{ $project: {_id: 0, date: 1, names: 1} },
{ $unwind: "$names" },
{ $group: {_id: "$names", dates: {$addToSet: "$date"}} },
{ $project: {name: "$_id", dates: 1, _id: 0} },
{ $sort: {name: 1} }
)The sample document created by the unwinding stage looks like:
{
"names": "Eugeniusza",
"date": { "day": 30, "month": 12 }
}The $group pipeline operator groups these documents by the names
field. The $addToSet operator returns an array of all unique values
of the date field found in the set of grouped documents.
The sample document created at this stage of the pipeline looks like:
{
"_id": "Maksymiliana",
"dates": [{"day": 12, "month": 10}, {"day": 14, "month": 8}]
}In the last two stages we sort and reshape these documents to the requested format:
{
"name": "Zuzanny",
"dates": [{"day": 11, "month": 8}, { "day": 24, "month": 5}]
}1. For the zipcodes collection, the aggregation below computes
248_706_415. What does this number mean?
coll.aggregate({ $group: {_id: 0, sum: {$sum: "$pop"}} })
// {
// "result": [
// {
// "_id": 0,
// "sum": 248706415
// }
// ],
// "ok": 1
// }2. How many different names are in the cal collection?