...
The problem is that $limit doesn't affect the amount of documents to be loaded into memory for examination in the aggregation pipeline. For example, check this one: > db.system.profile.find().limit(1).sort( { ts : -1 } ).pretty() { "op" : "command", "ns" : "datasetTreeDB.datasetMappingAnnotation", "command" : { "aggregate" : "datasetMappingAnnotation", "pipeline" : [ { "$match" : { "rootNode" : "markingtool", "imageCreatedDate" : { "$gt" : ISODate("2016-03-27T00:00:00Z") }, "imageMeta__league" : "Major League Baseball" } }, { "$limit" : 100000 }, { "$sort" : { "randomIndex" : 1 } }, { "$limit" : 1 }, { "$project" : { "_id" : 0, "fileName" : 1, "rootNode" : 1 } }, { "$count" : "count" } ], "allowDiskUse" : true, "cursor" : { } }, "keysExamined" : 100000, "docsExamined" : 100000, "hasSortStage" : true, "cursorExhausted" : true, "numYield" : 816, "locks" : { "Global" : { "acquireCount" : { "r" : NumberLong(1654) } }, "Database" : { "acquireCount" : { "r" : NumberLong(827) } }, "Collection" : { "acquireCount" : { "r" : NumberLong(826) } } }, "nreturned" : 1, "responseLength" : 130, "protocol" : "op_command", "millis" : 1346, "planSummary" : "IXSCAN { rootNode: 1, imageMeta__league: 1, imageCreatedDate: 1, randomIndex: 1, imageId: 1, secondLevelNode: 1 }", "ts" : ISODate("2018-06-28T17:05:57.702Z"), "client" : "127.0.0.1", "appName" : "MongoDB Shell", "allUsers" : [ ], "user" : "" } I want to get only 1 result from the aggregation and include into this result a field that doesn't exist in the index. The expected behaviour that it will scan 100000 keys for sorting in memory using values from indexes (it makes sense as after "imageCreatedDate" search it can't use further indexes for sorting anymore) and then load and test only 1 first document. However it tried to load all 100000 documents ("docsExamined" : 100000) - it completely kills the performance of this operation. If do not include the field that doesn't exist in the index, I can see that it doesn't try to fetch any document and can return the result using indexes only: db.system.profile.find().limit(1).sort( { ts : -1 } ).pretty() { "op" : "command", "ns" : "datasetTreeDB.datasetMappingAnnotation", "command" : { "aggregate" : "datasetMappingAnnotation", "pipeline" : [ { "$match" : { "rootNode" : "markingtool", "imageCreatedDate" : { "$gt" : ISODate("2016-03-27T00:00:00Z") }, "imageMeta__league" : "Major League Baseball" } }, { "$limit" : 100000 }, { "$sort" : { "randomIndex" : 1 } }, { "$limit" : 1 }, { "$project" : { "_id" : 0, "rootNode" : 1 } }, { "$count" : "count" } ], "allowDiskUse" : true, "cursor" : { } }, "keysExamined" : 100000, "docsExamined" : 0, "hasSortStage" : true, "cursorExhausted" : true, "numYield" : 785, "locks" : { "Global" : { "acquireCount" : { "r" : NumberLong(1586) } }, "Database" : { "acquireCount" : { "r" : NumberLong(793) } }, "Collection" : { "acquireCount" : { "r" : NumberLong(792) } } }, "nreturned" : 1, "responseLength" : 130, "protocol" : "op_command", "millis" : 182, "planSummary" : "IXSCAN { rootNode: 1, imageMeta__league: 1, imageCreatedDate: 1, randomIndex: 1, imageId: 1, secondLevelNode: 1 }", "ts" : ISODate("2018-06-28T17:16:10.787Z"), "client" : "127.0.0.1", "appName" : "MongoDB Shell", "allUsers" : [ ], "user" : "" } These ones are synthetic example and in real environment I have ~100.000.000 items in collection and my goal is to use indexes at the "$match" step, then "$sort" (like in this example) and then return only first 100.000 items ($limit with 100.000). However if "$match" step returns 5.000.000 items it tries to load and examine all 5.000.000 documents and only then return top 100.000. Is there a way how I can fix this behaviour? Thank you!
nick.brewer commented on Mon, 20 Aug 2018 22:27:09 +0000: ASemen The query subsystem returning the cursor to aggregation does not know that aggregation is going to do $sort and $limit:1 on it. With an appropriate index, the entire query (match, sort and limit) is pushed down to the query subsystem - as such my suggestion here would still be to use a different index. From your responses, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. -Nick asemen commented on Tue, 3 Jul 2018 22:41:57 +0000: Hi Nick I understand that this index is not optimized for sorting, as it has "non equality" condition (b) before sorting field (c). But the problem is that in my real database I have a lot of entities in this collection and "b" is something like "created date", so I decided that it would be more important to have an efficient search using "b" and then... probably sort the data by "c" using in memory sort. I realize that it can't just take the sort order of "c" index and has to perform in memory sorting, but how does it work if I don't project "d" field? I can see that in this case it can use only data from indexes (as it says "docsExamined" : 0 in profiler) for sorting. Why does it have to load all entities if a non indexed field is projected then? Why not to do everything like it does in the first flow (without "d" projected) and then only fetch N requested documents? Thank you, Alexey nick.brewer commented on Tue, 3 Jul 2018 18:48:41 +0000: Hi ASemen Going off of your example: When you match documents using a range, such as: $match": { "a": 1, "b": { "$gt": 1 } } The information that is returned cannot be meaningfully sorted via: {{"$sort": { "c": 1 }} using the compound index you've created: db.testCollection.ensureIndex({"a": 1, "b": 1, "c": 3}) So when you go to project the unindexed field, it is looking at all of the documents that meet the criteria of your initial $match You could rearrange the order of your compound index, so that the sort is processed before the range: db.testCollection.ensureIndex({"a": 1, "c": 1, "b": 1}) Regards, Nick asemen commented on Mon, 2 Jul 2018 23:41:46 +0000: Hi Nick Thank you for your help! Unfortunately removing the "$limit" : 100000 doesn't help. In fact I intentionally added this statement to reduce the amount of data to be fetched so I could test it faster - requests in my examples are real aggregation pipelines that I use in my application and I have around 100.000.000 entries in my collection. But I guess we can makes things easier - I created a small test where everything is super simple and it perfectly reproduces this issue - please check "simple_example.js" file attached. Please let me know if any clarifications are required. Thank you, Alexey nick.brewer commented on Mon, 2 Jul 2018 14:20:24 +0000: Hi ASemen I believe you could fix this behavior by removing the "$limit" : 100000 and only having the "$limit" : 1 step in your pipeline. However it's difficult to get a sense of what you're trying to accomplish from the sanitized outputs included here - could you include an example of an actual aggregation pipeline you're attempting to use? You can use our secure upload portal if you'd prefer; information provided there is only available to MongoDB employees, and is automatically removed after a period of time. Thanks, Nick asemen commented on Thu, 28 Jun 2018 21:00:48 +0000: Hi Nick, Yes, sure. I've attached 2 json files. As I can see in "slow" one it tried to FETCH data after XSCAN, while in the "fast" one it performs PROJECTION. I guess it makes sense but I'm wondering why it tries to FETCH all documents in the "slow" one... Thank you, Alexey nick.brewer commented on Thu, 28 Jun 2018 19:55:51 +0000: Hi ASemen, Could you please attach the .explain() output for both queries to this ticket? Thanks, Nick