...
The error originally reported in SERVER-20385 is marked as fixed in 3.1.9, but we are still able to reproduce it against version 3.4.4
keith.bostic commented on Tue, 11 Jun 2019 18:52:56 +0000: manan@indeed.com, I've been looking into this ticket, and I wanted to ask you for any details you can recall. When you noticed you were having "this exact same issue" on 3.4/3.6, can you confirm there were a large number of deleted records in the underlying collection? Or was it just you were seeing the same error message as listed in the ticket and we don't know what the data in the underlying collection looked like? Thank you! manan@indeed.com commented on Wed, 1 May 2019 15:07:58 +0000: So we upgrade our MongoDB version in production to 3.6.9, and also encountered this error. However now the improvement was the mongos was able to return some 100K+ documents for the query before tripping on that error. The query asked for a sample size of 200K. In 3.4.14 the query would fail without returning a single document. 2019-05-01T10:01:19.556-0500 E QUERY [thread1] Error: getMore command failed: { "ok" : 0, "errmsg" : "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again.", "code" : 28799, "codeName" : "Location28799" } : _getErrorWithCode@src/mongo/shell/utils.js:25:13 DBCommandCursor.prototype._runGetMoreCommand@src/mongo/shell/query.js:763:1 DBCommandCursor.prototype._hasNextUsingCommands@src/mongo/shell/query.js:791:9 DBCommandCursor.prototype.hasNext@src/mongo/shell/query.js:799:1 DBQuery.prototype.forEach@src/mongo/shell/query.js:500:12 @(shell):1:1 Worth noting that the parameter {allowDiskUse: true{color}} does not have any effect on the query. whether I specify or not, the results don't vary. manan@indeed.com commented on Tue, 30 Apr 2019 19:52:30 +0000: Hi We are having this exact same issue on version 3.4.14. Is this being fixed in 3.6 and higher versions? What do you recommend @Team MongoDB? keith.bostic commented on Thu, 3 Aug 2017 19:18:53 +0000: charlie.swanson It was unclear why we used two different techniques. The original reason for the two techniques was because older versions of WiredTiger could create relatively unbalanced trees, which would cause "random" records to be entirely selected from a small part of the whole tree. Stepping a cursor through the tree fixed that problem. Current versions of WiredTiger are far less likely to create unbalanced trees, we haven't seen that problem in a long time. I presume it does something like "include if rand() The algorithm is a little more clever in that it skips entire pages of records: that can often be done without actually reading the page from disk, so we can usually avoid bringing all of the tree's leaf pages into memory. Regardless, it's a slower technique. The problem with SERVER-29446 is it's a tiny tree with a large number of deleted records. I think we should fix that specific case by changing how WiredTiger handles leaf pages with large numbers of deleted items. charlie.swanson commented on Thu, 3 Aug 2017 18:45:41 +0000: While investigating, keith.bostic informed me that there are actually two ways that we use WiredTiger cursors to get random samples of data - the $sample stage configures a cursor passing the "next_random" config option, but elsewhere we pass the "next_random_sample_size=" config option when building "oplog stones". It was unclear why we used two different techniques, but I talked to max.hirschhorn and got some insight. I believe the difference between the two techniques can be roughly summarized as: "next_random" will repeatedly select a random document, independent of the previous cursor position "next_random_sample_size=" will use a technique that scans the entire collection or index structure, deciding whether to include a record as it reaches it. I presume it does something like "include if rand() I don't know if this was a very intentional decision, but it does seem to make more sense to use "next_random" for the $sample stage. If the "next_random_sample_size=" version does behave roughly how I described, then we could expect to scan roughly all the records in the average case, even if we were only asking for a very small number of documents. However, from the perspective of the oplog stones, it is much more important to get a good distribution than it is to be quick - and it's looking for enough samples that it probably makes sense to look at everything to generate a more representative sample. keith.bostic hopefully this makes sense to you? I realize that it's not really justified right now, since in many cases the small sample size is basically ignored at this level, and we will request 4MB of documents at a minimum. However, should we fix SERVER-23408, I would expect pretty good performance for a sample of size 10. alexander.gorrod commented on Wed, 2 Aug 2017 02:38:55 +0000: Thanks for the explanation keith.bostic, it describes a case we should do better with for random cursor positioning. keith.bostic commented on Mon, 31 Jul 2017 22:14:43 +0000: alexander.gorrod, I was part-way through figuring this one out, and charlie.swanson and I finished that investigation today. In short, the first run of the script loads 10K records into a collection, then the second script run deletes all the records from the first run and loads a different set of 10K records (that is, the RecordId is monotonically incrementing across the two inserts). What we end up with is two disk-backed pages in memory, each with some number of disk-backed deleted records, and one of which has an insert list with 10K entries, no records of which are deleted. The random-cursor implementation always prefers disk-backed records to records in an insert list (probably because most tables are more likely to have a larger number of records in the disk-backed section than the insert list), in other words, if page has any disk-backed records, we select from them. In this case, all of the disk-backed records have been deleted, and so we randomly select a deleted record, realize it's deleted and we can't return it, then do a cursor-next to the next visible record. That consistently returns the first record from the insert list, it's the next record in the tree that's visible to us, regardless of what disk-backed record we started with. Absent detecting large sets of deleted records in the WiredTiger btree random-cursor code, maybe we could configure all MongoDB sample to use next_random_sample_size, which should at least make this better? Although I'd have to review that code to be sure, it might fail as well, in some cases. jesse commented on Mon, 31 Jul 2017 19:05:14 +0000: I don't remember anymore why I hit this. I may have been playing with $sample in its early days for the sake of answering a community question, and I was surprised that in my simple case it failed so frequently. charlie.swanson commented on Tue, 11 Jul 2017 16:35:49 +0000: keith.bostic I finally took an initial pass, and I am going to send this over to you to look at. My initial investigation seems like we're hitting some sort of edge cases within a wired tiger btree where lots of entries are deleted, causing us to seek to the same record over and over again. My evidence for this so far is: If you change the reproduction script on line 11 to drop the collection instead of doing a delete_many({}), it no longer reproduces After having some fun following the cursor in gdb, I got down into here inside btr_curnext.c, and saw a cbt->page_deleted_count of 395. I didn't catch it at the beginning to see where we started within that page, but that seemed high to me. Let me know if there's anything I can do to aid the investigation. I got frustrated by SERVER-23408 during this, so I found it useful to do the following: // Disable batching within DocumentSourceCursor: db.adminCommand({setParameter: 1, internalDocumentSourceCursorBatchSizeBytes: 1}); // Log when we see a duplicate document in the sample stage: db.setLogLevel(1); charlie.swanson commented on Mon, 5 Jun 2017 15:41:34 +0000: I can reproduce this locally on Linux as well, and against my development version of the server, a bit ahead of 3.5.7
Taken from Jesse's comment on SERVER-20385. With MongoDB 3.4.4 on Mac OS X, I can reproduce this. First do "python -m pip install pymongo pytz", then: from datetime import datetime, timedelta import pytz from bson import ObjectId from pymongo import MongoClient from pymongo.errors import OperationFailure CHUNKS = 20 collection = MongoClient().db.test collection.delete_many({}) start = datetime(2000, 1, 1, tzinfo=pytz.UTC) for hour in range(10000): collection.insert( {'_id': ObjectId.from_datetime(start + timedelta(hours=hour)), 'x': 1}) for _ in range(10): try: docs = list(collection.aggregate([{ "$sample": {"size": CHUNKS} }, { "$sort": {"_id": 1} }])) except OperationFailure as exc: if exc.code == 28799: # Work around https://jira.mongodb.org/browse/SERVER-20385 print("retry") continue raise for d in docs: print(d['_id'].generation_time) break else: raise OperationFailure("$sample failed") As often as not, the sample fails ten times in a row with error code 28799 and the message: "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again."