...
Running 3.1.8-pre (d03334dfa87386feef4b8331f0e183d80495808c) > db.fanclub.aggregate([{$sample: {size: 120}}]) assert: command failed: { "ok" : 0, "errmsg" : "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again.", "code" : 28799 } : aggregate failed _getErrorWithCode@src/mongo/shell/utils.js:23:13 doassert@src/mongo/shell/assert.js:13:14 assert.commandWorked@src/mongo/shell/assert.js:259:5 DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1211:5 @(shell):1:1 It is indeed sporadic in my testing. Should the client ever see this message? I am able to reproduce this with --storageEngine=wiredTiger on a somewhat old set of files: $ less tmp/WiredTiger WiredTiger WiredTiger 2.5.1: (December 24, 2014) However, when I export/import that database into a new --dbpath, I am unable to repro: $ less tmp2/WiredTiger WiredTiger WiredTiger 2.6.2: (June 4, 2015)
jesse commented on Tue, 30 May 2017 15:45:44 +0000: With MongoDB 3.4.4 on Mac OS X, I can reproduce this. First do "python -m pip install pymongo pytz", then: from datetime import datetime, timedelta import pytz from bson import ObjectId from pymongo import MongoClient from pymongo.errors import OperationFailure CHUNKS = 20 collection = MongoClient().db.test collection.delete_many({}) start = datetime(2000, 1, 1, tzinfo=pytz.UTC) for hour in range(10000): collection.insert( {'_id': ObjectId.from_datetime(start + timedelta(hours=hour)), 'x': 1}) for _ in range(10): try: docs = list(collection.aggregate([{ "$sample": {"size": CHUNKS} }, { "$sort": {"_id": 1} }])) except OperationFailure as exc: if exc.code == 28799: # Work around https://jira.mongodb.org/browse/SERVER-20385 print("retry") continue raise for d in docs: print(d['_id'].generation_time) break else: raise OperationFailure("$sample failed") As often as not, the sample fails ten times in a row with error code 28799 and the message: "$sample stage could not find a non-duplicate document after 100 while using a random cursor. This is likely a sporadic failure, please try again." marmor commented on Wed, 5 Apr 2017 07:53:33 +0000: I'm able to reproduce this issue on 3.2.12: Collection contains 1.1B documents, trying to get a $sample of 1M keep returning this error msg (3/3 tries). The sample size is less then 1% of the collection size, so I don't think it should be hard to get 1M unique documents statistically speaking. The sample works ok for 1000. matt.kangas@10gen.com commented on Fri, 18 Sep 2015 21:53:56 +0000: Confirmed fixed per the repro above. Thanks! xgen-internal-githook commented on Wed, 16 Sep 2015 01:31:41 +0000: Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: Merge pull request #2194 from wiredtiger/server-20385 SERVER-20385: WT_CURSOR.next(random) more random Branch: develop https://github.com/wiredtiger/wiredtiger/commit/7505a02a52bc140acd0fcd81985c0e0ad2a78f7d xgen-internal-githook commented on Wed, 16 Sep 2015 01:31:39 +0000: Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: SERVER-20385: the original use case of WT_CURSOR.next(random) was to return a point in the tree for splitting the tree, and for that reason, once we found a random page, we always returned the first key on that page in order to make the split easy. In MongoDB: first, $sample de-duplicates the keys WiredTiger returns, that is, it ignores keys it's already returned; second, $sample allows you to set the sample size. If you specify a sample size greater than the number of leaf pages in the table, the de-duplication code catches us because we can't return more unique keys than the number of leaf pages in the table. Remove the code that returns the first key of the page, always return as a random a key as we can. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/ba9fcca4b317965b590ce4e67442f1a68a218bbe keith.bostic commented on Tue, 15 Sep 2015 21:33:57 +0000: This is a WiredTiger problem, I've pushed a branch for review & merge. Apologies all around! geert.bosch commented on Tue, 15 Sep 2015 20:33:01 +0000: I checked the dataset, and it seems the document count etc is valid. dan@10gen.com commented on Mon, 14 Sep 2015 16:02:31 +0000: charlie.swanson, I believe the issue that you were asking about was WT-2032. Resolved in 3.1.7. I'm wondering if there is something peculiar with how the data is laid out. Would need keith.bostic to take a look at the data files. matt.kangas@10gen.com commented on Mon, 14 Sep 2015 15:55:48 +0000: charlie.swanson, there are 10k documents in the collection being sampled (see attached tarball). Zero writes were taking place at that time; the database was otherwise entirely idle. charlie.swanson commented on Mon, 14 Sep 2015 15:40:20 +0000: So I now realize that log message is missing a word. It should be "after 100 attempts". How many documents are in the collection being sampled? Were there any writes taking place at the time? This error message indicates that the document returned from WiredTiger's random cursor was identical (in terms of _id), 100 times in a row. There is not a graceful way to recover from this, so we decided to just propagate this up to the user and have them try again. geert.bosch, I remember you encountered a similar problem when hooking up the random cursor, where WiredTiger always returned the same document. Do you remember what version that was fixed in?
Repro'd on Ubuntu 15.04 with a local build of mongod from source. Extract tarball to tmp3 ./mongod --dbpath tmp3 --port 27009 ./mongo --port 27009 > use mongodb > db.fanclub.aggregate([{$sample: {size: 120}}]) Try the .aggregate query a few times (n