...
BugZero found this defect 2728 days ago.
Our database is divided into 4 shards, each having one primary, secondary and arbiter. Primaries are r4.2xlarge servers on AWS EC2, and secondaries are r4.xlarge. Our work load is intensive in both reads and writes, but these servers usually handle the load without a problem. However during their regular work, primaries of 3 of the 4 shards suddenly crashed, within a very short time of each other. We don't know what could have caused this. Attached are the logs of the segfaults from the primary servers. The one from shard1 seems different that the other two.
xgen-internal-githook commented on Tue, 30 May 2017 15:27:10 +0000: Author: {u'username': u'samantharitter', u'name': u'samantharitter', u'email': u'samantha.ritter@10gen.com'} Message: SERVER-29152 Do not cache logging ostream in threadlocal when in other thread-specific contexts Branch: master https://github.com/mongodb/mongo/commit/fad590916a30ff34dc8c3b37afcfffa2c4e5c8bc samantha.ritter@10gen.com commented on Fri, 26 May 2017 19:19:32 +0000: It appears our hook did not catch the 3.2 commit, it's here: Author: samantharitter Message: SERVER-29152 Do not cache logging ostream in threadlocal when in other thread-specific contexts Branch: v3.2 https://github.com/mongodb/mongo/commit/85aa900eae81fdc07d00aa4b1fb782f7ca5b4664 xgen-internal-githook commented on Fri, 26 May 2017 14:33:08 +0000: Author: {u'username': u'samantharitter', u'name': u'samantharitter', u'email': u'samantha.ritter@10gen.com'} Message: SERVER-29152 Do not cache logging ostream in threadlocal when in other thread-specific contexts Branch: v3.4 https://github.com/mongodb/mongo/commit/36a4a00321bb531190bcd00f523bce95a81b5ab2 samantha.ritter@10gen.com commented on Mon, 22 May 2017 21:15:53 +0000: Hi Meni, I wanted to update you on the status of this bug. New logging code that was added by SERVER-28760 tries to log while a thread is exiting, in which case the logging subsystem may already be destroyed. The order in which these objects are destroyed seems quasi-random, depending on the build or on the system's memory allocation. This influences whether these objects are destroyed peacefully or whether they are destroyed in a bad order that leads to a crash. We are investigating exactly what determines the ordering of the destruction of these objects. We are still working to reproduce the crash on our end as we investigate what the best fix will be. Thank you for your patience. As to what actual event may have triggered the thread to exit here in your case, can you provide complete log files from these crashes? The stack traces you've linked have been very helpful, and it would also help us to see what the system was doing up until things went south. Thank you, Samantha meni commented on Sat, 13 May 2017 07:02:17 +0000: We're using the mongodb-org-server packages for ubuntu from the official mongodb repositories. As far as we know these don't add any log rotation settings, and we haven't implemented any ourselves, and never noticed the log file being rotated. samantha.ritter@10gen.com commented on Fri, 12 May 2017 18:31:41 +0000: Hi there, Thanks for opening this ticket, I'm sorry you experienced these crashes. I'm looking into what might have happened on these servers. Given the stack traces, it's possible that we have a bug in our logging subsystem. Are you running with rotating log files? If so, is there any chance that these servers' log files were being rotated around the time the crash occurred? Thanks, Samantha