...
BugZero found this defect 2842 days ago.
We ran into an issue where mongos was suddenly creating a lot of connections to mongods and ran out of file descriptors. This eventually consumed all descriptors also on mongods which then crashed (https://jira.mongodb.org/browse/SERVER-28933). This happened when there was some intermittent network congestion. We saw errors like this when the leaks occurred: 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] Socket recv() timeout 10.123.0.1:27017 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.123.0.1:27017] 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] Detected bad connection created at 1493026624910579 microSec, clearing pool for mongo5:27017 of 0 connections "0 connections" there is particularly interesting. Eventually the leaking lead to running out of descriptors: 2017-04-24T21:48:42.487+0000 I NETWORK [mongosMain] Listener: accept() returns -1 errno:24 Too many open files 2017-04-24T21:48:42.487+0000 E NETWORK [mongosMain] Out of file descriptors. Waiting one second before trying to accept more connections. The logs indicated that there would be about 500 open client connections, but when inspecting with lsof, it revealed that there were less than 10. It also revealed that there were 6 connections to the mongocs, 23 connections to the primary and total of 474 connections to the two secondaries. It also showed 478 "sock" descriptors with "can't identify protocol": mongos 5495 mongodb 1023u sock 0,7 0t0 84718243 can't identify protocol Normally there isn't any of those and there's just a few connections to the secondaries as well. This is currently a bit special setup with a single shard (replicaset of 3 nodes) and about 200 mongoses that connect to it. The leaking occurred on all of the mongos instances (all were also affected by the network congestion issue). This is what connPoolStatus showed around that time (although, I've understood that it's not entirely reliable): MongoDB shell version: 3.2.12 connecting to: test mongos> db.runCommand( { "connPoolStats" : 1 } ) { "numClientConnections" : 6, "numAScopedConnections" : 0, "totalInUse" : 478, "totalAvailable" : 24, "totalCreated" : 5000, "hosts" : { "mongocdata3:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongocdata2:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongocdata1:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongo7:27017" : { "inUse" : 0, "available" : 1, "created" : 7 }, "mongo6:27017" : { "inUse" : 0, "available" : 1, "created" : 4 }, "mongo5:27017" : { "inUse" : 0, "available" : 1, "created" : 18 } }, "replicaSets" : { "datacfg" : { "hosts" : [ { "addr" : "mongocdata1:27019", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 0 }, { "addr" : "mongocdata2:27019", "ok" : true, "ismaster" : true, "hidden" : false, "secondary" : false, "pingTimeMillis" : 0 }, { "addr" : "mongocdata3:27019", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 0 } ] }, "rs1" : { "hosts" : [ { "addr" : "mongo5:27017", "ok" : true, "ismaster" : true, "hidden" : false, "secondary" : false, "pingTimeMillis" : 0 }, { "addr" : "mongo6:27017", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 9 }, { "addr" : "mongo7:27017", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 4 } ] } }, "ok" : 1 }
thomas.schubert commented on Fri, 9 Jun 2017 21:24:39 +0000: Hi devastor, Thank you for the update. Since this is no longer an issue for you, I'm resolving this ticket. Please let us know if you encounter similar behavior in the future, and we will continue to investigate. Kind regards, Thomas devastor commented on Tue, 30 May 2017 09:49:37 +0000: Hi Thomas, Unfortunately, we have not seen it after upgrading. Also our setup changed a bit so we don't have that many mongoses any more, which might affect it, too. thomas.schubert commented on Wed, 24 May 2017 08:41:58 +0000: Hi devastor, Have you encountered this issue again since upgrading? Thanks, Thomas devastor commented on Wed, 3 May 2017 15:47:40 +0000: Thanks Thomas, Just to update, we are now running 3.2.13 and periodically logging the output of connPoolStats, so let's see what happens. thomas.schubert commented on Thu, 27 Apr 2017 18:44:24 +0000: Hi devastor, Thank you for reporting this behavior. We've completed a large amount of work in MongoDB 3.2.13 to improve diagnostics to investigate issues such as this. The release candidate, MongoDB 3.2.13-rc0, is available now, and general availability release is scheduled for early next week. In particular MongoDB 3.2.13 includes: Significant improvements to connPoolStats (including SERVER-25218) Log time to establish egress connections through NetworkInterfaceASIO (SERVER-28259) Add logging to track egress connection pool size (SERVER-28652) Add egress logging to legacy write path (SERVER-28760) Would you please upgrade and periodically collect the output of connPoolStats? Please feel free to increase the delay (in seconds) to a larger number that would still show the issue emerge according to the timelines you've observed previously. delay=1 mongo --eval "while(true) {print(JSON.stringify(db.serverStatus().localTime));print(JSON.stringify(db.runCommand( { "connPoolStats" : 1 } ))); sleep(1000*${delay:?})}" >connPoolStats.log & If you encounter this issue again, would you please upload the following diagnostic information: The connPoolStats.log file Complete log files of an affected mongos Complete log files of an affected mongod An archive of the diagnostic.data directory of the affected mongod I've created a secure upload portal for you to provide these files. Files uploaded to this portal are only visible to MongoDB employees and are routinely deleted after some time. Thank you for your help, Thomas
Unknown what the exact trigger is, but it has happened multiple times now during high traffic peaks which have caused congestion in the internal network.