BugZero | MongoDB BugID 377683 - Mongos leak connections to mongods

MongoDB - Defect ID: 377683

Mongos leak connections to mongods

MongoDB - Defect ID: 377683

Mongos leak connections to mongods

Last updated on 10/27/2023

Overall: 5.95.9

Severity: 6.46.4

Community: 66.0

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 5.95.9

Severity: 6.46.4

Community: 66.0

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

We ran into an issue where mongos was suddenly creating a lot of connections to mongods and ran out of file descriptors. This eventually consumed all descriptors also on mongods which then crashed (https://jira.mongodb.org/browse/SERVER-28933). This happened when there was some intermittent network congestion. We saw errors like this when the leaks occurred: 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] Socket recv() timeout 10.123.0.1:27017 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.123.0.1:27017] 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] Detected bad connection created at 1493026624910579 microSec, clearing pool for mongo5:27017 of 0 connections "0 connections" there is particularly interesting. Eventually the leaking lead to running out of descriptors: 2017-04-24T21:48:42.487+0000 I NETWORK [mongosMain] Listener: accept() returns -1 errno:24 Too many open files 2017-04-24T21:48:42.487+0000 E NETWORK [mongosMain] Out of file descriptors. Waiting one second before trying to accept more connections. The logs indicated that there would be about 500 open client connections, but when inspecting with lsof, it revealed that there were less than 10. It also revealed that there were 6 connections to the mongocs, 23 connections to the primary and total of 474 connections to the two secondaries. It also showed 478 "sock" descriptors with "can't identify protocol": mongos 5495 mongodb 1023u sock 0,7 0t0 84718243 can't identify protocol Normally there isn't any of those and there's just a few connections to the secondaries as well. This is currently a bit special setup with a single shard (replicaset of 3 nodes) and about 200 mongoses that connect to it. The leaking occurred on all of the mongos instances (all were also affected by the network congestion issue). This is what connPoolStatus showed around that time (although, I've understood that it's not entirely reliable): MongoDB shell version: 3.2.12 connecting to: test mongos> db.runCommand( { "connPoolStats" : 1 } ) { "numClientConnections" : 6, "numAScopedConnections" : 0, "totalInUse" : 478, "totalAvailable" : 24, "totalCreated" : 5000, "hosts" : { "mongocdata3:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongocdata2:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongocdata1:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongo7:27017" : { "inUse" : 0, "available" : 1, "created" : 7 }, "mongo6:27017" : { "inUse" : 0, "available" : 1, "created" : 4 }, "mongo5:27017" : { "inUse" : 0, "available" : 1, "created" : 18 } }, "replicaSets" : { "datacfg" : { "hosts" : [ { "addr" : "mongocdata1:27019", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 0 }, { "addr" : "mongocdata2:27019", "ok" : true, "ismaster" : true, "hidden" : false, "secondary" : false, "pingTimeMillis" : 0 }, { "addr" : "mongocdata3:27019", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 0 } ] }, "rs1" : { "hosts" : [ { "addr" : "mongo5:27017", "ok" : true, "ismaster" : true, "hidden" : false, "secondary" : false, "pingTimeMillis" : 0 }, { "addr" : "mongo6:27017", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 9 }, { "addr" : "mongo7:27017", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 4 } ] } }, "ok" : 1 }

Top User Comments

thomas.schubert commented on Fri, 9 Jun 2017 21:24:39 +0000: Hi devastor, Thank you for the update. Since this is no longer an issue for you, I'm resolving this ticket. Please let us know if you encounter similar behavior in the future, and we will continue to investigate. Kind regards, Thomas devastor commented on Tue, 30 May 2017 09:49:37 +0000: Hi Thomas, Unfortunately, we have not seen it after upgrading. Also our setup changed a bit so we don't have that many mongoses any more, which might affect it, too. thomas.schubert commented on Wed, 24 May 2017 08:41:58 +0000: Hi devastor, Have you encountered this issue again since upgrading? Thanks, Thomas devastor commented on Wed, 3 May 2017 15:47:40 +0000: Thanks Thomas, Just to update, we are now running 3.2.13 and periodically logging the output of connPoolStats, so let's see what happens. thomas.schubert commented on Thu, 27 Apr 2017 18:44:24 +0000: Hi devastor, Thank you for reporting this behavior. We've completed a large amount of work in MongoDB 3.2.13 to improve diagnostics to investigate issues such as this. The release candidate, MongoDB 3.2.13-rc0, is available now, and general availability release is scheduled for early next week. In particular MongoDB 3.2.13 includes: Significant improvements to connPoolStats (including SERVER-25218) Log time to establish egress connections through NetworkInterfaceASIO (SERVER-28259) Add logging to track egress connection pool size (SERVER-28652) Add egress logging to legacy write path (SERVER-28760) Would you please upgrade and periodically collect the output of connPoolStats? Please feel free to increase the delay (in seconds) to a larger number that would still show the issue emerge according to the timelines you've observed previously. delay=1 mongo --eval "while(true) {print(JSON.stringify(db.serverStatus().localTime));print(JSON.stringify(db.runCommand( { "connPoolStats" : 1 } ))); sleep(1000*${delay:?})}" >connPoolStats.log & If you encounter this issue again, would you please upload the following diagnostic information: The connPoolStats.log file Complete log files of an affected mongos Complete log files of an affected mongod An archive of the diagnostic.data directory of the affected mongod I've created a secure upload portal for you to provide these files. Files uploaded to this portal are only visible to MongoDB employees and are routinely deleted after some time. Thank you for your help, Thomas

Steps to Reproduce

Unknown what the exact trigger is, but it has happened multiple times now during high traffic peaks which have caused congestion in the internal network.

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 377683

Mongos leak connections to mongods

MongoDB - Defect ID: 377683

Mongos leak connections to mongods

Last updated on 10/27/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?