BugZero | MongoDB BugID 612336 - Decrease connection pooling and replication heartb...

OPERATIONAL DEFECT DATABASE

...

BugZero | MongoDB BugID 612336 - Decrease connection pooling and replication heartb...

MongoDB - Defect ID: 612336

Decrease connection pooling and replication heartbeat default log verbosity

MongoDB - Defect ID: 612336

Decrease connection pooling and replication heartbeat default log verbosity

Last updated on January 8th, 2024

BugZero Risk Score
6.3 Medium

Overall: 6.3

Severity: 6.4

Community: 6.4

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Major - P3
Status: Closed
Views: 12

Description

Info

If you happen to have a node in your config that is down, the system log on all the other nodes each fill with 12 log messages every half second (24 messages per second). This seems excessive. Example of all the messages logged in 3 ms: 2018-10-01T13:45:16.770-0400 I ASIO [Replication] Connecting to localhost:27019 2018-10-01T13:45:16.770-0400 I ASIO [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.770-0400 I ASIO [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.770-0400 I REPL_HB [replexec-3] Error in heartbeat (requestId: 225) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Connecting to localhost:27019 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I REPL_HB [replexec-3] Error in heartbeat (requestId: 226) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Connecting to localhost:27019 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.772-0400 I REPL_HB [replexec-3] Error in heartbeat (requestId: 227) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused I wonder if the volume of ASIO messages could be reduced in this situation.

Top User Comments

xgen-internal-githook commented on Wed, 20 Feb 2019 16:34:32 +0000: Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'} Message: SERVER-37412 Added LogSeverityLimiter for timed logging Branch: master https://github.com/mongodb/mongo/commit/6cdb28ab8df5dff06be82b4c46971fe5298c6f46 xgen-internal-githook commented on Wed, 20 Feb 2019 16:34:25 +0000: Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'} Message: SERVER-37412 Decrease replication heartbeat log verbosity Branch: master https://github.com/mongodb/mongo/commit/601ed1b88afe54f79e39c298cd2c578795bfc17b ben.caimano commented on Tue, 15 Jan 2019 20:17:50 +0000: Mildly stalled in code review. My intention is to generate the widget mira.carey@mongodb.com requested in the current sprint. greg.mckeon commented on Tue, 15 Jan 2019 20:16:59 +0000: ben.caimano is this active? ben.caimano commented on Wed, 14 Nov 2018 22:25:20 +0000: Can you help me understand what that special case is doing? If we get an error code back from a network call, we dump all connections to that URI in that pool. Since we allow the user to set the time limit, this particular case isn't necessarily indicative of a failure from the remote host. Thus we skip dumping the pool and just replace our single connection object. Since the default heartbeat interval is 2 seconds, we were concerned that we may send heartbeats too frequently to nodes that are down if the network interface is automatically retrying them. Do you know if that's the case, or how I would find that out? So the heart beat stuff is a bit spaghetti. You can definitely find out when it is sending out heartbeats just by upping the verbosity log level to 2 (sends here, receives here). One way or another, your command response will start here. A lot of the nitty gritty of the actual executor will flood the log at verbosity level 3 if that helps. That said, since the pool drop is immediately followed by the heartbeat error, I believe that you're always ending up back here where you schedule again. I don't believe that the network interface should be retrying them. Still, sometimes the networking layer is surprisingly byzantine, perhaps I've missed an edge case. The ms delay worries me a bit, could it be the natural delay on an attempt to immediately reschedule the heartbeat? tess.avitabile commented on Wed, 14 Nov 2018 19:45:41 +0000: ben.caimano, can you help me understand what that special case is doing? Since the default heartbeat interval is 2 seconds, we were concerned that we may send heartbeats too frequently to nodes that are down if the network interface is automatically retrying them. Do you know if that's the case, or how I would find that out? Heartbeats are scheduled using the ThreadPoolTaskExecutor. ben.caimano commented on Fri, 19 Oct 2018 20:47:52 +0000: tess.avitabile, I think I may have related code from the planned maintaince work. HostUnreachable in the ConnPool is usually either very temporary or very permanent. It probably needs a separate case like here. craig.homa commented on Mon, 1 Oct 2018 18:12:20 +0000: Please investigate whether the excessive logging is due to sending heartbeats too frequently.

Steps to Reproduce

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: 4.1.9

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: 4.1.9

Top MongoDB Defects

5.3Defect ID: 3215941
moveChunk with waitForDelete hangs when range deleter is disabled
5.3Defect ID: 3233763
Escape CheckMetadataConsistency hook assertion from running in sharded clusters
5.3Defect ID: 3214801
mongod always uses a SHA1 Signature Algorithm on Windows for Client Auth
5.3Defect ID: 3198578
$listClusterCatalog reports wrong sharding metadata for timeseries collections
5.3Defect ID: 3215948
[v7.0] Resharding doesn't create a shard key index if a multikey index exists prefixed by the shard key

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 612336

Decrease connection pooling and replication heartbeat default log verbosity

MongoDB - Defect ID: 612336

Decrease connection pooling and replication heartbeat default log verbosity

Last updated on January 8th, 2024

BugZero Risk Score6.3 Medium

Bug Details

Info

Top User Comments

Steps to Reproduce

Top MongoDB Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
6.3 Medium