...
If you happen to have a node in your config that is down, the system log on all the other nodes each fill with 12 log messages every half second (24 messages per second). This seems excessive. Example of all the messages logged in 3 ms: 2018-10-01T13:45:16.770-0400 I ASIO [Replication] Connecting to localhost:27019 2018-10-01T13:45:16.770-0400 I ASIO [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.770-0400 I ASIO [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.770-0400 I REPL_HB [replexec-3] Error in heartbeat (requestId: 225) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Connecting to localhost:27019 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I REPL_HB [replexec-3] Error in heartbeat (requestId: 226) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Connecting to localhost:27019 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Failed to connect to localhost:27019 - HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.771-0400 I ASIO [Replication] Dropping all pooled connections to localhost:27019 due to HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused 2018-10-01T13:45:16.772-0400 I REPL_HB [replexec-3] Error in heartbeat (requestId: 227) to localhost:27019, response status: HostUnreachable: Error connecting to localhost:27019 (127.0.0.1:27019) :: caused by :: Connection refused I wonder if the volume of ASIO messages could be reduced in this situation.
xgen-internal-githook commented on Wed, 20 Feb 2019 16:34:32 +0000: Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'} Message: SERVER-37412 Added LogSeverityLimiter for timed logging Branch: master https://github.com/mongodb/mongo/commit/6cdb28ab8df5dff06be82b4c46971fe5298c6f46 xgen-internal-githook commented on Wed, 20 Feb 2019 16:34:25 +0000: Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'} Message: SERVER-37412 Decrease replication heartbeat log verbosity Branch: master https://github.com/mongodb/mongo/commit/601ed1b88afe54f79e39c298cd2c578795bfc17b ben.caimano commented on Tue, 15 Jan 2019 20:17:50 +0000: Mildly stalled in code review. My intention is to generate the widget mira.carey@mongodb.com requested in the current sprint. greg.mckeon commented on Tue, 15 Jan 2019 20:16:59 +0000: ben.caimano is this active? ben.caimano commented on Wed, 14 Nov 2018 22:25:20 +0000: Can you help me understand what that special case is doing? If we get an error code back from a network call, we dump all connections to that URI in that pool. Since we allow the user to set the time limit, this particular case isn't necessarily indicative of a failure from the remote host. Thus we skip dumping the pool and just replace our single connection object. Since the default heartbeat interval is 2 seconds, we were concerned that we may send heartbeats too frequently to nodes that are down if the network interface is automatically retrying them. Do you know if that's the case, or how I would find that out? So the heart beat stuff is a bit spaghetti. You can definitely find out when it is sending out heartbeats just by upping the verbosity log level to 2 (sends here, receives here). One way or another, your command response will start here. A lot of the nitty gritty of the actual executor will flood the log at verbosity level 3 if that helps. That said, since the pool drop is immediately followed by the heartbeat error, I believe that you're always ending up back here where you schedule again. I don't believe that the network interface should be retrying them. Still, sometimes the networking layer is surprisingly byzantine, perhaps I've missed an edge case. The ms delay worries me a bit, could it be the natural delay on an attempt to immediately reschedule the heartbeat? tess.avitabile commented on Wed, 14 Nov 2018 19:45:41 +0000: ben.caimano, can you help me understand what that special case is doing? Since the default heartbeat interval is 2 seconds, we were concerned that we may send heartbeats too frequently to nodes that are down if the network interface is automatically retrying them. Do you know if that's the case, or how I would find that out? Heartbeats are scheduled using the ThreadPoolTaskExecutor. ben.caimano commented on Fri, 19 Oct 2018 20:47:52 +0000: tess.avitabile, I think I may have related code from the planned maintaince work. HostUnreachable in the ConnPool is usually either very temporary or very permanent. It probably needs a separate case like here. craig.homa commented on Mon, 1 Oct 2018 18:12:20 +0000: Please investigate whether the excessive logging is due to sending heartbeats too frequently.