BugZero | MongoDB BugID 317812 - Initial sync fails due to false time out if index ...

MongoDB - Defect ID: 317812

Initial sync fails due to false time out if index creation takes longer than X hours

MongoDB - Defect ID: 317812

Initial sync fails due to false time out if index creation takes longer than X hours

Last updated on January 8th, 2024

BugZero Risk Score
6.1 Medium

Overall: 6.1

Severity: 6.4

Community: 6.4

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Major - P3
Status: Closed

Description

Info

Hi guys, We recently had to redo some sync of our replica sets. We could reproducibly find a way to make the initial sync fail. We have several collections to be synced. For our largest, the sync permanently failed after the index build of the largest collection. When it tries to get the indexes for the next collection, it fails permanently. We had to use rsync. Between normal size collections, it was working, but not for this big one. Also a later retry gave the same effect. As you will see from the very narrow timestamps, the error must lay in the network component. The previous call was probably older than the threshold set for this operation, and then instead of resetting the timeout when the process is really making the request, it takes the previous now very old timestamp and thus times out. Here is a part of the log ## rs0 2016-08-27T18:49:09.004+0000 I - [rsSync] Index: (2/3) BTree Bottom Up Progress: 619110400/631007639 98% 2016-08-27T18:49:16.218+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T18:49:17.341+0000 I INDEX [rsSync] build index done. scanned 631007639 total records. 16912 secs 2016-08-27T18:49:17.342+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.24:27017 2016-08-27T18:49:17.349+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.24:27017] 2016-08-27T18:49:17.349+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining 2016-08-27T18:49:22.349+0000 I REPL [rsSync] initial sync pending 2016-08-27T18:49:22.350+0000 I REPL [ReplicationExecutor] syncing from: MONGO-RS1-1:27017 2016-08-27T18:49:22.421+0000 I REPL [rsSync] initial sync drop all databases ## rs1, same (shorter excerpt 2016-08-27T18:31:09.869+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T18:31:10.819+0000 I INDEX [rsSync] build index done. scanned 612324609 total records. 16342 secs 2016-08-27T18:31:10.819+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.43:27017 2016-08-27T18:31:10.830+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.43:27017] 2016-08-27T18:31:10.830+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining ## rs2, same (shorter excerpt 2016-08-27T17:18:19.030+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T17:18:20.245+0000 I INDEX [rsSync] build index done. scanned 523205629 total records. 40686 secs 2016-08-27T17:18:20.246+0000 I STORAGE [rsSync] copying indexes for: { name: "reports", options: {} } 2016-08-27T17:18:20.246+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.68:27017 2016-08-27T17:18:20.260+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.68:27017] 2016-08-27T17:18:20.260+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining

Top User Comments

mrigal commented on Fri, 5 May 2017 08:34:03 +0000: Hi @jason.carey Thanks for having pursued your investigations! We were a bit astonished that the issue was not tackled at all before, but better late than never. Regards, Matthieu jason.carey commented on Thu, 4 May 2017 20:14:42 +0000: MRigal, I believe that the behavior you saw was fixed in SERVER-28710. It performs a fix somewhat inline with my previous analysis, although more pressing was probably a failure to handle EINTR correctly, rather than EWOULDBLOCK. It is slated for release with 3.2.14. Or, if an upgrade is an option, SERVER-28710 doesn't effect current stable. Regards, Jason jason.carey commented on Thu, 23 Mar 2017 18:06:39 +0000: After some staring, I think this may actually be a bug in the networking code (specific at this point to 3.2 and earlier). While I haven't dug into the actual repl side of this, the log lines in question don't make a lot of sense. All of our actual log lines for time outs should say: Socket say send() remote timeout errno:110 Connection timed out 10.0.0.43:27017 instead of Socket say send() errno:110 Connection timed out 10.0.0.43:27017 And those functions should throw SocketException::SEND_TIMEOUT, instead of SEND_ERROR. From what I can tell, the cause is likely the sendmsg path in our Socket type. It's error handling for timeouts is different in the vectorized case than the non-vectorized case. Specifically: // vectorized path if (errno != EAGAIN || _timeout == 0) instead of // single buffer path if ((mongo_errno == EAGAIN || mongo_errno == EWOULDBLOCK) && _timeout != 0) Note that the second case specifically checks for EWOULDBLOCK, which is a valid return from a timeout generated by the SO_SNDTIMEO socket option. This doesn't matter in 3.4 and later, as that code path is no longer exercised (messages always use the single buffer version). So in this case, it's possible that most users see EAGAIN during initial sync timeouts for vectorized messages. Or that most users aren't generating vectorized messages during initial syncs (I haven't dug into which is more likely). But in either case, we have confirmation that there is a reasonable code path that involves returning timeouts as socket errors up into the initial sync code. Which I assume fails incorrectly, where a different exception would have been retried. thomas.schubert commented on Fri, 23 Sep 2016 06:14:36 +0000: Hi MRigal, Thank you for reporting this issue. We're investigating and will update this ticket when we know more. Kind regards, Thomas mrigal commented on Wed, 21 Sep 2016 14:59:29 +0000: Additional comment, we retried it also with modifying the TCP keepalive parameters, in both directions, without more success... https://docs.mongodb.com/manual/faq/diagnostics/#does-tcp-keepalive-time-affect-mongodb-deployments mrigal commented on Wed, 21 Sep 2016 14:31:48 +0000: This could (should ?) be done prior to https://jira.mongodb.org/browse/SERVER-8076

Steps to Reproduce

Create a collection big enough (several hundred millions of entries), enough indexes to keep the system busy for more than 3-4 hours and just run an initial sync, it will fail.

Change history

2025-03-21 Added: 3.2.9

Links

Relevant Products

Click on a version to see all relevant bugs

Affected versions:3.2.9

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:3.2.9

Fixed versions: No known fixed versions

Top MongoDB Defects by Risk Score

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 317812

Initial sync fails due to false time out if index creation takes longer than X hours

MongoDB - Defect ID: 317812

Initial sync fails due to false time out if index creation takes longer than X hours

Last updated on January 8th, 2024

BugZero Risk Score6.1 Medium

Bug Details

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB Defects by Risk Score

Ready to prevent the next vendor outage?

BugZero Risk Score
6.1 Medium