...
Hi guys, We recently had to redo some sync of our replica sets. We could reproducibly find a way to make the initial sync fail. We have several collections to be synced. For our largest, the sync permanently failed after the index build of the largest collection. When it tries to get the indexes for the next collection, it fails permanently. We had to use rsync. Between normal size collections, it was working, but not for this big one. Also a later retry gave the same effect. As you will see from the very narrow timestamps, the error must lay in the network component. The previous call was probably older than the threshold set for this operation, and then instead of resetting the timeout when the process is really making the request, it takes the previous now very old timestamp and thus times out. Here is a part of the log ## rs0 2016-08-27T18:49:09.004+0000 I - [rsSync] Index: (2/3) BTree Bottom Up Progress: 619110400/631007639 98% 2016-08-27T18:49:16.218+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T18:49:17.341+0000 I INDEX [rsSync] build index done. scanned 631007639 total records. 16912 secs 2016-08-27T18:49:17.342+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.24:27017 2016-08-27T18:49:17.349+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.24:27017] 2016-08-27T18:49:17.349+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining 2016-08-27T18:49:22.349+0000 I REPL [rsSync] initial sync pending 2016-08-27T18:49:22.350+0000 I REPL [ReplicationExecutor] syncing from: MONGO-RS1-1:27017 2016-08-27T18:49:22.421+0000 I REPL [rsSync] initial sync drop all databases ## rs1, same (shorter excerpt 2016-08-27T18:31:09.869+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T18:31:10.819+0000 I INDEX [rsSync] build index done. scanned 612324609 total records. 16342 secs 2016-08-27T18:31:10.819+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.43:27017 2016-08-27T18:31:10.830+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.43:27017] 2016-08-27T18:31:10.830+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining ## rs2, same (shorter excerpt 2016-08-27T17:18:19.030+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T17:18:20.245+0000 I INDEX [rsSync] build index done. scanned 523205629 total records. 40686 secs 2016-08-27T17:18:20.246+0000 I STORAGE [rsSync] copying indexes for: { name: "reports", options: {} } 2016-08-27T17:18:20.246+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.68:27017 2016-08-27T17:18:20.260+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.68:27017] 2016-08-27T17:18:20.260+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
mrigal commented on Fri, 5 May 2017 08:34:03 +0000: Hi @jason.carey Thanks for having pursued your investigations! We were a bit astonished that the issue was not tackled at all before, but better late than never. Regards, Matthieu jason.carey commented on Thu, 4 May 2017 20:14:42 +0000: MRigal, I believe that the behavior you saw was fixed in SERVER-28710. It performs a fix somewhat inline with my previous analysis, although more pressing was probably a failure to handle EINTR correctly, rather than EWOULDBLOCK. It is slated for release with 3.2.14. Or, if an upgrade is an option, SERVER-28710 doesn't effect current stable. Regards, Jason jason.carey commented on Thu, 23 Mar 2017 18:06:39 +0000: After some staring, I think this may actually be a bug in the networking code (specific at this point to 3.2 and earlier). While I haven't dug into the actual repl side of this, the log lines in question don't make a lot of sense. All of our actual log lines for time outs should say: Socket say send() remote timeout errno:110 Connection timed out 10.0.0.43:27017 instead of Socket say send() errno:110 Connection timed out 10.0.0.43:27017 And those functions should throw SocketException::SEND_TIMEOUT, instead of SEND_ERROR. From what I can tell, the cause is likely the sendmsg path in our Socket type. It's error handling for timeouts is different in the vectorized case than the non-vectorized case. Specifically: // vectorized path if (errno != EAGAIN || _timeout == 0) instead of // single buffer path if ((mongo_errno == EAGAIN || mongo_errno == EWOULDBLOCK) && _timeout != 0) Note that the second case specifically checks for EWOULDBLOCK, which is a valid return from a timeout generated by the SO_SNDTIMEO socket option. This doesn't matter in 3.4 and later, as that code path is no longer exercised (messages always use the single buffer version). So in this case, it's possible that most users see EAGAIN during initial sync timeouts for vectorized messages. Or that most users aren't generating vectorized messages during initial syncs (I haven't dug into which is more likely). But in either case, we have confirmation that there is a reasonable code path that involves returning timeouts as socket errors up into the initial sync code. Which I assume fails incorrectly, where a different exception would have been retried. thomas.schubert commented on Fri, 23 Sep 2016 06:14:36 +0000: Hi MRigal, Thank you for reporting this issue. We're investigating and will update this ticket when we know more. Kind regards, Thomas mrigal commented on Wed, 21 Sep 2016 14:59:29 +0000: Additional comment, we retried it also with modifying the TCP keepalive parameters, in both directions, without more success... https://docs.mongodb.com/manual/faq/diagnostics/#does-tcp-keepalive-time-affect-mongodb-deployments mrigal commented on Wed, 21 Sep 2016 14:31:48 +0000: This could (should ?) be done prior to https://jira.mongodb.org/browse/SERVER-8076
Create a collection big enough (several hundred millions of entries), enough indexes to keep the system busy for more than 3-4 hours and just run an initial sync, it will fail.