...
We have been running a single instance of Mongo for a number of years now with our application without any issues at all. We have now set up a replica set with 3 members: 1 primary, 1 secondary and an arbiter. After a couple of hours our application simply stops working with the database i.e. it hangs. Below is an extract from the database log file from the primary server. In this log the activity seems to stop at 14:09:15 with our application (running on 192.168.45.105). Address 192.168.45.106 is the secondary server. All we get after that is successful authentication and nothing else. If we restart our application it stops on the first database call. If Mongo is restarted on the primary and secondary servers and then our application restarted it works again for another couple of hours and the cycle repeats. 2018-11-09T14:09:14.973+0000 I COMMAND [conn254] command map.$cmd command: update \{ update: "items", writeConcern: { w: 1 }, ordered: true, $db: "map" } numYields:0 reslen:119 locks:\{ Global: { acquireCount: { r: 3, w: 3 } }, Database: \{ acquireCount: { w: 3 } }, Collection: \{ acquireCount: { w: 2 } }, oplog: \{ acquireCount: { w: 1 } } } protocol:op_query 199ms 2018-11-09T14:09:14.991+0000 I COMMAND [conn260] command map.data command: find \{ find: "data", filter: { _id: ObjectId('5afdb5c1238164595c201b71') }, limit: 1, singleBatch: true, $db: "map" } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:7697 locks:\{ Global: { acquireCount: { r: 4 } }, Database: \{ acquireCount: { r: 2 } }, Collection: \{ acquireCount: { r: 2 } } } protocol:op_query 116ms 2018-11-09T14:09:15.002+0000 I COMMAND [conn269] command map.data command: find \{ find: "data", filter: { FName: "aw pesticide\rainfall\geometry types\subcatchment\cat ditch\projected grid" }, sort: \{ _id: -1 }, skip: 0, limit: 1, noCursorTimeout: false, $db: "map" } planSummary: IXSCAN \{ FName: 1 } keysExamined:1 docsExamined:1 hasSortStage:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:2688 locks:\{ Global: { acquireCount: { r: 4 } }, Database: \{ acquireCount: { r: 2 } }, Collection: \{ acquireCount: { r: 2 } } } protocol:op_query 126ms 2018-11-09T14:09:24.936+0000 I NETWORK [listener] connection accepted from [192.168.45.106:49579|http://192.168.45.106:49579/] #328 (70 connections now open) 2018-11-09T14:09:24.937+0000 I NETWORK [conn328] received client metadata from [192.168.45.106:49579|http://192.168.45.106:49579/] conn: \{ driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.2" }, os: \{ type: "Windows", name: "Microsoft Windows Server 2012 R2", architecture: "x86_64", version: "6.3 (build 9600)" } } 2018-11-09T14:09:24.943+0000 I ACCESS [conn328] Successfully authenticated as principal __system on local 2018-11-09T14:09:33.290+0000 I NETWORK [listener] connection accepted from [192.168.45.105:60568|http://192.168.45.105:60568/] #329 (71 connections now open) 2018-11-09T14:09:33.292+0000 I NETWORK [conn329] received client metadata from [192.168.45.105:60568|http://192.168.45.105:60568/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" } 2018-11-09T14:09:33.428+0000 I ACCESS [conn329] Successfully authenticated as principal menadmin on admin 2018-11-09T14:10:01.028+0000 I NETWORK [listener] connection accepted from [192.168.45.105:60570|http://192.168.45.105:60570/] #330 (72 connections now open) 2018-11-09T14:10:01.029+0000 I NETWORK [conn330] received client metadata from [192.168.45.105:60570|http://192.168.45.105:60570/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" } 2018-11-09T14:10:01.201+0000 I ACCESS [conn330] Successfully authenticated as principal menadmin on admin 2018-11-09T14:10:09.791+0000 I NETWORK [listener] connection accepted from [192.168.45.105:60571|http://192.168.45.105:60571/] #331 (73 connections now open) 2018-11-09T14:10:09.793+0000 I NETWORK [conn331] received client metadata from [192.168.45.105:60571|http://192.168.45.105:60571/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" } 2018-11-09T14:10:09.922+0000 I ACCESS [conn331] Successfully authenticated as principal menadmin on admin 2018-11-09T14:10:29.935+0000 I NETWORK [listener] connection accepted from [192.168.45.106:49580|http://192.168.45.106:49580/] #332 (74 connections now open) 2018-11-09T14:10:29.936+0000 I NETWORK [conn332] received client metadata from [192.168.45.106:49580|http://192.168.45.106:49580/] conn: \{ driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.2" }, os: \{ type: "Windows", name: "Microsoft Windows Server 2012 R2", architecture: "x86_64", version: "6.3 (build 9600)" } } 2018-11-09T14:10:29.941+0000 I ACCESS [conn332] Successfully authenticated as principal __system on local 2018-11-09T14:11:34.935+0000 I NETWORK [listener] connection accepted from [192.168.45.106:49581|http://192.168.45.106:49581/] #333 (75 connections now open) 2018-11-09T14:11:34.936+0000 I NETWORK [conn333] received client metadata from [192.168.45.106:49581|http://192.168.45.106:49581/] conn: \{ driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.2" }, os: \{ type: "Windows", name: "Microsoft Windows Server 2012 R2", architecture: "x86_64", version: "6.3 (build 9600)" } } 2018-11-09T14:11:34.942+0000 I ACCESS [conn333] Successfully authenticated as principal __system on local 2018-11-09T14:11:51.225+0000 I NETWORK [listener] connection accepted from [192.168.45.105:60577|http://192.168.45.105:60577/] #334 (76 connections now open) 2018-11-09T14:11:51.226+0000 I NETWORK [conn334] received client metadata from [192.168.45.105:60577|http://192.168.45.105:60577/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" } 2018-11-09T14:11:51.368+0000 I ACCESS [conn334] Successfully authenticated as principal menadmin on admin 2018-11-09T14:39:30.813+0000 I NETWORK [listener] connection accepted from [192.168.45.105:60671|http://192.168.45.105:60671/] #335 (77 connections now open) 2018-11-09T14:39:30.814+0000 I NETWORK [conn335] received client metadata from [192.168.45.105:60671|http://192.168.45.105:60671/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" } 2018-11-09T14:39:30.951+0000 I ACCESS [conn335] Successfully authenticated as principal menadmin on admin 2018-11-09T16:13:03.502+0000 I CONTROL [serviceShutdown] got SERVICE_CONTROL_STOP request from Windows Service Control Manager, will terminate after current cmd ends
sputink1969 commented on Mon, 12 Nov 2018 21:09:16 +0000: HI Danny, This is the latest log statement: 2018-11-12T20:51:10.124+0000 I - [repl writer worker 9] map.childdata collection clone progress: 576093/14165331 4% (documents copied) So at this rate it is going to take more than 70 hours to synchronise which is way too long - something has to be very wrong. I have attached the diagnostic data from the last couple of days. Any help would be appreciated. Thanks Ian metrics.zip daniel.hatcher commented on Mon, 12 Nov 2018 20:30:24 +0000: Hello Ian, If you wish, you can provide us with the contents of your diagnostic.data folder and we can take a look at some of the internal metrics. Thank you, Danny daniel.hatcher commented on Mon, 12 Nov 2018 20:22:46 +0000: Hello Ian, You can attempt to disable read concern "majority" in your cluster by setting --enableMajorityReadConcern to false. This may reduce cache pressure on the Primary until the Secondary is back online. You may also want to investigate increasing the hardware available to the machine and it may be resource starved. From your responses, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources. Thank you, Danny sputink1969 commented on Mon, 12 Nov 2018 17:33:57 +0000: Just got another message in the logs: 2018-11-12T17:23:33.172+0000 I - [repl writer worker 10] map.childdata collection clone progress: 29975/14165331 0% (documents copied) sputink1969 commented on Mon, 12 Nov 2018 17:25:35 +0000: Hi Danny, I have just tried to set up everything again. Attached is the mongo.log file from the secondary server. You will see this line: 2018-11-12T17:02:27.196+0000 I - [repl writer worker 10] map.childdata collection clone progress: 14064/14165331 0% (documents copied) and there are no subsequent lines with more documents copied. Any ideas why this is the case? I think that this could be the cause of my issue. Thanks Ian mongo.log sputink1969 commented on Mon, 12 Nov 2018 16:28:12 +0000: Hi Danny, Let's take one step back. Basically mongodump of this collection takes two days which implies there is an issue? The initial sync is cloning the collection and is dog slow. So I guess that the first question is why is mongodump (which is doing a similar job to clone) taking so long? Working out this issue with solve the other one I think! Thanks Ian daniel.hatcher commented on Mon, 12 Nov 2018 16:07:55 +0000: Hello Ian, If you believe that MongoDB's initial sync is too slow, you can seed the node with data files from another node through something like rsync. However, the source process needs to be stopped in order to ensure that data files are not changed during the process. It will be hard to troubleshoot either the hanging application or the collection dump slowness until you have a healthy replica set. Thank you, Danny sputink1969 commented on Mon, 12 Nov 2018 15:51:51 +0000: Hi Danny, I understand that it needs to do an initial sync but the initial sync is taking a very long time i.e. many many hours. What is interesting is the mongodump takes over 2 days to backup this collection and the database is not that big (approx 300GB). Is there a way that I can speed up the initial sync? Please note that I was having this issue before and this is the first time that I have seen an initial sync. Thanks Ian daniel.hatcher commented on Mon, 12 Nov 2018 15:28:58 +0000: Hello Ian, By default, MongoDB will only log operations that take longer than 100ms. If you have operations running below that threshold, you would not see them in the log. However as you pointed out, this could also be evidence of a stall of some kind. It appears that your Secondary is currently in the midst of an initial sync. Until the sync is done, it will be unable to respond to read and write queries. If your application is using a write concern greater than 1, a read concern of "majority" or "linearizable", or a read preference of "secondary", the application will be waiting for the initial sync to finish before it can continue its operations. Ideally, you would replace the arbiter with another data node (when this second is finished cloning). That way, even if one data node goes down, you can still satisfy the requirements of any of the above items I mentioned. When the initial sync finishes on the Secondary, do you still experience issues with your application? Thank you, Danny sputink1969 commented on Mon, 12 Nov 2018 14:04:47 +0000: What is interesting is on the secondary node when it currently starts up it is cloning a collection which is terribly slow: 2018-11-12T13:29:55.326+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057 2018-11-12T13:29:55.333+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 7ms (3 connections now open to 192.168.45.107:27057) 2018-11-12T13:30:55.333+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open 2018-11-12T13:31:45.279+0000 I - [repl writer worker 10] map.childdata collection clone progress: 41966/13942644 0% (documents copied) 2018-11-12T13:36:02.449+0000 I - [repl writer worker 10] map.childdata collection clone progress: 55261/13942644 0% (documents copied) 2018-11-12T13:40:51.232+0000 I - [repl writer worker 10] map.childdata collection clone progress: 68668/13942644 0% (documents copied) 2018-11-12T13:44:38.874+0000 I NETWORK [listener] connection accepted from 192.168.45.105:54500 #4 (3 connections now open) 2018-11-12T13:44:38.995+0000 I NETWORK [conn4] received client metadata from 192.168.45.105:54500 conn: { driver: { name: "mongo-csharp-driver", version: "2.4.4.0" } , os: { type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" } 2018-11-12T13:44:39.609+0000 I ACCESS [conn4] Successfully authenticated as principal menadmin on admin 2018-11-12T13:45:15.704+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057 2018-11-12T13:45:15.712+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 8ms (3 connections now open to 192.168.45.107:27057) 2018-11-12T13:46:15.712+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open 2018-11-12T13:46:42.251+0000 I - [repl writer worker 10] map.childdata collection clone progress: 88246/13942644 0% (documents copied) 2018-11-12T13:48:54.430+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057 2018-11-12T13:48:54.439+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 10ms (3 connections now open to 192.168.45.107:27057) 2018-11-12T13:50:07.260+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open 2018-11-12T13:51:10.199+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057 2018-11-12T13:51:10.205+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 6ms (3 connections now open to 192.168.45.107:27057) 2018-11-12T13:53:05.176+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open 2018-11-12T13:53:57.616+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057 2018-11-12T13:53:57.623+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 8ms (3 connections now open to 192.168.45.107:27057) We also noticed that Mongodump is extremely slow on this collection. Could it be related?