BugZero | MongoDB BugID 1768503 - MongoDB primary instance down

MongoDB - Defect ID: 1768503

MongoDB primary instance down

MongoDB - Defect ID: 1768503

MongoDB primary instance down

Last updated on 7/23/2021

Overall: 6.36.3

Severity: 6.46.4

Community: 6.36.3

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.36.3

Severity: 6.46.4

Community: 6.36.3

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

primary instance of replset down and unable to restart. log for restart: 2021-06-04T11:27:54.148+0800 I CONTROL [main] ***** SERVER RESTARTED ***** 2021-06-04T11:27:54.157+0800 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none' 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] MongoDB starting : pid=46354 port=27077 dbpath=/data3/pretreatment/dbdata 64-bit host=wqdcsrv414.cn.prod 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] db version v4.0.13 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] git version: bda366f0b0e432ca143bc41da54d8732bd8d03c0 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] allocator: tcmalloc 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] modules: none 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] build environment: 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] distmod: rhel70 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] distarch: x86_64 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] target_arch: x86_64 2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] options: { config: "/etc/mongodpretreatment.cnf", net: { bindIp: "0.0.0.0", maxIncomingConnections: 10000, port: 27077, serviceExecutor: "adaptive", wireObjectCheck: true } , processManagement: { fork: true, pidFilePath: "/data3/pretreatment/pid/pretreatment.pid" }, replication: { enableMajorityReadConcern: false, oplogSizeMB: 819200, replSetName: "pretreatment" }, security: { authorization: "enabled", keyFile: "/data3/pretreatment/keyfile/securitypretreatment" }, storage: { dbPath: "/data3/pretreatment/dbdata", directoryPerDB: true, engine: "wiredTiger", journal: { commitIntervalMs: 50, enabled: true } , wiredTiger: { engineConfig: { cacheSizeGB: 60.0, directoryForIndexes: true } } }, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/logs/pretreatment/pretreatment.log", quiet: false } } 2021-06-04T11:27:54.180+0800 W STORAGE [initandlisten] Detected unclean shutdown - /data3/pretreatment/dbdata/mongod.lock is not empty. 2021-06-04T11:27:54.181+0800 W STORAGE [initandlisten] Recovering data from the last clean checkpoint. 2021-06-04T11:27:54.181+0800 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=61440M,cache_overflow=(file_max=0M),session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress), 2021-06-04T11:27:54.839+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:839167][46354:0x7f2051264b80], txn-recover: Main recovery loop: starting at 547399/94917248 to 547400/256 2021-06-04T11:27:54.839+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:839519][46354:0x7f2051264b80], txn-recover: Recovering log 547399 through 547400 2021-06-04T11:27:54.860+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:860444][46354:0x7f2051264b80], file:local/collection/8-4992545715697342319.wt, txn-recover: Recovering log 547400 through 547400 2021-06-04T11:27:54.933+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:933146][46354:0x7f2051264b80], file:local/collection/8-4992545715697342319.wt, txn-recover: Set global recovery timestamp: 0 2021-06-04T11:31:48.976+0800 I STORAGE [initandlisten] Placing a marker at optime Jun 4 10:51:59:4315 2021-06-04T11:31:49.069+0800 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/data3/pretreatment/dbdata/diagnostic.data' 2021-06-04T11:31:49.074+0800 I REPL [initandlisten] Rollback ID is 3 2021-06-04T11:31:49.078+0800 I REPL [initandlisten] Recovering from an unstable checkpoint (top of oplog: { ts: Timestamp(1622775163, 2189), t: 35 }, appliedThrough: { ts: Timestamp(1622775163, 2189), t: 35 }) 2021-06-04T11:31:49.078+0800 I REPL [initandlisten] Starting recovery oplog application at the appliedThrough: { ts: Timestamp(1622775163, 2189), t: 35 }, through the top of the oplog: { ts: Timestamp(1622775163, 2189), t: 35 } 2021-06-04T11:31:49.078+0800 I REPL [initandlisten] No oplog entries to apply for recovery. Start point is at the top of the oplog. 2021-06-04T11:31:49.115+0800 I CONTROL [LogicalSessionCacheRefresh] Sessions collection is not set up; waiting until next sessions refresh interval: Replication has not yet been configured 2021-06-04T11:31:49.115+0800 I EXECUTOR [initandlisten] No thread count configured for executor. Using number of cores / 2: 28 2021-06-04T11:31:49.115+0800 I EXECUTOR [worker-1] Started new database worker thread 1 2021-06-04T11:31:49.115+0800 I EXECUTOR [worker-2] Started new database worker thread 2 2021-06-04T11:31:49.115+0800 I CONTROL [LogicalSessionCacheReap] Failed to reap transaction table: NotYetInitialized: Replication has not yet been configured ... 2021-06-04T11:31:50.584+0800 I REPL [rsBackgroundSync] sync source candidate: 10.67.12.3:27077 2021-06-04T11:31:50.584+0800 I ASIO [RS] Connecting to 10.67.12.3:27077 2021-06-04T11:31:50.588+0800 I REPL [rsBackgroundSync] Changed sync source from empty to 10.67.12.3:27077 2021-06-04T11:31:50.589+0800 I ASIO [RS] Connecting to 10.67.12.3:27077 2021-06-04T11:31:50.693+0800 I NETWORK [listener] connection accepted from 10.67.12.2:34152 #26 (13 connections now open) 2021-06-04T11:31:50.694+0800 I NETWORK [conn26] received client metadata from 10.67.12.2:34152 conn26: { driver: { name: "NetworkInterfaceTL", version: "4.0.13" } , os: { type: "Linux", name: "Oracle Linux Server release 7.8", architecture: "x86_64", version: "Kernel 4.14.35-1902.301.1.el7uek.x86_64" } } 2021-06-04T11:31:50.701+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1622775163, 2189), t: 35 }. source's GTE: { ts: Timestamp(1622775638, 1), t: 36 } hashes: (3943340773125100714/-6300946020427835533) 2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 } 2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method. 2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] Canceling priority takeover callback 2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] transition to ROLLBACK from SECONDARY 2021-06-04T11:31:50.702+0800 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 26 2021-06-04T11:31:50.702+0800 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 18 2021-06-04T11:31:50.702+0800 I NETWORK [conn17] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 127.0.0.1:38140 (connection id: 17) 2021-06-04T11:31:50.702+0800 I NETWORK [conn16] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 127.0.0.1:38138 (connection id: 16) 2021-06-04T11:31:50.702+0800 I NETWORK [conn14] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43794 (connection id: 14) 2021-06-04T11:31:50.702+0800 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 12 2021-06-04T11:31:50.702+0800 I NETWORK [conn13] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43792 (connection id: 13) 2021-06-04T11:31:50.702+0800 I NETWORK [conn11] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43790 (connection id: 11) 2021-06-04T11:31:50.702+0800 I NETWORK [conn9] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43788 (connection id: 9) 2021-06-04T11:31:50.702+0800 I NETWORK [conn7] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43780 (connection id: 7) ... ** 2021-06-04T11:39:43.965+0800 I ROLLBACK [rsBackgroundSync] Rollback finished. The final minValid is: { ts: Timestamp(1622532457, 3), t: 22 } 2021-06-04T11:39:43.965+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back. 2021-06-04T11:39:43.965+0800 F - [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1542 2021-06-04T11:39:43.965+0800 F - [rsBackgroundSync] ***aborting after fassert() failure

Top User Comments

JIRAUSER1260348 commented on Thu, 10 Jun 2021 01:30:36 +0000: Yup, you are right Edwin, I must checked a wrong config file in yesterday. Thanks again for your analysis and advices, we will follow up your recommandation and see if the same issue occur again. Ticket can be closed. Thanks YS JIRAUSER1257066 commented on Wed, 9 Jun 2021 16:37:51 +0000: Hi ys.mao@homecreditcfc.cn, You're correct that net.wireObjectCheck is set to true by default. Given that information, you may leave this parameter as it is. For net.serviceExecutor: adaptive, this setting is mentioned in your configuration file that you commented above. net: port: 27077 maxIncomingConnections: 10000 wireObjectCheck: true bindIp: 0.0.0.0 serviceExecutor: adaptive <-- this line here Best, Edwin JIRAUSER1260348 commented on Wed, 9 Jun 2021 02:49:52 +0000: Thanks Edwin for your detail analyze and advice. The issue not yet reproduced and we have plan to upgrade to 4.2 recently and upgrad to 4.4 later. I have a question regarding below recommandations. Remove net.serviceExecutor: adaptive. We believe this is possibly the culprit for the primary node failover. This setting was introduced to alleviate issues with large numbers of connections. It has since been deprecated in MongoDB v4.4 Remove net.wireObjectCheck. "net.serviceExecutor: adaptive.","net.wireObjectCheck", we didn't set the first parameter in our configuration file, for 2nd one, the setting should be as same as the default value config file: net: port: 28010 maxIncomingConnections: 5000 wireObjectCheck: true bindIp: 0.0.0.0 Here is explaination for wireObjectCheck from mongo online document. net.wireObjectCheck Type: boolean Default: true When true, the mongod or mongos instance validates all requests from clients upon receipt to prevent clients from inserting malformed or invalid BSON into a MongoDB database. For objects with a high degree of sub-document nesting, net.wireObjectCheck can have a small impact on performance. So, shall we remove this parameter(net.wireObjectCheck) out of config file or we should change the setting from true to false？ and for the first one "net.serviceExecutor: adaptive", how can we remove it since we didn't config it? below are all parameters list from getParameter command. pretreatment:SECONDARY> db.adminCommand( { getParameter : '*' } ) { "AsyncRequestsSenderUseBaton" : true, "KeysRotationIntervalSec" : 7776000, "ShardingTaskExecutorPoolHostTimeoutMS" : 300000, "ShardingTaskExecutorPoolMaxConnecting" : 2, "ShardingTaskExecutorPoolMaxSize" : -1, "ShardingTaskExecutorPoolMinSize" : 1, "ShardingTaskExecutorPoolRefreshRequirementMS" : 60000, "ShardingTaskExecutorPoolRefreshTimeoutMS" : 20000, "TransactionRecordMinimumLifetimeMinutes" : 30, "adaptiveServiceExecutorIdlePctThreshold" : 60, "adaptiveServiceExecutorMaxQueueLatencyMicros" : 500, "adaptiveServiceExecutorRecursionLimit" : 8, "adaptiveServiceExecutorReservedThreads" : 28, "adaptiveServiceExecutorRunTimeJitterMillis" : 500, "adaptiveServiceExecutorRunTimeMillis" : 5000, "adaptiveServiceExecutorStuckThreadTimeoutMillis" : 250, "allowRolesFromX509Certificates" : true, "allowSecondaryReadsDuringBatchApplication" : true, "authSchemaVersion" : 5, "authenticationMechanisms" : [ "MONGODB-X509", "SCRAM-SHA-1", "SCRAM-SHA-256" ], "bgSyncOplogFetcherBatchSize" : 13981010, "clientCursorMonitorFrequencySecs" : 4, "cloudFreeMonitoringEndpointURL" : "https://cloud.mongodb.com/freemonitoring/mongo", "clusterAuthMode" : "keyFile", "collectionClonerBatchSize" : -1, "connPoolMaxConnsPerHost" : 200, "connPoolMaxInUseConnsPerHost" : 2147483647, "connPoolMaxShardedConnsPerHost" : 200, "connPoolMaxShardedInUseConnsPerHost" : 2147483647, "createRollbackDataFiles" : true, "createTimestampSafeUniqueIndex" : false, "cursorTimeoutMillis" : NumberLong(600000), "debugCollectionUUIDs" : false, "diagnosticDataCollectionDirectorySizeMB" : 200, "diagnosticDataCollectionEnabled" : true, "diagnosticDataCollectionFileSizeMB" : 10, "diagnosticDataCollectionPeriodMillis" : 1000, "diagnosticDataCollectionSamplesPerChunk" : 300, "diagnosticDataCollectionSamplesPerInterimUpdate" : 10, "disableJavaScriptJIT" : true, "disableLogicalSessionCacheRefresh" : false, "disableNonSSLConnectionLogging" : false, "disabledSecureAllocatorDomains" : [ ], "enableElectionHandoff" : true, "enableInMemoryTransactions" : false, "enableLocalhostAuthBypass" : true, "enableTestCommands" : false, "failIndexKeyTooLong" : true, "featureCompatibilityVersion" : { "version" : "4.0" } , "forceRollbackViaRefetch" : false, "globalConnPoolIdleTimeoutMinutes" : 2147483647, "heapProfilingEnabled" : false, "heapProfilingSampleIntervalBytes" : NumberLong(262144), "honorSystemUmask" : false, "initialSyncOplogBuffer" : "collection", "initialSyncOplogBufferPeekCacheSize" : 10000, "initialSyncOplogFetcherBatchSize" : 13981010, "internalDocumentSourceCursorBatchSizeBytes" : 4194304, "internalDocumentSourceLookupCacheSizeBytes" : 104857600, "internalGeoNearQuery2DMaxCoveringCells" : 16, "internalGeoPredicateQuery2DMaxCoveringCells" : 16, "internalInsertMaxBatchSize" : 64, "internalLookupStageIntermediateDocumentMaxSizeBytes" : NumberLong(104857600), "internalProhibitShardOperationRetry" : false, "internalQueryAlwaysMergeOnPrimaryShard" : false, "internalQueryCacheEvictionRatio" : 10, "internalQueryCacheFeedbacksStored" : 20, "internalQueryCacheSize" : 5000, "internalQueryEnumerationMaxIntersectPerAnd" : 3, "internalQueryEnumerationMaxOrSolutions" : 10, "internalQueryExecMaxBlockingSortBytes" : 33554432, "internalQueryExecYieldIterations" : 128, "internalQueryExecYieldPeriodMS" : 10, "internalQueryFacetBufferSizeBytes" : 104857600, "internalQueryForceIntersectionPlans" : false, "internalQueryIgnoreUnknownJSONSchemaKeywords" : false, "internalQueryMaxScansToExplode" : 200, "internalQueryPlanEvaluationCollFraction" : 0.3, "internalQueryPlanEvaluationMaxResults" : 101, "internalQueryPlanEvaluationWorks" : 10000, "internalQueryPlanOrChildrenIndependently" : true, "internalQueryPlannerEnableHashIntersection" : false, "internalQueryPlannerEnableIndexIntersection" : true, "internalQueryPlannerGenerateCoveredWholeIndexScans" : false, "internalQueryPlannerMaxIndexedSolutions" : 64, "internalQueryProhibitBlockingMergeOnMongoS" : false, "internalQueryProhibitMergingOnMongoS" : false, "internalQueryS2GeoCoarsestLevel" : 0, "internalQueryS2GeoFinestLevel" : 23, "internalQueryS2GeoMaxCells" : 20, "internalValidateFeaturesAsMaster" : true, "javascriptProtection" : false, "journalCommitInterval" : 50, "jsHeapLimitMB" : 1100, "localLogicalSessionTimeoutMinutes" : 30, "logComponentVerbosity" : { "verbosity" : 0, "accessControl" : { "verbosity" : -1 } , "command" : { "verbosity" : -1 } , "control" : { "verbosity" : -1 } , "executor" : { "verbosity" : -1 } , "geo" : { "verbosity" : -1 } , "index" : { "verbosity" : -1 } , "network" : { "verbosity" : -1, "asio" : { "verbosity" : -1 } , "bridge" : { "verbosity" : -1 } , "connectionPool" : { "verbosity" : -1 } }, "query" : { "verbosity" : -1 } , "replication" : { "verbosity" : -1, "heartbeats" : { "verbosity" : -1 } , "rollback" : { "verbosity" : -1 } }, "sharding" : { "verbosity" : -1, "shardingCatalogRefresh" : { "verbosity" : -1 } }, "storage" : { "verbosity" : -1, "recovery" : { "verbosity" : -1 } , "journal" : { "verbosity" : -1 } }, "write" : { "verbosity" : -1 } , "ftdc" : { "verbosity" : -1 } , "tracking" : { "verbosity" : -1 } , "transaction" : { "verbosity" : -1 } }, "logLevel" : 0, "logicalSessionRefreshMillis" : 300000, "maxAcceptableLogicalClockDriftSecs" : NumberLong(31536000), "maxBSONDepth" : 200, "maxIndexBuildMemoryUsageMegabytes" : 500, "maxLogSizeKB" : 10, "maxNumInitialSyncCollectionClonerCursors" : 1, "maxSessions" : 1000000, "maxSyncSourceLagSecs" : 30, "maxTransactionLockRequestTimeoutMillis" : 5, "migrateCloneInsertionBatchDelayMS" : 0, "migrateCloneInsertionBatchSize" : 0, "newCollectionsUsePowerOf2Sizes" : true, "notablescan" : false, "numInitialSyncAttempts" : 10, "numInitialSyncCollectionCountAttempts" : 3, "numInitialSyncCollectionFindAttempts" : 3, "numInitialSyncConnectAttempts" : 10, "numInitialSyncListCollectionsAttempts" : 3, "numInitialSyncListDatabasesAttempts" : 3, "numInitialSyncListIndexesAttempts" : 3, "numInitialSyncOplogFindAttempts" : 3, "opensslCipherConfig" : "", "opensslDiffieHellmanParameters" : "", "oplogFetcherInitialSyncMaxFetcherRestarts" : 10, "oplogFetcherSteadyStateMaxFetcherRestarts" : 1, "oplogInitialFindMaxSeconds" : 60, "oplogRetriedFindMaxSeconds" : 2, "orphanCleanupDelaySecs" : 900, "periodicNoopIntervalSecs" : 10, "priorityTakeoverFreshnessWindowSeconds" : 2, "quiet" : false, "rangeDeleterBatchDelayMS" : 20, "rangeDeleterBatchSize" : 0, "recoverFromOplogAsStandalone" : false, "replBatchLimitBytes" : 104857600, "replBatchLimitOperations" : 5000, "replElectionTimeoutOffsetLimitFraction" : 0.15, "replIndexPrefetch" : "all", "replWriterThreadCount" : 16, "reportOpWriteConcernCountersInServerStatus" : false, "reservedServiceExecutorRecursionLimit" : 8, "rollbackRemoteOplogQueryBatchSize" : 2000, "rollbackTimeLimitSecs" : 86400, "saslHostName" : "wqdcsrv414.cn.prod", "saslServiceName" : "mongodb", "saslauthdPath" : "", "scramIterationCount" : 10000, "scramSHA256IterationCount" : 15000, "scriptingEngineInterruptIntervalMS" : 1000, "shardedConnPoolIdleTimeoutMinutes" : 2147483647, "skipCorruptDocumentsWhenCloning" : false, "skipShardingConfigurationChecks" : false, "sslMode" : "disabled", "sslWithholdClientCertificate" : false, "startupAuthSchemaValidation" : true, "suppressNoTLSPeerCertificateWarning" : false, "syncdelay" : 60, "synchronousServiceExecutorRecursionLimit" : 8, "taskExecutorPoolSize" : 1, "tcmallocAggressiveMemoryDecommit" : 0, "tcmallocEnableMarkThreadTemporarilyIdle" : false, "tcmallocMaxTotalThreadCacheBytes" : NumberLong(1073741824), "testingSnapshotBehaviorInIsolation" : false, "traceExceptions" : false, "traceWriteConflictExceptions" : false, "transactionLifetimeLimitSeconds" : 60, "ttlMonitorEnabled" : true, "ttlMonitorSleepSecs" : 60, "waitForSecondaryBeforeNoopWriteMS" : 10, "waitForStepDownOnNonCommandShutdown" : true, "wiredTigerConcurrentReadTransactions" : 256, "wiredTigerConcurrentWriteTransactions" : 256, "wiredTigerCursorCacheSize" : -100, "wiredTigerEngineRuntimeConfig" : "", "wiredTigerMaxCacheOverflowSizeGB" : 0, "wiredTigerSessionCloseIdleTimeSecs" : 300, "writePeriodicNoops" : true, "ok" : 1, "operationTime" : Timestamp(1623206657, 8), "$clusterTime" : { "clusterTime" : Timestamp(1623206657, 8), "signature" : { "hash" : BinData(0,"RGhE9taFjsdy8lg9+J37lfYYcuU="), "keyId" : NumberLong("6924963697634854396") } } } Best regards YS JIRAUSER1257066 commented on Tue, 8 Jun 2021 17:25:38 +0000: Hi ys.mao@homecreditcfc.cn, After looking into the diagnostic data you provided, we noticed that your cluster configuration diverges from the default configuration in a number of ways which makes it difficult to understand the root cause. We noticed that there is substantial cache pressure that may have contributed to the failover on the primary, but it remains unclear why this happened. We're interested if this is a reproducible issue given your current settings, and if it reproduces by upgrading to MongoDB v4.4 which has deprecated one of the configurations you used, or by returning the following configurations to default: Remove net.serviceExecutor: adaptive. We believe this is possibly the culprit for the primary node failover. This setting was introduced to alleviate issues with large numbers of connections. It has since been deprecated in MongoDB v4.4 Remove net.wireObjectCheck. journal.commitIntervalMs to 100 from 50. Revert the read/write tickets back to 128. Best, Edwin JIRAUSER1260348 commented on Tue, 8 Jun 2021 01:28:56 +0000: Thanks Edwin, we did full resync after the issue occurred. Any idea what might be the root cause? according to the log files the primary node stopped service as the primary and converted itself as the secondary automatically and try to sync data from the elected new primary for about 4-5 minitues and then errored out and shutdown itself automatically. There was a high workload during that time, avaiable write tickets became zero, and one more thing... we adjusted the read/write tickets from default 128 to 512 on all nodes of this cluster a few days ago, the cpu cores is 56 for each of node. Regards YS JIRAUSER1257066 commented on Mon, 7 Jun 2021 19:48:25 +0000: Hi ys.mao@homecreditcfc.cn, Thank you for providing the ftdc files and the log lines that cover this issue. Based on the log message 2021-06-04T10:56:54.909+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back the ideal resolution is to perform a clean resync from an unaffected node. Best, Edwin JIRAUSER1260348 commented on Fri, 4 Jun 2021 08:37:17 +0000: attached log file and 2 diagnostic data files JIRAUSER1260348 commented on Fri, 4 Jun 2021 07:29:20 +0000: mongo configuration file: cat /etc/mongodpretreatment.cnf storage: dbPath: /data3/pretreatment/dbdata journal: enabled: true commitIntervalMs: 50 directoryPerDB: true engine: wiredTiger wiredTiger: engineConfig: cacheSizeGB: 60 directoryForIndexes: true systemLog: quiet: false path: /logs/pretreatment/pretreatment.log destination: file logAppend: true logRotate: rename processManagement: fork: true pidFilePath: /data3/pretreatment/pid/pretreatment.pid net: port: 27077 maxIncomingConnections: 10000 wireObjectCheck: true bindIp: 0.0.0.0 serviceExecutor: adaptive security: keyFile: /data3/pretreatment/keyfile/securitypretreatment authorization: enabled replication: oplogSizeMB: 819200 replSetName: pretreatment enableMajorityReadConcern: false JIRAUSER1260348 commented on Fri, 4 Jun 2021 05:48:38 +0000: error log during primary instance first failure. ... 2021-06-04T10:52:56.146+0800 I REPL [replexec-3062] Scheduling priority takeover at 2021-06-04T10:53:06.796+0800 2021-06-04T10:52:56.401+0800 I EXECUTOR [worker-2962] Thread was only executing tasks 59% over the last 5045ms. Exiting thread. 2021-06-04T10:52:56.433+0800 I EXECUTOR [worker-3013] Thread was only executing tasks 54% over the last 4561ms. Exiting thread. 2021-06-04T10:52:56.462+0800 I EXECUTOR [worker-3001] Thread was only executing tasks 53% over the last 4709ms. Exiting thread. 2021-06-04T10:52:56.473+0800 I EXECUTOR [worker-2949] Thread was only executing tasks 60% over the last 5250ms. Exiting thread. 2021-06-04T10:52:56.523+0800 I EXECUTOR [worker-2959] Thread was only executing tasks 58% over the last 5197ms. Exiting thread. 2021-06-04T10:52:56.525+0800 I EXECUTOR [worker-3019] Thread was only executing tasks 60% over the last 4592ms. Exiting thread. ... 2021-06-04T10:52:57.131+0800 I REPL [rsBackgroundSync] sync source candidate: 10.67.12.3:27077 2021-06-04T10:52:57.133+0800 I ASIO [RS] Connecting to 10.67.12.3:27077 2021-06-04T10:52:57.136+0800 I EXECUTOR [worker-2992] Thread was only executing tasks 56% over the last 5475ms. Exiting thread. 2021-06-04T10:52:57.141+0800 I EXECUTOR [worker-3014] Thread was only executing tasks 54% over the last 5259ms. Exiting thread. 2021-06-04T10:52:57.142+0800 I EXECUTOR [worker-3060] Thread was only executing tasks 49% over the last 4797ms. Exiting thread. 2021-06-04T10:52:57.144+0800 I REPL [rsBackgroundSync] Changed sync source from empty to 10.67.12.3:27077 2021-06-04T10:52:57.145+0800 I ASIO [RS] Connecting to 10.67.12.3:27077 2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1622775163, 2189), t: 35 }. source's GTE: { ts: Timestamp(1622775638, 1), t: 36 } hashes: (3943340773125100714/-6300946020427835533) 2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(1622775002, 3234), t: 35 } 2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method. 2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Canceling priority takeover callback 2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] transition to ROLLBACK from SECONDARY 2021-06-04T10:52:57.237+0800 I NETWORK [conn54001871] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.231.177:52896 (connection id: 54001871) 2021-06-04T10:52:57.237+0800 I NETWORK [conn54001870] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.231.177:52894 (connection id: 54001870) ... 2021-06-04T10:56:34.137+0800 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for pretreatment/10.67.12.1:27077,10.67.12.2:27077,10.67.12.3:27077 2021-06-04T10:56:34.602+0800 I REPL [replexec-3062] Scheduling priority takeover at 2021-06-04T10:56:45.262+0800 2021-06-04T10:56:42.362+0800 I NETWORK [listener] connection accepted from 10.30.218.4:58270 #54007208 (5214 connections now open) 2021-06-04T10:56:42.363+0800 I NETWORK [conn54007208] received client metadata from 10.30.218.4:58270 conn54007208: { driver: { name: "nodejs", version: "3.2.7-2" } , os: { type: "Windows_NT", name: "win32", architecture: "x64", version: "10.0.19041" }, platform: "Node.js v10.11.0, LE, mongodb-core: 3.2.7-2", application: { name: "NoSQLBoosterV5.2.12_1439.782" } } 2021-06-04T10:56:42.403+0800 I ACCESS [conn54007208] Successfully authenticated as principal root on admin from client 10.30.218.4:58270 2021-06-04T10:56:42.414+0800 I NETWORK [worker-148] end connection 10.30.218.4:58270 (5213 connections now open) 2021-06-04T10:56:45.262+0800 I REPL [replexec-3062] Canceling priority takeover callback 2021-06-04T10:56:45.262+0800 I REPL [replexec-3062] Not starting an election for a priority takeover, since we are not electable due to: Not standing for election because member is not currently a secondary; member is not caught up enough to the most up-to-date member to call for priority takeover - must be within 2 seconds (mask 0x408) 2021-06-04T10:56:46.602+0800 I REPL [replexec-3062] Scheduling priority takeover at 2021-06-04T10:56:58.059+0800 2021-06-04T10:56:49.228+0800 I CONTROL [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } 2021-06-04T10:56:52.441+0800 I NETWORK [listener] connection accepted from 10.30.218.4:57038 #54007209 (5214 connections now open) 2021-06-04T10:56:52.442+0800 I NETWORK [conn54007209] received client metadata from 10.30.218.4:57038 conn54007209: { driver: { name: "nodejs", version: "3.2.7-2" } , os: { type: "Windows_NT", name: "win32", architecture: "x64", version: "10.0.19041" }, platform: "Node.js v10.11.0, LE, mongodb-core: 3.2.7-2", application: { name: "NoSQLBoosterV5.2.12_1439.782" } } 2021-06-04T10:56:52.478+0800 I ACCESS [conn54007209] Successfully authenticated as principal root on admin from client 10.30.218.4:57038 2021-06-04T10:56:52.489+0800 I NETWORK [worker-2910] end connection 10.30.218.4:57038 (5213 connections now open) 2021-06-04T10:56:53.423+0800 I REPL [rsBackgroundSync] Incremented the rollback ID to 3 2021-06-04T10:56:54.909+0800 I ROLLBACK [rsBackgroundSync] Rollback finished. The final minValid is: { ts: Timestamp(1622532457, 3), t: 22 } 2021-06-04T10:56:54.909+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back. 2021-06-04T10:56:54.909+0800 F - [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1542 2021-06-04T10:56:54.909+0800 F - [rsBackgroundSync] ***aborting after fassert() failure JIRAUSER1260348 commented on Fri, 4 Jun 2021 05:47:47 +0000: MongoDB shell version v4.0.13 connecting to: mongodb://127.0.0.1:27077/?authSource=admin&gssapiServiceName=mongodb Implicit session: session { "id" : UUID("afb50b77-b459-42c8-8526-c64b046f5060") } MongoDB server version: 4.0.13 pretreatment:PRIMARY> db.hostInfo().os { "type" : "Linux", "name" : "Oracle Linux Server release 7.8", "version" : "Kernel 4.14.35-1902.301.1.el7uek.x86_64" }

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 1768503

MongoDB primary instance down

MongoDB - Defect ID: 1768503

MongoDB primary instance down

Last updated on 7/23/2021

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?