...
Suppose a transaction opens a cursor, then abandons it. The locks for that cursor will be held in intent mode, even between batches. Now suppose a drop for the database or collection comes in. That drop will "get in line" with a MODE_X lock. In order to be fair to such requests, that pending MODE_X acquisition will block all future intent acquisitions. This will prevent a killCursors from killing the cursor, since it will need to take a collection lock. It looks like it will also prevent the background transaction timeout job from killing it.
james.wahlin@10gen.com commented on Wed, 2 May 2018 15:53:22 +0000: One approach we could take to prevent killSessions from blocking is to kill stashed transaction resources prior to killing cursors. Then the transaction would stop holding locks, so the drop could proceed, so the cursor kill could proceed. That might be something we want to look into. I agree that we should change the order of operations when killing sessions. We currently: Kill operations Kill cursors Kill transactions We should instead: Kill transactions Kill operations Kill cursors As killing transactions will kill associated cursors, step 3 would be there only to kill cursors that were opened as part of the session but outside of a transaction. dianna.hohensee commented on Wed, 2 May 2018 15:28:34 +0000: tess.avitabile, we bump it to 3 hours for testing, so that it doesn't cause random failures on slow machines using transactions. See this code. Though not all our testing is covered by that setting, SERVER-34595 will make the coverage complete. tess.avitabile commented on Wed, 2 May 2018 15:18:06 +0000: james.wahlin, why would the transactionLifetimeLimitSeconds be 10800 at the start of the repro? I thought the default was 60. dianna.hohensee commented on Wed, 2 May 2018 15:14:14 +0000: On a related note, over in SERVER-34732 I'm exploring what appears to be a deadlock where PeriodicRunnerASIO, which runs the periodic task to abort expired transactions, is waiting on a IS lock behind a drop cmd waiting for a X lock behind an inactive transaction with an IX lock. tess.avitabile commented on Wed, 2 May 2018 14:45:48 +0000: I'm glad to hear the transaction timeout job will successfully kill the transaction. That is unfortunate that killCursors will block. I would also expect killSessions to block, since the first thing it does is kill all cursors for the session, which requires collection locks. I think it is expected behavior that a drop that is blocked behind a transaction will block other operations. The scope document for local snapshot reads explicitly says that catalog operations will block behind transactions. One approach we could take to prevent killSessions from blocking is to kill stashed transaction resources prior to killing cursors. Then the transaction would stop holding locks, so the drop could proceed, so the cursor kill could proceed. That might be something we want to look into. Even if killCursors did not block, killCursors is not sufficient to kill the transaction, since the transaction survives cursors kills and maintains its locks. james.wahlin@10gen.com commented on Wed, 2 May 2018 13:33:18 +0000: The transaction kill mechanism is not triggering here because the transactionLifetimeLimitSeconds used is 10800 or 3 hours. Session::_transactionExpireDate is compared to Date_t::now() to determine whether a transaction should be aborted. The value for this is set at transaction start time. The attached script waits to reduce transactionLifetimeLimitSeconds to 1 second until after the transaction has been started, so it is created with a 3 hour expiration. Moving the setParameter above the transaction start addresses this allowing for transaction kill and MODE_X lock acquisition. milkie commented on Tue, 1 May 2018 21:38:23 +0000: Also, while attempting to kill the cursor with the killCursor command won’t work, won’t killing the Session work? Especially if an admin was trying to diagnose the problem, the currentOp output will show you the session to kill, not the cursor id to kill. james.wahlin@10gen.com commented on Tue, 1 May 2018 21:34:15 +0000: I am surprised that the transaction timeout mechanism is blocked. When a transaction times out we call Session::abortArbitraryTransactionIfExpired() which will first release the transaction Lock and Recovery unit prior to attempting to kill associate cursors. It would be interesting to see where the transaction kill thread is blocked. david.storch commented on Tue, 1 May 2018 21:13:12 +0000: spencer tess.avitabile, this feels like a candidate for a 4.0.0-rc0 fixVersion. Abandoning a cursor and then issuing killCursors on it isn't entirely unusual, and it seems like this could lead to a server that is in a "stuck" state. Please triage, and let me know if you'd like an assist from James or someone else on query. Nice work tracking this down charlie.swanson! spencer commented on Tue, 1 May 2018 21:11:19 +0000: Hmm, the fact that it blocks the transaction timeout is the most worrisome part. milkie james.wahlin
Download the attached 'repro.js', then run: python buildscripts/resmoke.py --suites=replica_sets_jscore_passthrough repro.js That will hang forever.