BugZero | MongoDB BugID 389796 - Latencies in Top exclude lock acquisition time

OPERATIONAL DEFECT DATABASE

...

BugZero | MongoDB BugID 389796 - Latencies in Top exclude lock acquisition time

MongoDB - Defect ID: 389796

Latencies in Top exclude lock acquisition time

MongoDB - Defect ID: 389796

Latencies in Top exclude lock acquisition time

Last updated on June 14th, 2017

BugZero Risk Score
6.0 Medium

Overall: 6.0

Severity: 6.4

Community: 6.0

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Major - P3
Status: Closed

Description

Info

Due to the introduction of the AutoStatsTracker for reporting latencies to Top in SERVER-22541 (in particular, see commit 584ca76de9e), these latencies now exclude time spent acquiring locks. The issue can be demonstrated by applying the following patch, which adds sleeps to simulate super long lock acquisition times: Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml diff --git a/src/mongo/db/db_raii.cpp b/src/mongo/db/db_raii.cpp index 9098e3b..977d291 100644 --- a/src/mongo/db/db_raii.cpp +++ b/src/mongo/db/db_raii.cpp @@ -107,6 +107,9 @@ AutoStatsTracker::~AutoStatsTracker() { AutoGetCollectionForRead::AutoGetCollectionForRead(OperationContext* opCtx, const NamespaceString& nss, AutoGetCollection::ViewMode viewMode) { + // Simulate a 2 second lock acquisition. + sleepmillis(2000); + _autoColl.emplace(opCtx, nss, MODE_IS, MODE_IS, viewMode); // Note: this can yield. Then run the following: > db.c.drop() true > db.c.insert({_id: 1}) WriteResult({ "nInserted" : 1 }) > db.c.aggregate([{$collStats: {latencyStats: {}}}]).pretty() { "ns" : "test.c", "localTime" : ISODate("2017-06-02T21:17:22.267Z"), "latencyStats" : { "reads" : { "latency" : NumberLong(163), "ops" : NumberLong(2) }, "writes" : { "latency" : NumberLong(143438), "ops" : NumberLong(1) }, "commands" : { "latency" : NumberLong(0), "ops" : NumberLong(0) } } } > db.c.find() { "_id" : 1 } > db.c.aggregate([{$collStats: {latencyStats: {}}}]).pretty() { "ns" : "test.c", "localTime" : ISODate("2017-06-02T21:17:35.243Z"), "latencyStats" : { "reads" : { "latency" : NumberLong(401), "ops" : NumberLong(4) }, "writes" : { "latency" : NumberLong(143438), "ops" : NumberLong(1) }, "commands" : { "latency" : NumberLong(0), "ops" : NumberLong(0) } } } You should observe that the find commands take a few seconds to return, but the cumulative read latency reported in $collStats only increases by tens or hundreds of microseconds. Note that latency reported in slow query log lines is not affected by this bug. Re-using the example above, the log line for the find command reports a latency of 2000ms, which of course is dominated by the simulated 2 second lock acquisition: 2017-06-02T17:18:01.340-0400 I COMMAND [conn1] command test.c appName: "MongoDB Shell" command: find { find: "c", filter: {}, $db: "test" } planSummary: COLLSCAN keysExamined:0 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:100 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 2000ms

Top User Comments

david.storch commented on Wed, 14 Jun 2017 23:02:20 +0000: I ended up fixing this as part of the change for SERVER-29304. The fix was to use the CurOp timer for Top, so that our various debug mechanism which report latencies all use the same timer. The CurOp timer is now used by the slow query logs, the profiler, db.currentOp(), Top, and operation latency histograms. I'm resolving this ticket as a duplicate. david.storch commented on Mon, 5 Jun 2017 14:18:13 +0000: charlie.swanson, pretty sure, yeah. Unfortunately CurOp::ensureStarted() is used for reporting latency in the log lines, but we have a separate Timer instance for reporting latencies in Top. Prior to this patch, the Timer was constructed before the AutoGetCollectionForRead. I'm not sure if there is a good reason why we can't use the same timing code for both CurOp and Top, but I wasn't planning on changing it as part of fixing this issue. charlie.swanson commented on Mon, 5 Jun 2017 13:48:14 +0000: david.storch are you sure it was that commit? It looks like both before and after that commit we acquire the collection lock before calling Curop::ensureStarted(), which as far as I can tell is responsible for the latency reporting?

Steps to Reproduce

Change history

No changes to display

Links

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Top MongoDB Defects

5.5Defect ID: 3192414
Sharded DDL commands may complete while the DDL coordinator is still active in-memory (cleaning up)
5.4Defect ID: 3194150
Shard role API stashing doesn't abandon the snapshot for recursive acquisitions
5.3Defect ID: 3215941
moveChunk with waitForDelete hangs when range deleter is disabled
5.3Defect ID: 3198578
$listClusterCatalog reports wrong sharding metadata for timeseries collections
5.3Defect ID: 3191790
Change Stream breaks on document with top-level $v

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 389796

Latencies in Top exclude lock acquisition time

MongoDB - Defect ID: 389796

Latencies in Top exclude lock acquisition time

Last updated on June 14th, 2017

BugZero Risk Score6.0 Medium

Bug Details

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB Defects

Ready to prevent the next vendor outage?

BugZero Risk Score
6.0 Medium