...
BugZero found this defect 2759 days ago.
In mongod's slow log. I got so many lines like below: Mon Apr 10 11:05:01.426 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after extra_info: 1410, after globalLock: 1410, after locks: 1410, after network: 1410, after opcounters: 1410, after opcountersRepl: 1410, after repl: 1410, after storageEngine: 1410, after tcmalloc: 1410, after wiredTiger: 1410, at end: 1420 } there are so many lines like above. from the ftdc flag, I know it is generated from the diagonal thread, but this is not important. What is important is that time all grows after getting extra_info. Which I found is to run mallinfo() on linux after digging into the code. I doubt that mallinfo may be slow in some situations because all the slow logs I got is caused by the time-cost after extra_info.
wolf_kdy commented on Thu, 7 Dec 2017 05:24:05 +0000: We have found the problem. a high-loaded mongod with thounds of connections(or threads, both the same), huge amount of memory (64GB is enough). If you keep doing 'cat /proc/mongod_pid/smaps', you will see that mongod getting slow logs. On our machines, there is a monitor-agent that reads every process' memory stat by reading /proc/pid/smaps, every time agent does this, mongod hangs. Reasons more intrinsical is inside linux kernal. If someone meets this situation, you should strace the monitor-agent to see if mongod's smaps is being read. wolf_kdy commented on Sat, 20 May 2017 04:09:49 +0000: Perhaps more consideration should be taken. There are two doubts that are still unclear 1) why are all the serverStatus slowlog output all the same, time grows from extra_info which is collecting mallinfo in tcmalloc ? 2) Maunally run db.runCommand( {'serverStatus':1} ) can trigger slow serverStatus, followed by a huge amount of other slow queries. I think you should test the behavior of mongod with tcmalloc with high load, (eg, 5000 active threads and 200GB or more memory). in this situation, you will see tcmalloc's costing most of the cputime by perf top I strongly doubt that tcmalloc's mallinfo may block memory allocation and release, That may be the reason each serverStatus will be followed by large amounts of other slowquerys, some of which may be very simple key-value operation and should never be slow. mark.agarunov commented on Wed, 12 Apr 2017 18:45:18 +0000: Hello wolf_kdy, Thank you for providing the files. Looking over the diagnostic data and logs, it appears that as there is more load placed on the server, as a result serverStatus takes longer to return and times out. I don't see anything that would indicate a bug, after the load on the server is increased, the performance seems to be relatively consistent. If you are not seeing a performance degradation, this appears to be working as expected. Thanks, Mark wolf_kdy commented on Wed, 12 Apr 2017 03:23:18 +0000: @Mark Agarunov I've sent you an email which contains what you need mark.agarunov commented on Tue, 11 Apr 2017 17:21:34 +0000: Hello wolf_kdy, Thank you for the report. To better investigate the behavior you've described, I'd like to request some additional information. Could you please provide the following: The complete logs from mongod when this issue is present, An archive (tar or zip) of the $dbpath/diagnostic.data directory Thanks, Mark