sapling/hgext3rd/traceprofimpl.cpp

390 lines
9.2 KiB
C++
Raw Normal View History

traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
/* traceprofimpl.cpp - main implementation of traceprofile
*
* Copyright 2017 Facebook, Inc.
*
* This software may be used and distributed according to the terms of the
* GNU General Public License version 2 or any later version. */
#include "Python.h"
#include "frameobject.h"
#include <algorithm>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <list>
#include <string>
#include <unordered_map>
#include <vector>
#include <sys/time.h>
#include <unistd.h>
typedef int lineno_t;
typedef uint64_t frameid_t;
typedef uint64_t rdtsc_t;
/* tracing ------------------------------------------------------------------ */
/* information about a raw Python frame */
struct FrameInfo {
traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e
2017-05-23 21:51:10 +03:00
PyCodeObject *code; /* PyCodeObject */
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
frameid_t back;
traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e
2017-05-23 21:51:10 +03:00
/* needed by older C++ map which has difficulty on zero-copy assignment */
FrameInfo() { code = NULL; back = 0; }
void assign(PyFrameObject *frame, frameid_t backfid) {
back = backfid;
code = frame->f_code;
Py_XINCREF(code);
}
lineno_t line() {
if (!code) return 0;
return code->co_firstlineno;
}
const char *file() {
if (!code) return NULL;
return PyString_AsString(code->co_filename);
}
const char *name() {
if (!code) return NULL;
return PyString_AsString(code->co_name);
}
~FrameInfo() {
Py_XDECREF(code);
code = NULL;
}
private:
/* forbid copy */
FrameInfo(const FrameInfo& rhs);
FrameInfo(const FrameInfo&& rhs);
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
};
/* samples */
struct Sample {
rdtsc_t time;
frameid_t frameid;
int op; /* PyTrace_{CALL,EXCEPTION,LINE,RETURN,C_CALL,C_EXCEPTION,C_RETURN} */
};
/* global state, collected frames and samples */
static std::unordered_map<frameid_t, FrameInfo> frames;
static std::list<Sample> samples;
/* for measuring wall time / rdtsc ratio */
static struct timeval t1, t2;
static rdtsc_t r1, r2;
static double rdtscratio; /* set by disable() */
/* fast way to get wall time */
inline static rdtsc_t rdtsc() {
unsigned long lo, hi;
asm volatile ("rdtsc" : "=a" (lo), "=d" (hi));
return (rdtsc_t)(lo | (hi << 32));
}
/* fast, but inaccurate hashing of a Python frame */
inline static uint64_t hashframe(PyFrameObject *frame) {
uint64_t v = ((uint64_t)frame ^ ((uint64_t)(frame->f_back) << 16) ^
((uint64_t)frame->f_code << 32));
/* f_code is usually immutable (lsprof use its address as keys) */
return v;
}
/* hash, and store a Python frame, return its ID */
static frameid_t hashandstoreframe(PyFrameObject *frame) {
if (!frame) return 0;
frameid_t frameid = (frameid_t)hashframe(frame);
if (frames.count(frameid) == 0) {
traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e
2017-05-23 21:51:10 +03:00
frames[frameid].assign(frame,hashandstoreframe(frame->f_back));
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
}
return frameid;
}
/* record a sample */
inline static void recordframe(PyFrameObject *frame, int op) {
frameid_t fid = hashandstoreframe(frame);
samples.push_back({rdtsc(), fid, op});
}
static int tracefunc(PyObject *o, PyFrameObject *frame, int op, PyObject *a) {
(void)o;
(void)a;
recordframe(frame, op);
return 0;
}
static void enable() {
r1 = rdtsc();
gettimeofday(&t1, NULL);
PyEval_SetProfile((Py_tracefunc) tracefunc, NULL);
}
static void disable() {
PyEval_SetProfile(NULL, NULL);
r2 = rdtsc();
gettimeofday(&t2, NULL);
/* calculate rdtscratio */
double dt = 0; /* ms */
dt += (double)(t2.tv_sec - t1.tv_sec) * 1000.0; /* sec to ms */
dt += (double)(t2.tv_usec - t1.tv_usec) / 1000.0; /* us to ms */
rdtscratio = dt / (double)(r2 - r1);
}
/* reporting ---------------------------------------------------------------- */
struct FrameSummary {
rdtsc_t time;
unsigned int count;
};
static std::unordered_map<frameid_t, FrameSummary> summaries;
static std::unordered_map<frameid_t, std::list<frameid_t> > framechildren;
/* for dedup */
static std::unordered_map<frameid_t, uint64_t> fid2hash;
static std::unordered_map<uint64_t, frameid_t> hash2fid;
/* hash FrameInfo, do not be affected by frame addresses */
static uint64_t hashframeinfo(frameid_t fid) {
if (fid == 0) {
return 0;
}
if (!fid2hash.count(fid)) {
auto& fi = frames[fid];
traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e
2017-05-23 21:51:10 +03:00
uint64_t v = (uint64_t)fi.code;
v ^= hashframeinfo(fi.back) << 1;
fid2hash[fid] = v;
}
return fid2hash[fid];
}
/* fill hash2fid */
static void buildframededup() {
for (auto& s : samples) {
frameid_t fid = s.frameid;
while (fid) {
uint64_t v = hashframeinfo(fid);
if (hash2fid.count(v) == 0) {
hash2fid[v] = fid;
fid = frames[fid].back;
} else {
break;
}
}
}
}
static frameid_t dedupfid(frameid_t fid) {
if (fid2hash.count(fid) == 0) {
return fid; /* no information available */
} else {
return hash2fid[fid2hash[fid]];
}
}
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
/* fill calltimes and summaries */
static void buildsummaries() {
std::unordered_map<frameid_t, std::list<Sample*>> calls;
for (auto& s : samples) {
frameid_t fid = dedupfid(s.frameid);
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
if (s.op == PyTrace_CALL) {
calls[fid].push_back(&s);
} else if (s.op == PyTrace_RETURN) {
auto& entries = calls[fid];
if (entries.empty())
continue;
/* frame was entered before */
Sample* prev = entries.back();
entries.pop_back();
auto &sum = summaries[fid];
sum.count += 1;
if (entries.empty())
sum.time += s.time - prev->time;
} /* s.op */
}
}
/* fill framechildren */
static void buildframetree() {
for (auto& s : samples) {
if (s.op != PyTrace_CALL && s.op != PyTrace_C_CALL) {
continue; /* only interested in call */
}
for (frameid_t fid = s.frameid; fid;) {
fid = dedupfid(fid);
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
auto& fi = frames[fid];
frameid_t pfid = dedupfid(fi.back);
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
auto& children = framechildren[pfid];
int existed = 0;
for (auto& c : children) {
if (c == fid) {
existed = 1;
break;
}
}
if (existed) {
break;
}
children.push_back(fid);
fid = pfid;
} /* for fid */
}
}
static std::string shortname(std::string path) {
size_t p = path.rfind('/');
if (p == std::string::npos)
return path;
/* special handling __init__.py, include its dirname */
if (p > 0 && path.substr(p + 1, p + 1 + 11) == "__init__.py") {
p = path.rfind('/', p - 1);
if (p == std::string::npos)
return path;
}
return path.substr(p + 1);
}
/* width needed to output time (in ms) */
static int timelen() {
static int n = 0;
if (n)
return n;
n = 1;
rdtsc_t maxframetime = 0;
for (auto& s : summaries) {
if (s.second.time > maxframetime)
maxframetime = s.second.time;
}
for (double t = (double)maxframetime * rdtscratio; t >= 10; t /= 10) {
n ++;
}
return n;
}
/* printf format to output time */
static const char *timefmt() {
static char fmt[20] = { 0 };
if (fmt[0] == 0) {
int n = timelen();
snprintf(fmt, sizeof fmt, "%%%d.0f", n);
}
return fmt;
}
static inline int fprintindent(FILE *fp, int indent) {
for (int i = 0; i < indent; ++i) fputc(' ', fp);
return (indent > 0 ? indent : 0);
}
/* config items */
static double timethreshold = 2;
static size_t countthreshold = 2;
static int dedup = 1;
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
static void settimethreshold(double ms) {
timethreshold = ms;
}
static void setcountthreshold(size_t count) {
countthreshold = count;
}
static void setdedup(int value) {
dedup = value;
}
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
static void fprintframetree(FILE *fp = stderr, frameid_t fid = 0,
int indent = 0, char ch = '|') {
auto& f = frames[fid];
auto& s = summaries[fid];
/* collect (> 2ms) child frames to print */
std::vector<frameid_t> cfids;
rdtsc_t ctotaltime = 0;
for (auto& cfid : framechildren[fid]) {
auto& cs = summaries[cfid];
if ((double)cs.time * rdtscratio >= timethreshold || cs.count == 0)
cfids.push_back(cfid);
ctotaltime += cs.time;
}
/* hot frame? frame time > 2 * sum(child frame time) and frame time > 30ms */
int hot = (s.time > ctotaltime * 2 && (double)s.time * rdtscratio > 30);
if (fid != 0) {
int ncol = 0;
/* hot symbol */
if (hot) {
ncol += fprintf(fp, "* ");
} else {
ncol += fprintf(fp, " ");
}
/* time in ms */
if (s.count > 0) {
ncol += fprintf(fp, timefmt(), (double)s.time * rdtscratio);
} else {
ncol += fprintindent(fp, timelen()); /* call not recorded */
}
/* symbol and indent */
ncol += fprintindent(fp, indent + 1);
ncol += fprintf(fp, "%c ", ch);
/* frame name */
traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e
2017-05-23 21:51:10 +03:00
ncol += fprintf(fp, "%s ", f.name());
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
/* call count */
if (s.count >= countthreshold) {
ncol += fprintf(fp, "(%d times) ", s.count);
}
/* file path */
fprintindent(fp, 48 - ncol);
traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e
2017-05-23 21:51:10 +03:00
std::string path = f.file();
ncol += fprintf(fp, "%s:%d", shortname(path).c_str(), f.line());
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
/* end of line */
fprintf(fp, "\n");
}
/* children */
indent += (ch == '\\');
if (cfids.size() > 1) {
indent += 1;
ch = '\\';
} else {
ch = '|';
}
for (auto& cfid : cfids) {
fprintframetree(fp, cfid, indent, ch);
}
}
static void clear() {
summaries.clear();
framechildren.clear();
fid2hash.clear();
hash2fid.clear();
samples.clear();
traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e
2017-05-23 21:51:10 +03:00
frames.clear();
}
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
static void report(FILE *fp = stderr) {
if (dedup)
buildframededup();
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
buildsummaries();
buildframetree();
fprintframetree(fp, dedupfid(0));
traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: | <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 0 \ fib a.py:1 0 | fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` | <module> hg:10 | run dispatch.py:81 | dispatch dispatch.py:132 | _runcatch dispatch.py:202 | _callcatch dispatch.py:309 | callcatch scmutil.py:133 | _runcatchfunc dispatch.py:215 | _dispatch dispatch.py:744 2 \ repository hg.py:166 2 | _peerorrepo hg.py:151 2 | instance localrepo.py:2025 2 | __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 | _runcommand dispatch.py:926 863 | <lambda> dispatch.py:918 863 | check util.py:1059 863 | log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 | _logrevs cmdutil.py:2190 479 | revrange scmutil.py:429 479 | anyrevs localrepo.py:622 478 | mfunc revset.py:2275 478 | __init__ (4 times) smartset.py:1033 478 | __init__ (4 times) smartset.py:919 478 | __len__ (4 times) localrepo.py:587 478 | changelog repoview.py:309 477 | filterrevs (11 times) repoview.py:260 477 | computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 | getrevs obsolete.py:1114 455 | _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 | node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 | successors obsolete.py:672 225 \ __get__ util.py:798 224 | _all obsolete.py:662 6 \ tryread vfs.py:32 6 | read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 | _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 | cachehash repoview.py:98 3 | heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 | __contains__ (5226 times) ancestor.py:334 5 | parentrevs (4010 times) changelog.py:371 2 | parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 | _runpager ui.py:906 8 | __init__ subprocess.py:330 8 | _execute_child subprocess.py:880 5 | _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 | _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 | troubled context.py:225 12 \ unstable (2 times) context.py:207 12 | getrevs obsolete.py:1114 12 | _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 | __get__ (321 times) util.py:798 4 | _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 | obsolete (321 times) context.py:199 2 | getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 | getrevs obsolete.py:1114 319 | _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 | __get__ util.py:798 308 | precursors obsolete.py:678 * 308 | _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 | getrevs obsolete.py:1114 33 | _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 | <lambda> namespaces.py:43 9 | nodetags localrepo.py:741 8 | __get__ util.py:798 8 | _tagscache localrepo.py:646 8 | _findtags localrepo.py:685 7 | findglobaltags tags.py:170 7 | _readtagcache tags.py:370 6 | filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 | maybeprofile profiling.py:199 | __exit__ contextlib.py:21 | profile profiling.py:148 | __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
2017-05-23 01:49:45 +03:00
fprintf(fp, "Total time: %.0f ms\n", (double)(r2 - r1) * rdtscratio);
}