traceprof: new extension to perform accurate profiling

Summary:
Instead of sampling periodically like the default profiler `statprof`, trace
every function call so the report is accurate. The output is designed to be
similar to `statprof`.

Written in C++ and inline assembly to minimize overhead. Cython is used as a
glue layer.

Comparing to `lsprof`, the overhead is similarly small. The major difference is
this profiler cares about "frame object" and records `frame->f_back` (parent
frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records
"code object" instead and could have difficulty to deal with recursive calls.

The following code could demostrate the difference:

```
def fib(x):
    if x < 2:
        return x
    return fib(x - 1) + fib(x - 2)

# lsprof - recorded 1 code object for fib and no way to figure out callgraph
import _lsprof
p = _lsprof.Profiler()
p.enable(subcalls=True)
fib(5)
p.disable()
for s in p.getstats():
    print(s)

'''output:
_lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None)
_lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)])
'''

# traceprof (with timethreshold set to 0) - callgraph drawn correctly
import sys
from hgext3rd import traceprof
with traceprof.profile(None, sys.stderr):
    fib(5)

'''output:
    | <module>                                  a.py:1
  0  \ fib                                      a.py:1
  0    \ fib                                    a.py:1
  0      \ fib                                  a.py:1
  0        \ fib                                a.py:1
  0         | fib                               a.py:1
  0        \ fib                                a.py:1
  0      \ fib                                  a.py:1
  0       | fib                                 a.py:1
  0    \ fib                                    a.py:1
  0     | fib                                   a.py:1
     \ __exit__                                 contextlib.py:21
'''
```

There are some performance / correctness tradeoffs. The current implementation
takes `frame` address and several (but not theoretically complete) fields to
hash a frame.  That could in theory lead to inaccurate frame information being
recorded. But hashing all fields recursively incurs significant overhead. The
related logic is the `hashframe` function, which could be tweaked later.

This is an initial version. It probably does not cover all corner cases like
exception handling well. And it may duplicate a same frame multiple times
because of the way it hashes a frame. The duplicated frame issue could be
solved by adding post-profiling filtering though.

Here is a real-world example of `hg log -r .` on `hg-committed`:

```
      | <module>                                hg:10
      | run                                     dispatch.py:81
      | dispatch                                dispatch.py:132
      | _runcatch                               dispatch.py:202
      | _callcatch                              dispatch.py:309
      | callcatch                               scmutil.py:133
      | _runcatchfunc                           dispatch.py:215
      | _dispatch                               dispatch.py:744
    2  \ repository                             hg.py:166
    2   | _peerorrepo                           hg.py:151
    2   | instance                              localrepo.py:2025
    2   | __init__                              localrepo.py:275
  863  \ runcommand                             dispatch.py:656
  863   | _runcommand                           dispatch.py:926
  863   | <lambda>                              dispatch.py:918
  863   | check                                 util.py:1059
  863   | log                                   commands.py:3262
  480    \ getlogrevs                           cmdutil.py:2236
  479     | _logrevs                            cmdutil.py:2190
  479     | revrange                            scmutil.py:429
  479     | anyrevs                             localrepo.py:622
  478     | mfunc                               revset.py:2275
  478     | __init__ (4 times)                  smartset.py:1033
  478     | __init__ (4 times)                  smartset.py:919
  478     | __len__ (4 times)                   localrepo.py:587
  478     | changelog                           repoview.py:309
  477     | filterrevs (11 times)               repoview.py:260
  477     | computehidden                       repoview.py:171
  455      \ hideablerevs                       repoview.py:24
  455       | getrevs                           obsolete.py:1114
  455       | _computeobsoleteset               obsolete.py:1143
    3        \ __get__ (2 times)                localrepo.py:76
    7        \ node (5760 times)                changelog.py:359
    4         | node (8938 times)               revlog.py:482
    9        \ __get__ (5760 times)             localrepo.py:76
  428        \ __get__                          util.py:798
  428         | successors                      obsolete.py:672
  225          \ __get__                        util.py:798
  224           | _all                          obsolete.py:662
    6            \ tryread                      vfs.py:32
    6             | read                        vfs.py:60
  200            \ _readmarkers                 obsolete.py:442
* 199             | _fm1readmarkers             obsolete.py:430
   15            \ _checkinvalidmarkers         obsolete.py:523
* 203          \ _addsuccessors                 obsolete.py:504
    3      \ tryreadcache                       repoview.py:166
    3       | cachehash                         repoview.py:98
    3       | heads                             localrepo.py:1884
   17      \ <genexpr> (5225 times)             repoview.py:191
   15       | __contains__ (5226 times)         ancestor.py:334
    5       | parentrevs (4010 times)           changelog.py:371
    2       | parentrevs (4332 times)           revlog.py:479
    9    \ pager                                ui.py:843
    9     | _runpager                           ui.py:906
    8     | __init__                            subprocess.py:330
    8     | _execute_child                      subprocess.py:880
    5     | _eintr_retry_call                   subprocess.py:118
  374    \ show                                 cmdutil.py:1308
  374     | _show                               cmdutil.py:1316
  365      \ _changesetlabels                   cmdutil.py:1266
  365       | troubled                          context.py:225
   12        \ unstable (2 times)               context.py:207
   12         | getrevs                         obsolete.py:1114
   12         | _computeunstableset             obsolete.py:1154
    3          \ set (322 times)                localrepo.py:610
    3            \ revs (3 times)               localrepo.py:593
    8            \ __getitem__ (963 times)      localrepo.py:566
    4          \ parents (321 times)            context.py:246
    4           | __get__ (321 times)           util.py:798
    4           | _parents (321 times)          context.py:562
    3          \ <genexpr> (642 times)          obsolete.py:1164
    3           | obsolete (321 times)          context.py:199
    2           | getrevs (321 times)           obsolete.py:1114
  319        \ bumped (2 times)                 context.py:211
  319         | getrevs                         obsolete.py:1114
  319         | _computebumpedset               obsolete.py:1181
    4          \ set (322 times)                localrepo.py:610
  312          \ allprecursors (2021 times)     obsolete.py:850
  308           | __get__                       util.py:798
  308           | precursors                    obsolete.py:678
* 308           | _addprecursors                obsolete.py:509
   34        \ divergent (2 times)              context.py:218
   34         | getrevs                         obsolete.py:1114
   33         | _computedivergentset            obsolete.py:1204
    4          \ set (322 times)                localrepo.py:610
   24          \ successorssets (1368 times)    obsolete.py:899
    3            \ __get__ (1368 times)         localrepo.py:76
    4            \ __contains__ (200 times)     localrepo.py:575
    9      \ names (2 times)                    namespaces.py:184
    9       | <lambda>                          namespaces.py:43
    9       | nodetags                          localrepo.py:741
    8       | __get__                           util.py:798
    8       | _tagscache                        localrepo.py:646
    8       | _findtags                         localrepo.py:685
    7       | findglobaltags                    tags.py:170
    7       | _readtagcache                     tags.py:370
    6       | filteredhash                      scmutil.py:307
       \ __exit__                               contextlib.py:21
        | maybeprofile                          profiling.py:199
        | __exit__                              contextlib.py:21
        | profile                               profiling.py:148
        | __exit__                              contextlib.py:21
Total time: 867 ms
```

Some example conclusions from reading the above output include:

  1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow
  2. `changelog.node` and `changelog.parentrevs` are called thousands of times,
     but they are just a few milliseconds.
  3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly

While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is
helpful for 3. `statprof` is not accurate enough to be confident about 5ms,
`lsprof` will include other `vfs.read` calls so it's impossible to see just the
read for obsstore.


Test Plan:
With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run:

```
 HGPROF=traceprof hg id --config extensions.traceprof= --profile
```

Reviewers: #mercurial, rmcelroy

Reviewed By: rmcelroy

Subscribers: rmcelroy, mjpieters, jiangyuan

Differential Revision: https://phabricator.intern.facebook.com/D5104851

Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836
This commit is contained in:
Jun Wu 2017-05-22 15:49:45 -07:00
parent dcaee39e39
commit f1be725a71
4 changed files with 376 additions and 0 deletions

View File

@ -12,6 +12,7 @@ syntax: regexp
.idea
.testtimes*
^hgext3rd/.*\.c$
^hgext3rd/traceprof\.c.*$
^cdatapack/cdatapack_dump$
subinclude:cfastmanifest/.hgignore

59
hgext3rd/traceprof.pyx Normal file
View File

@ -0,0 +1,59 @@
# distutils: language = c++
# traceprof.pyx - C++ to Python bridge for the traceprof Mercurial extension
#
# Copyright 2017 Facebook, Inc.
#
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.
"""accurate callgraph profiling
lsprof's high precision, plus statprof's intuitive output format.
Config::
[traceprof]
# whether to disable Python GC before profiling
disablegc = no
# minimal microseconds to show a function
timethreshold = 2000
# minimal call count to show "(N times)"
countthreshold = 2
"""
from libc.stdio cimport FILE
from cpython.object cimport PyObject
import contextlib
import gc
cdef extern from "traceprofimpl.cpp":
void enable()
void disable()
void report(FILE *)
void settimethreshold(double)
void setcountthreshold(size_t)
cdef extern from "Python.h":
FILE* PyFile_AsFile(PyObject *p)
@contextlib.contextmanager
def profile(ui, fp):
if ui is not None:
if ui.configbool('traceprof', 'disablegc'):
gc.disable() # slightly more predictable
microseconds = ui.configint('traceprof', 'timethreshold')
if microseconds is not None:
settimethreshold((<double>microseconds) / 1000.0)
count = ui.configint('traceprof', 'countthreshold')
if count is not None:
setcountthreshold(count)
enable()
try:
yield
finally:
disable()
report(PyFile_AsFile(<PyObject *>fp))

303
hgext3rd/traceprofimpl.cpp Normal file
View File

@ -0,0 +1,303 @@
/* traceprofimpl.cpp - main implementation of traceprofile
*
* Copyright 2017 Facebook, Inc.
*
* This software may be used and distributed according to the terms of the
* GNU General Public License version 2 or any later version. */
#include "Python.h"
#include "frameobject.h"
#include <algorithm>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <list>
#include <string>
#include <unordered_map>
#include <vector>
#include <sys/time.h>
#include <unistd.h>
typedef int lineno_t;
typedef uint64_t frameid_t;
typedef uint64_t rdtsc_t;
/* tracing ------------------------------------------------------------------ */
/* information about a raw Python frame */
struct FrameInfo {
std::string file;
std::string name;
frameid_t back;
lineno_t line;
};
/* samples */
struct Sample {
rdtsc_t time;
frameid_t frameid;
int op; /* PyTrace_{CALL,EXCEPTION,LINE,RETURN,C_CALL,C_EXCEPTION,C_RETURN} */
};
/* global state, collected frames and samples */
static std::unordered_map<frameid_t, FrameInfo> frames;
static std::list<Sample> samples;
/* for measuring wall time / rdtsc ratio */
static struct timeval t1, t2;
static rdtsc_t r1, r2;
static double rdtscratio; /* set by disable() */
/* fast way to get wall time */
inline static rdtsc_t rdtsc() {
unsigned long lo, hi;
asm volatile ("rdtsc" : "=a" (lo), "=d" (hi));
return (rdtsc_t)(lo | (hi << 32));
}
/* fast, but inaccurate hashing of a Python frame */
inline static uint64_t hashframe(PyFrameObject *frame) {
uint64_t v = ((uint64_t)frame ^ ((uint64_t)(frame->f_back) << 16) ^
((uint64_t)frame->f_code << 32));
/* f_code is usually immutable (lsprof use its address as keys) */
return v;
}
/* hash, and store a Python frame, return its ID */
static frameid_t hashandstoreframe(PyFrameObject *frame) {
if (!frame) return 0;
frameid_t frameid = (frameid_t)hashframe(frame);
if (frames.count(frameid) == 0) {
FrameInfo fi;
fi.file = PyString_AsString(frame->f_code->co_filename);
fi.name = PyString_AsString(frame->f_code->co_name);
fi.back = hashandstoreframe(frame->f_back);
fi.line = frame->f_code->co_firstlineno;
frames[frameid] = fi;
}
return frameid;
}
/* record a sample */
inline static void recordframe(PyFrameObject *frame, int op) {
frameid_t fid = hashandstoreframe(frame);
samples.push_back({rdtsc(), fid, op});
}
static int tracefunc(PyObject *o, PyFrameObject *frame, int op, PyObject *a) {
(void)o;
(void)a;
recordframe(frame, op);
return 0;
}
static void enable() {
r1 = rdtsc();
gettimeofday(&t1, NULL);
PyEval_SetProfile((Py_tracefunc) tracefunc, NULL);
}
static void disable() {
PyEval_SetProfile(NULL, NULL);
r2 = rdtsc();
gettimeofday(&t2, NULL);
/* calculate rdtscratio */
double dt = 0; /* ms */
dt += (double)(t2.tv_sec - t1.tv_sec) * 1000.0; /* sec to ms */
dt += (double)(t2.tv_usec - t1.tv_usec) / 1000.0; /* us to ms */
rdtscratio = dt / (double)(r2 - r1);
}
/* reporting ---------------------------------------------------------------- */
struct FrameSummary {
rdtsc_t time;
unsigned int count;
};
static std::unordered_map<frameid_t, FrameSummary> summaries;
static std::unordered_map<frameid_t, std::list<frameid_t> > framechildren;
/* fill calltimes and summaries */
static void buildsummaries() {
std::unordered_map<frameid_t, std::list<Sample*>> calls;
for (auto& s : samples) {
frameid_t fid = s.frameid;
if (s.op == PyTrace_CALL) {
calls[fid].push_back(&s);
} else if (s.op == PyTrace_RETURN) {
auto& entries = calls[fid];
if (entries.empty())
continue;
/* frame was entered before */
Sample* prev = entries.back();
entries.pop_back();
auto &sum = summaries[fid];
sum.count += 1;
if (entries.empty())
sum.time += s.time - prev->time;
} /* s.op */
}
}
/* fill framechildren */
static void buildframetree() {
for (auto& s : samples) {
if (s.op != PyTrace_CALL && s.op != PyTrace_C_CALL) {
continue; /* only interested in call */
}
for (frameid_t fid = s.frameid; fid;) {
auto& fi = frames[fid];
frameid_t pfid = fi.back;
auto& children = framechildren[pfid];
int existed = 0;
for (auto& c : children) {
if (c == fid) {
existed = 1;
break;
}
}
if (existed) {
break;
}
children.push_back(fid);
fid = pfid;
} /* for fid */
}
}
static std::string shortname(std::string path) {
size_t p = path.rfind('/');
if (p == std::string::npos)
return path;
/* special handling __init__.py, include its dirname */
if (p > 0 && path.substr(p + 1, p + 1 + 11) == "__init__.py") {
p = path.rfind('/', p - 1);
if (p == std::string::npos)
return path;
}
return path.substr(p + 1);
}
/* width needed to output time (in ms) */
static int timelen() {
static int n = 0;
if (n)
return n;
n = 1;
rdtsc_t maxframetime = 0;
for (auto& s : summaries) {
if (s.second.time > maxframetime)
maxframetime = s.second.time;
}
for (double t = (double)maxframetime * rdtscratio; t >= 10; t /= 10) {
n ++;
}
return n;
}
/* printf format to output time */
static const char *timefmt() {
static char fmt[20] = { 0 };
if (fmt[0] == 0) {
int n = timelen();
snprintf(fmt, sizeof fmt, "%%%d.0f", n);
}
return fmt;
}
static inline int fprintindent(FILE *fp, int indent) {
for (int i = 0; i < indent; ++i) fputc(' ', fp);
return (indent > 0 ? indent : 0);
}
/* config items */
static double timethreshold = 2;
static size_t countthreshold = 2;
static void settimethreshold(double ms) {
timethreshold = ms;
}
static void setcountthreshold(size_t count) {
countthreshold = count;
}
static void fprintframetree(FILE *fp = stderr, frameid_t fid = 0,
int indent = 0, char ch = '|') {
auto& f = frames[fid];
auto& s = summaries[fid];
/* collect (> 2ms) child frames to print */
std::vector<frameid_t> cfids;
rdtsc_t ctotaltime = 0;
for (auto& cfid : framechildren[fid]) {
auto& cs = summaries[cfid];
if ((double)cs.time * rdtscratio >= timethreshold || cs.count == 0)
cfids.push_back(cfid);
ctotaltime += cs.time;
}
/* hot frame? frame time > 2 * sum(child frame time) and frame time > 30ms */
int hot = (s.time > ctotaltime * 2 && (double)s.time * rdtscratio > 30);
if (fid != 0) {
int ncol = 0;
/* hot symbol */
if (hot) {
ncol += fprintf(fp, "* ");
} else {
ncol += fprintf(fp, " ");
}
/* time in ms */
if (s.count > 0) {
ncol += fprintf(fp, timefmt(), (double)s.time * rdtscratio);
} else {
ncol += fprintindent(fp, timelen()); /* call not recorded */
}
/* symbol and indent */
ncol += fprintindent(fp, indent + 1);
ncol += fprintf(fp, "%c ", ch);
/* frame name */
ncol += fprintf(fp, "%s ", f.name.c_str());
/* call count */
if (s.count >= countthreshold) {
ncol += fprintf(fp, "(%d times) ", s.count);
}
/* file path */
fprintindent(fp, 48 - ncol);
std::string path = f.file;
ncol += fprintf(fp, "%s:%d", shortname(path).c_str(), f.line);
/* end of line */
fprintf(fp, "\n");
}
/* children */
indent += (ch == '\\');
if (cfids.size() > 1) {
indent += 1;
ch = '\\';
} else {
ch = '|';
}
for (auto& cfid : cfids) {
fprintframetree(fp, cfid, indent, ch);
}
}
static void report(FILE *fp = stderr) {
buildsummaries();
buildframetree();
fprintframetree(fp);
fprintf(fp, "Total time: %.0f ms\n", (double)(r2 - r1) * rdtscratio);
}

View File

@ -16,6 +16,7 @@ WEXTRA = "" if iswindows else "-Wextra"
WCONVERSION = "" if iswindows else "-Wconversion"
PEDANTIC = "" if iswindows else "-pedantic"
NOOPTIMIZATION = "/Od" if iswindows else "-O0"
OPTIMIZATION = "" if iswindows else "-O2"
PRODUCEDEBUGSYMBOLS = "/DEBUG:FULL" if iswindows else "-g"
# --component allows the caller to specify what components they want. We can't
@ -290,6 +291,15 @@ else:
]),
),
],
'traceprof': [
Extension('hgext3rd.traceprof',
sources=['hgext3rd/traceprof.pyx'],
include_dirs=['hgext3rd/'],
extra_compile_args=filter(None, [
OPTIMIZATION, STDCPP0X, WALL, WEXTRA, WCONVERSION, PEDANTIC,
]),
),
]
}
allnames = availablepackages + availableextmodules.keys() + availablepymodules
@ -324,6 +334,7 @@ else:
cythonmodules = [
'linelog',
'patchrmdir',
'traceprof',
]
for cythonmodule in cythonmodules:
if cythonmodule in components:
@ -409,6 +420,8 @@ class CleanExtCommand(Command):
for path in self._rglob('*.pyx'):
cpath = '%s.c' % path[:-4]
removepath(cpath)
cpppath = cpath + 'pp'
removepath(cpppath)
# print short summary
if removed: