sapling

facebook/sapling

Fork 0

mirror of https://github.com/facebook/sapling.git synced 2024-10-11 09:17:30 +03:00

Commit Graph

Author	SHA1	Message	Date
Jun Wu	3b9d8cf876	traceprof: store raw Python code object Summary: Previously, the tracer extracts file name, function name from Python code objects and stores them as std::string. And frame the de-duplication logic hashes those strings, which is actually quite expensive. `hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with de-duplication. This diff assumes Python code object is unique (i.e. Python won't generate two code objects for a single code segment). That seems reasonable and `lsprof` seems to make a same assumption. With that assumption, just store `PyCodeObject` (contains file, method name and line number) instead of `std::string` (for file, method names) and line numbers for code identity and use the raw address of `PyCodeObject` to do frame de-duplication. That is helpful during profiling and de-duplication: - During profiling, no need to copy strings (convert to std::string), therefore less overhead - During frame de-duplication, no need to read long strings and hash them, therefore much faster Debug flag during compilation is added to make debugging easier as the code starts to mix of C++ and Python objects, which means mistakes could be made more easily. Test Plan: With this patch, `hg id` with frame de-duplication takes 1.0 seconds. There is no visible overhead doing de-duplication. Cheers! Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112636 Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e	2017-05-23 11:51:10 -07:00
Jun Wu	85be37794d	traceprof: implement frame deduplication Summary: Previously, traceprof hashes frames using their raw memory addresses, which causes a same frame (same traceback) being recorded multiple times because of memory address change. That's for low overhead because hashing all the way back to the top frame is slow. This patch implements slow frame hashing and de-duplication so the output could look tidier in some cases. The slow de-duplication runs after profiling ends so it won't affect profiling accuracy. Test Plan: Trace IPython, before: ``` 6211 \ run posix.py:50 6152 \ call_inputhook (10 times) inputhook.py:60 .... 6147 \ select_fds (10 times) select.py:194 6 \ select select.py:85 6 \| select select.py:179 501 \ select select.py:85 * 501 \| select select.py:179 3775 \ select select.py:85 * 3775 \| select select.py:179 * 627 \ select (2 times) select.py:85 736 \ select (3 times) select.py:85 * 736 \| select (6 times) select.py:179 501 \ select select.py:85 * 501 \| select (2 times) select.py:179 ``` After: ``` 4397 \ run posix.py:50 4350 \ call_inputhook (4 times) inputhook.py:60 4349 \| select_fds (4 times) select.py:194 4348 \| select (4 times) select.py:85 * 4348 \| select (8 times) select.py:179 ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: mjpieters Differential Revision: https://phabricator.intern.facebook.com/D5112561 Signature: t1:5112561:1495562165:bf86c3283156d3b67918706dc875e6d3e28eac18	2017-05-23 11:50:33 -07:00
Jun Wu	f1be725a71	traceprof: new extension to perform accurate profiling Summary: Instead of sampling periodically like the default profiler `statprof`, trace every function call so the report is accurate. The output is designed to be similar to `statprof`. Written in C++ and inline assembly to minimize overhead. Cython is used as a glue layer. Comparing to `lsprof`, the overhead is similarly small. The major difference is this profiler cares about "frame object" and records `frame->f_back` (parent frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records "code object" instead and could have difficulty to deal with recursive calls. The following code could demostrate the difference: ``` def fib(x): if x < 2: return x return fib(x - 1) + fib(x - 2) # lsprof - recorded 1 code object for fib and no way to figure out callgraph import _lsprof p = _lsprof.Profiler() p.enable(subcalls=True) fib(5) p.disable() for s in p.getstats(): print(s) '''output: _lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None) _lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)]) ''' # traceprof (with timethreshold set to 0) - callgraph drawn correctly import sys from hgext3rd import traceprof with traceprof.profile(None, sys.stderr): fib(5) '''output: \| <module> a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \| fib a.py:1 0 \ fib a.py:1 0 \ fib a.py:1 0 \| fib a.py:1 0 \ fib a.py:1 0 \| fib a.py:1 \ __exit__ contextlib.py:21 ''' ``` There are some performance / correctness tradeoffs. The current implementation takes `frame` address and several (but not theoretically complete) fields to hash a frame. That could in theory lead to inaccurate frame information being recorded. But hashing all fields recursively incurs significant overhead. The related logic is the `hashframe` function, which could be tweaked later. This is an initial version. It probably does not cover all corner cases like exception handling well. And it may duplicate a same frame multiple times because of the way it hashes a frame. The duplicated frame issue could be solved by adding post-profiling filtering though. Here is a real-world example of `hg log -r .` on `hg-committed`: ``` \| <module> hg:10 \| run dispatch.py:81 \| dispatch dispatch.py:132 \| _runcatch dispatch.py:202 \| _callcatch dispatch.py:309 \| callcatch scmutil.py:133 \| _runcatchfunc dispatch.py:215 \| _dispatch dispatch.py:744 2 \ repository hg.py:166 2 \| _peerorrepo hg.py:151 2 \| instance localrepo.py:2025 2 \| __init__ localrepo.py:275 863 \ runcommand dispatch.py:656 863 \| _runcommand dispatch.py:926 863 \| <lambda> dispatch.py:918 863 \| check util.py:1059 863 \| log commands.py:3262 480 \ getlogrevs cmdutil.py:2236 479 \| _logrevs cmdutil.py:2190 479 \| revrange scmutil.py:429 479 \| anyrevs localrepo.py:622 478 \| mfunc revset.py:2275 478 \| __init__ (4 times) smartset.py:1033 478 \| __init__ (4 times) smartset.py:919 478 \| __len__ (4 times) localrepo.py:587 478 \| changelog repoview.py:309 477 \| filterrevs (11 times) repoview.py:260 477 \| computehidden repoview.py:171 455 \ hideablerevs repoview.py:24 455 \| getrevs obsolete.py:1114 455 \| _computeobsoleteset obsolete.py:1143 3 \ __get__ (2 times) localrepo.py:76 7 \ node (5760 times) changelog.py:359 4 \| node (8938 times) revlog.py:482 9 \ __get__ (5760 times) localrepo.py:76 428 \ __get__ util.py:798 428 \| successors obsolete.py:672 225 \ __get__ util.py:798 224 \| _all obsolete.py:662 6 \ tryread vfs.py:32 6 \| read vfs.py:60 200 \ _readmarkers obsolete.py:442 * 199 \| _fm1readmarkers obsolete.py:430 15 \ _checkinvalidmarkers obsolete.py:523 * 203 \ _addsuccessors obsolete.py:504 3 \ tryreadcache repoview.py:166 3 \| cachehash repoview.py:98 3 \| heads localrepo.py:1884 17 \ <genexpr> (5225 times) repoview.py:191 15 \| __contains__ (5226 times) ancestor.py:334 5 \| parentrevs (4010 times) changelog.py:371 2 \| parentrevs (4332 times) revlog.py:479 9 \ pager ui.py:843 9 \| _runpager ui.py:906 8 \| __init__ subprocess.py:330 8 \| _execute_child subprocess.py:880 5 \| _eintr_retry_call subprocess.py:118 374 \ show cmdutil.py:1308 374 \| _show cmdutil.py:1316 365 \ _changesetlabels cmdutil.py:1266 365 \| troubled context.py:225 12 \ unstable (2 times) context.py:207 12 \| getrevs obsolete.py:1114 12 \| _computeunstableset obsolete.py:1154 3 \ set (322 times) localrepo.py:610 3 \ revs (3 times) localrepo.py:593 8 \ __getitem__ (963 times) localrepo.py:566 4 \ parents (321 times) context.py:246 4 \| __get__ (321 times) util.py:798 4 \| _parents (321 times) context.py:562 3 \ <genexpr> (642 times) obsolete.py:1164 3 \| obsolete (321 times) context.py:199 2 \| getrevs (321 times) obsolete.py:1114 319 \ bumped (2 times) context.py:211 319 \| getrevs obsolete.py:1114 319 \| _computebumpedset obsolete.py:1181 4 \ set (322 times) localrepo.py:610 312 \ allprecursors (2021 times) obsolete.py:850 308 \| __get__ util.py:798 308 \| precursors obsolete.py:678 * 308 \| _addprecursors obsolete.py:509 34 \ divergent (2 times) context.py:218 34 \| getrevs obsolete.py:1114 33 \| _computedivergentset obsolete.py:1204 4 \ set (322 times) localrepo.py:610 24 \ successorssets (1368 times) obsolete.py:899 3 \ __get__ (1368 times) localrepo.py:76 4 \ __contains__ (200 times) localrepo.py:575 9 \ names (2 times) namespaces.py:184 9 \| <lambda> namespaces.py:43 9 \| nodetags localrepo.py:741 8 \| __get__ util.py:798 8 \| _tagscache localrepo.py:646 8 \| _findtags localrepo.py:685 7 \| findglobaltags tags.py:170 7 \| _readtagcache tags.py:370 6 \| filteredhash scmutil.py:307 \ __exit__ contextlib.py:21 \| maybeprofile profiling.py:199 \| __exit__ contextlib.py:21 \| profile profiling.py:148 \| __exit__ contextlib.py:21 Total time: 867 ms ``` Some example conclusions from reading the above output include: 1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow 2. `changelog.node` and `changelog.parentrevs` are called thousands of times, but they are just a few milliseconds. 3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is helpful for 3. `statprof` is not accurate enough to be confident about 5ms, `lsprof` will include other `vfs.read` calls so it's impossible to see just the read for obsstore. Test Plan: With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run: ``` HGPROF=traceprof hg id --config extensions.traceprof= --profile ``` Reviewers: #mercurial, rmcelroy Reviewed By: rmcelroy Subscribers: rmcelroy, mjpieters, jiangyuan Differential Revision: https://phabricator.intern.facebook.com/D5104851 Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836	2017-05-22 15:49:45 -07:00

Author

SHA1

Message

Date

Jun Wu

3b9d8cf876

traceprof: store raw Python code object

Summary:
Previously, the tracer extracts file name, function name from Python code
objects and stores them as std::string. And frame the de-duplication logic
hashes those strings, which is actually quite expensive.

`hg id` takes 1.0 seconds without frame de-duplication, and 2.5 seconds with
de-duplication.

This diff assumes Python code object is unique (i.e. Python won't generate two
code objects for a single code segment). That seems reasonable and `lsprof`
seems to make a same assumption.

With that assumption, just store `PyCodeObject` (contains file, method name and
line number) instead of `std::string` (for file, method names) and line numbers
for code identity and use the raw address of `PyCodeObject` to do frame
de-duplication.

That is helpful during profiling and de-duplication:

  - During profiling, no need to copy strings (convert to std::string),
    therefore less overhead
  - During frame de-duplication, no need to read long strings and hash them,
    therefore much faster

Debug flag during compilation is added to make debugging easier as the code
starts to mix of C++ and Python objects, which means mistakes could be made
more easily.

Test Plan:
With this patch, `hg id` with frame de-duplication takes 1.0 seconds.
There is no visible overhead doing de-duplication. Cheers!

Reviewers: #mercurial, rmcelroy

Reviewed By: rmcelroy

Subscribers: rmcelroy, mjpieters

Differential Revision: https://phabricator.intern.facebook.com/D5112636

Signature: t1:5112636:1495562232:dd0724198ebb7f2eb4e9fd3a83517fc9160fa01e

2017-05-23 11:51:10 -07:00

Jun Wu

85be37794d

traceprof: implement frame deduplication

Summary:
Previously, traceprof hashes frames using their raw memory addresses, which
causes a same frame (same traceback) being recorded multiple times because
of memory address change. That's for low overhead because hashing all the
way back to the top frame is slow.

This patch implements slow frame hashing and de-duplication so the output
could look tidier in some cases. The slow de-duplication runs after
profiling ends so it won't affect profiling accuracy.

Test Plan:
Trace IPython, before:

```
  6211          \ run                           posix.py:50
  6152            \ call_inputhook (10 times)   inputhook.py:60
                    ....
  6147              \ select_fds (10 times)     select.py:194
     6                \ select                  select.py:85
     6                 | select                 select.py:179
   501                \ select                  select.py:85
*  501                 | select                 select.py:179
  3775                \ select                  select.py:85
* 3775                 | select                 select.py:179
*  627                \ select (2 times)        select.py:85
   736                \ select (3 times)        select.py:85
*  736                 | select (6 times)       select.py:179
   501                \ select                  select.py:85
*  501                 | select (2 times)       select.py:179
```

After:

```
  4397          \ run                           posix.py:50
  4350            \ call_inputhook (4 times)    inputhook.py:60
  4349             | select_fds (4 times)       select.py:194
  4348             | select (4 times)           select.py:85
* 4348             | select (8 times)           select.py:179
```

Reviewers: #mercurial, rmcelroy

Reviewed By: rmcelroy

Subscribers: mjpieters

Differential Revision: https://phabricator.intern.facebook.com/D5112561

Signature: t1:5112561:1495562165:bf86c3283156d3b67918706dc875e6d3e28eac18

2017-05-23 11:50:33 -07:00

Jun Wu

f1be725a71

traceprof: new extension to perform accurate profiling

Summary:
Instead of sampling periodically like the default profiler `statprof`, trace
every function call so the report is accurate. The output is designed to be
similar to `statprof`.

Written in C++ and inline assembly to minimize overhead. Cython is used as a
glue layer.

Comparing to `lsprof`, the overhead is similarly small. The major difference is
this profiler cares about "frame object" and records `frame->f_back` (parent
frame) so it could draw a statprof-like tree of the callgraph. `lsprof` records
"code object" instead and could have difficulty to deal with recursive calls.

The following code could demostrate the difference:

```
def fib(x):
    if x < 2:
        return x
    return fib(x - 1) + fib(x - 2)

# lsprof - recorded 1 code object for fib and no way to figure out callgraph
import _lsprof
p = _lsprof.Profiler()
p.enable(subcalls=True)
fib(5)
p.disable()
for s in p.getstats():
    print(s)

'''output:
_lsprof.profiler_entry(code="<method 'disable' of '_lsprof.Profiler' objects>", callcount=1, reccallcount=0, totaltime=0.0, inlinetime=0.0, calls=None)
_lsprof.profiler_entry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=15, reccallcount=14, totaltime=8e-06, inlinetime=8e-06, calls=[_lsprof.profiler_subentry(code=<code object fib at 0x7f27ca587530, file "a.py", line 1>, callcount=14, reccallcount=12, totaltime=6e-06, inlinetime=6e-06)])
'''

# traceprof (with timethreshold set to 0) - callgraph drawn correctly
import sys
from hgext3rd import traceprof
with traceprof.profile(None, sys.stderr):
    fib(5)

'''output:
    | <module>                                  a.py:1
  0  \ fib                                      a.py:1
  0    \ fib                                    a.py:1
  0      \ fib                                  a.py:1
  0        \ fib                                a.py:1
  0         | fib                               a.py:1
  0        \ fib                                a.py:1
  0      \ fib                                  a.py:1
  0       | fib                                 a.py:1
  0    \ fib                                    a.py:1
  0     | fib                                   a.py:1
     \ __exit__                                 contextlib.py:21
'''
```

There are some performance / correctness tradeoffs. The current implementation
takes `frame` address and several (but not theoretically complete) fields to
hash a frame.  That could in theory lead to inaccurate frame information being
recorded. But hashing all fields recursively incurs significant overhead. The
related logic is the `hashframe` function, which could be tweaked later.

This is an initial version. It probably does not cover all corner cases like
exception handling well. And it may duplicate a same frame multiple times
because of the way it hashes a frame. The duplicated frame issue could be
solved by adding post-profiling filtering though.

Here is a real-world example of `hg log -r .` on `hg-committed`:

```
      | <module>                                hg:10
      | run                                     dispatch.py:81
      | dispatch                                dispatch.py:132
      | _runcatch                               dispatch.py:202
      | _callcatch                              dispatch.py:309
      | callcatch                               scmutil.py:133
      | _runcatchfunc                           dispatch.py:215
      | _dispatch                               dispatch.py:744
    2  \ repository                             hg.py:166
    2   | _peerorrepo                           hg.py:151
    2   | instance                              localrepo.py:2025
    2   | __init__                              localrepo.py:275
  863  \ runcommand                             dispatch.py:656
  863   | _runcommand                           dispatch.py:926
  863   | <lambda>                              dispatch.py:918
  863   | check                                 util.py:1059
  863   | log                                   commands.py:3262
  480    \ getlogrevs                           cmdutil.py:2236
  479     | _logrevs                            cmdutil.py:2190
  479     | revrange                            scmutil.py:429
  479     | anyrevs                             localrepo.py:622
  478     | mfunc                               revset.py:2275
  478     | __init__ (4 times)                  smartset.py:1033
  478     | __init__ (4 times)                  smartset.py:919
  478     | __len__ (4 times)                   localrepo.py:587
  478     | changelog                           repoview.py:309
  477     | filterrevs (11 times)               repoview.py:260
  477     | computehidden                       repoview.py:171
  455      \ hideablerevs                       repoview.py:24
  455       | getrevs                           obsolete.py:1114
  455       | _computeobsoleteset               obsolete.py:1143
    3        \ __get__ (2 times)                localrepo.py:76
    7        \ node (5760 times)                changelog.py:359
    4         | node (8938 times)               revlog.py:482
    9        \ __get__ (5760 times)             localrepo.py:76
  428        \ __get__                          util.py:798
  428         | successors                      obsolete.py:672
  225          \ __get__                        util.py:798
  224           | _all                          obsolete.py:662
    6            \ tryread                      vfs.py:32
    6             | read                        vfs.py:60
  200            \ _readmarkers                 obsolete.py:442
* 199             | _fm1readmarkers             obsolete.py:430
   15            \ _checkinvalidmarkers         obsolete.py:523
* 203          \ _addsuccessors                 obsolete.py:504
    3      \ tryreadcache                       repoview.py:166
    3       | cachehash                         repoview.py:98
    3       | heads                             localrepo.py:1884
   17      \ <genexpr> (5225 times)             repoview.py:191
   15       | __contains__ (5226 times)         ancestor.py:334
    5       | parentrevs (4010 times)           changelog.py:371
    2       | parentrevs (4332 times)           revlog.py:479
    9    \ pager                                ui.py:843
    9     | _runpager                           ui.py:906
    8     | __init__                            subprocess.py:330
    8     | _execute_child                      subprocess.py:880
    5     | _eintr_retry_call                   subprocess.py:118
  374    \ show                                 cmdutil.py:1308
  374     | _show                               cmdutil.py:1316
  365      \ _changesetlabels                   cmdutil.py:1266
  365       | troubled                          context.py:225
   12        \ unstable (2 times)               context.py:207
   12         | getrevs                         obsolete.py:1114
   12         | _computeunstableset             obsolete.py:1154
    3          \ set (322 times)                localrepo.py:610
    3            \ revs (3 times)               localrepo.py:593
    8            \ __getitem__ (963 times)      localrepo.py:566
    4          \ parents (321 times)            context.py:246
    4           | __get__ (321 times)           util.py:798
    4           | _parents (321 times)          context.py:562
    3          \ <genexpr> (642 times)          obsolete.py:1164
    3           | obsolete (321 times)          context.py:199
    2           | getrevs (321 times)           obsolete.py:1114
  319        \ bumped (2 times)                 context.py:211
  319         | getrevs                         obsolete.py:1114
  319         | _computebumpedset               obsolete.py:1181
    4          \ set (322 times)                localrepo.py:610
  312          \ allprecursors (2021 times)     obsolete.py:850
  308           | __get__                       util.py:798
  308           | precursors                    obsolete.py:678
* 308           | _addprecursors                obsolete.py:509
   34        \ divergent (2 times)              context.py:218
   34         | getrevs                         obsolete.py:1114
   33         | _computedivergentset            obsolete.py:1204
    4          \ set (322 times)                localrepo.py:610
   24          \ successorssets (1368 times)    obsolete.py:899
    3            \ __get__ (1368 times)         localrepo.py:76
    4            \ __contains__ (200 times)     localrepo.py:575
    9      \ names (2 times)                    namespaces.py:184
    9       | <lambda>                          namespaces.py:43
    9       | nodetags                          localrepo.py:741
    8       | __get__                           util.py:798
    8       | _tagscache                        localrepo.py:646
    8       | _findtags                         localrepo.py:685
    7       | findglobaltags                    tags.py:170
    7       | _readtagcache                     tags.py:370
    6       | filteredhash                      scmutil.py:307
       \ __exit__                               contextlib.py:21
        | maybeprofile                          profiling.py:199
        | __exit__                              contextlib.py:21
        | profile                               profiling.py:148
        | __exit__                              contextlib.py:21
Total time: 867 ms
```

Some example conclusions from reading the above output include:

  1. `_fm1readmarkers`, `_addsuccessors`, `_addprecursors` are slow
  2. `changelog.node` and `changelog.parentrevs` are called thousands of times,
     but they are just a few milliseconds.
  3. Reading the `obsstore` file (`vfs.read`) takes 5ms exactly

While `statprof` may get 1 right, `lsprof` may get 2 right, neither of them is
helpful for 3. `statprof` is not accurate enough to be confident about 5ms,
`lsprof` will include other `vfs.read` calls so it's impossible to see just the
read for obsstore.


Test Plan:
With the upstream [patch](https://patchwork.mercurial-scm.org/patch/20822/), run:

```
 HGPROF=traceprof hg id --config extensions.traceprof= --profile
```

Reviewers: #mercurial, rmcelroy

Reviewed By: rmcelroy

Subscribers: rmcelroy, mjpieters, jiangyuan

Differential Revision: https://phabricator.intern.facebook.com/D5104851

Signature: t1:5104851:1495480386:666847ea08bb6a94658bc10d7f0e91c225d56836

2017-05-22 15:49:45 -07:00

3 Commits