mirror of
https://github.com/Bodigrim/tasty-bench.git
synced 2024-10-05 21:01:10 +03:00
678 lines
26 KiB
Markdown
678 lines
26 KiB
Markdown
# tasty-bench [![Hackage](http://img.shields.io/hackage/v/tasty-bench.svg)](https://hackage.haskell.org/package/tasty-bench) [![Stackage LTS](http://stackage.org/package/tasty-bench/badge/lts)](http://stackage.org/lts/package/tasty-bench) [![Stackage Nightly](http://stackage.org/package/tasty-bench/badge/nightly)](http://stackage.org/nightly/package/tasty-bench)
|
|
|
|
Featherlight benchmark framework (only one file!) for performance measurement
|
|
with API mimicking [`criterion`](http://hackage.haskell.org/package/criterion)
|
|
and [`gauge`](http://hackage.haskell.org/package/gauge).
|
|
A prominent feature is built-in comparison against previous runs
|
|
and between benchmarks.
|
|
|
|
<!-- MarkdownTOC autolink="true" -->
|
|
|
|
- [How lightweight is it?](#how-lightweight-is-it)
|
|
- [How is it possible?](#how-is-it-possible)
|
|
- [How to switch?](#how-to-switch)
|
|
- [How to write a benchmark?](#how-to-write-a-benchmark)
|
|
- [How to read results?](#how-to-read-results)
|
|
- [Wall-clock time vs. CPU time](#wall-clock-time-vs-cpu-time)
|
|
- [Statistical model](#statistical-model)
|
|
- [Memory usage](#memory-usage)
|
|
- [Combining tests and benchmarks](#combining-tests-and-benchmarks)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Isolating interfering benchmarks](#isolating-interfering-benchmarks)
|
|
- [Comparison against baseline](#comparison-against-baseline)
|
|
- [Comparison between benchmarks](#comparison-between-benchmarks)
|
|
- [Plotting results](#plotting-results)
|
|
- [Build flags](#build-flags)
|
|
- [Command-line options](#command-line-options)
|
|
- [Custom command-line options](#custom-command-line-options)
|
|
|
|
<!-- /MarkdownTOC -->
|
|
|
|
## How lightweight is it?
|
|
|
|
There is only one source file `Test.Tasty.Bench` and no non-boot dependencies
|
|
except [`tasty`](http://hackage.haskell.org/package/tasty).
|
|
So if you already depend on `tasty` for a test suite, there
|
|
is nothing else to install.
|
|
|
|
Compare this to `criterion` (10+ modules, 50+ dependencies) and `gauge` (40+ modules, depends on `basement` and `vector`). A build on a clean machine is up to 16x
|
|
faster than `criterion` and up to 4x faster than `gauge`. A build without dependencies
|
|
is up to 6x faster than `criterion` and up to 8x faster than `gauge`.
|
|
|
|
`tasty-bench` is a native Haskell library and works everywhere, where GHC
|
|
does, including WASM. We support a full range of architectures (`i386`, `amd64`, `armhf`,
|
|
`arm64`, `ppc64le`, `s390x`) and operating systems (Linux, Windows, macOS,
|
|
FreeBSD, OpenBSD, NetBSD), plus any GHC from 8.0 to 9.10
|
|
(and earlier releases stretch back to GHC 7.0).
|
|
|
|
## How is it possible?
|
|
|
|
Our benchmarks are literally regular `tasty` tests, so we can leverage all existing
|
|
machinery for command-line options, resource management, structuring,
|
|
listing and filtering benchmarks, running and reporting results. It also means
|
|
that `tasty-bench` can be used in conjunction with other `tasty` ingredients.
|
|
|
|
Unlike `criterion` and `gauge` we use a very simple statistical model described below.
|
|
This is arguably a questionable choice, but it works pretty well in practice.
|
|
A rare developer is sufficiently well-versed in probability theory
|
|
to make sense and use of all numbers generated by `criterion`.
|
|
|
|
## How to switch?
|
|
|
|
[Cabal mixins](https://cabal.readthedocs.io/en/3.4/cabal-package.html#pkg-field-mixins)
|
|
allow to taste `tasty-bench` instead of `criterion` or `gauge`
|
|
without changing a single line of code:
|
|
|
|
```cabal
|
|
cabal-version: 2.0
|
|
|
|
benchmark foo
|
|
...
|
|
build-depends:
|
|
tasty-bench
|
|
mixins:
|
|
tasty-bench (Test.Tasty.Bench as Criterion, Test.Tasty.Bench as Criterion.Main, Test.Tasty.Bench as Gauge, Test.Tasty.Bench as Gauge.Main)
|
|
```
|
|
|
|
This works vice versa as well: if you use `tasty-bench`, but at some point
|
|
need a more comprehensive statistical analysis,
|
|
it is easy to switch temporarily back to `criterion`.
|
|
|
|
## How to write a benchmark?
|
|
|
|
Benchmarks are declared in a separate section of `cabal` file:
|
|
|
|
```cabal
|
|
cabal-version: 2.0
|
|
name: bench-fibo
|
|
version: 0.0
|
|
build-type: Simple
|
|
synopsis: Example of a benchmark
|
|
|
|
benchmark bench-fibo
|
|
main-is: BenchFibo.hs
|
|
type: exitcode-stdio-1.0
|
|
build-depends: base, tasty-bench
|
|
ghc-options: "-with-rtsopts=-A32m"
|
|
if impl(ghc >= 8.6)
|
|
ghc-options: -fproc-alignment=64
|
|
```
|
|
|
|
And here is `BenchFibo.hs`:
|
|
|
|
```haskell
|
|
import Test.Tasty.Bench
|
|
|
|
fibo :: Int -> Integer
|
|
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
|
|
|
|
main :: IO ()
|
|
main = defaultMain
|
|
[ bgroup "Fibonacci numbers"
|
|
[ bench "fifth" $ nf fibo 5
|
|
, bench "tenth" $ nf fibo 10
|
|
, bench "twentieth" $ nf fibo 20
|
|
]
|
|
]
|
|
```
|
|
|
|
Since `tasty-bench` provides an API compatible with `criterion`,
|
|
one can refer to [its documentation](http://www.serpentine.com/criterion/tutorial.html#how-to-write-a-benchmark-suite) for more examples.
|
|
|
|
## How to read results?
|
|
|
|
Running the example above (`cabal bench` or `stack bench`)
|
|
results in the following output:
|
|
|
|
```
|
|
All
|
|
Fibonacci numbers
|
|
fifth: OK (2.13s)
|
|
63 ns ± 3.4 ns
|
|
tenth: OK (1.71s)
|
|
809 ns ± 73 ns
|
|
twentieth: OK (3.39s)
|
|
104 μs ± 4.9 μs
|
|
|
|
All 3 tests passed (7.25s)
|
|
```
|
|
|
|
The output says that, for instance, the first benchmark was repeatedly
|
|
executed for 2.13 seconds (wall-clock time), its predicted mean CPU time was
|
|
63 nanoseconds and means of individual samples do not often diverge from it
|
|
further than ±3.4 nanoseconds (double standard deviation). Take standard
|
|
deviation numbers with a grain of salt; there are lies, damned lies, and
|
|
statistics.
|
|
|
|
## Wall-clock time vs. CPU time
|
|
|
|
What time are we talking about?
|
|
Both `criterion` and `gauge` by default report wall-clock time, which is
|
|
affected by any other application which runs concurrently.
|
|
Ideally benchmarks are executed on a dedicated server without any other load,
|
|
but — let's face the truth — most of developers run benchmarks
|
|
on a laptop with a hundred other services and a window manager, and
|
|
watch videos while waiting for benchmarks to finish. That's the cause
|
|
of a notorious "variance introduced by outliers: 88% (severely inflated)" warning.
|
|
|
|
To alleviate this issue `tasty-bench` measures CPU time by `getCPUTime`
|
|
instead of wall-clock time by default.
|
|
It does not provide a perfect isolation from other processes (e. g.,
|
|
if CPU cache is spoiled by others, populating data back from RAM
|
|
is your burden), but is a bit more stable.
|
|
|
|
Caveat: this means that for multithreaded algorithms
|
|
`tasty-bench` reports total elapsed CPU time across all cores, while
|
|
`criterion` and `gauge` print maximum of core's wall-clock time.
|
|
It also means that by default `tasty-bench` does not measure time spent out of process,
|
|
e. g., calls to other executables. To work around this limitation
|
|
use `--time-mode` command-line option or set it locally via `TimeMode` option.
|
|
|
|
## Statistical model
|
|
|
|
Here is a procedure used by `tasty-bench` to measure execution time:
|
|
|
|
1. Set $n \leftarrow 1$.
|
|
2. Measure execution time $t_n$ of $n$ iterations
|
|
and execution time $t_{2n}$ of $2n$ iterations.
|
|
3. Find $t$ which minimizes deviation of $(nt,2nt)$ from $(t_n,t_{2n})$,
|
|
namely $t \leftarrow (t_n + 2t_{2n}) / 5n$.
|
|
4. If deviation is small enough (see `--stdev` below)
|
|
or time is running out soon (see `--timeout` below),
|
|
return $t$ as a mean execution time.
|
|
5. Otherwise set $n \leftarrow 2n$ and jump back to Step 2.
|
|
|
|
This is roughly similar to the linear regression approach which `criterion` takes,
|
|
but we fit only two last points. This allows us to simplify away all heavy-weight
|
|
statistical analysis. More importantly, earlier measurements,
|
|
which are presumably shorter and noisier, do not affect overall result.
|
|
This is in contrast to `criterion`, which fits all measurements and
|
|
is biased to use more data points corresponding to shorter runs
|
|
(it employs $n \leftarrow 1.05n$ progression).
|
|
|
|
Mean time and its deviation does not say much about the
|
|
distribution of individual timings. E. g., imagine a computation which
|
|
(according to a coarse system timer) takes either 0 ms or 1 ms with equal
|
|
probability. While one would be able to establish that its mean time is 0.5 ms
|
|
with a very small deviation, this does not imply that individual measurements
|
|
are anywhere near 0.5 ms. Even assuming an infinite precision of a system
|
|
timer, the distribution of individual times is not known to be
|
|
[normal](https://en.wikipedia.org/wiki/Normal_distribution).
|
|
|
|
Obligatory disclaimer: statistics is a tricky matter, there is no
|
|
one-size-fits-all approach.
|
|
In the absence of a good theory
|
|
simplistic approaches are as (un)sound as obscure ones.
|
|
Those who seek statistical soundness should rather collect raw data
|
|
and process it themselves using a proper statistical toolbox.
|
|
Data reported by `tasty-bench`
|
|
is only of indicative and comparative significance.
|
|
|
|
## Memory usage
|
|
|
|
Configuring RTS to collect GC statistics
|
|
(e. g., via `cabal bench --benchmark-options '+RTS -T'`
|
|
or `stack bench --ba '+RTS -T'`) enables `tasty-bench` to estimate and report
|
|
memory usage:
|
|
|
|
```
|
|
All
|
|
Fibonacci numbers
|
|
fifth: OK (2.13s)
|
|
63 ns ± 3.4 ns, 223 B allocated, 0 B copied, 2.0 MB peak memory
|
|
tenth: OK (1.71s)
|
|
809 ns ± 73 ns, 2.3 KB allocated, 0 B copied, 4.0 MB peak memory
|
|
twentieth: OK (3.39s)
|
|
104 μs ± 4.9 μs, 277 KB allocated, 59 B copied, 5.0 MB peak memory
|
|
|
|
All 3 tests passed (7.25s)
|
|
```
|
|
|
|
This data is reported as per [`GHC.Stats.RTSStats`](https://hackage.haskell.org/package/base/docs/GHC-Stats.html#t:RTSStats) fields:
|
|
|
|
* `allocated_bytes`
|
|
|
|
Total size of data ever allocated since the start
|
|
of the benchmark iteration. Even if data was immediately
|
|
garbage collected and freed, it still counts.
|
|
|
|
* `copied_bytes`
|
|
|
|
Total size of data ever copied by GC (because it was alive and kicking)
|
|
since the start of the benchmark iteration. Note that zero bytes often mean
|
|
that the benchmark was too short to trigger GC at all.
|
|
|
|
* `max_mem_in_use_bytes`
|
|
|
|
Peak size of live data since the very start of the process.
|
|
This is a global metric, it cumulatively grows and does not say much
|
|
about individual benchmarks, but rather characterizes heap
|
|
environment in which they are executed.
|
|
|
|
## Combining tests and benchmarks
|
|
|
|
When optimizing an existing function, it is important to check that its
|
|
observable behavior remains unchanged. One can rebuild
|
|
both tests and benchmarks after each change, but it would be more convenient
|
|
to run sanity checks within benchmark itself. Since our benchmarks
|
|
are compatible with `tasty` tests, we can easily do so.
|
|
|
|
Imagine you come up with a faster function `myFibo` to generate Fibonacci numbers:
|
|
|
|
```haskell
|
|
import Test.Tasty.Bench
|
|
import Test.Tasty.QuickCheck -- from tasty-quickcheck package
|
|
|
|
fibo :: Int -> Integer
|
|
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
|
|
|
|
myFibo :: Int -> Integer
|
|
myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2)
|
|
|
|
main :: IO ()
|
|
main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain
|
|
[ bench "fibo 20" $ nf fibo 20
|
|
, bench "myFibo 20" $ nf myFibo 20
|
|
, testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n
|
|
]
|
|
```
|
|
|
|
This outputs:
|
|
|
|
```
|
|
All
|
|
fibo 20: OK (3.02s)
|
|
104 μs ± 4.9 μs
|
|
myFibo 20: OK (1.99s)
|
|
71 μs ± 5.3 μs
|
|
myFibo = fibo: FAIL
|
|
*** Failed! Falsified (after 5 tests and 1 shrink):
|
|
2
|
|
1 /= 2
|
|
Use --quickcheck-replay=927711 to reproduce.
|
|
|
|
1 out of 3 tests failed (5.03s)
|
|
```
|
|
|
|
We see that `myFibo` is indeed significantly faster than `fibo`,
|
|
but unfortunately does not do the same thing. One should probably
|
|
look for another way to speed up generation of Fibonacci numbers.
|
|
|
|
## Troubleshooting
|
|
|
|
* If benchmarks take too long, set `--timeout` to limit execution time
|
|
of individual benchmarks, and `tasty-bench` will do its best to fit
|
|
into a given time frame. Without `--timeout` we rerun benchmarks until
|
|
achieving a target precision set by `--stdev`, which in a noisy environment
|
|
of a modern laptop with GUI may take a lot of time.
|
|
|
|
While `criterion` runs each benchmark at least for 5 seconds,
|
|
`tasty-bench` is happy to conclude earlier, if it does not compromise
|
|
the quality of results. In our experiments `tasty-bench` suites
|
|
tend to finish earlier, even if some individual benchmarks
|
|
take longer than with `criterion`.
|
|
|
|
A common source of noisiness is garbage collection. Setting a larger
|
|
allocation area (_nursery_) is often a good idea, either via
|
|
`cabal bench --benchmark-options '+RTS -A32m'` or `stack bench --ba '+RTS -A32m'`.
|
|
Alternatively bake it into
|
|
`cabal` file as `ghc-options: "-with-rtsopts=-A32m"`.
|
|
|
|
* Never compile benchmarks with `-fstatic-argument-transformation`, because it
|
|
breaks a trick we use to force GHC into reevaluation of the same function application
|
|
over and over again.
|
|
|
|
* If benchmark results look malformed like below, make sure that you are
|
|
invoking `Test.Tasty.Bench.defaultMain` and not `Test.Tasty.defaultMain`
|
|
(the difference is `consoleBenchReporter` vs. `consoleTestReporter`):
|
|
|
|
```
|
|
All
|
|
fibo 20: OK (1.46s)
|
|
WithLoHi (Estimate {estMean = Measurement {measTime = 41529118775, measAllocs = 0, measCopied = 0, measMaxMem = 0}, estStdev = 1595055320}) (-Infinity) Infinity
|
|
```
|
|
|
|
* If benchmarks fail with an error message
|
|
|
|
```
|
|
Unhandled resource. Probably a bug in the runner you're using.
|
|
```
|
|
|
|
or
|
|
|
|
```
|
|
Unexpected state of the resource (NotCreated) in getResource. Report as a tasty bug.
|
|
```
|
|
|
|
this is likely caused by `env` or `envWithCleanup` affecting benchmarks structure.
|
|
You can use `env` to read test data from `IO`, but not to read benchmark names
|
|
or affect their hierarchy in other way. This is a fundamental restriction of `tasty`
|
|
to list and filter benchmarks without launching missiles.
|
|
|
|
Strict pattern-matching on resource is also prohibited. For instance,
|
|
if it is a tuple, the second argument of `env` should use a lazy pattern match
|
|
`\~(a, b) -> ...`
|
|
|
|
* If benchmarks fail with `Test dependencies form a loop`
|
|
or `Test dependencies have cycles`, this is likely
|
|
because of `bcompare`, which compares a benchmark with itself.
|
|
Locating a benchmark in a global environment may be tricky, please refer to
|
|
[`tasty` documentation](https://github.com/UnkindPartition/tasty#patterns) for details
|
|
and consider using `locateBenchmark`.
|
|
|
|
* When seeing
|
|
|
|
```
|
|
This benchmark takes more than 100 seconds. Consider setting --timeout, if this is unexpected (or to silence this warning).
|
|
```
|
|
|
|
do follow the advice: abort benchmarks and pass `-t100` or similar. Unless you are
|
|
benchmarking a very computationally expensive function, a single benchmark should
|
|
stabilize after a couple of seconds. This warning is a sign that your environment
|
|
is too noisy, in which case `tasty-bench` will continue trying with exponentially
|
|
longer intervals, often unproductively.
|
|
|
|
* The following error can be thrown when benchmarks are built with
|
|
`ghc-options: -threaded`:
|
|
|
|
```
|
|
Benchmarks must not be run concurrently. Please pass -j1 and/or avoid +RTS -N.
|
|
```
|
|
|
|
The underlying cause is that `tasty` runs tests concurrently, which is harmful
|
|
for reliable performance measurements. Make sure to use `tasty-bench >= 0.3.4` and invoke
|
|
`Test.Tasty.Bench.defaultMain` and not `Test.Tasty.defaultMain`. Note that
|
|
`localOption (NumThreads 1)` quashes the warning, but does not eliminate the cause.
|
|
|
|
* If benchmarks using GHC 9.4.4+ segfault on Windows, check that you
|
|
are not using non-moving garbage collector `--nonmoving-gc`. This is likely caused
|
|
by [GHC issue](https://gitlab.haskell.org/ghc/ghc/-/issues/23003).
|
|
Previous releases of `tasty-bench` recommended enabling `--nonmoving-gc`
|
|
to stabilise benchmarks, but it's discouraged now.
|
|
|
|
* If you see
|
|
|
|
```
|
|
<stdout>: commitBuffer: invalid argument (cannot encode character '\177')
|
|
```
|
|
|
|
it means that your locale does not support UTF-8. `tasty-bench` makes an effort
|
|
to force locale to UTF-8, but sometimes, when benchmarks are a part of
|
|
a larger application, it's [impossible](https://gitlab.haskell.org/ghc/ghc/-/issues/23606)
|
|
to do so. In such case run `locale -a` to list available locales and set a UTF-8-capable
|
|
one (e. g., `export LANG=C.UTF-8`) before starting benchmarks.
|
|
|
|
## Isolating interfering benchmarks
|
|
|
|
One difficulty of benchmarking in Haskell is that it is
|
|
hard to isolate benchmarks so that they do not interfere.
|
|
Changing the order of benchmarks or skipping some of them
|
|
has an effect on heap's layout and thus affects garbage collection.
|
|
This issue is well attested in
|
|
[both](https://github.com/haskell/criterion/issues/166)
|
|
[`criterion`](https://github.com/haskell/criterion/issues/60)
|
|
and
|
|
[`gauge`](https://github.com/vincenthz/hs-gauge/issues/2).
|
|
|
|
Usually (but not always) skipping some benchmarks speeds up remaining ones.
|
|
That's because once a benchmark allocated heap which for some reason
|
|
was not promptly released afterwards (e. g., it forced a top-level thunk
|
|
in an underlying library), all further benchmarks are slowed down
|
|
by garbage collector processing this additional amount of live data
|
|
over and over again.
|
|
|
|
There are several mitigation strategies. First of all, giving garbage collector
|
|
more breathing space by `+RTS -A32m` (or more) is often good enough.
|
|
|
|
Further, avoid using top-level bindings to store large test data. Once such thunks
|
|
are forced, they remain allocated forever, which affects detrimentally subsequent
|
|
unrelated benchmarks. Treat them as external data, supplied via `env`: instead of
|
|
|
|
```haskell
|
|
largeData :: String
|
|
largeData = replicate 1000000 'a'
|
|
|
|
main :: IO ()
|
|
main = defaultMain
|
|
[ bench "large" $ nf length largeData, ... ]
|
|
```
|
|
|
|
use
|
|
|
|
```haskell
|
|
import Control.DeepSeq (force)
|
|
import Control.Exception (evaluate)
|
|
|
|
main :: IO ()
|
|
main = defaultMain
|
|
[ env (evaluate (force (replicate 1000000 'a'))) $ \largeData ->
|
|
bench "large" $ nf length largeData, ... ]
|
|
```
|
|
|
|
Finally, as an ultimate measure to reduce interference between benchmarks,
|
|
one can run each of them in a separate process. We do not quite recommend
|
|
this approach, but if you are desperate, here is how:
|
|
|
|
```sh
|
|
cabal run -v0 all:benches -- -l | sed -e 's/[\"]/\\\\\\&/g' | while read -r name; do cabal run -v0 all:benches -- -p '$0 == "'"$name"'"'; done
|
|
```
|
|
|
|
This assumes that there is a single benchmark suite in the project
|
|
and that benchmark names do not contain newlines.
|
|
|
|
## Comparison against baseline
|
|
|
|
One can compare benchmark results against an earlier run in an automatic way.
|
|
|
|
When using this feature, it's especially important to compile benchmarks with
|
|
`ghc-options: `[`-fproc-alignment`](https://downloads.haskell.org/ghc/latest/docs/users_guide/debugging.html#ghc-flag--fproc-alignment)`=64`, otherwise results could be skewed by
|
|
intermittent changes in cache-line alignment.
|
|
|
|
Firstly, run `tasty-bench` with `--csv FILE` key
|
|
to dump results to `FILE` in CSV format
|
|
(it could be a good idea to set smaller `--stdev`, if possible):
|
|
|
|
```
|
|
Name,Mean (ps),2*Stdev (ps)
|
|
All.Fibonacci numbers.fifth,48453,4060
|
|
All.Fibonacci numbers.tenth,637152,46744
|
|
All.Fibonacci numbers.twentieth,81369531,3342646
|
|
```
|
|
|
|
Now modify implementation and rerun benchmarks
|
|
with `--baseline FILE` key. This produces a report as follows:
|
|
|
|
```
|
|
All
|
|
Fibonacci numbers
|
|
fifth: OK (0.44s)
|
|
53 ns ± 2.7 ns, 8% more than baseline
|
|
tenth: OK (0.33s)
|
|
641 ns ± 59 ns, same as baseline
|
|
twentieth: OK (0.36s)
|
|
77 μs ± 6.4 μs, 5% less than baseline
|
|
|
|
All 3 tests passed (1.50s)
|
|
```
|
|
|
|
You can also fail benchmarks, which deviate too far from baseline, using
|
|
`--fail-if-slower` and `--fail-if-faster` options. For example, setting both of them
|
|
to 6 will fail the first benchmark above (because it is more than 6% slower),
|
|
but the last one still succeeds (even while it is measurably faster than baseline,
|
|
deviation is less than 6%). Consider also using `--hide-successes` to show
|
|
only problematic benchmarks, or even
|
|
[`tasty-rerun`](http://hackage.haskell.org/package/tasty-rerun) package
|
|
to focus on rerunning failing items only.
|
|
|
|
If you wish to compare two CSV reports non-interactively, here is a handy `awk` incantation:
|
|
|
|
```sh
|
|
awk 'BEGIN{FS=",";OFS=",";print "Name,Old,New,Ratio"}FNR==1{trueNF=NF;next}NF<trueNF{print "Benchmark names should not contain newlines";exit 1}FNR==NR{oldTime=$(NF-trueNF+2);NF-=trueNF-1;a[$0]=oldTime;next}{newTime=$(NF-trueNF+2);NF-=trueNF-1;if(a[$0]){print $0,a[$0],newTime,newTime/a[$0];gs+=log(newTime/a[$0]);gc++}}END{if(gc>0)print "Geometric mean,,",exp(gs/gc)}' old.csv new.csv
|
|
```
|
|
|
|
A larger shell snippet to compare two `git` commits can be found in `compare_benches.sh`.
|
|
|
|
Note that columns in CSV report are different from what `criterion` or `gauge`
|
|
would produce. If names do not contain commas, missing columns can be faked this way:
|
|
|
|
```sh
|
|
awk 'BEGIN{FS=",";OFS=",";print "Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);stddev=$(NF-trueNF+3);NF-=trueNF-1;print $0,mean/1e12,mean/1e12,mean/1e12,stddev/2e12,stddev/2e12,stddev/2e12}'
|
|
```
|
|
|
|
To fake `gauge` in `--csvraw` mode use
|
|
|
|
```sh
|
|
awk 'BEGIN{FS=",";OFS=",";print "name,iters,time,cycles,cpuTime,utime,stime,maxrss,minflt,majflt,nvcsw,nivcsw,allocated,numGcs,bytesCopied,mutatorWallSeconds,mutatorCpuSeconds,gcWallSeconds,gcCpuSeconds"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);fourth=$(NF-trueNF+4);fifth=$(NF-trueNF+5);sixth=$(NF-trueNF+6);NF-=trueNF-1;print $0,1,mean/1e12,0,mean/1e12,mean/1e12,0,sixth+0,0,0,0,0,fourth+0,0,fifth+0,0,0,0,0}'
|
|
```
|
|
|
|
## Comparison between benchmarks
|
|
|
|
You can also compare benchmarks to each other without any external tools,
|
|
all in the comfort of your terminal.
|
|
|
|
```haskell
|
|
import Test.Tasty.Bench
|
|
|
|
fibo :: Int -> Integer
|
|
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
|
|
|
|
main :: IO ()
|
|
main = defaultMain
|
|
[ bgroup "Fibonacci numbers"
|
|
[ bcompare "tenth" $ bench "fifth" $ nf fibo 5
|
|
, bench "tenth" $ nf fibo 10
|
|
, bcompare "tenth" $ bench "twentieth" $ nf fibo 20
|
|
]
|
|
]
|
|
```
|
|
|
|
This produces a report, comparing mean times of `fifth` and `twentieth` to `tenth`:
|
|
|
|
```
|
|
All
|
|
Fibonacci numbers
|
|
fifth: OK (16.56s)
|
|
121 ns ± 2.6 ns, 0.08x
|
|
tenth: OK (6.84s)
|
|
1.6 μs ± 31 ns
|
|
twentieth: OK (6.96s)
|
|
203 μs ± 4.1 μs, 128.36x
|
|
```
|
|
|
|
To locate a baseline benchmark in a larger suite use `locateBenchmark`.
|
|
|
|
One can leverage comparisons between benchmarks to implement portable performance
|
|
tests, expressing properties like "this algorithm must be at least twice faster
|
|
than that one" or "this operation should not be more than thrice slower than that".
|
|
This can be achieved with `bcompareWithin`, which takes an acceptable interval
|
|
of performance as an argument.
|
|
|
|
## Plotting results
|
|
|
|
Users can dump results into CSV with `--csv FILE`
|
|
and plot them using `gnuplot` or other software. But for convenience
|
|
there is also a built-in quick-and-dirty SVG plotting feature,
|
|
which can be invoked by passing `--svg FILE`. Here is a sample of its output:
|
|
|
|
![Plotting](https://hackage.haskell.org/package/tasty-bench/src/example.svg)
|
|
|
|
## Build flags
|
|
|
|
Build flags are a brittle subject and users do not normally need to touch them.
|
|
|
|
* If you find yourself in an environment, where `tasty` is not available and you
|
|
have access to boot packages only, you can still use `tasty-bench`! Just copy
|
|
`Test/Tasty/Bench.hs` to your project (imagine it like a header-only C library).
|
|
It will provide you with functions to build `Benchmarkable` and run them manually
|
|
via `measureCpuTime`. This mode of operation can be also configured
|
|
by disabling Cabal flag `tasty`.
|
|
|
|
## Command-line options
|
|
|
|
Use `--help` to list all command-line options.
|
|
|
|
* `-p`, `--pattern`
|
|
|
|
This is a standard `tasty` option, which allows filtering benchmarks
|
|
by a pattern or `awk` expression. Please refer to
|
|
[`tasty` documentation](https://github.com/UnkindPartition/tasty#patterns)
|
|
for details.
|
|
|
|
* `-t`, `--timeout`
|
|
|
|
This is a standard `tasty` option, setting timeout for individual benchmarks
|
|
in seconds. Use it when benchmarks tend to take too long: `tasty-bench` will make
|
|
an effort to report results (even if of subpar quality) before timeout. Setting
|
|
timeout too tight (insufficient for at least three iterations)
|
|
will result in a benchmark failure. One can adjust it locally for a group
|
|
of benchmarks, e. g., `localOption (mkTimeout 100000000)` for 100 seconds.
|
|
|
|
* `--stdev`
|
|
|
|
Target relative standard deviation of measurements in percents (5% by default).
|
|
Large values correspond to fast and loose benchmarks, and small ones to long and precise.
|
|
It can also be adjusted locally for a group of benchmarks,
|
|
e. g., `localOption (RelStDev 0.02)`.
|
|
If benchmarking takes far too long, consider setting `--timeout`,
|
|
which will interrupt benchmarks, potentially before reaching the target deviation.
|
|
|
|
* `--csv`
|
|
|
|
File to write results in CSV format.
|
|
|
|
* `--baseline`
|
|
|
|
File to read baseline results in CSV format (as produced by `--csv`).
|
|
|
|
* `--fail-if-slower`, `--fail-if-faster`
|
|
|
|
Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see `--baseline`),
|
|
it will be reported as failed. Can be used in conjunction with
|
|
a standard `tasty` option `--hide-successes` to show only problematic benchmarks.
|
|
Both options can be adjusted locally for a group of benchmarks,
|
|
e. g., `localOption (FailIfSlower 0.10)`.
|
|
|
|
* `--svg`
|
|
|
|
File to plot results in SVG format.
|
|
|
|
* `--time-mode`
|
|
|
|
Whether to measure CPU time (`cpu`, default) or wall-clock time (`wall`).
|
|
|
|
* `+RTS -T`
|
|
|
|
Estimate and report memory usage.
|
|
|
|
## Custom command-line options
|
|
|
|
As usual with `tasty`, it is easy to extend benchmarks with custom command-line options.
|
|
Here is an example:
|
|
|
|
```haskell
|
|
import Data.Proxy
|
|
import Test.Tasty.Bench
|
|
import Test.Tasty.Ingredients.Basic
|
|
import Test.Tasty.Options
|
|
import Test.Tasty.Runners
|
|
|
|
newtype RandomSeed = RandomSeed Int
|
|
|
|
instance IsOption RandomSeed where
|
|
defaultValue = RandomSeed 42
|
|
parseValue = fmap RandomSeed . safeRead
|
|
optionName = pure "seed"
|
|
optionHelp = pure "Random seed used in benchmarks"
|
|
|
|
main :: IO ()
|
|
main = do
|
|
let customOpts = [Option (Proxy :: Proxy RandomSeed)]
|
|
ingredients = includingOptions customOpts : benchIngredients
|
|
opts <- parseOptions ingredients benchmarks
|
|
let RandomSeed seed = lookupOption opts
|
|
defaultMainWithIngredients ingredients benchmarks
|
|
|
|
benchmarks :: Benchmark
|
|
benchmarks = bgroup "All" []
|
|
```
|