mirror of
https://github.com/Bodigrim/tasty-bench.git
synced 2024-10-26 17:12:59 +03:00
527 lines
18 KiB
Markdown
527 lines
18 KiB
Markdown
# tasty-bench [![Hackage](http://img.shields.io/hackage/v/tasty-bench.svg)](https://hackage.haskell.org/package/tasty-bench) [![Stackage Nightly](http://stackage.org/package/tasty-bench/badge/nightly)](http://stackage.org/nightly/package/tasty-bench)
|
||
|
||
Featherlight benchmark framework (only one file!) for performance measurement
|
||
with API mimicking [`criterion`](http://hackage.haskell.org/package/criterion)
|
||
and [`gauge`](http://hackage.haskell.org/package/gauge).
|
||
A prominent feature is built-in comparison against previous runs
|
||
and between benchmarks.
|
||
|
||
<!-- MarkdownTOC autolink="true" -->
|
||
|
||
- [How lightweight is it?](#how-lightweight-is-it)
|
||
- [How is it possible?](#how-is-it-possible)
|
||
- [How to switch?](#how-to-switch)
|
||
- [How to write a benchmark?](#how-to-write-a-benchmark)
|
||
- [How to read results?](#how-to-read-results)
|
||
- [Statistical model](#statistical-model)
|
||
- [Memory usage](#memory-usage)
|
||
- [Combining tests and benchmarks](#combining-tests-and-benchmarks)
|
||
- [Troubleshooting](#troubleshooting)
|
||
- [Isolating interfering benchmarks](#isolating-interfering-benchmarks)
|
||
- [Comparison against baseline](#comparison-against-baseline)
|
||
- [Comparison between benchmarks](#comparison-between-benchmarks)
|
||
- [Plotting results](#plotting-results)
|
||
- [Command-line options](#command-line-options)
|
||
|
||
<!-- /MarkdownTOC -->
|
||
|
||
## How lightweight is it?
|
||
|
||
There is only one source file `Test.Tasty.Bench` and no non-boot dependencies
|
||
except [`tasty`](http://hackage.haskell.org/package/tasty).
|
||
So if you already depend on `tasty` for a test suite, there
|
||
is nothing else to install.
|
||
|
||
Compare this to `criterion` (10+ modules, 50+ dependencies) and `gauge` (40+ modules, depends on `basement` and `vector`).
|
||
|
||
## How is it possible?
|
||
|
||
Our benchmarks are literally regular `tasty` tests, so we can leverage all existing
|
||
machinery for command-line options, resource management, structuring,
|
||
listing and filtering benchmarks, running and reporting results. It also means
|
||
that `tasty-bench` can be used in conjunction with other `tasty` ingredients.
|
||
|
||
Unlike `criterion` and `gauge` we use a very simple statistical model described below.
|
||
This is arguably a questionable choice, but it works pretty well in practice.
|
||
A rare developer is sufficiently well-versed in probability theory
|
||
to make sense and use of all numbers generated by `criterion`.
|
||
|
||
## How to switch?
|
||
|
||
[Cabal mixins](https://cabal.readthedocs.io/en/3.4/cabal-package.html#pkg-field-mixins)
|
||
allow to taste `tasty-bench` instead of `criterion` or `gauge`
|
||
without changing a single line of code:
|
||
|
||
```cabal
|
||
cabal-version: 2.0
|
||
|
||
benchmark foo
|
||
...
|
||
build-depends:
|
||
tasty-bench
|
||
mixins:
|
||
tasty-bench (Test.Tasty.Bench as Criterion, Test.Tasty.Bench as Criterion.Main)
|
||
```
|
||
|
||
This works vice versa as well: if you use `tasty-bench`, but at some point
|
||
need a more comprehensive statistical analysis,
|
||
it is easy to switch temporarily back to `criterion`.
|
||
|
||
## How to write a benchmark?
|
||
|
||
Benchmarks are declared in a separate section of `cabal` file:
|
||
|
||
```cabal
|
||
cabal-version: 2.0
|
||
name: bench-fibo
|
||
version: 0.0
|
||
build-type: Simple
|
||
synopsis: Example of a benchmark
|
||
|
||
benchmark bench-fibo
|
||
main-is: BenchFibo.hs
|
||
type: exitcode-stdio-1.0
|
||
build-depends: base, tasty-bench
|
||
if impl(ghc >= 8.10)
|
||
ghc-options: "-with-rtsopts=-A32m --nonmoving-gc"
|
||
else
|
||
ghc-options: "-with-rtsopts=-A32m"
|
||
```
|
||
|
||
And here is `BenchFibo.hs`:
|
||
|
||
```haskell
|
||
import Test.Tasty.Bench
|
||
|
||
fibo :: Int -> Integer
|
||
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
|
||
|
||
main :: IO ()
|
||
main = defaultMain
|
||
[ bgroup "fibonacci numbers"
|
||
[ bench "fifth" $ nf fibo 5
|
||
, bench "tenth" $ nf fibo 10
|
||
, bench "twentieth" $ nf fibo 20
|
||
]
|
||
]
|
||
```
|
||
|
||
Since `tasty-bench` provides an API compatible with `criterion`,
|
||
one can refer to [its documentation](http://www.serpentine.com/criterion/tutorial.html#how-to-write-a-benchmark-suite) for more examples.
|
||
|
||
## How to read results?
|
||
|
||
Running the example above (`cabal bench` or `stack bench`)
|
||
results in the following output:
|
||
|
||
```
|
||
All
|
||
fibonacci numbers
|
||
fifth: OK (2.13s)
|
||
63 ns ± 3.4 ns
|
||
tenth: OK (1.71s)
|
||
809 ns ± 73 ns
|
||
twentieth: OK (3.39s)
|
||
104 μs ± 4.9 μs
|
||
|
||
All 3 tests passed (7.25s)
|
||
```
|
||
|
||
The output says that, for instance, the first benchmark
|
||
was repeatedly executed for 2.13 seconds (wall time),
|
||
its mean time was 63 nanoseconds and,
|
||
assuming ideal precision of a system clock,
|
||
execution time does not often diverge from the mean
|
||
further than ±3.4 nanoseconds
|
||
(double standard deviation, which for normal distributions
|
||
corresponds to [95%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
|
||
probability). Take standard deviation numbers
|
||
with a grain of salt; there are lies, damned lies, and statistics.
|
||
|
||
Note that this data is not directly comparable with `criterion` output:
|
||
|
||
```
|
||
benchmarking fibonacci numbers/fifth
|
||
time 62.78 ns (61.99 ns .. 63.41 ns)
|
||
0.999 R² (0.999 R² .. 1.000 R²)
|
||
mean 62.39 ns (61.93 ns .. 62.94 ns)
|
||
std dev 1.753 ns (1.427 ns .. 2.258 ns)
|
||
```
|
||
|
||
One might interpret the second line as saying that
|
||
95% of measurements fell into 61.99–63.41 ns interval, but this is wrong.
|
||
It states that the [OLS regression](https://en.wikipedia.org/wiki/Ordinary_least_squares)
|
||
of execution time (which is not exactly the mean time) is most probably
|
||
somewhere between 61.99 ns and 63.41 ns,
|
||
but does not say a thing about individual measurements.
|
||
To understand how far away a typical measurement deviates
|
||
you need to add/subtract double standard deviation yourself
|
||
(which gives 62.78 ns ± 3.506 ns, similar to `tasty-bench` above).
|
||
|
||
To add to the confusion, `gauge` in `--small` mode outputs
|
||
not the second line of `criterion` report as one might expect,
|
||
but a mean value from the penultimate line and a standard deviation:
|
||
|
||
```
|
||
fibonacci numbers/fifth mean 62.39 ns ( +- 1.753 ns )
|
||
```
|
||
|
||
The interval ±1.753 ns answers
|
||
for [68%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
|
||
of samples only, double it to estimate the behavior in 95% of cases.
|
||
|
||
## Statistical model
|
||
|
||
Here is a procedure used by `tasty-bench` to measure execution time:
|
||
|
||
1. Set _n_ ← 1.
|
||
2. Measure execution time _tₙ_ of _n_ iterations
|
||
and execution time _t₂ₙ_ of _2n_ iterations.
|
||
3. Find _t_ which minimizes deviation of (_nt_, _2nt_) from (_tₙ_, _t₂ₙ_).
|
||
4. If deviation is small enough (see `--stdev` below),
|
||
return _t_ as a mean execution time.
|
||
5. Otherwise set _n_ ← _2n_ and jump back to Step 2.
|
||
|
||
This is roughly similar to the linear regression approach which `criterion` takes,
|
||
but we fit only two last points. This allows us to simplify away all heavy-weight
|
||
statistical analysis. More importantly, earlier measurements,
|
||
which are presumably shorter and noisier, do not affect overall result.
|
||
This is in contrast to `criterion`, which fits all measurements and
|
||
is biased to use more data points corresponding to shorter runs
|
||
(it employs _n_ ← _1.05n_ progression).
|
||
|
||
An alert reader could object that we measure standard deviation
|
||
for samples with _n_ and _2n_ iterations, but report
|
||
it scaled to a single iteration.
|
||
Strictly speaking, this is justified only if we assume
|
||
that deviating factors are either roughly periodic
|
||
(e. g., coarseness of a system clock, garbage collection)
|
||
or are likely to affect several successive iterations in the same way
|
||
(e. g., slow down by another concurrent process).
|
||
|
||
Obligatory disclaimer: statistics is a tricky matter, there is no
|
||
one-size-fits-all approach.
|
||
In the absence of a good theory
|
||
simplistic approaches are as (un)sound as obscure ones.
|
||
Those who seek statistical soundness should rather collect raw data
|
||
and process it themselves using a proper statistical toolbox.
|
||
Data reported by `tasty-bench`
|
||
is only of indicative and comparative significance.
|
||
|
||
## Memory usage
|
||
|
||
Passing `+RTS -T` (via `cabal bench --benchmark-options '+RTS -T'`
|
||
or `stack bench --ba '+RTS -T'`) enables `tasty-bench` to estimate and report
|
||
memory usage such as allocated and copied bytes:
|
||
|
||
```
|
||
All
|
||
fibonacci numbers
|
||
fifth: OK (2.13s)
|
||
63 ns ± 3.4 ns, 223 B allocated, 0 B copied
|
||
tenth: OK (1.71s)
|
||
809 ns ± 73 ns, 2.3 KB allocated, 0 B copied
|
||
twentieth: OK (3.39s)
|
||
104 μs ± 4.9 μs, 277 KB allocated, 59 B copied
|
||
|
||
All 3 tests passed (7.25s)
|
||
```
|
||
|
||
## Combining tests and benchmarks
|
||
|
||
When optimizing an existing function, it is important to check that its
|
||
observable behavior remains unchanged. One can rebuild
|
||
both tests and benchmarks after each change, but it would be more convenient
|
||
to run sanity checks within benchmark itself. Since our benchmarks
|
||
are compatible with `tasty` tests, we can easily do so.
|
||
|
||
Imagine you come up with a faster function `myFibo` to generate Fibonacci numbers:
|
||
|
||
```haskell
|
||
import Test.Tasty.Bench
|
||
import Test.Tasty.QuickCheck -- from tasty-quickcheck package
|
||
|
||
fibo :: Int -> Integer
|
||
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
|
||
|
||
myFibo :: Int -> Integer
|
||
myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2)
|
||
|
||
main :: IO ()
|
||
main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain
|
||
[ bench "fibo 20" $ nf fibo 20
|
||
, bench "myFibo 20" $ nf myFibo 20
|
||
, testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n
|
||
]
|
||
```
|
||
|
||
This outputs:
|
||
|
||
```
|
||
All
|
||
fibo 20: OK (3.02s)
|
||
104 μs ± 4.9 μs
|
||
myFibo 20: OK (1.99s)
|
||
71 μs ± 5.3 μs
|
||
myFibo = fibo: FAIL
|
||
*** Failed! Falsified (after 5 tests and 1 shrink):
|
||
2
|
||
1 /= 2
|
||
Use --quickcheck-replay=927711 to reproduce.
|
||
|
||
1 out of 3 tests failed (5.03s)
|
||
```
|
||
|
||
We see that `myFibo` is indeed significantly faster than `fibo`,
|
||
but unfortunately does not do the same thing. One should probably
|
||
look for another way to speed up generation of Fibonacci numbers.
|
||
|
||
## Troubleshooting
|
||
|
||
* If benchmarks take too long, set `--timeout` to limit execution time
|
||
of individual benchmarks, and `tasty-bench` will do its best to fit
|
||
into a given time frame. Without `--timeout` we rerun benchmarks until
|
||
achieving a target precision set by `--stdev`, which in a noisy environment
|
||
of a modern laptop with GUI may take a lot of time.
|
||
|
||
While `criterion` runs each benchmark at least for 5 seconds,
|
||
`tasty-bench` is happy to conclude earlier, if it does not compromise
|
||
the quality of results. In our experiments `tasty-bench` suites
|
||
tend to finish earlier, even if some individual benchmarks
|
||
take longer than with `criterion`.
|
||
|
||
A common source of noisiness is garbage collection. Setting a larger
|
||
allocation area (_nursery_) is often a good idea, either via
|
||
`cabal bench --benchmark-options '+RTS -A32m'` or `stack bench --ba '+RTS -A32m'`.
|
||
Alternatively bake it into
|
||
`cabal` file as `ghc-options: "-with-rtsopts=-A32m"`.
|
||
|
||
For GHC >= 8.10 consider switching benchmarks to a non-moving garbage collector,
|
||
because it decreases GC pauses and corresponding noise: `+RTS --nonmoving-gc`.
|
||
|
||
* If benchmark results look malformed like below, make sure that you are
|
||
invoking `Test.Tasty.Bench.defaultMain` and not `Test.Tasty.defaultMain`
|
||
(the difference is `consoleBenchReporter` vs. `consoleTestReporter`):
|
||
|
||
```
|
||
All
|
||
fibo 20: OK (1.46s)
|
||
Response {respEstimate = Estimate {estMean = Measurement {measTime = 87496728, measAllocs = 0, measCopied = 0}, estStdev = 694487}, respIfSlower = FailIfSlower Infinity, respIfFaster = FailIfFaster Infinity}
|
||
```
|
||
|
||
* If benchmarks fail with an error message
|
||
|
||
```
|
||
Unhandled resource. Probably a bug in the runner you're using.
|
||
```
|
||
|
||
this is probably caused by `env` or `envWithCleanup` affecting benchmarks structure.
|
||
You can use `env` to read test data from `IO`, but not to read benchmark names
|
||
or affect their hierarchy in other way. This is a fundamental restriction of `tasty`
|
||
to list and filter benchmarks without launching missiles.
|
||
|
||
## Isolating interfering benchmarks
|
||
|
||
One difficulty of benchmarking in Haskell is that it is
|
||
hard to isolate benchmarks so that they do not interfere.
|
||
Changing the order of benchmarks or skipping some of them
|
||
has an effect on heap's layout and thus affects garbage collection.
|
||
This issue is well attested in
|
||
[`both`](https://github.com/haskell/criterion/issues/166)
|
||
[`criterion`](https://github.com/haskell/criterion/issues/60)
|
||
and
|
||
[`gauge`](https://github.com/vincenthz/hs-gauge/issues/2).
|
||
|
||
Usually (but not always) skipping some benchmarks speeds up remaining ones.
|
||
That's because once a benchmark allocated heap which for some reason
|
||
was not promptly released afterwards (e. g., it forced a top-level thunk
|
||
in an underlying library), all further benchmarks are slowed down
|
||
by garbage collector processing this additional amount of live data
|
||
over and over again.
|
||
|
||
There are several mitigation strategies. First of all, giving garbage collector
|
||
more breathing space by `+RTS -A32m` (or more) is often good enough.
|
||
|
||
Further, avoid using top-level bindings to store large test data. Once such thunks
|
||
are forced, they remain allocated forever, which affects detrimentally subsequent
|
||
unrelated benchmarks. Treat them as external data, supplied via `env`: instead of
|
||
|
||
```haskell
|
||
largeData :: String
|
||
largeData = replicate 1000000 'a'
|
||
|
||
main :: IO ()
|
||
main = defaultMain
|
||
[ bench "large" $ nf length largeData, ... ]
|
||
```
|
||
|
||
use
|
||
|
||
```haskell
|
||
import Control.DeepSeq (force)
|
||
import Control.Exception (evaluate)
|
||
|
||
main :: IO ()
|
||
main = defaultMain
|
||
[ env (evaluate (force (replicate 1000000 'a'))) $ \largeData ->
|
||
bench "large" $ nf length largeData, ... ]
|
||
```
|
||
|
||
Finally, as an ultimate measure to reduce interference between benchmarks,
|
||
one can run each of them in a separate process. We do not quite recommend
|
||
this approach, but if you are desperate, here is how.
|
||
|
||
Assuming that a benchmark is declared in `cabal` file as `benchmark my-bench` component,
|
||
let's first find its executable:
|
||
|
||
```bash
|
||
cabal build --enable-benchmarks my-bench
|
||
MYBENCH=`cabal list-bin my-bench`
|
||
```
|
||
|
||
Now list all benchmark names (hopefully, they do not contain newlines),
|
||
escape quotes and slashes, and run each of them separately:
|
||
|
||
```bash
|
||
$MYBENCH -l | sed -e 's/[\"]/\\\\\\&/g' | while read name; do $MYBENCH -p '$0 == "'$name'"'; done
|
||
```
|
||
|
||
## Comparison against baseline
|
||
|
||
One can compare benchmark results against an earlier baseline in an automatic way.
|
||
To use this feature, first run `tasty-bench` with `--csv FILE` key
|
||
to dump results to `FILE` in CSV format
|
||
(it could be a good idea to set smaller `--stdev`, if possible):
|
||
|
||
```
|
||
Name,Mean (ps),2*Stdev (ps)
|
||
All.fibonacci numbers.fifth,48453,4060
|
||
All.fibonacci numbers.tenth,637152,46744
|
||
All.fibonacci numbers.twentieth,81369531,3342646
|
||
```
|
||
|
||
Note that columns do not match CSV reports of `criterion` and `gauge`.
|
||
If desired, missing columns can be faked with
|
||
`awk 'BEGIN {FS=",";OFS=","}; {print $1,$2,$2,$2,$3/2,$3/2,$3/2}'` or similar.
|
||
|
||
Now modify implementation and rerun benchmarks
|
||
with `--baseline FILE` key. This produces a report as follows:
|
||
|
||
```
|
||
All
|
||
fibonacci numbers
|
||
fifth: OK (0.44s)
|
||
53 ns ± 2.7 ns, 8% slower than baseline
|
||
tenth: OK (0.33s)
|
||
641 ns ± 59 ns
|
||
twentieth: OK (0.36s)
|
||
77 μs ± 6.4 μs, 5% faster than baseline
|
||
|
||
All 3 tests passed (1.50s)
|
||
```
|
||
|
||
You can also fail benchmarks, which deviate too far from baseline, using
|
||
`--fail-if-slower` and `--fail-if-faster` options. For example, setting both of them
|
||
to 6 will fail the first benchmark above (because it is more than 6% slower),
|
||
but the last one still succeeds (even while it is measurably faster than baseline,
|
||
deviation is less than 6%). Consider also using `--hide-successes` to show
|
||
only problematic benchmarks, or even
|
||
[`tasty-rerun`](http://hackage.haskell.org/package/tasty-rerun) package
|
||
to focus on rerunning failing items only.
|
||
|
||
## Comparison between benchmarks
|
||
|
||
You can also compare benchmarks to each other without reaching to external tools,
|
||
all in the comfort of your terminal.
|
||
|
||
```haskell
|
||
import Test.Tasty.Bench
|
||
|
||
fibo :: Int -> Integer
|
||
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
|
||
|
||
main :: IO ()
|
||
main = defaultMain
|
||
[ bgroup "fibonacci numbers"
|
||
[ bcompare "tenth" $ bench "fifth" $ nf fibo 5
|
||
, bench "tenth" $ nf fibo 10
|
||
, bcompare "tenth" $ bench "twentieth" $ nf fibo 20
|
||
]
|
||
]
|
||
```
|
||
|
||
This produces a report, comparing mean times of `fifth` and `twentieth` to `tenth`:
|
||
|
||
```
|
||
All
|
||
fibonacci numbers
|
||
fifth: OK (16.56s)
|
||
121 ns ± 2.6 ns, 0.08x
|
||
tenth: OK (6.84s)
|
||
1.6 μs ± 31 ns
|
||
twentieth: OK (6.96s)
|
||
203 μs ± 4.1 μs, 128.36x
|
||
```
|
||
|
||
Locating a baseline benchmark in larger suites could get tricky;
|
||
|
||
```haskell
|
||
bcompare "$NF == \"tenth\" && $(NF-1) == \"fibonacci numbers\""
|
||
```
|
||
|
||
is a more robust choice of
|
||
an [`awk` pattern](https://github.com/feuerbach/tasty#patterns) here.
|
||
|
||
## Plotting results
|
||
|
||
Users can dump results into CSV with `--csv FILE`
|
||
and plot them using `gnuplot` or other software. But for convenience
|
||
there is also a built-in quick-and-dirty SVG plotting feature,
|
||
which can be invoked by passing `--svg FILE`. Here is a sample of its output:
|
||
|
||
![Plotting](./example.svg)
|
||
|
||
## Command-line options
|
||
|
||
Use `--help` to list command-line options.
|
||
|
||
* `-p`, `--pattern`
|
||
|
||
This is a standard `tasty` option, which allows filtering benchmarks
|
||
by a pattern or `awk` expression. Please refer to
|
||
[`tasty` documentation](https://github.com/feuerbach/tasty#patterns)
|
||
for details.
|
||
|
||
* `-t`, `--timeout`
|
||
|
||
This is a standard `tasty` option, setting timeout for individual benchmarks
|
||
in seconds. Use it when benchmarks tend to take too long: `tasty-bench` will make
|
||
an effort to report results (even if of subpar quality) before timeout. Setting
|
||
timeout too tight (insufficient for at least three iterations)
|
||
will result in a benchmark failure.
|
||
|
||
* `--stdev`
|
||
|
||
Target relative standard deviation of measurements in percents (5% by default).
|
||
Large values correspond to fast and loose benchmarks, and small ones to long and precise.
|
||
If it takes far too long, consider setting `--timeout`,
|
||
which will interrupt benchmarks, potentially before reaching the target deviation.
|
||
|
||
* `--csv`
|
||
|
||
File to write results in CSV format.
|
||
|
||
* `--baseline`
|
||
|
||
File to read baseline results in CSV format (as produced by `--csv`).
|
||
|
||
* `--fail-if-slower`, `--fail-if-faster`
|
||
|
||
Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see `--baseline`),
|
||
it will be reported as failed. Can be used in conjunction with
|
||
a standard `tasty` option `--hide-successes` to show only problematic benchmarks.
|
||
|
||
* `--svg`
|
||
|
||
File to plot results in SVG format.
|