tasty-bench/README.md

# tasty-bench [![Hackage](http://img.shields.io/hackage/v/tasty-bench.svg)](https://hackage.haskell.org/package/tasty-bench) [![Stackage Nightly](http://stackage.org/package/tasty-bench/badge/nightly)](http://stackage.org/nightly/package/tasty-bench)

Featherlight benchmark framework (only one file!) for performance measurement
with API mimicking [`criterion`](http://hackage.haskell.org/package/criterion)
and [`gauge`](http://hackage.haskell.org/package/gauge).
A prominent feature is built-in comparison against previous runs
and between benchmarks.

<!-- MarkdownTOC autolink="true" -->

- [How lightweight is it?](#how-lightweight-is-it)
- [How is it possible?](#how-is-it-possible)
- [How to switch?](#how-to-switch)
- [How to write a benchmark?](#how-to-write-a-benchmark)
- [How to read results?](#how-to-read-results)
- [Statistical model](#statistical-model)
- [Memory usage](#memory-usage)
- [Combining tests and benchmarks](#combining-tests-and-benchmarks)
- [Troubleshooting](#troubleshooting)
- [Isolating interfering benchmarks](#isolating-interfering-benchmarks)
- [Comparison against baseline](#comparison-against-baseline)
- [Comparison between benchmarks](#comparison-between-benchmarks)
- [Plotting results](#plotting-results)
- [Command-line options](#command-line-options)

<!-- /MarkdownTOC -->

## How lightweight is it?

There is only one source file `Test.Tasty.Bench` and no non-boot dependencies
except [`tasty`](http://hackage.haskell.org/package/tasty).
So if you already depend on `tasty` for a test suite, there
is nothing else to install.

Compare this to `criterion` (10+ modules, 50+ dependencies) and `gauge` (40+ modules, depends on `basement` and `vector`).

## How is it possible?

Our benchmarks are literally regular `tasty` tests, so we can leverage all existing
machinery for command-line options, resource management, structuring,
listing and filtering benchmarks, running and reporting results. It also means
that `tasty-bench` can be used in conjunction with other `tasty` ingredients.

Unlike `criterion` and `gauge` we use a very simple statistical model described below.
This is arguably a questionable choice, but it works pretty well in practice.
A rare developer is sufficiently well-versed in probability theory
to make sense and use of all numbers generated by `criterion`.

## How to switch?

[Cabal mixins](https://cabal.readthedocs.io/en/3.4/cabal-package.html#pkg-field-mixins)
allow to taste `tasty-bench` instead of `criterion` or `gauge`
without changing a single line of code:

```cabal
cabal-version: 2.0

benchmark foo
  ...
  build-depends:
    tasty-bench
  mixins:
    tasty-bench (Test.Tasty.Bench as Criterion, Test.Tasty.Bench as Criterion.Main)
```

This works vice versa as well: if you use `tasty-bench`, but at some point
need a more comprehensive statistical analysis,
it is easy to switch temporarily back to `criterion`.

## How to write a benchmark?

Benchmarks are declared in a separate section of `cabal` file:

```cabal
cabal-version:   2.0
name:            bench-fibo
version:         0.0
build-type:      Simple
synopsis:        Example of a benchmark

benchmark bench-fibo
  main-is:       BenchFibo.hs
  type:          exitcode-stdio-1.0
  build-depends: base, tasty-bench
  if impl(ghc >= 8.10)
    ghc-options:  "-with-rtsopts=-A32m --nonmoving-gc"
  else
    ghc-options:  "-with-rtsopts=-A32m"
```

And here is `BenchFibo.hs`:

```haskell
import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "fibonacci numbers"
    [ bench "fifth"     $ nf fibo  5
    , bench "tenth"     $ nf fibo 10
    , bench "twentieth" $ nf fibo 20
    ]
  ]
```

Since `tasty-bench` provides an API compatible with `criterion`,
one can refer to [its documentation](http://www.serpentine.com/criterion/tutorial.html#how-to-write-a-benchmark-suite) for more examples.

## How to read results?

Running the example above (`cabal bench` or `stack bench`)
results in the following output:

```
All
  fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns
    tenth:     OK (1.71s)
      809 ns ±  73 ns
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs

All 3 tests passed (7.25s)
```

The output says that, for instance, the first benchmark
was repeatedly executed for 2.13 seconds (wall time),
its mean time was 63 nanoseconds and,
assuming ideal precision of a system clock,
execution time does not often diverge from the mean
further than ±3.4 nanoseconds
(double standard deviation, which for normal distributions
corresponds to [95%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
probability). Take standard deviation numbers
with a grain of salt; there are lies, damned lies, and statistics.

Note that this data is not directly comparable with `criterion` output:

```
benchmarking fibonacci numbers/fifth
time                 62.78 ns   (61.99 ns .. 63.41 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 62.39 ns   (61.93 ns .. 62.94 ns)
std dev              1.753 ns   (1.427 ns .. 2.258 ns)
```

One might interpret the second line as saying that
95% of measurements fell into 61.99–63.41 ns interval, but this is wrong.
It states that the [OLS regression](https://en.wikipedia.org/wiki/Ordinary_least_squares)
of execution time (which is not exactly the mean time) is most probably
somewhere between 61.99 ns and 63.41 ns,
but does not say a thing about individual measurements.
To understand how far away a typical measurement deviates
you need to add/subtract double standard deviation yourself
(which gives 62.78 ns ± 3.506 ns, similar to `tasty-bench` above).

To add to the confusion, `gauge` in `--small` mode outputs
not the second line of `criterion` report as one might expect,
but a mean value from the penultimate line and a standard deviation:

```
fibonacci numbers/fifth                  mean 62.39 ns  ( +- 1.753 ns  )
```

The interval ±1.753 ns answers
for [68%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
of samples only, double it to estimate the behavior in 95% of cases.

## Statistical model

Here is a procedure used by `tasty-bench` to measure execution time:

1. Set _n_ ← 1.
2. Measure execution time _tₙ_ of _n_ iterations
   and execution time _t₂ₙ_ of _2n_ iterations.
3. Find _t_ which minimizes deviation of (_nt_, _2nt_) from (_tₙ_, _t₂ₙ_).
4. If deviation is small enough (see `--stdev` below),
   return _t_ as a mean execution time.
5. Otherwise set _n_ ← _2n_ and jump back to Step 2.

This is roughly similar to the linear regression approach which `criterion` takes,
but we fit only two last points. This allows us to simplify away all heavy-weight
statistical analysis. More importantly, earlier measurements,
which are presumably shorter and noisier, do not affect overall result.
This is in contrast to `criterion`, which fits all measurements and
is biased to use more data points corresponding to shorter runs
(it employs _n_ ← _1.05n_ progression).

An alert reader could object that we measure standard deviation
for samples with _n_ and _2n_ iterations, but report
it scaled to a single iteration.
Strictly speaking, this is justified only if we assume
that deviating factors are either roughly periodic
(e. g., coarseness of a system clock, garbage collection)
or are likely to affect several successive iterations in the same way
(e. g., slow down by another concurrent process).

Obligatory disclaimer: statistics is a tricky matter, there is no
one-size-fits-all approach.
In the absence of a good theory
simplistic approaches are as (un)sound as obscure ones.
Those who seek statistical soundness should rather collect raw data
and process it themselves using a proper statistical toolbox.
Data reported by `tasty-bench`
is only of indicative and comparative significance.

## Memory usage

Passing `+RTS -T` (via `cabal bench --benchmark-options '+RTS -T'`
or `stack bench --ba '+RTS -T'`) enables `tasty-bench` to estimate and report
memory usage such as allocated and copied bytes:

```
All
  fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns, 223 B  allocated,   0 B  copied
    tenth:     OK (1.71s)
      809 ns ±  73 ns, 2.3 KB allocated,   0 B  copied
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs, 277 KB allocated,  59 B  copied

All 3 tests passed (7.25s)
```

## Combining tests and benchmarks

When optimizing an existing function, it is important to check that its
observable behavior remains unchanged. One can rebuild
both tests and benchmarks after each change, but it would be more convenient
to run sanity checks within benchmark itself. Since our benchmarks
are compatible with `tasty` tests, we can easily do so.

Imagine you come up with a faster function `myFibo` to generate Fibonacci numbers:

```haskell
import Test.Tasty.Bench
import Test.Tasty.QuickCheck -- from tasty-quickcheck package

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

myFibo :: Int -> Integer
myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2)

main :: IO ()
main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain
  [ bench "fibo   20" $ nf fibo   20
  , bench "myFibo 20" $ nf myFibo 20
  , testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n
  ]
```

This outputs:

```
All
  fibo   20:     OK (3.02s)
    104 μs ± 4.9 μs
  myFibo 20:     OK (1.99s)
     71 μs ± 5.3 μs
  myFibo = fibo: FAIL
    *** Failed! Falsified (after 5 tests and 1 shrink):
    2
    1 /= 2
    Use --quickcheck-replay=927711 to reproduce.

1 out of 3 tests failed (5.03s)
```

We see that `myFibo` is indeed significantly faster than `fibo`,
but unfortunately does not do the same thing. One should probably
look for another way to speed up generation of Fibonacci numbers.

## Troubleshooting

* If benchmarks take too long, set `--timeout` to limit execution time
  of individual benchmarks, and `tasty-bench` will do its best to fit
  into a given time frame. Without `--timeout` we rerun benchmarks until
  achieving a target precision set by `--stdev`, which in a noisy environment
  of a modern laptop with GUI may take a lot of time.

  While `criterion` runs each benchmark at least for 5 seconds,
  `tasty-bench` is happy to conclude earlier, if it does not compromise
  the quality of results. In our experiments `tasty-bench` suites
  tend to finish earlier, even if some individual benchmarks
  take longer than with `criterion`.

  A common source of noisiness is garbage collection. Setting a larger
  allocation area (_nursery_) is often a good idea, either via
  `cabal bench --benchmark-options '+RTS -A32m'` or `stack bench --ba '+RTS -A32m'`.
  Alternatively bake it into
  `cabal` file as `ghc-options: "-with-rtsopts=-A32m"`.

  For GHC >= 8.10 consider switching benchmarks to a non-moving garbage collector,
  because it decreases GC pauses and corresponding noise: `+RTS --nonmoving-gc`.

* If benchmark results look malformed like below, make sure that you are
  invoking `Test.Tasty.Bench.defaultMain` and not `Test.Tasty.defaultMain`
  (the difference is `consoleBenchReporter` vs. `consoleTestReporter`):

  ```
  All
    fibo 20:       OK (1.46s)
      Response {respEstimate = Estimate {estMean = Measurement {measTime = 87496728, measAllocs = 0, measCopied = 0}, estStdev = 694487}, respIfSlower = FailIfSlower Infinity, respIfFaster = FailIfFaster Infinity}
  ```

* If benchmarks fail with an error message

  ```
  Unhandled resource. Probably a bug in the runner you're using.
  ```

  this is probably caused by `env` or `envWithCleanup` affecting benchmarks structure.
  You can use `env` to read test data from `IO`, but not to read benchmark names
  or affect their hierarchy in other way. This is a fundamental restriction of `tasty`
  to list and filter benchmarks without launching missiles.

## Isolating interfering benchmarks

One difficulty of benchmarking in Haskell is that it is
hard to isolate benchmarks so that they do not interfere.
Changing the order of benchmarks or skipping some of them
has an effect on heap's layout and thus affects garbage collection.
This issue is well attested in
[`both`](https://github.com/haskell/criterion/issues/166)
[`criterion`](https://github.com/haskell/criterion/issues/60)
and
[`gauge`](https://github.com/vincenthz/hs-gauge/issues/2).

Usually (but not always) skipping some benchmarks speeds up remaining ones.
That's because once a benchmark allocated heap which for some reason
was not promptly released afterwards (e. g., it forced a top-level thunk
in an underlying library), all further benchmarks are slowed down
by garbage collector processing this additional amount of live data
over and over again.

There are several mitigation strategies. First of all, giving garbage collector
more breathing space by `+RTS -A32m` (or more) is often good enough.

Further, avoid using top-level bindings to store large test data. Once such thunks
are forced, they remain allocated forever, which affects detrimentally subsequent
unrelated benchmarks. Treat them as external data, supplied via `env`: instead of

```haskell
largeData :: String
largeData = replicate 1000000 'a'

main :: IO ()
main = defaultMain
  [ bench "large" $ nf length largeData, ... ]
```

use

```haskell
import Control.DeepSeq (force)
import Control.Exception (evaluate)

main :: IO ()
main = defaultMain
  [ env (evaluate (force (replicate 1000000 'a'))) $ \largeData ->
    bench "large" $ nf length largeData, ... ]
```

Finally, as an ultimate measure to reduce interference between benchmarks,
one can run each of them in a separate process. We do not quite recommend
this approach, but if you are desperate, here is how.

Assuming that a benchmark is declared in `cabal` file as `benchmark my-bench` component,
let's first find its executable:

```bash
cabal build --enable-benchmarks my-bench
MYBENCH=`cabal list-bin my-bench`
```

Now list all benchmark names (hopefully, they do not contain newlines),
escape quotes and slashes, and run each of them separately:

```bash
$MYBENCH -l | sed -e 's/[\"]/\\\\\\&/g' | while read name; do $MYBENCH -p '$0 == "'$name'"'; done
```

## Comparison against baseline

One can compare benchmark results against an earlier baseline in an automatic way.
To use this feature, first run `tasty-bench` with `--csv FILE` key
to dump results to `FILE` in CSV format
(it could be a good idea to set smaller `--stdev`, if possible):

```
Name,Mean (ps),2*Stdev (ps)
All.fibonacci numbers.fifth,48453,4060
All.fibonacci numbers.tenth,637152,46744
All.fibonacci numbers.twentieth,81369531,3342646
```

Note that columns do not match CSV reports of `criterion` and `gauge`.
If desired, missing columns can be faked with
`awk 'BEGIN {FS=",";OFS=","}; {print $1,$2,$2,$2,$3/2,$3/2,$3/2}'` or similar.

Now modify implementation and rerun benchmarks
with `--baseline FILE` key. This produces a report as follows:

```
All
  fibonacci numbers
    fifth:     OK (0.44s)
       53 ns ± 2.7 ns,  8% slower than baseline
    tenth:     OK (0.33s)
      641 ns ±  59 ns
    twentieth: OK (0.36s)
       77 μs ± 6.4 μs,  5% faster than baseline

All 3 tests passed (1.50s)
```

You can also fail benchmarks, which deviate too far from baseline, using
`--fail-if-slower` and `--fail-if-faster` options. For example, setting both of them
to 6 will fail the first benchmark above (because it is more than 6% slower),
but the last one still succeeds (even while it is measurably faster than baseline,
deviation is less than 6%). Consider also using `--hide-successes` to show
only problematic benchmarks, or even
[`tasty-rerun`](http://hackage.haskell.org/package/tasty-rerun) package
to focus on rerunning failing items only.

## Comparison between benchmarks

You can also compare benchmarks to each other without reaching to external tools,
all in the comfort of your terminal.

```haskell
import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "fibonacci numbers"
    [ bcompare "tenth"  $ bench "fifth"     $ nf fibo  5
    ,                     bench "tenth"     $ nf fibo 10
    , bcompare "tenth"  $ bench "twentieth" $ nf fibo 20
    ]
  ]
```

This produces a report, comparing mean times of `fifth` and `twentieth` to `tenth`:

```
All
  fibonacci numbers
    fifth:     OK (16.56s)
      121 ns ± 2.6 ns, 0.08x
    tenth:     OK (6.84s)
      1.6 μs ±  31 ns
    twentieth: OK (6.96s)
      203 μs ± 4.1 μs, 128.36x
```

Locating a baseline benchmark in larger suites could get tricky;

```haskell
bcompare "$NF == \"tenth\" && $(NF-1) == \"fibonacci numbers\""
```

is a more robust choice of
an [`awk` pattern](https://github.com/feuerbach/tasty#patterns) here.

## Plotting results

Users can dump results into CSV with `--csv FILE`
and plot them using `gnuplot` or other software. But for convenience
there is also a built-in quick-and-dirty SVG plotting feature,
which can be invoked by passing `--svg FILE`. Here is a sample of its output:

![Plotting](./example.svg)

## Command-line options

Use `--help` to list command-line options.

* `-p`, `--pattern`

  This is a standard `tasty` option, which allows filtering benchmarks
  by a pattern or `awk` expression. Please refer to
  [`tasty` documentation](https://github.com/feuerbach/tasty#patterns)
  for details.

* `-t`, `--timeout`

  This is a standard `tasty` option, setting timeout for individual benchmarks
  in seconds. Use it when benchmarks tend to take too long: `tasty-bench` will make
  an effort to report results (even if of subpar quality) before timeout. Setting
  timeout too tight (insufficient for at least three iterations)
  will result in a benchmark failure.

* `--stdev`

  Target relative standard deviation of measurements in percents (5% by default).
  Large values correspond to fast and loose benchmarks, and small ones to long and precise.
  If it takes far too long, consider setting `--timeout`,
  which will interrupt benchmarks, potentially before reaching the target deviation.

* `--csv`

  File to write results in CSV format.

* `--baseline`

  File to read baseline results in CSV format (as produced by `--csv`).

* `--fail-if-slower`, `--fail-if-faster`

  Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see `--baseline`),
  it will be reported as failed. Can be used in conjunction with
  a standard `tasty` option `--hide-successes` to show only problematic benchmarks.

* `--svg`

  File to plot results in SVG format.
-												Update documentation

											
										
										
											2021-03-15 22:16:34 +03:00
+								# tasty-bench [![Hackage](http://img.shields.io/hackage/v/tasty-bench.svg)](https://hackage.haskell.org/package/tasty-bench) [![Stackage Nightly](http://stackage.org/package/tasty-bench/badge/nightly)](http://stackage.org/nightly/package/tasty-bench)
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
-												Release preparations

											
										
										
											2021-02-05 22:18:14 +03:00
+								Featherlight benchmark framework (only one file!) for performance measurement
 								with API mimicking [`criterion`](http://hackage.haskell.org/package/criterion)
 								and [`gauge`](http://hackage.haskell.org/package/gauge).
-												Add bcompare

											
										
										
											2021-03-23 02:53:51 +03:00
+								A prominent feature is built-in comparison against previous runs
 								and between benchmarks.
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
-												Update documentation

											
										
										
											2021-03-16 21:57:54 +03:00
+								<!-- MarkdownTOC autolink="true" -->
 								- [How lightweight is it?](#how-lightweight-is-it)
 								- [How is it possible?](#how-is-it-possible)
 								- [How to switch?](#how-to-switch)
 								- [How to write a benchmark?](#how-to-write-a-benchmark)
 								- [How to read results?](#how-to-read-results)
 								- [Statistical model](#statistical-model)
 								- [Memory usage](#memory-usage)
 								- [Combining tests and benchmarks](#combining-tests-and-benchmarks)
 								- [Troubleshooting](#troubleshooting)
-												Explain how to isolate interfering benchmarks

											
										
										
											2021-03-18 01:39:59 +03:00
+								- [Isolating interfering benchmarks](#isolating-interfering-benchmarks)
-												Update documentation

											
										
										
											2021-03-16 21:57:54 +03:00
+								- [Comparison against baseline](#comparison-against-baseline)
-												Add bcompare

											
										
										
											2021-03-23 02:53:51 +03:00
+								- [Comparison between benchmarks](#comparison-between-benchmarks)
-												Add plotting example to README

											
										
										
											2021-03-28 03:00:16 +03:00
+								- [Plotting results](#plotting-results)
-												Update documentation

											
										
										
											2021-03-16 21:57:54 +03:00
+								- [Command-line options](#command-line-options)
 								<!-- /MarkdownTOC -->
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
+								## How lightweight is it?
-												Minor changes

											
										
										
											2021-03-11 00:38:23 +03:00
+								There is only one source file `Test.Tasty.Bench` and no non-boot dependencies
-												Experiment with CSV output

											
										
										
											2021-01-27 04:03:11 +03:00
+								except [`tasty`](http://hackage.haskell.org/package/tasty).
 								So if you already depend on `tasty` for a test suite, there
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
+								is nothing else to install.
 								Compare this to `criterion` (10+ modules, 50+ dependencies) and `gauge` (40+ modules, depends on `basement` and `vector`).
 								## How is it possible?
 								Our benchmarks are literally regular `tasty` tests, so we can leverage all existing
 								machinery for command-line options, resource management, structuring,
 								listing and filtering benchmarks, running and reporting results. It also means
 								that `tasty-bench` can be used in conjunction with other `tasty` ingredients.
 								Unlike `criterion` and `gauge` we use a very simple statistical model described below.
 								This is arguably a questionable choice, but it works pretty well in practice.
 								A rare developer is sufficiently well-versed in probability theory
 								to make sense and use of all numbers generated by `criterion`.
 								## How to switch?
 								[Cabal mixins](https://cabal.readthedocs.io/en/3.4/cabal-package.html#pkg-field-mixins)
 								allow to taste `tasty-bench` instead of `criterion` or `gauge`
 								without changing a single line of code:
 								```cabal
 								cabal-version: 2.0
 								benchmark foo
 								  ...
 								  build-depends:
 								    tasty-bench
 								  mixins:
-												Release preparations

											
										
										
											2021-02-07 02:39:39 +03:00
+								    tasty-bench (Test.Tasty.Bench as Criterion, Test.Tasty.Bench as Criterion.Main)
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
+								```
 								This works vice versa as well: if you use `tasty-bench`, but at some point
 								need a more comprehensive statistical analysis,
 								it is easy to switch temporarily back to `criterion`.
 								## How to write a benchmark?
-												Elaborate how to write and run a benchmark

											
										
										
											2021-01-30 15:15:42 +03:00
+								Benchmarks are declared in a separate section of `cabal` file:
 								```cabal
 								cabal-version:   2.0
 								name:            bench-fibo
 								version:         0.0
 								build-type:      Simple
 								synopsis:        Example of a benchmark
 								benchmark bench-fibo
 								  main-is:       BenchFibo.hs
 								  type:          exitcode-stdio-1.0
 								  build-depends: base, tasty-bench
-												Recommend switching to non-moving garbage collector

											
										
										
											2021-03-21 02:33:50 +03:00
+								  if impl(ghc >= 8.10)
 								    ghc-options:  "-with-rtsopts=-A32m --nonmoving-gc"
 								  else
 								    ghc-options:  "-with-rtsopts=-A32m"
-												Elaborate how to write and run a benchmark

											
										
										
											2021-01-30 15:15:42 +03:00
+								```
 								And here is `BenchFibo.hs`:
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
+								```haskell
 								import Test.Tasty.Bench
 								fibo :: Int -> Integer
 								fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
 								main :: IO ()
 								main = defaultMain
 								  [ bgroup "fibonacci numbers"
 								    [ bench "fifth"     $ nf fibo  5
 								    , bench "tenth"     $ nf fibo 10
 								    , bench "twentieth" $ nf fibo 20
 								    ]
 								  ]
 								```
 								Since `tasty-bench` provides an API compatible with `criterion`,
 								one can refer to [its documentation](http://www.serpentine.com/criterion/tutorial.html#how-to-write-a-benchmark-suite) for more examples.
 								## How to read results?
-												Elaborate how to write and run a benchmark

											
										
										
											2021-01-30 15:15:42 +03:00
+								Running the example above (`cabal bench` or `stack bench`)
 								results in the following output:
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
 								```
 								All
 								  fibonacci numbers
 								    fifth:     OK (2.13s)
 ns ± 3.4 ns
 								    tenth:     OK (1.71s)
 ns ±  73 ns
 								    twentieth: OK (3.39s)
 μs ± 4.9 μs
 								All 3 tests passed (7.25s)
 								```
 								The output says that, for instance, the first benchmark
 								was repeatedly executed for 2.13 seconds (wall time),
-												Explain how to compare with criterion and gauge reports

											
										
										
											2021-01-29 01:22:17 +03:00
+								its mean time was 63 nanoseconds and,
 								assuming ideal precision of a system clock,
 								execution time does not often diverge from the mean
-												Report double stdev in CSV mode

											
										
										
											2021-01-28 23:40:57 +03:00
+								further than ±3.4 nanoseconds
 								(double standard deviation, which for normal distributions
-												Explain how to compare with criterion and gauge reports

											
										
										
											2021-01-29 01:22:17 +03:00
+								corresponds to [95%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
 								probability). Take standard deviation numbers
 								with a grain of salt; there are lies, damned lies, and statistics.
 								Note that this data is not directly comparable with `criterion` output:
 								```
 								benchmarking fibonacci numbers/fifth
 								time                 62.78 ns   (61.99 ns .. 63.41 ns)
 .999 R²   (0.999 R² .. 1.000 R²)
 								mean                 62.39 ns   (61.93 ns .. 62.94 ns)
 								std dev              1.753 ns   (1.427 ns .. 2.258 ns)
 								```
 								One might interpret the second line as saying that
 % of measurements fell into 61.99–63.41 ns interval, but this is wrong.
 								It states that the [OLS regression](https://en.wikipedia.org/wiki/Ordinary_least_squares)
 								of execution time (which is not exactly the mean time) is most probably
 								somewhere between 61.99 ns and 63.41 ns,
 								but does not say a thing about individual measurements.
 								To understand how far away a typical measurement deviates
 								you need to add/subtract double standard deviation yourself
 								(which gives 62.78 ns ± 3.506 ns, similar to `tasty-bench` above).
 								To add to the confusion, `gauge` in `--small` mode outputs
 								not the second line of `criterion` report as one might expect,
 								but a mean value from the penultimate line and a standard deviation:
 								```
 								fibonacci numbers/fifth                  mean 62.39 ns  ( +- 1.753 ns  )
 								```
 								The interval ±1.753 ns answers
 								for [68%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
-												Final tweaks

											
										
										
											2021-01-30 17:17:47 +03:00
+								of samples only, double it to estimate the behavior in 95% of cases.
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
 								## Statistical model
-												Explain how to compare with criterion and gauge reports

											
										
										
											2021-01-29 01:22:17 +03:00
+								Here is a procedure used by `tasty-bench` to measure execution time:
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
 . Set _n_ ← 1.
 . Measure execution time _tₙ_ of _n_ iterations
 								   and execution time _t₂ₙ_ of _2n_ iterations.
 . Find _t_ which minimizes deviation of (_nt_, _2nt_) from (_tₙ_, _t₂ₙ_).
-												Elaborate how to write and run a benchmark

											
										
										
											2021-01-30 15:15:42 +03:00
+. If deviation is small enough (see `--stdev` below),
 								   return _t_ as a mean execution time.
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
+. Otherwise set _n_ ← _2n_ and jump back to Step 2.
 								This is roughly similar to the linear regression approach which `criterion` takes,
 								but we fit only two last points. This allows us to simplify away all heavy-weight
 								statistical analysis. More importantly, earlier measurements,
 								which are presumably shorter and noisier, do not affect overall result.
 								This is in contrast to `criterion`, which fits all measurements and
 								is biased to use more data points corresponding to shorter runs
 								(it employs _n_ ← _1.05n_ progression).
-												Explain how to compare with criterion and gauge reports

											
										
										
											2021-01-29 01:22:17 +03:00
+								An alert reader could object that we measure standard deviation
 								for samples with _n_ and _2n_ iterations, but report
 								it scaled to a single iteration.
 								Strictly speaking, this is justified only if we assume
 								that deviating factors are either roughly periodic
 								(e. g., coarseness of a system clock, garbage collection)
 								or are likely to affect several successive iterations in the same way
 								(e. g., slow down by another concurrent process).
 								Obligatory disclaimer: statistics is a tricky matter, there is no
 								one-size-fits-all approach.
-												Final tweaks

											
										
										
											2021-01-30 17:17:47 +03:00
+								In the absence of a good theory
-												Explain how to compare with criterion and gauge reports

											
										
										
											2021-01-29 01:22:17 +03:00
+								simplistic approaches are as (un)sound as obscure ones.
 								Those who seek statistical soundness should rather collect raw data
-												Update wording

											
										
										
											2021-01-30 22:18:52 +03:00
+								and process it themselves using a proper statistical toolbox.
 								Data reported by `tasty-bench`
-												Explain how to compare with criterion and gauge reports

											
										
										
											2021-01-29 01:22:17 +03:00
+								is only of indicative and comparative significance.
-												Add an example of output with +RTS -T

											
										
										
											2021-01-31 00:26:45 +03:00
+								## Memory usage
-												Report memory usage when invoked with +RTS -T

											
										
										
											2021-01-28 01:27:01 +03:00
 								Passing `+RTS -T` (via `cabal bench --benchmark-options '+RTS -T'`
 								or `stack bench --ba '+RTS -T'`) enables `tasty-bench` to estimate and report
-												Add an example of output with +RTS -T

											
										
										
											2021-01-31 00:26:45 +03:00
+								memory usage such as allocated and copied bytes:
 								```
 								All
 								  fibonacci numbers
 								    fifth:     OK (2.13s)
 ns ± 3.4 ns, 223 B  allocated,   0 B  copied
 								    tenth:     OK (1.71s)
 ns ±  73 ns, 2.3 KB allocated,   0 B  copied
 								    twentieth: OK (3.39s)
 μs ± 4.9 μs, 277 KB allocated,  59 B  copied
 								All 3 tests passed (7.25s)
 								```
-												Report memory usage when invoked with +RTS -T

											
										
										
											2021-01-28 01:27:01 +03:00
-												Explain how to combine tests and benchmarks

											
										
										
											2021-02-04 02:26:21 +03:00
+								## Combining tests and benchmarks
 								When optimizing an existing function, it is important to check that its
 								observable behavior remains unchanged. One can rebuild
 								both tests and benchmarks after each change, but it would be more convenient
 								to run sanity checks within benchmark itself. Since our benchmarks
 								are compatible with `tasty` tests, we can easily do so.
 								Imagine you come up with a faster function `myFibo` to generate Fibonacci numbers:
 								```haskell
 								import Test.Tasty.Bench
 								import Test.Tasty.QuickCheck -- from tasty-quickcheck package
 								fibo :: Int -> Integer
 								fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
 								myFibo :: Int -> Integer
 								myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2)
 								main :: IO ()
 								main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain
 								  [ bench "fibo   20" $ nf fibo   20
 								  , bench "myFibo 20" $ nf myFibo 20
 								  , testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n
 								  ]
 								```
 								This outputs:
 								```
 								All
 								  fibo   20:     OK (3.02s)
 μs ± 4.9 μs
 								  myFibo 20:     OK (1.99s)
 μs ± 5.3 μs
 								  myFibo = fibo: FAIL
 								    *** Failed! Falsified (after 5 tests and 1 shrink):
 
 /= 2
 								    Use --quickcheck-replay=927711 to reproduce.
 out of 3 tests failed (5.03s)
 								```
 								We see that `myFibo` is indeed significantly faster than `fibo`,
 								but unfortunately does not do the same thing. One should probably
 								look for another way to speed up generation of Fibonacci numbers.
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
+								## Troubleshooting
-												Update documentation

											
										
										
											2021-03-15 22:16:34 +03:00
+								* If benchmarks take too long, set `--timeout` to limit execution time
 								  of individual benchmarks, and `tasty-bench` will do its best to fit
-												Recommend increasing nursery size

											
										
										
											2021-03-17 02:36:03 +03:00
+								  into a given time frame. Without `--timeout` we rerun benchmarks until
-												Update documentation

											
										
										
											2021-03-15 22:16:34 +03:00
+								  achieving a target precision set by `--stdev`, which in a noisy environment
 								  of a modern laptop with GUI may take a lot of time.
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
-												Update documentation

											
										
										
											2021-03-15 22:16:34 +03:00
+								  While `criterion` runs each benchmark at least for 5 seconds,
 								  `tasty-bench` is happy to conclude earlier, if it does not compromise
 								  the quality of results. In our experiments `tasty-bench` suites
 								  tend to finish earlier, even if some individual benchmarks
 								  take longer than with `criterion`.
-												Recommend increasing nursery size

											
										
										
											2021-03-17 02:36:03 +03:00
+								  A common source of noisiness is garbage collection. Setting a larger
 								  allocation area (_nursery_) is often a good idea, either via
 								  `cabal bench --benchmark-options '+RTS -A32m'` or `stack bench --ba '+RTS -A32m'`.
 								  Alternatively bake it into
 								  `cabal` file as `ghc-options: "-with-rtsopts=-A32m"`.
-												Recommend switching to non-moving garbage collector

											
										
										
											2021-03-21 02:33:50 +03:00
+								  For GHC >= 8.10 consider switching benchmarks to a non-moving garbage collector,
 								  because it decreases GC pauses and corresponding noise: `+RTS --nonmoving-gc`.
-												Update documentation

											
										
										
											2021-03-15 22:16:34 +03:00
+								* If benchmark results look malformed like below, make sure that you are
 								  invoking `Test.Tasty.Bench.defaultMain` and not `Test.Tasty.defaultMain`
 								  (the difference is `consoleBenchReporter` vs. `consoleTestReporter`):
 								  ```
 								  All
 								    fibo 20:       OK (1.46s)
 								      Response {respEstimate = Estimate {estMean = Measurement {measTime = 87496728, measAllocs = 0, measCopied = 0}, estStdev = 694487}, respIfSlower = FailIfSlower Infinity, respIfFaster = FailIfFaster Infinity}
 								  ```
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
-												Update documentation

											
										
										
											2021-03-16 21:57:54 +03:00
+								* If benchmarks fail with an error message
 								  ```
 								  Unhandled resource. Probably a bug in the runner you're using.
 								  ```
 								  this is probably caused by `env` or `envWithCleanup` affecting benchmarks structure.
 								  You can use `env` to read test data from `IO`, but not to read benchmark names
 								  or affect their hierarchy in other way. This is a fundamental restriction of `tasty`
 								  to list and filter benchmarks without launching missiles.
-												Explain how to isolate interfering benchmarks

											
										
										
											2021-03-18 01:39:59 +03:00
+								## Isolating interfering benchmarks
 								One difficulty of benchmarking in Haskell is that it is
 								hard to isolate benchmarks so that they do not interfere.
 								Changing the order of benchmarks or skipping some of them
 								has an effect on heap's layout and thus affects garbage collection.
 								This issue is well attested in
 								[`both`](https://github.com/haskell/criterion/issues/166)
 								[`criterion`](https://github.com/haskell/criterion/issues/60)
 								and
 								[`gauge`](https://github.com/vincenthz/hs-gauge/issues/2).
 								Usually (but not always) skipping some benchmarks speeds up remaining ones.
 								That's because once a benchmark allocated heap which for some reason
 								was not promptly released afterwards (e. g., it forced a top-level thunk
 								in an underlying library), all further benchmarks are slowed down
 								by garbage collector processing this additional amount of live data
 								over and over again.
 								There are several mitigation strategies. First of all, giving garbage collector
 								more breathing space by `+RTS -A32m` (or more) is often good enough.
 								Further, avoid using top-level bindings to store large test data. Once such thunks
 								are forced, they remain allocated forever, which affects detrimentally subsequent
 								unrelated benchmarks. Treat them as external data, supplied via `env`: instead of
 								```haskell
 								largeData :: String
 								largeData = replicate 1000000 'a'
 								main :: IO ()
 								main = defaultMain
 								  [ bench "large" $ nf length largeData, ... ]
 								```
 								use
 								```haskell
 								import Control.DeepSeq (force)
 								import Control.Exception (evaluate)
 								main :: IO ()
 								main = defaultMain
 								  [ env (evaluate (force (replicate 1000000 'a'))) $ \largeData ->
 								    bench "large" $ nf length largeData, ... ]
 								```
 								Finally, as an ultimate measure to reduce interference between benchmarks,
 								one can run each of them in a separate process. We do not quite recommend
 								this approach, but if you are desperate, here is how.
 								Assuming that a benchmark is declared in `cabal` file as `benchmark my-bench` component,
 								let's first find its executable:
 								```bash
 								cabal build --enable-benchmarks my-bench
 								MYBENCH=`cabal list-bin my-bench`
 								```
 								Now list all benchmark names (hopefully, they do not contain newlines),
 								escape quotes and slashes, and run each of them separately:
 								```bash
 								$MYBENCH -l | sed -e 's/[\"]/\\\\\\&/g' | while read name; do $MYBENCH -p '$0 == "'$name'"'; done
 								```
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
+								## Comparison against baseline
 								One can compare benchmark results against an earlier baseline in an automatic way.
 								To use this feature, first run `tasty-bench` with `--csv FILE` key
-												Update documentation

											
										
										
											2021-03-15 22:16:34 +03:00
+								to dump results to `FILE` in CSV format
 								(it could be a good idea to set smaller `--stdev`, if possible):
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
 								```
 								Name,Mean (ps),2*Stdev (ps)
 								All.fibonacci numbers.fifth,48453,4060
 								All.fibonacci numbers.tenth,637152,46744
 								All.fibonacci numbers.twentieth,81369531,3342646
 								```
-												Tweak documentation and comments

											
										
										
											2021-02-05 22:11:54 +03:00
+								Note that columns do not match CSV reports of `criterion` and `gauge`.
 								If desired, missing columns can be faked with
 								`awk 'BEGIN {FS=",";OFS=","}; {print $1,$2,$2,$2,$3/2,$3/2,$3/2}'` or similar.
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
+								Now modify implementation and rerun benchmarks
 								with `--baseline FILE` key. This produces a report as follows:
 								```
 								All
 								  fibonacci numbers
 								    fifth:     OK (0.44s)
 ns ± 2.7 ns,  8% slower than baseline
 								    tenth:     OK (0.33s)
 ns ±  59 ns
 								    twentieth: OK (0.36s)
 μs ± 6.4 μs,  5% faster than baseline
-												Tweak documentation and comments

											
										
										
											2021-02-05 22:11:54 +03:00
+								All 3 tests passed (1.50s)
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
+								```
 								You can also fail benchmarks, which deviate too far from baseline, using
 								`--fail-if-slower` and `--fail-if-faster` options. For example, setting both of them
 								to 6 will fail the first benchmark above (because it is more than 6% slower),
 								but the last one still succeeds (even while it is measurably faster than baseline,
 								deviation is less than 6%). Consider also using `--hide-successes` to show
 								only problematic benchmarks, or even
 								[`tasty-rerun`](http://hackage.haskell.org/package/tasty-rerun) package
 								to focus on rerunning failing items only.
-												Add bcompare

											
										
										
											2021-03-23 02:53:51 +03:00
+								## Comparison between benchmarks
 								You can also compare benchmarks to each other without reaching to external tools,
 								all in the comfort of your terminal.
 								```haskell
 								import Test.Tasty.Bench
 								fibo :: Int -> Integer
 								fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)
 								main :: IO ()
 								main = defaultMain
 								  [ bgroup "fibonacci numbers"
 								    [ bcompare "tenth"  $ bench "fifth"     $ nf fibo  5
 								    ,                     bench "tenth"     $ nf fibo 10
 								    , bcompare "tenth"  $ bench "twentieth" $ nf fibo 20
 								    ]
 								  ]
 								```
 								This produces a report, comparing mean times of `fifth` and `twentieth` to `tenth`:
 								```
 								All
 								  fibonacci numbers
 								    fifth:     OK (16.56s)
 ns ± 2.6 ns, 0.08x
 								    tenth:     OK (6.84s)
 .6 μs ±  31 ns
 								    twentieth: OK (6.96s)
 μs ± 4.1 μs, 128.36x
 								```
 								Locating a baseline benchmark in larger suites could get tricky;
 								```haskell
 								bcompare "$NF == \"tenth\" && $(NF-1) == \"fibonacci numbers\""
 								```
 								is a more robust choice of
 								an [`awk` pattern](https://github.com/feuerbach/tasty#patterns) here.
-												Add plotting example to README

											
										
										
											2021-03-28 03:00:16 +03:00
+								## Plotting results
 								Users can dump results into CSV with `--csv FILE`
 								and plot them using `gnuplot` or other software. But for convenience
 								there is also a built-in quick-and-dirty SVG plotting feature,
 								which can be invoked by passing `--svg FILE`. Here is a sample of its output:
 								![Plotting](./example.svg)
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
+								## Command-line options
 								Use `--help` to list command-line options.
 								* `-p`, `--pattern`
 								  This is a standard `tasty` option, which allows filtering benchmarks
 								  by a pattern or `awk` expression. Please refer to
 								  [`tasty` documentation](https://github.com/feuerbach/tasty#patterns)
 								  for details.
 								* `-t`, `--timeout`
 								  This is a standard `tasty` option, setting timeout for individual benchmarks
 								  in seconds. Use it when benchmarks tend to take too long: `tasty-bench` will make
 								  an effort to report results (even if of subpar quality) before timeout. Setting
-												Minor changes

											
										
										
											2021-02-01 21:33:36 +03:00
+								  timeout too tight (insufficient for at least three iterations)
-												Run console and csv reporters in parallel

											
										
										
											2021-02-02 23:11:35 +03:00
+								  will result in a benchmark failure.
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
 								* `--stdev`
-												Restore default RelStDev to 5%

											
										
										
											2021-02-06 05:09:03 +03:00
+								  Target relative standard deviation of measurements in percents (5% by default).
-												Initial commit

											
										
										
											2021-01-19 03:23:42 +03:00
+								  Large values correspond to fast and loose benchmarks, and small ones to long and precise.
 								  If it takes far too long, consider setting `--timeout`,
 								  which will interrupt benchmarks, potentially before reaching the target deviation.
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
 								* `--csv`
 								  File to write results in CSV format.
 								* `--baseline`
 								  File to read baseline results in CSV format (as produced by `--csv`).
 								* `--fail-if-slower`, `--fail-if-faster`
 								  Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see `--baseline`),
-												Tweak documentation and comments

											
										
										
											2021-02-05 22:11:54 +03:00
+								  it will be reported as failed. Can be used in conjunction with
-												Update documentation with regards to baseline comparisons

											
										
										
											2021-02-04 23:19:23 +03:00
+								  a standard `tasty` option `--hide-successes` to show only problematic benchmarks.
-												Add plotting example to README

											
										
										
											2021-03-28 03:00:16 +03:00
 								* `--svg`
 								  File to plot results in SVG format.