streamly/benchmark
Harendra Kumar 06e55120ce Rename Consumer to Refold
Consumer is a very generic name. One cannot intuitively distinguish
between Consumer and Fold. "Refold" gives a better idea that its a
slightly modified version of Fold with a starting value. Also, it fits
better as a dual to Unfold.
2021-10-22 01:34:02 +05:30
..
bench-report Add a --quick option 2021-10-12 18:57:14 +05:30
lib/Streamly/Benchmark Fix/enhance benchmarks for concatPairsWith 2021-09-28 16:59:29 +05:30
Streamly/Benchmark Rename Consumer to Refold 2021-10-22 01:34:02 +05:30
NanoBenchmarks.hs Rename Data.Array.Storable.Foreign to Data.Array.Foreign. 2021-02-01 21:31:06 +05:30
README.md Add a section for identifying/fixing perf issues 2021-06-22 23:06:26 +05:30
streamly-benchmarks.cabal Rename Transformation1 benchmark module 2021-10-12 20:46:49 +05:30

Benchmarking

Benchmark Drivers

Two benchmark drivers are supported:

  • tasty-bench (default)
  • gauge (enabled by --use-gauge build flag)

Build and run benchmarks directly

The benchmark executables are tasty-bench executables unless you have passed --use-gauge cabal flag when building in which case it is a gauge executable.

$ cabal run bench:Prelude.Serial  # run selected
$ cabal run bench:Prelude.Serial -- --help # help on arguments
$ cabal run bench:Prelude.Serial -- --stdev 100000 # specify arguments
$ cabal run bench:Prelude.Serial --flag fusion-plugin # with fusion-plugin

$ cabal build bench:Prelude.Serial # build selected
$ cabal build --enable-benchmarks streamly-benchmarks # build all
$ cabal build --enable-benchmarks all # build all, alternate method

$ cabal build --flag "-opt" ... # disable optimization, faster build

Building and Running Benchmarks with bench.sh

<streamly repo>/bin/bench.sh script is the top level driver for running benchmarks. It runs the requested benchmarks and then creates a report from the results using the bench-show package.

IMPORTANT NOTE: The first time you run this script it may take a long time because it has to build the bench-report executable which has a lot of dependencies. If you are using nix then use --use-nix flag for the first time so that the bench-report executable is built using nix. That can save a lot of time compiling it. However, once it is built it will be cached in the bin directory of the repo and used from there every time. You can also build it manually from the cabal file in benchmark/bench-report and install it in the bin directory.

bench.sh: Quick start

Useful commands:

$ bin/bench.sh --help
$ bin/bench.sh --quick # run all the benchmark suites
$ bin/bench.sh --benchmarks help # Show available benchmark suites
$ bin/bench.sh --benchmarks serial_grp # Run all serial benchmark suites
$ bin/bench.sh --benchmarks "Prelude.Serial Data.Parser" # run selected suites
$ bin/bench.sh --no-measure # don't run benchmarks just show previous results

# Run all O(1) space complexity benchmarks in `Prelude.Serial` suite
$ bin/bench.sh --benchmarks Prelude.Serial --prefix Prelude.Serial/o-1-space

# Run a specific benchmark in `Prelude.Serial` suite
$ bin/bench.sh --benchmarks Prelude.Serial --prefix Prelude.Serial/o-1-space.generation.unfoldr

Note: bench.sh enables fusion-plugin by default.

Comparing results with baseline

# Checkout baseline commit
$ bin/bench.sh --quick

# Checkout commit with new changes
$ bin/bench.sh --quick --append

# To add another result to comparisons just repeat the above command on
# desired commit

Comparing benchmark suites

First see the available benchmark suites:

$ bin/bench.sh --benchmarks help

You will see some benchmark suites end with _cmp, these are comparison groups. If you run a comparison group benchmark, comparison of all the benchmark suites in that group will be shown in the end. For example to compare all array benchmark suites:

$ bin/bench.sh --benchmarks array_cmp

Reporting without measuring

You can use the --no-measure option to report the already measured results in the benchmarks results file. A results file may collect an arbitrary number of results by running with --append multiple times. Each benchmark has its own results file, for example the Prelude.Serial benchmark has the results file at charts/Prelude.Serial/results.csv.

You can also manually edit the file to remove a set of results if you like or to append results from previously saved results or from some other results file. After editing you can run bench.sh with the --no-measure option to see the reports corresponding to the results.

Additional benchmark configuration

Stream size

You can specify the stream size (default is 100000) to be used for benchmarking:

$ cabal run bench:Prelude.Serial -- --stream-size 1000000

External input file

In the FileSystem.Handle benchmark you can specify the input file as an environment variable:

$ export Benchmark_FileSystem_Handle_InputFile=./gutenberg-500.txt
$ cabal run FileSystem.Handle -- FileSystem.Handle/o-1-space/reduce/read/S.splitOnSeq

The automatic tests do not test unicode input, this option is useful to specify a unicode text file manually.

Benchmarking notes

We run each benchmark in an isolated process to minimize interference of benchmarks and to be able to control the RTS memory restrictions per benchmark.

Gotchas

Gauge forces a GC before and after the measurement. However, we have observed that sometimes the GC stats may not be accurate when the number of iterations in the measurement is small (e.g. 1 iteration). In such cases usually the number of GCs and GC times would also be 0.

Diagnosing Performance Issues

Reproducible comparison

When comparing different compilers we need to make sure that we are using exactly the same versions of the libraries for apples to apples comparison. We have seen cases where a change in the "random" library caused allocations regressions in the new version of compiler because of the way in which the benchmark code was generated due to the change.

When it is required to reproduce benchmark results precisely across different systems, it is recommended that you create and use a cabal freeze file so that the versions of all libraries are pinned.

Identifying issues

There are two ways to find problematic code:

  1. Run performance benchmarks using bench.sh, select the benchmarks that are taking more than expected time.
  2. When making a new change, compare with the baseline and select benchmarks with the most regression reported by bench.sh.

Number of allocations are the most stable measure that do not vary from run to run. cpuTime and bytesCopied may vary. When comparing two runs for regression the first thing to look at is the difference in allocations. Also note that allocations may vary from run to run for concurrent benchmarks.

The next thing to look at is cpuTime. Please note that cpuTime may fluctuate quite a bit, you may want to run the relevant benchmarks without the --quick mode for confirming and make sure no other load is running on the system when measuring.

Usually the increase is cpuTime is proportional to the increase in allocations but sometimes it may increase independently because more cpu instructions are being executed. TBD - we should count the instructions instead.

Inspection Testing

Before you proceed make sure have to run the benchmarks with inspection flag on. It may catch any obvious issues or regressions.

$ cabal build --flag inspection --flag fusion-plugin --enable-benchmarks streamly-benchmarks

Compiling with diagnostics

  1. Comment out all other benchmarks in the given benchmark suite, and keep only the one you are examining.
  2. Edit the file and add the following line on top:
{-# OPTIONS_GHC
  -ddump-simpl
  -ddump-to-file
  -dsuppress-all
  -Wmissed-specialisations
  -Wall-missed-specialisations
  -fplugin-opt=Fusion.Plugin:verbose=2
  -fplugin-opt=Fusion.Plugin:dump-core
#-}
  1. Build the benchmark suite with fusion-plugin enabled:
$ cabal build bench:Prelude.Serial --flag fusion-plugin

See the .dump-simpl file in the cabal build directory. You can find it like this:

$ find dist-newstyle/ -name "*.dump-simpl"

Make sure you are looking into the right build dir (--build-dir may change dist-newstyle to something else), and check in the appropriate GHC version dir.

Compiling standalone example

Sometimes you may want to create a separate program from the benchmark code removing the benchmarking harness to simplify and isolate the code for better reasoning and simpler core.

Add the following GHC options at the top of your file, say, example.hs:

{-# OPTIONS_GHC
  -ddump-simpl
  -ddump-to-file
  -dsuppress-all
  -Wmissed-specialisations
  -Wall-missed-specialisations
  -fplugin Fusion.Plugin
  -fplugin-opt=Fusion.Plugin:verbose=2
  -fplugin-opt=Fusion.Plugin:dump-core
#-}

Do not include the optimization options in OPTIONS_GHC pragma, instead, specify them on the command line. This is to avoid optimization failing if you import another module which is not compiled with the same optimization options.

$ cabal build # build and write ghc environment file
$ ghc -O2 -fspec-constr-recursive=16 -fmax-worker-args=16 example.hs

To pinpoint where the optimization is going wrong you can examine the plugin generated core files for each optimization pass. The files are numbered for each optimization pass. You can compare successive files using side-by-side diff and see what the compiler is doing between each pass.

Diagnosing the Problem

Specialization Issues

Look for missed specialization messages. When you are comparing against a baseline, check if something that was specialized before is no longer specialized.

In the core you have to look for type class dictionaries e.g.

exc_r6DD = \ @s_a6ai -> try $fMonadCatchIO $fExceptionSomeException

Search for $f in the core.

Fusion Issues

Look for unfused function warnings emitted by fusion-plugin. You may want to take a look at the unfused constructors or functions that fusion-plugin is warning about. Beware that:

  • fusion-plugin emits warnings for unfused stuff in intermediate functions as well, those should be ignored.
  • the constructors may remain genuinely unfused unless the loop is closed. So you should look at the warnings in the file where the loop is closed and everything is supposed to be fused.

Also, look at the core for unfused constructors. At times you may need to look for the boxed primitive type constructors e.g. W8# or I#, these may not be eliminated, usually, due to strictness issues.

Often it is useful to diff and compare the core without the problem and the core with the problem especially in cases when the problem is due to GHC version changes, or smaller changes in the code.

Note, some operations are inherently fusion breaking, those cannot fuse, they are usually annotated so in their documentation.

Resolving the problem

Review the problematic code, see the optimization guide for common problems and how to solve those. If no obvious issues are found on review, then generate and examine the core.

You may want to add the Fuse annotation on some of those constructors to make the code fuse. Please note that unnecessary Fuse annotations may cause unnecessary inlining. Also, make sure that the constructor you are adding fuse annotation is not shared by any other code where you may not want inlining/fusion.