06e55120ce
Consumer is a very generic name. One cannot intuitively distinguish between Consumer and Fold. "Refold" gives a better idea that its a slightly modified version of Fold with a starting value. Also, it fits better as a dual to Unfold. |
||
---|---|---|
.. | ||
bench-report | ||
lib/Streamly/Benchmark | ||
Streamly/Benchmark | ||
NanoBenchmarks.hs | ||
README.md | ||
streamly-benchmarks.cabal |
Benchmarking
Benchmark Drivers
Two benchmark drivers are supported:
tasty-bench
(default)gauge
(enabled by--use-gauge
build flag)
Build and run benchmarks directly
The benchmark executables are tasty-bench
executables unless you have
passed --use-gauge
cabal flag when building in which case it is a
gauge
executable.
$ cabal run bench:Prelude.Serial # run selected
$ cabal run bench:Prelude.Serial -- --help # help on arguments
$ cabal run bench:Prelude.Serial -- --stdev 100000 # specify arguments
$ cabal run bench:Prelude.Serial --flag fusion-plugin # with fusion-plugin
$ cabal build bench:Prelude.Serial # build selected
$ cabal build --enable-benchmarks streamly-benchmarks # build all
$ cabal build --enable-benchmarks all # build all, alternate method
$ cabal build --flag "-opt" ... # disable optimization, faster build
Building and Running Benchmarks with bench.sh
<streamly repo>/bin/bench.sh
script is the top level driver for
running benchmarks. It runs the requested benchmarks and then creates a
report from the results using the bench-show
package.
IMPORTANT NOTE: The first time you run this script it may take a long
time because it has to build the bench-report
executable which has a
lot of dependencies. If you are using nix then use --use-nix
flag
for the first time so that the bench-report
executable is built using
nix. That can save a lot of time compiling it. However, once it is built
it will be cached in the bin
directory of the repo and used from
there every time. You can also build it manually from the cabal file in
benchmark/bench-report
and install it in the bin
directory.
bench.sh: Quick start
Useful commands:
$ bin/bench.sh --help
$ bin/bench.sh --quick # run all the benchmark suites
$ bin/bench.sh --benchmarks help # Show available benchmark suites
$ bin/bench.sh --benchmarks serial_grp # Run all serial benchmark suites
$ bin/bench.sh --benchmarks "Prelude.Serial Data.Parser" # run selected suites
$ bin/bench.sh --no-measure # don't run benchmarks just show previous results
# Run all O(1) space complexity benchmarks in `Prelude.Serial` suite
$ bin/bench.sh --benchmarks Prelude.Serial --prefix Prelude.Serial/o-1-space
# Run a specific benchmark in `Prelude.Serial` suite
$ bin/bench.sh --benchmarks Prelude.Serial --prefix Prelude.Serial/o-1-space.generation.unfoldr
Note: bench.sh
enables fusion-plugin by default.
Comparing results with baseline
# Checkout baseline commit
$ bin/bench.sh --quick
# Checkout commit with new changes
$ bin/bench.sh --quick --append
# To add another result to comparisons just repeat the above command on
# desired commit
Comparing benchmark suites
First see the available benchmark suites:
$ bin/bench.sh --benchmarks help
You will see some benchmark suites end with _cmp
, these are comparison
groups. If you run a comparison group benchmark, comparison of all the
benchmark suites in that group will be shown in the end. For example to compare
all array benchmark suites:
$ bin/bench.sh --benchmarks array_cmp
Reporting without measuring
You can use the --no-measure
option to report the already measured results in
the benchmarks results file. A results file may collect an arbitrary number of
results by running with --append
multiple times. Each benchmark has its own
results file, for example the Prelude.Serial
benchmark has the results file at
charts/Prelude.Serial/results.csv
.
You can also manually edit the file to remove a set of results if you like or
to append results from previously saved results or from some other results
file. After editing you can run bench.sh
with the --no-measure
option to
see the reports corresponding to the results.
Additional benchmark configuration
Stream size
You can specify the stream size (default is 100000) to be used for benchmarking:
$ cabal run bench:Prelude.Serial -- --stream-size 1000000
External input file
In the FileSystem.Handle
benchmark you can specify the input file as an
environment variable:
$ export Benchmark_FileSystem_Handle_InputFile=./gutenberg-500.txt
$ cabal run FileSystem.Handle -- FileSystem.Handle/o-1-space/reduce/read/S.splitOnSeq
The automatic tests do not test unicode input, this option is useful to specify a unicode text file manually.
Benchmarking notes
We run each benchmark in an isolated process to minimize interference of benchmarks and to be able to control the RTS memory restrictions per benchmark.
Gotchas
Gauge forces a GC before and after the measurement. However, we have observed that sometimes the GC stats may not be accurate when the number of iterations in the measurement is small (e.g. 1 iteration). In such cases usually the number of GCs and GC times would also be 0.
Diagnosing Performance Issues
Reproducible comparison
When comparing different compilers we need to make sure that we are using exactly the same versions of the libraries for apples to apples comparison. We have seen cases where a change in the "random" library caused allocations regressions in the new version of compiler because of the way in which the benchmark code was generated due to the change.
When it is required to reproduce benchmark results precisely across different systems, it is recommended that you create and use a cabal freeze file so that the versions of all libraries are pinned.
Identifying issues
There are two ways to find problematic code:
- Run performance benchmarks using
bench.sh
, select the benchmarks that are taking more than expected time. - When making a new change, compare with the baseline and select benchmarks
with the most regression reported by
bench.sh
.
Number of allocations are the most stable measure that do not vary from
run to run. cpuTime
and bytesCopied
may vary. When comparing two
runs for regression the first thing to look at is the difference in
allocations. Also note that allocations may vary from run to run for
concurrent benchmarks.
The next thing to look at is cpuTime. Please note that cpuTime may fluctuate quite a bit, you may want to run the relevant benchmarks without the --quick mode for confirming and make sure no other load is running on the system when measuring.
Usually the increase is cpuTime is proportional to the increase in allocations but sometimes it may increase independently because more cpu instructions are being executed. TBD - we should count the instructions instead.
Inspection Testing
Before you proceed make sure have to run the benchmarks with
inspection
flag on. It may catch any obvious issues or regressions.
$ cabal build --flag inspection --flag fusion-plugin --enable-benchmarks streamly-benchmarks
Compiling with diagnostics
- Comment out all other benchmarks in the given benchmark suite, and keep only the one you are examining.
- Edit the file and add the following line on top:
{-# OPTIONS_GHC
-ddump-simpl
-ddump-to-file
-dsuppress-all
-Wmissed-specialisations
-Wall-missed-specialisations
-fplugin-opt=Fusion.Plugin:verbose=2
-fplugin-opt=Fusion.Plugin:dump-core
#-}
- Build the benchmark suite with fusion-plugin enabled:
$ cabal build bench:Prelude.Serial --flag fusion-plugin
See the .dump-simpl
file in the cabal build directory. You can find it
like this:
$ find dist-newstyle/ -name "*.dump-simpl"
Make sure you are looking into the right build dir (--build-dir
may change
dist-newstyle
to something else), and check in the appropriate GHC
version dir.
Compiling standalone example
Sometimes you may want to create a separate program from the benchmark code removing the benchmarking harness to simplify and isolate the code for better reasoning and simpler core.
Add the following GHC options at the top of your file, say, example.hs:
{-# OPTIONS_GHC
-ddump-simpl
-ddump-to-file
-dsuppress-all
-Wmissed-specialisations
-Wall-missed-specialisations
-fplugin Fusion.Plugin
-fplugin-opt=Fusion.Plugin:verbose=2
-fplugin-opt=Fusion.Plugin:dump-core
#-}
Do not include the optimization options in OPTIONS_GHC pragma, instead, specify them on the command line. This is to avoid optimization failing if you import another module which is not compiled with the same optimization options.
$ cabal build # build and write ghc environment file
$ ghc -O2 -fspec-constr-recursive=16 -fmax-worker-args=16 example.hs
To pinpoint where the optimization is going wrong you can examine the plugin generated core files for each optimization pass. The files are numbered for each optimization pass. You can compare successive files using side-by-side diff and see what the compiler is doing between each pass.
Diagnosing the Problem
Specialization Issues
Look for missed specialization messages. When you are comparing against a baseline, check if something that was specialized before is no longer specialized.
In the core you have to look for type class dictionaries e.g.
exc_r6DD = \ @s_a6ai -> try $fMonadCatchIO $fExceptionSomeException
Search for $f
in the core.
Fusion Issues
Look for unfused function warnings emitted by fusion-plugin. You may want to take a look at the unfused constructors or functions that fusion-plugin is warning about. Beware that:
- fusion-plugin emits warnings for unfused stuff in intermediate functions as well, those should be ignored.
- the constructors may remain genuinely unfused unless the loop is closed. So you should look at the warnings in the file where the loop is closed and everything is supposed to be fused.
Also, look at the core for unfused constructors. At times you may need to look for the boxed primitive type constructors e.g. W8# or I#, these may not be eliminated, usually, due to strictness issues.
Often it is useful to diff and compare the core without the problem and the core with the problem especially in cases when the problem is due to GHC version changes, or smaller changes in the code.
Note, some operations are inherently fusion breaking, those cannot fuse, they are usually annotated so in their documentation.
Resolving the problem
Review the problematic code, see the optimization guide for common problems and how to solve those. If no obvious issues are found on review, then generate and examine the core.
You may want to add the Fuse
annotation on some of those constructors
to make the code fuse. Please note that unnecessary Fuse
annotations
may cause unnecessary inlining. Also, make sure that the constructor you
are adding fuse annotation is not shared by any other code where you may
not want inlining/fusion.