This PR attempts to add some automation around assigning release
management. The PR adds a file `release/rotation`; each week, the
updated CI cron job will:
- Open a PR for the new release [as current].
- Assign the first user in the file to that PR.
- Add the Standard-Change label to the PR.
- Start the build for that PR [as current].
- Open a new PR that rotates the `release/rotate` file, i.e. pushes back
the first line to the end of the file.
This PR also adds mentions of the "release handler" (the first line of
`release/rotation`) to the various messages we send to Slack along the
release process.
The initial state of the `release/rotation` file has been created by
listing all the volunteers (Language team, Application Runtime team, as
well as @SamirTalwar-DA and @stefanobaghino-da) and piping the file
through `shuf`. (Then I put myself at the top so I can hopefully iron
out the issues with the first attempt.)
CHANGELOG_BEGIN
CHANGELOG_END
This script is no longer relevant to our internal processes. The report
is now generated by the security team and validated by us, rather than
produced and validated by us.
CHANGELOG_BEGIN
CHANGELOG_END
Even though the Azure Pipelines bot account _clearly_ has write access
to our repo, as it can create the PR, it does not count as having write
access for the purposes of Azure deciding to run the build on the PRs it
opens. Not having the build run on the release PR would defeat the whole
point of having it, so this adds a little nudge to Azure so it does
start the build after opening the PR.
CHANGELOG_BEGIN
CHANGELOG_END
Based on feedback from @nickchapman-da, this PR aims at making the
release process easier by:
- Automatically opening a release PR on Wednesday morning. The goal here
is that by the time we start working, there is a release already
built, so we save about an hour on waiting for that. This obviously
doesn't help with ad-hoc releases.
- On a release PR build, posting to Slack when the release is ready to
merge.
- On a release master build, posting to Slack when a release is ready to
be tested.
My hope is that this makes the release process less tedious. This is not
trying to address the actual release testing, but hopefully should
reduce the annoyance of having to constantly go and check if the release
is ready.
CHANGELOG_BEGIN
CHANGELOG_END
simplify docs cron
This commit changes the "live state" to be that all versions are there
on S3, most of them hidden the way snapshots currently are, and only
displays in the drop-down the list of "supported" versions, i.e. stable
and >= 1.0.0.
The docs cron will now:
- Get list of versions from GitHub (as it does now)
- Get list of versions from S3 (as it does now: versions.json +
snapshots.json, though it assumes we'll have a follow-up PR to change
the latter to hidden.json)
- Compare; if the sets of versions are the same, stop there. (Note: this
"set of versions" here includes the notion of which versions are shown,
not just which ones exist. See the Versions data type in the code.)
- If there is a new hidden version, just build that, push it, change
nothing else. No need to download any of the existing versions or mess
around with anything else (except updating `hidden.json`, otherwise
we're going to be doing this way too often.)
- If there is a new visible version:
- check if we have it locally (i.e. from the previous step: it's a
version we just added)
- figure out the old and new default versions, and then apply the diff
to the top-level directory. Basically download the two folders, list
files that exist in the old one and not in the new one, delete those
from S3, then push the new one to the top-level on S3.
- update versions.json & hidden.json (and for now snapshots.json)
This means that:
- we never mess with the existing versions; we don't need to download
them, we don't need to change them, we don't clean them up. Old links
keep working forever.
- The running time for the docs cron is roughly constant, in that it
should very rarely have to either build or upload (or download) more
than 2 versions per run, and if those instances happen they'd be
accidents (we made 3 actual releases in an hour), not build-up over
time.
* Upgrade nixpkgs revision
* Remove unused minio
It used to be used as a gateway to push the Nix cache to GCS, but has
since been replaced by nix-store-gcs-proxy.
* Update Bazel on Windows
changelog_begin
changelog_end
* Fix hlint warnings
The nixpkgs update implied an hlint update which enabled new warnings.
* Fix "Error applying patch"
Since Bazel 2.2.0 the order of generating `WORKSPACE` and `BUILD` files
and applying patches has been reversed. The allows users to define
patches to these files that will not be immediately overwritten.
However, it also means that patches on another repository's original
`WORKSPACE` file will likely become invalid.
* a948eb7255
* https://github.com/bazelbuild/bazel/issues/10681
Hint: If you're generating a patch with `git` then you can use the
following command to exclude the `WORKSPACE` file.
```
git diff ':(exclude)WORKSPACE'
```
* Update rules_nixpkgs
* nixpkgs location expansion escaping
* Drop --noincompatible_windows_native_test_wrapper
* client_server_test using sh_inline_test
client_server_test used to produce an executable shell script in form of
a text file output. However, since the removal of
`--noincompatible_windows_native_test_wrapper` this no longer works on
Windows since `.sh` files are not directly executable on Windows.
This change fixes the issue by producing the script file in a dedicated
rule and then wrapping it in a `sh_test` rule which also works on
Windows.
* daml_test using sh_inline_test
* daml_doc_test using sh_inline_test
* _daml_validate_test using sh_inline_test
* damlc_compile_test using sh_inline_test
* client_server_test find .exe on Windows
* Bump Windows cache for Bazel update
Remove `clean --expunge` after merge.
Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>
* Fix reference to return produced by ApplicativeDo
see https://github.com/digital-asset/ghc/pull/53 for details.
fixes#6820
changelog_begin
changelog_end
* bump to merged commit
changelog_begin
changelog_end
* switch to new ghc-lib
changelog_begin
changelog_end
* Move public code into daml-integration-api
CHANGELOG_BEGIN
[DAML Integration Kit]: Removed sandbox specific code from the API intended to be used by ledger integrations. Use the maven coordinates ``com.daml:participant-integration-api:VERSION`` instead of ``com.daml:ledger-api-server`` or ``com.daml:sandbox``.
CHANGELOG_END
This reverts commit 61e9df3eaf.
This interacts very badly with the fact that we check out old commits
for releases. While we could fix it for this particular issue, I don’t
think this buys us enough to make this worth doing and it makes it
easy to introduce issues in the future if we modify lib.sh
changelog_begin
changelog_end
This bumps the timeout of the compat tests on PRs to 360 minutes
matching other jobs on a PR (we mainly hit this if ghc-lib is rebuilt)
and the timeout on the daily jobs to 720 minutes (we hit this if
_everything_ is rebuilt).
I am slightly worried about the timeout on the daily job. After having
taken a look at it, there are a few reasons how we ended up here:
1. We started including more tests, e.g., sandbox-classic. Not much we
can do here, those tests are useful.
2. We have a very large number of snapshots for 1.3.0. There are a few
reasons for this:
1. Timing: We branched off early for the 1.2.0 release so the first
snapshot for 1.3 was on June 3th. For 1.4 it looks like the first
snapshot will be on July 15th so that’s roughly 2 extra
snapshots just due to timing.
2. Additional snapshots: We had one broken snapshot due to a broken
VSCode extension that we didn’t delete (probably not worth doing
at this point). We also had to backport to an old snapshot which
resulted in another extra snapshot. We also had one extra
snapshot which was supposed to be the RC but wasn’t since the
ANF revert needed to go in.
The only thing that is clearly useless is the one broken snapshot
but that doesn’t change things that much. I see 2 orthogonal
options for improving this assuming we agree that the current
runtime is worryingly high.
1. Prune snapshots more aggressively, e.g., only include the last 3
snapshots. That’s a pretty arbitrary decision but it would
enforce a hard limit.
2. Reduce test combinations. E.g., only test snapshots vs stable
releases but not snapshots vs snapshots.
3. We end up forcing a full build quite frequently. Here are just 2
examples of how we’ve done that so far.
1. Upgrade rules_haskell. Basically all tests are run by a Haskell
binary so this forces a full rebuild.
2. Change runfiles of `daml`.
I don’t think there is much we can do about 1 or 3 which leaves us
with 2. One not entirely unreasonable option is to just do nothing. We
did have periods where things went pretty smoothly for the most part
and each month we reset to a much smaller number of releases (we also
have to start throwing out old stable releases at some
point). Otherwise reducing the number of test combinations seems the
most promising option to me.
changelog_begin
changelog_end
This changed by the revert of the ANF changes which is harmless by the
same reasoning that made bumping it harmless when we introduced it.
changelog_begin
changelog_end
We saw a flake recently where PostgreSQL stopped accepting connections
during a CI run, leading the build to fail. This increases the number of
connections to 200 from the default of 100, hopefully mitigating issues
such as this one.
CHANGELOG_BEGIN
CHANGELOG_END
document shared memory segment issue
After discussion with @SamirTalwar-DA, we agree the CI script to clear
memory segments is a bit too dangerous to make it easy to run on
developer machines. Still, developers may run into similar issues if
they run lots of tests and/or do not reboot their laptop frequently.
On developer laptops, we usually spawn one PostgreSQL instance per
build/test that needs it (as opposed to CI where we create a single one
for the entire build; see `build.sh`), so they can actually build up
fairly quickly in some scenarios.
As an alternative, I have added a section to the README to cover what to
do if that issue happens.
CHANGELOG_BEGIN
CHANGELOG_END
For a while now we've had errors along the line of
```
FATAL: could not create shared memory segment: No space left on device
DETAIL: Failed system call was shmget(key=5432001, size=56, 03600).
HINT: This error does *not* mean that you have run out of disk space.
It occurs either if all available shared memory IDs have been taken, in
which case you need to raise the SHMMNI parameter in your kernel, or
because the system's overall limit for shared memory has been reached.
The PostgreSQL documentation contains more information about
shared memory configuration.
child process exited with exit code 1
```
on macOS CI nodes, which we were not able to reproduce locally. Today I
managed to, sort of by accident, and that allowed me to dig a bit
further.
The root cause seems to be that PostgreSQL, as run by Bazel, does not
always seem to properly unlink the shared memory segment it uses to
communicate with itself. On my machine, running:
```
bazel test -t- --runs_per_test=100 //ledger/sandbox:conformance-test-wall-clock-postgresql
```
and eyealling the results of
```
watch ipcs -mcopt
```
I would say about one in three runs leaks its memory segment. After much
googling and some head scratching trying to figure out the C APIs for
managing shared memory segments on macOS, I kind of stumbled on a
reference to `pcirm` in a comment to some low-ranking StackOverflow
answer. It looks like it's working very well on my machine, even if I
run it while a test (and therefore an instance of pg) is running. I
believe this is because the command does not actually remove the shared
memory segments, but simply marks them for removal once the last process
stops using it. (At least that's what the manpage describes.)
CHANGELOG_BEGIN
CHANGELOG_END
This script was supposed to remove old agents from the Azure Pipelines
UI. It may have been useful at some time (notably, when we used
ephemeral instances, they did not necessarily get to run their shutdown
script), but as it stands now, it's broken. The output from that step
ends in:
```
error: 2 derivations need to be built, but neither local builds ('--max-jobs') nor remote builds ('--builders') are enabled
```
after listing the nix packages it would build. Furthermore, it does not
seem to be useful as I have not seen any spurious entry in the agents
list on Azure since we switched to permanent nodes, on either the Linux
or Windows side (and this would only run on Linux, if it ran).
I'm also not convinced it ever ran, as I used to see a lot of spurious
machines on both Linux and Windows when we did use ephemeral instances.
CHANGELOG_BEGIN
CHANGELOG_END
Nix now requires -L, I’ve gone ahead and just normalized everything to
use -sfL which we were already using in one place.
changelog_begin
changelog_end
fixes#3150
This PR introduces a patch to GHC to fix the performance of the
pattern match checker in the presence of multiple packages which
is currently significantly (orders of magnitude) slower than having
everything in a single package. I also added a test case that hits
this. Here’s what you need to hit this issue:
1. A typeclass with a functional dependency. `HasField` is the obvious
candidate for this.
2. A lot of instances of this typeclass in a separate package (this is
the only part where the separate package matters).
3. A reasonably large ADT with a bunch of strict fields.
4. A pattern match in the context of some constraints of the
typeclass. The constraints can be completely unused.
In that case, you will get a significant slowdown in the number of
instances, number of constructors and number of constraints (didn’t
verify if it’s linear but it is significant which is all that
matters).
Here’s why this happens:
1. The pattern match checker checks for strict fields if the type is
inhabited.
2. This calls `pmTopNormaliseType_maybe` to normalize a type (the details don’t
matter) which in turn calls into the typechecker. This function is
called very often (presumably linear in the number of constructors
but didn’t verify.)
3. The typechecker has some logic in `improveFromInstEnv` for
generating additional equations by unifying functional
dependencies `a -> b` with constraints in scope
and thereby deducing information about `b`.
4. In the pattern match checker the list of instances of the home
package is empty since the pattern match checker (apparently)
doesn’t actually care about those extra equations. However, the
list of instances in the EPS is not empty. This is the issue here:
By moving it to an external package we suddenly end up with
thousands of instances that we try to unify with the functional
dependencies every time we normalize which happens very often.
Proposed fix:
The solution is rather simple: Since the pattern match checker
apparently does not care about the instances of the home package, it
almost certainly doesn’t care about instances in general so we just
empty the instances of external packages explicitly.
Is the fix correct?
1. I verified that the GHC test suite passes with this patch which
gives me a reasonable level of confidence.
2. I verified that our own test suite passes.
3. The most dodgy part is actually emptying the instance since the
whole EPS stuff is a mutable mess. What could in theory happen is that
the PM ends up loading an interface file that mutates this
again. However, afaiu it is impossible for the PM to need an
interface that the typechecker didnt already need. I did do a bunch
of debugging and this is exactly what I observed in my experiments.
Alternative ideas and upstreaming:
The other option would be to not try and mess with the EPS but somehow
have a conditional flag somewhere in the typechecker env to disable
this logic in the pattern match checker. However, that sounds
significantly more complex so I don’t think it’s worth the effort.
GHC 8.10 has a new pattern match checker that has different
performance characteristics and seems to do much better here so there
is little reason to try and upstream this. I strongly want to avoid
upgrading DAML to 8.10 at this point (too much risk, let’s wait until
things calm down)
changelog_begin
- [DAML Compiler] Fix an issue where compilation slowed down
significantly when code was split up into several packages. See
https://github.com/digital-asset/daml/issues/3150
changelog_end
When I changed the quoting for the success case as part of #6267, I
forgot to update the error case, so now we don't get well-formed JSON
for errors.
CHANGELOG_BEGIN
CHANGELOG_END
We've seen a series of failures of the form
```
ERROR: D:/a/1/s/daml-assistant/integration-tests/BUILD.bazel:162:1: output 'daml-assistant/integration-tests/create-daml-app-tests.exe' was not created
ERROR: D:/a/1/s/daml-assistant/integration-tests/BUILD.bazel:162:1: not all outputs were created or valid
```
across multiple machines. We suspect cache poisoning as the cause. This
increments the cache URL to effectively clear the cache.
changelog_begin
changelog_end
Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>
We are seeing
ERROR: D:/a/2/s/compiler/scenario-service/protos/BUILD.bazel:67:1:
output
'compiler/scenario-service/protos/_obj/scenario_service_haskell_proto/ScenarioService.o'
was not created
again so following our experiments, let’s reset the cache to see if it
fixes anything.
changelog_begin
changelog_end
Thinking about the upcoming release, I realized our current docs cron
has somehow lost the step of taking the release notes from the
triggering commit, probably in all the back-and-forth about which
release notes version to use to overwrite all the other ones.
This restores that, and adapts the algorithm for the new, multi-line
LATEST file format.
This _should_ work for all the current history, including releases made
on `release/*` branches and the unifying commit that turned the LATEST
file multiline (it adds more than one line so won't be matched as a
trigger commit).
CHANGELOG_BEGIN
CHANGELOG_END
For some reason, platform_suffix doesn’t seem to provide enough
isolation to fix the “undeclared inclusion” errors even though it does
fix the issues for me locally.
This PR tries to address the problem by switching from
`platform_suffix` to modifying the actual URL of the cache.
To avoid leaking stuff from the local cache, I’ve added a clean
--expunge for now. We should be able to remove this once nodes have
been reset tomorrow. It will slow down nodes but that is clearly
better than having everything fail.
changelog_begin
changelog_end
When bumping the cache url on Windows, I accidentally also changed the
URL we push to on Linux and MacOS. This is obviously a bad idea so
this PR fixes it.
changelog_begin
changelog_end
* Bump cache suffix
As discussed, we are going to bump this every time we feel like
resetting the cache might help. This is a temporary measure to get
some metrics on how often things break and if resetting the cache
helps.
changelog_begin
changelog_end
* Update configure-bazel as well
changelog_begin
changelog_end
Currently the report fails with variables[Build.SourceBranchName]:
command not found which is obviously not what we want (it’s mixing up
the syntax in Azure’s yaml config and Bash). Looking at the
code in the tell-slack-failed.yml, this one does seem to work but I
haven’t tested this so :crossed-fingers:.
changelog_begin
changelog_end
automated ghc-lib build
This PR aims at automating the build of ghc-lib. The current process
still has a few manual steps; it needs to be updated because Bintray is
going away, so this seemed like a good opportunity to fully automate it.
This works like the "patch bazel on Windows" jobs: the filename will
contain a hash of the `ci/da-ghc-lib` folder, and the job will run only
if the corresponding filename does not yet exist on the GCS bucket. PRs
aiming at changing the ghc-lib version will need to run twice: once to
create the artifacts, and once to change the `stack-snapshot.yaml` file
to match.
CHANGELOG_BEGIN
CHANGELOG_END
* Include rules_haskell revision in platform suffix
Hopefully this makes CI a bit less of a dumpsterfire. I’ve also
followed the comment and made the suffix actually 3 characters long
instead of 2 since that makes me worry less about collisions and
should hopefully still be short enough to not hit MAX_PATH.
changelog_begin
changelog_end
* Update ci/configure-bazel.sh
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
Currently the message to Slack is always triggered by running the daily
checks. This means that it gets very noisy to:
1. Run the check on PRs affecting the check (like this one),
2. Rerun the check multiple times to ascertain that a given failure is
flaky.
With this PR, the message to Slack is replaced with a simple `echo` when
these checks are not run from the `master` branch, so whoever (manually)
triggered them can still get feedback on the result, but other people
don't get spurious `@here` mentions.
CHANGELOG_BEGIN
CHANGELOG_END
* Sort files when calculating CACHE_KEY
The order returned by `find` is unspecified and seems to have changed
for whatever reason in some cases. This changed the cache key which is
obviously not intended. It looks like the one we currently have in our
scoop manifest is the one that we get by sorting. Reversing the sort
produces the one CI currently calculates.
changelog_begin
changelog_end
* update manifest to match CI output
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
In light of #6127, I kept wondering why rebuilding 1.1.1 would fail. The
problem addressed by #6127 is that we tried to rebuild it, which we
shouldn't, but the reason I noticed it is because the build failed, and
there is no good reason for the 1.1.1 docs to not build anymore. Looking
at the logs confused me even more as it failed with (elided):
```
docs/source/support/new-assistant.rst:
WARNING: document isn't included in any toctree
```
and that change happened _after_ 1.1.1. So I went back to the code, and
discovered I somehow had gotten confused as I changed the approach
mid-way through editing the file. If we're overwriting the
`release-notes.html` file post-build, which we are now doing (and is the
reason for ignoring it when checking checksums), then we should not be
touching the `release-notes.rst` file pre-build.
CHANGELOG_BEGIN
CHANGELOG_END