When I changed the quoting for the success case as part of #6267, I
forgot to update the error case, so now we don't get well-formed JSON
for errors.
CHANGELOG_BEGIN
CHANGELOG_END
We've seen a series of failures of the form
```
ERROR: D:/a/1/s/daml-assistant/integration-tests/BUILD.bazel:162:1: output 'daml-assistant/integration-tests/create-daml-app-tests.exe' was not created
ERROR: D:/a/1/s/daml-assistant/integration-tests/BUILD.bazel:162:1: not all outputs were created or valid
```
across multiple machines. We suspect cache poisoning as the cause. This
increments the cache URL to effectively clear the cache.
changelog_begin
changelog_end
Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>
We are seeing
ERROR: D:/a/2/s/compiler/scenario-service/protos/BUILD.bazel:67:1:
output
'compiler/scenario-service/protos/_obj/scenario_service_haskell_proto/ScenarioService.o'
was not created
again so following our experiments, let’s reset the cache to see if it
fixes anything.
changelog_begin
changelog_end
Thinking about the upcoming release, I realized our current docs cron
has somehow lost the step of taking the release notes from the
triggering commit, probably in all the back-and-forth about which
release notes version to use to overwrite all the other ones.
This restores that, and adapts the algorithm for the new, multi-line
LATEST file format.
This _should_ work for all the current history, including releases made
on `release/*` branches and the unifying commit that turned the LATEST
file multiline (it adds more than one line so won't be matched as a
trigger commit).
CHANGELOG_BEGIN
CHANGELOG_END
For some reason, platform_suffix doesn’t seem to provide enough
isolation to fix the “undeclared inclusion” errors even though it does
fix the issues for me locally.
This PR tries to address the problem by switching from
`platform_suffix` to modifying the actual URL of the cache.
To avoid leaking stuff from the local cache, I’ve added a clean
--expunge for now. We should be able to remove this once nodes have
been reset tomorrow. It will slow down nodes but that is clearly
better than having everything fail.
changelog_begin
changelog_end
When bumping the cache url on Windows, I accidentally also changed the
URL we push to on Linux and MacOS. This is obviously a bad idea so
this PR fixes it.
changelog_begin
changelog_end
* Bump cache suffix
As discussed, we are going to bump this every time we feel like
resetting the cache might help. This is a temporary measure to get
some metrics on how often things break and if resetting the cache
helps.
changelog_begin
changelog_end
* Update configure-bazel as well
changelog_begin
changelog_end
Currently the report fails with variables[Build.SourceBranchName]:
command not found which is obviously not what we want (it’s mixing up
the syntax in Azure’s yaml config and Bash). Looking at the
code in the tell-slack-failed.yml, this one does seem to work but I
haven’t tested this so :crossed-fingers:.
changelog_begin
changelog_end
automated ghc-lib build
This PR aims at automating the build of ghc-lib. The current process
still has a few manual steps; it needs to be updated because Bintray is
going away, so this seemed like a good opportunity to fully automate it.
This works like the "patch bazel on Windows" jobs: the filename will
contain a hash of the `ci/da-ghc-lib` folder, and the job will run only
if the corresponding filename does not yet exist on the GCS bucket. PRs
aiming at changing the ghc-lib version will need to run twice: once to
create the artifacts, and once to change the `stack-snapshot.yaml` file
to match.
CHANGELOG_BEGIN
CHANGELOG_END
* Include rules_haskell revision in platform suffix
Hopefully this makes CI a bit less of a dumpsterfire. I’ve also
followed the comment and made the suffix actually 3 characters long
instead of 2 since that makes me worry less about collisions and
should hopefully still be short enough to not hit MAX_PATH.
changelog_begin
changelog_end
* Update ci/configure-bazel.sh
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
Currently the message to Slack is always triggered by running the daily
checks. This means that it gets very noisy to:
1. Run the check on PRs affecting the check (like this one),
2. Rerun the check multiple times to ascertain that a given failure is
flaky.
With this PR, the message to Slack is replaced with a simple `echo` when
these checks are not run from the `master` branch, so whoever (manually)
triggered them can still get feedback on the result, but other people
don't get spurious `@here` mentions.
CHANGELOG_BEGIN
CHANGELOG_END
* Sort files when calculating CACHE_KEY
The order returned by `find` is unspecified and seems to have changed
for whatever reason in some cases. This changed the cache key which is
obviously not intended. It looks like the one we currently have in our
scoop manifest is the one that we get by sorting. Reversing the sort
produces the one CI currently calculates.
changelog_begin
changelog_end
* update manifest to match CI output
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
In light of #6127, I kept wondering why rebuilding 1.1.1 would fail. The
problem addressed by #6127 is that we tried to rebuild it, which we
shouldn't, but the reason I noticed it is because the build failed, and
there is no good reason for the 1.1.1 docs to not build anymore. Looking
at the logs confused me even more as it failed with (elided):
```
docs/source/support/new-assistant.rst:
WARNING: document isn't included in any toctree
```
and that change happened _after_ 1.1.1. So I went back to the code, and
discovered I somehow had gotten confused as I changed the approach
mid-way through editing the file. If we're overwriting the
`release-notes.html` file post-build, which we are now doing (and is the
reason for ignoring it when checking checksums), then we should not be
touching the `release-notes.rst` file pre-build.
CHANGELOG_BEGIN
CHANGELOG_END
The docs cron is supposed to ignore the release-notes.html page when
checking whether a docs folder is corrupted, because we manually
override it. However, that currently doesn't work, either because the
`sed` version we are using does not support changing the delimiters, or
because no version of `sed` does and I just imagined it.
CHANGELOG_BEGIN
CHANGELOG_END
Changed by #6123, relevant part of the diff is:
```
ledger.lookupGlobalContract(ParticipantView(committers.head),
effectiveAt, acoid) match {
- case LookupOk(_, result) =>
+ case LookupOk(_, result, _) =>
cachedContract = cachedContract + (step -> result)
```
which seems benign enough.
CHANGELOG_BEGIN
CHANGELOG_END
This should be merged after #6080. This PR adds a patch (and
consequently updates the `ci/cron/perf/compare.sh` script) to apply the
same logical change as #6080 on top of the baseline commit, so our
performance comparison remains "apples to apples".
I am well aware that managing patches is not going to be a great way
forward. The rate of changes on the benchmark seems to be slow enough
that this is good enough for now, but should we change the benchmark
more often and/or want to add new benchmarks, a better approach would be
to handle the changes at the Scala level. That is:
- Create a "rest of the world" (world = Speedy, its compiler, and all of
the associated types) interface that benchmarks would depend on,
rather than depend directly on the rest of the codebase.
- Create two implementations of that interface, one that compiles
against the current state of the world, and one that compiles against
the baseline.
- Change the script to load the relevant implementation, and then run
all the benchmarks as-is, with no match necessary.
CHANGELOG_BEGIN
CHANGELOG_END
Version is taken from the env var (or defaulted to 0.0.0) at build-time.
Since those two packages are not build by default by Bazel, we need to
add the env var to the Bash step where they do get explicitly built.
Fixes#6090.
CHANGELOG_BEGIN
- sandbox and http-api fatjars will now display correct version number.
CHANGELOG_END
* Include puppeteer tests in compat tests
This PR adds the puppeteer based tests to the compatibility
tests. This also means that they are now actually compatibility
tests. Before, we only tested the SDK side.
Apart from process management being a nightmare on Windows as usually,
there are two things that might stick out here:
1. I’ve replaced the `sh_binary` wrapper by a `cc_binary`. There is a
lengthy comment explaining why. I think at the moment, we could
actually get rid of the wraper completely and add JAVA to path in
the tests that need it but at least for now, I’d like to keep it
until we are sure that we don’t need to add more to it (and then
it’s also in the git history if we do need to resurrect it).
2. These tests are duplicated now similar to the `daml ledger *`
tests. The reasoning here is different. They depend on the SDK
tarball either way so performance wise there is no reason to keep
them. However, we reference the other file in the docs which means
we cannot change it freely. What we could do is to make this
sufficiently flexible to handle both the `daml start` case and
separate `daml sandbox`/`daml json-api` processes and then we can
reference it in the docs. There is still added complexity for
Windows but that’s necessary for users as well that want to run
this on Windows so that seems unavoidable. (I should probably also
remove my snarky comments 😇) I’d like to kee it duplicated
for this PR and then we can clean it up afterwards.
changelog_begin
changelog_end
* Bump timeouts
changelog_begin
changelog_end
CI currently errors with:
```
Subprocess:
git checkout efe6545c2c
-- docs/source/support/release-notes.rst
failed with exit code 127; output:
---
---
err:
---
Previous HEAD position was 2af134c... WIP: Draft version constraint
generation (#5472)
HEAD is now at efe6545... 1.2.0-snapshot.20200520.4224.0.2af134ca
(#6040)
/bin/sh: 2: --: not found
---
```
because the line
```
latest_release_notes_sha <- shell "git log -n1 --format=%H HEAD -- LATEST"
```
will assign a string that ends in a newline, and then when we try to
construct the shell command:
```
(shell_ $ "git checkout " <> latest_sha <> " -- docs/source/support/release-notes.rst")
```
we actually get two lines for Bash to execute.
CHANGELOG_BEGIN
CHANGELOG_END
Current version yields:
```
Subprocess:
git log -n1 --format=%H master -- LATEST
failed with exit code 128; output:
---
---
err:
---
fatal: bad revision 'master'
---
```
so apparently we can't trust a CI run on master to have a master branch
defined. `HEAD` should work, though.
CHANGELOG_BEGIN
CHANGELOG_END
trigger all releases from master
The 1.1.0 release went wrong and we had to trash it and release 1.1.1
instead. This is an attempt at identifying and correcting the root
cause behind that incident.
To understand the situation, we need to know how releases worked before
1.0. We had a one-line file called `LATEST` that specifies the git SHA and
version tag for the latest release. A change to that file triggered a
release with the specified release tag, built from the source tree of
the specified commit. The `LATEST` file looked something like:
```
f050da78c9 1.0.0-snapshot.20200411.3905.0.f050da78
```
To mark a release as stable, we would change it to look like this:
```
f050da78c9 1.0.0
```
i.e. simply drop the `-snapshot...` suffix. Even though the commit (and
thus the entire source tree we build from) is the same, we would need to
rebuild almost all of our release artifacts, as they embed the version
tag in various places and ways. That worked well as long as we could
assume we were doing trunk-based development, i.e. all releases would
always come from the same (`master`) branch.
When we released 1.0, and started work on 1.1, we had a few bug reports
for 1.0 that we decided should be resolved in a point release. We
decided that the best way to handle that would be to have a branch
starting on the release commit for 1.0, and then backport patches from
`master` to that branch. We adapted our build process to also watch the
`release/1.0.x` branch and, in particular, trigger a new release build if
the `LATEST` file in that branch changed. That worked well.
The plan going forward was to keep doing regular snapshot releases from
the `master` branch, and create support, point releases ("patch" releases
in semver) from dedicated branches.
On April 30, we made a snapshot release as an RC for 1.1.0, by changing
the `LATEST` file in the `master` branch. That release was built on commit
681c862d. On May 6, we decided to take a new snapshot as the RC for
1.1.0; we changed `LATEST` in `master` to designate 7e448d81 as the new
latest release.
On May 11, we noticed an issue that broke our builds. Without going into
details, an external artifact we depend on had changed in incompatible
ways. After fixing that on `master`, we reasoned that this would also
break the build of the final 1.1.0 release if we just tried to build
7e448d81 again. But as the target release date was May 13, we did not
want to take a new snapshot after that fix, as that would have included
one more week of work in the release, and given us no time to test it.
So we did what we did for the 1.0 branch, as it had worked well: we
created a branch that forked from `master` at commit 7e448d81 and called
it `release/1.1.x`, then cherry-picked the one fix to our build process to
work around the broken download. When the time came to make the final
1.1.0 build on May 13, we naturally picked the `LATEST` file from the
`release/1.1.x` branch and dropped the `-snapshot...` suffix. Importantly,
we did not need to update the target commit to include the "broken
download" fix as, in the meantime, the internet had fixed itself, and we
thus reasoned we should go for the exact code of the RC rather than
include an unnecessary, albeit seemingly harmless, change.
Everything went well with the release process. Tests went well too. Then
we got a report that an application that worked against the latest RC
broke with the final 1.1.0. The issue was that we had built the wrong
commit: by branching off at the point of the _target_ commit for the
latest snapshot, we did not have the change to the `LATEST` file that
designated that commit as the target. So the `LATEST` file in
`release/1.1.x` was still pointing to 681c862d.
I believe the root cause for this issue is the fact that we have
scattered our release process over multiple branches, meaning there is
no linear history of what was released and we are relying on people
being able to mentally manage multiple timelines. Therefore, I propose
to fix our release process so this should not happen again by
linearizing the release process, i.e. getting back to a situation where
all releases are made from a single branch, `master`.
Because we do want to be able to release _for_ multiple release branches
(to provide backports and bugfixes), we still need some way to
accommodate that. Having a single `LATEST` file in the same format as
before would not really work well: keeping track of interleaved release
streams on a single file would not really be easier than keeping track
of multiple branches.
My proposed solution is to instead have a multiline LATEST file, so that
all the release branch "tips" can be observed at the same time, and, as
long as we take care to only advance one release branch at a time, we
can easily keep track of each of them. This is what this PR does.
This required a few changes to our release process. Most notably:
- Obviously, as this is the main point of this PR, the build process has
once again been restricted to only trigger new releases from the
`master` branch.
- As our CI machinery cannot easily be made to produce multiple releases
from a single build, the `check_for_release` step will only recognize
a commit as a release trigger if it changes a single line in the
`LATEST` file. This restriction comes in addition to the existing one
that a release commit is only allowed to change either just the
`LATEST` file or both the `LATEST` and
`docs/source/support/release-notes.rst` files.
- The docs publication process has been changed to update _all_
published versions to display the _latest_ release notes page. This
means that the release notes page will always show you all published
versions, regardless of which version of the documentation you're
looking at. This also means that interleaving release notes correctly on
that page is a manual exercise.
- As per the intention of the new process, the `LATEST` file has been
updated to contained all existing post-1.0 stable releases. It should
also include all existing snapshot releases should we have more than one
at a time (say, should we discover an issue with 1.1.1 that required us
to work on a 1.1.2).
- The `release.sh` script has been dramatically simplified as I felt it
was trying to do too much and porting its existing functionality to a
multi-line `LATEST` file would be too hard.
CHANGELOG_BEGIN
CHANGELOG_END
Note: this is beta-level software. See documentation for the precise
guarantees this does and does not come with. (Documentation does not
exist at the time of opening this PR, but should exist by the time the
first version of this gets published.)
CHANGELOG_BEGIN
- We now publish Sandbox Next as an **ALPHA** standalone jar.
- We now publish the HTTP JSON API as a standalone jar.
CHANGELOG_END
* Include Bazel patch to cache exclusive tests on Windows
This includes the patch that we already use on Linux and MacOS to fix
caching of things marked exclusive. I’ve kept in the debugging output
that I added in the last patch for now. While our workaround seems to
be working, I’d like to wait a bit longer in case the issue reappears.
changelog_begin
changelog_end
* Actually bump manifest
changelog_begin
changelog_end
* Include sources directory in the Bazel cache key
This should hopefully fix the “undeclared inclusion” errors we have
been getting daily on CI
The details are in a comment but the short summary is that
the daily cron job is running in D:\a\1 whereas jobs on the same
machine afterwards run in D:\a\2. Because absolute paths leak in some
places, this fucks things up.
changelog_begin
changelog_end
* Add debugging output to the cache
* Add debugging output to inclusion errors
This adds some more debugging output to inclusion errors in Bazel
which should hopefully help us track it down. These are the only call sites
in the Bazel source that can produce them. My suspicion is that it’s
coming from HeaderDiscovery but I’m not entirely sure what is off.
We’ll almost certainly have to add more output once we know which of
those 3 cases we hit but let’s do it step by step.
changelog_begin
changelog_end
* Bump url and hash
changelog_begin
changelog_end
This is the first part of #5700
It adds tests that build create-daml-app using `daml build` and then
run the codegen and build the UI. Contrary to our main tests these
also run on Windows. This is actually reasonably simple by first
building the typescript libraries on Linux and then downloading them
on Windows.
There are two parts that are still missing from the tests in the main
workspace:
1. Building the extra feature. This should be fairly easy to add.
2. Running the pupeeter tests. At least MacOS and Linux should be
reasonably easy. I don’t know what horrors Windows will throw at
us. This step is what actually makes this a compatibility
test. Currently it doesn’t actually launch Sandbox and the JSON API.
Since this PR is already pretty large, I’d like to tackle those things
separately.
changelog_begin
changelog_end
patch Bazel on Windows (ci setup)
We have a weird, intermittent bug on Windows where Bazel gets into a
broken state. To investigate, we need to patch Bazel to add more debug
output than present in the official distribution. This PR adds the basic
infrastructure we need to download the Bazel source code, apply a patch,
compile it, and make that binary available to the rest of the build.
This is for Windows only as we already have the ability to do similar
things on Linux and macOS through Nix.
This PR does not contain any intresting patch to Bazel, just the minimum
that we can check we are actually using the patched version.
CHANGELOG_BEGIN
CHANGELOG_END
This is temporary. It looks like the macOS nodes are dead; @nycnewman is
looking into it, but in case he doesn't fix it in time, at least we
have a backup plan so we're not completely blocked on Monday.
CHANGELOG_BEGIN
CHANGELOG_END
add default machine capability
We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.
Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.
This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.
Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.
Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.
This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.
CHANGELOG_BEGIN
CHANGELOG_END
This PR separates the "last known valid perf test" commit from the
"baseline speedy implementation" commit. It is important for the perf
test to be meaningful that the changes between those two commits are
benign, say minor API adjustments, so that the perf measurement remains
meaningful.
This also adds a check on merging to master that tells Slack if the perf
test has changed and the `test_sha` file needs updating. The Slack
message is conditional on the current commit to avoid excessive noise.
CHANGELOG_BEGIN
CHANGELOG_END
bazel configuration does two things:
It modifies .bazelrc.local and it writes a temp file.
I’ve run `git clean` on a PR. This caused the temp file to be
removed. However `.bazelrc.local` stayed since it is in
`.gitignore`. This meant that the next time the formatting check ran
the `.bazelrc.local` pointed to the temp file but the temp file was no
longer there.
changelog_begin
changelog_end
As mentioned in the comment, this is required to get DNS requests to
work. This is more important than one might realize at first:
`daml start` tries to make an HTTP request to localhost:7500 to wait
for Navigator to start up. However, in our docker image, these
requests currently fail completely since they fail to resolve
`localhost`. This stops all following steps, in particular,
`init-script` and the JSON API from starting up.
changelog_begin
changelog_end
This PR adds a simple daily job that runs the performance test on a
chosen "baseline" commit and then runs the same benchmark on latest
master. This should allow us to track overall performance improvements.
CHANGELOG_BEGIN
CHANGELOG_END
* Apply plotform_suffix on all Windows pipelines
To distinguish action keys between the compatibility and the main
workspace and avoid the "undeclared input(s)" error. We also modify the
main workspace's action cache keys to avoid poisoned cache items.
CHANGELOG_BEGIN
CHANGELOG_END
* Avoid exceeding MAX_PATH on Windows
Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>
* Make compat tests work on windows
This required some changes to the daml_sdk rule since the read-only
installation by the assistant breaks Bazel completely. We could only
apply those changes on Windows but I think I prefer the consistency
across platforms here over trying to stay close to how the SDK is
installed on user machines given that the SDK installation is not
something we’ve had issues with.
I’ve excluded the postgresql tests for now. I don’t expect them to be
particularly hard to fix but I’ve already spent almost 2 days on this
and having some tests run on Windows seems like a clear improvement
over running no tests on Windows :)
changelog_begin
changelog_end
* Remove todo
changelog_begin
changelog_end
I believe the compatibility check is important enough, and should fail
rarely enough, that it is worth reporting even on success. This will
mean (once we support Windows) 3 messages a day, sent while presumably
nobody is working, so the disruption should be minimal.
The issue with reporting only on failures is that, if we don't
proactively check (which we do for the state of master for different
reasons, but would likely not keep doing for a job that doesn't block
PRs), we may get into a state where it is so broken that it doesn't even
report.
CHANGELOG_BEGIN
CHANGELOG_END
I am hoping this will make the Azure output a bit more chatty. At the
moment, you don’t get any output until the job has finished which is a
bit annoying (although not really an issue).
changelog_begin
changelog_end
This PR extends the existing Linux compatibility tests to run on macOS
too. Fixes#5692.
CHANGELOG_BEGIN
CHANGELOG_END
Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>