add default machine capability
We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.
Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.
This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.
Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.
Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.
This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.
CHANGELOG_BEGIN
CHANGELOG_END
With the current setup, we always push whatever version GitHub considers
as the latest, which is defined by date. This means that at the moment a
patch release could overwrite a less recent but higher-version release,
essentially downgrading the SDK to a previous, presumably less good user
experience.
This patches the upload process to choose the highest-numbered release
instead of the most recent one by date.
CHANGELOG_BEGIN
CHANGELOG_END
* Publish docker images for snapshot releases as well
We got a couple of requests for this and given the current release
process this seems very reasonable. Snapshot releases are clearly
marked as such so I think there this is fine.
We could think about differentiating between a stable release version
marked as a prerelease and not publish the docker image for that but
given our current release process where the stable version is based on
an already existing snapshot and therefore testing is very unlikely to
fail when we mark it as stable, that doesn’t seem worth the complexity
here.
changelog_begin
changelog_end
* Update azure-cron.yml
Co-Authored-By: Samir Talwar <samir.talwar@digitalasset.com>
Co-authored-by: Samir Talwar <samir.talwar@digitalasset.com>
One of the outputs of our brainstorming about how to make CI better was
that it is annoying to have to "babysit" pull requests. This PR attempts
to introduce a notification mechanism by which Azure will notify people
on Slack when a build finishes, so they know they need to go and rerun
or merge the corresponding PR.
This commit also changes the existing $Slack.URL variable to
$Slack.team-daml, to make more explicit where the Slack message is being
sent to (Slack works with one token per destination channel). Both
$Slack.URL and $Slack.team-daml are currently defined as the same token
in Azure.
CHANGELOG_BEGIN
CHANGELOG_END
Currently, when cron jobs fail, this is completely silent. After #4178,
in which we realized the VSCode extension had not been updated for about
a month, I believe it may be a good idea to get error messages sent to
Slack when a cron job fails.
We'll need to monitor for verbosity, but hopefully they will work most
of the time.
CHANGELOG_BEGIN
CHANGELOG_END
This has changed in 0a26591849 which
broke the build. We only build the latest release so changing this
should be safe and not break older releases.
changelog_begin
changelog_end
changelog_begin
- [DAML SDK] Docker images for this release and releases in the future
are built using the Dockerfile of the corresponding git tag and are
therefore stable. Previously, they were updated whenever the
Dockerfile changed.
changelog_end
This commit aims at mitigating two issues we have noticed with the
0.13.41 release:
1. The initial cron run for that release got interrupted at the 50
minutes mark, which happened to be right in the middle of the s3 upload.
This means it had already changed the versions.json file, but had not
finished updating the actual html files. Right now, the docs.daml.com
website shows version 0.13.41 in the drop-down, but actually displays
the content for 0.13.40. Additionally, trying to explicitly visit the
website for 0.13.41 (https://docs.daml.com/0.13.41) yields a 404. Note
that this also means the cron job did not reach the "tell HubSpot"
point, so 0.13.41 did not get announced.
2. As the script also did not reach the "clear cache" step, subsequent
runs have been rebuilding the documentation for no reason as the
sequence of steps was: check versions.json through HTTP, get cached one,
see it's not up-to-date, build docs, check versions.json through s3 API,
bypassing the cache, see it's up-to-date, stop.
To address those issues, this PR changes the cron to:
1. Increase the timeout to 2h instead of 50 minutes.
2. Always check the versions.json file through s3, rather than go
through the HTTP cache first.
These are not complete solutions but I'm not sure how to do better given
that s3 does not have atomic operations.
* webide: build webide image when sdk releases
* add scripts which check the latest version of sdk. If webide docker
image version does not exist or is older than the sdk version, it will
kick off a build of the webide docker image
* add job to azure cron
* webide: minor response to review
There is no simple way to configure GCS to serve the desired security
headers, so instead the script will keep updating the existing s3
bucket.
Consequent changes:
- Add aws cli tool to dev-env
- Remove docs bucket from Terraform
- Set reasonable timeout. We want this to run once per hour, so it
should never run for longer than that.
- Fix gcp deploy command to target the right directory. That somehow got
lost in the manual changes between the PR and what I was manually
running on CI.
- Keep the not-found page intact, so as to not break 404 responses.
- Retry up to ten times to install nix packages, for each version. Nix
package download can be a bit flaky, especially when we download large
collections of packages (e.g. texlive), and Bazel is very unforgiving
towards flaky nix downloads so instead we now preload all the packages
before invokin Bazel.