This script was supposed to remove old agents from the Azure Pipelines
UI. It may have been useful at some time (notably, when we used
ephemeral instances, they did not necessarily get to run their shutdown
script), but as it stands now, it's broken. The output from that step
ends in:
```
error: 2 derivations need to be built, but neither local builds ('--max-jobs') nor remote builds ('--builders') are enabled
```
after listing the nix packages it would build. Furthermore, it does not
seem to be useful as I have not seen any spurious entry in the agents
list on Azure since we switched to permanent nodes, on either the Linux
or Windows side (and this would only run on Linux, if it ran).
I'm also not convinced it ever ran, as I used to see a lot of spurious
machines on both Linux and Windows when we did use ephemeral instances.
CHANGELOG_BEGIN
CHANGELOG_END
This is the second PR in the plan outlined in #6405. I have already
disabled the old nodes so no new job will get started there; I will,
however, wait until I've seen a few successful builds on the new nodes
before pulling the plug.
CHANGELOG_BEGIN
CHANGELOG_END
It looks like some nix update has broken our current Terraform setup.
The Google provider plugin has changed its reported version to 0.0.0;
poking at my local nix store seems to indicate we actually get 3.15, but
🤷.
This PR also reverts the infra part of #6400 so we get back to master ==
reality.
CHANGELOG_BEGIN
CHANGELOG_END
Nix now requires -L, I’ve gone ahead and just normalized everything to
use -sfL which we were already using in one place.
changelog_begin
changelog_end
add default machine capability
We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.
Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.
This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.
Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.
Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.
This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.
CHANGELOG_BEGIN
CHANGELOG_END
There are two issues with the current setup:
- iptables entry prevents connecting to the metadata server, and
- machines are given insufficient permissions.
It looks like the curl command is currently installing but not starting the service that is supposed to send logs to StackDriver. When connecting to the machines manually, a call to `restart` seems to fix it.
* remove -O option from curl command in order to pipe script contents to bash
* follow redirects for stackdriver
Co-Authored-By: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
* ci: always use the linux-pool
reduce the difference of environment between external and internal
contributions
* infra: tweak the linux cache warmup script
Don't share the same bazel cache directory with the disk cache, which is
something else. Be more specific about the target. Clean after yourself.
* infra: bump the linux agent disk to 200GB
avoid running out of disk space
Warm up local caches by building dev-env and current daml master This is
allowed to fail, as we still want to have CI machines around, even when
their caches are only warmed up halfway.
Afterwards, we purge old agents that might still be around, that didn't
unregister themselves
This depends on #402 to be merged, as otherwise purge_old_agents.py
can't be found obviously.
* nix: add the more providers to terraform
* docs: make tarballs more reproducible
* ci: use the linux-pool pool
* ci: tweak the nix installation
handle the case where the user is root and on ubuntu
* infra: terraform fmt
* infra: add Azure Pipeline agents
* ci: only enable linux-pool for internal PRs