Since #12645, we added a new pipeline, so we need to add a corresponding
entry.
As for #12645, the content of the files and the directory structure is
taken directly from a live CI node, as printed by the (now-named)
`workdirs` step.
CHANGELOG_BEGIN
CHANGELOG_END
We've been using an old version of Terraform for a long time now. The
main blocker used to be that there was no post-0.12 version of `secret`,
but that has now been resolved: there's a new fork, with new maintainers
(blessed by the original one and accepted by the Terraform registry)
[here].
I'll be upgrading one version at a time as 0.x versions are considered
major (and thus potentially breaking).
[here]: https://github.com/numtide/terraform-provider-secret
See https://github.com/digital-asset/daml/pull/12670 for details.
CHANGELOG_BEGIN
CHANGELOG_END
The Bazel cache on Windows includes absolute paths. The normal process
for Azure is to dynamically allocate new top-level folders for each new
bbuild that runs on a given machine. The result of that is that we get
about a one in three chance to get caching for any single Windows build
(it's actually not _quite_ that because we don't run different builds an
equal number of times).
This PR is an attempt at pinning the folder to job mapping by mucking
around in [Azure internals], which may or may not have bad consequences
down the line, assuming it works at all.
[Azure internals]: https://github.com/microsoft/azure-pipelines-agent/blob/master/docs/jobdirectories.md
CHANGELOG_BEGIN
CHANGELOG_END
Our Windows CI nodes seem completely overwhelmed today, with typical
wait times above half an hour before jobs even start. This isn't fun, so
I'd like to double our capacity for a few hours.
CHANGELOG_BEGIN
CHANGELOG_END
When no service account is explicitly selected, GCP provides a default
one, which happens to have way more access rights than we're comfortable
with. I'm not quite sure how the total lack of a service account slipped
through here, but I've noticed today so I'm changing it.
CHANGELOG_BEGIN
CHANGELOG_END
As the title suggests. We already disable all communication between CI
nodes through network rules, but we currently get a lot of noise from
GCP logging violations to those rules from Windows trying to feel its
way out for file share buddies.
CHANGELOG_BEGIN
CHANGELOG_END
As usualy, this branch will contain intermediate commits that may serve
as an audit log of sorts.
New year, new copyright, new expected unknown issues with various files
that won't be covered by the script and/or will be but shouldn't change.
I'll do the details on Jan 1, but would appreciate this being
preapproved so I can actually get it merged by then.
CHANGELOG_BEGIN
CHANGELOG_END
This bumps dotnet to the version required by the latest azuresigntool,
and pins azuresigntool for the future.
As usual for live CI upgrades, this will be rolled out using the
blue/green approach. I'll keep each deployed commit in this PR.
For future reference, this is PR [#10979].
[#10979]: https://github.com/digital-asset/daml/pull/10979
CHANGELOG_BEGIN
CHANGELOG_END
* ci/windows: disable spool
We're not expecting to print anything, and @RPS5' security newsletter
says this is a vector of attack.
CHANGELOG_BEGIN
CHANGELOG_END
* increase no-spool to 6
* Windows name truncation causing collisions
* update main group
* remove temp group
This morning we started with very restricted CI pools (2/6 for Windows
and 7/20 for Linux), apparently because the region we run in (us-east1)
has three zones, two of them were unable to allocate new nodes, and the
default policy is to distribute nodes evenly between zones.
I've manually changed the distribution policy. Unfortunately this option
is not yet available in our version of the GCP Terraform plugin.
CHANGELOG_BEGIN
CHANGELOG_END
* fixup terraform config
Two changes have happened recently that have invalidated the current
Terraform files:
1. The Terraform version has gone through a major, incompatible upgrade
(#8190); the required updates for this are reflected in the first
commit of this PR.
2. The certificate used to serve [Hoogle](https://hoogle.daml.com) was
about to expire, so Edward created a new one and updated the config
directly. The second commit in this PR updates the Terraform config
to match that new, already-in-prod setting.
Note: This PR applies cleanly, as there are no resulting changes in
Terraform's perception of the target state from 1, and the change from 2
has already been applied through other channels.
CHANGELOG_BEGIN
CHANGELOG_END
* update hoogle cert
switch CI nodes from n1-standard-8 to c2-*
A while back (#4520), I did a bunch of performance tests when trying to
size up the requirements for the hosted macOS nodes we needed to buy. As
part of that testing, it looked like `c2-standard-8` nodes were faster
(full build down from ~95 to ~75 minutes) and marginally cheaper
($0.4176 vs $0.4280) than the `n1-standard-8` we are currently using.
Then I got distracted, and I forgot to upgrade our existing machines.
CHANGELOG_BEGIN
CHANGELOG_END
Keeping CI working on Windows involves a constant fight against
MAX_PATH, which is a very short 260 characters. As the username appears
in some paths, sometimes multiple times, we can save a few precious
characters by having it shorter.
CHANGELOG_BEGIN
CHANGELOG_END
add default machine capability
We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.
Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.
This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.
Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.
Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.
This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.
CHANGELOG_BEGIN
CHANGELOG_END
It looks like the change in Windows agent names has caused an issue:
because Windows agents are not always properly cleaned up on shutdown,
i.e. they do not always have time to tell Azure they are going away, and
because GCP likes to reuse the same names for machines in a group, we've
been seeing errors like:
```
ERROR: The running command stopped because the preference variable
"ErrorActionPreference" or common parameter is set to Stop: Pool 11
already contains an agent with name VSTS-WIN-3QCX.
```
recently. Today, only 2 out of our 6 agents have managed to register
with Azure. This PR should fix that.
ChaNGELOG_BEGIN
CHANGELOG_END
This is a small QoL improvement, mostly targeted at myself: have Windows
agents register with Azure using the name they display on the GCP
console, so I don't need to find a build and look at the "Agent
Diagnostics" step to figure out the corresponding between Azure and GCP.
CHANGELOG_BEGIN
CHANGELOG_END
This is an attempt to apply a potential fix discovered as part of the
investigation in #4370. The issue seems to be that Chocolatey is using a
protocol deemed not secure enough and disabled in recent Windows images
(our node creation script dynamically selects the lmatest "Windows 2016"
server image from GCP).
CHANGELOG_BEGIN
CHANGELOG_END