The daily restart is now working, I think we can switch over. I'm
keeping the GCP configuration around for the time being in case we need
to roll back for some reason; if everything goes smoothly I'll remove
all of it in a month or so.
I've seen a few "disk full" errors recently on CI, and they don't quite
make sense because we do have this "reset cache" step that looks at
available disk space and cleans up if needed.
So I dug a little bit more and found this discrepancy: the real hard
drives are 400g, the virtual disks are 200g. So what may happen is we
still have plenty of space on the real drives (well, they're virtual,
but bear with me), while the virtual ones are full. And since the
clean-up step only looks at the free space on the real drives, it goes
ahead without cleaning up and then when we try to download stuff there's
no free space on the virtual drives.
This didn't use to be a problem because the size of the virtual drives
mapped to the size of the real ones. This PR aims at restoring that
equivalence.
When we set this up years ago (#1566), it was a way to get a more recent
Docker version than was then available through the default Ubuntu 16.04
apt repository. Nowadays, this actually makes us lag behind, to the
point where the 2.3.1 image isn't building.
CHANGELOG_BEGIN
CHANGELOG_END
Running the `docker` command on today's Ubuntu images crashes the
kernel. (Which is super reassuring from a security pov.)
CHANGELOG_BEGIN
CHANGELOG_END
Process:
- `git ls-files -z | xargs -0 -n 100 sed -i --follow-symlinks 's/DAML/Daml/g'`
- `git add -p`
- `git restore -p`
- Check there is no unstaged change left.
To review:
- Check for false positives by carefully reviewing the diff in this PR.
- Check for false negatives with `git grep DAML`.
- Quicker check for fals positives:
```
git grep DAML | grep -v migration | grep -v DAML_
```
Fixes#13190
Note: This is the "second half" of #13191, which failed to cover all the
remaining DAMLs because of:
```
$ git ls-files | grep "'"
compiler/damlc/tests/daml-test-files/MangledScenario'.daml
```
CHANGELOG_BEGIN
CHANGELOG_END
The cluster shuts down about once every two weeks and takes a couple
hous to get back up. It's been off for a few days right now and as far
as I'm aware nobody noticed.
My personal assessment is that this is costing us more in maintenance
(not to mention running) costs than what we're getting out of it.
CHANGELOG_BEGIN
CHANGELOG_END
However that happened, we were stuck with Nix 2.3.15 (or 2.3.16 in some
cases) on our macOS nodes. This PR is a minor edition to the Nix
initialization commands to switch from 2.4 to "latest", but I wil lalso
use it to record the changes I just did manually to the cluster.
The cluster is currently composed of two parts:
- 7 machines running Catalina (10.15.7).
- 1 machine running Monterey (12.2).
Unfortunately they use different setup. The Catalina ones are described
by the state of the repo (in theory, though keeping them in sync is
manual); in order to update those, I have:
1. Taken one node off the CI pool (`builder1epjj7`).
2. On that node, run the following commands:
```
cd ~/daml/infra/macos/3-running-box
vagrant destroy -f
rm ~/images/*
vagrant box remove macinbox
vagrant box remove azure-ci-node
rm -r ~/.vagrant.d/boxes/macinbox-06032020.tar.gz
softwareupdate -d --fetch-full-installer --full-installer-version 10.15.7
cd ~/daml/infra/macos/1-create-box
sudo macinbox --box-format vmware_desktop --disk 250 --memory 32768 --cpu 10 --user-script user-script.sh
cd ../2-common-box
vagrant up
vagrant package --output ~/images/initialized-$(date +%Y%m%d).box
vagrant destroy -f
cd
./run-agent.sh
```
This leaves us with that node running an updated box. The new box is
in `~/images/initialized-$(date)`.
3. Send that file to all the other nodes with `scp`.
4. Reboot all the nodes (after deactivating & waiting for jobs to
finish).
For the Monterey node, images (steps 1 and 2 in this repo) are currently
created by @nycnewman on another machine I don't have access to, so I
took a slightly different approach: I took the existing image, started
it from the `3-running-box` folder as usual, manually updated Nix there,
then repackaged that.
CHANGELOG_BEGIN
CHANGELOG_END
Goals:
- Reflect manual changes from #12996 in Terraform.
- Reflect manual changes from #12997 in Terraform.
- Update plugins to wirk with #12926.
- Keep running services working through the changes.
Details in commits.
CHANGELOG_BEGIN
CHANGELOG_END
Since #12645, we added a new pipeline, so we need to add a corresponding
entry.
As for #12645, the content of the files and the directory structure is
taken directly from a live CI node, as printed by the (now-named)
`workdirs` step.
CHANGELOG_BEGIN
CHANGELOG_END
We've been using an old version of Terraform for a long time now. The
main blocker used to be that there was no post-0.12 version of `secret`,
but that has now been resolved: there's a new fork, with new maintainers
(blessed by the original one and accepted by the Terraform registry)
[here].
I'll be upgrading one version at a time as 0.x versions are considered
major (and thus potentially breaking).
[here]: https://github.com/numtide/terraform-provider-secret
See https://github.com/digital-asset/daml/pull/12670 for details.
CHANGELOG_BEGIN
CHANGELOG_END
The Bazel cache on Windows includes absolute paths. The normal process
for Azure is to dynamically allocate new top-level folders for each new
bbuild that runs on a given machine. The result of that is that we get
about a one in three chance to get caching for any single Windows build
(it's actually not _quite_ that because we don't run different builds an
equal number of times).
This PR is an attempt at pinning the folder to job mapping by mucking
around in [Azure internals], which may or may not have bad consequences
down the line, assuming it works at all.
[Azure internals]: https://github.com/microsoft/azure-pipelines-agent/blob/master/docs/jobdirectories.md
CHANGELOG_BEGIN
CHANGELOG_END
Our Windows CI nodes seem completely overwhelmed today, with typical
wait times above half an hour before jobs even start. This isn't fun, so
I'd like to double our capacity for a few hours.
CHANGELOG_BEGIN
CHANGELOG_END
When no service account is explicitly selected, GCP provides a default
one, which happens to have way more access rights than we're comfortable
with. I'm not quite sure how the total lack of a service account slipped
through here, but I've noticed today so I'm changing it.
CHANGELOG_BEGIN
CHANGELOG_END
As the title suggests. We already disable all communication between CI
nodes through network rules, but we currently get a lot of noise from
GCP logging violations to those rules from Windows trying to feel its
way out for file share buddies.
CHANGELOG_BEGIN
CHANGELOG_END
As usualy, this branch will contain intermediate commits that may serve
as an audit log of sorts.
Our Terraform configuration has been slightly broken by two recent
changes:
- The nixpkgs upgrade in #12280 means a new version of our GCP plugin
for Terraform, which as a breaking change added a required argument to
`google_project_iam_member`. The new version also results in a number
of smaller changes in the way Terraform handles default arguments, which
doesn't result in any changes to our configuration files or to the
behaviour of our deployed infrastructure, but does require re-syncing
the Terraform state (by running `terraform apply`, which would
essentially be a no-op if it were not for the next bullet point).
- The nix configuration changes in #12265 have changed the Linux CI
nodes configuration but have not been deployed yet.
This PR is an audit log of the steps taken to rectfy those and bring us
back to a state where our deployed configuration and our recorded
Terraform state both agree with our current `main` branch tip.
CHANGELOG_BEGIN
CHANGELOG_END
* devenv: Factor out a function to get the Nix version.
* devenv: On newer versions of Nix, use `NIX_USER_CONF_FILES`.
This stacks, rather than overwrites.
* devenv: Append Nix caches, instead of overwriting them.
The "extra-" prefix tells Nix to append.
We also switch to non-deprecated configuration keys.
CHANGELOG_BEGIN
CHANGELOG_END
* devenv: Just require Nix v2.4 or newer.
* devenv: `NIX_USER_CONF_FILES` may not be set already.
New year, new copyright, new expected unknown issues with various files
that won't be covered by the script and/or will be but shouldn't change.
I'll do the details on Jan 1, but would appreciate this being
preapproved so I can actually get it merged by then.
CHANGELOG_BEGIN
CHANGELOG_END