However that happened, we were stuck with Nix 2.3.15 (or 2.3.16 in some
cases) on our macOS nodes. This PR is a minor edition to the Nix
initialization commands to switch from 2.4 to "latest", but I wil lalso
use it to record the changes I just did manually to the cluster.
The cluster is currently composed of two parts:
- 7 machines running Catalina (10.15.7).
- 1 machine running Monterey (12.2).
Unfortunately they use different setup. The Catalina ones are described
by the state of the repo (in theory, though keeping them in sync is
manual); in order to update those, I have:
1. Taken one node off the CI pool (`builder1epjj7`).
2. On that node, run the following commands:
```
cd ~/daml/infra/macos/3-running-box
vagrant destroy -f
rm ~/images/*
vagrant box remove macinbox
vagrant box remove azure-ci-node
rm -r ~/.vagrant.d/boxes/macinbox-06032020.tar.gz
softwareupdate -d --fetch-full-installer --full-installer-version 10.15.7
cd ~/daml/infra/macos/1-create-box
sudo macinbox --box-format vmware_desktop --disk 250 --memory 32768 --cpu 10 --user-script user-script.sh
cd ../2-common-box
vagrant up
vagrant package --output ~/images/initialized-$(date +%Y%m%d).box
vagrant destroy -f
cd
./run-agent.sh
```
This leaves us with that node running an updated box. The new box is
in `~/images/initialized-$(date)`.
3. Send that file to all the other nodes with `scp`.
4. Reboot all the nodes (after deactivating & waiting for jobs to
finish).
For the Monterey node, images (steps 1 and 2 in this repo) are currently
created by @nycnewman on another machine I don't have access to, so I
took a slightly different approach: I took the existing image, started
it from the `3-running-box` folder as usual, manually updated Nix there,
then repackaged that.
CHANGELOG_BEGIN
CHANGELOG_END
New year, new copyright, new expected unknown issues with various files
that won't be covered by the script and/or will be but shouldn't change.
I'll do the details on Jan 1, but would appreciate this being
preapproved so I can actually get it merged by then.
CHANGELOG_BEGIN
CHANGELOG_END
As part of the 2.4 release, the Nix installer has been changed to take
care of the volume setup (which we don't want it to do here). Because
that requires root access, they've decided to make multi-user install
the default, and to disable single-user install.
We could do an in-depth review of the difference and adapt our setup to
use a multi-user setup (we do use the multi-user setup on Linux, so
there's precedent), but as an immediate fix, we can keep our single-user
setup _and_ get the latest Nix by using the 2.3.16 installer and then
upgrading from within Nix itself. This _should_ keep working at least
for a while, as Linux still defaults to single-user.
CHANGELOG_BEGIN
CHANGELOG_END
macOS filesystems have been case-insensitive by default for years, and
in particular our laptops are, so if we want the cache to work as
expected, CI should be too.
Note: this does not apply to Nix, because the Nix partition is a
case-sensitive image per @nycnewman's script on laptops too.
CHANGELOG_BEGIN
CHANGELOG_END
We've recently seen a few cases where the macOS nodes ended up not
having the cache partition mounted. So far this has only happened on
semi-broken nodes (guest VM still up and running but host unable to
connect to it), so I haven't been able to actually poke at a broken
machine, but I believe this should allow a machine in such a state to
recover.
While we haven't observed a similar issue on Linux nodes (as far as I'm
aware), I have made similar changes there to keep both scripts in sync.
CHANGELOG_BEGIN
CHANGELOG_END
This is adapting the same approach as #9137 to the macOS machines. The
setup is very similar, except macOS apparently doesn't require any kind
of `sudo` access in the process.
The main reason for the change here is that while `~/.bazel-cache` is
reasonably fast to clean, cleaning just that has finally caught up to us
with a recent cleanup step that proudly claimed:
```
before: 638Mi free
after: 1.2Gi free
```
So we do need to start cleaning the other one after all.
CHANGELOG_BEGIN
CHANGELOG_END
As we strive for more inclusiveness, we are becoming less comfortable
with historically-charged terms being used in our everyday work.
This is targeted for merge on Dec 26, _after_ the necessary
corresponding changes at both the GitHub and Azure Pipelines levels.
CHANGELOG_BEGIN
- DAML Connect development is now conducted from the `main` branch,
rather than the `master` one. If you had any dependency on the
digital-asset/daml repository, you will need to update this parameter.
CHANGELOG_END
This is far from perfect but removes the blatantly wrong sections of the
README.
Note: as a README change, this is not really a standard change, but
because the README is under the infra folder, this PR does need the tag
to pass CI.
CHANGELOG_BEGIN
CHANGELOG_END
This is the macOS part of #5912, which I have separated because our
macOS nodes have a different deployment process so it seemed easier to
track the deployment of the change separately.
CHANGELOG_BEGIN
CHANGELOG_END
multistep macos setup
This updates the macOS node setup instructions to avoid repeating
identical work and network traffic across all machines through
initialization by building a "daily" image with all the tools and code
we need.
CHANGELOG_BEGIN
CHANGELOG_END
* Fix 3-running-box to remount nix partition
* updated scripts to use multi-step process
* add copyright notices
Co-authored-by: nycnewman <edward@digitalasset.com>
See #6400; split out as separate PR so master == reality and we can
track when this is done. @nycnewman please merge this once the change
is deployed.
Note: it has to be deployed before the next restart; nodes will _not_ be
able to boot with the current configuration.
CHANGELOG_BEGIN
CHANGELOG_END
It looks like some nix update has broken our current Terraform setup.
The Google provider plugin has changed its reported version to 0.0.0;
poking at my local nix store seems to indicate we actually get 3.15, but
🤷.
This PR also reverts the infra part of #6400 so we get back to master ==
reality.
CHANGELOG_BEGIN
CHANGELOG_END
Nix now requires -L, I’ve gone ahead and just normalized everything to
use -sfL which we were already using in one place.
changelog_begin
changelog_end
* Fix alunchd killing VMWare process at end of script execution
* Fix alunchd killing VMWare process at end of script execution
CHANGELOG_BEGIN
Fix issue with MacOS Catalina Launchd killing VMWare instance on rebuild (AbandonProcessGrop)
CHANGELOG_END
* Updates to support VMWare vairant of Hypervisor
* Update infra/macos/scripts/rebuild-crontask.sh
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
* Update infra/macos/scripts/run-agent.sh
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
We have seen the following error message crop up a couple times
recently:
```
FATAL: could not create shared memory segment: No space left on device
DETAIL: Failed system call was shmget(key=5432001, size=56, 03600).
HINT: This error does *not* mean that you have run out of disk space.
It occurs either if all available shared memory IDs have been taken, in
which case you need to raise the SHMMNI parameter in your kernel, or
because the system's overall limit for shared memory has been reached.
The PostgreSQL documentation contains more information about shared
memory configuration.
child process exited with exit code 1
```
Based on [the PostgreSQL
documentation](https://www.postgresql.org/docs/12/kernel-resources.html),
this should fix it.
CHANGELOG_BEGIN
CHANGELOG_END
set up macOS nodes
This PR documents how to create and manage macOS CI nodes. Because macOS
is not supported by our current cloud providers, these instructions are
geared towards creating VMs on physical machines we would need to host
and manage ourselves, i.e. these notes are mostly targeted at Ed.
CHANGELOG_BEGIN
CHANGELOG_END