daml/infra/macos
Gary Verhaegen fe9d44ffe7
ci: bump Nix on macOS nodes (#13061)
However that happened, we were stuck with Nix 2.3.15 (or 2.3.16 in some
cases) on our macOS nodes. This PR is a minor edition to the Nix
initialization commands to switch from 2.4 to "latest", but I wil lalso
use it to record the changes I just did manually to the cluster.

The cluster is currently composed of two parts:
- 7 machines running Catalina (10.15.7).
- 1 machine running Monterey (12.2).

Unfortunately they use different setup. The Catalina ones are described
by the state of the repo (in theory, though keeping them in sync is
manual); in order to update those, I have:

1. Taken one node off the CI pool (`builder1epjj7`).
2. On that node, run the following commands:
   ```
   cd ~/daml/infra/macos/3-running-box
   vagrant destroy -f
   rm ~/images/*
   vagrant box remove macinbox
   vagrant box remove azure-ci-node
   rm -r ~/.vagrant.d/boxes/macinbox-06032020.tar.gz
   softwareupdate -d --fetch-full-installer --full-installer-version 10.15.7
   cd ~/daml/infra/macos/1-create-box
   sudo macinbox --box-format vmware_desktop --disk 250 --memory 32768 --cpu 10 --user-script user-script.sh
   cd ../2-common-box
   vagrant up
   vagrant package --output ~/images/initialized-$(date +%Y%m%d).box
   vagrant destroy -f
   cd
   ./run-agent.sh
   ```
   This leaves us with that node running an updated box. The new box is
   in `~/images/initialized-$(date)`.
3. Send that file to all the other nodes with `scp`.
4. Reboot all the nodes (after deactivating & waiting for jobs to
   finish).

For the Monterey node, images (steps 1 and 2 in this repo) are currently
created by @nycnewman on another machine I don't have access to, so I
took a slightly different approach: I took the existing image, started
it from the `3-running-box` folder as usual, manually updated Nix there,
then repackaged that.

CHANGELOG_BEGIN
CHANGELOG_END
2022-02-24 01:04:28 +00:00
..
1-create-box update copyright headers (#12240) 2022-01-03 16:36:51 +00:00
2-common-box ci: bump Nix on macOS nodes (#13061) 2022-02-24 01:04:28 +00:00
3-running-box update copyright headers (#12240) 2022-01-03 16:36:51 +00:00
scripts update copyright headers (#12240) 2022-01-03 16:36:51 +00:00
.ruby-version Macos boxes for ci (#5002) 2020-04-14 18:03:24 +02:00
README.md update macOS nodes README (#8243) 2020-12-10 16:48:12 +01:00

Introduction

While our Linux and Windows machines are using standard cloud infrastructure, and as such can be created entirely from Terraform scripts, our macOS nodes must be managed on a more physical level. This folder contains all the instructions needed to create a virtual macOS machine (running on a physical macOS host) that can be added to our Azure pool.

There are a few pieces to this puzzle:

  1. Instructions to create a base Vagrant box. This only needs to be done once per Apple-supplied macOS installer version; this is as close as we can get to an unmodified, vanilla, out-of-the-box macOS installation. (We do add the synthetic.conf file though.)
  2. Common tools Vagrant box. This is a Vagrant box created on top of the previous step that does all the initialization except for the installation of the Azure agent. This allows us to only do the common steps once. This may be rebuilt on a slower frequency, say once a week.
  3. Azure runner starts from the previous one and downloads and runs the Azure agent. This is the one that should run on each machine and be reset every day.
  4. Additional considerations, discussed below.

Security considerations

The guest machine is created with a user, vagrant, that has passwordless sudo access and can be accessed with the default, well-known Vagrant SSH "private" key and a well known default password. While this is useful for debugging, it is crucial that the SSH port of the guest machine MUST NOT be accessible from outside the host machine, and that access to the host machine itself be appropriately restricted.

My personal recommendation would be for the host machines to not be accessible from any network, and to instead be managed by physical access, if possible.

The init.sh script creates a vsts user with more restricted access to run the CI builds. NOTE: the VSTS agent is locked down so upgrade from the Azure console will fail. Expectation is that the nodes are cycled daily and will pick up latest Azure VSTS agent on rebuild.

Machine initialization

Machine initialization is done based on the local checkout of the repo on each macOS host. This means that changes to the macOS nodes init script need to be propagated manually at this time. PRs that change macOS nodes configs should include an audit trail of deploying those changes in their comments.

Wiping nodes on a cron

macOS nodes are wiped every day overnight (at different time each, and after checking the machine is not processing any job, so as to minimize service interruption).

Proxying the cache

It is likely (though not certain) that, at some point, we will want to reduce the amount of traffic generated between our macOS CI nodes and the GCP-hosted caches, both for performance and price reasons. Under VirtualBox, guest VMs by default use their host machines as the default gateway, so this should be feasible through standard HTTP proxying.

However, I have not yet spent much time investigating this.

Other virtualization techniques

While this folder suggests one known-to-work way to get CI nodes, there are alternative options. I have not spent much time exploring other virtualization options or other ways to create an initial "blank" macOS virtual hard drive, though I believe the provided init.sh script should work with most other approaches with minimal changes, if required.