Commit Graph

23 Commits

Author SHA1 Message Date
Gary Verhaegen
fe9d44ffe7
ci: bump Nix on macOS nodes (#13061)
However that happened, we were stuck with Nix 2.3.15 (or 2.3.16 in some
cases) on our macOS nodes. This PR is a minor edition to the Nix
initialization commands to switch from 2.4 to "latest", but I wil lalso
use it to record the changes I just did manually to the cluster.

The cluster is currently composed of two parts:
- 7 machines running Catalina (10.15.7).
- 1 machine running Monterey (12.2).

Unfortunately they use different setup. The Catalina ones are described
by the state of the repo (in theory, though keeping them in sync is
manual); in order to update those, I have:

1. Taken one node off the CI pool (`builder1epjj7`).
2. On that node, run the following commands:
   ```
   cd ~/daml/infra/macos/3-running-box
   vagrant destroy -f
   rm ~/images/*
   vagrant box remove macinbox
   vagrant box remove azure-ci-node
   rm -r ~/.vagrant.d/boxes/macinbox-06032020.tar.gz
   softwareupdate -d --fetch-full-installer --full-installer-version 10.15.7
   cd ~/daml/infra/macos/1-create-box
   sudo macinbox --box-format vmware_desktop --disk 250 --memory 32768 --cpu 10 --user-script user-script.sh
   cd ../2-common-box
   vagrant up
   vagrant package --output ~/images/initialized-$(date +%Y%m%d).box
   vagrant destroy -f
   cd
   ./run-agent.sh
   ```
   This leaves us with that node running an updated box. The new box is
   in `~/images/initialized-$(date)`.
3. Send that file to all the other nodes with `scp`.
4. Reboot all the nodes (after deactivating & waiting for jobs to
   finish).

For the Monterey node, images (steps 1 and 2 in this repo) are currently
created by @nycnewman on another machine I don't have access to, so I
took a slightly different approach: I took the existing image, started
it from the `3-running-box` folder as usual, manually updated Nix there,
then repackaged that.

CHANGELOG_BEGIN
CHANGELOG_END
2022-02-24 01:04:28 +00:00
Gary Verhaegen
d2e2c21684
update copyright headers (#12240)
New year, new copyright, new expected unknown issues with various files
that won't be covered by the script and/or will be but shouldn't change.

I'll do the details on Jan 1, but would appreciate this being
preapproved so I can actually get it merged by then.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-03 16:36:51 +00:00
Gary Verhaegen
de8d15fb1e
fix Nix install on macOS nodes (#11696)
As part of the 2.4 release, the Nix installer has been changed to take
care of the volume setup (which we don't want it to do here). Because
that requires root access, they've decided to make multi-user install
the default, and to disable single-user install.

We could do an in-depth review of the difference and adapt our setup to
use a multi-user setup (we do use the multi-user setup on Linux, so
there's precedent), but as an immediate fix, we can keep our single-user
setup _and_ get the latest Nix by using the 2.3.16 installer and then
upgrading from within Nix itself. This _should_ keep working at least
for a while, as Linux still defaults to single-user.

CHANGELOG_BEGIN
CHANGELOG_END
2021-11-24 18:53:13 +01:00
Gary Verhaegen
4093bbd58c
fix macOS Bazel cache (#10795)
macOS filesystems have been case-insensitive by default for years, and
in particular our laptops are, so if we want the cache to work as
expected, CI should be too.

Note: this does not apply to Nix, because the Nix partition is a
case-sensitive image per @nycnewman's script on laptops too.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-07 13:31:57 +02:00
Gary Verhaegen
8a6cfacbff
more robust macOS cleanup (#9456)
We've recently seen a few cases where the macOS nodes ended up not
having the cache partition mounted. So far this has only happened on
semi-broken nodes (guest VM still up and running but host unable to
connect to it), so I haven't been able to actually poke at a broken
machine, but I believe this should allow a machine in such a state to
recover.

While we haven't observed a similar issue on Linux nodes (as far as I'm
aware), I have made similar changes there to keep both scripts in sync.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-21 12:10:47 +02:00
Gary Verhaegen
2745bc03a5
macos: move cache setup to step 2 (#9350)
The caches really need to be set up before we warm them up.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-07 21:42:15 +02:00
Gary Verhaegen
c97db24295
fix macOS cache cleaning (#9343)
The script needs to run once before the first build, otherwise the cache
folders get created on the main partition.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-07 18:46:44 +02:00
Gary Verhaegen
45c4ba2230
macos cache cleaning (#9245)
This is adapting the same approach as #9137 to the macOS machines. The
setup is very similar, except macOS apparently doesn't require any kind
of `sudo` access in the process.

The main reason for the change here is that while `~/.bazel-cache` is
reasonably fast to clean, cleaning just that has finally caught up to us
with a recent cleanup step that proudly claimed:

```
before: 638Mi free
after: 1.2Gi free
```

So we do need to start cleaning the other one after all.

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-30 02:46:05 +02:00
Edward Newman
a98e03981f
Increase nix partition to max of 60Gb (#9259)
Increase nix partition to max of 60Gb

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-30 01:06:58 +02:00
Gary Verhaegen
a925f0174c
update copyright notices for 2021 (#8257)
* update copyright notices for 2021

To be merged on 2021-01-01.

CHANGELOG_BEGIN
CHANGELOG_END

* patch-bazel-windows & da-ghc-lib
2021-01-01 19:49:51 +01:00
Gary Verhaegen
93f449d245
rename master to main (#8245)
As we strive for more inclusiveness, we are becoming less comfortable
with historically-charged terms being used in our everyday work.

This is targeted for merge on Dec 26, _after_ the necessary
corresponding changes at both the GitHub and Azure Pipelines levels.

CHANGELOG_BEGIN

- DAML Connect development is now conducted from the `main` branch,
  rather than the `master` one. If you had any dependency on the
  digital-asset/daml repository, you will need to update this parameter.

CHANGELOG_END
2020-12-27 14:19:07 +01:00
Gary Verhaegen
5c8ac44049
update macOS nodes README (#8243)
This is far from perfect but removes the blatantly wrong sections of the
README.

Note: as a README change, this is not really a standard change, but
because the README is under the infra folder, this PR does need the tag
to pass CI.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-10 16:48:12 +01:00
Gary Verhaegen
b23304c691
add default capability to macos (#5915)
This is the macOS part of #5912, which I have separated because our
macOS nodes have a different deployment process so it seemed easier to
track the deployment of the change separately.

CHANGELOG_BEGIN
CHANGELOG_END
2020-11-25 15:34:33 +01:00
Gary Verhaegen
5b2319e137
multistep macos setup (#5768)
multistep macos setup

This updates the macOS node setup instructions to avoid repeating
identical work and network traffic across all machines through
initialization by building a "daily" image with all the tools and code
we need.

CHANGELOG_BEGIN
CHANGELOG_END

* Fix 3-running-box to remount nix partition

* updated scripts to use multi-step process

* add copyright notices

Co-authored-by: nycnewman <edward@digitalasset.com>
2020-08-18 16:01:02 +02:00
Gary Verhaegen
72f428d8df
macos nodes: add nix redirect (#6406)
See #6400; split out as separate PR so master == reality and we can
track when this is done. @nycnewman please merge this once the change
is deployed.

Note: it has to be deployed before the next restart; nodes will _not_ be
able to boot with the current configuration.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 14:51:25 +02:00
Gary Verhaegen
fba57470a5
restore terraform to working state (#6402)
It looks like some nix update has broken our current Terraform setup.
The Google provider plugin has changed its reported version to 0.0.0;
poking at my local nix store seems to indicate we actually get 3.15, but
🤷.

This PR also reverts the infra part of #6400 so we get back to master ==
reality.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 12:15:27 +02:00
Moritz Kiefer
2c1d4cb805
Fix nix installation (#6400)
Nix now requires -L, I’ve gone ahead and just normalized everything to
use -sfL which we were already using in one place.

changelog_begin
changelog_end
2020-06-18 10:34:08 +02:00
Edward Newman
9a073cebd9
Macos fix nix installer for build agent servers (#6133)
* Fix issue with xz dependency missing for Nix installer

CHANGELOG_BEGIN
- MacOS - fix Nix installer dependency for xz
CHANGELOG_END

* - additional changes for new Nix installer for Catalina depdencies
2020-05-28 14:04:01 +02:00
Edward Newman
be4f85d165
Fix launchd killing VMWare process at end of script execution (#6006)
* Fix alunchd killing VMWare process at end of script execution

* Fix alunchd killing VMWare process at end of script execution

CHANGELOG_BEGIN
Fix issue with MacOS Catalina Launchd killing VMWare instance on rebuild (AbandonProcessGrop)
CHANGELOG_END
2020-05-18 10:54:15 -04:00
Edward Newman
0ec0cc335f
Updates to support VMWare variant of Hypervisor for MacOS Build Nodes (#5940)
* Updates to support VMWare vairant of Hypervisor

* Update infra/macos/scripts/rebuild-crontask.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

* Update infra/macos/scripts/run-agent.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
2020-05-12 09:36:40 -04:00
Gary Verhaegen
6aac32480a
hopefully fix memory issue with pg on macos CI (#5824)
We have seen the following error message crop up a couple times
recently:

```
FATAL:  could not create shared memory segment: No space left on device
DETAIL:  Failed system call was shmget(key=5432001, size=56, 03600).
HINT:  This error does *not* mean that you have run out of disk space.
It occurs either if all available shared memory IDs have been taken, in
which case you need to raise the SHMMNI parameter in your kernel, or
because the system's overall limit for shared memory has been reached.
    The PostgreSQL documentation contains more information about shared
memory configuration.
child process exited with exit code 1
```

Based on [the PostgreSQL
documentation](https://www.postgresql.org/docs/12/kernel-resources.html),
this should fix it.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-04 14:32:23 -04:00
Edward Newman
01c784659f
Minor changes to MacOS infra config (#5673) 2020-04-22 18:57:40 +02:00
Gary Verhaegen
b3c428e76f
Macos boxes for ci (#5002)
set up macOS nodes

This PR documents how to create and manage macOS CI nodes. Because macOS
is not supported by our current cloud providers, these instructions are
geared towards creating VMs on physical machines we would need to host
and manage ourselves, i.e. these notes are mostly targeted at Ed.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-14 18:03:24 +02:00