Commit Graph

17 Commits

Author SHA1 Message Date
Gary Verhaegen
b149ffa8b8
Windows clean up (#16723)
* shut down GCP windows nodes
* shut down periodic-killer
* cycle Windows nodes in daily-reset
2023-04-20 17:06:26 +02:00
Gary Verhaegen
7fb303be3b
update CI node reset permissions (#16216) 2023-02-02 09:06:39 +01:00
Gary Verhaegen
151e12b81a
bump copyright (#16002)
This is the result of:

- Updating `./COPY` to say `2023`.
- Running `./dev-env/bin/dade-copyright-headers update .`
2023-01-04 18:21:15 +01:00
Gary Verhaegen
9f5a2f9778
Fix terraform (#12333)
Our Terraform configuration has been slightly broken by two recent
changes:

- The nixpkgs upgrade in #12280 means a new version of our GCP plugin
  for Terraform, which as a breaking change added a required argument to
  `google_project_iam_member`. The new version also results in a number
  of smaller changes in the way Terraform handles default arguments, which
  doesn't result in any changes to our configuration files or to the
  behaviour of our deployed infrastructure, but does require re-syncing
  the Terraform state (by running `terraform apply`, which would
  essentially be a no-op if it were not for the next bullet point).
- The nix configuration changes in #12265 have changed the Linux CI
  nodes configuration but have not been deployed yet.

This PR is an audit log of the steps taken to rectfy those and bring us
back to a state where our deployed configuration and our recorded
Terraform state both agree with our current `main` branch tip.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-10 21:56:47 +00:00
Gary Verhaegen
d2e2c21684
update copyright headers (#12240)
New year, new copyright, new expected unknown issues with various files
that won't be covered by the script and/or will be but shouldn't change.

I'll do the details on Jan 1, but would appreciate this being
preapproved so I can actually get it merged by then.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-03 16:36:51 +00:00
Gary Verhaegen
349d812482
ci: increase hard drive space (not macOS) (#11983)
I've seen quite a few builds failing for lack of disk space recently,
sometimes as early as 2pm CET.

CHANGELOG_BEGIN
CHANGELOG_END
2021-12-06 19:41:11 +00:00
Gary Verhaegen
cfae2d88f5
update Terraform files to match reality (#8780)
* fixup terraform config

Two changes have happened recently that have invalidated the current
Terraform files:

1. The Terraform version has gone through a major, incompatible upgrade
   (#8190); the required updates for this are reflected in the first
   commit of this PR.
2. The certificate used to serve [Hoogle](https://hoogle.daml.com) was
   about to expire, so Edward created a new one and updated the config
   directly. The second commit in this PR updates the Terraform config
   to match that new, already-in-prod setting.

Note: This PR applies cleanly, as there are no resulting changes in
Terraform's perception of the target state from 1, and the change from 2
has already been applied through other channels.

CHANGELOG_BEGIN
CHANGELOG_END

* update hoogle cert
2021-02-08 17:25:04 +00:00
Gary Verhaegen
a925f0174c
update copyright notices for 2021 (#8257)
* update copyright notices for 2021

To be merged on 2021-01-01.

CHANGELOG_BEGIN
CHANGELOG_END

* patch-bazel-windows & da-ghc-lib
2021-01-01 19:49:51 +01:00
Gary Verhaegen
7c2ba6f996
infra: add prod label (#8140)
Requested by @nycnewman.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-03 01:55:43 +01:00
Gary Verhaegen
6ac61960e6
fix periodic-killer permissions (#7776)
I screwed up in #7771: `google_project_iam_binding` is defined as _the_
authoritative list of accounts for that role, not just a list of
accounts to add the role to. So in applying that rule yesterday, I
inadvertently stripped the periodic-killer machine of its role, and
therefore nothing got reset last night. The Terraform plan did not
mention this, unfortunately (though, arguably, consistently with the
semantics of the Terraform rules).

This is the same intent as #7771, but this one actually works. (Or at
least does not fail in the same way.)

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-22 12:22:07 +02:00
Gary Verhaegen
bdc2b5a9b1
allow Moritz to kill machines (#7771)
Also, explicitly allow myself, rather than rely on my admin status.

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-21 18:40:54 +02:00
Gary Verhaegen
819210827e
fix permissions on periodic-killer (#5307)
Even though the command succeeds as far as deleting the machine goes, it
does log an error. That is probably why we recently had only one machine
deleted per night.

Something must have changed on the Google side recently to make this
additional permission required.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-31 19:04:40 +02:00
Gary Verhaegen
38a5fea7a0
tweak periodic-killer (#5268)
1. Google says the instance is currently overutilized and suggests
   g1-small as a more appropriate size.
2. It occurred to me that the reason no error was logged might be that
   we lose them, so explicitly redirecting stderr too.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-30 14:12:14 +02:00
Gary Verhaegen
7e960eb454
log periodic reboots (#5235)
It appears that most of our Windows machines have not been rebooted
since Tuesday 24. We detected this because one of them has run out of
disk space.

This is not good, but what's worse is I currently have no idea what
could be going wrong, and we are not logging anything at all in the
current setup, so even ssh'ing into the machine provides no insight.

This PR hopefully addresses that by:

1. Redirecting the outputs of the script to a file, and
2. `tail`iing that file from the startup script, so the logs will appear
   directly in the GCP web console. (This is what we currently do for
   the Azure agent logs on Linux.)

This PR also tells the script to not stop on the first failed machine
and keep trying.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-27 21:35:49 +01:00
Gary Verhaegen
1872c668a5
replace DAML Authors with DA in copyright headers (#5228)
Change requested by Manoj.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-27 01:26:10 +01:00
Gary Verhaegen
0a251b3fa5
switch CI nodes to permanent (#4455)
CHANGELOG_BEGIN
CHANGELOG_END
2020-02-11 02:07:42 +01:00
Gary Verhaegen
1681922f90
ci: temp machines for scheduled killing experiment (#4386)
* ci: temp machines for scheduled killing experiment

Based on our discussions last week, I am exploring ways to move us to
permanent machines instead of preemptible ones. This should drastically
reduce the number of "cancelled" jobs.

The end goal is to have:

1. An instance group (per OS) that defines the actual CI nodes; this
would be pretty much the same as the existing ones, but with
`preemptible` set to false.
2. A separate machine that, on a cron (say at 4AM UTC), destroys all the
CI nodes.

The hope is that the group managers, which are set to maintain 10 nodes,
will then recreate the "missing" nodes using their normal starting
procedure.

However, there are a lot of unknowns I would like to explore, and I need
a playground for that. This is where this PR comes in. As it stands, it
creates one "killer" machine and a temporary group manager. I will use
these to experiment with the GCP API in various ways without interfering
with the real CI nodes.

This experimentation will likely require multiple `terraform apply` with
multiple different versions of the associated files, as well as
connecting to the machines and running various commands directly from
them. I will ensure all of that only affects the new machines created as
part of this PR, and therefore believe we do not need to go through a
separate round of approval for each change.

Once I have finished experimenting, I will create a new PR to clean up
the temporary resources created with this one and hopefully set up a
more permanent solution.

CHANGELOG_BEGIN
CHANGELOG_END

* add missing zone for killer instance

* add compute scope to killer

* authorize Terraform to shutdown killer to update it

* change in plans: use a service account instead

* .

* add compute.instances.list permission

* add compute.instances.delete permission

* add cron script

* obligatory round of extra escaping

* fix PATH issue & crontab format

* smaller machine & less frequent reboots
2020-02-07 21:04:03 +01:00