Commit Graph

205 Commits

Author SHA1 Message Date
Gary Verhaegen
170d839ed0
Fix es (#12554)
It's down again. I wish I knew why it does that.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-24 18:33:59 +01:00
Gary Verhaegen
de2a8c0c04
ci: use service account for Windows nodes (#12489)
When no service account is explicitly selected, GCP provides a default
one, which happens to have way more access rights than we're comfortable
with. I'm not quite sure how the total lack of a service account slipped
through here, but I've noticed today so I'm changing it.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-19 19:58:17 +00:00
Gary Verhaegen
5716d99cd2
Disable printer sharing (#12408)
As the title suggests. We already disable all communication between CI
nodes through network rules, but we currently get a lot of noise from
GCP logging violations to those rules from Windows trying to feel its
way out for file share buddies.

CHANGELOG_BEGIN
CHANGELOG_END

As usualy, this branch will contain intermediate commits that may serve
as an audit log of sorts.
2022-01-13 20:55:18 +00:00
Gary Verhaegen
e34ac20d23
offboarding Akshay (#12396)
😿

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-13 14:40:44 +01:00
Gary Verhaegen
6aa9409e6e
split-releases: gcs accounts for assembly & canton (#12373)
CHANGELOG_BEGIN
CHANGELOG_END
2022-01-13 14:19:08 +01:00
Gary Verhaegen
9f5a2f9778
Fix terraform (#12333)
Our Terraform configuration has been slightly broken by two recent
changes:

- The nixpkgs upgrade in #12280 means a new version of our GCP plugin
  for Terraform, which as a breaking change added a required argument to
  `google_project_iam_member`. The new version also results in a number
  of smaller changes in the way Terraform handles default arguments, which
  doesn't result in any changes to our configuration files or to the
  behaviour of our deployed infrastructure, but does require re-syncing
  the Terraform state (by running `terraform apply`, which would
  essentially be a no-op if it were not for the next bullet point).
- The nix configuration changes in #12265 have changed the Linux CI
  nodes configuration but have not been deployed yet.

This PR is an audit log of the steps taken to rectfy those and bring us
back to a state where our deployed configuration and our recorded
Terraform state both agree with our current `main` branch tip.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-10 21:56:47 +00:00
Gary Verhaegen
648021a2e7
Fix es cluster (#12262)
audit log of actions taken to fix cluster post Winter break

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-05 16:29:48 +00:00
Samir Talwar
854b66ee2f
devenv: Use NIX_USER_CONF_FILES to set caches. (#12265)
* devenv: Factor out a function to get the Nix version.

* devenv: On newer versions of Nix, use `NIX_USER_CONF_FILES`.

This stacks, rather than overwrites.

* devenv: Append Nix caches, instead of overwriting them.

The "extra-" prefix tells Nix to append.

We also switch to non-deprecated configuration keys.

CHANGELOG_BEGIN
CHANGELOG_END

* devenv: Just require Nix v2.4 or newer.

* devenv: `NIX_USER_CONF_FILES` may not be set already.
2022-01-05 13:24:52 +01:00
Gary Verhaegen
93e616475e
Update ci nodes for copyright update (#12255)
Audit log for CI rotation.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-04 15:24:31 +00:00
Gary Verhaegen
d2e2c21684
update copyright headers (#12240)
New year, new copyright, new expected unknown issues with various files
that won't be covered by the script and/or will be but shouldn't change.

I'll do the details on Jan 1, but would appreciate this being
preapproved so I can actually get it merged by then.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-03 16:36:51 +00:00
Gary Verhaegen
0e30d468f9
expand CI cluster back (#12239)
To be done / merged on Jan 3.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-03 14:50:08 +00:00
Gary Verhaegen
a51f75d193
give a break to CI (#12238)
I can't think of a good reason to keep 30+ machines running over the
Winter break. I'll bump this back up on Jan 3.

CHANGELOG_BEGIN
CHANGELOG_END
2021-12-26 10:53:08 +01:00
Gary Verhaegen
c5708c96c5
es: mitigate log4j vuln (#12118)
CHANGELOG_BEGIN
CHANGELOG_END
2021-12-10 22:46:02 +00:00
Gary Verhaegen
b6b6ecd669
repair ES cluster (#12117)
The cluster shut down today. I'm still not sure why this happens, but
cycling the entire cluster seems to solve it so here goes.

CHANGELOG_BEGIN
CHANGELOG_END
2021-12-10 20:32:52 +01:00
Gary Verhaegen
349d812482
ci: increase hard drive space (not macOS) (#11983)
I've seen quite a few builds failing for lack of disk space recently,
sometimes as early as 2pm CET.

CHANGELOG_BEGIN
CHANGELOG_END
2021-12-06 19:41:11 +00:00
Gary Verhaegen
de8d15fb1e
fix Nix install on macOS nodes (#11696)
As part of the 2.4 release, the Nix installer has been changed to take
care of the volume setup (which we don't want it to do here). Because
that requires root access, they've decided to make multi-user install
the default, and to disable single-user install.

We could do an in-depth review of the difference and adapt our setup to
use a multi-user setup (we do use the multi-user setup on Linux, so
there's precedent), but as an immediate fix, we can keep our single-user
setup _and_ get the latest Nix by using the 2.3.16 installer and then
upgrading from within Nix itself. This _should_ keep working at least
for a while, as Linux still defaults to single-user.

CHANGELOG_BEGIN
CHANGELOG_END
2021-11-24 18:53:13 +01:00
Gary Verhaegen
5f52f00afb
increase linux cluster size (#11860)
CHANGELOG_BEGIN
CHANGELOG_END
2021-11-24 15:18:56 +00:00
Gary Verhaegen
ab520fbc51
Fix es (#11784)
CHANGELOG_BEGIN
CHANGELOG_END
2021-11-22 11:41:10 +01:00
Gary Verhaegen
fdde5353f4
fix hoogle by pinning nixpkgs (#11548)
Hoogle has been down for at least 24h accoridng to user reports. What
seems to be happening is that our nixpkgs pinning is not taking effect,
and the nixpkgs version of Hoogle already includes the patch we are
trying to add. This confuses nix, which fails, and thus the boot
sequence is broken.

I've applied the minimal possible patch here (i.e. enforce the pin),
which gets things running again. I've already deployed this change.

We may want to look at bumping the nixpkgs snapshot.

CHANGELOG_BEGIN
CHANGELOG_END
2021-11-04 10:22:30 +00:00
Gary Verhaegen
ebe742098d
fix es ingest: file too large (#11415)
This fixes an issue where we try to upload a single JSON blob that is
bigger than the limit of 500MB. We could also raise the limit, I guess,
but that seems more error-prone.

CHANGELOG_BEGIN
CHANGELOG_END
2021-10-27 16:46:14 +02:00
Gary Verhaegen
5654d5cb48
fix es ingest for missing files (#11375)
If a job fails at build time, there are not test logs to process.
Currently this means the ingestion process is going to be stuck in an
endless loop of retrying that job and failing on the missing file.

This change should let us process jobs with no test files.

CHANGELOG_BEGIN
CHANGELOG_END
2021-10-25 14:40:33 +02:00
Gary Verhaegen
5e43f8c703
es: drop jobs-* indices (#10857)
We are currently ingesting Bazel events in two forms:

In the `events-*` indices, each Bazel event is recorded as a separate ES
object, with the corresponding job name as a field that can serve to
aggregate all of the events for a given job.

In the `jobs-*` indices, each job is ingested as a single (composite) ES
object, with the individual events as elements in a list-type field.

When I set up the cluster, I wasn't sure which one would be more useful,
so I included both. We now have a bit more usage experience and it turns
out the `events-*` form is the only one we use, so I think we should
stop ingesting evrything twice and from now on create only the
`events-*` ones.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-28 11:06:52 +02:00
Gary Verhaegen
fe9aeffeaf
Increase es disk size (#11019)
Disks are currently at 75% fullness, so seems like a good idea to bump a
bit. #10857 should reduce our needs, too, so this should last us a
while.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-25 02:51:26 +02:00
Gary Verhaegen
28b8d9a1f7
bump dotnet (#10979)
This bumps dotnet to the version required by the latest azuresigntool,
and pins azuresigntool for the future.

As usual for live CI upgrades, this will be rolled out using the
blue/green approach. I'll keep each deployed commit in this PR.

For future reference, this is PR [#10979].

[#10979]: https://github.com/digital-asset/daml/pull/10979

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-22 16:39:40 +00:00
Gary Verhaegen
6f151e287e
save kibana exports (#10861)
As explained in #10853, we recently lost our ES cluster. While I'm not
planning on trusting Google's "rolling restart" feature ever again, we
can't exclude the possibility of future similar outages (without a
significant investment in the cluster, which I don't think we want to
do).

Losing the cluster is not a huge issue as we can always reingest the
data. Worst case we lose visibility for a few days. At least, as far as
the bazel logs are concerned.

Losing the Kibana data is a lot more annoying, as that is not derived
data and thus cannot be reingested. This PR aims to add a backup
mechanism for our Kibana configuration.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-13 18:28:11 +00:00
Gary Verhaegen
8c9edd8522
es cluster tweaks (#10853)
On Sept 8 our ES cluster became unresponsive. I tried connecting to the
machines.

One machine had an ES Docker container that claimed to have started 7
weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5
weeks. I assume GCP had decided to restart it for some reason. The init
script had failed on missing a TTY, hence the addition of the
`DEBIAN_FRONTEND` env var.

Two machines had a Docker container that had stopped on that day, resp.
6h and 2h before I started investigating. It wasn't immediately clear
what had caused the containers to stop.

On all three of these machines, I was abble to manually restart the
containers and they were abble to reform a cluster, though the state of
the cluster was red (missing shards).

The last two machines simply did not respond to SSH connection attempts.
Assuming it might help, I decided to try to restart the machines. As GCP
does not allow restarting individual machines when they're part of a
managed instance roup, I tried clicking the "rolling restart" button
on the GCP console, which seemed like it would restart the machines. I
carefully selected "restart" (and not "replace"), started the process,
and watched GCP proceed to immediately replace all five machines, losing
all data in the process.

I then started a new cluster and used bigger (and more) machines to
reingest all of the data, and then fell back to the existing
configuration for the "steady" state. I'll try to keep a better eye on
the state of the cluster from now on. In particular, we should not have
a node down for 5 weeks without noticing.

I'll also try to find some time to look into backing up the Kibana
configuration, as that's the one thing we can't just reingest at the
moment.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-13 11:12:02 +02:00
Gary Verhaegen
4093bbd58c
fix macOS Bazel cache (#10795)
macOS filesystems have been case-insensitive by default for years, and
in particular our laptops are, so if we want the cache to work as
expected, CI should be too.

Note: this does not apply to Nix, because the Nix partition is a
case-sensitive image per @nycnewman's script on laptops too.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-07 13:31:57 +02:00
Andreas Herrmann
7b94b0674e
Map shortened scala test suite names to long names on Windows (#10628)
* Generate short to long name mapping in aspect

Maps shortened test names in da_scala_test_suite on Windows to their
long name on Linux and MacOS.

Names are shortened on Windows to avoid exceeding MAX_PATH.

* Script to generate scala test name mapping

* Generate scala-test-suite-name-map.json on Windows

changelog_begin
changelog_end

* Generate UTF-8 with Unix line endings

Otherwise the file will be formatted using UTF-16 with CRLF line
endings, which confuses `jq` on Linux.

* Apply Scala test name remapping before ES upload

* Pipe bazel output into intermediate file

Bazel writes the output of --experimental_show_artifacts to stderr
instead of stdout. In Powershell this means that these outputs are not
plain strings, but instead error objects. Simply redirecting these to
stdout and piping them into further processing will lead to
indeterministically missing items or indeterministically introduced
additional newlines which may break paths.

To work around this we extract the error message from error objects,
introduce appropriate newlines, and write the output to a temporary file
before further processing.

This solution is taken and adapted from
https://stackoverflow.com/a/48671797/841562

* Add copyright header

Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>
2021-08-24 17:03:45 +02:00
Gary Verhaegen
a6650c11e5
fix hoogle (#10555)
Hoogle was down because the machines got stuck in the `apt-get update`
stage.

CHANGELOG_BEGIN
CHANGELOG_END
2021-08-11 13:52:43 +00:00
Gary Verhaegen
449a72a86f
increase ES memory (#10318)
ES died (again) over the weekend, so I had to manually connect to each
node in order to restore it, and thus made another migration. This time
I opted to make a change, though. Lack of memory is a bit of a weak
hypothesis for the observed behaviour, but it's the only one I have at
the moment and, given how reliably ES has been crashing so far, it's
fairly easy to test.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-19 17:50:16 +02:00
Gary Verhaegen
a3b861eae8
refresh es cluster (#10300)
The cluster died yesterday. As part of recovery, I connected to the
machines and made manual changes. To ensure that we get back to a known,
documented setup I then proceeded to do a full blue -> green migration,
having not tainted any of the green machines with manual interventions.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-19 10:55:44 +02:00
Gary Verhaegen
2bcbd4e177
es: switch to persistent nodes (#10236)
A few small tweaks, but the most important change is giving up on
preemptible instances (for now at least), because I saw GCP kill 8 out
of the 10 nodes at exactly the same time and I can't really expect the
cluster to sruvive that.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-12 06:27:23 +00:00
Gary Verhaegen
2b67ebb5d4
tf: refactor appr var (#10232)
Two changes at the Terraform level, both with no impact on the actual
GCP state:

- There is no reason to make this value a `variable`: variables in
  Terraforma are meant to be supplied at the CLI. `local` is the right
  abstraction here (i.e. set in the file directly).
- Using an unordered `for_each` set rather than a list so we don't have
  positional identity, meaning when adding someone at the top we don't
  need to destroy and recreate everyone else.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-09 13:41:46 +02:00
Gary Verhaegen
202b7f7ae7
add akshay to appr team (#10229)
CHANGELOG_BEGIN
CHANGELOG_END
2021-07-09 10:55:16 +00:00
Gary Verhaegen
999577a1a7
tweak ES cluster (#10219)
This PR contains many small changes:

- A small refactoring whereby the "es-init" machine is now
  (syntactically) integrated with the two instance groups, to cut down a
  bit on repetition.
- The feeder machine is now preemptible, because I've seen it recover
  enough times that I'm confident this will not cause any issue.
- Indices are now sharded.
- Return values from ES are filtered, cutting down a bit on network
  usage and memory requirements to produce the responses.
- Bulk uploads for a single job are now done in parallel. This results
  in about a 2x speedup for ingestion.
- crontab was changed to very minute instead of every 5 minutes.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-08 19:20:35 +02:00
Gary Verhaegen
38734f02d7
es-feed: ignore invalid files (#10207)
We currently have about 1% (28 out of 2756) of our build logs that have
invalid JSON files. They are all about a `-profile` file being
incomplete, and since those files represent a single JSON object we
can't do smarter things like filtering invalid individual lines.

I haven't looked deeply into _why_ we create invalid files, but this
should let our ingestion process make some progress in the meantime.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-07 15:38:14 +00:00
Gary Verhaegen
cb1f4ec773
ci/windows: disable spool (#10200)
* ci/windows: disable spool

We're not expecting to print anything, and @RPS5' security newsletter
says this is a vector of attack.

CHANGELOG_BEGIN
CHANGELOG_END

* increase no-spool to 6

* Windows name truncation causing collisions

* update main group

* remove temp group
2021-07-07 12:44:33 +00:00
Gary Verhaegen
1d5ba4fa42
feed elasticsearch cluster (#10193)
This PR adds a machine that will, every 5 minutes, look at the GCS
bucket that stores Bazel metrics and push whatever it finds to
ElasticSearch.

A huge part of this commit is based on @aherrmann-da's work. You can
assume that all the good bits are his.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-06 19:46:14 +02:00
Gary Verhaegen
f7cf7c75b5
add kibana (#10152)
This PR adds a Kibana instance to each ES node, and duplicates the load
balancer mechanism to expose both raw ES and Kibana.

CHANGELOG_BEGIN
CHANGELOG_END
2021-06-30 14:08:03 +02:00
Gary Verhaegen
2dfe026cc2
add ES cluster (#10144)
This PR adds a basic ES cluster to our infrastructure, completely open
and unprotected but only accessible through VPN.

And, as of yet, through its IP address. I'm not sure whether it's worth
adding a DNS for it.

CHANGELOG_BEGIN
CHANGELOG_END
2021-06-29 17:50:45 +02:00
Gary Verhaegen
31a76a4a2a
allow CI pools to use any zone (#10069)
This morning we started with very restricted CI pools (2/6 for Windows
and 7/20 for Linux), apparently because the region we run in (us-east1)
has three zones, two of them were unable to allocate new nodes, and the
default policy is to distribute nodes evenly between zones.

I've manually changed the distribution policy. Unfortunately this option
is not yet available in our version of the GCP Terraform plugin.

CHANGELOG_BEGIN
CHANGELOG_END
2021-06-22 10:43:08 +02:00
Gary Verhaegen
646c956457
new windows signing (#9786)
CHANGELOG_BEGIN
CHANGELOG_END
2021-05-25 16:23:17 +02:00
Gary Verhaegen
45bca6e68b
test_windows_signing: install for u (#9776)
Turns out "`--global`" means "for this user".

CHANGELOG_BEGIN
CHANGELOG_END
2021-05-21 15:49:31 +02:00
Gary Verhaegen
4af6608185
fix signing machine (#9772)
Turns out PowerShell is not Bash. Who knew? 🤷

CHANGELOG_BEGIN
CHANGELOG_END
2021-05-21 12:36:56 +02:00
Gary Verhaegen
f5c5b634eb
prepare for EV Windows signing (#9758)
Setting up a non-disruptive way to test out EV signing of our Windows
artifacts.

CHANGELOG_BEGIN
CHANGELOG_END
2021-05-21 10:46:45 +02:00
Gary Verhaegen
96d72a6987
add Victor to app-runtime (#9556)
CHANGELOG_BEGIN
CHANGELOG_END
2021-05-04 13:35:28 +02:00
Gary Verhaegen
af4fffad0f
increase Linux CI size (#9533)
CHANGELOG_BEGIN
CHANGELOG_END
2021-04-29 18:17:10 +02:00
Gary Verhaegen
8a6cfacbff
more robust macOS cleanup (#9456)
We've recently seen a few cases where the macOS nodes ended up not
having the cache partition mounted. So far this has only happened on
semi-broken nodes (guest VM still up and running but host unable to
connect to it), so I haven't been able to actually poke at a broken
machine, but I believe this should allow a machine in such a state to
recover.

While we haven't observed a similar issue on Linux nodes (as far as I'm
aware), I have made similar changes there to keep both scripts in sync.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-21 12:10:47 +02:00
Moritz Kiefer
91b65e8004
Patch hoogle binary to include bugfix (#9366)
This includes https://github.com/ndmitchell/hoogle/pull/367.

As usual, I unfortunately cannot test this myself so please review
carefully.

Note that this will slightly increase compile times since we will now
build hoogle. However, we still only build hoogle rather than
everything which takes less than 2min on my very weak personal
laptop. We could integrate this with our nix cache but for now, I’m
not that worried about this.

changelog_begin
changelog_end

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
2021-04-09 14:17:03 +02:00
Gary Verhaegen
dfe26b93b0
fix hoogle (#9364)
After #9362, hoogle stopped returning any result. It was up and running,
but with an empty database.

Problem was two-fold:
1. In the switch to `nix`, I lost track of the fact that we were
   previously doing all the work in `~/hoogle` rather than `~/`, which
   broke the periodic refresh.
2. The initial setup has been broken for months; machines have been
   initializing with an empty DB and only getting data on the first
   refresh since #7370.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-08 18:45:47 +02:00
Gary Verhaegen
c220e05380
infra/hoogle: use nix (#9362)
This is a demonstration, commit-by-commit, of how to use the blue/green
deployment for Hoogle servers.

The actual change (using nix) is stolen from #9352.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-08 16:59:16 +02:00
Gary Verhaegen
948d4dd964
infra: hoogle blue/green tf (#9351)
Making it a bit easier to manage rollouts.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-08 10:07:11 +02:00
Gary Verhaegen
2745bc03a5
macos: move cache setup to step 2 (#9350)
The caches really need to be set up before we warm them up.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-07 21:42:15 +02:00
Gary Verhaegen
38c417e981
bump hoogle ubuntu (#9344)
16.04 is approaching EOL.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-07 20:11:36 +02:00
Gary Verhaegen
c97db24295
fix macOS cache cleaning (#9343)
The script needs to run once before the first build, otherwise the cache
folders get created on the main partition.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-07 18:46:44 +02:00
Gary Verhaegen
45c4ba2230
macos cache cleaning (#9245)
This is adapting the same approach as #9137 to the macOS machines. The
setup is very similar, except macOS apparently doesn't require any kind
of `sudo` access in the process.

The main reason for the change here is that while `~/.bazel-cache` is
reasonably fast to clean, cleaning just that has finally caught up to us
with a recent cleanup step that proudly claimed:

```
before: 638Mi free
after: 1.2Gi free
```

So we do need to start cleaning the other one after all.

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-30 02:46:05 +02:00
Edward Newman
a98e03981f
Increase nix partition to max of 60Gb (#9259)
Increase nix partition to max of 60Gb

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-30 01:06:58 +02:00
Gary Verhaegen
691edeacf2
ci: fix cache cleanup (#9137)
This is a continuation of #8595 and #8599. I somehow had missed that
`/etc/fstab` can be used to tell `mount` to let users mount some
filesystems with preset options.

This is using the full history of `mount` hardening so should be safe
enough. The option `user` in `/etc/fstab` automatically disables any kind
of `setuid` feature on the mounted filesystem, which is the main attack
vector I know of.

This works flawlessly on my local VM, so hopefully this time's the
charm. (It also happens to be my third PR specifically targeted on this
issue, so, who knows, it may even work.)

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-16 17:51:38 +01:00
Gary Verhaegen
cfae2d88f5
update Terraform files to match reality (#8780)
* fixup terraform config

Two changes have happened recently that have invalidated the current
Terraform files:

1. The Terraform version has gone through a major, incompatible upgrade
   (#8190); the required updates for this are reflected in the first
   commit of this PR.
2. The certificate used to serve [Hoogle](https://hoogle.daml.com) was
   about to expire, so Edward created a new one and updated the config
   directly. The second commit in this PR updates the Terraform config
   to match that new, already-in-prod setting.

Note: This PR applies cleanly, as there are no resulting changes in
Terraform's perception of the target state from 1, and the change from 2
has already been applied through other channels.

CHANGELOG_BEGIN
CHANGELOG_END

* update hoogle cert
2021-02-08 17:25:04 +00:00
Gary Verhaegen
29197a96d7
add Stefano to appr (#8663)
CHANGELOG_BEGIN
CHANGELOG_END
2021-01-28 11:11:08 +00:00
Gary Verhaegen
ec5e419e3a
clean-up infra after Ubuntu 20.04 upgrade (#8653)
See #8617.

This has to be a separate PR so we could drain all the jobs still
expecting the linux-pool pool to exist.

CHANGELOG_BEGIN
CHANGELOG_END
2021-01-27 22:19:34 +01:00
Gary Verhaegen
fef712bf60
Upgrade linux nodes to 20.04 (#8617)
CHANGELOG_BEGIN

- Our Linux binaries are now built on Ubuntu 20.04 instead of 16.04. We
  do not expect any user-level impact, but please reach out if you
  do notice any issue that might be caused by this.

CHANGELOG_END
2021-01-27 17:38:34 +01:00
Bernhard Elsner
cda93db944
Daml case and logo (#8433)
* Replace many occurrences of DAML with Daml

* Update docs logo

* A few more CLI occurrences

CHANGELOG_BEGIN
- Change DAML capitalization and docs logo
CHANGELOG_END

* Fix some over-eager replacements

* A few mor occurrences in md files

* Address comments in *.proto files

* Change case in comments and strings in .ts files

* Revert changes to frozen proto files

* Also revert LF 1.11

* Update get-daml.sh

* Update windows installer

* Include .py files

* Include comments in .daml files

* More instances in the assistant CLI

* some more help texts
2021-01-08 12:50:15 +00:00
Moritz Kiefer
9c2e2db34e
Include new Nix signing key in static nix config on CI nodes (#8407)
Our CI nodes install nix in multi-user mode. This means that changing
cache information is only available to certain trusted users for
security reasons. The CI user is not part of those so the cache info
from dev-env/etc/nix.conf is silently ignored.

We could consider not running in multi-user mode although from a
security pov this seems like a pretty sensible decision and our
signing keys change very rarely so for now, I would keep it.

changelog_begin
changelog_end
2021-01-06 13:24:34 +01:00
Gary Verhaegen
a925f0174c
update copyright notices for 2021 (#8257)
* update copyright notices for 2021

To be merged on 2021-01-01.

CHANGELOG_BEGIN
CHANGELOG_END

* patch-bazel-windows & da-ghc-lib
2021-01-01 19:49:51 +01:00
Gary Verhaegen
93f449d245
rename master to main (#8245)
As we strive for more inclusiveness, we are becoming less comfortable
with historically-charged terms being used in our everyday work.

This is targeted for merge on Dec 26, _after_ the necessary
corresponding changes at both the GitHub and Azure Pipelines levels.

CHANGELOG_BEGIN

- DAML Connect development is now conducted from the `main` branch,
  rather than the `master` one. If you had any dependency on the
  digital-asset/daml repository, you will need to update this parameter.

CHANGELOG_END
2020-12-27 14:19:07 +01:00
Gary Verhaegen
5c8ac44049
update macOS nodes README (#8243)
This is far from perfect but removes the blatantly wrong sections of the
README.

Note: as a README change, this is not really a standard change, but
because the README is under the infra folder, this PR does need the tag
to pass CI.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-10 16:48:12 +01:00
Gary Verhaegen
7c2ba6f996
infra: add prod label (#8140)
Requested by @nycnewman.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-03 01:55:43 +01:00
Gary Verhaegen
586e29adca
incident-118: investigate & fix (#8135)
incident-118: fruitless investigation; revert

This first commit just duplicates the existing configuration. Further
commits will make actual changes so they can be tracked by looking at
individual commits (rather than try to think up the diff by looking at
entirely new files).

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-02 19:20:56 +01:00
Gary Verhaegen
841116bf1e
incident-118: linux machines unable to start (#8128)
Early investigation points to cloud logging install failing. Temporarily
disabling.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-02 10:11:11 +00:00
Gary Verhaegen
b23304c691
add default capability to macos (#5915)
This is the macOS part of #5912, which I have separated because our
macOS nodes have a different deployment process so it seemed easier to
track the deployment of the change separately.

CHANGELOG_BEGIN
CHANGELOG_END
2020-11-25 15:34:33 +01:00
Gary Verhaegen
e4638d9004
document how to kill nodes (#7782)
CHANGELOG_BEGIN
CHANGELOG_END
2020-10-22 15:44:48 +02:00
Gary Verhaegen
6ac61960e6
fix periodic-killer permissions (#7776)
I screwed up in #7771: `google_project_iam_binding` is defined as _the_
authoritative list of accounts for that role, not just a list of
accounts to add the role to. So in applying that rule yesterday, I
inadvertently stripped the periodic-killer machine of its role, and
therefore nothing got reset last night. The Terraform plan did not
mention this, unfortunately (though, arguably, consistently with the
semantics of the Terraform rules).

This is the same intent as #7771, but this one actually works. (Or at
least does not fail in the same way.)

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-22 12:22:07 +02:00
Gary Verhaegen
bdc2b5a9b1
allow Moritz to kill machines (#7771)
Also, explicitly allow myself, rather than rely on my admin status.

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-21 18:40:54 +02:00
Gary Verhaegen
6419ff2f34
remove leo (#7535)
Leo has left the team and so should not have access anymore.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-30 18:16:57 +02:00
Gary Verhaegen
168345f4a8
let CI delete bazel cache items (#7514)
Recently we have been seeing lots of issues with the Bazel cache. It
does not seem like it would need to delete things, but the issues
cropped up about the same time we restricted the permissions, so it's
worth trying to revert that.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-29 13:56:35 +02:00
Gary Verhaegen
2a38d03250
protect GCS bucket items (#7439)
Yesterday, a certificate expiration triggered the `patch_bazel_windows`
job to run when it shouldn't, and it overrode an artifact we depend on.
This was build from the same sources, but the build is not reproducible
so we ended up with a hash mismatch.

As far as I know, there is no good reason for CI to ever delete or
overwrite anything from our GCS buckets, so I'm removing its rights to
do so.

As an added safety measure, this PR also enables versioning on all
non-cache buckets (GCS does not support versioning on buckets with an
expiration policy).

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-18 15:59:23 +02:00
Gary Verhaegen
8ea85d1393
update certificates (#7432)
Our old wildcard certificate has expired. @nycnewman has already updated
our configuration to use new ones; this is just updating the tf files to
match.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-17 17:36:35 +02:00
Gary Verhaegen
b9acc09a77
read access to data bucket for appr members (#7422)
We've been saving data there but not doing anything with it. Ideally
this data would be used by some sort of automated process, but in the
meantime (or while developing said processes), having at least some
people with read access can help.

This is a Standard Change requested by @cocreature.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-16 18:25:23 +02:00
Gary Verhaegen
b4d211642c
fixup Terraform setup (#7373)
It looks like #6761 broke our Terraform setup by upgrading the nixpkgs
snapshot. That this has not been caught earlier is, I suppose, a
testament to how stable our infrastructure has become nowadays.

This is the same issue we had with the Google providers in #6402, i.e.
we are trying to pin the provider versions both at the nix level and at
the terraform level, with no way to force them to stay in sync.

I don't have a good proposal for such a way, and it seems rare and
innocuous enough to not warrant the investment to fix this at a more
fundamental level.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-10 16:28:18 +02:00
Gary Verhaegen
4b13b18c8f
hoogle db as tarbal (#7370)
We want to be able to support more than one package in our [Hoogle]
instance. In order to not have to list each file individually, we assume
the collection of Hoogle text files will be published as a tarball.

Note: we keep trying the existing file for now, because the deployment
of this change needs to be done in separate, asynchronous steps if we
want everything to keep working with no downtime:

1. We deploy the new version of the Hoogle configuration, which supports
   both the new and old file structure on the docs website (this PR).
2. After the next stable version (likely 1.6) is published, the docs
   site actually changes to the new format.
3. We can then clean-up the Hoogle configuration.

Any other sequence will require turning off Hoogle and coordinating with
the docs update, which seems impractical.

[Hoogle]: https://hoogle.daml.com

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-10 15:39:09 +02:00
Gary Verhaegen
5b2319e137
multistep macos setup (#5768)
multistep macos setup

This updates the macOS node setup instructions to avoid repeating
identical work and network traffic across all machines through
initialization by building a "daily" image with all the tools and code
we need.

CHANGELOG_BEGIN
CHANGELOG_END

* Fix 3-running-box to remount nix partition

* updated scripts to use multi-step process

* add copyright notices

Co-authored-by: nycnewman <edward@digitalasset.com>
2020-08-18 16:01:02 +02:00
Gary Verhaegen
c8f31ca16a
switch CI nodes from n1-standard-8 to c2-* (#6514)
switch CI nodes from n1-standard-8 to c2-*

A while back (#4520), I did a bunch of performance tests when trying to
size up the requirements for the hosted macOS nodes we needed to buy. As
part of that testing, it looked like `c2-standard-8` nodes were faster
(full build down from ~95 to ~75 minutes) and marginally cheaper
($0.4176 vs $0.4280) than the `n1-standard-8` we are currently using.

Then I got distracted, and I forgot to upgrade our existing machines.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-27 12:20:29 +02:00
Gary Verhaegen
2923048935
remove purge_old_agents (#6439)
This script was supposed to remove old agents from the Azure Pipelines
UI. It may have been useful at some time (notably, when we used
ephemeral instances, they did not necessarily get to run their shutdown
script), but as it stands now, it's broken. The output from that step
ends in:

```
error: 2 derivations need to be built, but neither local builds ('--max-jobs') nor remote builds ('--builders') are enabled
```

after listing the nix packages it would build. Furthermore, it does not
seem to be useful as I have not seen any spurious entry in the agents
list on Azure since we switched to permanent nodes, on either the Linux
or Windows side (and this would only run on Linux, if it ran).

I'm also not convinced it ever ran, as I used to see a lot of spurious
machines on both Linux and Windows when we did use ephemeral instances.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-20 17:37:24 +02:00
Gary Verhaegen
d839acdbce
increase nix cache retention time (#6437)
The nix cache is currently only 3.5GB, and GHC takes a long time to
build, so I think the convenience vs. cost tradeoff is in favour of
keeping things for a bit longer.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-20 16:25:02 +02:00
Gary Verhaegen
aa86a64842
remove temp linux nodes (#6410)
This is the last step of the plan outlined in #6405. As of opening this
PR, "old" nodes are back up, "temp" nodes are disabled at the Azure
level, and there is no job running on either (🤔). In other
words, this can be deployed as soon as it gets a stamp.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 13:20:56 +00:00
Gary Verhaegen
72f428d8df
macos nodes: add nix redirect (#6406)
See #6400; split out as separate PR so master == reality and we can
track when this is done. @nycnewman please merge this once the change
is deployed.

Note: it has to be deployed before the next restart; nodes will _not_ be
able to boot with the current configuration.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 14:51:25 +02:00
Gary Verhaegen
d01715bf2f
add redirect to nix curl (linux) (#6407)
This is the second PR in the plan outlined in #6405. I have already
disabled the old nodes so no new job will get started there; I will,
however, wait until I've seen a few successful builds on the new nodes
before pulling the plug.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 14:08:21 +02:00
Gary Verhaegen
561c392b69
duplicate linux CI cluster (#6405)
This PR duplicates the linux CI cluster. This is the first in a
three-PR plan to implement #6400 safely while people are working.

I usually do cluster updates over the weekend because they require
shutting down the entire CI system for about two hours. This is
unfortunately not practical while people are working, and timezones make
it difficult for me to find a time where people are not working during
the week.

So instead the plan is as follows:

1. Create a duplicate of our CI cluster (this PR).
2. Wait for the new cluster to be operational (~90-120 minutes ime).
3. In the Azure Pipelines config screen, disable all the nodes of the
   "old" cluster, so all new jobs get assigned to the temp cluster. Wait
   for all jobs to finish on the old cluster.
4. Update the old cluster. Wait for it to be deployed. (Second PR.)
5. In Azure, disable temp nodes, wait for jobs to drain.
6. Delete temp nodes (third PR).

Reviewing this PR is best done by verifying you can reproduce the
following shell session:

```
$ diff vsts_agent_linux.tf vsts_agent_linux_temp.tf
4,7c4,5
< resource "secret_resource" "vsts-token" {}
<
< data "template_file" "vsts-agent-linux-startup" {
<   template = "${file("${path.module}/vsts_agent_linux_startup.sh")}"
---
> data "template_file" "vsts-agent-linux-startup-temp" {
>   template =
"${file("${path.module}/vsts_agent_linux_startup_temp.sh")}"
16c14
< resource "google_compute_region_instance_group_manager"
"vsts-agent-linux" {
---
> resource "google_compute_region_instance_group_manager"
"vsts-agent-linux-temp" {
18,19c16,17
<   name               = "vsts-agent-linux"
<   base_instance_name = "vsts-agent-linux"
---
>   name               = "vsts-agent-linux-temp"
>   base_instance_name = "vsts-agent-linux-temp"
24,25c22,23
<     name              = "vsts-agent-linux"
<     instance_template =
"${google_compute_instance_template.vsts-agent-linux.self_link}"
---
>     name              = "vsts-agent-linux-temp"
>     instance_template =
"${google_compute_instance_template.vsts-agent-linux-temp.self_link}"
36,37c34,35
< resource "google_compute_instance_template" "vsts-agent-linux" {
<   name_prefix  = "vsts-agent-linux-"
---
> resource "google_compute_instance_template" "vsts-agent-linux-temp" {
>   name_prefix  = "vsts-agent-linux-temp-"
52c50
<     startup-script =
"${data.template_file.vsts-agent-linux-startup.rendered}"
---
>     startup-script =
"${data.template_file.vsts-agent-linux-startup-temp.rendered}"
$ diff vsts_agent_linux_startup.sh vsts_agent_linux_startup_temp.sh
149c149
< su --command "sh <(curl https://nixos.org/nix/install) --daemon"
--login vsts
---
> su --command "sh <(curl -sSfL https://nixos.org/nix/install) --daemon"
--login vsts
$
```

and reviewing that diff, rather than looking at the added files in their
entirety. The name changes are benign and needed for Terraform to
appropriately keep track of which node belongs to the old vs the temp
group. The only change that matters is the new group has the `-sSfL`
flag so they will actually boot up. (Hopefully.)

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 13:04:19 +02:00
Gary Verhaegen
fba57470a5
restore terraform to working state (#6402)
It looks like some nix update has broken our current Terraform setup.
The Google provider plugin has changed its reported version to 0.0.0;
poking at my local nix store seems to indicate we actually get 3.15, but
🤷.

This PR also reverts the infra part of #6400 so we get back to master ==
reality.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 12:15:27 +02:00
Moritz Kiefer
2c1d4cb805
Fix nix installation (#6400)
Nix now requires -L, I’ve gone ahead and just normalized everything to
use -sfL which we were already using in one place.

changelog_begin
changelog_end
2020-06-18 10:34:08 +02:00
Gary Verhaegen
b9fbba7fc5
shorten Windows CI username (#6190)
Keeping CI working on Windows involves a constant fight against
MAX_PATH, which is a very short 260 characters. As the username appears
in some paths, sometimes multiple times, we can save a few precious
characters by having it shorter.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-06 15:03:15 +02:00
Edward Newman
9a073cebd9
Macos fix nix installer for build agent servers (#6133)
* Fix issue with xz dependency missing for Nix installer

CHANGELOG_BEGIN
- MacOS - fix Nix installer dependency for xz
CHANGELOG_END

* - additional changes for new Nix installer for Catalina depdencies
2020-05-28 14:04:01 +02:00
Edward Newman
be4f85d165
Fix launchd killing VMWare process at end of script execution (#6006)
* Fix alunchd killing VMWare process at end of script execution

* Fix alunchd killing VMWare process at end of script execution

CHANGELOG_BEGIN
Fix issue with MacOS Catalina Launchd killing VMWare instance on rebuild (AbandonProcessGrop)
CHANGELOG_END
2020-05-18 10:54:15 -04:00
Gary Verhaegen
bda565fa44
patching Bazel on Windows (infra bits, no patch yet) (#5918)
patch Bazel on Windows (ci setup)

We have a weird, intermittent bug on Windows where Bazel gets into a
broken state. To investigate, we need to patch Bazel to add more debug
output than present in the official distribution. This PR adds the basic
infrastructure we need to download the Bazel source code, apply a patch,
compile it, and make that binary available to the rest of the build.
This is for Windows only as we already have the ability to do similar
things on Linux and macOS through Nix.

This PR does not contain any intresting patch to Bazel, just the minimum
that we can check we are actually using the patched version.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-12 23:16:04 +02:00
Edward Newman
0ec0cc335f
Updates to support VMWare variant of Hypervisor for MacOS Build Nodes (#5940)
* Updates to support VMWare vairant of Hypervisor

* Update infra/macos/scripts/rebuild-crontask.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

* Update infra/macos/scripts/run-agent.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
2020-05-12 09:36:40 -04:00
Gary Verhaegen
4a6ab84b69
add default machine capability (#5912)
add default machine capability

We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.

Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.

This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.

Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.

Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.

This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-09 18:21:42 +02:00
Gary Verhaegen
6aac32480a
hopefully fix memory issue with pg on macos CI (#5824)
We have seen the following error message crop up a couple times
recently:

```
FATAL:  could not create shared memory segment: No space left on device
DETAIL:  Failed system call was shmget(key=5432001, size=56, 03600).
HINT:  This error does *not* mean that you have run out of disk space.
It occurs either if all available shared memory IDs have been taken, in
which case you need to raise the SHMMNI parameter in your kernel, or
because the system's overall limit for shared memory has been reached.
    The PostgreSQL documentation contains more information about shared
memory configuration.
child process exited with exit code 1
```

Based on [the PostgreSQL
documentation](https://www.postgresql.org/docs/12/kernel-resources.html),
this should fix it.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-04 14:32:23 -04:00
Edward Newman
01c784659f
Minor changes to MacOS infra config (#5673) 2020-04-22 18:57:40 +02:00
Gary Verhaegen
43def51fce
add puppeteer dependencies to Linux nodes (#5575)
See #5540 for context.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-17 01:32:25 +02:00