Commit Graph

64 Commits

Author SHA1 Message Date
Gary Verhaegen
b9fbba7fc5
shorten Windows CI username (#6190)
Keeping CI working on Windows involves a constant fight against
MAX_PATH, which is a very short 260 characters. As the username appears
in some paths, sometimes multiple times, we can save a few precious
characters by having it shorter.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-06 15:03:15 +02:00
Edward Newman
9a073cebd9
Macos fix nix installer for build agent servers (#6133)
* Fix issue with xz dependency missing for Nix installer

CHANGELOG_BEGIN
- MacOS - fix Nix installer dependency for xz
CHANGELOG_END

* - additional changes for new Nix installer for Catalina depdencies
2020-05-28 14:04:01 +02:00
Edward Newman
be4f85d165
Fix launchd killing VMWare process at end of script execution (#6006)
* Fix alunchd killing VMWare process at end of script execution

* Fix alunchd killing VMWare process at end of script execution

CHANGELOG_BEGIN
Fix issue with MacOS Catalina Launchd killing VMWare instance on rebuild (AbandonProcessGrop)
CHANGELOG_END
2020-05-18 10:54:15 -04:00
Gary Verhaegen
bda565fa44
patching Bazel on Windows (infra bits, no patch yet) (#5918)
patch Bazel on Windows (ci setup)

We have a weird, intermittent bug on Windows where Bazel gets into a
broken state. To investigate, we need to patch Bazel to add more debug
output than present in the official distribution. This PR adds the basic
infrastructure we need to download the Bazel source code, apply a patch,
compile it, and make that binary available to the rest of the build.
This is for Windows only as we already have the ability to do similar
things on Linux and macOS through Nix.

This PR does not contain any intresting patch to Bazel, just the minimum
that we can check we are actually using the patched version.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-12 23:16:04 +02:00
Edward Newman
0ec0cc335f
Updates to support VMWare variant of Hypervisor for MacOS Build Nodes (#5940)
* Updates to support VMWare vairant of Hypervisor

* Update infra/macos/scripts/rebuild-crontask.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

* Update infra/macos/scripts/run-agent.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
2020-05-12 09:36:40 -04:00
Gary Verhaegen
4a6ab84b69
add default machine capability (#5912)
add default machine capability

We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.

Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.

This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.

Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.

Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.

This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-09 18:21:42 +02:00
Gary Verhaegen
6aac32480a
hopefully fix memory issue with pg on macos CI (#5824)
We have seen the following error message crop up a couple times
recently:

```
FATAL:  could not create shared memory segment: No space left on device
DETAIL:  Failed system call was shmget(key=5432001, size=56, 03600).
HINT:  This error does *not* mean that you have run out of disk space.
It occurs either if all available shared memory IDs have been taken, in
which case you need to raise the SHMMNI parameter in your kernel, or
because the system's overall limit for shared memory has been reached.
    The PostgreSQL documentation contains more information about shared
memory configuration.
child process exited with exit code 1
```

Based on [the PostgreSQL
documentation](https://www.postgresql.org/docs/12/kernel-resources.html),
this should fix it.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-04 14:32:23 -04:00
Edward Newman
01c784659f
Minor changes to MacOS infra config (#5673) 2020-04-22 18:57:40 +02:00
Gary Verhaegen
43def51fce
add puppeteer dependencies to Linux nodes (#5575)
See #5540 for context.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-17 01:32:25 +02:00
Gary Verhaegen
b3c428e76f
Macos boxes for ci (#5002)
set up macOS nodes

This PR documents how to create and manage macOS CI nodes. Because macOS
is not supported by our current cloud providers, these instructions are
geared towards creating VMs on physical machines we would need to host
and manage ourselves, i.e. these notes are mostly targeted at Ed.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-14 18:03:24 +02:00
Gary Verhaegen
08a5a64325
replace Windows agents (#5527)
It looks like the change in Windows agent names has caused an issue:
because Windows agents are not always properly cleaned up on shutdown,
i.e. they do not always have time to tell Azure they are going away, and
because GCP likes to reuse the same names for machines in a group, we've
been seeing errors like:

```
ERROR: The running command stopped because the preference variable
"ErrorActionPreference" or common parameter is set to Stop: Pool 11
already contains an agent with name VSTS-WIN-3QCX.
```

recently. Today, only 2 out of our 6 agents have managed to register
with Azure. This PR should fix that.

ChaNGELOG_BEGIN
CHANGELOG_END
2020-04-14 13:58:42 +02:00
Gary Verhaegen
66e7068b39
better Windows machine names (#5374)
This is a small QoL improvement, mostly targeted at myself: have Windows
agents register with Azure using the name they display on the GCP
console, so I don't need to find a build and look at the "Agent
Diagnostics" step to figure out the corresponding between Azure and GCP.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-07 01:33:36 +02:00
Gary Verhaegen
10fefbae00
remove temp Windows machine (#5445)
CHANGELOG_BEGIN
CHANGELOG_END
2020-04-06 16:24:51 +02:00
Gary Verhaegen
1bf208ebbf
remove temp linux machine (#5351)
@cocreature told me he's done with the Linux machine. He's still using
the Windows one, not removing it is not an oversight.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-01 18:36:03 +02:00
Gary Verhaegen
ce5ad647a3
fix cocreature's temp machine (#5341)
Our Linux startup script never finishes, as it ends with `exec`'ing to
the Azure agent. Since I've removed that part, the EXIT handler,
supposed to only kick in when an issue prevents the script from
finishing, triggers on normal exit, and the machine shuts down. Making
it hard to use.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-01 15:28:56 +02:00
Gary Verhaegen
5ddf7ef497
temp machines for cocreature (#5335)
CI has been behaving weirdly for the past three days, with build times
on Linux and Windows regularly taking over 40 minutes, macOS builds
occasionally running for almost three hours, and generally a lot of OOM
exceptions (mostly on Windows, but a bit on the other two too).

We currently have no idea what changed, and have been having trouble
reproducing locally. As far as I'm aware, there has been no change to
the CI infrastructure itself, so we suspect we broke something in our
code somehow.

@cocreature has requested access to Linux and Windows machines with
similar specs and set-up as the CI ones, but without credentials. This
PR attempts to provide that.

Once the machines are up I will manually add accounts for @cocreature.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-01 14:13:03 +02:00
Gary Verhaegen
819210827e
fix permissions on periodic-killer (#5307)
Even though the command succeeds as far as deleting the machine goes, it
does log an error. That is probably why we recently had only one machine
deleted per night.

Something must have changed on the Google side recently to make this
additional permission required.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-31 19:04:40 +02:00
Gary Verhaegen
38a5fea7a0
tweak periodic-killer (#5268)
1. Google says the instance is currently overutilized and suggests
   g1-small as a more appropriate size.
2. It occurred to me that the reason no error was logged might be that
   we lose them, so explicitly redirecting stderr too.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-30 14:12:14 +02:00
Gary Verhaegen
7e960eb454
log periodic reboots (#5235)
It appears that most of our Windows machines have not been rebooted
since Tuesday 24. We detected this because one of them has run out of
disk space.

This is not good, but what's worse is I currently have no idea what
could be going wrong, and we are not logging anything at all in the
current setup, so even ssh'ing into the machine provides no insight.

This PR hopefully addresses that by:

1. Redirecting the outputs of the script to a file, and
2. `tail`iing that file from the startup script, so the logs will appear
   directly in the GCP web console. (This is what we currently do for
   the Azure agent logs on Linux.)

This PR also tells the script to not stop on the first failed machine
and keep trying.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-27 21:35:49 +01:00
Gary Verhaegen
1872c668a5
replace DAML Authors with DA in copyright headers (#5228)
Change requested by Manoj.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-27 01:26:10 +01:00
Gary Verhaegen
7d665d6163
fix tf config for GCP default (#5158)
It looks like GCP doesn't like not having a "page suffix" set, so it
sets a default. Except somehow Terraform doesn't know it's a default
value, so when trying to plan without the (optional) website value set,
Terraform will always find that the deployed state has changed.

With this change, we set it to a value that doesn't exist and won't
work, but at least Terraform will see that the deployed state matches
the configured one.

Note: this PR is a bit special as far as "changes" go as there will be
nothing to apply: applying current master tries to get rid of this
website.main_page_suffix value, but it's back on the next run. With this
patch, `terraform plan` declares "nothing to apply", so this PR itself
won't (need to) be applied.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-24 13:33:59 +01:00
Gary Verhaegen
4095538acf
match terraform with reality (#5143)
Our current Terraform setup attempts to create three static files on our
GCS buckets. The issue is that these buckets are configured to
automatically delete files that are older than X days, and there is no
way to exclude specific files from that. Therefore, the created files
disappear after some time, and running `terraform plan` suddenly looks
like the infrastructure has changed.

Moreover, the added value of these three files seems questionable: two
of them provide `index.html` type of functionality for our two caches,
whereas the third is automatically created by `nix` when pushing to the
cache anyway (if it doesn't exist already).

This PR also reduces the cache eviction time for the nix cache to 60
days, as a full year seemed a bit long.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-24 12:07:16 +01:00
Gary Verhaegen
2b951e7296
increase linux nodes to 10 (#4634)
We're still seeing cases where we are hampered by a lack of Linux nodes,
so increasing this again.

CHANGELOG_BEGIN
CHANGELOG_END
2020-02-20 17:02:41 +00:00
Gary Verhaegen
3e94f29a6a
increase linux pool (#4565)
We've had a number of jobs waiting for >10 minutes at the busiest times
of the day since we switched to 6 nodes, so increasing back a bit.

I don't have very good visibility through the Azure UI, but it looks
like all of the jobs queued (and not running) right now are very short
ones so hopefully 8 should be enough.

CHANGELOG_BEGIN
CHANGELOG_END
2020-02-18 13:33:32 +00:00
Gary Verhaegen
c8e6486c79
pin Terraform plugin versions (#4519)
We're currently depending on a floating "latest", which is often a bad
idea. Today my machine decided to upgrade the google plugin,w hich is no
specifying some new fields for the GCS objects, and therefore `terraform
plan` doe snot look clean anymore, even though there has been no change
to the terraform files (nor to the infrastructure).

This PR aims to make our Terraform setup more reproducible by pinning
Terraform plugin versions. It's also a way to track the application of
the "new" Terraform setup, as it is technically a standard change
(though hopefully a very safe one).

CHANGELOG_BEGIN
CHANGELOG_END
2020-02-14 13:52:27 +01:00
Gary Verhaegen
0a251b3fa5
switch CI nodes to permanent (#4455)
CHANGELOG_BEGIN
CHANGELOG_END
2020-02-11 02:07:42 +01:00
Gary Verhaegen
1681922f90
ci: temp machines for scheduled killing experiment (#4386)
* ci: temp machines for scheduled killing experiment

Based on our discussions last week, I am exploring ways to move us to
permanent machines instead of preemptible ones. This should drastically
reduce the number of "cancelled" jobs.

The end goal is to have:

1. An instance group (per OS) that defines the actual CI nodes; this
would be pretty much the same as the existing ones, but with
`preemptible` set to false.
2. A separate machine that, on a cron (say at 4AM UTC), destroys all the
CI nodes.

The hope is that the group managers, which are set to maintain 10 nodes,
will then recreate the "missing" nodes using their normal starting
procedure.

However, there are a lot of unknowns I would like to explore, and I need
a playground for that. This is where this PR comes in. As it stands, it
creates one "killer" machine and a temporary group manager. I will use
these to experiment with the GCP API in various ways without interfering
with the real CI nodes.

This experimentation will likely require multiple `terraform apply` with
multiple different versions of the associated files, as well as
connecting to the machines and running various commands directly from
them. I will ensure all of that only affects the new machines created as
part of this PR, and therefore believe we do not need to go through a
separate round of approval for each change.

Once I have finished experimenting, I will create a new PR to clean up
the temporary resources created with this one and hopefully set up a
more permanent solution.

CHANGELOG_BEGIN
CHANGELOG_END

* add missing zone for killer instance

* add compute scope to killer

* authorize Terraform to shutdown killer to update it

* change in plans: use a service account instead

* .

* add compute.instances.list permission

* add compute.instances.delete permission

* add cron script

* obligatory round of extra escaping

* fix PATH issue & crontab format

* smaller machine & less frequent reboots
2020-02-07 21:04:03 +01:00
Gary Verhaegen
852fc7cd1a
remove temp debug ci nodes (#4373)
Following the happy resolution of #4370 in #4371, we do not need the
temporary nodes anymore. This PR therefore removes them.

CHANGELOG_BEGIN
CHANGELOG_END
2020-02-04 15:54:03 +01:00
Gary Verhaegen
5606ab350c
fix Windows CI node startup script (#4371)
This is an attempt to apply a potential fix discovered as part of the
investigation in #4370. The issue seems to be that Chocolatey is using a
protocol deemed not secure enough and disabled in recent Windows images
(our node creation script dynamically selects the lmatest "Windows 2016"
server image from GCP).

CHANGELOG_BEGIN
CHANGELOG_END
2020-02-04 14:37:53 +01:00
Gary Verhaegen
48f39beda2
add Windows debug machine (#4370)
Today we don't have any Windows machine in the CI pool. The machine
template has not changed since 2019-11-21, yet as of today when the
machine starts GCP proudly declares

> GCEMetadataScripts: No startup scripts to run.

despite the script being defined as `sysprep-specialize-script-ps1`, as
per the
[documentation](https://cloud.google.com/compute/docs/startupscript).
Also, it used to work and we haven't changed anything.

I'm not quite sure what's going on and how to investigate, but I think
at the very least we can try to unblock the team by having a set of
machines we initialize manually. This PR is meant to do that.)

This is the same changeset as a877491139
and 16da700532, except that it now
specifies 5 machines instead of just one.

CHANGELOG_BEGIN
CHANGELOG_END
2020-02-04 14:30:56 +01:00
Gary Verhaegen
6233f66ff6
remove debug Windows machine (#4267)
CHANGELOG_BEGIN
CHANGELOG_END
2020-01-29 18:07:53 +01:00
Gary Verhaegen
16da700532
temporary Windows machins for Andreas (#4165)
The recent changes to the way in which we build npm packages with Bazel
have caused a lot of issues on Windows. To debug those, Andreas has
requested a temporary machine.

This is pretty much an exact replica of #3294 (a87749113), with the same
plan:

1. I run terraform apply on this PR is merged.
2. I manually, through the GCP web console, set a dummy password for that
  machine's RDP connection and transmit that to @aherrmann-da through
  Slack.
3. @aherrmann-da debugs the issue.
4. I create a PR to roll back this one, then apply it once it's merged.

Note: I have verified that master applies cleanly prior to opening this
PR.

CHANGELOG_BEGIN
CHANGELOG_END
2020-01-22 19:10:01 +01:00
Gary Verhaegen
878429e3bf
update copyright notices to 2020 (#3939)
copyright update 2020

* update template
* run script: `dade-copyright-headers update .`
* update script
* manual adjustments
* exclude frozen proto files from further header checks (by adding NO_AUTO_COPYRIGHT files)
2020-01-02 21:21:13 +01:00
Gary Verhaegen
07074a4759
remove Windows debug machine (#3451) 2019-11-13 18:33:15 +01:00
Gary Verhaegen
d4c38a3763
add gcs bucket for ledger dumps (#3374) 2019-11-07 14:41:15 +00:00
Gary Verhaegen
62dcbd86b5 pin hoogle version to avoid surprises (#3322) 2019-11-05 18:14:29 +00:00
Gary Verhaegen
a877491139
temporary Windows CI instance for debugging (#3294)
Create a temporary CI machine that looks just like the real ones specifically for debugging.
2019-11-04 11:52:27 +01:00
Gary Verhaegen
13e6f581e3
fix hoogle; revert cache buckets ACL changes (#3062) 2019-09-27 15:42:31 +01:00
Gary Verhaegen
99ea93168d
update copyright notices (#2499) 2019-08-13 17:23:03 +01:00
Gary Verhaegen
bf5995f529
remove mentions of da-int servers (#2485) 2019-08-12 10:42:41 +01:00
Florian Klink
14ecfd7bae infra: add acls for google_storage_objects create via tf (#2460)
This ensures objects in the google storage bucket created by terraform
have the proper publicRead acl.
2019-08-08 19:13:15 +02:00
Gary Verhaegen
36070476c3 collect historical download data (#2003) 2019-07-04 11:23:51 +00:00
Florian Klink
1cd5bb2492 infra: move index.html outside gcp_cdn_bucket module (#1716)
* infra: gcp_cdn_bucket: update comment

The cache retention can be configured, while the comment suggests its
hardcoded.

* infra: don't create index.html inside gcp_cdn_bucket module

We might want to add a different index.html per bucket, so move that
code outside the module and into the bucket-specific terraform files.

Also add bucket-specific index.html files.
2019-07-02 11:14:21 +01:00
Gary Verhaegen
a1424d3446 add authealing to hoogle cluster (#1906) 2019-06-27 05:46:01 +00:00
Gary Verhaegen
18aee24e0f fix hoogle cron escaping (#1902) 2019-06-26 18:42:23 +00:00
Gary Verhaegen
31171ec6b6 terraform files for hoogle server (#1660) 2019-06-22 00:15:52 +00:00
Bolek@DigitalAsset
1a62841616 infra: add docker daemon to ci agent (#1566)
* installs docker and adds vsts user to docker group
2019-06-08 22:31:55 +00:00
Gary Verhaegen
4120ef2d1b [linux/ci] fix logging agent (#1356)
There are two issues with the current setup:

- iptables entry prevents connecting to the metadata server, and
- machines are given insufficient permissions.
2019-05-30 15:36:57 +00:00
Gary Verhaegen
ac719e7927 [ci/linux] keep daml copy until it's actually not needed anymore (#1349)
The existing script is deleting the daml directory too early, leading to
the "shutdown agents" step failing.
2019-05-23 15:25:37 +00:00
Gary Verhaegen
c762d491ea target s3 bucket with docs refresh script (#1287)
There is no simple way to configure GCS to serve the desired security
headers, so instead the script will keep updating the existing s3
bucket.

Consequent changes:

- Add aws cli tool to dev-env
- Remove docs bucket from Terraform
2019-05-21 22:26:07 +00:00