Commit Graph

93 Commits

Author SHA1 Message Date
Bernhard Elsner
cda93db944
Daml case and logo (#8433)
* Replace many occurrences of DAML with Daml

* Update docs logo

* A few more CLI occurrences

CHANGELOG_BEGIN
- Change DAML capitalization and docs logo
CHANGELOG_END

* Fix some over-eager replacements

* A few mor occurrences in md files

* Address comments in *.proto files

* Change case in comments and strings in .ts files

* Revert changes to frozen proto files

* Also revert LF 1.11

* Update get-daml.sh

* Update windows installer

* Include .py files

* Include comments in .daml files

* More instances in the assistant CLI

* some more help texts
2021-01-08 12:50:15 +00:00
Moritz Kiefer
9c2e2db34e
Include new Nix signing key in static nix config on CI nodes (#8407)
Our CI nodes install nix in multi-user mode. This means that changing
cache information is only available to certain trusted users for
security reasons. The CI user is not part of those so the cache info
from dev-env/etc/nix.conf is silently ignored.

We could consider not running in multi-user mode although from a
security pov this seems like a pretty sensible decision and our
signing keys change very rarely so for now, I would keep it.

changelog_begin
changelog_end
2021-01-06 13:24:34 +01:00
Gary Verhaegen
a925f0174c
update copyright notices for 2021 (#8257)
* update copyright notices for 2021

To be merged on 2021-01-01.

CHANGELOG_BEGIN
CHANGELOG_END

* patch-bazel-windows & da-ghc-lib
2021-01-01 19:49:51 +01:00
Gary Verhaegen
93f449d245
rename master to main (#8245)
As we strive for more inclusiveness, we are becoming less comfortable
with historically-charged terms being used in our everyday work.

This is targeted for merge on Dec 26, _after_ the necessary
corresponding changes at both the GitHub and Azure Pipelines levels.

CHANGELOG_BEGIN

- DAML Connect development is now conducted from the `main` branch,
  rather than the `master` one. If you had any dependency on the
  digital-asset/daml repository, you will need to update this parameter.

CHANGELOG_END
2020-12-27 14:19:07 +01:00
Gary Verhaegen
5c8ac44049
update macOS nodes README (#8243)
This is far from perfect but removes the blatantly wrong sections of the
README.

Note: as a README change, this is not really a standard change, but
because the README is under the infra folder, this PR does need the tag
to pass CI.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-10 16:48:12 +01:00
Gary Verhaegen
7c2ba6f996
infra: add prod label (#8140)
Requested by @nycnewman.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-03 01:55:43 +01:00
Gary Verhaegen
586e29adca
incident-118: investigate & fix (#8135)
incident-118: fruitless investigation; revert

This first commit just duplicates the existing configuration. Further
commits will make actual changes so they can be tracked by looking at
individual commits (rather than try to think up the diff by looking at
entirely new files).

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-02 19:20:56 +01:00
Gary Verhaegen
841116bf1e
incident-118: linux machines unable to start (#8128)
Early investigation points to cloud logging install failing. Temporarily
disabling.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-02 10:11:11 +00:00
Gary Verhaegen
b23304c691
add default capability to macos (#5915)
This is the macOS part of #5912, which I have separated because our
macOS nodes have a different deployment process so it seemed easier to
track the deployment of the change separately.

CHANGELOG_BEGIN
CHANGELOG_END
2020-11-25 15:34:33 +01:00
Gary Verhaegen
e4638d9004
document how to kill nodes (#7782)
CHANGELOG_BEGIN
CHANGELOG_END
2020-10-22 15:44:48 +02:00
Gary Verhaegen
6ac61960e6
fix periodic-killer permissions (#7776)
I screwed up in #7771: `google_project_iam_binding` is defined as _the_
authoritative list of accounts for that role, not just a list of
accounts to add the role to. So in applying that rule yesterday, I
inadvertently stripped the periodic-killer machine of its role, and
therefore nothing got reset last night. The Terraform plan did not
mention this, unfortunately (though, arguably, consistently with the
semantics of the Terraform rules).

This is the same intent as #7771, but this one actually works. (Or at
least does not fail in the same way.)

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-22 12:22:07 +02:00
Gary Verhaegen
bdc2b5a9b1
allow Moritz to kill machines (#7771)
Also, explicitly allow myself, rather than rely on my admin status.

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-21 18:40:54 +02:00
Gary Verhaegen
6419ff2f34
remove leo (#7535)
Leo has left the team and so should not have access anymore.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-30 18:16:57 +02:00
Gary Verhaegen
168345f4a8
let CI delete bazel cache items (#7514)
Recently we have been seeing lots of issues with the Bazel cache. It
does not seem like it would need to delete things, but the issues
cropped up about the same time we restricted the permissions, so it's
worth trying to revert that.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-29 13:56:35 +02:00
Gary Verhaegen
2a38d03250
protect GCS bucket items (#7439)
Yesterday, a certificate expiration triggered the `patch_bazel_windows`
job to run when it shouldn't, and it overrode an artifact we depend on.
This was build from the same sources, but the build is not reproducible
so we ended up with a hash mismatch.

As far as I know, there is no good reason for CI to ever delete or
overwrite anything from our GCS buckets, so I'm removing its rights to
do so.

As an added safety measure, this PR also enables versioning on all
non-cache buckets (GCS does not support versioning on buckets with an
expiration policy).

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-18 15:59:23 +02:00
Gary Verhaegen
8ea85d1393
update certificates (#7432)
Our old wildcard certificate has expired. @nycnewman has already updated
our configuration to use new ones; this is just updating the tf files to
match.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-17 17:36:35 +02:00
Gary Verhaegen
b9acc09a77
read access to data bucket for appr members (#7422)
We've been saving data there but not doing anything with it. Ideally
this data would be used by some sort of automated process, but in the
meantime (or while developing said processes), having at least some
people with read access can help.

This is a Standard Change requested by @cocreature.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-16 18:25:23 +02:00
Gary Verhaegen
b4d211642c
fixup Terraform setup (#7373)
It looks like #6761 broke our Terraform setup by upgrading the nixpkgs
snapshot. That this has not been caught earlier is, I suppose, a
testament to how stable our infrastructure has become nowadays.

This is the same issue we had with the Google providers in #6402, i.e.
we are trying to pin the provider versions both at the nix level and at
the terraform level, with no way to force them to stay in sync.

I don't have a good proposal for such a way, and it seems rare and
innocuous enough to not warrant the investment to fix this at a more
fundamental level.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-10 16:28:18 +02:00
Gary Verhaegen
4b13b18c8f
hoogle db as tarbal (#7370)
We want to be able to support more than one package in our [Hoogle]
instance. In order to not have to list each file individually, we assume
the collection of Hoogle text files will be published as a tarball.

Note: we keep trying the existing file for now, because the deployment
of this change needs to be done in separate, asynchronous steps if we
want everything to keep working with no downtime:

1. We deploy the new version of the Hoogle configuration, which supports
   both the new and old file structure on the docs website (this PR).
2. After the next stable version (likely 1.6) is published, the docs
   site actually changes to the new format.
3. We can then clean-up the Hoogle configuration.

Any other sequence will require turning off Hoogle and coordinating with
the docs update, which seems impractical.

[Hoogle]: https://hoogle.daml.com

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-10 15:39:09 +02:00
Gary Verhaegen
5b2319e137
multistep macos setup (#5768)
multistep macos setup

This updates the macOS node setup instructions to avoid repeating
identical work and network traffic across all machines through
initialization by building a "daily" image with all the tools and code
we need.

CHANGELOG_BEGIN
CHANGELOG_END

* Fix 3-running-box to remount nix partition

* updated scripts to use multi-step process

* add copyright notices

Co-authored-by: nycnewman <edward@digitalasset.com>
2020-08-18 16:01:02 +02:00
Gary Verhaegen
c8f31ca16a
switch CI nodes from n1-standard-8 to c2-* (#6514)
switch CI nodes from n1-standard-8 to c2-*

A while back (#4520), I did a bunch of performance tests when trying to
size up the requirements for the hosted macOS nodes we needed to buy. As
part of that testing, it looked like `c2-standard-8` nodes were faster
(full build down from ~95 to ~75 minutes) and marginally cheaper
($0.4176 vs $0.4280) than the `n1-standard-8` we are currently using.

Then I got distracted, and I forgot to upgrade our existing machines.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-27 12:20:29 +02:00
Gary Verhaegen
2923048935
remove purge_old_agents (#6439)
This script was supposed to remove old agents from the Azure Pipelines
UI. It may have been useful at some time (notably, when we used
ephemeral instances, they did not necessarily get to run their shutdown
script), but as it stands now, it's broken. The output from that step
ends in:

```
error: 2 derivations need to be built, but neither local builds ('--max-jobs') nor remote builds ('--builders') are enabled
```

after listing the nix packages it would build. Furthermore, it does not
seem to be useful as I have not seen any spurious entry in the agents
list on Azure since we switched to permanent nodes, on either the Linux
or Windows side (and this would only run on Linux, if it ran).

I'm also not convinced it ever ran, as I used to see a lot of spurious
machines on both Linux and Windows when we did use ephemeral instances.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-20 17:37:24 +02:00
Gary Verhaegen
d839acdbce
increase nix cache retention time (#6437)
The nix cache is currently only 3.5GB, and GHC takes a long time to
build, so I think the convenience vs. cost tradeoff is in favour of
keeping things for a bit longer.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-20 16:25:02 +02:00
Gary Verhaegen
aa86a64842
remove temp linux nodes (#6410)
This is the last step of the plan outlined in #6405. As of opening this
PR, "old" nodes are back up, "temp" nodes are disabled at the Azure
level, and there is no job running on either (🤔). In other
words, this can be deployed as soon as it gets a stamp.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 13:20:56 +00:00
Gary Verhaegen
72f428d8df
macos nodes: add nix redirect (#6406)
See #6400; split out as separate PR so master == reality and we can
track when this is done. @nycnewman please merge this once the change
is deployed.

Note: it has to be deployed before the next restart; nodes will _not_ be
able to boot with the current configuration.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 14:51:25 +02:00
Gary Verhaegen
d01715bf2f
add redirect to nix curl (linux) (#6407)
This is the second PR in the plan outlined in #6405. I have already
disabled the old nodes so no new job will get started there; I will,
however, wait until I've seen a few successful builds on the new nodes
before pulling the plug.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 14:08:21 +02:00
Gary Verhaegen
561c392b69
duplicate linux CI cluster (#6405)
This PR duplicates the linux CI cluster. This is the first in a
three-PR plan to implement #6400 safely while people are working.

I usually do cluster updates over the weekend because they require
shutting down the entire CI system for about two hours. This is
unfortunately not practical while people are working, and timezones make
it difficult for me to find a time where people are not working during
the week.

So instead the plan is as follows:

1. Create a duplicate of our CI cluster (this PR).
2. Wait for the new cluster to be operational (~90-120 minutes ime).
3. In the Azure Pipelines config screen, disable all the nodes of the
   "old" cluster, so all new jobs get assigned to the temp cluster. Wait
   for all jobs to finish on the old cluster.
4. Update the old cluster. Wait for it to be deployed. (Second PR.)
5. In Azure, disable temp nodes, wait for jobs to drain.
6. Delete temp nodes (third PR).

Reviewing this PR is best done by verifying you can reproduce the
following shell session:

```
$ diff vsts_agent_linux.tf vsts_agent_linux_temp.tf
4,7c4,5
< resource "secret_resource" "vsts-token" {}
<
< data "template_file" "vsts-agent-linux-startup" {
<   template = "${file("${path.module}/vsts_agent_linux_startup.sh")}"
---
> data "template_file" "vsts-agent-linux-startup-temp" {
>   template =
"${file("${path.module}/vsts_agent_linux_startup_temp.sh")}"
16c14
< resource "google_compute_region_instance_group_manager"
"vsts-agent-linux" {
---
> resource "google_compute_region_instance_group_manager"
"vsts-agent-linux-temp" {
18,19c16,17
<   name               = "vsts-agent-linux"
<   base_instance_name = "vsts-agent-linux"
---
>   name               = "vsts-agent-linux-temp"
>   base_instance_name = "vsts-agent-linux-temp"
24,25c22,23
<     name              = "vsts-agent-linux"
<     instance_template =
"${google_compute_instance_template.vsts-agent-linux.self_link}"
---
>     name              = "vsts-agent-linux-temp"
>     instance_template =
"${google_compute_instance_template.vsts-agent-linux-temp.self_link}"
36,37c34,35
< resource "google_compute_instance_template" "vsts-agent-linux" {
<   name_prefix  = "vsts-agent-linux-"
---
> resource "google_compute_instance_template" "vsts-agent-linux-temp" {
>   name_prefix  = "vsts-agent-linux-temp-"
52c50
<     startup-script =
"${data.template_file.vsts-agent-linux-startup.rendered}"
---
>     startup-script =
"${data.template_file.vsts-agent-linux-startup-temp.rendered}"
$ diff vsts_agent_linux_startup.sh vsts_agent_linux_startup_temp.sh
149c149
< su --command "sh <(curl https://nixos.org/nix/install) --daemon"
--login vsts
---
> su --command "sh <(curl -sSfL https://nixos.org/nix/install) --daemon"
--login vsts
$
```

and reviewing that diff, rather than looking at the added files in their
entirety. The name changes are benign and needed for Terraform to
appropriately keep track of which node belongs to the old vs the temp
group. The only change that matters is the new group has the `-sSfL`
flag so they will actually boot up. (Hopefully.)

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 13:04:19 +02:00
Gary Verhaegen
fba57470a5
restore terraform to working state (#6402)
It looks like some nix update has broken our current Terraform setup.
The Google provider plugin has changed its reported version to 0.0.0;
poking at my local nix store seems to indicate we actually get 3.15, but
🤷.

This PR also reverts the infra part of #6400 so we get back to master ==
reality.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-18 12:15:27 +02:00
Moritz Kiefer
2c1d4cb805
Fix nix installation (#6400)
Nix now requires -L, I’ve gone ahead and just normalized everything to
use -sfL which we were already using in one place.

changelog_begin
changelog_end
2020-06-18 10:34:08 +02:00
Gary Verhaegen
b9fbba7fc5
shorten Windows CI username (#6190)
Keeping CI working on Windows involves a constant fight against
MAX_PATH, which is a very short 260 characters. As the username appears
in some paths, sometimes multiple times, we can save a few precious
characters by having it shorter.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-06 15:03:15 +02:00
Edward Newman
9a073cebd9
Macos fix nix installer for build agent servers (#6133)
* Fix issue with xz dependency missing for Nix installer

CHANGELOG_BEGIN
- MacOS - fix Nix installer dependency for xz
CHANGELOG_END

* - additional changes for new Nix installer for Catalina depdencies
2020-05-28 14:04:01 +02:00
Edward Newman
be4f85d165
Fix launchd killing VMWare process at end of script execution (#6006)
* Fix alunchd killing VMWare process at end of script execution

* Fix alunchd killing VMWare process at end of script execution

CHANGELOG_BEGIN
Fix issue with MacOS Catalina Launchd killing VMWare instance on rebuild (AbandonProcessGrop)
CHANGELOG_END
2020-05-18 10:54:15 -04:00
Gary Verhaegen
bda565fa44
patching Bazel on Windows (infra bits, no patch yet) (#5918)
patch Bazel on Windows (ci setup)

We have a weird, intermittent bug on Windows where Bazel gets into a
broken state. To investigate, we need to patch Bazel to add more debug
output than present in the official distribution. This PR adds the basic
infrastructure we need to download the Bazel source code, apply a patch,
compile it, and make that binary available to the rest of the build.
This is for Windows only as we already have the ability to do similar
things on Linux and macOS through Nix.

This PR does not contain any intresting patch to Bazel, just the minimum
that we can check we are actually using the patched version.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-12 23:16:04 +02:00
Edward Newman
0ec0cc335f
Updates to support VMWare variant of Hypervisor for MacOS Build Nodes (#5940)
* Updates to support VMWare vairant of Hypervisor

* Update infra/macos/scripts/rebuild-crontask.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

* Update infra/macos/scripts/run-agent.sh

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
2020-05-12 09:36:40 -04:00
Gary Verhaegen
4a6ab84b69
add default machine capability (#5912)
add default machine capability

We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.

Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.

This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.

Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.

Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.

This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-09 18:21:42 +02:00
Gary Verhaegen
6aac32480a
hopefully fix memory issue with pg on macos CI (#5824)
We have seen the following error message crop up a couple times
recently:

```
FATAL:  could not create shared memory segment: No space left on device
DETAIL:  Failed system call was shmget(key=5432001, size=56, 03600).
HINT:  This error does *not* mean that you have run out of disk space.
It occurs either if all available shared memory IDs have been taken, in
which case you need to raise the SHMMNI parameter in your kernel, or
because the system's overall limit for shared memory has been reached.
    The PostgreSQL documentation contains more information about shared
memory configuration.
child process exited with exit code 1
```

Based on [the PostgreSQL
documentation](https://www.postgresql.org/docs/12/kernel-resources.html),
this should fix it.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-04 14:32:23 -04:00
Edward Newman
01c784659f
Minor changes to MacOS infra config (#5673) 2020-04-22 18:57:40 +02:00
Gary Verhaegen
43def51fce
add puppeteer dependencies to Linux nodes (#5575)
See #5540 for context.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-17 01:32:25 +02:00
Gary Verhaegen
b3c428e76f
Macos boxes for ci (#5002)
set up macOS nodes

This PR documents how to create and manage macOS CI nodes. Because macOS
is not supported by our current cloud providers, these instructions are
geared towards creating VMs on physical machines we would need to host
and manage ourselves, i.e. these notes are mostly targeted at Ed.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-14 18:03:24 +02:00
Gary Verhaegen
08a5a64325
replace Windows agents (#5527)
It looks like the change in Windows agent names has caused an issue:
because Windows agents are not always properly cleaned up on shutdown,
i.e. they do not always have time to tell Azure they are going away, and
because GCP likes to reuse the same names for machines in a group, we've
been seeing errors like:

```
ERROR: The running command stopped because the preference variable
"ErrorActionPreference" or common parameter is set to Stop: Pool 11
already contains an agent with name VSTS-WIN-3QCX.
```

recently. Today, only 2 out of our 6 agents have managed to register
with Azure. This PR should fix that.

ChaNGELOG_BEGIN
CHANGELOG_END
2020-04-14 13:58:42 +02:00
Gary Verhaegen
66e7068b39
better Windows machine names (#5374)
This is a small QoL improvement, mostly targeted at myself: have Windows
agents register with Azure using the name they display on the GCP
console, so I don't need to find a build and look at the "Agent
Diagnostics" step to figure out the corresponding between Azure and GCP.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-07 01:33:36 +02:00
Gary Verhaegen
10fefbae00
remove temp Windows machine (#5445)
CHANGELOG_BEGIN
CHANGELOG_END
2020-04-06 16:24:51 +02:00
Gary Verhaegen
1bf208ebbf
remove temp linux machine (#5351)
@cocreature told me he's done with the Linux machine. He's still using
the Windows one, not removing it is not an oversight.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-01 18:36:03 +02:00
Gary Verhaegen
ce5ad647a3
fix cocreature's temp machine (#5341)
Our Linux startup script never finishes, as it ends with `exec`'ing to
the Azure agent. Since I've removed that part, the EXIT handler,
supposed to only kick in when an issue prevents the script from
finishing, triggers on normal exit, and the machine shuts down. Making
it hard to use.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-01 15:28:56 +02:00
Gary Verhaegen
5ddf7ef497
temp machines for cocreature (#5335)
CI has been behaving weirdly for the past three days, with build times
on Linux and Windows regularly taking over 40 minutes, macOS builds
occasionally running for almost three hours, and generally a lot of OOM
exceptions (mostly on Windows, but a bit on the other two too).

We currently have no idea what changed, and have been having trouble
reproducing locally. As far as I'm aware, there has been no change to
the CI infrastructure itself, so we suspect we broke something in our
code somehow.

@cocreature has requested access to Linux and Windows machines with
similar specs and set-up as the CI ones, but without credentials. This
PR attempts to provide that.

Once the machines are up I will manually add accounts for @cocreature.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-01 14:13:03 +02:00
Gary Verhaegen
819210827e
fix permissions on periodic-killer (#5307)
Even though the command succeeds as far as deleting the machine goes, it
does log an error. That is probably why we recently had only one machine
deleted per night.

Something must have changed on the Google side recently to make this
additional permission required.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-31 19:04:40 +02:00
Gary Verhaegen
38a5fea7a0
tweak periodic-killer (#5268)
1. Google says the instance is currently overutilized and suggests
   g1-small as a more appropriate size.
2. It occurred to me that the reason no error was logged might be that
   we lose them, so explicitly redirecting stderr too.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-30 14:12:14 +02:00
Gary Verhaegen
7e960eb454
log periodic reboots (#5235)
It appears that most of our Windows machines have not been rebooted
since Tuesday 24. We detected this because one of them has run out of
disk space.

This is not good, but what's worse is I currently have no idea what
could be going wrong, and we are not logging anything at all in the
current setup, so even ssh'ing into the machine provides no insight.

This PR hopefully addresses that by:

1. Redirecting the outputs of the script to a file, and
2. `tail`iing that file from the startup script, so the logs will appear
   directly in the GCP web console. (This is what we currently do for
   the Azure agent logs on Linux.)

This PR also tells the script to not stop on the first failed machine
and keep trying.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-27 21:35:49 +01:00
Gary Verhaegen
1872c668a5
replace DAML Authors with DA in copyright headers (#5228)
Change requested by Manoj.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-27 01:26:10 +01:00
Gary Verhaegen
7d665d6163
fix tf config for GCP default (#5158)
It looks like GCP doesn't like not having a "page suffix" set, so it
sets a default. Except somehow Terraform doesn't know it's a default
value, so when trying to plan without the (optional) website value set,
Terraform will always find that the deployed state has changed.

With this change, we set it to a value that doesn't exist and won't
work, but at least Terraform will see that the deployed state matches
the configured one.

Note: this PR is a bit special as far as "changes" go as there will be
nothing to apply: applying current master tries to get rid of this
website.main_page_suffix value, but it's back on the next run. With this
patch, `terraform plan` declares "nothing to apply", so this PR itself
won't (need to) be applied.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-24 13:33:59 +01:00