Commit Graph

35 Commits

Author SHA1 Message Date
Gary Verhaegen
c04fa81d6a
ci: bump Windows workdirs (#12918)
Since #12645, we added a new pipeline, so we need to add a corresponding
entry.

As for #12645, the content of the files and the directory structure is
taken directly from a live CI node, as printed by the (now-named)
`workdirs` step.

CHANGELOG_BEGIN
CHANGELOG_END
2022-02-14 18:49:32 +00:00
Gary Verhaegen
f08dfa3264
Bump terraform (#12670)
We've been using an old version of Terraform for a long time now. The
main blocker used to be that there was no post-0.12 version of `secret`,
but that has now been resolved: there's a new fork, with new maintainers
(blessed by the original one and accepted by the Terraform registry)
[here].

I'll be upgrading one version at a time as 0.x versions are considered
major (and thus potentially breaking).

[here]: https://github.com/numtide/terraform-provider-secret

See https://github.com/digital-asset/daml/pull/12670 for details.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-31 15:46:59 +01:00
Gary Verhaegen
1fa7f61bb0
ci: pin workdirs on Windows (#12645)
The Bazel cache on Windows includes absolute paths. The normal process
for Azure is to dynamically allocate new top-level folders for each new
bbuild that runs on a given machine. The result of that is that we get
about a one in three chance to get caching for any single Windows build
(it's actually not _quite_ that because we don't run different builds an
equal number of times).

This PR is an attempt at pinning the folder to job mapping by mucking
around in [Azure internals], which may or may not have bad consequences
down the line, assuming it works at all.

[Azure internals]: https://github.com/microsoft/azure-pipelines-agent/blob/master/docs/jobdirectories.md

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-31 10:20:12 +00:00
Gary Verhaegen
b1a917596c
ci: reduce Windows capacity (#12607)
Reverting #12599.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-26 18:17:56 +00:00
Gary Verhaegen
01219d6cdc
ci: temporarily increase Windows capacity (#12599)
Our Windows CI nodes seem completely overwhelmed today, with typical
wait times above half an hour before jobs even start. This isn't fun, so
I'd like to double our capacity for a few hours.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-26 15:03:28 +01:00
Gary Verhaegen
de2a8c0c04
ci: use service account for Windows nodes (#12489)
When no service account is explicitly selected, GCP provides a default
one, which happens to have way more access rights than we're comfortable
with. I'm not quite sure how the total lack of a service account slipped
through here, but I've noticed today so I'm changing it.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-19 19:58:17 +00:00
Gary Verhaegen
5716d99cd2
Disable printer sharing (#12408)
As the title suggests. We already disable all communication between CI
nodes through network rules, but we currently get a lot of noise from
GCP logging violations to those rules from Windows trying to feel its
way out for file share buddies.

CHANGELOG_BEGIN
CHANGELOG_END

As usualy, this branch will contain intermediate commits that may serve
as an audit log of sorts.
2022-01-13 20:55:18 +00:00
Gary Verhaegen
d2e2c21684
update copyright headers (#12240)
New year, new copyright, new expected unknown issues with various files
that won't be covered by the script and/or will be but shouldn't change.

I'll do the details on Jan 1, but would appreciate this being
preapproved so I can actually get it merged by then.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-03 16:36:51 +00:00
Gary Verhaegen
0e30d468f9
expand CI cluster back (#12239)
To be done / merged on Jan 3.

CHANGELOG_BEGIN
CHANGELOG_END
2022-01-03 14:50:08 +00:00
Gary Verhaegen
a51f75d193
give a break to CI (#12238)
I can't think of a good reason to keep 30+ machines running over the
Winter break. I'll bump this back up on Jan 3.

CHANGELOG_BEGIN
CHANGELOG_END
2021-12-26 10:53:08 +01:00
Gary Verhaegen
349d812482
ci: increase hard drive space (not macOS) (#11983)
I've seen quite a few builds failing for lack of disk space recently,
sometimes as early as 2pm CET.

CHANGELOG_BEGIN
CHANGELOG_END
2021-12-06 19:41:11 +00:00
Gary Verhaegen
28b8d9a1f7
bump dotnet (#10979)
This bumps dotnet to the version required by the latest azuresigntool,
and pins azuresigntool for the future.

As usual for live CI upgrades, this will be rolled out using the
blue/green approach. I'll keep each deployed commit in this PR.

For future reference, this is PR [#10979].

[#10979]: https://github.com/digital-asset/daml/pull/10979

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-22 16:39:40 +00:00
Gary Verhaegen
cb1f4ec773
ci/windows: disable spool (#10200)
* ci/windows: disable spool

We're not expecting to print anything, and @RPS5' security newsletter
says this is a vector of attack.

CHANGELOG_BEGIN
CHANGELOG_END

* increase no-spool to 6

* Windows name truncation causing collisions

* update main group

* remove temp group
2021-07-07 12:44:33 +00:00
Gary Verhaegen
31a76a4a2a
allow CI pools to use any zone (#10069)
This morning we started with very restricted CI pools (2/6 for Windows
and 7/20 for Linux), apparently because the region we run in (us-east1)
has three zones, two of them were unable to allocate new nodes, and the
default policy is to distribute nodes evenly between zones.

I've manually changed the distribution policy. Unfortunately this option
is not yet available in our version of the GCP Terraform plugin.

CHANGELOG_BEGIN
CHANGELOG_END
2021-06-22 10:43:08 +02:00
Gary Verhaegen
646c956457
new windows signing (#9786)
CHANGELOG_BEGIN
CHANGELOG_END
2021-05-25 16:23:17 +02:00
Gary Verhaegen
45bca6e68b
test_windows_signing: install for u (#9776)
Turns out "`--global`" means "for this user".

CHANGELOG_BEGIN
CHANGELOG_END
2021-05-21 15:49:31 +02:00
Gary Verhaegen
4af6608185
fix signing machine (#9772)
Turns out PowerShell is not Bash. Who knew? 🤷

CHANGELOG_BEGIN
CHANGELOG_END
2021-05-21 12:36:56 +02:00
Gary Verhaegen
f5c5b634eb
prepare for EV Windows signing (#9758)
Setting up a non-disruptive way to test out EV signing of our Windows
artifacts.

CHANGELOG_BEGIN
CHANGELOG_END
2021-05-21 10:46:45 +02:00
Gary Verhaegen
cfae2d88f5
update Terraform files to match reality (#8780)
* fixup terraform config

Two changes have happened recently that have invalidated the current
Terraform files:

1. The Terraform version has gone through a major, incompatible upgrade
   (#8190); the required updates for this are reflected in the first
   commit of this PR.
2. The certificate used to serve [Hoogle](https://hoogle.daml.com) was
   about to expire, so Edward created a new one and updated the config
   directly. The second commit in this PR updates the Terraform config
   to match that new, already-in-prod setting.

Note: This PR applies cleanly, as there are no resulting changes in
Terraform's perception of the target state from 1, and the change from 2
has already been applied through other channels.

CHANGELOG_BEGIN
CHANGELOG_END

* update hoogle cert
2021-02-08 17:25:04 +00:00
Gary Verhaegen
a925f0174c
update copyright notices for 2021 (#8257)
* update copyright notices for 2021

To be merged on 2021-01-01.

CHANGELOG_BEGIN
CHANGELOG_END

* patch-bazel-windows & da-ghc-lib
2021-01-01 19:49:51 +01:00
Gary Verhaegen
7c2ba6f996
infra: add prod label (#8140)
Requested by @nycnewman.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-03 01:55:43 +01:00
Gary Verhaegen
c8f31ca16a
switch CI nodes from n1-standard-8 to c2-* (#6514)
switch CI nodes from n1-standard-8 to c2-*

A while back (#4520), I did a bunch of performance tests when trying to
size up the requirements for the hosted macOS nodes we needed to buy. As
part of that testing, it looked like `c2-standard-8` nodes were faster
(full build down from ~95 to ~75 minutes) and marginally cheaper
($0.4176 vs $0.4280) than the `n1-standard-8` we are currently using.

Then I got distracted, and I forgot to upgrade our existing machines.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-27 12:20:29 +02:00
Gary Verhaegen
b9fbba7fc5
shorten Windows CI username (#6190)
Keeping CI working on Windows involves a constant fight against
MAX_PATH, which is a very short 260 characters. As the username appears
in some paths, sometimes multiple times, we can save a few precious
characters by having it shorter.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-06 15:03:15 +02:00
Gary Verhaegen
4a6ab84b69
add default machine capability (#5912)
add default machine capability

We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.

Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.

This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.

Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.

Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.

This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-09 18:21:42 +02:00
Gary Verhaegen
08a5a64325
replace Windows agents (#5527)
It looks like the change in Windows agent names has caused an issue:
because Windows agents are not always properly cleaned up on shutdown,
i.e. they do not always have time to tell Azure they are going away, and
because GCP likes to reuse the same names for machines in a group, we've
been seeing errors like:

```
ERROR: The running command stopped because the preference variable
"ErrorActionPreference" or common parameter is set to Stop: Pool 11
already contains an agent with name VSTS-WIN-3QCX.
```

recently. Today, only 2 out of our 6 agents have managed to register
with Azure. This PR should fix that.

ChaNGELOG_BEGIN
CHANGELOG_END
2020-04-14 13:58:42 +02:00
Gary Verhaegen
66e7068b39
better Windows machine names (#5374)
This is a small QoL improvement, mostly targeted at myself: have Windows
agents register with Azure using the name they display on the GCP
console, so I don't need to find a build and look at the "Agent
Diagnostics" step to figure out the corresponding between Azure and GCP.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-07 01:33:36 +02:00
Gary Verhaegen
1872c668a5
replace DAML Authors with DA in copyright headers (#5228)
Change requested by Manoj.

CHANGELOG_BEGIN
CHANGELOG_END
2020-03-27 01:26:10 +01:00
Gary Verhaegen
0a251b3fa5
switch CI nodes to permanent (#4455)
CHANGELOG_BEGIN
CHANGELOG_END
2020-02-11 02:07:42 +01:00
Gary Verhaegen
5606ab350c
fix Windows CI node startup script (#4371)
This is an attempt to apply a potential fix discovered as part of the
investigation in #4370. The issue seems to be that Chocolatey is using a
protocol deemed not secure enough and disabled in recent Windows images
(our node creation script dynamically selects the lmatest "Windows 2016"
server image from GCP).

CHANGELOG_BEGIN
CHANGELOG_END
2020-02-04 14:37:53 +01:00
Gary Verhaegen
878429e3bf
update copyright notices to 2020 (#3939)
copyright update 2020

* update template
* run script: `dade-copyright-headers update .`
* update script
* manual adjustments
* exclude frozen proto files from further header checks (by adding NO_AUTO_COPYRIGHT files)
2020-01-02 21:21:13 +01:00
Gary Verhaegen
99ea93168d
update copyright notices (#2499) 2019-08-13 17:23:03 +01:00
Moritz Kiefer
1cfa27d616
Install the Windows SDK on CI nodes (#1272)
This provides signtool.exe which we need to sign our Windows installer.
2019-05-21 13:42:49 +02:00
Gary Verhaegen
e95575b033 install StackDriver on build machines (#905)
Requested by Security
2019-05-04 22:55:51 +00:00
Jonas Chevalier
769c04d3ba infra: reduce differences with hosted (#698) 2019-04-25 20:49:38 +00:00
Jonas Chevalier
3b8ae1ff86 infra: add a VSTS windows agents (#368) 2019-04-18 11:20:57 +00:00