digital-asset/daml - daml - gitea: Gitea Service

mirror of https://github.com/digital-asset/daml.git synced 2024-09-20 09:17:43 +03:00

Author	SHA1	Message	Date
Gary Verhaegen	bdc2b5a9b1	allow Moritz to kill machines (#7771 ) Also, explicitly allow myself, rather than rely on my admin status. CHANGELOG_BEGIN CHANGELOG_END	2020-10-21 18:40:54 +02:00
Gary Verhaegen	6419ff2f34	remove leo (#7535 ) Leo has left the team and so should not have access anymore. CHANGELOG_BEGIN CHANGELOG_END	2020-09-30 18:16:57 +02:00
Gary Verhaegen	168345f4a8	let CI delete bazel cache items (#7514 ) Recently we have been seeing lots of issues with the Bazel cache. It does not seem like it would need to delete things, but the issues cropped up about the same time we restricted the permissions, so it's worth trying to revert that. CHANGELOG_BEGIN CHANGELOG_END	2020-09-29 13:56:35 +02:00
Gary Verhaegen	2a38d03250	protect GCS bucket items (#7439 ) Yesterday, a certificate expiration triggered the `patch_bazel_windows` job to run when it shouldn't, and it overrode an artifact we depend on. This was build from the same sources, but the build is not reproducible so we ended up with a hash mismatch. As far as I know, there is no good reason for CI to ever delete or overwrite anything from our GCS buckets, so I'm removing its rights to do so. As an added safety measure, this PR also enables versioning on all non-cache buckets (GCS does not support versioning on buckets with an expiration policy). CHANGELOG_BEGIN CHANGELOG_END	2020-09-18 15:59:23 +02:00
Gary Verhaegen	8ea85d1393	update certificates (#7432 ) Our old wildcard certificate has expired. @nycnewman has already updated our configuration to use new ones; this is just updating the tf files to match. CHANGELOG_BEGIN CHANGELOG_END	2020-09-17 17:36:35 +02:00
Gary Verhaegen	b9acc09a77	read access to data bucket for appr members (#7422 ) We've been saving data there but not doing anything with it. Ideally this data would be used by some sort of automated process, but in the meantime (or while developing said processes), having at least some people with read access can help. This is a Standard Change requested by @cocreature. CHANGELOG_BEGIN CHANGELOG_END	2020-09-16 18:25:23 +02:00
Gary Verhaegen	b4d211642c	fixup Terraform setup (#7373 ) It looks like #6761 broke our Terraform setup by upgrading the nixpkgs snapshot. That this has not been caught earlier is, I suppose, a testament to how stable our infrastructure has become nowadays. This is the same issue we had with the Google providers in #6402, i.e. we are trying to pin the provider versions both at the nix level and at the terraform level, with no way to force them to stay in sync. I don't have a good proposal for such a way, and it seems rare and innocuous enough to not warrant the investment to fix this at a more fundamental level. CHANGELOG_BEGIN CHANGELOG_END	2020-09-10 16:28:18 +02:00
Gary Verhaegen	4b13b18c8f	hoogle db as tarbal (#7370 ) We want to be able to support more than one package in our [Hoogle] instance. In order to not have to list each file individually, we assume the collection of Hoogle text files will be published as a tarball. Note: we keep trying the existing file for now, because the deployment of this change needs to be done in separate, asynchronous steps if we want everything to keep working with no downtime: 1. We deploy the new version of the Hoogle configuration, which supports both the new and old file structure on the docs website (this PR). 2. After the next stable version (likely 1.6) is published, the docs site actually changes to the new format. 3. We can then clean-up the Hoogle configuration. Any other sequence will require turning off Hoogle and coordinating with the docs update, which seems impractical. [Hoogle]: https://hoogle.daml.com CHANGELOG_BEGIN CHANGELOG_END	2020-09-10 15:39:09 +02:00
Gary Verhaegen	5b2319e137	multistep macos setup (#5768 ) multistep macos setup This updates the macOS node setup instructions to avoid repeating identical work and network traffic across all machines through initialization by building a "daily" image with all the tools and code we need. CHANGELOG_BEGIN CHANGELOG_END * Fix 3-running-box to remount nix partition * updated scripts to use multi-step process * add copyright notices Co-authored-by: nycnewman <edward@digitalasset.com>	2020-08-18 16:01:02 +02:00
Gary Verhaegen	c8f31ca16a	switch CI nodes from n1-standard-8 to c2-* (#6514 ) switch CI nodes from n1-standard-8 to c2-* A while back (#4520), I did a bunch of performance tests when trying to size up the requirements for the hosted macOS nodes we needed to buy. As part of that testing, it looked like `c2-standard-8` nodes were faster (full build down from ~95 to ~75 minutes) and marginally cheaper ($0.4176 vs $0.4280) than the `n1-standard-8` we are currently using. Then I got distracted, and I forgot to upgrade our existing machines. CHANGELOG_BEGIN CHANGELOG_END	2020-06-27 12:20:29 +02:00
Gary Verhaegen	2923048935	remove purge_old_agents (#6439 ) This script was supposed to remove old agents from the Azure Pipelines UI. It may have been useful at some time (notably, when we used ephemeral instances, they did not necessarily get to run their shutdown script), but as it stands now, it's broken. The output from that step ends in: ``` error: 2 derivations need to be built, but neither local builds ('--max-jobs') nor remote builds ('--builders') are enabled ``` after listing the nix packages it would build. Furthermore, it does not seem to be useful as I have not seen any spurious entry in the agents list on Azure since we switched to permanent nodes, on either the Linux or Windows side (and this would only run on Linux, if it ran). I'm also not convinced it ever ran, as I used to see a lot of spurious machines on both Linux and Windows when we did use ephemeral instances. CHANGELOG_BEGIN CHANGELOG_END	2020-06-20 17:37:24 +02:00
Gary Verhaegen	d839acdbce	increase nix cache retention time (#6437 ) The nix cache is currently only 3.5GB, and GHC takes a long time to build, so I think the convenience vs. cost tradeoff is in favour of keeping things for a bit longer. CHANGELOG_BEGIN CHANGELOG_END	2020-06-20 16:25:02 +02:00
Gary Verhaegen	aa86a64842	remove temp linux nodes (#6410 ) This is the last step of the plan outlined in #6405. As of opening this PR, "old" nodes are back up, "temp" nodes are disabled at the Azure level, and there is no job running on either (🤔). In other words, this can be deployed as soon as it gets a stamp. CHANGELOG_BEGIN CHANGELOG_END	2020-06-18 13:20:56 +00:00
Gary Verhaegen	72f428d8df	macos nodes: add nix redirect (#6406 ) See #6400; split out as separate PR so master == reality and we can track when this is done. @nycnewman please merge this once the change is deployed. Note: it has to be deployed before the next restart; nodes will _not_ be able to boot with the current configuration. CHANGELOG_BEGIN CHANGELOG_END	2020-06-18 14:51:25 +02:00
Gary Verhaegen	d01715bf2f	add redirect to nix curl (linux) (#6407 ) This is the second PR in the plan outlined in #6405. I have already disabled the old nodes so no new job will get started there; I will, however, wait until I've seen a few successful builds on the new nodes before pulling the plug. CHANGELOG_BEGIN CHANGELOG_END	2020-06-18 14:08:21 +02:00
Gary Verhaegen	561c392b69	duplicate linux CI cluster (#6405 ) This PR duplicates the linux CI cluster. This is the first in a three-PR plan to implement #6400 safely while people are working. I usually do cluster updates over the weekend because they require shutting down the entire CI system for about two hours. This is unfortunately not practical while people are working, and timezones make it difficult for me to find a time where people are not working during the week. So instead the plan is as follows: 1. Create a duplicate of our CI cluster (this PR). 2. Wait for the new cluster to be operational (~90-120 minutes ime). 3. In the Azure Pipelines config screen, disable all the nodes of the "old" cluster, so all new jobs get assigned to the temp cluster. Wait for all jobs to finish on the old cluster. 4. Update the old cluster. Wait for it to be deployed. (Second PR.) 5. In Azure, disable temp nodes, wait for jobs to drain. 6. Delete temp nodes (third PR). Reviewing this PR is best done by verifying you can reproduce the following shell session: ``` $ diff vsts_agent_linux.tf vsts_agent_linux_temp.tf 4,7c4,5 < resource "secret_resource" "vsts-token" {} < < data "template_file" "vsts-agent-linux-startup" { < template = "${file("${path.module}/vsts_agent_linux_startup.sh")}" --- > data "template_file" "vsts-agent-linux-startup-temp" { > template = "${file("${path.module}/vsts_agent_linux_startup_temp.sh")}" 16c14 < resource "google_compute_region_instance_group_manager" "vsts-agent-linux" { --- > resource "google_compute_region_instance_group_manager" "vsts-agent-linux-temp" { 18,19c16,17 < name = "vsts-agent-linux" < base_instance_name = "vsts-agent-linux" --- > name = "vsts-agent-linux-temp" > base_instance_name = "vsts-agent-linux-temp" 24,25c22,23 < name = "vsts-agent-linux" < instance_template = "${google_compute_instance_template.vsts-agent-linux.self_link}" --- > name = "vsts-agent-linux-temp" > instance_template = "${google_compute_instance_template.vsts-agent-linux-temp.self_link}" 36,37c34,35 < resource "google_compute_instance_template" "vsts-agent-linux" { < name_prefix = "vsts-agent-linux-" --- > resource "google_compute_instance_template" "vsts-agent-linux-temp" { > name_prefix = "vsts-agent-linux-temp-" 52c50 < startup-script = "${data.template_file.vsts-agent-linux-startup.rendered}" --- > startup-script = "${data.template_file.vsts-agent-linux-startup-temp.rendered}" $ diff vsts_agent_linux_startup.sh vsts_agent_linux_startup_temp.sh 149c149 < su --command "sh <(curl https://nixos.org/nix/install) --daemon" --login vsts --- > su --command "sh <(curl -sSfL https://nixos.org/nix/install) --daemon" --login vsts $ ``` and reviewing that diff, rather than looking at the added files in their entirety. The name changes are benign and needed for Terraform to appropriately keep track of which node belongs to the old vs the temp group. The only change that matters is the new group has the `-sSfL` flag so they will actually boot up. (Hopefully.) CHANGELOG_BEGIN CHANGELOG_END	2020-06-18 13:04:19 +02:00
Gary Verhaegen	fba57470a5	restore terraform to working state (#6402 ) It looks like some nix update has broken our current Terraform setup. The Google provider plugin has changed its reported version to 0.0.0; poking at my local nix store seems to indicate we actually get 3.15, but 🤷. This PR also reverts the infra part of #6400 so we get back to master == reality. CHANGELOG_BEGIN CHANGELOG_END	2020-06-18 12:15:27 +02:00
Moritz Kiefer	2c1d4cb805	Fix nix installation (#6400 ) Nix now requires -L, I’ve gone ahead and just normalized everything to use -sfL which we were already using in one place. changelog_begin changelog_end	2020-06-18 10:34:08 +02:00
Gary Verhaegen	b9fbba7fc5	shorten Windows CI username (#6190 ) Keeping CI working on Windows involves a constant fight against MAX_PATH, which is a very short 260 characters. As the username appears in some paths, sometimes multiple times, we can save a few precious characters by having it shorter. CHANGELOG_BEGIN CHANGELOG_END	2020-06-06 15:03:15 +02:00
Edward Newman	9a073cebd9	Macos fix nix installer for build agent servers (#6133 ) * Fix issue with xz dependency missing for Nix installer CHANGELOG_BEGIN - MacOS - fix Nix installer dependency for xz CHANGELOG_END * - additional changes for new Nix installer for Catalina depdencies	2020-05-28 14:04:01 +02:00
Edward Newman	be4f85d165	Fix launchd killing VMWare process at end of script execution (#6006 ) * Fix alunchd killing VMWare process at end of script execution * Fix alunchd killing VMWare process at end of script execution CHANGELOG_BEGIN Fix issue with MacOS Catalina Launchd killing VMWare instance on rebuild (AbandonProcessGrop) CHANGELOG_END	2020-05-18 10:54:15 -04:00
Gary Verhaegen	bda565fa44	patching Bazel on Windows (infra bits, no patch yet) (#5918 ) patch Bazel on Windows (ci setup) We have a weird, intermittent bug on Windows where Bazel gets into a broken state. To investigate, we need to patch Bazel to add more debug output than present in the official distribution. This PR adds the basic infrastructure we need to download the Bazel source code, apply a patch, compile it, and make that binary available to the rest of the build. This is for Windows only as we already have the ability to do similar things on Linux and macOS through Nix. This PR does not contain any intresting patch to Bazel, just the minimum that we can check we are actually using the patched version. CHANGELOG_BEGIN CHANGELOG_END	2020-05-12 23:16:04 +02:00
Edward Newman	0ec0cc335f	Updates to support VMWare variant of Hypervisor for MacOS Build Nodes (#5940 ) * Updates to support VMWare vairant of Hypervisor * Update infra/macos/scripts/rebuild-crontask.sh Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com> * Update infra/macos/scripts/run-agent.sh Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com> Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>	2020-05-12 09:36:40 -04:00
Gary Verhaegen	4a6ab84b69	add default machine capability (#5912 ) add default machine capability We semi-regularly need to do work that has the potential to disrupt a machine's local cache, rendering it broken for other streams of work. This can include upgrading nix, upgrading Bazel, debugging caching issues, or anything related to Windows. Right now we do not have any good solution for these situations. We can either not do those streams of work, or we can proceed with them and just accept that all other builds may get affected depending on which machine they get assigned to. Debugging broken nodes is particularly tricky as we do not have any way to force a build to run on a given node. This PR aims at providing a better alternative by (ab)using an Azure Pipelines feature called [capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities). The idea behind capabilities is that you assign a set of tags to a machine, and then a job can express its [demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml), i.e. specify a set of tags machines need to have in order to run it. Support for this is fairly badly documented. We can gather from the documentation that a job can specify two things about a capability (through its `demands`): that a given tag exists, and that a given tag has an exact specified value. In particular, a job cannot specify that a capability should _not_ be present, meaning we cannot rely on, say, adding a "broken" tag to broken machines. Documentation on how to set capabilities for an agent is basically nonexistent, but [looking at the code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs) indicates that they can be set by using a simple `key=value`-formatted text file, provided we can find the right place to put this file. This PR adds this file to our Linux, macOS and Windows node init scripts to define an `assignment` capability and adds a demand for a `default` value on each job. From then on, when we hit a case where we want a PR to run on a specific node, and to prevent other PRs from running on that node, we can manually override the capability from the Azure UI and update the demand in the relevant YAML file in the PR. CHANGELOG_BEGIN CHANGELOG_END	2020-05-09 18:21:42 +02:00
Gary Verhaegen	6aac32480a	hopefully fix memory issue with pg on macos CI (#5824 ) We have seen the following error message crop up a couple times recently: ``` FATAL: could not create shared memory segment: No space left on device DETAIL: Failed system call was shmget(key=5432001, size=56, 03600). HINT: This error does not mean that you have run out of disk space. It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached. The PostgreSQL documentation contains more information about shared memory configuration. child process exited with exit code 1 ``` Based on [the PostgreSQL documentation](https://www.postgresql.org/docs/12/kernel-resources.html), this should fix it. CHANGELOG_BEGIN CHANGELOG_END	2020-05-04 14:32:23 -04:00
Edward Newman	01c784659f	Minor changes to MacOS infra config (#5673 )	2020-04-22 18:57:40 +02:00
Gary Verhaegen	43def51fce	add puppeteer dependencies to Linux nodes (#5575 ) See #5540 for context. CHANGELOG_BEGIN CHANGELOG_END	2020-04-17 01:32:25 +02:00
Gary Verhaegen	b3c428e76f	Macos boxes for ci (#5002 ) set up macOS nodes This PR documents how to create and manage macOS CI nodes. Because macOS is not supported by our current cloud providers, these instructions are geared towards creating VMs on physical machines we would need to host and manage ourselves, i.e. these notes are mostly targeted at Ed. CHANGELOG_BEGIN CHANGELOG_END	2020-04-14 18:03:24 +02:00
Gary Verhaegen	08a5a64325	replace Windows agents (#5527 ) It looks like the change in Windows agent names has caused an issue: because Windows agents are not always properly cleaned up on shutdown, i.e. they do not always have time to tell Azure they are going away, and because GCP likes to reuse the same names for machines in a group, we've been seeing errors like: ``` ERROR: The running command stopped because the preference variable "ErrorActionPreference" or common parameter is set to Stop: Pool 11 already contains an agent with name VSTS-WIN-3QCX. ``` recently. Today, only 2 out of our 6 agents have managed to register with Azure. This PR should fix that. ChaNGELOG_BEGIN CHANGELOG_END	2020-04-14 13:58:42 +02:00
Gary Verhaegen	66e7068b39	better Windows machine names (#5374 ) This is a small QoL improvement, mostly targeted at myself: have Windows agents register with Azure using the name they display on the GCP console, so I don't need to find a build and look at the "Agent Diagnostics" step to figure out the corresponding between Azure and GCP. CHANGELOG_BEGIN CHANGELOG_END	2020-04-07 01:33:36 +02:00
Gary Verhaegen	10fefbae00	remove temp Windows machine (#5445 ) CHANGELOG_BEGIN CHANGELOG_END	2020-04-06 16:24:51 +02:00
Gary Verhaegen	1bf208ebbf	remove temp linux machine (#5351 ) @cocreature told me he's done with the Linux machine. He's still using the Windows one, not removing it is not an oversight. CHANGELOG_BEGIN CHANGELOG_END	2020-04-01 18:36:03 +02:00
Gary Verhaegen	ce5ad647a3	fix cocreature's temp machine (#5341 ) Our Linux startup script never finishes, as it ends with `exec`'ing to the Azure agent. Since I've removed that part, the EXIT handler, supposed to only kick in when an issue prevents the script from finishing, triggers on normal exit, and the machine shuts down. Making it hard to use. CHANGELOG_BEGIN CHANGELOG_END	2020-04-01 15:28:56 +02:00
Gary Verhaegen	5ddf7ef497	temp machines for cocreature (#5335 ) CI has been behaving weirdly for the past three days, with build times on Linux and Windows regularly taking over 40 minutes, macOS builds occasionally running for almost three hours, and generally a lot of OOM exceptions (mostly on Windows, but a bit on the other two too). We currently have no idea what changed, and have been having trouble reproducing locally. As far as I'm aware, there has been no change to the CI infrastructure itself, so we suspect we broke something in our code somehow. @cocreature has requested access to Linux and Windows machines with similar specs and set-up as the CI ones, but without credentials. This PR attempts to provide that. Once the machines are up I will manually add accounts for @cocreature. CHANGELOG_BEGIN CHANGELOG_END	2020-04-01 14:13:03 +02:00
Gary Verhaegen	819210827e	fix permissions on periodic-killer (#5307 ) Even though the command succeeds as far as deleting the machine goes, it does log an error. That is probably why we recently had only one machine deleted per night. Something must have changed on the Google side recently to make this additional permission required. CHANGELOG_BEGIN CHANGELOG_END	2020-03-31 19:04:40 +02:00
Gary Verhaegen	38a5fea7a0	tweak periodic-killer (#5268 ) 1. Google says the instance is currently overutilized and suggests g1-small as a more appropriate size. 2. It occurred to me that the reason no error was logged might be that we lose them, so explicitly redirecting stderr too. CHANGELOG_BEGIN CHANGELOG_END	2020-03-30 14:12:14 +02:00
Gary Verhaegen	7e960eb454	log periodic reboots (#5235 ) It appears that most of our Windows machines have not been rebooted since Tuesday 24. We detected this because one of them has run out of disk space. This is not good, but what's worse is I currently have no idea what could be going wrong, and we are not logging anything at all in the current setup, so even ssh'ing into the machine provides no insight. This PR hopefully addresses that by: 1. Redirecting the outputs of the script to a file, and 2. `tail`iing that file from the startup script, so the logs will appear directly in the GCP web console. (This is what we currently do for the Azure agent logs on Linux.) This PR also tells the script to not stop on the first failed machine and keep trying. CHANGELOG_BEGIN CHANGELOG_END	2020-03-27 21:35:49 +01:00
Gary Verhaegen	1872c668a5	replace DAML Authors with DA in copyright headers (#5228 ) Change requested by Manoj. CHANGELOG_BEGIN CHANGELOG_END	2020-03-27 01:26:10 +01:00
Gary Verhaegen	7d665d6163	fix tf config for GCP default (#5158 ) It looks like GCP doesn't like not having a "page suffix" set, so it sets a default. Except somehow Terraform doesn't know it's a default value, so when trying to plan without the (optional) website value set, Terraform will always find that the deployed state has changed. With this change, we set it to a value that doesn't exist and won't work, but at least Terraform will see that the deployed state matches the configured one. Note: this PR is a bit special as far as "changes" go as there will be nothing to apply: applying current master tries to get rid of this website.main_page_suffix value, but it's back on the next run. With this patch, `terraform plan` declares "nothing to apply", so this PR itself won't (need to) be applied. CHANGELOG_BEGIN CHANGELOG_END	2020-03-24 13:33:59 +01:00
Gary Verhaegen	4095538acf	match terraform with reality (#5143 ) Our current Terraform setup attempts to create three static files on our GCS buckets. The issue is that these buckets are configured to automatically delete files that are older than X days, and there is no way to exclude specific files from that. Therefore, the created files disappear after some time, and running `terraform plan` suddenly looks like the infrastructure has changed. Moreover, the added value of these three files seems questionable: two of them provide `index.html` type of functionality for our two caches, whereas the third is automatically created by `nix` when pushing to the cache anyway (if it doesn't exist already). This PR also reduces the cache eviction time for the nix cache to 60 days, as a full year seemed a bit long. CHANGELOG_BEGIN CHANGELOG_END	2020-03-24 12:07:16 +01:00
Gary Verhaegen	2b951e7296	increase linux nodes to 10 (#4634 ) We're still seeing cases where we are hampered by a lack of Linux nodes, so increasing this again. CHANGELOG_BEGIN CHANGELOG_END	2020-02-20 17:02:41 +00:00
Gary Verhaegen	3e94f29a6a	increase linux pool (#4565 ) We've had a number of jobs waiting for >10 minutes at the busiest times of the day since we switched to 6 nodes, so increasing back a bit. I don't have very good visibility through the Azure UI, but it looks like all of the jobs queued (and not running) right now are very short ones so hopefully 8 should be enough. CHANGELOG_BEGIN CHANGELOG_END	2020-02-18 13:33:32 +00:00
Gary Verhaegen	c8e6486c79	pin Terraform plugin versions (#4519 ) We're currently depending on a floating "latest", which is often a bad idea. Today my machine decided to upgrade the google plugin,w hich is no specifying some new fields for the GCS objects, and therefore `terraform plan` doe snot look clean anymore, even though there has been no change to the terraform files (nor to the infrastructure). This PR aims to make our Terraform setup more reproducible by pinning Terraform plugin versions. It's also a way to track the application of the "new" Terraform setup, as it is technically a standard change (though hopefully a very safe one). CHANGELOG_BEGIN CHANGELOG_END	2020-02-14 13:52:27 +01:00
Gary Verhaegen	0a251b3fa5	switch CI nodes to permanent (#4455 ) CHANGELOG_BEGIN CHANGELOG_END	2020-02-11 02:07:42 +01:00
Gary Verhaegen	1681922f90	ci: temp machines for scheduled killing experiment (#4386 ) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots	2020-02-07 21:04:03 +01:00
Gary Verhaegen	852fc7cd1a	remove temp debug ci nodes (#4373 ) Following the happy resolution of #4370 in #4371, we do not need the temporary nodes anymore. This PR therefore removes them. CHANGELOG_BEGIN CHANGELOG_END	2020-02-04 15:54:03 +01:00
Gary Verhaegen	5606ab350c	fix Windows CI node startup script (#4371 ) This is an attempt to apply a potential fix discovered as part of the investigation in #4370. The issue seems to be that Chocolatey is using a protocol deemed not secure enough and disabled in recent Windows images (our node creation script dynamically selects the lmatest "Windows 2016" server image from GCP). CHANGELOG_BEGIN CHANGELOG_END	2020-02-04 14:37:53 +01:00
Gary Verhaegen	48f39beda2	add Windows debug machine (#4370 ) Today we don't have any Windows machine in the CI pool. The machine template has not changed since 2019-11-21, yet as of today when the machine starts GCP proudly declares > GCEMetadataScripts: No startup scripts to run. despite the script being defined as `sysprep-specialize-script-ps1`, as per the [documentation](https://cloud.google.com/compute/docs/startupscript). Also, it used to work and we haven't changed anything. I'm not quite sure what's going on and how to investigate, but I think at the very least we can try to unblock the team by having a set of machines we initialize manually. This PR is meant to do that.) This is the same changeset as `a877491139` and `16da700532`, except that it now specifies 5 machines instead of just one. CHANGELOG_BEGIN CHANGELOG_END	2020-02-04 14:30:56 +01:00
Gary Verhaegen	6233f66ff6	remove debug Windows machine (#4267 ) CHANGELOG_BEGIN CHANGELOG_END	2020-01-29 18:07:53 +01:00
Gary Verhaegen	16da700532	temporary Windows machins for Andreas (#4165 ) The recent changes to the way in which we build npm packages with Bazel have caused a lot of issues on Windows. To debug those, Andreas has requested a temporary machine. This is pretty much an exact replica of #3294 (`a87749113`), with the same plan: 1. I run terraform apply on this PR is merged. 2. I manually, through the GCP web console, set a dummy password for that machine's RDP connection and transmit that to @aherrmann-da through Slack. 3. @aherrmann-da debugs the issue. 4. I create a PR to roll back this one, then apply it once it's merged. Note: I have verified that master applies cleanly prior to opening this PR. CHANGELOG_BEGIN CHANGELOG_END	2020-01-22 19:10:01 +01:00
Gary Verhaegen	878429e3bf	update copyright notices to 2020 (#3939 ) copyright update 2020 * update template * run script: `dade-copyright-headers update .` * update script * manual adjustments * exclude frozen proto files from further header checks (by adding NO_AUTO_COPYRIGHT files)	2020-01-02 21:21:13 +01:00
Gary Verhaegen	07074a4759	remove Windows debug machine (#3451 )	2019-11-13 18:33:15 +01:00
Gary Verhaegen	d4c38a3763	add gcs bucket for ledger dumps (#3374 )	2019-11-07 14:41:15 +00:00
Gary Verhaegen	62dcbd86b5	pin hoogle version to avoid surprises (#3322 )	2019-11-05 18:14:29 +00:00
Gary Verhaegen	a877491139	temporary Windows CI instance for debugging (#3294 ) Create a temporary CI machine that looks just like the real ones specifically for debugging.	2019-11-04 11:52:27 +01:00
Gary Verhaegen	13e6f581e3	fix hoogle; revert cache buckets ACL changes (#3062 )	2019-09-27 15:42:31 +01:00
Gary Verhaegen	99ea93168d	update copyright notices (#2499 )	2019-08-13 17:23:03 +01:00
Gary Verhaegen	bf5995f529	remove mentions of da-int servers (#2485 )	2019-08-12 10:42:41 +01:00
Florian Klink	14ecfd7bae	infra: add acls for google_storage_objects create via tf (#2460 ) This ensures objects in the google storage bucket created by terraform have the proper publicRead acl.	2019-08-08 19:13:15 +02:00
Gary Verhaegen	36070476c3	collect historical download data (#2003 )	2019-07-04 11:23:51 +00:00
Florian Klink	1cd5bb2492	infra: move index.html outside gcp_cdn_bucket module (#1716 ) * infra: gcp_cdn_bucket: update comment The cache retention can be configured, while the comment suggests its hardcoded. * infra: don't create index.html inside gcp_cdn_bucket module We might want to add a different index.html per bucket, so move that code outside the module and into the bucket-specific terraform files. Also add bucket-specific index.html files.	2019-07-02 11:14:21 +01:00
Gary Verhaegen	a1424d3446	add authealing to hoogle cluster (#1906 )	2019-06-27 05:46:01 +00:00
Gary Verhaegen	18aee24e0f	fix hoogle cron escaping (#1902 )	2019-06-26 18:42:23 +00:00
Gary Verhaegen	31171ec6b6	terraform files for hoogle server (#1660 )	2019-06-22 00:15:52 +00:00
Bolek@DigitalAsset	1a62841616	infra: add docker daemon to ci agent (#1566 ) * installs docker and adds vsts user to docker group	2019-06-08 22:31:55 +00:00
Gary Verhaegen	4120ef2d1b	[linux/ci] fix logging agent (#1356 ) There are two issues with the current setup: - iptables entry prevents connecting to the metadata server, and - machines are given insufficient permissions.	2019-05-30 15:36:57 +00:00
Gary Verhaegen	ac719e7927	[ci/linux] keep daml copy until it's actually not needed anymore (#1349 ) The existing script is deleting the daml directory too early, leading to the "shutdown agents" step failing.	2019-05-23 15:25:37 +00:00
Gary Verhaegen	c762d491ea	target s3 bucket with docs refresh script (#1287 ) There is no simple way to configure GCS to serve the desired security headers, so instead the script will keep updating the existing s3 bucket. Consequent changes: - Add aws cli tool to dev-env - Remove docs bucket from Terraform	2019-05-21 22:26:07 +00:00
Gary Verhaegen	be2457cc6a	[ci/linux] restart fluentd after installing (#1290 ) It looks like the curl command is currently installing but not starting the service that is supposed to send logs to StackDriver. When connecting to the machines manually, a call to `restart` seems to fix it.	2019-05-21 21:37:51 +00:00
Moritz Kiefer	1cfa27d616	Install the Windows SDK on CI nodes (#1272 ) This provides signtool.exe which we need to sign our Windows installer.	2019-05-21 13:42:49 +02:00
Brian Hansen	f9bb85a5a7	remove -O option from curl command in order to pipe script contents t… (#953 ) * remove -O option from curl command in order to pipe script contents to bash * follow redirects for stackdriver Co-Authored-By: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>	2019-05-15 18:33:01 +00:00
Gary Verhaegen	a244579470	set default page for docs (#1102 ) This mirrors the current behaviour of docs.daml.com.	2019-05-13 22:34:21 +00:00
Gary Verhaegen	5ab5ced2e3	add GCS bucket for docs (#1062 ) This is a first step towards improving our docs release process. The goal here is to get rid of the manual "publish docs" step. This is done as a periodic check because we only want to run this for "published" releases, i.e. the ones that are not marked as prerelease. Because the act of publishing a release is a manual step that Azure cannot trigger on, we instead opt for a periodic check. Not included in this piece of work: - Any change to the docs themselves; the goal here is to automate the current process as a first step. Future plans for the docs themselves include adding links to older versions of the docs. - A better way to detect docs are already up-to-date, and abort if so. - Including older versions of the docs. - Switching the DNS record from the current AWS S3 bucket to this new GCS bucket. That will be a manual step once we're happy with how the new bucket works.	2019-05-11 03:27:17 +00:00
Gary Verhaegen	e95575b033	install StackDriver on build machines (#905 ) Requested by Security	2019-05-04 22:55:51 +00:00
Florian Klink	56c322c982	infra: add some docs / comments (#796 ) * infra: document google_storage_bucket_iam_member resources * infra: document nix-cache-info file * infra: document who's maintaining the DA ext certificate * infra: README: mention azure pipeline agents * infra: README: IT -> DA IT	2019-05-01 15:54:09 +00:00
Jonas Chevalier	769c04d3ba	infra: reduce differences with hosted (#698 )	2019-04-25 20:49:38 +00:00
Jonas Chevalier	3b8ae1ff86	infra: add a VSTS windows agents (#368 )	2019-04-18 11:20:57 +00:00
Jonas Chevalier	16aba583ce	CI linux agent changes (#509 ) * ci: always use the linux-pool reduce the difference of environment between external and internal contributions * infra: tweak the linux cache warmup script Don't share the same bazel cache directory with the disk cache, which is something else. Be more specific about the target. Clean after yourself. * infra: bump the linux agent disk to 200GB avoid running out of disk space	2019-04-16 11:35:46 +02:00
Florian Klink	5f75e9d1a0	infra/vsts_agent_linux_startup.sh: warm up local caches, purge old agents (#438 ) Warm up local caches by building dev-env and current daml master This is allowed to fail, as we still want to have CI machines around, even when their caches are only warmed up halfway. Afterwards, we purge old agents that might still be around, that didn't unregister themselves This depends on #402 to be merged, as otherwise purge_old_agents.py can't be found obviously.	2019-04-12 16:47:36 +02:00
Jonas Chevalier	6f90fda6d1	infra: VSTS agent improvements (#369 ) * infra: replace the debian image by ubuntu 16.04 be closer to what the azure vmImage is using * infra: limit access to the PAT token	2019-04-11 17:11:14 +02:00
zimbatm	430a85649c	add more Azure Pipeline agents (#230 ) * nix: add the more providers to terraform * docs: make tarballs more reproducible * ci: use the linux-pool pool * ci: tweak the nix installation handle the case where the user is root and on ubuntu * infra: terraform fmt * infra: add Azure Pipeline agents * ci: only enable linux-pool for internal PRs	2019-04-09 18:59:37 +02:00
Digital Asset GmbH	05e691f558	open-sourcing daml	2019-04-04 09:33:38 +01:00

1 2 3

132 Commits