digital-asset/daml - daml - gitea: Gitea Service

mirror of https://github.com/digital-asset/daml.git synced 2024-09-20 17:28:46 +03:00

Author	SHA1	Message	Date
Edward Newman	8e44f3b8bf	M1 build setup using Packer & Tart (#14635 ) * M1 build setup using Packer * Add change log CHANGELOG_BEGIN CHANGELOG_END * Update infra/macos/m1-build/init-2.sh Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com> Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>	2022-08-17 07:57:22 -04:00
Gary Verhaegen	8ca7af3030	onboarding: add Chun Lok to release rotation (#14593 ) And to our infrastructure's notion of "the ledger clients team" (obviously abbreviated to "appr"). CHANGELOG_BEGIN CHANGELOG_END	2022-08-03 12:05:28 +02:00
Gary Verhaegen	bf64e705e9	ci: unpin Docker (#14464 ) When we set this up years ago (#1566), it was a way to get a more recent Docker version than was then available through the default Ubuntu 16.04 apt repository. Nowadays, this actually makes us lag behind, to the point where the 2.3.1 image isn't building. CHANGELOG_BEGIN CHANGELOG_END	2022-07-18 17:52:28 +00:00
Gary Verhaegen	9aec01853b	ci: unpin Ubuntu image (#14143 ) Cleaning up after #14126 CHANGELOG_BEGIN CHANGELOG_END	2022-06-22 13:56:05 +02:00
Gary Verhaegen	feb53f96c1	infra: tighten TLS security (#14239 ) This tightens our TLS configuration a bit, mostly by dropping support for SSL3, TLS1.0 and TLS1.1 on https://hoogle.daml.com, https://bazel-cache.da-ext.net, https://nix-cache.da-ext.net and the daml-binaries front (which I don't think we still use). CHANGELOG_BEGIN CHANGELOG_END	2022-06-21 14:37:24 +00:00
Gary Verhaegen	d81b4a7071	ci: pin Linux image to yesterday's because today's is broken (#14126 ) Running the `docker` command on today's Ubuntu images crashes the kernel. (Which is super reassuring from a security pov.) CHANGELOG_BEGIN CHANGELOG_END	2022-06-08 14:08:23 +00:00
Gary Verhaegen	eefe285f67	remove Stewart from release rotation (#14018 ) CHANGELOG_BEGIN CHANGELOG_END	2022-05-30 17:09:40 +00:00
Gary Verhaegen	dfa648f585	hunt down DAML better (#13195 ) Process: - `git ls-files -z \| xargs -0 -n 100 sed -i --follow-symlinks 's/DAML/Daml/g'` - `git add -p` - `git restore -p` - Check there is no unstaged change left. To review: - Check for false positives by carefully reviewing the diff in this PR. - Check for false negatives with `git grep DAML`. - Quicker check for fals positives: ``` git grep DAML \| grep -v migration \| grep -v DAML_ ``` Fixes #13190 Note: This is the "second half" of #13191, which failed to cover all the remaining DAMLs because of: ``` $ git ls-files \| grep "'" compiler/damlc/tests/daml-test-files/MangledScenario'.daml ``` CHANGELOG_BEGIN CHANGELOG_END	2022-03-08 17:04:58 +01:00
Gary Verhaegen	53557dd7de	shut down ElasticSearch (#13151 ) The cluster shuts down about once every two weeks and takes a couple hous to get back up. It's been off for a few days right now and as far as I'm aware nobody noticed. My personal assessment is that this is costing us more in maintenance (not to mention running) costs than what we're getting out of it. CHANGELOG_BEGIN CHANGELOG_END	2022-03-04 17:14:15 +01:00
Gary Verhaegen	091a5ac752	appr: add Stewart (#13116 ) CHANGELOG_BEGIN CHANGELOG_END	2022-03-01 23:11:54 +00:00
Gary Verhaegen	fe9d44ffe7	ci: bump Nix on macOS nodes (#13061 ) However that happened, we were stuck with Nix 2.3.15 (or 2.3.16 in some cases) on our macOS nodes. This PR is a minor edition to the Nix initialization commands to switch from 2.4 to "latest", but I wil lalso use it to record the changes I just did manually to the cluster. The cluster is currently composed of two parts: - 7 machines running Catalina (10.15.7). - 1 machine running Monterey (12.2). Unfortunately they use different setup. The Catalina ones are described by the state of the repo (in theory, though keeping them in sync is manual); in order to update those, I have: 1. Taken one node off the CI pool (`builder1epjj7`). 2. On that node, run the following commands: ``` cd ~/daml/infra/macos/3-running-box vagrant destroy -f rm ~/images/* vagrant box remove macinbox vagrant box remove azure-ci-node rm -r ~/.vagrant.d/boxes/macinbox-06032020.tar.gz softwareupdate -d --fetch-full-installer --full-installer-version 10.15.7 cd ~/daml/infra/macos/1-create-box sudo macinbox --box-format vmware_desktop --disk 250 --memory 32768 --cpu 10 --user-script user-script.sh cd ../2-common-box vagrant up vagrant package --output ~/images/initialized-$(date +%Y%m%d).box vagrant destroy -f cd ./run-agent.sh ``` This leaves us with that node running an updated box. The new box is in `~/images/initialized-$(date)`. 3. Send that file to all the other nodes with `scp`. 4. Reboot all the nodes (after deactivating & waiting for jobs to finish). For the Monterey node, images (steps 1 and 2 in this repo) are currently created by @nycnewman on another machine I don't have access to, so I took a slightly different approach: I took the existing image, started it from the `3-running-box` folder as usual, manually updated Nix there, then repackaged that. CHANGELOG_BEGIN CHANGELOG_END	2022-02-24 01:04:28 +00:00
Gary Verhaegen	583cad5fd6	Fix tf (#13028 ) Goals: - Reflect manual changes from #12996 in Terraform. - Reflect manual changes from #12997 in Terraform. - Update plugins to wirk with #12926. - Keep running services working through the changes. Details in commits. CHANGELOG_BEGIN CHANGELOG_END	2022-02-22 18:33:21 +00:00
Gary Verhaegen	c04fa81d6a	ci: bump Windows workdirs (#12918 ) Since #12645, we added a new pipeline, so we need to add a corresponding entry. As for #12645, the content of the files and the directory structure is taken directly from a live CI node, as printed by the (now-named) `workdirs` step. CHANGELOG_BEGIN CHANGELOG_END	2022-02-14 18:49:32 +00:00
Gary Verhaegen	449a68cb33	Fix es (#12845 ) A node seemed to have died so I connected to investigate and you know the rest of this story. CHANGELOG_BEGIN CHANGELOG_END	2022-02-09 19:33:25 +00:00
Gary Verhaegen	f08dfa3264	Bump terraform (#12670 ) We've been using an old version of Terraform for a long time now. The main blocker used to be that there was no post-0.12 version of `secret`, but that has now been resolved: there's a new fork, with new maintainers (blessed by the original one and accepted by the Terraform registry) [here]. I'll be upgrading one version at a time as 0.x versions are considered major (and thus potentially breaking). [here]: https://github.com/numtide/terraform-provider-secret See https://github.com/digital-asset/daml/pull/12670 for details. CHANGELOG_BEGIN CHANGELOG_END	2022-01-31 15:46:59 +01:00
Gary Verhaegen	1fa7f61bb0	ci: pin workdirs on Windows (#12645 ) The Bazel cache on Windows includes absolute paths. The normal process for Azure is to dynamically allocate new top-level folders for each new bbuild that runs on a given machine. The result of that is that we get about a one in three chance to get caching for any single Windows build (it's actually not _quite_ that because we don't run different builds an equal number of times). This PR is an attempt at pinning the folder to job mapping by mucking around in [Azure internals], which may or may not have bad consequences down the line, assuming it works at all. [Azure internals]: https://github.com/microsoft/azure-pipelines-agent/blob/master/docs/jobdirectories.md CHANGELOG_BEGIN CHANGELOG_END	2022-01-31 10:20:12 +00:00
Gary Verhaegen	b1a917596c	ci: reduce Windows capacity (#12607 ) Reverting #12599. CHANGELOG_BEGIN CHANGELOG_END	2022-01-26 18:17:56 +00:00
Gary Verhaegen	01219d6cdc	ci: temporarily increase Windows capacity (#12599 ) Our Windows CI nodes seem completely overwhelmed today, with typical wait times above half an hour before jobs even start. This isn't fun, so I'd like to double our capacity for a few hours. CHANGELOG_BEGIN CHANGELOG_END	2022-01-26 15:03:28 +01:00
Gary Verhaegen	170d839ed0	Fix es (#12554 ) It's down again. I wish I knew why it does that. CHANGELOG_BEGIN CHANGELOG_END	2022-01-24 18:33:59 +01:00
Gary Verhaegen	de2a8c0c04	ci: use service account for Windows nodes (#12489 ) When no service account is explicitly selected, GCP provides a default one, which happens to have way more access rights than we're comfortable with. I'm not quite sure how the total lack of a service account slipped through here, but I've noticed today so I'm changing it. CHANGELOG_BEGIN CHANGELOG_END	2022-01-19 19:58:17 +00:00
Gary Verhaegen	5716d99cd2	Disable printer sharing (#12408 ) As the title suggests. We already disable all communication between CI nodes through network rules, but we currently get a lot of noise from GCP logging violations to those rules from Windows trying to feel its way out for file share buddies. CHANGELOG_BEGIN CHANGELOG_END As usualy, this branch will contain intermediate commits that may serve as an audit log of sorts.	2022-01-13 20:55:18 +00:00
Gary Verhaegen	e34ac20d23	offboarding Akshay (#12396 ) 😿 CHANGELOG_BEGIN CHANGELOG_END	2022-01-13 14:40:44 +01:00
Gary Verhaegen	6aa9409e6e	split-releases: gcs accounts for assembly & canton (#12373 ) CHANGELOG_BEGIN CHANGELOG_END	2022-01-13 14:19:08 +01:00
Gary Verhaegen	9f5a2f9778	Fix terraform (#12333 ) Our Terraform configuration has been slightly broken by two recent changes: - The nixpkgs upgrade in #12280 means a new version of our GCP plugin for Terraform, which as a breaking change added a required argument to `google_project_iam_member`. The new version also results in a number of smaller changes in the way Terraform handles default arguments, which doesn't result in any changes to our configuration files or to the behaviour of our deployed infrastructure, but does require re-syncing the Terraform state (by running `terraform apply`, which would essentially be a no-op if it were not for the next bullet point). - The nix configuration changes in #12265 have changed the Linux CI nodes configuration but have not been deployed yet. This PR is an audit log of the steps taken to rectfy those and bring us back to a state where our deployed configuration and our recorded Terraform state both agree with our current `main` branch tip. CHANGELOG_BEGIN CHANGELOG_END	2022-01-10 21:56:47 +00:00
Gary Verhaegen	648021a2e7	Fix es cluster (#12262 ) audit log of actions taken to fix cluster post Winter break CHANGELOG_BEGIN CHANGELOG_END	2022-01-05 16:29:48 +00:00
Samir Talwar	854b66ee2f	devenv: Use `NIX_USER_CONF_FILES` to set caches. (#12265 ) * devenv: Factor out a function to get the Nix version. * devenv: On newer versions of Nix, use `NIX_USER_CONF_FILES`. This stacks, rather than overwrites. * devenv: Append Nix caches, instead of overwriting them. The "extra-" prefix tells Nix to append. We also switch to non-deprecated configuration keys. CHANGELOG_BEGIN CHANGELOG_END * devenv: Just require Nix v2.4 or newer. * devenv: `NIX_USER_CONF_FILES` may not be set already.	2022-01-05 13:24:52 +01:00
Gary Verhaegen	93e616475e	Update ci nodes for copyright update (#12255 ) Audit log for CI rotation. CHANGELOG_BEGIN CHANGELOG_END	2022-01-04 15:24:31 +00:00
Gary Verhaegen	d2e2c21684	update copyright headers (#12240 ) New year, new copyright, new expected unknown issues with various files that won't be covered by the script and/or will be but shouldn't change. I'll do the details on Jan 1, but would appreciate this being preapproved so I can actually get it merged by then. CHANGELOG_BEGIN CHANGELOG_END	2022-01-03 16:36:51 +00:00
Gary Verhaegen	0e30d468f9	expand CI cluster back (#12239 ) To be done / merged on Jan 3. CHANGELOG_BEGIN CHANGELOG_END	2022-01-03 14:50:08 +00:00
Gary Verhaegen	a51f75d193	give a break to CI (#12238 ) I can't think of a good reason to keep 30+ machines running over the Winter break. I'll bump this back up on Jan 3. CHANGELOG_BEGIN CHANGELOG_END	2021-12-26 10:53:08 +01:00
Gary Verhaegen	c5708c96c5	es: mitigate log4j vuln (#12118 ) CHANGELOG_BEGIN CHANGELOG_END	2021-12-10 22:46:02 +00:00
Gary Verhaegen	b6b6ecd669	repair ES cluster (#12117 ) The cluster shut down today. I'm still not sure why this happens, but cycling the entire cluster seems to solve it so here goes. CHANGELOG_BEGIN CHANGELOG_END	2021-12-10 20:32:52 +01:00
Gary Verhaegen	349d812482	ci: increase hard drive space (not macOS) (#11983 ) I've seen quite a few builds failing for lack of disk space recently, sometimes as early as 2pm CET. CHANGELOG_BEGIN CHANGELOG_END	2021-12-06 19:41:11 +00:00
Gary Verhaegen	de8d15fb1e	fix Nix install on macOS nodes (#11696 ) As part of the 2.4 release, the Nix installer has been changed to take care of the volume setup (which we don't want it to do here). Because that requires root access, they've decided to make multi-user install the default, and to disable single-user install. We could do an in-depth review of the difference and adapt our setup to use a multi-user setup (we do use the multi-user setup on Linux, so there's precedent), but as an immediate fix, we can keep our single-user setup _and_ get the latest Nix by using the 2.3.16 installer and then upgrading from within Nix itself. This _should_ keep working at least for a while, as Linux still defaults to single-user. CHANGELOG_BEGIN CHANGELOG_END	2021-11-24 18:53:13 +01:00
Gary Verhaegen	5f52f00afb	increase linux cluster size (#11860 ) CHANGELOG_BEGIN CHANGELOG_END	2021-11-24 15:18:56 +00:00
Gary Verhaegen	ab520fbc51	Fix es (#11784 ) CHANGELOG_BEGIN CHANGELOG_END	2021-11-22 11:41:10 +01:00
Gary Verhaegen	fdde5353f4	fix hoogle by pinning nixpkgs (#11548 ) Hoogle has been down for at least 24h accoridng to user reports. What seems to be happening is that our nixpkgs pinning is not taking effect, and the nixpkgs version of Hoogle already includes the patch we are trying to add. This confuses nix, which fails, and thus the boot sequence is broken. I've applied the minimal possible patch here (i.e. enforce the pin), which gets things running again. I've already deployed this change. We may want to look at bumping the nixpkgs snapshot. CHANGELOG_BEGIN CHANGELOG_END	2021-11-04 10:22:30 +00:00
Gary Verhaegen	ebe742098d	fix es ingest: file too large (#11415 ) This fixes an issue where we try to upload a single JSON blob that is bigger than the limit of 500MB. We could also raise the limit, I guess, but that seems more error-prone. CHANGELOG_BEGIN CHANGELOG_END	2021-10-27 16:46:14 +02:00
Gary Verhaegen	5654d5cb48	fix es ingest for missing files (#11375 ) If a job fails at build time, there are not test logs to process. Currently this means the ingestion process is going to be stuck in an endless loop of retrying that job and failing on the missing file. This change should let us process jobs with no test files. CHANGELOG_BEGIN CHANGELOG_END	2021-10-25 14:40:33 +02:00
Gary Verhaegen	5e43f8c703	es: drop jobs-* indices (#10857 ) We are currently ingesting Bazel events in two forms: In the `events-` indices, each Bazel event is recorded as a separate ES object, with the corresponding job name as a field that can serve to aggregate all of the events for a given job. In the `jobs-` indices, each job is ingested as a single (composite) ES object, with the individual events as elements in a list-type field. When I set up the cluster, I wasn't sure which one would be more useful, so I included both. We now have a bit more usage experience and it turns out the `events-` form is the only one we use, so I think we should stop ingesting evrything twice and from now on create only the `events-` ones. CHANGELOG_BEGIN CHANGELOG_END	2021-09-28 11:06:52 +02:00
Gary Verhaegen	fe9aeffeaf	Increase es disk size (#11019 ) Disks are currently at 75% fullness, so seems like a good idea to bump a bit. #10857 should reduce our needs, too, so this should last us a while. CHANGELOG_BEGIN CHANGELOG_END	2021-09-25 02:51:26 +02:00
Gary Verhaegen	28b8d9a1f7	bump dotnet (#10979 ) This bumps dotnet to the version required by the latest azuresigntool, and pins azuresigntool for the future. As usual for live CI upgrades, this will be rolled out using the blue/green approach. I'll keep each deployed commit in this PR. For future reference, this is PR [#10979]. [#10979]: https://github.com/digital-asset/daml/pull/10979 CHANGELOG_BEGIN CHANGELOG_END	2021-09-22 16:39:40 +00:00
Gary Verhaegen	6f151e287e	save kibana exports (#10861 ) As explained in #10853, we recently lost our ES cluster. While I'm not planning on trusting Google's "rolling restart" feature ever again, we can't exclude the possibility of future similar outages (without a significant investment in the cluster, which I don't think we want to do). Losing the cluster is not a huge issue as we can always reingest the data. Worst case we lose visibility for a few days. At least, as far as the bazel logs are concerned. Losing the Kibana data is a lot more annoying, as that is not derived data and thus cannot be reingested. This PR aims to add a backup mechanism for our Kibana configuration. CHANGELOG_BEGIN CHANGELOG_END	2021-09-13 18:28:11 +00:00
Gary Verhaegen	8c9edd8522	es cluster tweaks (#10853 ) On Sept 8 our ES cluster became unresponsive. I tried connecting to the machines. One machine had an ES Docker container that claimed to have started 7 weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5 weeks. I assume GCP had decided to restart it for some reason. The init script had failed on missing a TTY, hence the addition of the `DEBIAN_FRONTEND` env var. Two machines had a Docker container that had stopped on that day, resp. 6h and 2h before I started investigating. It wasn't immediately clear what had caused the containers to stop. On all three of these machines, I was abble to manually restart the containers and they were abble to reform a cluster, though the state of the cluster was red (missing shards). The last two machines simply did not respond to SSH connection attempts. Assuming it might help, I decided to try to restart the machines. As GCP does not allow restarting individual machines when they're part of a managed instance roup, I tried clicking the "rolling restart" button on the GCP console, which seemed like it would restart the machines. I carefully selected "restart" (and not "replace"), started the process, and watched GCP proceed to immediately replace all five machines, losing all data in the process. I then started a new cluster and used bigger (and more) machines to reingest all of the data, and then fell back to the existing configuration for the "steady" state. I'll try to keep a better eye on the state of the cluster from now on. In particular, we should not have a node down for 5 weeks without noticing. I'll also try to find some time to look into backing up the Kibana configuration, as that's the one thing we can't just reingest at the moment. CHANGELOG_BEGIN CHANGELOG_END	2021-09-13 11:12:02 +02:00
Gary Verhaegen	4093bbd58c	fix macOS Bazel cache (#10795 ) macOS filesystems have been case-insensitive by default for years, and in particular our laptops are, so if we want the cache to work as expected, CI should be too. Note: this does not apply to Nix, because the Nix partition is a case-sensitive image per @nycnewman's script on laptops too. CHANGELOG_BEGIN CHANGELOG_END	2021-09-07 13:31:57 +02:00
Andreas Herrmann	7b94b0674e	Map shortened scala test suite names to long names on Windows (#10628 ) * Generate short to long name mapping in aspect Maps shortened test names in da_scala_test_suite on Windows to their long name on Linux and MacOS. Names are shortened on Windows to avoid exceeding MAX_PATH. * Script to generate scala test name mapping * Generate scala-test-suite-name-map.json on Windows changelog_begin changelog_end * Generate UTF-8 with Unix line endings Otherwise the file will be formatted using UTF-16 with CRLF line endings, which confuses `jq` on Linux. * Apply Scala test name remapping before ES upload * Pipe bazel output into intermediate file Bazel writes the output of --experimental_show_artifacts to stderr instead of stdout. In Powershell this means that these outputs are not plain strings, but instead error objects. Simply redirecting these to stdout and piping them into further processing will lead to indeterministically missing items or indeterministically introduced additional newlines which may break paths. To work around this we extract the error message from error objects, introduce appropriate newlines, and write the output to a temporary file before further processing. This solution is taken and adapted from https://stackoverflow.com/a/48671797/841562 * Add copyright header Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>	2021-08-24 17:03:45 +02:00
Gary Verhaegen	a6650c11e5	fix hoogle (#10555 ) Hoogle was down because the machines got stuck in the `apt-get update` stage. CHANGELOG_BEGIN CHANGELOG_END	2021-08-11 13:52:43 +00:00
Gary Verhaegen	449a72a86f	increase ES memory (#10318 ) ES died (again) over the weekend, so I had to manually connect to each node in order to restore it, and thus made another migration. This time I opted to make a change, though. Lack of memory is a bit of a weak hypothesis for the observed behaviour, but it's the only one I have at the moment and, given how reliably ES has been crashing so far, it's fairly easy to test. CHANGELOG_BEGIN CHANGELOG_END	2021-07-19 17:50:16 +02:00
Gary Verhaegen	a3b861eae8	refresh es cluster (#10300 ) The cluster died yesterday. As part of recovery, I connected to the machines and made manual changes. To ensure that we get back to a known, documented setup I then proceeded to do a full blue -> green migration, having not tainted any of the green machines with manual interventions. CHANGELOG_BEGIN CHANGELOG_END	2021-07-19 10:55:44 +02:00
Gary Verhaegen	2bcbd4e177	es: switch to persistent nodes (#10236 ) A few small tweaks, but the most important change is giving up on preemptible instances (for now at least), because I saw GCP kill 8 out of the 10 nodes at exactly the same time and I can't really expect the cluster to sruvive that. CHANGELOG_BEGIN CHANGELOG_END	2021-07-12 06:27:23 +00:00

1 2 3 4

173 Commits