digital-asset/daml - daml - gitea: Gitea Service

mirror of https://github.com/digital-asset/daml.git synced 2024-11-10 00:35:25 +03:00

Author	SHA1	Message	Date
Gary Verhaegen	5e43f8c703	es: drop jobs-* indices (#10857 ) We are currently ingesting Bazel events in two forms: In the `events-` indices, each Bazel event is recorded as a separate ES object, with the corresponding job name as a field that can serve to aggregate all of the events for a given job. In the `jobs-` indices, each job is ingested as a single (composite) ES object, with the individual events as elements in a list-type field. When I set up the cluster, I wasn't sure which one would be more useful, so I included both. We now have a bit more usage experience and it turns out the `events-` form is the only one we use, so I think we should stop ingesting evrything twice and from now on create only the `events-` ones. CHANGELOG_BEGIN CHANGELOG_END	2021-09-28 11:06:52 +02:00
Gary Verhaegen	fe9aeffeaf	Increase es disk size (#11019 ) Disks are currently at 75% fullness, so seems like a good idea to bump a bit. #10857 should reduce our needs, too, so this should last us a while. CHANGELOG_BEGIN CHANGELOG_END	2021-09-25 02:51:26 +02:00
Gary Verhaegen	28b8d9a1f7	bump dotnet (#10979 ) This bumps dotnet to the version required by the latest azuresigntool, and pins azuresigntool for the future. As usual for live CI upgrades, this will be rolled out using the blue/green approach. I'll keep each deployed commit in this PR. For future reference, this is PR [#10979]. [#10979]: https://github.com/digital-asset/daml/pull/10979 CHANGELOG_BEGIN CHANGELOG_END	2021-09-22 16:39:40 +00:00
Gary Verhaegen	6f151e287e	save kibana exports (#10861 ) As explained in #10853, we recently lost our ES cluster. While I'm not planning on trusting Google's "rolling restart" feature ever again, we can't exclude the possibility of future similar outages (without a significant investment in the cluster, which I don't think we want to do). Losing the cluster is not a huge issue as we can always reingest the data. Worst case we lose visibility for a few days. At least, as far as the bazel logs are concerned. Losing the Kibana data is a lot more annoying, as that is not derived data and thus cannot be reingested. This PR aims to add a backup mechanism for our Kibana configuration. CHANGELOG_BEGIN CHANGELOG_END	2021-09-13 18:28:11 +00:00
Gary Verhaegen	8c9edd8522	es cluster tweaks (#10853 ) On Sept 8 our ES cluster became unresponsive. I tried connecting to the machines. One machine had an ES Docker container that claimed to have started 7 weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5 weeks. I assume GCP had decided to restart it for some reason. The init script had failed on missing a TTY, hence the addition of the `DEBIAN_FRONTEND` env var. Two machines had a Docker container that had stopped on that day, resp. 6h and 2h before I started investigating. It wasn't immediately clear what had caused the containers to stop. On all three of these machines, I was abble to manually restart the containers and they were abble to reform a cluster, though the state of the cluster was red (missing shards). The last two machines simply did not respond to SSH connection attempts. Assuming it might help, I decided to try to restart the machines. As GCP does not allow restarting individual machines when they're part of a managed instance roup, I tried clicking the "rolling restart" button on the GCP console, which seemed like it would restart the machines. I carefully selected "restart" (and not "replace"), started the process, and watched GCP proceed to immediately replace all five machines, losing all data in the process. I then started a new cluster and used bigger (and more) machines to reingest all of the data, and then fell back to the existing configuration for the "steady" state. I'll try to keep a better eye on the state of the cluster from now on. In particular, we should not have a node down for 5 weeks without noticing. I'll also try to find some time to look into backing up the Kibana configuration, as that's the one thing we can't just reingest at the moment. CHANGELOG_BEGIN CHANGELOG_END	2021-09-13 11:12:02 +02:00
Gary Verhaegen	4093bbd58c	fix macOS Bazel cache (#10795 ) macOS filesystems have been case-insensitive by default for years, and in particular our laptops are, so if we want the cache to work as expected, CI should be too. Note: this does not apply to Nix, because the Nix partition is a case-sensitive image per @nycnewman's script on laptops too. CHANGELOG_BEGIN CHANGELOG_END	2021-09-07 13:31:57 +02:00
Andreas Herrmann	7b94b0674e	Map shortened scala test suite names to long names on Windows (#10628 ) * Generate short to long name mapping in aspect Maps shortened test names in da_scala_test_suite on Windows to their long name on Linux and MacOS. Names are shortened on Windows to avoid exceeding MAX_PATH. * Script to generate scala test name mapping * Generate scala-test-suite-name-map.json on Windows changelog_begin changelog_end * Generate UTF-8 with Unix line endings Otherwise the file will be formatted using UTF-16 with CRLF line endings, which confuses `jq` on Linux. * Apply Scala test name remapping before ES upload * Pipe bazel output into intermediate file Bazel writes the output of --experimental_show_artifacts to stderr instead of stdout. In Powershell this means that these outputs are not plain strings, but instead error objects. Simply redirecting these to stdout and piping them into further processing will lead to indeterministically missing items or indeterministically introduced additional newlines which may break paths. To work around this we extract the error message from error objects, introduce appropriate newlines, and write the output to a temporary file before further processing. This solution is taken and adapted from https://stackoverflow.com/a/48671797/841562 * Add copyright header Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>	2021-08-24 17:03:45 +02:00
Gary Verhaegen	a6650c11e5	fix hoogle (#10555 ) Hoogle was down because the machines got stuck in the `apt-get update` stage. CHANGELOG_BEGIN CHANGELOG_END	2021-08-11 13:52:43 +00:00
Gary Verhaegen	449a72a86f	increase ES memory (#10318 ) ES died (again) over the weekend, so I had to manually connect to each node in order to restore it, and thus made another migration. This time I opted to make a change, though. Lack of memory is a bit of a weak hypothesis for the observed behaviour, but it's the only one I have at the moment and, given how reliably ES has been crashing so far, it's fairly easy to test. CHANGELOG_BEGIN CHANGELOG_END	2021-07-19 17:50:16 +02:00
Gary Verhaegen	a3b861eae8	refresh es cluster (#10300 ) The cluster died yesterday. As part of recovery, I connected to the machines and made manual changes. To ensure that we get back to a known, documented setup I then proceeded to do a full blue -> green migration, having not tainted any of the green machines with manual interventions. CHANGELOG_BEGIN CHANGELOG_END	2021-07-19 10:55:44 +02:00
Gary Verhaegen	2bcbd4e177	es: switch to persistent nodes (#10236 ) A few small tweaks, but the most important change is giving up on preemptible instances (for now at least), because I saw GCP kill 8 out of the 10 nodes at exactly the same time and I can't really expect the cluster to sruvive that. CHANGELOG_BEGIN CHANGELOG_END	2021-07-12 06:27:23 +00:00
Gary Verhaegen	2b67ebb5d4	tf: refactor appr var (#10232 ) Two changes at the Terraform level, both with no impact on the actual GCP state: - There is no reason to make this value a `variable`: variables in Terraforma are meant to be supplied at the CLI. `local` is the right abstraction here (i.e. set in the file directly). - Using an unordered `for_each` set rather than a list so we don't have positional identity, meaning when adding someone at the top we don't need to destroy and recreate everyone else. CHANGELOG_BEGIN CHANGELOG_END	2021-07-09 13:41:46 +02:00
Gary Verhaegen	202b7f7ae7	add akshay to appr team (#10229 ) CHANGELOG_BEGIN CHANGELOG_END	2021-07-09 10:55:16 +00:00
Gary Verhaegen	999577a1a7	tweak ES cluster (#10219 ) This PR contains many small changes: - A small refactoring whereby the "es-init" machine is now (syntactically) integrated with the two instance groups, to cut down a bit on repetition. - The feeder machine is now preemptible, because I've seen it recover enough times that I'm confident this will not cause any issue. - Indices are now sharded. - Return values from ES are filtered, cutting down a bit on network usage and memory requirements to produce the responses. - Bulk uploads for a single job are now done in parallel. This results in about a 2x speedup for ingestion. - crontab was changed to very minute instead of every 5 minutes. CHANGELOG_BEGIN CHANGELOG_END	2021-07-08 19:20:35 +02:00
Gary Verhaegen	38734f02d7	es-feed: ignore invalid files (#10207 ) We currently have about 1% (28 out of 2756) of our build logs that have invalid JSON files. They are all about a `-profile` file being incomplete, and since those files represent a single JSON object we can't do smarter things like filtering invalid individual lines. I haven't looked deeply into _why_ we create invalid files, but this should let our ingestion process make some progress in the meantime. CHANGELOG_BEGIN CHANGELOG_END	2021-07-07 15:38:14 +00:00
Gary Verhaegen	cb1f4ec773	ci/windows: disable spool (#10200 ) * ci/windows: disable spool We're not expecting to print anything, and @RPS5' security newsletter says this is a vector of attack. CHANGELOG_BEGIN CHANGELOG_END * increase no-spool to 6 * Windows name truncation causing collisions * update main group * remove temp group	2021-07-07 12:44:33 +00:00
Gary Verhaegen	1d5ba4fa42	feed elasticsearch cluster (#10193 ) This PR adds a machine that will, every 5 minutes, look at the GCS bucket that stores Bazel metrics and push whatever it finds to ElasticSearch. A huge part of this commit is based on @aherrmann-da's work. You can assume that all the good bits are his. CHANGELOG_BEGIN CHANGELOG_END	2021-07-06 19:46:14 +02:00
Gary Verhaegen	f7cf7c75b5	add kibana (#10152 ) This PR adds a Kibana instance to each ES node, and duplicates the load balancer mechanism to expose both raw ES and Kibana. CHANGELOG_BEGIN CHANGELOG_END	2021-06-30 14:08:03 +02:00
Gary Verhaegen	2dfe026cc2	add ES cluster (#10144 ) This PR adds a basic ES cluster to our infrastructure, completely open and unprotected but only accessible through VPN. And, as of yet, through its IP address. I'm not sure whether it's worth adding a DNS for it. CHANGELOG_BEGIN CHANGELOG_END	2021-06-29 17:50:45 +02:00
Gary Verhaegen	31a76a4a2a	allow CI pools to use any zone (#10069 ) This morning we started with very restricted CI pools (2/6 for Windows and 7/20 for Linux), apparently because the region we run in (us-east1) has three zones, two of them were unable to allocate new nodes, and the default policy is to distribute nodes evenly between zones. I've manually changed the distribution policy. Unfortunately this option is not yet available in our version of the GCP Terraform plugin. CHANGELOG_BEGIN CHANGELOG_END	2021-06-22 10:43:08 +02:00
Gary Verhaegen	646c956457	new windows signing (#9786 ) CHANGELOG_BEGIN CHANGELOG_END	2021-05-25 16:23:17 +02:00
Gary Verhaegen	45bca6e68b	test_windows_signing: install for u (#9776 ) Turns out "`--global`" means "for this user". CHANGELOG_BEGIN CHANGELOG_END	2021-05-21 15:49:31 +02:00
Gary Verhaegen	4af6608185	fix signing machine (#9772 ) Turns out PowerShell is not Bash. Who knew? 🤷 CHANGELOG_BEGIN CHANGELOG_END	2021-05-21 12:36:56 +02:00
Gary Verhaegen	f5c5b634eb	prepare for EV Windows signing (#9758 ) Setting up a non-disruptive way to test out EV signing of our Windows artifacts. CHANGELOG_BEGIN CHANGELOG_END	2021-05-21 10:46:45 +02:00
Gary Verhaegen	96d72a6987	add Victor to app-runtime (#9556 ) CHANGELOG_BEGIN CHANGELOG_END	2021-05-04 13:35:28 +02:00
Gary Verhaegen	af4fffad0f	increase Linux CI size (#9533 ) CHANGELOG_BEGIN CHANGELOG_END	2021-04-29 18:17:10 +02:00
Gary Verhaegen	8a6cfacbff	more robust macOS cleanup (#9456 ) We've recently seen a few cases where the macOS nodes ended up not having the cache partition mounted. So far this has only happened on semi-broken nodes (guest VM still up and running but host unable to connect to it), so I haven't been able to actually poke at a broken machine, but I believe this should allow a machine in such a state to recover. While we haven't observed a similar issue on Linux nodes (as far as I'm aware), I have made similar changes there to keep both scripts in sync. CHANGELOG_BEGIN CHANGELOG_END	2021-04-21 12:10:47 +02:00
Moritz Kiefer	91b65e8004	Patch hoogle binary to include bugfix (#9366 ) This includes https://github.com/ndmitchell/hoogle/pull/367. As usual, I unfortunately cannot test this myself so please review carefully. Note that this will slightly increase compile times since we will now build hoogle. However, we still only build hoogle rather than everything which takes less than 2min on my very weak personal laptop. We could integrate this with our nix cache but for now, I’m not that worried about this. changelog_begin changelog_end Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>	2021-04-09 14:17:03 +02:00
Gary Verhaegen	dfe26b93b0	fix hoogle (#9364 ) After #9362, hoogle stopped returning any result. It was up and running, but with an empty database. Problem was two-fold: 1. In the switch to `nix`, I lost track of the fact that we were previously doing all the work in `~/hoogle` rather than `~/`, which broke the periodic refresh. 2. The initial setup has been broken for months; machines have been initializing with an empty DB and only getting data on the first refresh since #7370. CHANGELOG_BEGIN CHANGELOG_END	2021-04-08 18:45:47 +02:00
Gary Verhaegen	c220e05380	infra/hoogle: use nix (#9362 ) This is a demonstration, commit-by-commit, of how to use the blue/green deployment for Hoogle servers. The actual change (using nix) is stolen from #9352. CHANGELOG_BEGIN CHANGELOG_END	2021-04-08 16:59:16 +02:00
Gary Verhaegen	948d4dd964	infra: hoogle blue/green tf (#9351 ) Making it a bit easier to manage rollouts. CHANGELOG_BEGIN CHANGELOG_END	2021-04-08 10:07:11 +02:00
Gary Verhaegen	2745bc03a5	macos: move cache setup to step 2 (#9350 ) The caches really need to be set up before we warm them up. CHANGELOG_BEGIN CHANGELOG_END	2021-04-07 21:42:15 +02:00
Gary Verhaegen	38c417e981	bump hoogle ubuntu (#9344 ) 16.04 is approaching EOL. CHANGELOG_BEGIN CHANGELOG_END	2021-04-07 20:11:36 +02:00
Gary Verhaegen	c97db24295	fix macOS cache cleaning (#9343 ) The script needs to run once before the first build, otherwise the cache folders get created on the main partition. CHANGELOG_BEGIN CHANGELOG_END	2021-04-07 18:46:44 +02:00
Gary Verhaegen	45c4ba2230	macos cache cleaning (#9245 ) This is adapting the same approach as #9137 to the macOS machines. The setup is very similar, except macOS apparently doesn't require any kind of `sudo` access in the process. The main reason for the change here is that while `~/.bazel-cache` is reasonably fast to clean, cleaning just that has finally caught up to us with a recent cleanup step that proudly claimed: ``` before: 638Mi free after: 1.2Gi free ``` So we do need to start cleaning the other one after all. CHANGELOG_BEGIN CHANGELOG_END	2021-03-30 02:46:05 +02:00
Edward Newman	a98e03981f	Increase nix partition to max of 60Gb (#9259 ) Increase nix partition to max of 60Gb CHANGELOG_BEGIN CHANGELOG_END	2021-03-30 01:06:58 +02:00
Gary Verhaegen	691edeacf2	ci: fix cache cleanup (#9137 ) This is a continuation of #8595 and #8599. I somehow had missed that `/etc/fstab` can be used to tell `mount` to let users mount some filesystems with preset options. This is using the full history of `mount` hardening so should be safe enough. The option `user` in `/etc/fstab` automatically disables any kind of `setuid` feature on the mounted filesystem, which is the main attack vector I know of. This works flawlessly on my local VM, so hopefully this time's the charm. (It also happens to be my third PR specifically targeted on this issue, so, who knows, it may even work.) CHANGELOG_BEGIN CHANGELOG_END	2021-03-16 17:51:38 +01:00
Gary Verhaegen	cfae2d88f5	update Terraform files to match reality (#8780 ) * fixup terraform config Two changes have happened recently that have invalidated the current Terraform files: 1. The Terraform version has gone through a major, incompatible upgrade (#8190); the required updates for this are reflected in the first commit of this PR. 2. The certificate used to serve [Hoogle](https://hoogle.daml.com) was about to expire, so Edward created a new one and updated the config directly. The second commit in this PR updates the Terraform config to match that new, already-in-prod setting. Note: This PR applies cleanly, as there are no resulting changes in Terraform's perception of the target state from 1, and the change from 2 has already been applied through other channels. CHANGELOG_BEGIN CHANGELOG_END * update hoogle cert	2021-02-08 17:25:04 +00:00
Gary Verhaegen	29197a96d7	add Stefano to appr (#8663 ) CHANGELOG_BEGIN CHANGELOG_END	2021-01-28 11:11:08 +00:00
Gary Verhaegen	ec5e419e3a	clean-up infra after Ubuntu 20.04 upgrade (#8653 ) See #8617. This has to be a separate PR so we could drain all the jobs still expecting the linux-pool pool to exist. CHANGELOG_BEGIN CHANGELOG_END	2021-01-27 22:19:34 +01:00
Gary Verhaegen	fef712bf60	Upgrade linux nodes to 20.04 (#8617 ) CHANGELOG_BEGIN - Our Linux binaries are now built on Ubuntu 20.04 instead of 16.04. We do not expect any user-level impact, but please reach out if you do notice any issue that might be caused by this. CHANGELOG_END	2021-01-27 17:38:34 +01:00
Bernhard Elsner	cda93db944	Daml case and logo (#8433 ) * Replace many occurrences of DAML with Daml * Update docs logo * A few more CLI occurrences CHANGELOG_BEGIN - Change DAML capitalization and docs logo CHANGELOG_END * Fix some over-eager replacements * A few mor occurrences in md files * Address comments in .proto files Change case in comments and strings in .ts files * Revert changes to frozen proto files * Also revert LF 1.11 * Update get-daml.sh * Update windows installer * Include .py files * Include comments in .daml files * More instances in the assistant CLI * some more help texts	2021-01-08 12:50:15 +00:00
Moritz Kiefer	9c2e2db34e	Include new Nix signing key in static nix config on CI nodes (#8407 ) Our CI nodes install nix in multi-user mode. This means that changing cache information is only available to certain trusted users for security reasons. The CI user is not part of those so the cache info from dev-env/etc/nix.conf is silently ignored. We could consider not running in multi-user mode although from a security pov this seems like a pretty sensible decision and our signing keys change very rarely so for now, I would keep it. changelog_begin changelog_end	2021-01-06 13:24:34 +01:00
Gary Verhaegen	a925f0174c	update copyright notices for 2021 (#8257 ) * update copyright notices for 2021 To be merged on 2021-01-01. CHANGELOG_BEGIN CHANGELOG_END * patch-bazel-windows & da-ghc-lib	2021-01-01 19:49:51 +01:00
Gary Verhaegen	93f449d245	rename master to main (#8245 ) As we strive for more inclusiveness, we are becoming less comfortable with historically-charged terms being used in our everyday work. This is targeted for merge on Dec 26, _after_ the necessary corresponding changes at both the GitHub and Azure Pipelines levels. CHANGELOG_BEGIN - DAML Connect development is now conducted from the `main` branch, rather than the `master` one. If you had any dependency on the digital-asset/daml repository, you will need to update this parameter. CHANGELOG_END	2020-12-27 14:19:07 +01:00
Gary Verhaegen	5c8ac44049	update macOS nodes README (#8243 ) This is far from perfect but removes the blatantly wrong sections of the README. Note: as a README change, this is not really a standard change, but because the README is under the infra folder, this PR does need the tag to pass CI. CHANGELOG_BEGIN CHANGELOG_END	2020-12-10 16:48:12 +01:00
Gary Verhaegen	7c2ba6f996	infra: add prod label (#8140 ) Requested by @nycnewman. CHANGELOG_BEGIN CHANGELOG_END	2020-12-03 01:55:43 +01:00
Gary Verhaegen	586e29adca	incident-118: investigate & fix (#8135 ) incident-118: fruitless investigation; revert This first commit just duplicates the existing configuration. Further commits will make actual changes so they can be tracked by looking at individual commits (rather than try to think up the diff by looking at entirely new files). CHANGELOG_BEGIN CHANGELOG_END	2020-12-02 19:20:56 +01:00
Gary Verhaegen	841116bf1e	incident-118: linux machines unable to start (#8128 ) Early investigation points to cloud logging install failing. Temporarily disabling. CHANGELOG_BEGIN CHANGELOG_END	2020-12-02 10:11:11 +00:00
Gary Verhaegen	b23304c691	add default capability to macos (#5915 ) This is the macOS part of #5912, which I have separated because our macOS nodes have a different deployment process so it seemed easier to track the deployment of the change separately. CHANGELOG_BEGIN CHANGELOG_END	2020-11-25 15:34:33 +01:00

1 2 3

134 Commits