Commit Graph

16 Commits

Author SHA1 Message Date
Gary Verhaegen
793253ca87
bump cleanup threshold (#10771)
I've witnessed a build ([link], though that will likely expire soon)
that failed with a "No space left on device" error after skipping the
cleanup step because the machine still had 68GB free.

[link]: https://dev.azure.com/digitalasset/daml/_build/results?buildId=87591&view=logs&j=870bb40c-6da0-5bff-67ed-547f10fa97f2&t=deecee86-545a-596e-8b0d-fb7d606fe9f2

With the machines only having 200GB disk size total, cleaning up at 80
is probably going to start hampering the overall efficiency of the
cache. It may be time to think about increasing the disk size itself (or
finding ways to reduce the size requirements of our builds). Important
note, though: we can't actually increase the macOS disk size very much.

The failure happened on the `compatibility_linux` job.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-03 18:22:18 +02:00
Gary Verhaegen
c3a3d60083
don't call Gary, he's on holiday (#10400)
I'm still not sure how or why this happens, but if we can detect it
"early" to fail and try to debug, we can also just try to fix it 🤷

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-24 09:12:42 +02:00
Gary Verhaegen
a56cfead28
even earlier mount failure detection (#10371)
CHANGELOG_BEGIN
CHANGELOG_END
2021-07-22 12:36:41 +02:00
Gary Verhaegen
3c0010b38a
detect mount issue earlier (#10313)
When machine disks are full, we can't clean the Bazel cache if it
happens to not be a mount point. I don't quite understand yet why it's
not a mount point, but maybe I'll be able to investigate more if we catch
the issue early, rather than waiting for the disk to be full and the
clean-up to fail.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-19 15:45:50 +02:00
Gary Verhaegen
8a6cfacbff
more robust macOS cleanup (#9456)
We've recently seen a few cases where the macOS nodes ended up not
having the cache partition mounted. So far this has only happened on
semi-broken nodes (guest VM still up and running but host unable to
connect to it), so I haven't been able to actually poke at a broken
machine, but I believe this should allow a machine in such a state to
recover.

While we haven't observed a similar issue on Linux nodes (as far as I'm
aware), I have made similar changes there to keep both scripts in sync.

CHANGELOG_BEGIN
CHANGELOG_END
2021-04-21 12:10:47 +02:00
Gary Verhaegen
2662163a23
cleanup: kill more stuff (#9411)
In this case we had Java and Daml processes running.

May be worth moving this to the reset-caches scripts later on.

changelog_begin
changelog_end
2021-04-14 15:15:03 +02:00
Gary Verhaegen
45c4ba2230
macos cache cleaning (#9245)
This is adapting the same approach as #9137 to the macOS machines. The
setup is very similar, except macOS apparently doesn't require any kind
of `sudo` access in the process.

The main reason for the change here is that while `~/.bazel-cache` is
reasonably fast to clean, cleaning just that has finally caught up to us
with a recent cleanup step that proudly claimed:

```
before: 638Mi free
after: 1.2Gi free
```

So we do need to start cleaning the other one after all.

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-30 02:46:05 +02:00
Gary Verhaegen
839c97cff9
fix dev-env in clean-up (#9244)
CHANGELOG_BEGIN
CHANGELOG_END
2021-03-25 15:48:58 +00:00
Gary Verhaegen
7c4b32aee9
use coreutils date on macos (#9228)
macOS uses BSD date by default which has slightly different options.

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-24 13:35:02 +01:00
Gary Verhaegen
691edeacf2
ci: fix cache cleanup (#9137)
This is a continuation of #8595 and #8599. I somehow had missed that
`/etc/fstab` can be used to tell `mount` to let users mount some
filesystems with preset options.

This is using the full history of `mount` hardening so should be safe
enough. The option `user` in `/etc/fstab` automatically disables any kind
of `setuid` feature on the mounted filesystem, which is the main attack
vector I know of.

This works flawlessly on my local VM, so hopefully this time's the
charm. (It also happens to be my third PR specifically targeted on this
issue, so, who knows, it may even work.)

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-16 17:51:38 +01:00
Gary Verhaegen
c556db48ed
ci/clean-up: remove poweroff (#9108)
It's not working and I can't make it work (see #9096), so I'd rather
just remove it.

CHANGELOG_BEGIN
CHANGELOG_END
2021-03-12 10:48:34 +01:00
Gary Verhaegen
df0086d26f
ci/linux: kill machines if they fail to clean up (#8835)
It does not seem like CI machines recover from a failed clean-up. This
is not the most elegant solution possible, but it's a cheap one that
should work.

Not: shutting down the machine in the middle of the build will not
provide an error message to Slack for main branch builds (because the
`tell_slack_failed` step would need to run on the same machine) but will
correctly report failure for PRs (that was the original purpose of the
`collect_build_data` step).

An alternative here would be to give a delay to the shutdown command,
and try to calibrate it so that it's long enough for this job to
correctly report its failure to both Azure and Slack, while making it
short enough that no other job gets assigned to the machine. I'm not
clear enough on how often Azure assigns jobs to try and bet on that.

CHANGELOG_BEGIN
CHANGELOG_END
2021-02-12 19:32:14 +01:00
Gary Verhaegen
95c2184dcc
bump disk cleanup threashold (#8807)
I've seen [a build] failing with "disk full" after starting with 41GB
free.

[a build]: https://dev.azure.com/digitalasset/daml/_build/results?buildId=69270&view=logs&j=870bb40c-6da0-5bff-67ed-547f10fa97f2&t=deecee86-545a-596e-8b0d-fb7d606fe9f2

CHANGELOG_BEGIN
CHANGELOG_END
2021-02-10 13:39:29 +00:00
Gary Verhaegen
ab5f62abac
ci/clean-up: s/lsof/pgrep (#8748)
CHANGELOG_BEGIN
CHANGELOG_END
2021-02-04 12:21:55 +01:00
Gary Verhaegen
5734730d50
tweak local cache cleanup (#8738)
For various reasons, my attempts at improving the cache cleanup process
have been delayed. There are, however, two simple, non-controversial
changes I can "backport" without having to wait for consensus on the
whole thing:

1. Increase the threshold. At least for the compat jobs, we have seen
   builds failing after starting with ~32GB free.
2. Kill dangling Bazel processes, which keep some files open and
   sometimes cause the clean-up process to crash.

CHANGELOG_BEGIN
CHANGELOG_END
2021-02-03 19:47:52 +01:00
Gary Verhaegen
532f996f12
ci: clean-up hard drive more often (#8582)
CHANGELOG_BEGIN
CHANGELOG_END
2021-01-20 17:22:53 +01:00