From df0086d26f36197735ba80b0bc5fe29ecc66f322 Mon Sep 17 00:00:00 2001
From: Gary Verhaegen <gary.verhaegen@digitalasset.com>
Date: Fri, 12 Feb 2021 19:32:14 +0100
Subject: [PATCH] ci/linux: kill machines if they fail to clean up (#8835)

It does not seem like CI machines recover from a failed clean-up. This
is not the most elegant solution possible, but it's a cheap one that
should work.

Not: shutting down the machine in the middle of the build will not
provide an error message to Slack for main branch builds (because the
`tell_slack_failed` step would need to run on the same machine) but will
correctly report failure for PRs (that was the original purpose of the
`collect_build_data` step).

An alternative here would be to give a delay to the shutdown command,
and try to calibrate it so that it's long enough for this job to
correctly report its failure to both Azure and Slack, while making it
short enough that no other job gets assigned to the machine. I'm not
clear enough on how often Azure assigns jobs to try and bet on that.

CHANGELOG_BEGIN
CHANGELOG_END
---
 ci/clean-up.yml | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/ci/clean-up.yml b/ci/clean-up.yml
index f374619d5a..28ecfd74b7 100644
--- a/ci/clean-up.yml
+++ b/ci/clean-up.yml
@@ -8,6 +8,12 @@ steps:
     # infra/macos/2-common-box/init.sh:echo "build:darwin --disk_cache=~/.bazel-cache" > ~/.bazelrc
     # infra/vsts_agent_linux_startup.sh:echo "build:linux --disk_cache=~/.bazel-cache" > ~/.bazelrc
 
+    # Linux machines don't seem to recover when this script fails, and they get
+    # renewed by the instance_group
+    if [ "$(uname -s)" == "Linux" ]; then
+        trap "shutdown -h now" EXIT
+    fi
+
     if [ $(df -m . | sed 1d | awk '{print $4}') -lt 50000 ]; then
         echo "Disk full, cleaning up..."
         disk_cache="$HOME/.bazel-cache"
@@ -24,4 +30,5 @@ steps:
         fi
     fi
     df -h .
+    trap - EXIT
   displayName: clean-up disk cache