From df0086d26f36197735ba80b0bc5fe29ecc66f322 Mon Sep 17 00:00:00 2001 From: Gary Verhaegen Date: Fri, 12 Feb 2021 19:32:14 +0100 Subject: [PATCH] ci/linux: kill machines if they fail to clean up (#8835) It does not seem like CI machines recover from a failed clean-up. This is not the most elegant solution possible, but it's a cheap one that should work. Not: shutting down the machine in the middle of the build will not provide an error message to Slack for main branch builds (because the `tell_slack_failed` step would need to run on the same machine) but will correctly report failure for PRs (that was the original purpose of the `collect_build_data` step). An alternative here would be to give a delay to the shutdown command, and try to calibrate it so that it's long enough for this job to correctly report its failure to both Azure and Slack, while making it short enough that no other job gets assigned to the machine. I'm not clear enough on how often Azure assigns jobs to try and bet on that. CHANGELOG_BEGIN CHANGELOG_END --- ci/clean-up.yml | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/ci/clean-up.yml b/ci/clean-up.yml index f374619d5a..28ecfd74b7 100644 --- a/ci/clean-up.yml +++ b/ci/clean-up.yml @@ -8,6 +8,12 @@ steps: # infra/macos/2-common-box/init.sh:echo "build:darwin --disk_cache=~/.bazel-cache" > ~/.bazelrc # infra/vsts_agent_linux_startup.sh:echo "build:linux --disk_cache=~/.bazel-cache" > ~/.bazelrc + # Linux machines don't seem to recover when this script fails, and they get + # renewed by the instance_group + if [ "$(uname -s)" == "Linux" ]; then + trap "shutdown -h now" EXIT + fi + if [ $(df -m . | sed 1d | awk '{print $4}') -lt 50000 ]; then echo "Disk full, cleaning up..." disk_cache="$HOME/.bazel-cache" @@ -24,4 +30,5 @@ steps: fi fi df -h . + trap - EXIT displayName: clean-up disk cache