daml/infra/periodic_killer.tf

102 lines
2.6 KiB
Terraform
Raw Normal View History

# Copyright (c) 2023 Digital Asset (Switzerland) GmbH and/or its affiliates. All rights reserved.
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
# SPDX-License-Identifier: Apache-2.0
#
# This file defines a machine meant to destroy/recreate all our CI nodes every
# night.
resource "google_service_account" "periodic-killer" {
account_id = "periodic-killer"
}
resource "google_project_iam_custom_role" "periodic-killer" {
role_id = "killCiNodes"
title = "Permissions to list & kill CI nodes"
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
permissions = [
"compute.instances.delete",
"compute.instances.list",
"compute.zoneOperations.get",
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
"compute.zones.list",
]
}
locals {
accounts_that_can_kill_machines = [
# should reference google_project_iam_custom_role.periodic-killer.id or
# something, but for whatever reason that's not exposed.
"serviceAccount:${google_service_account.periodic-killer.email}",
"user:gary.verhaegen@digitalasset.com",
"user:gerolf.seitz@digitalasset.com",
]
}
resource "google_project_iam_member" "periodic-killer" {
count = length(local.accounts_that_can_kill_machines)
project = local.project
role = google_project_iam_custom_role.periodic-killer.id
member = local.accounts_that_can_kill_machines[count.index]
}
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
resource "google_compute_instance" "periodic-killer" {
count = 0
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
name = "periodic-killer"
machine_type = "g1-small"
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
zone = "us-east4-a"
labels = local.machine-labels
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
boot_disk {
initialize_params {
image = "ubuntu-1804-lts"
}
}
network_interface {
network = "default"
// Ephemeral IP to get access to the Internet
access_config {}
}
service_account {
email = google_service_account.periodic-killer.email
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
scopes = ["cloud-platform"]
}
allow_stopping_for_update = true
metadata_startup_script = <<STARTUP
set -euxo pipefail
apt-get update
apt-get install -y jq
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
echo "$(date -Is -u) boot" > /root/log
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
cat <<CRON > /root/periodic-kill.sh
#!/usr/bin/env bash
set -euo pipefail
echo "\$(date -Is -u) start"
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
MACHINES=\$(/snap/bin/gcloud compute instances list --format=json | jq -c '.[] | select(.name | startswith("ci-")) | [.name, .zone]')
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
for m in \$MACHINES; do
MACHINE_NAME=\$(echo \$m | jq -r '.[0]')
MACHINE_ZONE=\$(echo \$m | jq -r '.[1]')
# We do not want to abort the script on error here because failing to
# reboot one machine should not prevent trying to reboot the others.
/snap/bin/gcloud -q compute instances delete \$MACHINE_NAME --zone=\$MACHINE_ZONE || true
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
done
echo "\$(date -Is -u) end"
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
CRON
chmod +x /root/periodic-kill.sh
cat <<CRONTAB >> /etc/crontab
0 4 * * * root /root/periodic-kill.sh >> /root/log 2>&1
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
CRONTAB
tail -f /root/log
ci: temp machines for scheduled killing experiment (#4386) * ci: temp machines for scheduled killing experiment Based on our discussions last week, I am exploring ways to move us to permanent machines instead of preemptible ones. This should drastically reduce the number of "cancelled" jobs. The end goal is to have: 1. An instance group (per OS) that defines the actual CI nodes; this would be pretty much the same as the existing ones, but with `preemptible` set to false. 2. A separate machine that, on a cron (say at 4AM UTC), destroys all the CI nodes. The hope is that the group managers, which are set to maintain 10 nodes, will then recreate the "missing" nodes using their normal starting procedure. However, there are a lot of unknowns I would like to explore, and I need a playground for that. This is where this PR comes in. As it stands, it creates one "killer" machine and a temporary group manager. I will use these to experiment with the GCP API in various ways without interfering with the real CI nodes. This experimentation will likely require multiple `terraform apply` with multiple different versions of the associated files, as well as connecting to the machines and running various commands directly from them. I will ensure all of that only affects the new machines created as part of this PR, and therefore believe we do not need to go through a separate round of approval for each change. Once I have finished experimenting, I will create a new PR to clean up the temporary resources created with this one and hopefully set up a more permanent solution. CHANGELOG_BEGIN CHANGELOG_END * add missing zone for killer instance * add compute scope to killer * authorize Terraform to shutdown killer to update it * change in plans: use a service account instead * . * add compute.instances.list permission * add compute.instances.delete permission * add cron script * obligatory round of extra escaping * fix PATH issue & crontab format * smaller machine & less frequent reboots
2020-02-07 23:04:03 +03:00
STARTUP
}