This should fix the flakyness of the test.
Forcefully killing the consul process can lead to
a broken `/var/lib/consul/node-id` file, which
will prevent consul from starting on that node again.
See https://github.com/hashicorp/consul/issues/3489
So instead of crashing the whole node, which leads to
this corruption from time to time, we kill the
networking instead, preventing any cluster
communication and then cleanly stop consul.
Done by setting `autopilot.min_quorum = 3`.
Techncially, this would have been required to keep the test correct since
Consul's "autopilot" "Dead Server Cleanup" was enabled by default (I believe
that was in Consul 0.8). Practically, the issue only occurred with our NixOS
test with releases >= `1.7.0-beta2` (see #90613). The setting itself is
available since Consul 1.6.2.
However, this setting was not documented clearly enough for anybody to notice,
and only the upstream issue https://github.com/hashicorp/consul/issues/8118
I filed brought that to light.
As explained there, the test could also have been made pass by applying the
more correct rolling reboot procedure
-m.wait_until_succeeds("[ $(consul members | grep -o alive | wc -l) == 5 ]")
+m.wait_until_succeeds(
+ "[ $(consul operator raft list-peers | grep true | wc -l) == 3 ]"
+)
but we also intend to test that Consul can regain consensus even if
the quorum gets temporarily broken.