Commit Graph

13 Commits

Author SHA1 Message Date
Gary Verhaegen
5e43f8c703
es: drop jobs-* indices (#10857)
We are currently ingesting Bazel events in two forms:

In the `events-*` indices, each Bazel event is recorded as a separate ES
object, with the corresponding job name as a field that can serve to
aggregate all of the events for a given job.

In the `jobs-*` indices, each job is ingested as a single (composite) ES
object, with the individual events as elements in a list-type field.

When I set up the cluster, I wasn't sure which one would be more useful,
so I included both. We now have a bit more usage experience and it turns
out the `events-*` form is the only one we use, so I think we should
stop ingesting evrything twice and from now on create only the
`events-*` ones.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-28 11:06:52 +02:00
Gary Verhaegen
fe9aeffeaf
Increase es disk size (#11019)
Disks are currently at 75% fullness, so seems like a good idea to bump a
bit. #10857 should reduce our needs, too, so this should last us a
while.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-25 02:51:26 +02:00
Gary Verhaegen
6f151e287e
save kibana exports (#10861)
As explained in #10853, we recently lost our ES cluster. While I'm not
planning on trusting Google's "rolling restart" feature ever again, we
can't exclude the possibility of future similar outages (without a
significant investment in the cluster, which I don't think we want to
do).

Losing the cluster is not a huge issue as we can always reingest the
data. Worst case we lose visibility for a few days. At least, as far as
the bazel logs are concerned.

Losing the Kibana data is a lot more annoying, as that is not derived
data and thus cannot be reingested. This PR aims to add a backup
mechanism for our Kibana configuration.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-13 18:28:11 +00:00
Gary Verhaegen
8c9edd8522
es cluster tweaks (#10853)
On Sept 8 our ES cluster became unresponsive. I tried connecting to the
machines.

One machine had an ES Docker container that claimed to have started 7
weeks ago and stopped 5 weeks ago, while the machine's own uptime was 5
weeks. I assume GCP had decided to restart it for some reason. The init
script had failed on missing a TTY, hence the addition of the
`DEBIAN_FRONTEND` env var.

Two machines had a Docker container that had stopped on that day, resp.
6h and 2h before I started investigating. It wasn't immediately clear
what had caused the containers to stop.

On all three of these machines, I was abble to manually restart the
containers and they were abble to reform a cluster, though the state of
the cluster was red (missing shards).

The last two machines simply did not respond to SSH connection attempts.
Assuming it might help, I decided to try to restart the machines. As GCP
does not allow restarting individual machines when they're part of a
managed instance roup, I tried clicking the "rolling restart" button
on the GCP console, which seemed like it would restart the machines. I
carefully selected "restart" (and not "replace"), started the process,
and watched GCP proceed to immediately replace all five machines, losing
all data in the process.

I then started a new cluster and used bigger (and more) machines to
reingest all of the data, and then fell back to the existing
configuration for the "steady" state. I'll try to keep a better eye on
the state of the cluster from now on. In particular, we should not have
a node down for 5 weeks without noticing.

I'll also try to find some time to look into backing up the Kibana
configuration, as that's the one thing we can't just reingest at the
moment.

CHANGELOG_BEGIN
CHANGELOG_END
2021-09-13 11:12:02 +02:00
Andreas Herrmann
7b94b0674e
Map shortened scala test suite names to long names on Windows (#10628)
* Generate short to long name mapping in aspect

Maps shortened test names in da_scala_test_suite on Windows to their
long name on Linux and MacOS.

Names are shortened on Windows to avoid exceeding MAX_PATH.

* Script to generate scala test name mapping

* Generate scala-test-suite-name-map.json on Windows

changelog_begin
changelog_end

* Generate UTF-8 with Unix line endings

Otherwise the file will be formatted using UTF-16 with CRLF line
endings, which confuses `jq` on Linux.

* Apply Scala test name remapping before ES upload

* Pipe bazel output into intermediate file

Bazel writes the output of --experimental_show_artifacts to stderr
instead of stdout. In Powershell this means that these outputs are not
plain strings, but instead error objects. Simply redirecting these to
stdout and piping them into further processing will lead to
indeterministically missing items or indeterministically introduced
additional newlines which may break paths.

To work around this we extract the error message from error objects,
introduce appropriate newlines, and write the output to a temporary file
before further processing.

This solution is taken and adapted from
https://stackoverflow.com/a/48671797/841562

* Add copyright header

Co-authored-by: Andreas Herrmann <andreas.herrmann@tweag.io>
2021-08-24 17:03:45 +02:00
Gary Verhaegen
449a72a86f
increase ES memory (#10318)
ES died (again) over the weekend, so I had to manually connect to each
node in order to restore it, and thus made another migration. This time
I opted to make a change, though. Lack of memory is a bit of a weak
hypothesis for the observed behaviour, but it's the only one I have at
the moment and, given how reliably ES has been crashing so far, it's
fairly easy to test.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-19 17:50:16 +02:00
Gary Verhaegen
a3b861eae8
refresh es cluster (#10300)
The cluster died yesterday. As part of recovery, I connected to the
machines and made manual changes. To ensure that we get back to a known,
documented setup I then proceeded to do a full blue -> green migration,
having not tainted any of the green machines with manual interventions.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-19 10:55:44 +02:00
Gary Verhaegen
2bcbd4e177
es: switch to persistent nodes (#10236)
A few small tweaks, but the most important change is giving up on
preemptible instances (for now at least), because I saw GCP kill 8 out
of the 10 nodes at exactly the same time and I can't really expect the
cluster to sruvive that.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-12 06:27:23 +00:00
Gary Verhaegen
999577a1a7
tweak ES cluster (#10219)
This PR contains many small changes:

- A small refactoring whereby the "es-init" machine is now
  (syntactically) integrated with the two instance groups, to cut down a
  bit on repetition.
- The feeder machine is now preemptible, because I've seen it recover
  enough times that I'm confident this will not cause any issue.
- Indices are now sharded.
- Return values from ES are filtered, cutting down a bit on network
  usage and memory requirements to produce the responses.
- Bulk uploads for a single job are now done in parallel. This results
  in about a 2x speedup for ingestion.
- crontab was changed to very minute instead of every 5 minutes.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-08 19:20:35 +02:00
Gary Verhaegen
38734f02d7
es-feed: ignore invalid files (#10207)
We currently have about 1% (28 out of 2756) of our build logs that have
invalid JSON files. They are all about a `-profile` file being
incomplete, and since those files represent a single JSON object we
can't do smarter things like filtering invalid individual lines.

I haven't looked deeply into _why_ we create invalid files, but this
should let our ingestion process make some progress in the meantime.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-07 15:38:14 +00:00
Gary Verhaegen
1d5ba4fa42
feed elasticsearch cluster (#10193)
This PR adds a machine that will, every 5 minutes, look at the GCS
bucket that stores Bazel metrics and push whatever it finds to
ElasticSearch.

A huge part of this commit is based on @aherrmann-da's work. You can
assume that all the good bits are his.

CHANGELOG_BEGIN
CHANGELOG_END
2021-07-06 19:46:14 +02:00
Gary Verhaegen
f7cf7c75b5
add kibana (#10152)
This PR adds a Kibana instance to each ES node, and duplicates the load
balancer mechanism to expose both raw ES and Kibana.

CHANGELOG_BEGIN
CHANGELOG_END
2021-06-30 14:08:03 +02:00
Gary Verhaegen
2dfe026cc2
add ES cluster (#10144)
This PR adds a basic ES cluster to our infrastructure, completely open
and unprotected but only accessible through VPN.

And, as of yet, through its IP address. I'm not sure whether it's worth
adding a DNS for it.

CHANGELOG_BEGIN
CHANGELOG_END
2021-06-29 17:50:45 +02:00