daml/ci/README.md
Paul Brauner fd47e89ce2
Update ci/README.md (#19329)
Update ci/README with explanations gathered from @garyverhaegen-da on slack.
2024-06-24 16:07:31 +02:00

259 lines
13 KiB
Markdown

# CI
## Overview
We run our CI on Azure Pipelines. Azure Pipelines uses its own variant of YAML
for its ocnfiguration; it's worth getting familiar with their [YAML
documentation][YAML], as well as with their [expression syntax].
[YAML]: https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/?view=azure-pipelines
[expression syntax]: https://learn.microsoft.com/en-us/azure/devops/pipelines/process/expressions?view=azure-devops
Azure Pipelines allows one to define any number of "pipelines", which are
definitions of what to do (think CircleCI workflow). Each pipeline has its own
entrypoint and conditions for running, as well as its own configuration within
Azure Pipelines (e.g. each has its own set of nvironment variables).
## Entrypoints
The entrypoints we have in this repo, as of 2024-05-28, are:
- `/azure-cron.yml` for the `digital-asset.daml-cron` pipeline. This is a
hourly cron. For some reason setting it up as a hourly cron did not work four
years ago so we have a slightly convoluted set up where a separate job within
Azure Pipelines (not visible in this repo's files) called `cron-workaround CI`
"maually" triggers this through the API every hour. This entrypoint is
responsible for generating the `daml-sdk` Docker image (when a new release is
detected), cleaning up potentially-broken files from the Bazel cache,
publishing the VSCode Extension (when a new release is detected), copying the
GitHub downloads stats to GCS to give us some historical perspective, and
checking that get.daml.com has not been changed.
- `/azure-pipelines.yml` is the `main` branch build (also `main-2.x`). This
runs only on commits from `main` or `main-2.x` and has access to release
secrets. The corresponding pipeline is `digital-asset.daml`.
- `/ci/prs.yml` is the entrypoint for PR builds. It is very similar to the main
branch build (in either case, most of the jobs are defined in
`/ci/build.yml`), but has access to fewer secrets. It also has a few specific
jobs such as `check_standard_change_label`.
- `/ci/daily-snapshot.yml` is, as the name suggests, the entrypoint for the
daily snapshots. Note that, like all releases, these are made from the latest
commit of the main branch. The corresponding pipeline is named `snapshot` and
can be manually triggered with sufficient Azure Pipelines permissions. This
should **not** be done for non-snapshot releases as we want an audit trail of
those in `LATEST`. Note that despite the name this does not run on a cron.
- `/ci/cron/daily-compat.yml` is the entrypoint for all of our "daily" cron
jobs, whether they are related to the compatibility tests or not. These
include the compatibility tests, but also the speedy performance tests (2.x
only), a daily BlackDuck scan (which optionally udpates the `NOTICES` file),
the daily code pull from canton, as well as triggering the `snapshot` pipeline
if needed.
- `/ci/cron/tuesday.yml` runs on Tuesdays to send a Slack message to
`#team-daml` announcing who will be responsible for the Wednesday release
testing.
- `/ci/cron/wednesday.yml` open the release testing rotation update PR.
## Special branches
We currently have a few special branches.
### `main`
The `main` branch is where most of the development happens. It currently
(2024-05-28) targets the 3.1.0 release, but that is changeable: when the code
freeze for 3.1.0 nears, the way we'll actually do the code freeze is by
creating a new branch called `release/3.1.x` from the then tip of `main`, and
then change `main` to target 3.2.0.
The `main` branch is also the only one that triggers releases. This means that
the `sdk/LATEST` file on the `main` branch is the source of truth for releases
in this repo, across all versions. When CI detects that a given build is a
release build, all of the build steps will first check out the target release
commit (first column of the `sdk/LATEST` file).
This has two important consequences:
- On the good side, it's very good for auditability: this one file on this one
branch is the only trigger for releases, so looking at the history of that
file tells you all you need to know about releases. (In most cases you'll have
all you need from the current state of the file.)
- On the bad side, Azure Pipelines loads the YML files _before_ we check out
the target release commit, which means that the YML files involved in making
a release need to be changed very carefully as they need to keep working with
older versions. This has not been an issue in practice so far, but certainly
something to keep in mind.
The `main` branch is the one that runs for most cron jobs, the one exception
being the `digital-asset.daml-daily-compat` pipeline which runs every day on
both the `main` and `main-2.x` branches.
Note however that the daily-compat `main-2.x` pipeline does not trigger any
release build. All snapshots (included `main-2.x` ones) are started by
[this job](https://github.com/digital-asset/daml/blob/18e4e155fb94ac153e08b9f7b141910d8e005e7a/ci/cron/daily-compat.yml#L314-L335)
in the `main` daily-compat pipeline. The job starts two builds of the `snapshot`
pipeline, with two different commits passed as a parameter. As these builds
are all run from `main` (even when building `main-2.x`), they can't be easily
told appart from the [snapshot pipeline page](https://dev.azure.com/digitalasset/daml/_build?definitionId=40). To tell which branch a given build is building: click on the job,
then consult the `check_for_release`'s `out` step. The `release_tag` variable then
makes it clear whether this is a 2.x or 3.x build.
### `main-2.x`
This temporary fork of `main` targets `2.9.0` and will likely move on to target
`2.10.0` when `2.9.0` gets its code freeze. It is similar to `main` in many
ways, but does not trigger releases.
### `release/*`
The `release/*` branches (e.g. `release/2.3.x`) represent the code base of past
minor releases and exist to allow us to do patch releases.
The process of making a stable release will generall involve creating the
`release/*` branch when we start to close in on the code freeze for that
release, with usually a couple more PRs that need to go in.
Therefore, the vast majority of releases are made using a target commit from a
`release/*` branch, while being triggered by a commit getting merged into the
`main` branch.
Release branches do not run CI on their own commits - instead, CI is run on PRs
targeting them, and we enforce linear merges.
## Working with Azure Pipelines
### Understanding what gets built
Azure Pipelines does not build your branch or your PR; instead, what it builds
is the result of merging your branch into its target (in most cases, `main` or
`main-2.x`). This has some nice properties (you don't need to explicitly
rebase/merge to be confident your PR builds against current head), but it can
also cause some subtle issues because **this is done per job**.
Meaning that, within a single build, two separate jobs may not be building the
same code. This is particularly problematic for the platform-independence test,
in rare cases where the various platform jobs don't start at the same time and
the changes on main in-between change the produced DAR file.
### When a build doesn't start
There are two situations where a build will not start for a PR:
- The PR does not match our security rule "has been opened by an account with
write access". This covers bot-opened PRs (bots can create branches directly
on the repo but don't count as having write access, because reasons) as well as
any PR "from a fork", regardless of who opens it (i.e. if you have write access
but choose to make a fork instead of pushing your branch directly to the repo,
CI won't start).
- Sometimes either GitHub or Azure Pipelines has a temporary network issue.
Builds are triggered by GitHub sending events to Azure Pipelines (PR opened,
new commit pushed, etc.); there is no polling. So if there's any issue with
that one notification, the build has been "missed" and won't be started.
Remediation depends on the situation. In most cases, if you have write access
to the repo you can trigger a PR build by adding a comment that reads `/azp
run` on the PR. The comment has to be just thoss 8 characters.
Alternatively, in the second case, operations like pushing a new commit or
closing and reopening the PR can trigger a new notification.
### Restarting a failed build
Azure Pipelines should trigger a build on every pull request (but not every
branch). If a build has failed and you believe the failure to be flaky, you can
re-run the build by navigating to the "Checks" tab of your PR, and clicking the
"Re-run failed checks" button in the top right.
**This will only work once the build is finished, whether successfully or
not.** The button does nothing if some jobs (from that build) are still
running. You can identify builds and jobs on the GitHub Checks page by their
name, which is of the form `[Pipeline] ([Job])`, e.g. `PRs
(compatibility_linux)` where `PRs` is the name of the pipeline and
`compatibility_linux` is the job. A build is an instance of running all the
jobs in a pipeline.
Note that the `Re-run all jobs` button reruns all the jobs, which means you
take a chance with the ones that have already succeeded. This is sometimes
necessary, but the only case I can think of is when the platform-independence
test fails because of a race condition.
A PR-triggered build gets canceled if a new commit is pushed to the
corresponding branch.
### Finding logs for a build
From the same Checks tab (or the equivalent for main branch commits), you can
click on the "View more details on Azure Pipelines" link to get access to the
running logs of a job.
Note that logs at this level are per step, not per job. You can look at logs
scrolling by for a running step, or download the entirety of the logs as a text
file with the "View raw log" button in the top right.
On the build page view (when no specific job or step is selected), you can see
the build artifacts. Most of these artifacts are additional logs, presumably
more detailed.
### Managing jobs in Azure Pipeleines
Only a few people have access to Azure Pipelines directly. Those people can
additionally use the Azure Pipelines UI to:
- Cancel a running build. Note that we cannot cancel individual jobs, and the
cancellation is a request - some stops react more quickly than others.
- Manually start a build from a pipeline on an arbitrary git commit - this is
easily abusable and the reason why not many people are given access.
Direct access to Azure Pipelines does not help with most routine tasks, e.g. it
does not allow one to restart a failed job while other jobs in the same build
are still running.
### Managing CI pools
Direct access to Azure Pipelines also allows one to manage the pools of CI
machines:
- See how many jobs are running and how many jobs are queued, which may
indicate a need for more machines. There is no auto-scaling, so scaling may
need to be done manually.
- Disable (and then re-enable) individual machines in a pool. A disabled
machine will finish any ongoing job but will not be assigned new jobs.
- Delete a machine from a pool. This removes it from Azure Pipelines, but does
not free up the corresponding resources on Azure. Prefer [Bracin] for machine
deletion.
- Add (or remove) "capabilities" to a machine, which is a set of flags that can
be used in "demands" in job configuration. By default, all jobs require the
`assignment` capability to be equal to `default` (this is an explicit demand in
our YAML files, not a statement about Azure Pipelines defaults), and all
machines start with the `assignment` capability equal to `default` (this is
explicitly set in our startup scripts in [daml-ci], not a statement about the
Azure Agent's defaults). Changing capabilities can allow fine-grained
selection of which PR runs on which machine, which is generally seen as a bad
thing we should not do, but is occasionally needed while working on the CI
infrastructure, for example to test out a new version of the base VM that
machines run from.
Scaling machines requires access to Azure (which is separate from Azure
Pipelines despite the naming similarity), or the feature to be added to
[Bracin].
[Bracin]: https://daml-ci.da-int.net
## Playbook
### When the split_release job fails
When the `split_release` job fails, that's basically unrecoverable (unless it
fails on the very first thing it tries to push) because that step is not idempotent.
In other words, the release is busted, you'll have to make a new one with a new version
string. This is the reason why we have that `.0.` in the version string.
### Retrying a snapshot that failed because of a flake
Find the commit on `main` from which the `snapshot` or `digital-asset.daml` pipeline was run
(depending on whether this was a daily snapshot triggered by `daily-compat` or a "manual"
snapshot triggered by a modification to the `LATEST` file). Click on the red cross indicating
that the CI failed on that commit. Re-run the relevant pipeline from the "check" page you land
on.
Sometimes all the pipeline jobs are green because GitHub got the notifications for a
successful 2.x after it got the notifications for a failed 3.x. In that case there is
unfortunately no way to restart the snpashot from github.