mirror of
https://github.com/digital-asset/daml.git
synced 2024-09-19 00:37:23 +03:00
fd47e89ce2
Update ci/README with explanations gathered from @garyverhaegen-da on slack.
259 lines
13 KiB
Markdown
259 lines
13 KiB
Markdown
# CI
|
|
|
|
## Overview
|
|
|
|
We run our CI on Azure Pipelines. Azure Pipelines uses its own variant of YAML
|
|
for its ocnfiguration; it's worth getting familiar with their [YAML
|
|
documentation][YAML], as well as with their [expression syntax].
|
|
|
|
[YAML]: https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/?view=azure-pipelines
|
|
[expression syntax]: https://learn.microsoft.com/en-us/azure/devops/pipelines/process/expressions?view=azure-devops
|
|
|
|
Azure Pipelines allows one to define any number of "pipelines", which are
|
|
definitions of what to do (think CircleCI workflow). Each pipeline has its own
|
|
entrypoint and conditions for running, as well as its own configuration within
|
|
Azure Pipelines (e.g. each has its own set of nvironment variables).
|
|
|
|
## Entrypoints
|
|
|
|
The entrypoints we have in this repo, as of 2024-05-28, are:
|
|
|
|
- `/azure-cron.yml` for the `digital-asset.daml-cron` pipeline. This is a
|
|
hourly cron. For some reason setting it up as a hourly cron did not work four
|
|
years ago so we have a slightly convoluted set up where a separate job within
|
|
Azure Pipelines (not visible in this repo's files) called `cron-workaround CI`
|
|
"maually" triggers this through the API every hour. This entrypoint is
|
|
responsible for generating the `daml-sdk` Docker image (when a new release is
|
|
detected), cleaning up potentially-broken files from the Bazel cache,
|
|
publishing the VSCode Extension (when a new release is detected), copying the
|
|
GitHub downloads stats to GCS to give us some historical perspective, and
|
|
checking that get.daml.com has not been changed.
|
|
- `/azure-pipelines.yml` is the `main` branch build (also `main-2.x`). This
|
|
runs only on commits from `main` or `main-2.x` and has access to release
|
|
secrets. The corresponding pipeline is `digital-asset.daml`.
|
|
- `/ci/prs.yml` is the entrypoint for PR builds. It is very similar to the main
|
|
branch build (in either case, most of the jobs are defined in
|
|
`/ci/build.yml`), but has access to fewer secrets. It also has a few specific
|
|
jobs such as `check_standard_change_label`.
|
|
- `/ci/daily-snapshot.yml` is, as the name suggests, the entrypoint for the
|
|
daily snapshots. Note that, like all releases, these are made from the latest
|
|
commit of the main branch. The corresponding pipeline is named `snapshot` and
|
|
can be manually triggered with sufficient Azure Pipelines permissions. This
|
|
should **not** be done for non-snapshot releases as we want an audit trail of
|
|
those in `LATEST`. Note that despite the name this does not run on a cron.
|
|
- `/ci/cron/daily-compat.yml` is the entrypoint for all of our "daily" cron
|
|
jobs, whether they are related to the compatibility tests or not. These
|
|
include the compatibility tests, but also the speedy performance tests (2.x
|
|
only), a daily BlackDuck scan (which optionally udpates the `NOTICES` file),
|
|
the daily code pull from canton, as well as triggering the `snapshot` pipeline
|
|
if needed.
|
|
- `/ci/cron/tuesday.yml` runs on Tuesdays to send a Slack message to
|
|
`#team-daml` announcing who will be responsible for the Wednesday release
|
|
testing.
|
|
- `/ci/cron/wednesday.yml` open the release testing rotation update PR.
|
|
|
|
## Special branches
|
|
|
|
We currently have a few special branches.
|
|
|
|
### `main`
|
|
|
|
The `main` branch is where most of the development happens. It currently
|
|
(2024-05-28) targets the 3.1.0 release, but that is changeable: when the code
|
|
freeze for 3.1.0 nears, the way we'll actually do the code freeze is by
|
|
creating a new branch called `release/3.1.x` from the then tip of `main`, and
|
|
then change `main` to target 3.2.0.
|
|
|
|
The `main` branch is also the only one that triggers releases. This means that
|
|
the `sdk/LATEST` file on the `main` branch is the source of truth for releases
|
|
in this repo, across all versions. When CI detects that a given build is a
|
|
release build, all of the build steps will first check out the target release
|
|
commit (first column of the `sdk/LATEST` file).
|
|
|
|
This has two important consequences:
|
|
- On the good side, it's very good for auditability: this one file on this one
|
|
branch is the only trigger for releases, so looking at the history of that
|
|
file tells you all you need to know about releases. (In most cases you'll have
|
|
all you need from the current state of the file.)
|
|
- On the bad side, Azure Pipelines loads the YML files _before_ we check out
|
|
the target release commit, which means that the YML files involved in making
|
|
a release need to be changed very carefully as they need to keep working with
|
|
older versions. This has not been an issue in practice so far, but certainly
|
|
something to keep in mind.
|
|
|
|
The `main` branch is the one that runs for most cron jobs, the one exception
|
|
being the `digital-asset.daml-daily-compat` pipeline which runs every day on
|
|
both the `main` and `main-2.x` branches.
|
|
|
|
Note however that the daily-compat `main-2.x` pipeline does not trigger any
|
|
release build. All snapshots (included `main-2.x` ones) are started by
|
|
[this job](https://github.com/digital-asset/daml/blob/18e4e155fb94ac153e08b9f7b141910d8e005e7a/ci/cron/daily-compat.yml#L314-L335)
|
|
in the `main` daily-compat pipeline. The job starts two builds of the `snapshot`
|
|
pipeline, with two different commits passed as a parameter. As these builds
|
|
are all run from `main` (even when building `main-2.x`), they can't be easily
|
|
told appart from the [snapshot pipeline page](https://dev.azure.com/digitalasset/daml/_build?definitionId=40). To tell which branch a given build is building: click on the job,
|
|
then consult the `check_for_release`'s `out` step. The `release_tag` variable then
|
|
makes it clear whether this is a 2.x or 3.x build.
|
|
|
|
### `main-2.x`
|
|
|
|
This temporary fork of `main` targets `2.9.0` and will likely move on to target
|
|
`2.10.0` when `2.9.0` gets its code freeze. It is similar to `main` in many
|
|
ways, but does not trigger releases.
|
|
|
|
### `release/*`
|
|
|
|
The `release/*` branches (e.g. `release/2.3.x`) represent the code base of past
|
|
minor releases and exist to allow us to do patch releases.
|
|
|
|
The process of making a stable release will generall involve creating the
|
|
`release/*` branch when we start to close in on the code freeze for that
|
|
release, with usually a couple more PRs that need to go in.
|
|
|
|
Therefore, the vast majority of releases are made using a target commit from a
|
|
`release/*` branch, while being triggered by a commit getting merged into the
|
|
`main` branch.
|
|
|
|
Release branches do not run CI on their own commits - instead, CI is run on PRs
|
|
targeting them, and we enforce linear merges.
|
|
|
|
## Working with Azure Pipelines
|
|
|
|
### Understanding what gets built
|
|
|
|
Azure Pipelines does not build your branch or your PR; instead, what it builds
|
|
is the result of merging your branch into its target (in most cases, `main` or
|
|
`main-2.x`). This has some nice properties (you don't need to explicitly
|
|
rebase/merge to be confident your PR builds against current head), but it can
|
|
also cause some subtle issues because **this is done per job**.
|
|
|
|
Meaning that, within a single build, two separate jobs may not be building the
|
|
same code. This is particularly problematic for the platform-independence test,
|
|
in rare cases where the various platform jobs don't start at the same time and
|
|
the changes on main in-between change the produced DAR file.
|
|
|
|
### When a build doesn't start
|
|
|
|
There are two situations where a build will not start for a PR:
|
|
|
|
- The PR does not match our security rule "has been opened by an account with
|
|
write access". This covers bot-opened PRs (bots can create branches directly
|
|
on the repo but don't count as having write access, because reasons) as well as
|
|
any PR "from a fork", regardless of who opens it (i.e. if you have write access
|
|
but choose to make a fork instead of pushing your branch directly to the repo,
|
|
CI won't start).
|
|
- Sometimes either GitHub or Azure Pipelines has a temporary network issue.
|
|
Builds are triggered by GitHub sending events to Azure Pipelines (PR opened,
|
|
new commit pushed, etc.); there is no polling. So if there's any issue with
|
|
that one notification, the build has been "missed" and won't be started.
|
|
|
|
Remediation depends on the situation. In most cases, if you have write access
|
|
to the repo you can trigger a PR build by adding a comment that reads `/azp
|
|
run` on the PR. The comment has to be just thoss 8 characters.
|
|
|
|
Alternatively, in the second case, operations like pushing a new commit or
|
|
closing and reopening the PR can trigger a new notification.
|
|
|
|
### Restarting a failed build
|
|
|
|
Azure Pipelines should trigger a build on every pull request (but not every
|
|
branch). If a build has failed and you believe the failure to be flaky, you can
|
|
re-run the build by navigating to the "Checks" tab of your PR, and clicking the
|
|
"Re-run failed checks" button in the top right.
|
|
|
|
**This will only work once the build is finished, whether successfully or
|
|
not.** The button does nothing if some jobs (from that build) are still
|
|
running. You can identify builds and jobs on the GitHub Checks page by their
|
|
name, which is of the form `[Pipeline] ([Job])`, e.g. `PRs
|
|
(compatibility_linux)` where `PRs` is the name of the pipeline and
|
|
`compatibility_linux` is the job. A build is an instance of running all the
|
|
jobs in a pipeline.
|
|
|
|
Note that the `Re-run all jobs` button reruns all the jobs, which means you
|
|
take a chance with the ones that have already succeeded. This is sometimes
|
|
necessary, but the only case I can think of is when the platform-independence
|
|
test fails because of a race condition.
|
|
|
|
A PR-triggered build gets canceled if a new commit is pushed to the
|
|
corresponding branch.
|
|
|
|
### Finding logs for a build
|
|
|
|
From the same Checks tab (or the equivalent for main branch commits), you can
|
|
click on the "View more details on Azure Pipelines" link to get access to the
|
|
running logs of a job.
|
|
|
|
Note that logs at this level are per step, not per job. You can look at logs
|
|
scrolling by for a running step, or download the entirety of the logs as a text
|
|
file with the "View raw log" button in the top right.
|
|
|
|
On the build page view (when no specific job or step is selected), you can see
|
|
the build artifacts. Most of these artifacts are additional logs, presumably
|
|
more detailed.
|
|
|
|
### Managing jobs in Azure Pipeleines
|
|
|
|
Only a few people have access to Azure Pipelines directly. Those people can
|
|
additionally use the Azure Pipelines UI to:
|
|
|
|
- Cancel a running build. Note that we cannot cancel individual jobs, and the
|
|
cancellation is a request - some stops react more quickly than others.
|
|
- Manually start a build from a pipeline on an arbitrary git commit - this is
|
|
easily abusable and the reason why not many people are given access.
|
|
|
|
Direct access to Azure Pipelines does not help with most routine tasks, e.g. it
|
|
does not allow one to restart a failed job while other jobs in the same build
|
|
are still running.
|
|
|
|
### Managing CI pools
|
|
|
|
Direct access to Azure Pipelines also allows one to manage the pools of CI
|
|
machines:
|
|
|
|
- See how many jobs are running and how many jobs are queued, which may
|
|
indicate a need for more machines. There is no auto-scaling, so scaling may
|
|
need to be done manually.
|
|
- Disable (and then re-enable) individual machines in a pool. A disabled
|
|
machine will finish any ongoing job but will not be assigned new jobs.
|
|
- Delete a machine from a pool. This removes it from Azure Pipelines, but does
|
|
not free up the corresponding resources on Azure. Prefer [Bracin] for machine
|
|
deletion.
|
|
- Add (or remove) "capabilities" to a machine, which is a set of flags that can
|
|
be used in "demands" in job configuration. By default, all jobs require the
|
|
`assignment` capability to be equal to `default` (this is an explicit demand in
|
|
our YAML files, not a statement about Azure Pipelines defaults), and all
|
|
machines start with the `assignment` capability equal to `default` (this is
|
|
explicitly set in our startup scripts in [daml-ci], not a statement about the
|
|
Azure Agent's defaults). Changing capabilities can allow fine-grained
|
|
selection of which PR runs on which machine, which is generally seen as a bad
|
|
thing we should not do, but is occasionally needed while working on the CI
|
|
infrastructure, for example to test out a new version of the base VM that
|
|
machines run from.
|
|
|
|
Scaling machines requires access to Azure (which is separate from Azure
|
|
Pipelines despite the naming similarity), or the feature to be added to
|
|
[Bracin].
|
|
|
|
[Bracin]: https://daml-ci.da-int.net
|
|
|
|
## Playbook
|
|
|
|
### When the split_release job fails
|
|
|
|
When the `split_release` job fails, that's basically unrecoverable (unless it
|
|
fails on the very first thing it tries to push) because that step is not idempotent.
|
|
In other words, the release is busted, you'll have to make a new one with a new version
|
|
string. This is the reason why we have that `.0.` in the version string.
|
|
|
|
### Retrying a snapshot that failed because of a flake
|
|
|
|
Find the commit on `main` from which the `snapshot` or `digital-asset.daml` pipeline was run
|
|
(depending on whether this was a daily snapshot triggered by `daily-compat` or a "manual"
|
|
snapshot triggered by a modification to the `LATEST` file). Click on the red cross indicating
|
|
that the CI failed on that commit. Re-run the relevant pipeline from the "check" page you land
|
|
on.
|
|
|
|
Sometimes all the pipeline jobs are green because GitHub got the notifications for a
|
|
successful 2.x after it got the notifications for a failed 3.x. In that case there is
|
|
unfortunately no way to restart the snpashot from github.
|