Context
=======
After multiple discussions about our current release schedule and
process, we've come to the conclusion that we need to be able to make a
distinction between technical snapshots and marketing releases. In other
words, we need to be able to create a bundle for early adopters to test
without making it an officially-supported version, and without
necessarily implying everyone should go through the trouble of
upgrading. The underlying goal is to have less frequent but more stable
"official" releases.
This PR is a proposal for a new release process designed under the
following constraints:
- Reuse as much as possible of the existing infrastructure, to minimize
effort but also chances of disruptions.
- Have the ability to create "snapshot"/"nightly"/... releases that are
not meant for general public consumption, but can still be used by savvy
users without jumping through too many extra hoops (ideally just
swapping in a slightly-weirder version string).
- Have the ability to promote an existing snapshot release to "official"
release status, with as few changes as possible in-between, so we can be
confident that the official release is what we tested as a prerelease.
- Have as much of the release pipeline shared between the two types of
releases, to avoid discovering non-transient problems while trying to
promote a snapshot to an official release.
- Triggerring a release should still be done through a PR, so we can
keep the same approval process for SOC2 auditability.
The gist of this proposal is to replace the current `VERSION` file with
a `LATEST` file, which would have the following format:
```
ef5d32b7438e481de0235c5538aedab419682388 0.13.53-alpha.20200214.3025.ef5d32b7
```
This file would be maintained with a script to reduce manual labor in
producing the version string. Other than that, the process will be
largely the same, with releases triggered by changes to this `LATEST`
and the release notes files.
Version numbers
===============
Because one of the goals is to reduce the velocity of our published
version numbers, we need a different version scheme for our snapshot
releases. Fortunately, most version schemes have some support for that;
unfortunately, the SDK sits at the intersection of three different
version schemes that have made incompatible choices. Without going into
too much detail:
- Semantic versioning (which we chose as the version format for the SDK
version number) allows for "prerelease" version numbers as well as
"metadata"; an example of a complete version string would be
`1.2.3-nightly.201+server12.43`. The "main" part of the version string
always has to have 3 numbers separated by dots; the "prerelease"
(after the `-` but before the `+`) and the "metadata" (after the `+`)
parts are optional and, if present, must consist of one or more segments
separated by dots, where a segment can be either a number or an
alphanumeric string. In terms of ordering, metadata is irrelevant and
any version with a prerelease string is before the corresponding "main"
version string alone. Amongst prereleases, segments are compared in
order with purely numeric ones compared as numbers and mixed ones
compared lexicographically. So 1.2.3 is more recent than 1.2.3-1,
which is itself less recent than 1.2.3-2.
- Maven version strings are any number of segments separated by a `.`, a
`-`, or a transition between a number and a letter. Version strings
are compared element-wise, with numeric segments being compared as
numbers. Alphabetic segments are treated specially if they happen to be
one of a handful of magic words (such as "alpha", "beta" or "snapshot"
for example) which count as "qualifiers"; a version string with a
qualifier is "before" its prefix (`1.2.3` is before `1.2.3-alpha.3`,
which is the same as `1.2.3-alpha3` or `1.2.3-alpha-3`), and there is a
special ordering amongst qualifiers. Other alphabetic segments are
compared alphabetically and count as being "after" their prefix
(`1.2.3-really-final-this-time` counts as being released after `1.2.3`).
- GHC package numbers are comprised of any number of numeric segments
separated by `.`, plus an optional (though deprecated) alphanumeric
"version tag" separated by a `-`. I could not find any official
documentation on ordering for the version tag; numeric segments are
compared as numbers.
- npm uses semantic versioning so that is covered already.
After much more investigation than I'd care to admit, I have come up
with the following compromise as the least-bad solution. First,
obviously, the version string for stable/marketing versions is going to
be "standard" semver, i.e. major.minor.patch, all numbers, which works,
and sorts as expected, for all three schemes. For snapshot releases, we
shall use the following (semver) format:
```
0.13.53-alpha.20200214.3025.ef5d32b7
```
where the components are, respectively:
- `0.13.53`: the expected version string of the next "stable" release.
- `alpha`: a marker that hopefully scares people enough.
- `20200214`: the date of the release commit, which _MUST_ be on
master.
- `3025`: the number of commits in master up to the release commit
(included). Because we have a linear, append-only master branch, this
uniquely identifies the commit.
- `ef5d32b7ù : the first 8 characters of the release commit sha. This is
not strictly speaking necessary, but makes it a lot more convenient to
identify the commit.
The main downsides of this format are:
1. It is not a valid format for GHC packages. We do not publish GHC
packages from the SDK (so far we have instead opted to release our
Haskell code as separate packages entirely), so this should not be an
issue. However, our SDK version currently leaks to `ghc-pkg` as the
version string for the stdlib (and prim) packages. This PR addresses
that by tweaking the compiler to remove the offending bits, so `ghc-pkg`
would see the above version number as `0.13.53.20200214.3025`, which
should be enough to uniquely identify it. Note that, as far as I could
find out, this number would never be exposed to users.
2. It is rather long, which I think is good from a human perspective as
it makes it more scary. However, I have been told that this may be
long enough to cause issues on Windows by pushing us past the max path
size limitation of that "OS". I suggest we try it and see what
happens.
The upsides are:
- It clearly indicates it is an unstable release (`alpha`).
- It clearly indicates how old it is, by including the date.
- To humans, it is immediately obvious which version is "later" even if
they have the same date, allowing us to release same-day patches if
needed. (Note: that is, commits that were made on the same day; the
release date itself is irrelevant here.)
- It contains the git sha so the commit built for that release is
immediately obvious.
- It sorts correctly under all schemes (modulo the modification for
GHC).
Alternatives I considered:
- Pander to GHC: 0.13.53-alpha-20200214-3025-ef5d32b7. This format would
be accepted by all schemes, but will not sort as expected under semantic
versioning (though Maven will be fine). I have no idea how it will sort
under GHC.
- Not having any non-numeric component, e.g. `0.13.53.20200214.3025`.
This is not valid semantic versioning and is therefore rejected by
npm.
- Not having detailed info: just go with `0.13.53-snapshot`. This is
what is generally done in the Java world, but we then lose track of what
version is actually in use and I'm concerned about bug reports. This
would also not let us publish to the main Maven repo (at least not more
than once), as artifacts there are supposed to be immutable.
- No having a qualifier: `0.13.53-3025` would be acceptable to all three
version formats. However, it would not clearly indicate to humans that
it is not meant as a stable version, and would sort differently under
semantic versioning (which counts it as a prerelease, i.e. before
`0.13.53`) than under maven (which counts it as a patch, so after
`0.13.53`).
- Just counting releases: `0.13.53-alpha.1`, where we just count the
number of prereleases in-between `0.13.52` and the next. This is
currently the fallback plan if Windows path length causes issues. It
would be less convenient to map releases to commits, but it could still
be done via querying the history of the `LATEST` file.
Release notes
=============
> Note: We have decided not to have release notes for snapshot releases.
Release notes are a bit tricky. Because we want the ability to make
snapshot releases, then later on promote them to stable releases, it
follows that we want to build commits from the past. However, if we
decide post-hoc that a commit is actually a good candidate for a
release, there is no way that commit can have the appropriate release
notes: it cannot know what version number it's getting, and, moreover,
we now track changes in commit messages. And I do not think anyone wants
to go back to the release notes file being a merge bottleneck.
But release notes need to be published to the releases blog upon
releasing a stable version, and the docs website needs to be updated and
include them.
The only sensible solution here is to pick up the release notes as of
the commit that triggers the release. As the docs cron runs
asynchronously, this means walking down the git history to find the
relevant commit.
> Note: We could probably do away with the asynchronicity at this point.
> It was originally included to cover for the possibility of a release
> failing. If we are releasing commits from the past after they have been
> tested, this should not be an issue anymore. If the docs generation were
> part of the synchronous release step, it would have direct access to the
> correct release notes without having to walk down the git history.
>
> However, I think it is more prudent to keep this change as a future step,
> after we're confident the new release scheme does indeed produce much more
> reliable "stable" releases.
New release process
===================
Just like releases are currently controlled mostly by detecting
changes to the `VERSION` file, the new process will be controlled by
detecting changes to the `LATEST` file. The format of that file will
include both the version string and the corresponding SHA.
Upon detecting a change to the `LATEST` file, CI will run the entire
release process, just like it does now with the VERSION file. The main
differences are:
1. Before running the release step, CI will checkout the commit
specified in the LATEST file. This requires separating the release
step from the build step, which in my opinion is cleaner anyway.
2. The `//:VERSION` Bazel target is replaced by a repository rule
that gets the version to build from an environment variable, with a
default of `0.0.0` to remain consistent with the current `daml-head`
behaviour.
Some of the manual steps will need to be skipped for a snapshot release.
See amended `release/RELEASE.md` in this commit for details.
The main caveat of this approach is that the official release will be a
different binary from the corresponding snapshot. It will have been
built from the same source, but with a different version string. This is
somewhat mitigated by Bazel caching, meaning any build step that does
not depend on the version string should use the cache and produce
identical results. I do not think this can be avoided when our artifact
includes its own version number.
I must note, though, that while going through the changes required after
removing the `VERSION` file, I have been quite surprised at the sheer number of
things that actually depend on the SDK version number. I believe we should
look into reducing that over time.
CHANGELOG_BEGIN
CHANGELOG_END
We have seen a number of failures recently where the collect_build_data
and notify_user steps failed to fetch their branch commit from GitHub.
Trying to reproduce locally doesn't work (i.e. fetching the same sha
succeeds), so we're currently assuming transient network errors.
This PR adds a retry mechanism as well as a Slack message to help me
keep tabs on the issue.
CHANGELOG_BEGIN
CHANGELOG_END
* Update CI nix version
For `--option http2 false` to take effect requires Nix 2.3.2.
CHANGELOG_BEGIN
CHANGELOG_END
* Set option `http2 = false` dev-env nix config
This is less likely to overlook an instance than manually adding
`--option http2 false` to each Nix invocation.
Setting `--option htt2p false` also had no effect on the multi-user Nix
installation on the Linux CI machines due to
```
WARNING: option '--disk_cache' was expanded to from both option '--config linux' (source /nix/store/2xnfb2l39d2b4nxw5vwmqz5hjwhw0caw-daml-bazelrc) and option '--config linux' (source /nix/store/2xnfb2l39d2b4nxw5vwmqz5hjwhw0caw-daml-bazelrc)
```
Co-authored-by: Andreas Herrmann <andreash87@gmx.ch>
The $(Agent.JobStatus) variable unfortunately only tracks the status of
the current _job_, whereas what we really want here is the status of the
whole build. Azure does not have an easy way to do that, but fortunately
we already have logic to determine the current build status in
`collect_build_data`, so here we can just piggyback on that.
CHANGELOG_BEGIN
CHANGELOG_END
PR #4286 introduced new jobs that do not work well when ran against the
master branch, rather than as part of a PR. This hopefully fixes that,
though it's hard to test for obvious reasons.
CHANGELOG_BEGIN
CHANGELOG_END
One of the outputs of our brainstorming about how to make CI better was
that it is annoying to have to "babysit" pull requests. This PR attempts
to introduce a notification mechanism by which Azure will notify people
on Slack when a build finishes, so they know they need to go and rerun
or merge the corresponding PR.
This commit also changes the existing $Slack.URL variable to
$Slack.team-daml, to make more explicit where the Slack message is being
sent to (Slack works with one token per destination channel). Both
$Slack.URL and $Slack.team-daml are currently defined as the same token
in Azure.
CHANGELOG_BEGIN
CHANGELOG_END
Azure builds on the merge commit provided by GitHub as
refs/pull/<pr-number>/merge. Either GitHub recently changed to clear out
such commits faster than they used to, or Azure recently changed to
cache the resulting commit sha rather than go through the indirection
again.
Either way, the end result is that, currently, if the other jobs take
"long enough", and `master` has changed in-between the build starting
and the `collect_build_data` step running, the latter will fail to
checkout the commit it is looking for, and the build will irredeemably
fail. The only option is to re-run the entire build (`/azp run` or
rebase/push), which is sort of the entire opposite of the whole reason
for introducing `collect_build_data` in the first place.
This patch aims to address this by not relying on Azure to fetch the
daml repo in the `collect_build_data` job. This is definitely a hack,
but hopefully one that can alleviate the problem for now.
CHANGELOG_BEGIN
CHANGELOG_END
The main goal of this job is to fail when other, required jobs are
canceled. The reason for this is that the communication between Azure
and GitHub does not always work very well, particularly around canceled
jobs, so that when a job gets canceled GitHub does not always know about
it, and furthermore GitHub does not provide a "re-run" button for
canceled jobs.
Thus, this one provides a failed job that doe sdisplay a "re-run"
button, which re-runs all the failed/canceled jobs in the current build.
Therefore, this only needs to detect canceled jobs, not failed ones
(because those will have their own "re-run" button).
Additionally, we recently changed the standard-change label check to not
run on master (as it really only makes sense on PRs), resulting in a
Skipped status instead of Succeeded, which made this job fail.
CHANGELOG_BEGIN
CHANGELOG_END
This commit:
- fixes an issue introduced by e2015e2ec whereby the data was never
saved to GCS, because if the variable _is_ set, then Azure will
substitute the right side of the comparison as well as set the
environment variable, so the two sides will still be equal. The new
approach skips the first character of the env var and removes the dollar
from the (intended) literal expression, ensuring Azure does not
substitute.
- adds output to the failure case because we have recently seen all
commits on master fail that step, regardless of success of other jobs.
CHANGELOG_BEGIN
CHANGELOG_END
There was an issue where multiline commit messages were not encoded at
all, yielding invalid JSON. This should fix it. This commit message is
designed to test it before it gets to master.
$(exit 1)
`exit 1`
end quote: "
invalid json input
* Rename hie-core to ghcide
The name `hie-core` has caused a lot of confusion as to how we relate
to haskell-ide-engine so changing it should hopefully help with that.
I also think that ghcide is still a good name once we hopefully
integrate with haskell-ide-engine more closely.
The name ghcide seems to have a reasonable amount of support on
Twitter https://twitter.com/ndm_haskell/status/1170681262987710464
which is of course the only good way to come up with names.
* Add a readme that points people to the new directory.
* Fix bogus replacements
* Use a proper link
* links are hard
At the moment, if we try to run the build of a release commit after it
has succeeded, the `git tag` step fails. We do not normally try to
rebuild old commits that had succeeded, but sometimes Azure gets
confused when we ask for rerunning specific jobs within a build, and
reruns the whole build.
This creates two (small, bubt annoying) problems:
1. It adds noise to the #team-daml channel as it notifies of the build
failure, and
2. It marks the commit as failed with a red cross on the github list of
commits, which obviously doesn't look great for release commits.
This fixes that. Note that if a release does fail after the `git tag`
step (e.g. some network error between Azure and GitHub), this **does
not** change the necessary steps to remediate, as that situation would
already be broken in the current setup. (Steps to remediate would
essentially boil down to deleting the tag on GitHub before rerunning so
the build can create it again.)
This is in preparation for #2326 as well as for splitting hie-core
into a separate repo. Given that, it explicitely avoids using our
dev-env.
We do need to install a few system packages, so for now this uses the
hosted builder so we can do this. Another option would be to just add
those to our builders. I don’t really have a preference either
way. The builds are < 5 minutes so I don’t expect issues from using
the hosted builders.
- change jobs names to match the names Azure uses, so GitHub doesn't get
confused. These were the names when the pipeline was originally set
up; since I changed them it looks like Azure and GitHub don't agree on
test names, resulting in some occurrences of GitHub forever expecting
more info about `linux` while `Linux` is completed.
- add `-i` to cURL invocations, so we can see the response headers. This
is an attempt to debug a situation I have seen three times now where
Azure says it executed the cURL step successfully, yet no message
appears in Slack.
- add test name to failure message sent to Slack.
* Fix termination of scenario service on Windows
The lack of a proper Windows IO manager resulted in us being unable to
kill the conduits reading the output of the scenario service so `damlc
test` and `damlc ide` blocked forever. This PR fixes the problem by
shutting down the scenario service (by closing its stdin) before
killing the conduits .
* Use fail instead of error
* Add debugging output
* Remove debug output
* Bump timeout of perf test
* ci: always use the linux-pool
reduce the difference of environment between external and internal
contributions
* infra: tweak the linux cache warmup script
Don't share the same bazel cache directory with the disk cache, which is
something else. Be more specific about the target. Clean after yourself.
* infra: bump the linux agent disk to 200GB
avoid running out of disk space
Azure Pipelines has direct integration with GitHub, so we're just using
that. Releases on GitHub have to target a tag, so we also need to push
the tag as an intermediate step; we also need to include the platform
name in the artifact to avoid overwriting from different builds.
The two "GitHub release" steps depend on two Azure variables that are
not defined in the pipeline script. This may look like it should not
work, but in fact it does, because these variables are set by the
release script.
In Azure Pipelines, any build step can set variables for the next build
steps by outputting specially-formatted text to stdout. This text will
not appear in the build output displayed by Azure Pipelines, e.g.:
```
echo '##vso[task.setvariable variable=sauce]tomatoes'
```
would define the Azure variable `sauce` to have the string `tomatoes` as
its value for the next build steps.
See [0] for details.
[0]: https://docs.microsoft.com/en-us/azure/devops/pipelines/process/variables?view=azure-devops&tabs=yaml%2Cbatch#set-in-script
* nix: add the more providers to terraform
* docs: make tarballs more reproducible
* ci: use the linux-pool pool
* ci: tweak the nix installation
handle the case where the user is root and on ubuntu
* infra: terraform fmt
* infra: add Azure Pipeline agents
* ci: only enable linux-pool for internal PRs