Commit Graph

43 Commits

Author SHA1 Message Date
Gary Verhaegen
eb14dcf6e9
bump check_releases timeout (#8812)
It's become quite flaky at just 4h.

Yes, I do plan to actually fix it, but it may take a while.

CHANGELOG_BEGIN
CHANGELOG_END
2021-02-10 22:45:25 +00:00
Brian Healey
dd73af06fb
replace deprecated blackduck property (#8811)
CHANGELOG_BEGIN
CHANGELOG_END
2021-02-10 19:54:31 +00:00
Gary Verhaegen
29c2577fa2
fix perf_speedy report (escaping ") (#8655)
The report currently chokes on quotes in the commit message (see
550aa48fc5). Rather than trying to
correctly excape things in Bash, this PR delegates the quote handling to
jq, because having to deal with Bash embedded in YAML is hard enough.

CHANGELOG_BEGIN
CHANGELOG_END
2021-02-03 19:47:27 +01:00
Gary Verhaegen
fef712bf60
Upgrade linux nodes to 20.04 (#8617)
CHANGELOG_BEGIN

- Our Linux binaries are now built on Ubuntu 20.04 instead of 16.04. We
  do not expect any user-level impact, but please reach out if you
  do notice any issue that might be caused by this.

CHANGELOG_END
2021-01-27 17:38:34 +01:00
Gary Verhaegen
fc83f7088d
limit blackduck scans to main (#8647)
CHANGELOG_BEGIN
CHANGELOG_END
2021-01-27 17:29:27 +01:00
Gary Verhaegen
532f996f12
ci: clean-up hard drive more often (#8582)
CHANGELOG_BEGIN
CHANGELOG_END
2021-01-20 17:22:53 +01:00
Gary Verhaegen
ce96802a6d
only update NOTICES from main daily compats (#8473)
See #8468 for failure mode of current config.

CHANGELOG_BEGIN
CHANGELOG_END
2021-01-12 12:10:27 +00:00
Gary Verhaegen
a925f0174c
update copyright notices for 2021 (#8257)
* update copyright notices for 2021

To be merged on 2021-01-01.

CHANGELOG_BEGIN
CHANGELOG_END

* patch-bazel-windows & da-ghc-lib
2021-01-01 19:49:51 +01:00
Gary Verhaegen
93f449d245
rename master to main (#8245)
As we strive for more inclusiveness, we are becoming less comfortable
with historically-charged terms being used in our everyday work.

This is targeted for merge on Dec 26, _after_ the necessary
corresponding changes at both the GitHub and Azure Pipelines levels.

CHANGELOG_BEGIN

- DAML Connect development is now conducted from the `main` branch,
  rather than the `master` one. If you had any dependency on the
  digital-asset/daml repository, you will need to update this parameter.

CHANGELOG_END
2020-12-27 14:19:07 +01:00
Brian Healey
6a1e0a633b
Avoid haskell and jvm bazel blackduck scan stomping (#8247)
* Run bazel scan for all but haskell first, then haskell at end so they do not stomp on each other

CHANGELOG_BEGIN
CHANGELOG_END

Signed-off-by: Brian Healey <brian.healey@digitalasset.com>

* use common prefix between runs so results are aggregated, get branch name from running job rather than assuming master

* DO NOT MERGE: disable all jobs except for blackduck scan

* haskell before full scan

* parens instead of braces

* differentiate haskell and jvm bazel runs with unique code location

* unique code location prefix

* fix syntax

* unique code location to avoid clashing bazel runs

* Use master security-blackduck script helper

* reenable all jobs to make mergeable

* cleanup whitespace

* Use Build.SourceBranchName for branch

* Update ci/cron/daily-compat.yml

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

* DO NOT MERGE: skip all jobs but blackduck, skip update notices file step

* reenable all jobs to make mergeable

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
2020-12-11 07:56:05 -05:00
Gary Verhaegen
029c655adc
blackduck: open PR on NOTICES file change (#8215)
CHANGELOG_BEGIN
CHANGELOG_END
2020-12-10 10:08:28 +01:00
Brian Healey
ca294eb14d
add blackduck scan to run on master (#6130) (#8161)
* add blackduck scan to run on master (#6130)

* add blackduck scan

* disable go scanning
exclude entire language-support/ts directory for node scanning
break to multiple lines to make command line params easier to parse

* Increase timeout for blackduck binary scan

* update blackduck scan config

* remove some exclusions, force python3

* exclude GO until path to go executable can be resolved

* added readme explanation of why we want this file

* fail in case of policy violation

* ensure haskell bazel scan completes before running second round scan for bazel jvm and node and other langs

* trigger notices file gen to ensure BOM complete

* remove trailing end of lines

* run with latest detect version and unique code location name changes to wrapper script

* Add blackduck to daily compat job

* DO NOT MERGE: condition false to disable other jobs for testing

* remove parameters not available to cronjob

* Revert changes to regular CI pipeline

CHANGELOG_BEGIN
CHANGELOG_END

Signed-off-by: Brian Healey <brian.healey@digitalasset.com>

* Do not get branch name from variable

* Upgrade com.fasterxml.jackson.core:jackson-databind to 2.12.0 to address security vulnerability

* Remove disabling of other jobs, set to branch to be used on prod runs

* Apply suggestions from code review

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>

* Address code review comments

* Updated NOTICES file

* Run bazel build, update NOTICES file

* Correct dade-assist

* do not have perms to pipe to dev/null

* Add md file explaining how to update NOTICES file

* Add instructions for running blackduck locally

* Add a link to full security-blackduck readme

Co-authored-by: Gary Verhaegen <gary.verhaegen@digitalasset.com>
2020-12-07 19:59:39 +00:00
Gary Verhaegen
485dd1b597
perf_json_http: compress results (#8123)
I wanted to suggest that on the PR but caught it after it was merged. So
I made a note of it, which promptly got lost. As the end of the year
approaches I've started trying to clean up a little.

CHANGELOG_BEGIN
CHANGELOG_END
2020-12-01 17:44:43 +01:00
Gary Verhaegen
3cef53135c
ci/cron: do not push artifacts to gcs bucket (#8067)
Having the cron push artifacts to GCP was really only meant to happen
once. I got distracted and worked on other things. This PR closes that
work loop such that the current state and expectations are:

- Every new release pushes to GCP as part of the release process.
- The cron only checks that the GCP backup exists and matches, but does
  not push if it doesn't.

The reason for this is we want the cron job to fail if there are
additional, unexpected files in a release, rather than automatically
commit those files for the long term.

CHANGELOG_BEGIN
CHANGELOG_END
2020-11-25 19:12:03 +01:00
Gary Verhaegen
b23304c691
add default capability to macos (#5915)
This is the macOS part of #5912, which I have separated because our
macOS nodes have a different deployment process so it seemed easier to
track the deployment of the change separately.

CHANGELOG_BEGIN
CHANGELOG_END
2020-11-25 15:34:33 +01:00
Gary Verhaegen
cd427dc2d2
ci/cron: upload artifact to daml-data if missing (#7616)
If we don't already have a copy of an artifact in our "disaster
recovery" storage box, put one.

Note: as implemented, this upload mechanism happens only if the release
was successfully verified signature-wise, so this should not result in
us saving broken artifacts. Also, CI does not have deletion or overwrite
access to this bucket, so overall this should be pretty safe.

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-09 14:55:37 +02:00
Gary Verhaegen
8c1fbf6225
ci/bash_lib: generalize save_gcp_data (#7599)
This PR extends the existing `save_gcp_data` function to handle any
`gsutil` command. This is done to support existence checking using
`gsutil ls` for private artifacts in release checking (`ci/cron`).

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-08 18:37:14 +02:00
Gary Verhaegen
4db8c3ada1
ci/cron: move Bash cron inside Haskell script (#7585)
Yes, this is how I write Haskell. I'm told it's an improvement over
Bash.

Jokes aside, plan is to chip away at the Bash script, starting with
replacing the outermost loop with a proper "get _all_ releases" call
from Haskell, but I like keeping things working in small steps, and even
long-term I have no desire to reimplement the gpg signature checking
code in Haskell.

I have tested that things still work (on my machine); the only
difference is that we now only get the full output all at once at the
end, rather than one signature at a time. I don't think anyone is
looking at the output in real-time, so this should not be a huge issue.

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-06 18:45:29 +02:00
Gary Verhaegen
2973228f77
signature check: report to Slack (#7568)
If a tree falls in the forest and all that.

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-05 17:31:11 +02:00
Gary Verhaegen
fda2eca084
periodically check signatures (#7543)
This is a first, very incomplete step in the spirit of small,
incremental PRs. Known missing features:

- Should check all versions, not just the 30 most recent ones.
- Should also download from GCP backup and compare.
- Should alert on Slack if anything is unexpected.
- Should handle versions prior to us starting to sign (and do what?).
- Should also check artifacts in Artifactory, not just GitHub Releases.
- Optionally should save to GCP if we don't have a backup already.

So at the moment it's just downloading the artifacts for the 30 most
recent releases and printing a message stating whether we have a
signature and whether it's valid.

CHANGELOG_BEGIN
CHANGELOG_END
2020-10-01 21:01:42 +02:00
Leonid Shlyapnikov
d13e7aa184
JSON API daily perf test cron job (#7406)
* Add http-json-perf daily cron job

changelog_begin
changelog_end

* commenting out other jobs so we can manually test the new one

* commenting out other jobs so we can manually test the new one

* Fix the shell script

* Fixing the gs bucket, `gs://http-json bucket does not exist`

* uncomment the other jobs

* timestamp from git log

* get rid of DAR copying

* comment out the other jobs, so we can test it

* uncomment the other jobs
2020-09-16 13:02:02 -04:00
Gary Verhaegen
b2d58a3304
save daily perf results (#7396)
It's a real shame I forgot to do this sooner, but better late than never
I suppose.

CHANGELOG_BEGIN
CHANGELOG_END
2020-09-14 18:38:31 +00:00
Moritz Kiefer
631ed3e891
Bump timeouts in compat tests (#6689)
This bumps the timeout of the compat tests on PRs to 360 minutes
matching other jobs on a PR (we mainly hit this if ghc-lib is rebuilt)
and the timeout on the daily jobs to 720 minutes (we hit this if
_everything_ is rebuilt).

I am slightly worried about the timeout on the daily job. After having
taken a look at it, there are a few reasons how we ended up here:

1. We started including more tests, e.g., sandbox-classic. Not much we
   can do here, those tests are useful.

2. We have a very large number of snapshots for 1.3.0. There are a few
   reasons for this:

   1. Timing: We branched off early for the 1.2.0 release so the first
      snapshot for 1.3 was on June 3th. For 1.4 it looks like the first
      snapshot will be on July 15th so that’s roughly 2 extra
      snapshots just due to timing.

   2. Additional snapshots: We had one broken snapshot due to a broken
      VSCode extension that we didn’t delete (probably not worth doing
      at this point). We also had to backport to an old snapshot which
      resulted in another extra snapshot. We also had one extra
      snapshot which was supposed to be the RC but wasn’t since the
      ANF revert needed to go in.

   The only thing that is clearly useless is the one broken snapshot
   but that doesn’t change things that much. I see 2 orthogonal
   options for improving this assuming we agree that the current
   runtime is worryingly high.

   1. Prune snapshots more aggressively, e.g., only include the last 3
      snapshots. That’s a pretty arbitrary decision but it would
      enforce a hard limit.

   2. Reduce test combinations. E.g., only test snapshots vs stable
      releases but not snapshots vs snapshots.

3. We end up forcing a full build quite frequently. Here are just 2
   examples of how we’ve done that so far.

   1. Upgrade rules_haskell. Basically all tests are run by a Haskell
      binary so this forces a full rebuild.

   2. Change runfiles of `daml`.

I don’t think there is much we can do about 1 or 3 which leaves us
with 2. One not entirely unreasonable option is to just do nothing. We
did have periods where things went pretty smoothly for the most part
and each month we reset to a much smaller number of releases (we also
have to start throwing out old stable releases at some
point). Otherwise reducing the number of test combinations seems the
most promising option to me.

changelog_begin
changelog_end
2020-07-10 12:34:53 +00:00
Gary Verhaegen
beb33f2ab1
add explanation for clearing shared segments (#6545)
As requested on #6530.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-30 13:21:32 +00:00
Gary Verhaegen
55776f92ba
clear shared memory segment on macOS (#6530)
For a while now we've had errors along the line of

```
FATAL:  could not create shared memory segment: No space left on device
DETAIL:  Failed system call was shmget(key=5432001, size=56, 03600).
HINT:  This error does *not* mean that you have run out of disk space.
It occurs either if all available shared memory IDs have been taken, in
which case you need to raise the SHMMNI parameter in your kernel, or
because the system's overall limit for shared memory has been reached.
        The PostgreSQL documentation contains more information about
shared memory configuration.
child process exited with exit code 1
```

on macOS CI nodes, which we were not able to reproduce locally. Today I
managed to, sort of by accident, and that allowed me to dig a bit
further.

The root cause seems to be that PostgreSQL, as run by Bazel, does not
always seem to properly unlink the shared memory segment it uses to
communicate with itself. On my machine, running:

```
bazel test -t- --runs_per_test=100 //ledger/sandbox:conformance-test-wall-clock-postgresql
```

and eyealling the results of

```
watch ipcs -mcopt
```

I would say about one in three runs leaks its memory segment. After much
googling and some head scratching trying to figure out the C APIs for
managing shared memory segments on macOS, I kind of stumbled on a
reference to `pcirm` in a comment to some low-ranking StackOverflow
answer. It looks like it's working very well on my machine, even if I
run it while a test (and therefore an instance of pg) is running. I
believe this is because the command does not actually remove the shared
memory segments, but simply marks them for removal once the last process
stops using it. (At least that's what the manpage describes.)

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-30 01:40:16 +02:00
Gary Verhaegen
664df64e13
fix daily perf Slack notification (#6267)
This PR fixes the Slack notification on daily perf runs. It also updates
the perf sha.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-09 06:45:58 +00:00
Gary Verhaegen
445f6467d9
daily run: warn on master only (#6177)
Currently the message to Slack is always triggered by running the daily
checks. This means that it gets very noisy to:

1. Run the check on PRs affecting the check (like this one),
2. Rerun the check multiple times to ascertain that a given failure is
   flaky.

With this PR, the message to Slack is replaced with a simple `echo` when
these checks are not run from the `master` branch, so whoever (manually)
triggered them can still get feedback on the result, but other people
don't get spurious `@here` mentions.

CHANGELOG_BEGIN
CHANGELOG_END
2020-06-03 16:36:05 +02:00
Moritz Kiefer
629ec732dd
Include puppeteer tests in compat tests (#6018)
* Include puppeteer tests in compat tests

This PR adds the puppeteer based tests to the compatibility
tests. This also means that they are now actually compatibility
tests. Before, we only tested the SDK side.

Apart from process management being a nightmare on Windows as usually,
there are two things that might stick out here:

1. I’ve replaced the `sh_binary` wrapper by a `cc_binary`. There is a
   lengthy comment explaining why. I think at the moment, we could
   actually get rid of the wraper completely and add JAVA to path in
   the tests that need it but at least for now, I’d like to keep it
   until we are sure that we don’t need to add more to it (and then
   it’s also in the git history if we do need to resurrect it).
2. These tests are duplicated now similar to the `daml ledger *`
   tests. The reasoning here is different. They depend on the SDK
   tarball either way so performance wise there is no reason to keep
   them. However, we reference the other file in the docs which means
   we cannot change it freely. What we could do is to make this
   sufficiently flexible to handle both the `daml start` case and
   separate `daml sandbox`/`daml json-api` processes and then we can
   reference it in the docs. There is still added complexity for
   Windows but that’s necessary for users as well that want to run
   this on Windows so that seems unavoidable. (I should probably also
   remove my snarky comments 😇) I’d like to kee it duplicated
   for this PR and then we can clean it up afterwards.

changelog_begin
changelog_end

* Bump timeouts

changelog_begin
changelog_end
2020-05-22 14:02:59 +02:00
Moritz Kiefer
4916a28682
Include create-daml-app tests in compatibility tests (#5945)
This is the first part of #5700

It adds tests that build create-daml-app using `daml build` and then
run the codegen and build the UI. Contrary to our main tests these
also run on Windows. This is actually reasonably simple by first
building the typescript libraries on Linux and then downloading them
on Windows.

There are two parts that are still missing from the tests in the main
workspace:

1. Building the extra feature. This should be fairly easy to add.
2. Running the pupeeter tests. At least MacOS and Linux should be
   reasonably easy. I don’t know what horrors Windows will throw at
   us. This step is what actually makes this a compatibility
   test. Currently it doesn’t actually launch Sandbox and the JSON API.

Since this PR is already pretty large, I’d like to tackle those things
separately.

changelog_begin
changelog_end
2020-05-13 10:39:51 +02:00
Gary Verhaegen
3899a59a11
switch back to hosted macOS nodes (#5935)
CHANGELOG_BEGIN
CHANGELOG_END
2020-05-11 22:59:33 +02:00
Gary Verhaegen
9b476416b8
switch back to Azure-provided macos nodes (#5920)
This is temporary. It looks like the macOS nodes are dead; @nycnewman is
looking into it, but in case he doesn't fix it in time, at least we
have a backup plan so we're not completely blocked on Monday.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-11 09:28:40 +02:00
Gary Verhaegen
4a6ab84b69
add default machine capability (#5912)
add default machine capability

We semi-regularly need to do work that has the potential to disrupt a
machine's local cache, rendering it broken for other streams of work.
This can include upgrading nix, upgrading Bazel, debugging caching
issues, or anything related to Windows.

Right now we do not have any good solution for these situations. We can
either not do those streams of work, or we can proceed with them and
just accept that all other builds may get affected depending on which
machine they get assigned to. Debugging broken nodes is particularly
tricky as we do not have any way to force a build to run on a given
node.

This PR aims at providing a better alternative by (ab)using an Azure
Pipelines feature called
[capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities).
The idea behind capabilities is that you assign a set of tags to a
machine, and then a job can express its
[demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml),
i.e. specify a set of tags machines need to have in order to run it.

Support for this is fairly badly documented. We can gather from the
documentation that a job can specify two things about a capability
(through its `demands`): that a given tag exists, and that a given tag
has an exact specified value. In particular, a job cannot specify that a
capability should _not_ be present, meaning we cannot rely on, say,
adding a "broken" tag to broken machines.

Documentation on how to set capabilities for an agent is basically
nonexistent, but [looking at the
code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs)
indicates that they can be set by using a simple `key=value`-formatted
text file, provided we can find the right place to put this file.

This PR adds this file to our Linux, macOS and Windows node init scripts
to define an `assignment` capability and adds a demand for a `default`
value on each job. From then on, when we hit a case where we want a PR
to run on a specific node, and to prevent other PRs from running on that
node, we can manually override the capability from the Azure UI and
update the demand in the relevant YAML file in the PR.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-09 18:21:42 +02:00
Gary Verhaegen
11a2fc3c2d
more flexible perf test check (#5891)
This PR separates the "last known valid perf test" commit from the
"baseline speedy implementation" commit. It is important for the perf
test to be meaningful that the changes between those two commits are
benign, say minor API adjustments, so that the perf measurement remains
meaningful.

This also adds a check on merging to master that tells Slack if the perf
test has changed and the `test_sha` file needs updating. The Slack
message is conditional on the current commit to avoid excessive noise.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-07 13:53:22 +02:00
Gary Verhaegen
204c8b0657
add daily perf report (#5843)
This PR adds a simple daily job that runs the performance test on a
chosen "baseline" commit and then runs the same benchmark on latest
master. This should allow us to track overall performance improvements.

CHANGELOG_BEGIN
CHANGELOG_END
2020-05-06 13:50:35 +02:00
Moritz Kiefer
b291e96ce1
Publish execution logs from Windows compatibility jobs (#5834)
Hopefully, this helps diagnose the Windows CI failures.

changelog_begin
changelog_end
2020-05-05 12:23:11 +02:00
Moritz Kiefer
49e19ebed1
Make compat tests work on windows (#5732)
* Make compat tests work on windows

This required some changes to the daml_sdk rule since the read-only
installation by the assistant breaks Bazel completely. We could only
apply those changes on Windows but I think I prefer the consistency
across platforms here over trying to stay close to how the SDK is
installed on user machines given that the SDK installation is not
something we’ve had issues with.

I’ve excluded the postgresql tests for now. I don’t expect them to be
particularly hard to fix but I’ve already spent almost 2 days on this
and having some tests run on Windows seems like a clear improvement
over running no tests on Windows :)

changelog_begin
changelog_end

* Remove todo

changelog_begin
changelog_end
2020-04-28 16:06:36 +02:00
Gary Verhaegen
cfae3df7fa
report compat status every day (#5744)
I believe the compatibility check is important enough, and should fail
rarely enough, that it is worth reporting even on success. This will
mean (once we support Windows) 3 messages a day, sent while presumably
nobody is working, so the disruption should be minimal.

The issue with reporting only on failures is that, if we don't
proactively check (which we do for the state of master for different
reasons, but would likely not keep doing for a job that doesn't block
PRs), we may get into a state where it is so broken that it doesn't even
report.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-28 12:30:45 +02:00
Gary Verhaegen
7ceda5678a
run compatibility tests on macos (#5723)
This PR extends the existing Linux compatibility tests to run on macOS
too. Fixes #5692.

CHANGELOG_BEGIN
CHANGELOG_END

Co-authored-by: Moritz Kiefer <moritz.kiefer@purelyfunctional.org>
2020-04-27 14:55:16 +02:00
Moritz Kiefer
0d1f21e4a2
Extend compatibility tests to test against HEAD (#5714)
fixes #5691

changelog_begin
changelog_end
2020-04-24 14:43:35 +02:00
Moritz Kiefer
f61aadc422
Add dade-assist step to compat cron job (#5712)
changelog_begin
changelog_end
2020-04-24 11:25:14 +02:00
Gary Verhaegen
ec1a9326ea
fix daily-compat.yml (#5686)
For some reason the Azure Pipelines folks thought it was a good idea to
search for templates starting with the current YAML file's path. Other
steps (e.g. the bash script here) start from the root of the repo,
though.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-23 13:38:57 +02:00
Moritz Kiefer
7d36402412
Initial boilerplate for cross-version compatibility testing (#5665)
This is a first step towards testing cross-version
compatibility. It doesn’t actuall do much yet but hopefully it should
be easier to parallelize once we have the initial boilerplate in place
so ideally I’d like to address most missing things and issues in
separate PRs.

changelog_begin
changelog_end
2020-04-23 12:58:11 +02:00
Gary Verhaegen
644d4c7512
add empty daily CI run (#5675)
This is meant to be filled once we have a better idea of what exactly we
want to test. See #5665 for current thinking about it.

CHANGELOG_BEGIN
CHANGELOG_END
2020-04-23 10:36:20 +02:00