daml/ci/cron/daily-compat.yml

108 lines
3.4 KiB
YAML
Raw Normal View History

# Copyright (c) 2020 Digital Asset (Switzerland) GmbH and/or its affiliates. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Do not run on PRs
pr: none
# Do not run on merge to master
trigger: none
# Do run on a schedule (daily)
#
# Note: machines are killed every day at 4AM UTC, so we need to either:
# - run sufficiently before that that this doesn't get killed, or
# - run sufficiently after that that machines are initialized.
#
# Targeting 6AM UTC seems to fit that.
schedules:
- cron: "0 6 * * *"
displayName: daily checks and reporting
branches:
include:
- master
always: true
jobs:
- job: compatibility_ts_libs
timeoutInMinutes: 60
pool:
name: linux-pool
demands: assignment -equals default
steps:
- checkout: self
- template: ../compatibility_ts_libs.yml
- template: ../daily_tell_slack.yml
- job: compatibility
dependsOn: compatibility_ts_libs
Include puppeteer tests in compat tests (#6018) * Include puppeteer tests in compat tests This PR adds the puppeteer based tests to the compatibility tests. This also means that they are now actually compatibility tests. Before, we only tested the SDK side. Apart from process management being a nightmare on Windows as usually, there are two things that might stick out here: 1. I’ve replaced the `sh_binary` wrapper by a `cc_binary`. There is a lengthy comment explaining why. I think at the moment, we could actually get rid of the wraper completely and add JAVA to path in the tests that need it but at least for now, I’d like to keep it until we are sure that we don’t need to add more to it (and then it’s also in the git history if we do need to resurrect it). 2. These tests are duplicated now similar to the `daml ledger *` tests. The reasoning here is different. They depend on the SDK tarball either way so performance wise there is no reason to keep them. However, we reference the other file in the docs which means we cannot change it freely. What we could do is to make this sufficiently flexible to handle both the `daml start` case and separate `daml sandbox`/`daml json-api` processes and then we can reference it in the docs. There is still added complexity for Windows but that’s necessary for users as well that want to run this on Windows so that seems unavoidable. (I should probably also remove my snarky comments :innocent:) I’d like to kee it duplicated for this PR and then we can clean it up afterwards. changelog_begin changelog_end * Bump timeouts changelog_begin changelog_end
2020-05-22 15:02:59 +03:00
timeoutInMinutes: 360
strategy:
matrix:
linux:
pool: linux-pool
macos:
pool: macOS-pool
pool:
name: $(pool)
${{ if eq(variables['pool'], 'linux-pool') }}:
demands: assignment -equals default
steps:
- checkout: self
clear shared memory segment on macOS (#6530) For a while now we've had errors along the line of ``` FATAL: could not create shared memory segment: No space left on device DETAIL: Failed system call was shmget(key=5432001, size=56, 03600). HINT: This error does *not* mean that you have run out of disk space. It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached. The PostgreSQL documentation contains more information about shared memory configuration. child process exited with exit code 1 ``` on macOS CI nodes, which we were not able to reproduce locally. Today I managed to, sort of by accident, and that allowed me to dig a bit further. The root cause seems to be that PostgreSQL, as run by Bazel, does not always seem to properly unlink the shared memory segment it uses to communicate with itself. On my machine, running: ``` bazel test -t- --runs_per_test=100 //ledger/sandbox:conformance-test-wall-clock-postgresql ``` and eyealling the results of ``` watch ipcs -mcopt ``` I would say about one in three runs leaks its memory segment. After much googling and some head scratching trying to figure out the C APIs for managing shared memory segments on macOS, I kind of stumbled on a reference to `pcirm` in a comment to some low-ranking StackOverflow answer. It looks like it's working very well on my machine, even if I run it while a test (and therefore an instance of pg) is running. I believe this is because the command does not actually remove the shared memory segments, but simply marks them for removal once the last process stops using it. (At least that's what the manpage describes.) CHANGELOG_BEGIN CHANGELOG_END
2020-06-30 02:40:16 +03:00
- ${{ if eq(variables['pool'], 'macos-pool') }}:
- bash:
for shmid in $(ipcs -m | sed 1,3d | awk '{print $2}' | sed '$d'); do ipcrm -m $shmid; done
name: clear_shm
- template: ../compatibility.yml
- template: ../daily_tell_slack.yml
- job: compatibility_windows
dependsOn: compatibility_ts_libs
Include puppeteer tests in compat tests (#6018) * Include puppeteer tests in compat tests This PR adds the puppeteer based tests to the compatibility tests. This also means that they are now actually compatibility tests. Before, we only tested the SDK side. Apart from process management being a nightmare on Windows as usually, there are two things that might stick out here: 1. I’ve replaced the `sh_binary` wrapper by a `cc_binary`. There is a lengthy comment explaining why. I think at the moment, we could actually get rid of the wraper completely and add JAVA to path in the tests that need it but at least for now, I’d like to keep it until we are sure that we don’t need to add more to it (and then it’s also in the git history if we do need to resurrect it). 2. These tests are duplicated now similar to the `daml ledger *` tests. The reasoning here is different. They depend on the SDK tarball either way so performance wise there is no reason to keep them. However, we reference the other file in the docs which means we cannot change it freely. What we could do is to make this sufficiently flexible to handle both the `daml start` case and separate `daml sandbox`/`daml json-api` processes and then we can reference it in the docs. There is still added complexity for Windows but that’s necessary for users as well that want to run this on Windows so that seems unavoidable. (I should probably also remove my snarky comments :innocent:) I’d like to kee it duplicated for this PR and then we can clean it up afterwards. changelog_begin changelog_end * Bump timeouts changelog_begin changelog_end
2020-05-22 15:02:59 +03:00
timeoutInMinutes: 360
pool:
name: windows-pool
add default machine capability (#5912) add default machine capability We semi-regularly need to do work that has the potential to disrupt a machine's local cache, rendering it broken for other streams of work. This can include upgrading nix, upgrading Bazel, debugging caching issues, or anything related to Windows. Right now we do not have any good solution for these situations. We can either not do those streams of work, or we can proceed with them and just accept that all other builds may get affected depending on which machine they get assigned to. Debugging broken nodes is particularly tricky as we do not have any way to force a build to run on a given node. This PR aims at providing a better alternative by (ab)using an Azure Pipelines feature called [capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities). The idea behind capabilities is that you assign a set of tags to a machine, and then a job can express its [demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml), i.e. specify a set of tags machines need to have in order to run it. Support for this is fairly badly documented. We can gather from the documentation that a job can specify two things about a capability (through its `demands`): that a given tag exists, and that a given tag has an exact specified value. In particular, a job cannot specify that a capability should _not_ be present, meaning we cannot rely on, say, adding a "broken" tag to broken machines. Documentation on how to set capabilities for an agent is basically nonexistent, but [looking at the code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs) indicates that they can be set by using a simple `key=value`-formatted text file, provided we can find the right place to put this file. This PR adds this file to our Linux, macOS and Windows node init scripts to define an `assignment` capability and adds a demand for a `default` value on each job. From then on, when we hit a case where we want a PR to run on a specific node, and to prevent other PRs from running on that node, we can manually override the capability from the Azure UI and update the demand in the relevant YAML file in the PR. CHANGELOG_BEGIN CHANGELOG_END
2020-05-09 19:21:42 +03:00
demands: assignment -equals default
steps:
- checkout: self
- template: ../compatibility-windows.yml
- task: PublishBuildArtifacts@1
condition: succeededOrFailed()
inputs:
pathtoPublish: '$(Build.StagingDirectory)'
artifactName: 'Bazel Compatibility Logs'
- template: ../daily_tell_slack.yml
- job: performance_report
timeoutInMinutes: 120
pool:
name: "linux-pool"
add default machine capability (#5912) add default machine capability We semi-regularly need to do work that has the potential to disrupt a machine's local cache, rendering it broken for other streams of work. This can include upgrading nix, upgrading Bazel, debugging caching issues, or anything related to Windows. Right now we do not have any good solution for these situations. We can either not do those streams of work, or we can proceed with them and just accept that all other builds may get affected depending on which machine they get assigned to. Debugging broken nodes is particularly tricky as we do not have any way to force a build to run on a given node. This PR aims at providing a better alternative by (ab)using an Azure Pipelines feature called [capabilities](https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/agents?view=azure-devops&tabs=browser#capabilities). The idea behind capabilities is that you assign a set of tags to a machine, and then a job can express its [demands](https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml), i.e. specify a set of tags machines need to have in order to run it. Support for this is fairly badly documented. We can gather from the documentation that a job can specify two things about a capability (through its `demands`): that a given tag exists, and that a given tag has an exact specified value. In particular, a job cannot specify that a capability should _not_ be present, meaning we cannot rely on, say, adding a "broken" tag to broken machines. Documentation on how to set capabilities for an agent is basically nonexistent, but [looking at the code](https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Microsoft.VisualStudio.Services.Agent/Capabilities/UserCapabilitiesProvider.cs) indicates that they can be set by using a simple `key=value`-formatted text file, provided we can find the right place to put this file. This PR adds this file to our Linux, macOS and Windows node init scripts to define an `assignment` capability and adds a demand for a `default` value on each job. From then on, when we hit a case where we want a PR to run on a specific node, and to prevent other PRs from running on that node, we can manually override the capability from the Azure UI and update the demand in the relevant YAML file in the PR. CHANGELOG_BEGIN CHANGELOG_END
2020-05-09 19:21:42 +03:00
demands: assignment -equals default
steps:
- checkout: self
- bash: ci/dev-env-install.sh
displayName: 'Build/Install the Developer Environment'
- bash: ci/configure-bazel.sh
displayName: 'Configure Bazel for root workspace'
env:
IS_FORK: $(System.PullRequest.IsFork)
# to upload to the bazel cache
GOOGLE_APPLICATION_CREDENTIALS_CONTENT: $(GOOGLE_APPLICATION_CREDENTIALS_CONTENT)
- bash: |
set -euo pipefail
eval "$(dev-env/bin/dade assist)"
BASELINE="cebc26af88efef4a7c81c62b0c14353f829b755e"
TEST_SHA=$(cat ci/cron/perf/test_sha)
OUT="$(Build.StagingDirectory)/perf-results.json"
if git diff --exit-code $TEST_SHA -- daml-lf/scenario-interpreter/src/perf >&2; then
# no changes, all good
ci/cron/perf/compare.sh $BASELINE > $OUT
cat $(Build.StagingDirectory)/perf-results.json
else
# the tests have changed, we need to figure out what to do with
# the baseline.
echo "Baseline no longer valid, needs manual correction." > $OUT
fi
displayName: measure perf
- template: ../daily_tell_slack.yml
parameters:
success-message: '$(cat $(Build.StagingDirectory)/perf-results.json | jq . | jq -sR ''"perf for ''"$COMMIT_LINK"'':```\(.)```"'')'