daml/ledger
Samir Talwar 5dd38d54e8 sandbox: PostgreSQL health checks. (#3655)
* ledger-api-test-tool: Increase the duration when watching health.

This should hopefully stop CI from flaking out.

* reference-v2/sandbox: Avoid unnecessary companion object constructors.

I like indirection… when it does something.

* ledger: Propagate empty health checks throughout the services.

* reference: Remove duplication from the ReferenceServer object.

* ledger-api-common: Actually query a "reporter" in the health service.

* ledger-api-common: Report health per-component when required.

* ledger-api-health: Use a Map to represent components for health checks.

* sandbox: Fix warnings in SqlLedgerSpec.

* ledger-api-common: Throw GrpcHealthService errors inside the Future.

* ledger: Implement health checks against the PostgreSQL connection.

Without proper testing, because I am not great at this.

* sandbox: Remove duplication and fix warnings in PostgresAround.

* sandbox: Test the SQL Ledger's health reporting on failure.

* sandbox: Don't report as unhealthy until 3 connections fail.

* ledger-api-health: Remove unused parts of the API.

Bit of premature design there.

* sandbox: Rename the "ledger" health check to "write".

* participant-state: Add the ReportsHealth trait to ReadService.

* ledger-api-common: `Future.fromTry(Try(…))` -> `Future(…)`.

* ledger-api-common: Make it clearer that StubReporter closes over health.

* ledger-api-common: Explain the HealthService watch tests with comments.

* sandbox: Clean up SqlLedger a bit.

* sandbox: Don't try and stop PostgreSQL twice in PostgresAround.

* bazel_tools: Windows rlocation lookups need to be with forward slashes.

* release: Fix case of "true".

* ledger-api-common: Make `GrpcHealthService::matchResponse` return a Try.

* ledger-api-common: Make `GrpcHealthServiceSpec` async.

* sandbox: Make a couple of DB classes final.

* sandbox: Avoid importing `X._` in PostgresAround.

* sandbox: Add clues to the SqlLedgerSpec's multiple assertions.

* sandbox: If PostgreSQL doesn't come back up, keep retrying.

* sandbox: Remove duplication in SqlLedgerSpec.

* sandbox: In SqlLedgerSpec, actually wait for the health to change.

* sandbox: In PostgresAround, make stopping PostgreSQL idempotent.

* sandbox: Simplify the SqlLedgerSpec to make it work on CI.

It's worth a shot.

* ledger-api-common: Simplify the GrpcHealthServiceSpec a little.

And add a changelog.

CHANGELOG_BEGIN

- [Ledger API Server] Add a health check endpoint conforming to the
  `GRPC Health Checking Protocol <https://github.com/grpc/grpc/blob/master/doc/health-checking.md>`_.
- [Ledger API Server] Add health checks for index database connectivity.
- [Participant State API] Add a mandatory ``currentHealth()`` method to
  ``IndexService``, ``ReadService`` and ``WriteService``.

CHANGELOG_END

* sandbox: Improve the Javadoc layout for DbDispatcher.

* sandbox: Capitalize constants in SqlExecutor.

* ledger-api-health: Convert HealthStatus to an abstract class.
2019-11-29 15:07:43 +00:00
..
api-server-damlonx/reference-v2 sandbox: PostgreSQL health checks. (#3655) 2019-11-29 15:07:43 +00:00
ledger-api-akka Ledger API: Add healthcheck endpoints. (#3573) 2019-11-22 14:02:05 +00:00
ledger-api-auth Implement support for RSA-signed JWT tokens (#3526) 2019-11-25 16:29:24 +01:00
ledger-api-auth-client Add authentication to Java identity client (#3630) 2019-11-26 18:51:09 +00:00
ledger-api-client Add authentication to Java identity client (#3630) 2019-11-26 18:51:09 +00:00
ledger-api-common sandbox: PostgreSQL health checks. (#3655) 2019-11-29 15:07:43 +00:00
ledger-api-domain Use TreeMap for storing transaction nodes (#3418) 2019-11-12 13:55:03 +01:00
ledger-api-health sandbox: PostgreSQL health checks. (#3655) 2019-11-29 15:07:43 +00:00
ledger-api-integration-tests Ledger configuration indexing changes (rebased to master) (#3553) 2019-11-27 17:41:23 +01:00
ledger-api-scala-logging Replace bazel-deps by rules_jvm_external (#3253) 2019-10-28 13:53:14 +01:00
ledger-api-test-tool Ledger api CommandService conformance test fixes (#3675) 2019-11-29 09:57:14 +00:00
ledger-api-test-tool-on-canton Ledger api test tool canton run config fix (#3602) 2019-11-25 11:13:27 +01:00
participant-state sandbox: PostgreSQL health checks. (#3655) 2019-11-29 15:07:43 +00:00
participant-state-index sandbox: PostgreSQL health checks. (#3655) 2019-11-29 15:07:43 +00:00
sandbox sandbox: PostgreSQL health checks. (#3655) 2019-11-29 15:07:43 +00:00
sandbox-perf ledger: Remove redundant PostgreSQL dependencies. (#3465) 2019-11-14 14:36:51 +01:00
scripts update copyright notices (#2499) 2019-08-13 17:23:03 +01:00
test-common Ledger API: Add healthcheck endpoints. (#3573) 2019-11-22 14:02:05 +00:00
README.md Replace MetricsManager with MetricRegistry (#3574) 2019-11-21 16:47:30 +01:00

ledger

Home of our reference ledger implementation (Sandbox) and various ledger related libraries.

Logging

Logging Configuration

The Sandbox and Ledger API Server use Logback for logging configuration.

Log Files

The Sandbox logs at INFO level to standard out and to the file sandbox.log in the current working directory.

Log levels

As most Java libraries and frameworks, the Sandbox and Ledger API Server use INFO as the default logging level. This level is for minimal and important information (usually only startup and normal shutdown events). INFO level logging should not produce increasing volume of logging during normal operation.

WARN level should be used for transition between healthy/unhealthy state, or in other close to error scenarios.

DEBUG level should be turned on only when investigating issues in the system, and usually that means we want the trail loggers. Normal loggers at DEBUG level can be useful sometimes (e.g. DAML interpretation).

Metrics

Sandbox and Ledger API Server provide a couple of useful metrics:

Sandbox and Ledger API Server

The Ledger API Server exposes basic metrics for all gRPC services and some additional ones.

Metric NameDescription
LedgerApi.com.digitalasset.ledger.api.v1.$SERVICE.$METHOD
A meter that tracks the number of calls to the respective service and method.
CommandSubmission.failedCommandInterpretations
A meter that tracks the failed command interpretations.
CommandSubmission.submittedTransactions
A timer that tracks the commands submitted to the backing ledger.

Indexer

Metric NameDescription
JdbcIndexer.processedStateUpdates
A timer that tracks duration of state update processing.
JdbcIndexer.lastReceivedRecordTime
A gauge that returns the last received record time in milliseconds since EPOCH.
JdbcIndexer.lastReceivedOffset
A gauge that returns that last received offset from the ledger.
JdbcIndexer.currentRecordTimeLag
A gauge that returns the difference between the Indexer's wallclock time and the last received record time in milliseconds.

Metrics Reporting

The Sandbox automatically makes all metrics available via JMX under the JMX domain com.digitalasset.platform.sandbox.

When building an Indexer or Ledger API Server the implementer/ledger integrator is responsible to set up a MetricRegistry and a suitable metric reporting strategy that fits their needs.

gRPC and back-pressure

RPC

Standard RPC requests should return with RESOURCE_EXHAUSTED status code to signal back-pressure. Envoy can be configured to retry on these errors. We have to be careful not to have any persistent changes when returning with such an error as the same original request can be retried on another service instance.

Streaming

gRPC's streaming protocol has built-in flow-control, but it's not fully active by default. What it does it controls the flow between the TCP/HTTP layer and the library so it builds on top of TCP's own flow control. The inbound flow control is active by default, but the outbound does not signal back-pressure out of the box.

AutoInboundFlowControl: The default behaviour for handling incoming items in a stream is to automatically signal demand after every onNext call. This is the correct thing to do if the handler logic is CPU bound and does not depend on other reactive downstream services. By default it's active on all inbound streams. One can disable this and signal demand by manually calling request to follow demands of downstream services. Disabling this feature is possible by calling disableAutoInboundFlowControl on CallStreamObserver.

ServerCallStreamObserver: casting an outbound StreamObserver manually to ServerCallStreamObserver gives us access to isReady and onReadyHandler. With these methods we can check if there is available capacity in the channel i.e. we are safe to push into it. This can be used to signal demand to our upstream flow. Note that gRPC buffers 32Kb data per channel and isReady will return false only when this buffer gets full.