Document Sandbox metrics (#6662)

* Document Sandbox metrics

Closes #5573

changelog_begin
changelog_end

* Selected a list of relevant metrics to document

* Provided a description of selected namespaces and metrics

* Alphabetical ordering

* Fix nested list

* Visual fixes, removing cache metrics

* No need to mock Metrics

* Update docs/source/tools/sandbox.rst

Co-authored-by: Peter Garmaz <40190826+pgarmaz-da@users.noreply.github.com>

* Adapts to recent change in metrics CLI option format

* Update docs/source/tools/sandbox.rst

Co-authored-by: Gerolf Seitz <gerolf.seitz@daml.com>

* Address https://github.com/digital-asset/daml/pull/6662#discussion_r455163515

* Document the jvm metrics namespace

* Document HikariCP metrics

* Fix title underline

Co-authored-by: Peter Garmaz <40190826+pgarmaz-da@users.noreply.github.com>
Co-authored-by: Gerolf Seitz <gerolf.seitz@daml.com>
This commit is contained in:
Stefano Baghino 2020-07-16 11:31:49 +02:00 committed by GitHub
parent bdcfb6c114
commit a4b666a7ee
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 441 additions and 36 deletions

View File

@ -226,3 +226,416 @@ Command-line reference
To start Sandbox, run: ``sandbox [options] <archive>...``.
To see all the available options, run ``daml sandbox --help``.
Metrics
*******
Enable and configure reporting
==============================
To enable metrics and configure reporting, you can use the two following CLI options:
- ``--metrics-reporter``: passing a legal value will enable reporting; the accepted values
are ``console``, ``csv:</path/to/metrics.csv>`` and ``graphite:<local_server_port>``.
- ``console``: prints captured metrics on the standard output
- ``csv://</path/to/metrics.csv>``: saves the captured metrics in CSV format at the specified location
- ``graphite://<server_host>[:<server_port>]``: sends captured metrics to a Graphite server. If the port
is omitted, the default value ``2003`` will be used.
- ``--metrics-reporting-interval``: metrics are pre-aggregated on the sandbox and sent to
the reporter, this option allows the user to set the interval. The formats accepted are based
on the ISO-8601 duration format ``PnDTnHnMn.nS`` with days considered to be exactly 24 hours.
The default interval is 10 seconds.
Types of metrics
================
This is a list of type of metrics with all data points recorded for each.
Use this as a reference when reading the list of metrics.
Gauge
-----
An individual instantaneous measurement.
Counter
-------
Number of occurrences of some event.
Meter
-----
A meter tracks the number of times a given event occurred. The following data
points are kept and reported by any meter.
- ``<metric.qualified.name>.count``: number of registered data points overall
- ``<metric.qualified.name>.m1_rate``: number of registered data points per minute
- ``<metric.qualified.name>.m5_rate``: number of registered data points every 5 minutes
- ``<metric.qualified.name>.m15_rate``: number of registered data points every 15 minutes
- ``<metric.qualified.name>.mean_rate``: mean number of registered data points
Histogram
---------
An histogram records aggregated statistics about collections of events.
The exact meaning of the number depends on the metric (e.g. timers
are histograms about the time necessary to complete an operation).
- ``<metric.qualified.name>.mean``: arithmetic mean
- ``<metric.qualified.name>.stddev``: standard deviation
- ``<metric.qualified.name>.p50``: median
- ``<metric.qualified.name>.p75``: 75th percentile
- ``<metric.qualified.name>.p95``: 95th percentile
- ``<metric.qualified.name>.p98``: 98th percentile
- ``<metric.qualified.name>.p99``: 99th percentile
- ``<metric.qualified.name>.p999``: 99.9th percentile
- ``<metric.qualified.name>.min``: lowest registered value overall
- ``<metric.qualified.name>.max``: highest registered value overall
Histograms only keep a small *reservoir* of statistically relevant data points
to ensure that metrics collection can be reasonably accurate without being
too taxing resource-wise.
Unless mentioned otherwise all histograms (including timers, mentioned below)
use exponentially decaying reservoirs (i.e. the data is roughly relevant for
the last five minutes of recording) to ensure that recent and possibly
operationally relevant changes are visible through the metrics reporter.
Note that ``min`` and ``max`` values are not affected by the reservoir sampling policy.
You can read more about reservoir sampling and possible associated policies
in the `Dropwizard Metrics library documentation <https://metrics.dropwizard.io/4.1.2/manual/core.html#man-core-histograms/>`__.
Timers
------
A timer records all metrics registered by a meter and by an histogram, where
the histogram records the time necessary to execute a given operation (unless
otherwise specified, the precision is nanoseconds and the unit of measurement
is milliseconds).
Database Metrics
----------------
A "database metric" is a collection of simpler metrics that keep track of
relevant numbers when interacting with a persistent relational store.
These metrics are:
- ``<metric.qualified.name>.wait`` (timer): time to acquire a connection to the database
- ``<metric.qualified.name>.exec`` (timer): time to run the query and read the result
- ``<metric.qualified.name>.query`` (timer): time to run the query
- ``<metric.qualified.name>.commit`` (timer): time to perform the commit
- ``<metric.qualified.name>.translation`` (timer): if relevant, time necessary to turn serialized DAML-LF values into in-memory objects
List of metrics
===============
The following is a non-exhaustive list of selected metrics
that can be particularly important to track. Note that not
all the following metrics are available unless you run the
sandbox with a PostgreSQL backend.
``daml.commands.deduplicated_commands``
---------------------------------------
A meter. Number of deduplicated commands.
``daml.commands.delayed_submissions``
-------------------------------------
A meter. Number of delayed submissions (submission who have been
evaluated to transaction with a ledger time farther in
the future than the expected latency).
``daml.commands.failed_command_interpretation``
-----------------------------------------------
A meter. Number of commands that have been deemed unacceptable
by the interpreter and thus rejected (e.g. double spends)
``daml.commands.submissions``
-----------------------------
A timer. Time to fully process a submission (validation,
deduplication and interpretation) before it's handed over
to the ledger to be finalized (either committed or rejected).
``daml.commands.valid_submissions``
-----------------------------------
A meter. Number of submission that pass validation and are
further sent to deduplication and interpretation.
``daml.commands.validation``
----------------------------
A timer. Time to validate submitted commands before they are
fed to the DAML interpreter.
``daml.execution.get_lf_package``
---------------------------------
A timer. Time spent by the engine fetching the packages of compiled
DAML code necessary for interpretation.
``daml.execution.lookup_active_contract_count_per_execution``
-------------------------------------------------------------
A histogram. Number of active contracts fetched for each processed transaction.
``daml.execution.lookup_active_contract_per_execution``
-------------------------------------------------------
A timer. Time to fetch all active contracts necessary to process each transaction.
``daml.execution.lookup_active_contract``
-----------------------------------------
A timer. Time to fetch each individual active contract during interpretation.
``daml.execution.lookup_contract_key_count_per_execution``
----------------------------------------------------------
A histogram. Number of contract keys looked up for each processed transaction.
``daml.execution.lookup_contract_key_per_execution``
----------------------------------------------------
A timer. Time to lookup all contract keys necessary to process each transaction.
``daml.execution.lookup_contract_key``
--------------------------------------
A timer. Time to lookup each individual contract key during interpretation.
``daml.execution.retry``
------------------------
A meter. Overall number of interpretation retries attempted due to
mismatching ledger effective time.
``daml.execution.total``
------------------------
A timer. Time spent interpreting a valid command into a transaction
ready to be submitted to the ledger for finalization.
``daml.index.db.connection.sandbox.pool``
-----------------------------------------
This namespace holds a number of interesting metrics about the
connection pool used to communicate with the persistent store
that underlies the index.
These metrics include:
- ``daml.index.db.connection.sandbox.pool.Wait`` (timer): time spent waiting to acquire a connection
- ``daml.index.db.connection.sandbox.pool.Usage`` (histogram): time spent using each acquired connection
- ``daml.index.db.connection.sandbox.pool.TotalConnections`` (gauge): number or total connections
- ``daml.index.db.connection.sandbox.pool.IdleConnections`` (gauge): number of idle connections
- ``daml.index.db.connection.sandbox.pool.ActiveConnections`` (gauge): number of active connections
- ``daml.index.db.connection.sandbox.pool.PendingConnections`` (gauge): number of threads waiting for a connection
``daml.index.db.deduplicate_command``
-------------------------------------
A timer. Time spent persisting deduplication information to ensure the
continued working of the deduplication mechanism across restarts.
``daml.index.db.get_active_contracts``
--------------------------------------
A database metric. Time spent retrieving a page of active contracts to be
served from the active contract service. The page size is
configurable, please look at the CLI reference.
``daml.index.db.get_completions``
---------------------------------
A database metric. Time spent retrieving a page of command completions to be
served from the command completion service. The page size is
configurable, please look at the CLI reference.
``daml.index.db.get_flat_transactions``
---------------------------------------
A database metric. Time spent retrieving a page of flat transactions to be
streamed from the transaction service. The page size is
configurable, please look at the CLI reference.
``daml.index.db.get_ledger_end``
--------------------------------
A database metric. Time spent retrieving the current ledger end. The count for this metric is expected to
be very high and always increasing as the indexed is queried for the latest updates.
``daml.index.db.get_ledger_id``
-------------------------------
A database metric. Time spent retrieving the ledger identifier.
``daml.index.db.get_transaction_trees``
---------------------------------------
A database metric. Time spent retrieving a page of flat transactions to be
streamed from the transaction service. The page size is
configurable, please look at the CLI reference.
``daml.index.db.load_all_parties``
----------------------------------
A database metric. Load the currently allocated parties so that
they are served via the party management service.
``daml.index.db.load_archive``
------------------------------
A database metric. Time spent loading a package of compiled DAML code
so that it's given to the DAML interpreter when
needed.
``daml.index.db.load_configuration_entries``
--------------------------------------------
A database metric. Time to load the current entries in the log of
configuration entries. Used to verify whether a configuration
has been ultimately set.
``daml.index.db.load_package_entries``
--------------------------------------
A database metric. Time to load the current entries in the log of
package uploads. Used to verify whether a package
has been ultimately uploaded.
``daml.index.db.load_packages``
-------------------------------
A database metric. Load the currently uploaded packages so that
they are served via the package management service.
``daml.index.db.load_parties``
------------------------------
A database metric. Load the currently allocated parties so that
they are served via the party service.
``daml.index.db.load_party_entries``
------------------------------------
A database metric. Time to load the current entries in the log of
party allocations. Used to verify whether a party
has been ultimately allocated.
``daml.index.db.lookup_active_contract``
----------------------------------------
A database metric. Time to fetch one contract on the index to be used by
the DAML interpreter to evaluate a command into a
transaction.
``daml.index.db.lookup_configuration``
--------------------------------------
A database metric. Time to fetch the configuration so that it's
served via the configuration management service.
``daml.index.db.lookup_contract_by_key``
----------------------------------------
A database metric. Time to lookup one contract key on the index to be used by
the DAML interpreter to evaluate a command into a
transaction.
``daml.index.db.lookup_flat_transaction_by_id``
-----------------------------------------------
A database metric. Time to lookup a single flat transaction by identifier
to be served by the transaction service.
``daml.index.db.lookup_maximum_ledger_time``
--------------------------------------------
A database metric. Time spent looking up the ledger effective time of a
transaction as the maximum ledger time of all active
contracts involved to ensure causal monotonicity.
``daml.index.db.lookup_transaction_tree_by_id``
-----------------------------------------------
A database metric. Time to lookup a single transaction tree by identifier
to be served by the transaction service.
``daml.index.db.remove_expired_deduplication_data``
---------------------------------------------------
A database metric. Time spent removing deduplication information after the expiration
of the deduplication window. Deduplication information is persisted to
ensure the continued working of the deduplication mechanism across restarts.
``daml.index.db.stop_deduplicating_command``
--------------------------------------------
A database metric. Time spent removing deduplication information after the failure of a
command. Deduplication information is persisted to ensure the continued
working of the deduplication mechanism across restarts.
``daml.index.db.store_configuration_entry``
-------------------------------------------
A database metric. Time spent persisting a change in the ledger configuration
provided through the configuration management service.
``daml.index.db.store_ledger_entry``
------------------------------------
A database metric. Time spent persisting a transaction that has been
successfully interpreted and is final.
``daml.index.db.store_package_entry``
-------------------------------------
A database metric. Time spent storing a DAML package uploaded through
the package management service.
``daml.index.db.store_party_entry``
-----------------------------------
A database metric. Time spent storing party information as part of the
party allocation endpoint provided by the party
management service.
``daml.index.db.store_rejection``
---------------------------------
A database metric. Time spent persisting the information that a given
command has been rejected.
``daml.lapi``
-------------
Every metrics under this namespace is a timer, one for each
service exposed by the Ledger API, in the format:
``daml.lapi.service_name.service_endpoint``
As in the following example:
``daml.lapi.command_service.submit_and_wait``
Single call services return the time to serve the request,
streaming services measure the time to return the first response.
``jvm``
-------
Under the ``jvm`` namespace there is a collection of metrics that
tracks important measurements about the JVM that the sandbox is
running on, including CPU usage, memory consumption and the
current state of threads.

View File

@ -6,7 +6,7 @@ package com.daml.metrics
import com.codahale.metrics.MetricRegistry.MetricSupplier
import com.codahale.metrics._
class Metrics(val registry: MetricRegistry) {
final class Metrics(val registry: MetricRegistry) {
private def gauge[T](name: MetricName, metricSupplier: MetricSupplier[Gauge[_]]): Gauge[T] = {
registry.remove(name)
@ -65,9 +65,6 @@ class Metrics(val registry: MetricRegistry) {
// Timer (and count) of how fast submissions have been processed.
val runTimer: Timer = registry.timer(Prefix :+ "run_timer")
// Number of exceptions seen.
val exceptions: Counter = registry.counter(Prefix :+ "exceptions")
// Counter to monitor how many at a time and when kvutils is processing a submission.
val processing: Counter = registry.counter(Prefix :+ "processing")
@ -433,7 +430,7 @@ class Metrics(val registry: MetricRegistry) {
object services {
private val Prefix: MetricName = daml.Prefix :+ "services"
object indexService {
object index {
private val Prefix: MetricName = services.Prefix :+ "index"
val listLfPackages: Timer = registry.timer(Prefix :+ "list_lf_packages")

View File

@ -40,28 +40,28 @@ import scala.concurrent.Future
final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends IndexService {
override def listLfPackages(): Future[Map[PackageId, v2.PackageDetails]] =
Timed.future(metrics.daml.services.indexService.listLfPackages, delegate.listLfPackages())
Timed.future(metrics.daml.services.index.listLfPackages, delegate.listLfPackages())
override def getLfArchive(packageId: PackageId): Future[Option[DamlLf.Archive]] =
Timed.future(metrics.daml.services.indexService.getLfArchive, delegate.getLfArchive(packageId))
Timed.future(metrics.daml.services.index.getLfArchive, delegate.getLfArchive(packageId))
override def getLfPackage(packageId: PackageId): Future[Option[Ast.Package]] =
Timed.future(metrics.daml.services.indexService.getLfPackage, delegate.getLfPackage(packageId))
Timed.future(metrics.daml.services.index.getLfPackage, delegate.getLfPackage(packageId))
override def packageEntries(
startExclusive: LedgerOffset.Absolute
): Source[domain.PackageEntry, NotUsed] =
Timed.source(
metrics.daml.services.indexService.packageEntries,
metrics.daml.services.index.packageEntries,
delegate.packageEntries(startExclusive))
override def getLedgerConfiguration(): Source[v2.LedgerConfiguration, NotUsed] =
Timed.source(
metrics.daml.services.indexService.getLedgerConfiguration,
metrics.daml.services.index.getLedgerConfiguration,
delegate.getLedgerConfiguration())
override def currentLedgerEnd(): Future[LedgerOffset.Absolute] =
Timed.future(metrics.daml.services.indexService.currentLedgerEnd, delegate.currentLedgerEnd())
Timed.future(metrics.daml.services.index.currentLedgerEnd, delegate.currentLedgerEnd())
override def getCompletions(
begin: domain.LedgerOffset,
@ -69,7 +69,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
parties: Set[Party]
): Source[CompletionStreamResponse, NotUsed] =
Timed.source(
metrics.daml.services.indexService.getCompletions,
metrics.daml.services.index.getCompletions,
delegate.getCompletions(begin, applicationId, parties))
override def transactions(
@ -79,7 +79,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
verbose: Boolean
): Source[GetTransactionsResponse, NotUsed] =
Timed.source(
metrics.daml.services.indexService.transactions,
metrics.daml.services.index.transactions,
delegate.transactions(begin, endAt, filter, verbose))
override def transactionTrees(
@ -89,7 +89,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
verbose: Boolean
): Source[GetTransactionTreesResponse, NotUsed] =
Timed.source(
metrics.daml.services.indexService.transactionTrees,
metrics.daml.services.index.transactionTrees,
delegate.transactionTrees(begin, endAt, filter, verbose))
override def getTransactionById(
@ -97,7 +97,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
requestingParties: Set[Party]
): Future[Option[GetFlatTransactionResponse]] =
Timed.future(
metrics.daml.services.indexService.getTransactionById,
metrics.daml.services.index.getTransactionById,
delegate.getTransactionById(transactionId, requestingParties))
override def getTransactionTreeById(
@ -105,7 +105,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
requestingParties: Set[Party]
): Future[Option[GetTransactionResponse]] =
Timed.future(
metrics.daml.services.indexService.getTransactionTreeById,
metrics.daml.services.index.getTransactionTreeById,
delegate.getTransactionTreeById(transactionId, requestingParties))
override def getActiveContracts(
@ -113,7 +113,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
verbose: Boolean,
): Source[GetActiveContractsResponse, NotUsed] =
Timed.source(
metrics.daml.services.indexService.getActiveContracts,
metrics.daml.services.index.getActiveContracts,
delegate.getActiveContracts(filter, verbose))
override def lookupActiveContract(
@ -121,7 +121,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
contractId: Value.ContractId
): Future[Option[Value.ContractInst[Value.VersionedValue[Value.ContractId]]]] =
Timed.future(
metrics.daml.services.indexService.lookupActiveContract,
metrics.daml.services.index.lookupActiveContract,
delegate.lookupActiveContract(submitter, contractId))
override def lookupContractKey(
@ -129,45 +129,41 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
key: Node.GlobalKey
): Future[Option[Value.ContractId]] =
Timed.future(
metrics.daml.services.indexService.lookupContractKey,
metrics.daml.services.index.lookupContractKey,
delegate.lookupContractKey(submitter, key))
override def lookupMaximumLedgerTime(
ids: Set[Value.ContractId],
): Future[Option[Instant]] =
Timed.future(
metrics.daml.services.indexService.lookupMaximumLedgerTime,
metrics.daml.services.index.lookupMaximumLedgerTime,
delegate.lookupMaximumLedgerTime(ids))
override def getLedgerId(): Future[LedgerId] =
Timed.future(metrics.daml.services.indexService.getLedgerId, delegate.getLedgerId())
Timed.future(metrics.daml.services.index.getLedgerId, delegate.getLedgerId())
override def getParticipantId(): Future[ParticipantId] =
Timed.future(metrics.daml.services.indexService.getParticipantId, delegate.getParticipantId())
Timed.future(metrics.daml.services.index.getParticipantId, delegate.getParticipantId())
override def getParties(parties: Seq[Party]): Future[List[domain.PartyDetails]] =
Timed.future(metrics.daml.services.indexService.getParties, delegate.getParties(parties))
Timed.future(metrics.daml.services.index.getParties, delegate.getParties(parties))
override def listKnownParties(): Future[List[domain.PartyDetails]] =
Timed.future(metrics.daml.services.indexService.listKnownParties, delegate.listKnownParties())
Timed.future(metrics.daml.services.index.listKnownParties, delegate.listKnownParties())
override def partyEntries(
startExclusive: LedgerOffset.Absolute
): Source[domain.PartyEntry, NotUsed] =
Timed.source(
metrics.daml.services.indexService.partyEntries,
delegate.partyEntries(startExclusive))
Timed.source(metrics.daml.services.index.partyEntries, delegate.partyEntries(startExclusive))
override def lookupConfiguration(): Future[Option[(LedgerOffset.Absolute, Configuration)]] =
Timed.future(
metrics.daml.services.indexService.lookupConfiguration,
delegate.lookupConfiguration())
Timed.future(metrics.daml.services.index.lookupConfiguration, delegate.lookupConfiguration())
override def configurationEntries(
startExclusive: Option[LedgerOffset.Absolute]
): Source[(LedgerOffset.Absolute, ConfigurationEntry), NotUsed] =
Timed.source(
metrics.daml.services.indexService.configurationEntries,
metrics.daml.services.index.configurationEntries,
delegate.configurationEntries(startExclusive))
override def deduplicateCommand(
@ -177,7 +173,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
deduplicateUntil: Instant
): Future[v2.CommandDeduplicationResult] =
Timed.future(
metrics.daml.services.indexService.deduplicateCommand,
metrics.daml.services.index.deduplicateCommand,
delegate.deduplicateCommand(commandId, submitter, submittedAt, deduplicateUntil))
override def stopDeduplicatingCommand(
@ -185,7 +181,7 @@ final class TimedIndexService(delegate: IndexService, metrics: Metrics) extends
submitter: Ref.Party,
): Future[Unit] =
Timed.future(
metrics.daml.services.indexService.stopDeduplicateCommand,
metrics.daml.services.index.stopDeduplicateCommand,
delegate.stopDeduplicatingCommand(commandId, submitter))
override def currentHealth(): HealthStatus =

View File

@ -256,12 +256,10 @@ object Cli {
opt[MetricsReporter]("metrics-reporter")
.optional()
.action((reporter, config) => config.copy(metricsReporter = Some(reporter)))
.hidden()
opt[Duration]("metrics-reporting-interval")
.optional()
.action((interval, config) => config.copy(metricsReportingInterval = interval))
.hidden()
opt[Int]("max-commands-in-flight")
.optional()

View File

@ -3,6 +3,7 @@
package com.daml.platform.apiserver.execution
import com.codahale.metrics.MetricRegistry
import com.daml.ledger.api.domain.Commands
import com.daml.ledger.participant.state.index.v2.{ContractStore, IndexPackagesService}
import com.daml.lf.crypto.Hash
@ -42,7 +43,7 @@ class StoreBackedCommandExecutorSpec extends AsyncWordSpec with MockitoSugar wit
Ref.ParticipantId.assertFromString("anId"),
mock[IndexPackagesService],
mock[ContractStore],
mock[Metrics])
new Metrics(new MetricRegistry))
val mockDomainCommands = mock[Commands]
val mockLfCommands = mock[com.daml.lf.command.Commands]
when(mockLfCommands.ledgerEffectiveTime).thenReturn(Time.Timestamp.now())