sample FS trace events for logging

Summary:
The current fb303 counters only report aggregated latency while we want to track Eden performance under different version, os, channel, and configs. So I am setting up a new logging mechanism for this purpose.

This diff introduces the class `FsEventLogger` for sampling and logging. There are 3 configs introduced by this diff. The configs are reloaded every 30 minutes.
1. `telemetry:request-sampling-config-allowlist`
A list of config keys that we want to attach to scuba events.

2. `telemetry:request-samples-per-minute`
Max number of events logged to scuba per minute per mount.

3. `telemetry:request-sampling-group-denominators`
* Each type of operation has a "sampling group" (defaulted to 0, which is dropping all).
* We use this sampling group as index to look up its denominator in this config.
* The denominator is then used for sampling. e.g. `1/x` of the events are send to scuba, if we haven't reached the cap specified by #2.

Example workflow:
1. receive tracing event
2. look up denominator of the sampling group of the operation type
3. sample based on the denominator
4. check that we have not exceeded the logging cap per min
5. create sample and send to scribe

Reviewed By: xavierd

Differential Revision: D30288054

fbshipit-source-id: 8f2b95c11c718550a8162f4d1259a25628f499ff
This commit is contained in:
Zhengchao Liu 2021-08-19 14:20:37 -07:00 committed by Facebook GitHub Bot
parent 025929ce12
commit 432d674649
10 changed files with 281 additions and 17 deletions

View File

@ -529,6 +529,60 @@ class EdenConfig : private ConfigSettingManager {
std::nullopt,
this};
/**
* Controls sample denominator for each request sampling group.
* We assign request types into sampling groups based on their usage and
* set a sample denominator for each sampling group so that we have the
* flexibility of up/down-sampling different requests but also avoid having to
* set a sampling rate for each of the dozens of request types. For example,
* `mkdir` and `rmdir` can be assigned to a sampling group that have a high
* sampling rate while `getattr` and `getxattr` to another sampling group with
* low sampling rate as they happen very frequently.
*
* Sampling rates are calculated from sampling denominators. A denominator of
* 0 indicates dropping all requests in the group. Group 0's value is ignored
* as it's always considered as having denominator of 0. A positive
* denominator means that the requests in the group are sampled at 1/x (so
* denominator of 1 drops no events).
*
* We use sampling group as indexes into this vector to look
* up their denominators. Thus, the size of this vector should match the
* number of sampling groups defined by the enum `SamplingGroup`. If the
* vector has fewer elements than the number of sampling groups, look-ups will
* fail for the higher sampling groups and we will consider them having
* denominator of 0. For example, if the vector has size of 3, all requests of
* sampling group 4 will be dropped.
* Keeping this vector in ascending order is recommended but not required.
* e.g. {0, 10, 100, 1000, 10000}
*/
ConfigSetting<std::vector<uint32_t>> requestSamplingGroupDenominators{
"telemetry:request-sampling-group-denominators",
std::vector<uint32_t>{0, 0, 0, 0, 0},
this};
/**
* Controls the max number of requests per minute per mount that can be sent
* for logging.
* A request is first sampled based on its sampling group denominators. Then
* if we have not reached this cap, the request is sent for logging.
*/
ConfigSetting<uint32_t> requestSamplesPerMinute{
"telemetry:request-samples-per-minute",
0,
this};
/**
* Controls which configs we want to send with the request logging.
* The elements are full config keys, e.g. "hg:import-batch-size".
* Elements not valid or not present in the config map are silently ignored.
* This is only intended for facilitating A/B testing and should be empty if
* there is no active experiment.
*/
ConfigSetting<std::vector<std::string>> requestSamplingConfigAllowlist{
"telemetry:request-sampling-config-allowlist",
std::vector<std::string>{},
this};
// [experimental]
/**

View File

@ -16,10 +16,12 @@
#include <folly/logging/xlog.h>
#include <folly/system/ThreadName.h>
#include <signal.h>
#include <chrono>
#include <type_traits>
#include "eden/fs/fuse/DirList.h"
#include "eden/fs/fuse/FuseDispatcher.h"
#include "eden/fs/fuse/FuseRequestContext.h"
#include "eden/fs/telemetry/FsEventLogger.h"
#include "eden/fs/utils/Bug.h"
#include "eden/fs/utils/IDGen.h"
#include "eden/fs/utils/Synchronized.h"
@ -234,8 +236,14 @@ struct HandlerEntry {
Handler h,
FuseArgRenderer r,
ChannelThreadStats::StatPtr s,
AccessType at = AccessType::FsChannelOther)
: name{n}, handler{h}, argRenderer{r}, stat{s}, accessType{at} {}
AccessType at = AccessType::FsChannelOther,
SamplingGroup samplingGroup = SamplingGroup::DropAll)
: name{n},
handler{h},
argRenderer{r},
stat{s},
samplingGroup{samplingGroup},
accessType{at} {}
std::string getShortName() const {
if (name.startsWith("FUSE_")) {
@ -260,6 +268,7 @@ struct HandlerEntry {
Handler handler = nullptr;
FuseArgRenderer argRenderer = nullptr;
ChannelThreadStats::StatPtr stat = nullptr;
SamplingGroup samplingGroup = SamplingGroup::DropAll;
AccessType accessType = AccessType::FsChannelOther;
};
@ -351,13 +360,15 @@ constexpr auto kFuseHandlers = [] {
&FuseChannel::fuseRead,
&argrender::read,
&ChannelThreadStats::read,
Read};
Read,
SamplingGroup::Three};
handlers[FUSE_WRITE] = {
"FUSE_WRITE",
&FuseChannel::fuseWrite,
&argrender::write,
&ChannelThreadStats::write,
Write};
Write,
SamplingGroup::Four};
handlers[FUSE_STATFS] = {
"FUSE_STATFS",
&FuseChannel::fuseStatFs,
@ -415,7 +426,8 @@ constexpr auto kFuseHandlers = [] {
&FuseChannel::fuseReadDir,
&argrender::readdir,
&ChannelThreadStats::readdir,
Read};
Read,
SamplingGroup::Two};
handlers[FUSE_RELEASEDIR] = {
"FUSE_RELEASEDIR",
&FuseChannel::fuseReleaseDir,
@ -590,6 +602,11 @@ iovec make_iovec(const T& t) {
return iov;
}
SamplingGroup fuseOpcodeSamplingGroup(uint32_t opcode) {
auto* entry = lookupFuseHandlerEntry(opcode);
return entry ? entry->samplingGroup : SamplingGroup::DropAll;
}
} // namespace
StringPiece fuseOpcodeName(uint32_t opcode) {
@ -779,6 +796,7 @@ FuseChannel::FuseChannel(
std::unique_ptr<FuseDispatcher> dispatcher,
const folly::Logger* straceLogger,
std::shared_ptr<ProcessNameCache> processNameCache,
std::shared_ptr<FsEventLogger> fsEventLogger,
folly::Duration requestTimeout,
Notifications* notifications,
CaseSensitivity caseSensitive,
@ -804,20 +822,43 @@ FuseChannel::FuseChannel(
installSignalHandler();
traceSubscriptionHandles_.push_back(traceBus_->subscribeFunction(
"FuseChannel request tracking", [this](const FuseTraceEvent& event) {
"FuseChannel request tracking",
[this,
fsEventLogger = std::move(fsEventLogger)](const FuseTraceEvent& event) {
switch (event.getType()) {
case FuseTraceEvent::START: {
auto state = telemetryState_.wlock();
auto [iter, inserted] = state->requests.emplace(
event.getUnique(),
OutstandingRequest{event.getUnique(), event.getRequest()});
OutstandingRequest{
event.getUnique(),
event.getRequest(),
event.monotonicTime});
XCHECK(inserted) << "duplicate fuse start event";
break;
}
case FuseTraceEvent::FINISH: {
auto state = telemetryState_.wlock();
auto erased = state->requests.erase(event.getUnique());
XCHECK(erased) << "duplicate fuse finish event";
uint64_t durationNs = 0;
{
auto state = telemetryState_.wlock();
auto it = state->requests.find(event.getUnique());
XCHECK(it != state->requests.end())
<< "duplicate fuse finish event";
durationNs =
std::chrono::duration_cast<std::chrono::nanoseconds>(
event.monotonicTime - it->second.requestStartTime)
.count();
state->requests.erase(it);
}
if (fsEventLogger) {
auto opcode = event.getRequest().opcode;
fsEventLogger->log({
durationNs,
fuseOpcodeSamplingGroup(opcode),
fuseOpcodeName(opcode),
});
}
break;
}
}

View File

@ -41,6 +41,7 @@ struct Unit;
namespace facebook::eden {
class Notifications;
class FsEventLogger;
class FuseRequestContext;
using TraceDetailedArgumentsHandle = std::shared_ptr<void>;
@ -205,6 +206,7 @@ class FuseChannel {
struct OutstandingRequest {
uint64_t unique;
FuseTraceEvent::RequestHeader request;
std::chrono::steady_clock::time_point requestStartTime;
};
/**
@ -225,6 +227,7 @@ class FuseChannel {
std::unique_ptr<FuseDispatcher> dispatcher,
const folly::Logger* straceLogger,
std::shared_ptr<ProcessNameCache> processNameCache,
std::shared_ptr<FsEventLogger> fsEventLogger,
folly::Duration requestTimeout,
Notifications* FOLLY_NULLABLE notifications,
CaseSensitivity caseSensitive,

View File

@ -133,11 +133,12 @@ int main(int argc, char** argv) {
std::move(dispatcher),
&straceLogger,
std::make_shared<ProcessNameCache>(),
/*fsEventLogger=*/nullptr,
std::chrono::seconds(60),
nullptr,
/*notifications=*/nullptr,
CaseSensitivity::Sensitive,
true,
12 /* the default on Linux */));
/*requireUtf8Path=*/true,
/*maximumBackgroundRequests=*/12 /* the default on Linux */));
XLOG(INFO) << "Starting FUSE...";
auto completionFuture = channel->initialize().get();

View File

@ -76,11 +76,12 @@ class FuseChannelTest : public ::testing::Test {
std::move(testDispatcher),
&straceLogger,
std::make_shared<ProcessNameCache>(),
/*fsEventLogger=*/nullptr,
std::chrono::seconds(60),
nullptr,
/*notifications=*/nullptr,
CaseSensitivity::Sensitive,
true,
12));
/*requireUtf8Path=*/true,
/*maximumBackgroundRequests=*/12));
}
FuseChannel::StopFuture performInit(

View File

@ -1381,6 +1381,7 @@ std::unique_ptr<FuseChannel, FuseChannelDeleter> makeFuseChannel(
EdenDispatcherFactory::makeFuseDispatcher(mount),
&mount->getStraceLogger(),
mount->getServerState()->getProcessNameCache(),
mount->getServerState()->getFsEventLogger(),
std::chrono::duration_cast<folly::Duration>(
edenConfig->fuseRequestTimeout.getValue()),
mount->getServerState()->getNotifications(),

View File

@ -12,6 +12,7 @@
#include "eden/fs/config/EdenConfig.h"
#include "eden/fs/model/git/TopLevelIgnores.h"
#include "eden/fs/telemetry/FsEventLogger.h"
#include "eden/fs/utils/Clock.h"
#include "eden/fs/utils/FaultInjector.h"
#include "eden/fs/utils/UnboundedQueueExecutor.h"
@ -22,6 +23,14 @@ DEFINE_bool(
"Block mount attempts via the fault injection framework. "
"Requires --enable_fault_injection.");
namespace {
#if defined(EDEN_HAVE_HIVE_LOGGER)
constexpr auto kHasHiveLogger = true;
#else
constexpr auto kHasHiveLogger = false;
#endif
} // namespace
namespace facebook {
namespace eden {
@ -58,7 +67,11 @@ ServerState::ServerState(
systemIgnoreFileMonitor_{CachedParsedFileMonitor<GitIgnoreFileParser>{
edenConfig->systemIgnoreFile.getValue(),
kSystemIgnoreMinPollSeconds}},
notifications_(config_) {
notifications_(config_),
fsEventLogger_{
(kHasHiveLogger && edenConfig->requestSamplesPerMinute.getValue())
? std::make_shared<FsEventLogger>(config_, hiveLogger_)
: nullptr} {
// It would be nice if we eventually built a more generic mechanism for
// defining faults to be configured on start up. (e.g., loading this from the
// EdenConfig).

View File

@ -27,6 +27,7 @@ class Clock;
class EdenConfig;
class FaultInjector;
class IHiveLogger;
class FsEventLogger;
class ProcessNameCache;
class StructuredLogger;
class TopLevelIgnores;
@ -155,6 +156,15 @@ class ServerState {
return hiveLogger_.get();
}
/**
* Returns a pointer to the FsEventLogger for logging FS event samples, if the
* platform supports it. Otherwise, returns nullptr. The caller is responsible
* for null checking.
*/
std::shared_ptr<FsEventLogger> getFsEventLogger() const {
return fsEventLogger_;
}
FaultInjector& getFaultInjector() {
return *faultInjector_;
}
@ -182,6 +192,7 @@ class ServerState {
folly::Synchronized<CachedParsedFileMonitor<GitIgnoreFileParser>>
systemIgnoreFileMonitor_;
Notifications notifications_;
std::shared_ptr<FsEventLogger> fsEventLogger_;
};
} // namespace eden
} // namespace facebook

View File

@ -0,0 +1,81 @@
/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* This software may be used and distributed according to the terms of the
* GNU General Public License version 2.
*/
#include "eden/fs/telemetry/FsEventLogger.h"
#include <folly/Random.h>
#include "eden/fs/config/EdenConfig.h"
#include "eden/fs/config/ReloadableConfig.h"
namespace facebook::eden {
namespace {
constexpr auto kConfigsStringRefreshInterval = std::chrono::minutes(30);
std::string getConfigsString(std::shared_ptr<const EdenConfig>) {
// TODO: get config string
return {};
}
} // namespace
FsEventLogger::FsEventLogger(
ReloadableConfig& edenConfig,
std::shared_ptr<IHiveLogger> logger)
: edenConfig_{edenConfig},
logger_{std::move(logger)},
counterStartTime_{std::chrono::steady_clock::now()},
configsString_{getConfigsString(edenConfig_.getEdenConfig())},
configsStringUpdateTime_{std::chrono::steady_clock::now()} {}
void FsEventLogger::log(Event event) {
if (event.samplingGroup == SamplingGroup::DropAll) {
return;
}
auto config = edenConfig_.getEdenConfig(ConfigReloadBehavior::NoReload);
const auto& denominators =
config->requestSamplingGroupDenominators.getValue();
auto samplingGroup = folly::to_underlying(event.samplingGroup);
if (samplingGroup > denominators.size()) {
// sampling group does not exist
return;
}
if (auto sampleDenominator = denominators.at(samplingGroup);
sampleDenominator && 0 != folly::Random::rand32(sampleDenominator)) {
// failed sampling
return;
}
// Multiple threads could enter the branches at the same time
// resulting in samplesCount_ undercounting, but this should rarely happen
// given the sampling above.
auto now = std::chrono::steady_clock::now();
if ((now - counterStartTime_.load()) > std::chrono::minutes(1)) {
// reset counter for this minute
counterStartTime_.store(now);
samplesCount_.store(1);
} else if (
samplesCount_.load() < config->requestSamplesPerMinute.getValue()) {
// not throttled so bump counter
samplesCount_.fetch_add(1, std::memory_order_relaxed);
} else {
// throttled
return;
}
if ((now - configsStringUpdateTime_.load()) > kConfigsStringRefreshInterval) {
configsStringUpdateTime_.store(now);
*configsString_.wlock() = getConfigsString(edenConfig_.getEdenConfig());
}
// TODO: log
}
} // namespace facebook::eden

View File

@ -0,0 +1,58 @@
/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* This software may be used and distributed according to the terms of the
* GNU General Public License version 2.
*/
#pragma once
#include <atomic>
#include <chrono>
#include <vector>
#include <folly/Synchronized.h>
#include "folly/Range.h"
namespace facebook::eden {
// `telemetry:request-sampling-group-denominators` should be
// maintained in ascending order so that the higher the sampling group
// the higher the sampling rate.
enum class SamplingGroup : uint32_t {
DropAll = 0,
One = 1,
Two = 2,
Three = 3,
Four = 4,
Five = 5,
};
class ReloadableConfig;
class IHiveLogger;
class FsEventLogger {
public:
struct Event {
uint64_t durationNs;
SamplingGroup samplingGroup;
folly::StringPiece cause;
};
FsEventLogger(
ReloadableConfig& edenConfig,
std::shared_ptr<IHiveLogger> logger);
void log(Event event);
private:
ReloadableConfig& edenConfig_;
FOLLY_MAYBE_UNUSED std::shared_ptr<IHiveLogger> logger_;
std::atomic<uint32_t> samplesCount_{0};
std::atomic<std::chrono::steady_clock::time_point> counterStartTime_;
folly::Synchronized<std::string> configsString_;
std::atomic<std::chrono::steady_clock::time_point> configsStringUpdateTime_;
};
} // namespace facebook::eden