Conditional Distributed Tracing

Somewhat delayed post, but I presented Conditional Distributed Tracing at ollycon in June. This is a rough transcript of the slides and ideas in blog format.

How Well Does Tracing Debug Microservices?

Distributed Tracing in the Dapper model has a defined tracing model, and has a "trace view" that attempts to show the entire model on a single web page. This makes the trace easy to conceptualize, but ironically developers have trouble finding and correlating spans inside the traces to resolve internal bugs.

Matt Klein starts off:

Cindy Sridharan references the Matt Klein thread and follows up in Distributed Tracing: We've Been Doing It Wrong:

"Being able to quickly and cheaply test hypotheses and refine one’s mental model accordingly is the cornerstone of debugging. Any tool that aims to assist in the process of debugging needs to be an interactive tool that helps to either whittle down the search space or in the case of a red herring, help the user backtrack and refocus on a different area of the system."

"Instead of seeing an entire trace, what I really want to be seeing is a portion of the trace where something interesting or unusual is happening."

And Abby Bangser discusses what she wants from an observability tool:

Debugging Microservices

Part of the issue comes from the definition of debugging. Software engineers (SWEs) debug code, and look for data that will correlate to the underlying software. SWEs inherently think in terms of logging and debugging statements. System Reliability Engineers (SREs) debug incidents that may appear across the entire environment, and care about the overall health of operations.

From Debugging Incidents in Google's Distributed Systems:

"SWEs are more likely to consult logs earlier in their debugging workflow, where they look for errors that could indicate where a failure occurred."

"SREs rely on a more generic approach to debugging: Because SREs are often on call for multiple services, they apply a general approach to debugging based on known characteristics of their system(s). They look for common failure patterns across service health metrics (for example, errors and latency for requests) to isolate where the issue is happening, and often dig into logs only if they're still uncertain about the best mitigation strategy.

Distributed Tracing is mostly oriented toward SREs. Most spans in a trace are added around the edges of a service, and are typically done automatically through instrumentation. Typically SWEs do not add manual custom spans into the service.

When developers do add custom spans, they will typically do so in their mental model. They'll keep adding spans according to a simple rule: When In Doubt, Add More Spans.

This is because to a developer, a span is semantically equivalent to printf debugging.

And developers love printfs. From On the Dichotomy of Debugging Behavior Among Programmers:

"[D]evelopers were well-informed about printf debugging and that it is a conscious choice if they employ it, often the beginning of a longer debugging process. Interviewees praised printf as a universal tool that one can always resort back to. […] While frequently used, developers are also aware of its shortcomings, saying that “you are half-way toward either telemetry or toward tracing”

There's a long history of developers keeping debug statements in production executables for diagnostic purposes, based on logging. Developers have a method to debug and resolve actual vs expected behavior:

  • Create a pool of data from statements: state dumps, branches, and state transitions.
  • Fish around in that pool with various hooks and queries.
  • Keep the most useful statements around as diagnostic logging for later.

In short, exploratory debugging involves "drill-down" fine-grained detailed spans. This works with logs because debug logging is not enabled by default in production, and logging can be selectively enabled. But spans are printfs: there is no priority system, and no way to filter spans so that a user is not overwhelmed with debugging spans.

What we'd really like is a system that could intelligently filter and display a trace view so that spans could be "folded up" as necessary. We don't have systems that do that currently, but we can do the next best thing – we can fold up spans ourselves.

Conditional Span Builder

The solution here is to enable the application to determine what spans it constructs. We can do this by adding a scripting engine to the Opentelemetry SDK, and creating spans based on dynamic script evaluation.
If you've been following the blog, I've taken the code from Dynamic Logging with Conditions and stuck it in the Tracing SDK.

When the SDK calls startSpan, the application goes through the script evaluation based on the inputs to the span:

boolean shouldStartSpan(String spanName, Context parent, HashMap<AttributeKey<?>, Object>
attributes) {
  return <by something here>;
}

and following that, we can render by debug attribute:

boolean byDebugAttribute(String spanName, HashMap<AttributeKey<?>, Object> attributes) {
    // println("byDebugAttribute: instance = $spanName, attributes = $attributes")
    if (attributes.containsKey(booleanKey("DEBUG"))) {
        println("byAttribute: ignoring debug span $spanName")
        return false;
    } else {
        return true;
    }
}

or even by feature flag:

boolean byFeatureFlag(String spanName) {
    //println("byFeatureFlag: spanName = $spanName")
    Context context = Context.current();
    LDClient ldClient = context.get(Main.LD_CLIENT_KEY);
    LDUser user = context.get(Main.LD_USER_KEY);

    if (ldClient.boolVariation("testflag", user, false)) {
        return true;
    } else {
        println("byFeatureFlag: ignoring $spanName for $user")
        return false;
    }
}

This functionality itself can be changed on the fly. You can change the script, or you can disable script evaluation completely and revert to the out of the box behavior of OpenTelemetry.

The Java Scripting API will compile the script to Java bytecode and so will handle this as "native Java". You are not limited to Java either – scripting is entirely possible in Go and Rust SDK implementations, for example luajit has some great optimization features.

Future

I think that eventually this will be built-in SDK behavior, and scripting language will eventually be a part of all telemetry operations. As systems become more complex and downtime becomes less acceptable, there will be an expectation that telemetry is dynamic and adjusts with the environment.

We're already seeing this with eBPF tracing tools and the linux tracing systems, it's just a matter of time before it becomes a standard requirement in applications.

Trace Analytics

Adding in application hooks does not negate the need for a more advanced tracing system – not trace-view – that can provide richer interaction with traces. I don't think that this will be a fancy trace view though – I think that logs, traces, and metrics will all be sources for exploratory data analysis through a notebook interface like Observable or Kaggle, and additional visualizations like heat maps, box plots, flame graphs, context calling trees, and sankey diagrams will all be options.

There are some great starting point from an analytics perspective: Dan Luu's summary of Rebecca Isaacs' work on tracing analytics, the work that Slack has done in causal graphs using SQL, and Daniela Miao's talk on Tracing is More Than Traces: The Insights in Trace Aggregates.

Comments