Somewhat delayed post, but I presented Conditional Distributed Tracing at ollycon in June. This is a rough transcript of the slides and ideas in blog format.
How Well Does Tracing Debug Microservices?
Distributed Tracing in the Dapper model has a defined tracing model, and has a "trace view" that attempts to show the entire model on a single web page. This makes the trace easy to conceptualize, but ironically developers have trouble finding and correlating spans inside the traces to resolve internal bugs.
Matt Klein starts off:
Sure, looking at a pretty trace page is fun, but how does a user find a trace that is relevant to their debugging needs? The honest answer is that they often can't or are confused by the UI and give up.— Matt Klein (@mattklein123) October 10, 2018
The reality for most companies is that once deployed, tracing systems get very little real use. Why?— Matt Klein (@mattklein123) October 10, 2018
Primarily because developers are used to logging and stats, but also because it is simply too hard to find relevant traces, especially when sampling, with existing tooling.— Matt Klein (@mattklein123) October 10, 2018
"Being able to quickly and cheaply test hypotheses and refine one’s mental model accordingly is the cornerstone of debugging. Any tool that aims to assist in the process of debugging needs to be an interactive tool that helps to either whittle down the search space or in the case of a red herring, help the user backtrack and refocus on a different area of the system."
"Instead of seeing an entire trace, what I really want to be seeing is a portion of the trace where something interesting or unusual is happening."
And Abby Bangser discusses what she wants from an observability tool:
I'll take a stab @mipsytipsy and you can correct me😉— Abby Bangser (@a_bangser) October 22, 2020
A key to observability is BOTH being able to get a high level view AND the detailed debugging. Therefore, if you need to use two different tools for those two different views, each is not independently an observability tool.
Part of the issue comes from the definition of debugging. Software engineers (SWEs) debug code, and look for data that will correlate to the underlying software. SWEs inherently think in terms of logging and debugging statements. System Reliability Engineers (SREs) debug incidents that may appear across the entire environment, and care about the overall health of operations.
"SWEs are more likely to consult logs earlier in their debugging workflow, where they look for errors that could indicate where a failure occurred."
"SREs rely on a more generic approach to debugging: Because SREs are often on call for multiple services, they apply a general approach to debugging based on known characteristics of their system(s). They look for common failure patterns across service health metrics (for example, errors and latency for requests) to isolate where the issue is happening, and often dig into logs only if they're still uncertain about the best mitigation strategy.
Distributed Tracing is mostly oriented toward SREs. Most spans in a trace are added around the edges of a service, and are typically done automatically through instrumentation. Typically SWEs do not add manual custom spans into the service.
When developers do add custom spans, they will typically do so in their mental model. They'll keep adding spans according to a simple rule: When In Doubt, Add More Spans.
This is because to a developer, a span is semantically equivalent to printf debugging.
And developers love printfs. From On the Dichotomy of Debugging Behavior Among Programmers:
"[D]evelopers were well-informed about printf debugging and that it is a conscious choice if they employ it, often the beginning of a longer debugging process. Interviewees praised printf as a universal tool that one can always resort back to. […] While frequently used, developers are also aware of its shortcomings, saying that “you are half-way toward either telemetry or toward tracing”
There's a long history of developers keeping debug statements in production executables for diagnostic purposes, based on logging. Developers have a method to debug and resolve actual vs expected behavior:
- Create a pool of data from statements: state dumps, branches, and state transitions.
- Fish around in that pool with various hooks and queries.
- Keep the most useful statements around as diagnostic logging for later.
In short, exploratory debugging involves "drill-down" fine-grained detailed spans. This works with logs because debug logging is not enabled by default in production, and logging can be selectively enabled. But spans are printfs: there is no priority system, and no way to filter spans so that a user is not overwhelmed with debugging spans.
12/ Now, when people think about tracing, they tend to think about this giant table “one trace (or row) at a time.”— Ben Sigelman (@el_bhs) April 12, 2021
Imagine restricting a logging system to display only one log-line at a time. This is just as bad… perhaps worse. And yet it passes for “tracing.” :-/ pic.twitter.com/fgEM6tUyK4
What we'd really like is a system that could intelligently filter and display a trace view so that spans could be "folded up" as necessary. We don't have systems that do that currently, but we can do the next best thing – we can fold up spans ourselves.
Conditional Span Builder
The solution here is to enable the application to determine what spans it constructs. We can do this by adding a scripting engine to the Opentelemetry SDK, and creating spans based on dynamic script evaluation.
If you've been following the blog, I've taken the code from Dynamic Logging with Conditions and stuck it in the Tracing SDK.
and following that, we can render by debug attribute:
or even by feature flag:
Oh enable or disable tracing by feature flag is such a smart idea. Basically turning on debug mode but with spans and traces and not having to wait for a redeploy. Watching @will_sargent 's talk at @o11ycon . https://t.co/u0wGsT2Smd— Michael Sickles (@McSick90) June 9, 2021
This functionality itself can be changed on the fly. You can change the script, or you can disable script evaluation completely and revert to the out of the box behavior of OpenTelemetry.
The Java Scripting API will compile the script to Java bytecode and so will handle this as "native Java". You are not limited to Java either – scripting is entirely possible in Go and Rust SDK implementations, for example luajit has some great optimization features.
I think that eventually this will be built-in SDK behavior, and scripting language will eventually be a part of all telemetry operations. As systems become more complex and downtime becomes less acceptable, there will be an expectation that telemetry is dynamic and adjusts with the environment.
Adding in application hooks does not negate the need for a more advanced tracing system – not trace-view – that can provide richer interaction with traces. I don't think that this will be a fancy trace view though – I think that logs, traces, and metrics will all be sources for exploratory data analysis through a notebook interface like Observable or Kaggle, and additional visualizations like heat maps, box plots, flame graphs, context calling trees, and sankey diagrams will all be options.
There are some great starting point from an analytics perspective: Dan Luu's summary of Rebecca Isaacs' work on tracing analytics, the work that Slack has done in causal graphs using SQL, and Daniela Miao's talk on Tracing is More Than Traces: The Insights in Trace Aggregates.