Limitations of Process Tracing

Process tracing is a wonderful technical solution to debug performance issues at scale and in production. However, it can answer neither customer nor support questions, like “Why is the order not processed?”

Story time

A couple of months ago, the almighty tracing tooling was added to your application to solve all the problems of observability. Data from the auto instrumented libraries like OpenTelemetry, ElasticAPM, etc. is flowing in. And in a couple code places, some extra metadata is added to the spans to make the traces more informative.

Last week, a customer reported a bug, and you started digging for a specific trace. You find the trace, but it doesn’t contain the information you need. It’s added for next time, while also the granularity of the spans is increased.

Today, another bug is reported and again, the trace do reveal which endpoints where called, but misses the actual data received as well as the result of the calculation.

You start to realize: The auto instrumentation only does process tracing, but not data tracing.

It can tell you which functions where called, but not what data was passed to them.
It can tell you how long a function took, but not what it did.
It can tell you how often a function was called, but not why.

Span events

Instead, record events when a business decision is made or a specific action is taken. Examples are technical events like “OrderEventReceived” from system A or functional events like “OrderProcessed”. These events can be enriched with context information like the order id, the customer id, the product id, etc.

In OpenTelemetry, this is called Span Event. Using span events, the event is added to the current span and thereby becomes part of the trace. It’s similar to a normal log entry, except that it’s stored separately from the application logs.

Ivan Burmistrov took it one step further and calls these events wide events.

Sampling

Another motivation to split process tracing and data tracing is the amount of data that’s generated. Process tracing is a high volume, low value data source. Sooner or later, you will have to sample the traces aka throwing away data.

At that point, the data trace information is still relevant and should be kept. Since span events are part of the trace, they’re also sampled. A simple workaround is to instead log these span events as structured logs. By adding a specific attribute or log level, they can be filtered out.

Conclusion

That summarized, process tracing is a great tool to debug performance issues, but answering customer or support questions requires data tracing.

Story time#

Span events#

Sampling#

Conclusion#

Story time

Span events

Sampling

Conclusion