Rework SymbolizationComplete mechanism #301

christos68k · 2025-01-10T18:04:51Z

The SymbolizationComplete mechanism still reflects old semantics regarding trace transmission, processing and timestamping that no longer exist.

As the documentation string describes, it's meant to be in essence an indicator that all Traces until a specific KTime have been processed, allowing for dependent events in userspace (such as cleaning up process metadata) to execute without loss of information.

However, when we switched away from batched Trace processing and introduced high resolution timestamps for each individual trace (as opposed to timestamping the entire batch of Traces in userspace) we broke that contract as the ordering of events delivered to userspace and subsequently processed is no longer guaranteed to be sequential in observed KTime. This is due to how events are delivered in userspace: our perf event processing loop will drain per-CPU buffers without taking KTime into account. For example, for a CPU that has generated trace events on two cores, we'll first process all events for core 0 and then all events for core 1 introducing the possibility of SymbolizationComplete seeing backward jumps in time.

Additionally, we're calling SymbolizationComplete for every trace received which can result in thousands of calls per second depending on sampling frequency and core count. This is not a big issue for the common case where SymbolizationComplete exits immediately, but could result in corner cases with increased processing load.

Finally, the SymbolizationComplete mechanism is currently tied to interpreted traces (even though we needlessly call it for native traces too) but the problem it solves is more generic than that and it'd be useful to have a mechanism that lets userspace know it's safe to perform actions that depend on Traces up to a specific KTime having been processed. For example, a generic mechanism that provides this guarantee can be used to fix #278.

The text was updated successfully, but these errors were encountered:

christos68k · 2025-01-10T18:11:11Z

I'm looking into reworking the existing mechanism as such:

Generify and decouple SymbolizationComplete from traceHandler.HandleTrace. Instead, call it at fixed-reduced frequency (e.g. 2 times a second) from the tracer polling loop.
Instead of a current KTime, introduce artificial lag by keeping track of the previous, lowest-seen KTime in that loop. This is a simpler solution than batching and sorting trace events.

The artificial lag introduced based on the current polling interval of 250ms is expected to be less than 1s, making PID reuse even with a reduced PID space of 32768 values, highly unlikely.

christos68k added the bug Something isn't working label Jan 10, 2025

christos68k self-assigned this Jan 10, 2025

christos68k mentioned this issue Jan 14, 2025

Rework SymbolizationComplete #307

Merged

christos68k linked a pull request Jan 14, 2025 that will close this issue

Rework SymbolizationComplete #307

Merged

christos68k closed this as completed in #307 Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework SymbolizationComplete mechanism #301

Rework SymbolizationComplete mechanism #301

christos68k commented Jan 10, 2025 •

edited

Loading

christos68k commented Jan 10, 2025 •

edited

Loading

Rework SymbolizationComplete mechanism #301

Rework SymbolizationComplete mechanism #301

Comments

christos68k commented Jan 10, 2025 • edited Loading

christos68k commented Jan 10, 2025 • edited Loading

christos68k commented Jan 10, 2025 •

edited

Loading

christos68k commented Jan 10, 2025 •

edited

Loading