Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework SymbolizationComplete mechanism #301

Closed
christos68k opened this issue Jan 10, 2025 · 1 comment · Fixed by #307
Closed

Rework SymbolizationComplete mechanism #301

christos68k opened this issue Jan 10, 2025 · 1 comment · Fixed by #307
Assignees
Labels
bug Something isn't working

Comments

@christos68k
Copy link
Member

christos68k commented Jan 10, 2025

The SymbolizationComplete mechanism still reflects old semantics regarding trace transmission, processing and timestamping that no longer exist.

As the documentation string describes, it's meant to be in essence an indicator that all Traces until a specific KTime have been processed, allowing for dependent events in userspace (such as cleaning up process metadata) to execute without loss of information.

However, when we switched away from batched Trace processing and introduced high resolution timestamps for each individual trace (as opposed to timestamping the entire batch of Traces in userspace) we broke that contract as the ordering of events delivered to userspace and subsequently processed is no longer guaranteed to be sequential in observed KTime. This is due to how events are delivered in userspace: our perf event processing loop will drain per-CPU buffers without taking KTime into account. For example, for a CPU that has generated trace events on two cores, we'll first process all events for core 0 and then all events for core 1 introducing the possibility of SymbolizationComplete seeing backward jumps in time.

Additionally, we're calling SymbolizationComplete for every trace received which can result in thousands of calls per second depending on sampling frequency and core count. This is not a big issue for the common case where SymbolizationComplete exits immediately, but could result in corner cases with increased processing load.

Finally, the SymbolizationComplete mechanism is currently tied to interpreted traces (even though we needlessly call it for native traces too) but the problem it solves is more generic than that and it'd be useful to have a mechanism that lets userspace know it's safe to perform actions that depend on Traces up to a specific KTime having been processed. For example, a generic mechanism that provides this guarantee can be used to fix #278.

@christos68k christos68k added the bug Something isn't working label Jan 10, 2025
@christos68k christos68k self-assigned this Jan 10, 2025
@christos68k
Copy link
Member Author

christos68k commented Jan 10, 2025

I'm looking into reworking the existing mechanism as such:

  1. Generify and decouple SymbolizationComplete from traceHandler.HandleTrace. Instead, call it at fixed-reduced frequency (e.g. 2 times a second) from the tracer polling loop.
  2. Instead of a current KTime, introduce artificial lag by keeping track of the previous, lowest-seen KTime in that loop. This is a simpler solution than batching and sorting trace events.

The artificial lag introduced based on the current polling interval of 250ms is expected to be less than 1s, making PID reuse even with a reduced PID space of 32768 values, highly unlikely.

@christos68k christos68k linked a pull request Jan 14, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant