Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework SymbolizationComplete #307

Merged
merged 8 commits into from
Jan 22, 2025
Merged

Rework SymbolizationComplete #307

merged 8 commits into from
Jan 22, 2025

Conversation

christos68k
Copy link
Member

@christos68k christos68k commented Jan 14, 2025

Summary

Updated SymbolizationComplete mechanism to reflect current semantics around trace processing and timestamping (no batching, in-kernel high resolution timestamps):

  • Don't call SymbolizationComplete per-Trace, instead call it after each iteration of the perf event batch-drain loop. This introduces a call frequency upper bound (currently: 4Hz).
  • Keep track of minimum KTime seen during trace event retrieval and report the minimum KTime belonging to the previous processing iteration with SymbolizationComplete.
  • startPollingPerfEventMonitor is now specialized to trace event processing, this also simplifies caller logic.

TODO:

  1. Generify SymbolizationComplete, fix Sending executable path for processes that have exited #278 Will open new PR for this.

Also see:

@christos68k christos68k requested review from a team as code owners January 14, 2025 22:27
@christos68k christos68k marked this pull request as draft January 14, 2025 22:32
@christos68k christos68k self-assigned this Jan 14, 2025
@christos68k christos68k linked an issue Jan 14, 2025 that may be closed by this pull request
@christos68k christos68k marked this pull request as ready for review January 17, 2025 21:26
@christos68k christos68k force-pushed the ck/symb-complete branch 4 times, most recently from 3541f2b to 9991fa8 Compare January 17, 2025 22:06
@@ -117,11 +117,8 @@ func newTraceHandler(rep reporter.TraceReporter, traceProcessor TraceProcessor,
}

func (m *traceHandler) HandleTrace(bpfTrace *host.Trace) {
defer m.traceProcessor.SymbolizationComplete(bpfTrace.KTime)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifying, SymbolizationComplete is now called from tracer/events.go with an introduced upper bound to the calling frequency.

pollFrequency time.Duration, perCPUBufferSize int, triggerFunc func([]byte, int),
) func() (lost, noData, readError uint64) {
eventReader, err := perf.NewReader(perfEventMap, perCPUBufferSize)
func (t *Tracer) startTraceEventMonitor(ctx context.Context,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in having this be a generic function when:

  1. Nothing else is currently using it (other than receiving trace events)
  2. I'm introducing logic that's specialized to trace event handling

Instead, switching it to a Tracer method also simplifies the interface.

@christos68k
Copy link
Member Author

christos68k commented Jan 17, 2025

This is how the logic looks now (polling loop and SymbolizationComplete being called).

On a system with low CPU load

WARN[0017] Poll events:0 oldKTime:0 minKTime:0          
WARN[0017] Poll events:0 oldKTime:0 minKTime:0          
WARN[0017] Poll events:1 oldKTime:0 minKTime:849288200358321 
WARN[0017] Poll events:0 oldKTime:849288200358321 minKTime:0 
WARN[0017] SymbolizationComplete captureKT: 849288200358321 latency: 288 ms 
WARN[0018] Poll events:0 oldKTime:0 minKTime:0          
WARN[0018] Poll events:4 oldKTime:0 minKTime:849288799647475 
WARN[0018] Poll events:1 oldKTime:849288799647475 minKTime:849289049572241 
WARN[0018] SymbolizationComplete captureKT: 849288799647475 latency: 438 ms 
WARN[0018] Poll events:0 oldKTime:849289049572241 minKTime:0 
WARN[0018] SymbolizationComplete captureKT: 849289049572241 latency: 438 ms 

We see 4 iterations of the polling loop per second (as expected due to 250ms polling interval) and SymbolizationComplete being called only when needed. The 'artificial' latency in observed KTime remains below 500ms.

On a fully loaded system

WARN[0042] Poll events:30 oldKTime:849312250364163 minKTime:849312499384875 
WARN[0042] SymbolizationComplete captureKT: 849312250364163 latency: 496 ms 
WARN[0042] Poll events:38 oldKTime:849312499384875 minKTime:849312750439243 
WARN[0042] SymbolizationComplete captureKT: 849312499384875 latency: 500 ms 
WARN[0042] Poll events:49 oldKTime:849312750439243 minKTime:849313000197715 
WARN[0042] SymbolizationComplete captureKT: 849312750439243 latency: 501 ms 
WARN[0042] Poll events:46 oldKTime:849313000197715 minKTime:849313276705244 
WARN[0042] SymbolizationComplete captureKT: 849313000197715 latency: 506 ms 
WARN[0043] Poll events:54 oldKTime:849313276705244 minKTime:849313526712844 
WARN[0043] SymbolizationComplete captureKT: 849313276705244 latency: 483 ms 
WARN[0043] Poll events:19 oldKTime:849313526712844 minKTime:849313777505746 
WARN[0043] SymbolizationComplete captureKT: 849313526712844 latency: 468 ms 
WARN[0043] Poll events:59 oldKTime:849313777505746 minKTime:849314000068987 
WARN[0043] SymbolizationComplete captureKT: 849313777505746 latency: 480 ms 
WARN[0043] Poll events:46 oldKTime:849314000068987 minKTime:849314277230230 
WARN[0043] SymbolizationComplete captureKT: 849314000068987 latency: 506 ms 

Again we see 4 iterations of the polling loop per second, but this time SymbolizationComplete is being called at maximum frequency (4 times a second) as there are new trace events being received continuously. The artificial latency introduced is similar to the low CPU load case. Worst-case latency I've seen in a completely bogged down system is around 1000ms.

tracer/events.go Outdated
Comment on lines 202 to 209
kt := oldKTime
if minKTime > 0 && minKTime < kt {
// If current minKTime is smaller than oldKTime, use it
// instead of oldKTime (and set it to 0 to avoid a repeat).
kt = minKTime
minKTime = 0
}
t.TraceProcessor().SymbolizationComplete(kt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a kt temp variable, the code is easier to read.

Suggested change
kt := oldKTime
if minKTime > 0 && minKTime < kt {
// If current minKTime is smaller than oldKTime, use it
// instead of oldKTime (and set it to 0 to avoid a repeat).
kt = minKTime
minKTime = 0
}
t.TraceProcessor().SymbolizationComplete(kt)
if oldKTime <= minKTime {
t.TraceProcessor().SymbolizationComplete(oldKTime)
} else {
// If minKTime is smaller than oldKTime, use it
// and reset it to avoid a repeat.
t.TraceProcessor().SymbolizationComplete(minKTime)
minKTime = 0
}

Copy link
Member Author

@christos68k christos68k Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to rework this a bit as the suggested logic was incorrect when minKTime == 0. See d12a80a.

tracer/events.go Outdated Show resolved Hide resolved
tracer/events.go Outdated Show resolved Hide resolved
tracer/events.go Outdated Show resolved Hide resolved
@christos68k christos68k force-pushed the ck/symb-complete branch 2 times, most recently from c7a47e4 to ad771d9 Compare January 21, 2025 23:15
if minKTime == 0 || trace.KTime < minKTime {
minKTime = trace.KTime
}
traceOutChan <- trace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you moved that code here it becomes obvious that we enforce a task/context switch for every trace. Possibly not for this PR, but we should improve this (e.g. with batch processing or a buffered channel).

@rockdaboot rockdaboot self-requested a review January 22, 2025 10:59
christos68k and others added 2 commits January 22, 2025 11:32
Update SymbolizationComplete mechanism to reflect current semantics
around trace processing and timestamping (no batching, in-kernel
high resolution timestamps)
Co-authored-by: Tim Rühsen <tim.ruhsen@elastic.co>
@christos68k christos68k merged commit 58af13c into main Jan 22, 2025
24 checks passed
@christos68k christos68k deleted the ck/symb-complete branch January 22, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework SymbolizationComplete mechanism Sending executable path for processes that have exited
3 participants