Rework SymbolizationComplete #307

christos68k · 2025-01-14T22:27:45Z

Summary

Updated SymbolizationComplete mechanism to reflect current semantics around trace processing and timestamping (no batching, in-kernel high resolution timestamps):

Don't call SymbolizationComplete per-Trace, instead call it after each iteration of the perf event batch-drain loop. This introduces a call frequency upper bound (currently: 4Hz).
Keep track of minimum KTime seen during trace event retrieval and report the minimum KTime belonging to the previous processing iteration with SymbolizationComplete.
startPollingPerfEventMonitor is now specialized to trace event processing, this also simplifies caller logic.

TODO:

~~Generify SymbolizationComplete, fix Sending executable path for processes that have exited #278~~ Will open new PR for this.

Also see:

Rework SymbolizationComplete mechanism #301

christos68k · 2025-01-17T21:27:52Z

tracehandler/tracehandler.go

@@ -117,11 +117,8 @@ func newTraceHandler(rep reporter.TraceReporter, traceProcessor TraceProcessor,
 }

 func (m *traceHandler) HandleTrace(bpfTrace *host.Trace) {
-	defer m.traceProcessor.SymbolizationComplete(bpfTrace.KTime)


Simplifying, SymbolizationComplete is now called from tracer/events.go with an introduced upper bound to the calling frequency.

christos68k · 2025-01-17T21:28:49Z

tracer/events.go

-	pollFrequency time.Duration, perCPUBufferSize int, triggerFunc func([]byte, int),
-) func() (lost, noData, readError uint64) {
-	eventReader, err := perf.NewReader(perfEventMap, perCPUBufferSize)
+func (t *Tracer) startTraceEventMonitor(ctx context.Context,


No point in having this be a generic function when:

Nothing else is currently using it (other than receiving trace events)

I'm introducing logic that's specialized to trace event handling

Instead, switching it to a Tracer method also simplifies the interface.

christos68k · 2025-01-17T22:15:31Z

This is how the logic looks now (polling loop and SymbolizationComplete being called).

On a system with low CPU load

WARN[0017] Poll events:0 oldKTime:0 minKTime:0          
WARN[0017] Poll events:0 oldKTime:0 minKTime:0          
WARN[0017] Poll events:1 oldKTime:0 minKTime:849288200358321 
WARN[0017] Poll events:0 oldKTime:849288200358321 minKTime:0 
WARN[0017] SymbolizationComplete captureKT: 849288200358321 latency: 288 ms 
WARN[0018] Poll events:0 oldKTime:0 minKTime:0          
WARN[0018] Poll events:4 oldKTime:0 minKTime:849288799647475 
WARN[0018] Poll events:1 oldKTime:849288799647475 minKTime:849289049572241 
WARN[0018] SymbolizationComplete captureKT: 849288799647475 latency: 438 ms 
WARN[0018] Poll events:0 oldKTime:849289049572241 minKTime:0 
WARN[0018] SymbolizationComplete captureKT: 849289049572241 latency: 438 ms

We see 4 iterations of the polling loop per second (as expected due to 250ms polling interval) and SymbolizationComplete being called only when needed. The 'artificial' latency in observed KTime remains below 500ms.

On a fully loaded system

WARN[0042] Poll events:30 oldKTime:849312250364163 minKTime:849312499384875 
WARN[0042] SymbolizationComplete captureKT: 849312250364163 latency: 496 ms 
WARN[0042] Poll events:38 oldKTime:849312499384875 minKTime:849312750439243 
WARN[0042] SymbolizationComplete captureKT: 849312499384875 latency: 500 ms 
WARN[0042] Poll events:49 oldKTime:849312750439243 minKTime:849313000197715 
WARN[0042] SymbolizationComplete captureKT: 849312750439243 latency: 501 ms 
WARN[0042] Poll events:46 oldKTime:849313000197715 minKTime:849313276705244 
WARN[0042] SymbolizationComplete captureKT: 849313000197715 latency: 506 ms 
WARN[0043] Poll events:54 oldKTime:849313276705244 minKTime:849313526712844 
WARN[0043] SymbolizationComplete captureKT: 849313276705244 latency: 483 ms 
WARN[0043] Poll events:19 oldKTime:849313526712844 minKTime:849313777505746 
WARN[0043] SymbolizationComplete captureKT: 849313526712844 latency: 468 ms 
WARN[0043] Poll events:59 oldKTime:849313777505746 minKTime:849314000068987 
WARN[0043] SymbolizationComplete captureKT: 849313777505746 latency: 480 ms 
WARN[0043] Poll events:46 oldKTime:849314000068987 minKTime:849314277230230 
WARN[0043] SymbolizationComplete captureKT: 849314000068987 latency: 506 ms

Again we see 4 iterations of the polling loop per second, but this time SymbolizationComplete is being called at maximum frequency (4 times a second) as there are new trace events being received continuously. The artificial latency introduced is similar to the low CPU load case. Worst-case latency I've seen in a completely bogged down system is around 1000ms.

rockdaboot · 2025-01-20T10:48:11Z

tracer/events.go

+				kt := oldKTime
+				if minKTime > 0 && minKTime < kt {
+					// If current minKTime is smaller than oldKTime, use it
+					// instead of oldKTime (and set it to 0 to avoid a repeat).
+					kt = minKTime
+					minKTime = 0
+				}
+				t.TraceProcessor().SymbolizationComplete(kt)


Without a kt temp variable, the code is easier to read.

Suggested change

kt := oldKTime

if minKTime > 0 && minKTime < kt {

// If current minKTime is smaller than oldKTime, use it

// instead of oldKTime (and set it to 0 to avoid a repeat).

kt = minKTime

minKTime = 0

}

t.TraceProcessor().SymbolizationComplete(kt)

if oldKTime <= minKTime {

t.TraceProcessor().SymbolizationComplete(oldKTime)

} else {

// If minKTime is smaller than oldKTime, use it

// and reset it to avoid a repeat.

t.TraceProcessor().SymbolizationComplete(minKTime)

minKTime = 0

}

Hard to rework this a bit as the suggested logic was incorrect when minKTime == 0. See d12a80a.

processmanager/manager.go

tracer/events.go

rockdaboot · 2025-01-22T08:53:43Z

tracer/events.go

+				if minKTime == 0 || trace.KTime < minKTime {
+					minKTime = trace.KTime
+				}
+				traceOutChan <- trace


Now that you moved that code here it becomes obvious that we enforce a task/context switch for every trace. Possibly not for this PR, but we should improve this (e.g. with batch processing or a buffered channel).

tracer/events.go

Update SymbolizationComplete mechanism to reflect current semantics around trace processing and timestamping (no batching, in-kernel high resolution timestamps)

Co-authored-by: Tim Rühsen <tim.ruhsen@elastic.co>

Co-authored-by: Florian Lehner <florian.lehner@elastic.co>

christos68k requested review from a team as code owners January 14, 2025 22:27

christos68k marked this pull request as draft January 14, 2025 22:32

christos68k self-assigned this Jan 14, 2025

christos68k linked an issue Jan 14, 2025 that may be closed by this pull request

Rework SymbolizationComplete mechanism #301

Closed

christos68k mentioned this pull request Jan 14, 2025

Add metrics for trace event perf event monitor #308

Closed

christos68k marked this pull request as ready for review January 17, 2025 21:26

christos68k force-pushed the ck/symb-complete branch 4 times, most recently from 3541f2b to 9991fa8 Compare January 17, 2025 22:06

christos68k commented Jan 17, 2025

View reviewed changes

christos68k force-pushed the ck/symb-complete branch from 9991fa8 to 870479d Compare January 17, 2025 22:23

rockdaboot reviewed Jan 20, 2025

View reviewed changes

processmanager/manager.go Outdated Show resolved Hide resolved

florianl reviewed Jan 20, 2025

View reviewed changes

tracer/events.go Outdated Show resolved Hide resolved

tracer/events.go Outdated Show resolved Hide resolved

rockdaboot approved these changes Jan 21, 2025

View reviewed changes

christos68k commented Jan 21, 2025

View reviewed changes

tracer/events.go Outdated Show resolved Hide resolved

christos68k force-pushed the ck/symb-complete branch 2 times, most recently from c7a47e4 to ad771d9 Compare January 21, 2025 23:15

christos68k requested a review from rockdaboot January 22, 2025 02:23

christos68k force-pushed the ck/symb-complete branch from 5dc9499 to 0009388 Compare January 22, 2025 02:38

florianl approved these changes Jan 22, 2025

View reviewed changes

rockdaboot reviewed Jan 22, 2025

View reviewed changes

tracer/events.go Show resolved Hide resolved

rockdaboot self-requested a review January 22, 2025 10:59

rockdaboot approved these changes Jan 22, 2025

View reviewed changes

christos68k and others added 2 commits January 22, 2025 11:32

Rework SymbolizationComplete

78b90dd

Update SymbolizationComplete mechanism to reflect current semantics around trace processing and timestamping (no batching, in-kernel high resolution timestamps)

Update tracer/events.go

677fa8f

Co-authored-by: Tim Rühsen <tim.ruhsen@elastic.co>

christos68k and others added 6 commits January 22, 2025 11:32

Use debug instead of warn logging

2b4162d

Better scoping of variables

43e73db

Co-authored-by: Florian Lehner <florian.lehner@elastic.co>

Update explanation

b6efb76

Fix edge case where minKTime == 0

e79a2f6

Synchronize with HandleTrace

d53998b

Minor doc update

eded6cb

christos68k force-pushed the ck/symb-complete branch from 51f78ac to eded6cb Compare January 22, 2025 16:32

christos68k merged commit 58af13c into main Jan 22, 2025
24 checks passed

christos68k deleted the ck/symb-complete branch January 22, 2025 16:54

christos68k mentioned this pull request Jan 22, 2025

Delayed processing for ProcessManager.pidToProcessInfo #321

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework SymbolizationComplete #307

Rework SymbolizationComplete #307

christos68k commented Jan 14, 2025 •

edited

Loading

christos68k Jan 17, 2025

christos68k Jan 17, 2025

christos68k commented Jan 17, 2025 •

edited

Loading

rockdaboot Jan 20, 2025

christos68k Jan 22, 2025 •

edited

Loading

rockdaboot Jan 22, 2025

Rework SymbolizationComplete #307

Rework SymbolizationComplete #307

Conversation

christos68k commented Jan 14, 2025 • edited Loading

Summary

christos68k Jan 17, 2025

Choose a reason for hiding this comment

christos68k Jan 17, 2025

Choose a reason for hiding this comment

christos68k commented Jan 17, 2025 • edited Loading

On a system with low CPU load

On a fully loaded system

rockdaboot Jan 20, 2025

Choose a reason for hiding this comment

christos68k Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

rockdaboot Jan 22, 2025

Choose a reason for hiding this comment

christos68k commented Jan 14, 2025 •

edited

Loading

christos68k commented Jan 17, 2025 •

edited

Loading

christos68k Jan 22, 2025 •

edited

Loading