Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirement: Support for additonal metadata such as OpenTelemetry/TracingContext in WebSocket Messages #30

Open
RobWin opened this issue Dec 19, 2024 · 3 comments

Comments

@RobWin
Copy link
Collaborator

RobWin commented Dec 19, 2024

Description:
In modern distributed systems, observability and tracing are crucial for debugging, performance monitoring, and understanding the flow of requests across multiple services. OpenTelemetry and W3C Trace Context provide a standardized approach for tracing and capturing metadata about requests, enabling better visibility into the system's behavior.

As Thing Clients and Thing Servers communicate over WebSockets in the WebThing protocol, it's important to provide support for injecting tracing context into individual WebSocket messages. This would allow consumers and servers to track the flow of messages across various components and gain better insights into performance and errors.

Proposal:
I propose adding support for additonal metadata, such as OpenTelemetry/TracingContext, to WebSocket messages in the WebThing protocol. This would enable the transmission of trace context as metadata, allowing tools and observability systems to correlate events, measure latencies, and track the health of interactions between Thing Clients and Thing Servers.

Why Tracing Context is Needed:

  • Distributed Tracing: When a message is sent from a Thing Client to a Thing Server, and potentially forwarded to other services or systems, it is important to be able to trace that request through its lifecycle. OpenTelemetry provides a standardized way to carry trace context along with requests, ensuring full traceability of messages as they propagate through the system.
  • Monitoring & Performance: With tracing context, monitoring systems can track the latency of individual requests, identify bottlenecks, and gain insights into the performance of the Thing Server or Client. This can help in improving the system’s reliability and responsiveness.
  • Error Diagnosis: By correlating WebSocket messages with traces, it becomes easier to pinpoint the root cause of failures, whether they are network issues, server errors, or client misconfigurations.

Suggested Changes:

  1. Allow for the inclusion of tracing context as metadata fields in WebSocket messages. This metadata would carry trace context (such as trace ID, span ID, and other relevant details) that can be used by observability tools.
  2. This could be achieved by extending the message format to include a new field, like traceContext (or metadata), that holds the OpenTelemetry-compatible trace information.
  3. Provide a mechanism for consumers and servers to read and inject tracing context into WebSocket messages to maintain the trace chain as messages flow through the system.

Example:
When a Thing Client sends a message to a Thing Server, the message could include trace metadata:

{
  "thingId": "https://mythingserver.com/things/mylamp1",
  "messageType": "writeProperty",
  "property": "on",
  "data": true,
   "traceContext": {
    "traceId": "abc123",
    "spanId": "xyz456",
    "parentSpanId": "def789",
    "traceState": "key1=value1,key2=value2"
     ...
  }
}

This trace context would allow both the Thing Server and external observability systems to correlate the message with other services in the distributed system, providing a clearer picture of the request lifecycle.

Benefits:

  • Improved Observability: Integrating OpenTelemetry tracing context into WebSocket messages will provide better tracking and monitoring of interactions within the system.
  • End-to-End Tracing: By passing trace context along with the WebSocket messages, distributed tracing systems can correlate events and improve the overall observability of Thing Client and Thing Server interactions.
  • Error Handling and Debugging: Trace metadata will allow easier detection and resolution of issues, as developers can trace back to the specific message or request that caused an error.

Next Steps:

  • Define the format and rules for embedding tracing context within WebThing messages.
  • Update the WebThing protocol documentation to reflect these changes and explain how to use tracing context effectively.
@RobWin RobWin changed the title Requirement: Support for additonal metadat asuch as OpenTelemetry/TracingContext in WebSocket Messages Requirement: Support for additonal metadata such as OpenTelemetry/TracingContext in WebSocket Messages Dec 19, 2024
@benfrancis
Copy link
Member

@RobWin wrote:

OpenTelemetry and W3C Trace Context provide a standardized approach for tracing and capturing metadata about requests, enabling better visibility into the system's behavior.

This is very interesting but seems like a big topic and I don't think I understand all the use cases.

Would any of these use cases be solved by the messageID & correlationID members already being discussed or is this something entirely different?

There aren't currently any requirements around observability/traceability in the Use Cases & Requirements document. That is intended to be a living document though so that doesn't mean we couldn't add new requirements if there's a lot of demand for these features. That would require writing some new use cases & requirements in the style used in that document, and sufficient support for adding them.

Similarly there aren't any requirements around Quality of Service, which I'm conscious is something people may care about that WebSockets don't provide on their own, and may be related to this and the messageID/correlationID discussion?

As far as referencing other specifications goes it's appealing that Trace Context is an existing W3C Recommendation that we could reference and I note that there is a precedent in the MQTT v3 serialisation of the trace context protocol for serialising the metadata in JSON. That would be easy for us to re-use in Web Thing Protocol messages, e.g.

{
   ...
    "traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01",
    "tracestate": "congo=BleGNlZWRzIHRohbCBwbGVhc3VyZS4"
}

I'm not sure whether the OpenTelemetry specifications by the Cloud Native Computing Foundation would be referenceable in a normative W3C specification. Are OpenTelemetry and W3C Trace Context competing or overlapping specifications, or are they used together?

@RobWin
Copy link
Collaborator Author

RobWin commented Jan 10, 2025

Are OpenTelemetry and W3C Trace Context competing or overlapping specifications, or are they used together?

OpenTelemetry and W3C Trace Context are complementary, not competing, specifications. OpenTelemetry is an open-source, vendor-neutral observability framework designed to collect, process, and export traces (and also metrics, and logs). It provides the tooling (e.g., SDKs and APIs) needed to instrument and monitor distributed systems. W3C Trace Context, on the other hand, is a standardized way to propagate trace information (e.g., traceparent and tracestate headers) across services. OpenTelemetry uses W3C Trace Context as its default propagator.

Use Case for Trace Context in W3C Web Thing Protocol:

In a smart home ecosystem, W3C Trace Context would enable end-to-end tracing of a user request, such as turning on a light via a mobile app. The mobile app generates a traceparent header containing a unique trace ID and span ID, which propagates through the cloud service, home gateway, and eventually to the smart device (e.g., the light bulb). Each component updates the trace context with new span IDs to represent its role in the request. This propagation ensures that the request’s journey, including latencies and failures at each step, can be traced seamlessly across the distributed system. Developers gain the visibility needed to debug issues, optimize performance, and monitor the system’s behavior as a whole.

Reading Trace Context:

  • When a service receives an incoming request, the propagator extracts the traceparent and tracestate headers from the request.
  • These headers are parsed to reconstruct the trace context (e.g., trace ID, parent span ID, sampling flags).
  • The reconstructed trace context is then used to create a new child span for the current service.

Writing Trace Context:

  • When a service makes an outgoing request, OpenTelemetry injects the current trace context into the request headers as traceparent and tracestate.
  • These headers allow downstream services to link their spans to the same trace.

Benefit:
Incorporating observability and traceability as a UseCase&Requirement for the W3C Web Thing Protocol would improve monitoring, simplifies debugging, and enhances the reliability and performance of the overall IoT solution.

If missing:
Without standardized traceability, debugging in production systems becomes more time-consuming, performance issues go unnoticed, and maintaining system reliability at scale becomes extremely challenging. In production systems, failures are inevitable. If a user reports an issue—such as a command not reaching the smart home device or a service being slow—traceability allows you to pinpoint the exact step in the process where things went wrong. Whether it’s a delay in the cloud service or a miscommunication between devices, having trace data can speed up debugging significantly.

Proposed Requirement for W3C Web Thing Protocol:
To improve observability, the W3C Web Thing Protocol should include optional support for W3C Trace Context headers. These could be incorporated either as explicit, optional fields in the protocol or as part of a flexible metadata structure that allows additional custom headers for systems.

@RobWin
Copy link
Collaborator Author

RobWin commented Jan 10, 2025

Would any of these use cases be solved by the messageID & correlationID members already being discussed or is this something entirely different?

Summary:
A message ID or correlation ID identifies individual messages or request-reply interactions between two components but lacks the hierarchical context and end-to-end visibility needed to trace distributed interactions over multiple components. Unlike W3C Trace Context, message IDs don’t link parent and child operations, propagate across different protocols and services, or include metadata (e.g., sampling flags) essential for selective tracing and debugging. Therefore, while message IDs are useful for tracking individual messages, they cannot provide the comprehensive observability and traceability required in complex, distributed systems. W3C Trace Context fills this gap by enabling standardized propagation of trace metadata across all components.

Detailed explanation:

Message ID, W3C Trace ID, and W3C Span ID all serve to identify and track requests across distributed systems.

  • Message ID typically identifies a single message and is useful for tracking messages within that specific communication.
  • Trace ID is part of distributed tracing and links a series of related requests across multiple services or systems, helping to track an end-to-end journey of a request.
  • Span ID identifies a single operation or unit of work within a trace, allowing for the detailed breakdown of where time is spent during a trace.

Span ID and Correlation ID are also both used in distributed systems, but they serve different purposes.

Span IDs form a tree-like structure and can have a parent-child relationship with other spans.
Whereas Correlation IDs usually represent a flat grouping of messages or requests/responses related to a single interaction.

TraceId and SpanId are serialized into headers like traceparent and tracestate.

traceparent Header:

The traceparent header is part of the W3C Trace Context standard and carries the trace context between services. It contains the TraceId, SpanId, and other metadata.

The format of traceparent is:

00-<trace-id>-<span-id>-<trace-flags>
  • TraceId (<trace-id>): A globally unique identifier for the entire trace.

  • SpanId (<span-id>): A unique identifier for a single operation or unit of work within the trace.

  • TraceFlags (<trace-flags>): An optional field that specifies trace-related flags, such as whether the trace should be sampled (recorded) or not.

Example traceparent Header:

traceparent: 00-4bf92f3577b34da6a4f1e1f3b98d5f47-00f067aa0ba902b7-01

tracestate Header:

The tracestate header is an optional field that provides additional vendor-specific or system-specific trace information. It contains metadata related to the trace, such as the specific trace context used by different tracing systems or services. It does not directly serialize TraceId or SpanId, but may contain other contextual information about the trace.

The format of tracestate is:

tracestate: <key1>=<value1>,<key2>=<value2>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants