-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please revisit decision to emit an event for every message #1621
Comments
Logging all messages as one event comes with the following limitations:
Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways). If users use the tool that's not showing chat history nicely, they probably don't care about the chat history and can disable those events or send them to low-cost destination. The fact that chat history needs to be sent over and over again only shows how terribly inefficient existing GenAI solutions are and I don't think we should hide it. [Update] I'd expect any distributed application that does anything complicated to generate 10-100+ info logs for each trace even if it has nothing to do with AI. Enabling debug logs would add another order of magnitude. Existing tools don't show it nicely, that's true. |
That's the nature of the system; they do have different structure, and they are heterogenous.
For non-streaming, that's not the case: the completion is logically all created at a single moment in time and shows up as a single response payload. For streaming, sure, but the spec already doesn't reflect the nature of streaming, e.g. you could get two tokens at very different times, but there's no notion of that in the spec, which necessitates that a streaming response be composed back into manufactured messages.
I'm not following. The system needs to have a mapping of what it would emit at various verbosity levels. If it's emitting individual messages, it needs to emit them at that level. If it's emitting them altogether, it can simply query for the current verbosity level (which any respectible system does anyway before logging) and choose to compose only the pieces that are relevant at that verbosity level. Any options to configure this are relevant to both approaches.
I disagree. You log what you send/receive. In the case of chat completion, what you send is a list of messages. In the case of assistants, what you send is a set of messages, it just needn't be the complete set because it may be augmented with a further set that already exists in storage. The telemetry would include what you send and receive, in both cases.
I do not see how it degrades the experience. In both cases, the tooling needs to understand the format and the data.
I'm a user. I care about chat history. My tool is not showing it nicely.
How is it hiding it? In both cases, all the content is logged. It's just a question of whether it's logged in a way that reflects reality (they're physically and logically all part of the same prompt), or whether the telemetry is manufacturing boundaries that don't exist.
100%. All the more reason to not introduce artificial boundaries where they don't exist. If I have a sentence I want to log: "This is a test of the emergency broadcast system", can you imagine if we recommended that be logged as 9 distinct entries? logger.LogInformation("This ");
logger.LogInformation("is ");
logger.LogInformation("a ");
logger.LogInformation("test ");
logger.LogInformation("of ");
logger.LogInformation("the ");
logger.LogInformation("emergency ");
logger.LogInformation("broadcast ");
logger.LogInformation("system"); yet that's exactly what this design is suggested be done for the prompt to the LLM, splitting the prompt into pieces that are then logged individually, even though it's all just part of the same input and is not interpretable separately. |
Let me tackle things one-by-one
Different structure is a good signal that messages should be in separate events. The more complicated the structure is, the harder it is for the end users to parse it. Tooling suffers from it too, but I'm less worried about tools. Imagine I'm writing a query that would return all completions with logs | where name == "gen_ai.choice" | extend content = parse_json(body) | where content["finish_reason"] != "stop" I can spend 15-45 mins to write a query that'd do the same for events like one below. I've done it a few times in the past and it's not easy. The hint is to use mv-expand if I recall correctly {
"name": "gen_ai.content",
"body": {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\n\nHello there, how may I assist you today?"
},
"logprobs": null,
"finish_reason": "stop"
}
]
}
} I encourage you to try it yourself. You can say that it's hard to do in KQL and you might be right. Let's compare it to JQ: Now: .logs[]
| select(.name == "gen_ai.choice" and .body.content.finish_reason != "stop") One event: .logs[]
| select(.name == "gen_ai.content")
| .body.choices[]
| select(.finish_reason != "stop") both are relatively simple, yet I spent 1 min to write the first one, and a few more to write the second one |
This is not the case for agents/assistants where you do add messages one-by-one and agent manages your chat history separately from the completions. Or when you report individual events (too calls) as they occur during thread run - PTAL at the microsoft#3 Also what you're saying that if we log HTTP request and response bodies we should log them together since they occur at the same time and describe the same thing creating visual noise in my tool. |
We can fix the tool or change the telemetry (by making it harder to query and control verbosity), why should we change the telemetry and not the tool? |
But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion. |
If only tool can group and visualize them under spanId... |
Sure it is. You log what you have in the largest unit you have them. |
I don't think we should have different telemetry formats for agents and chat completions just because existing tooling can't show verbose logs nicely. |
Why should the telemetry make it harder for the tool by exposing the data factored at a finer granularity than it actually is in reality? |
You're going to end up with different telemetry, regardless, eg if each turn for completions logs the same message multiple times and only once or zero times for assistants. |
I'm not saying that. The request and response are separate from each other and can be logged separately, just as an LLM prompt can be logged separately from the completion. I'm objecting to artificially splitting the request or artificially spitting the completion, just as I'd object to doing so for HTTP. I don't want a separate log entry for every HTTP header, for example. |
Why is the tool capable of that but not basic search? |
What do you mean by basic search? I can write a relatively simple query that would group logs by the spanid and traceid in KQL or JQ.
I'll end up with different telemetry, but the structure of individual telemetry items will be the same.
Would you put tool definition inside prompt event? I would not because I'd like people who don't want to log tool definitions to be able to disable that event using generic log filtering/config. Also I want to have the same event defining tool call for assistants and for chat completions. So splitting may look artificial, but it does not mean it's bad. If we summarize:
I'm happy to continue the discussion and try looking for options 3, 4, etc, but picking between 1 and 2, I don't see why 1 would be better. |
Having all the data in multiple log messages is going to make analysis tools considerably harder to write - they will need to store all the messages and then run queries to piece together the logs to build the bigger picture. If all the data is in one payload, its much easier to do batch operations or parallel processing as each log entry tells the complete story for that gen_ai call. |
would it mean grouping prompts, tool definitions, attachments, tool calls, tool results, and completions into one uber event that captures everything? The moment you separate prompts from completions into different records, you get a problem of bringing them together. I'd like to challenge assumptions that
|
Is the goal to record the underlying physical reality of the protocol, or is it to record an interpretation of what happened in a form that eases certain queries? I can see arguments both ways. Consumers who are comfortable with the wire format will likely bias towards seeing the "true" shape of the data, whereas consumers who have never seen the wire format but just want answers to high-level queries will likely prefer whatever makes their specific queries more intuitive.
TBH I don't think the pros/cons list is that clear cut. For example, when it comes to ease of querying, you could say:
Ultimately it's hard to argue from first principles that there's a single right answer here. It depends on scenarios. Maybe in the long run it will become clear that tools and consumption scenarios all settle one way or the other. If that's already settled, it's news to me. You could make the "first principles" argument that in the absence of definite evidence that people want to vary the structure from its original shape, it's a safer to preserve the original wire format since that carries some additional level of truth. I don't know how the decision was made to switch to event-per-message so maybe there were a large number of people agreeing that's best in their scenario, but that's not clear from this GitHub issue alone. |
The principles we use to instrument libraries and write semantic conventions that have an impact on this decision are:
There are many others. I encourage you to participate in semconv discussions. There are extensive discussions on the topic - check out #980 and linked issues + we have notes and recordings of calls where we discussed it - all available here https://docs.google.com/document/d/1EKIeDgBGXQPGehUigIRLwAUpRGa7-1kXB736EaYuJ2M/edit?tab=t.0#heading=h.ylazl6464n0c |
Thanks @lmolkova, that's interesting and helps a bit with understanding the thought process. To be clear when I referred to "record the underlying physical reality of the protocol" I meant representing the same conceptual information (e.g., having equivalent concepts and being at the same level of granularity), not the literal bytes.
I guess that is meant quite loosely since you're also implying you don't want to reflect the differences between alternate API patterns (e.g., streaming vs nonstreaming have very different API shapes, likewise there tend to be different APIs for calling a stateless chat endpoint vs a stateful assistant). |
@SteveSandersonMS we record multiple telemetry items for a single API call - logical span/duration metric/events + transport request spans, maybe connection establishment or dns lookup can be done in a scope of that call. I don't think we can have any strict rules on how to record details (and how many details to record) on a single telemetry item for an arbitrary call. I.e. we analyze use-cases and try to manually optimize telemetry items for the most consistency across important use-cases. The use-case as I see it here is "dear model, please generate some content based on this data".
Depending on the scenario, my telemetry will be slightly different. E.g. If I use agents I'd have If as a user I decide to use a different chat completion API while staying within the same use case (chat completion), I'd like my telemetry to change as little as possible (ideally nothing major breaks, I just get a slightly different set of attributes describing the details). For content, the unit that remains almost the same is the message itself. I can create it, pass it by value or reference. If the model reports it, I'd like to know how many tokens were in that message (and if it doesn't, it probably will at some point) and I'd like to have a standard place to get it from in my queries. You're right that it's applied loosely and how it's applied depends on the level of abstraction and what we consider to be the same use-case. It seems to be the core of this discussion. |
commenting here mainly to follow along. I'm one of the maintainers of a few LLM instrumentors (https://github.com/Arize-ai/openinference) as well as an OTEL backend (https://github.com/Arize-ai/phoenix) and have had success with messages being a part of the span attributes payload. It's been a pretty successful direction for us as it's allowed for some helpful debugging flows (evaluations, replay of the invocation, etc.) - I understand that this mainly product related an not related to semantic conventions per-se but just wanted to express that I do echo some of the concerns @stephentoub expressed above. Let me know if I can be of any help as we've gone through the implementation of messages as attributes end-to-end and might be able to provide some perspective. |
I've created an issue in the Python repo that might address the issue you are observing. |
The latest (1.28) spec requires emitting an event for every individual message. This is problematic in a variety of ways:
The text was updated successfully, but these errors were encountered: