OpenAI Custom Tracing for Workflows

One of the key problems with Agentic workflow is lack of visibility for what goes on within it. OpenAI SDK provides an interface to log tracing data for Agentic workflow so you can observe what your agents and processes do at runtime; which operations run, how long they take, and how they relate to each other. In the previous blog, I did say that I will be discussing about MCP servers in this one, but before we go there, we will make a stop and build our own custom agent workflow tracer.

Agent SDK includes built in tracing which collects a comprehensive set of data about the execution for the agent. It captures all LLM generations, tool calls, handoffs, triggered guardrails and multitude of other events. Tracing is enabled by default, and OpenAI will log all executions to the cloud where you can check the logs by going into your account.

Let’s go over a few definitions first.

Key Concepts

Trace: a logical unit of work from start to finish (for example, “handle incoming user request” or “agent-run:plan+execute”). A trace contains one or more spans and has a unique trace id.
Span: a time-bounded operation within a trace. Spans are nested or causally related (parent/child). Eachspan has:
- span id (unique)
- parent span id (optional)
- name (human readable)
- start time, end time (or duration)
- attributes/metadata (key/value pairs)
- status (ok, error, cancelled)
TracingProcessor: a pluggable handler in the Agent SDK that receives tracing events (span start, spanfinish, attributes) so you can export or record them.

So a trace can be considered as a workflow. Spans are steps within the workflow. One execution unit will be contained in start and end trace blocks. Within each trace we will have spans which are information about individual steps.

Span Types

Based on documentation, I could see the following available spans, The definitions for each of them are generated by GPT.

Agent Span
- AgentSpanData represents an agent’s execution span in a trace, including its name, handoffs, tools, and output type .
Function Span
- FunctionSpanData refers to a data structure that captures information about the execution of function calls, including performance metrics and context for debugging or monitoring purposes. It helps developers understand how functions are being executed within the SDK.
Generation Span
- GenerationSpanData represents a traceable unit of work that records details of a model’s text generation step, including the input prompts, the generated output, the model used, its configuration, and usage metrics.
Turn Span
- TurnSpanData represents a single turn in an agent’s loop, capturing details about the agent’s actions during that turn. It is used for processing and monitoring traces and spans within the OpenAI Agents system.
Response Span
- ResponseSpanData refers to a structure that captures specific segments of a model’s output, allowing developers to analyze or manipulate parts of the response more effectively. This feature enhances the handling of generated text by providing detailed information about the response’s structure and content.
Handoff Span
- HandoffSpanData refers to a structure that captures information about the handoff process between agents, including details about the task being transferred and the context of the interaction. This data helps in tracking and managing the flow of tasks in multi-agent systems.
Guardrails Span
- GuardrailSpanData is a tracing data structure that records details about guardrail checks during agent execution, including input/output validation events and any tripwire triggers that block unsafe actions.
Transcription Span
- TranscriptionSpanData refers to the structured data that represents segments of transcribed audio, including details like the start and end times of each segment. This allows developers to manage and analyze audio transcriptions more effectively.
Speech Span
- SpeechSpanData refers to a data structure that captures specific segments of speech within audio input, often used for tasks like transcription or analysis. It helps in identifying and processing portions of audio more effectively.
Speech Group Span
- peechGroupSpanData in the OpenAI SDK refers to a data structure used to manage and track the processing of speech-related tasks, such as speech recognition and generation, within the SDK’s framework. It helps in organizing and optimizing the handling of audio inputs and outputs during interactions.
MCP List Tools Span
- MCPListToolsSpanData refers to structured tracing data generated when listing available tools from an MCP server, used for debugging and observability in agent workflows.
Custom Span
- Everything else

Building a Custom Tracer

We will build a processor that can create a JSON file at the end of execution. This should have sufficient information to trace the flow. This can be easily changed to be streaming, logging to DB or whatever is necessary, but for now we will limit ourself to a simple flow.

Like I mentioned earlier, we will have to inherit from TracingProcessor. Most of the other classes required reside in agents.tracing package.

from agents import TracingProcessor
from agents.tracing import (AgentSpanData, FunctionSpanData, GenerationSpanData, ResponseSpanData, HandoffSpanData, 
        CustomSpanData, GuardrailSpanData, TurnSpanData,
        TranscriptionSpanData, SpeechSpanData, SpeechGroupSpanData, MCPListToolsSpanData)

class LocalTracingProcessor(TracingProcessor):
  """
  A tracing processor that builds a JSON flow. Traces contain spans (one trace -> many spans).
  On trace end the trace is serialized to JSON (printed or can be saved).
  """
  def __init__(self):
    # store trace dicts keyed by trace_id
    self.active_traces = {}
    # store raw span objects keyed by span_id until they end
    self.active_spans = {}

With that initialization in place, we are ready to start getting the logs into those arrays.

Capturing Trace

def on_trace_start(self, trace):
  """
  Create a trace dict and store it keyed by trace_id. Spans will be attached as they end.
  """
  tid = getattr(trace, "trace_id", None)
  if tid is None:
    # generate key if missing
    tid = id(trace)
    self.active_traces[tid] = self._make_trace_dict(trace)
    
def on_trace_end(self, trace):
  """
  Finalize the trace dict, serialize to JSON, and cleanup. Spans should have been attached by now.
  """
  tid = getattr(trace, "trace_id", None) or id(trace)
  tdict = self.active_traces.get(tid)
  if tdict is None:
    # create if missing
    tdict = self._make_trace_dict(trace)
    self.active_traces[tid] = tdict
    tdict["ended_at"] = datetime.datetime.now(datetime.timezone.utc).isoformat()
    # serialize JSON flow for this trace
    print(json.dumps(tdict, indent=2, sort_keys=False, default=str), flush=True)
    # cleanup
    del self.active_traces[tid]

When the trace starts, we will just capture the trace ID and initialize a custom trace log object. We have created a private function _make_trace_dict to do this. Let’s see the implementation.

def _make_trace_dict(self, trace):
  """
  Create a dict for a trace with basic info. Spans will be attached later.
  """
  return {
    "trace_id": getattr(trace, "trace_id", None),
    "name": getattr(trace, "name", None),
    "workflow_name": getattr(trace, "workflow_name", None),
    "started_at": datetime.datetime.now(datetime.timezone.utc).isoformat(),
    "ended_at": None,
    "attributes": getattr(trace, "attributes", {}) or {},
    "spans": []
  }

That’s it. We have a trace object that initializes an empty span list ready to be filled up.

Capturing Spans

Again we start by defining the primary functions that capture spans. Then we will show how each of the private functions look.

def on_span_start(self, span):
  """
  Store a span until it ends.
  """
  # store raw span until it ends
  sid = getattr(span, "span_id", None) or id(span)
  self.active_spans[sid] = span
  
def on_span_end(self, span):
  """
  Build a span dict, attach to its trace, and cleanup stored span.
  """
  sid = getattr(span, "span_id", None) or id(span)
  raw = self.active_spans.get(sid)
  if raw is None:
    return
  # build span dict
  span_type, payload = self._span_type_and_payload(raw)
  parent_id = self._get_span_parent_id(raw)
  span_dict = {
    "span_id": getattr(raw, "span_id", None),
    "parent_id": parent_id,
    "name": payload.get("name", getattr(raw, "name", None)),
    "type": span_type,
    "started_at": getattr(raw, "started_at", None),
    "ended_at": getattr(raw, "ended_at", None),
    "payload": payload,
    "attributes": getattr(raw, "attributes", {}) or {}
  }
  # attach to its trace
  trace_id = self._get_span_trace_id(raw) or getattr(raw, "trace_id", None) or getattr(raw, "trace", None)
  if hasattr(trace_id, "trace_id"):
    trace_id = getattr(trace_id, "trace_id")
  # Can do additoional validation for the presence of trace ID
  # If not present, or not active, design alternatives
  # For most scenarios, we will have trace ID, so add this span
  self.active_traces[trace_id]["spans"].append(span_dict)
  # cleanup stored span
  del self.active_spans[sid]

To get the trace ID from the span block, we will try to get trace_id attribute from the span block. Here is the minimal code for that,

def _get_span_trace_id(self, span):
  """
  Attempt to find the trace id for a span by checking common attributes and nested objects.
  """
  # Try common locations for trace id/pointer
  for attr in ("trace_id", "trace", "parent_trace_id", "traceId"):
    val = getattr(span, attr, None)
    if val:
      # if it's an object with trace_id
      if hasattr(val, "trace_id"):
        return getattr(val, "trace_id")
      return val

Of course no data is complete if we have not displayed the payloads. Payloads in this case means the input and response texts. Along with that we will try to get the name of the agent that has triggered this call. For that we will use the following private function.

def _span_type_and_payload(self, span):
  """
  Determine span type and extract relevant payload based on known span data types.
  """
  sd = span.span_data
  payload = {}
  span_type = type(sd).__name__ if sd is not None else "Unknown"
  # extract useful fields per known span types
  if isinstance(sd, AgentSpanData):
    payload["name"] = getattr(sd, "name", None)
  elif isinstance(sd, FunctionSpanData):
    payload["name"] = getattr(sd, "name", None)
  elif isinstance(sd, TurnSpanData):
    payload["name"] = getattr(sd, "name", None)
  elif isinstance(sd, GenerationSpanData):
    payload["model"] = getattr(sd, "model", None)
    payload["input"] = getattr(sd, "input", None)
    payload["output"] = getattr(sd, "output", None)
  elif isinstance(sd, ResponseSpanData):
    payload["response"] = getattr(sd, "Response", None)
  elif isinstance(sd, HandoffSpanData):
    ####
    # And a lot of other spans here with information that we want to capture
    ####
  else:
    # generic fallback: try to convert attributes
    try:
      payload = sd.__dict__
    except Exception:
      payload = {}
  return span_type, payload

That’s it from capturing perspective. The only thing now remains is to flush to console. Again, like I said before, this is a minimalistic version, so we will write to console and be done.

def shutdown(self):
  """
  Flush any remaining traces and spans.
  """
  # Optionally flush remaining traces as JSON
  for tid, tdict in list(self.active_traces.items()):
    print(json.dumps(tdict, indent=2, sort_keys=False, default=str), flush=True)
    del self.active_traces[tid]
    self.active_spans.clear()

The Testbed and a Visual Representation

At this time we have a JSON representation for the entire workflow. However, going through the JSON and processing each parent class and visualizing the flow is not easy. I took the easy way and just generated as basic HTML/ JavaScript class to visualize it.

I will not put the entire testbed here, but just for completion show how this trace processor can be used. In our agentic workflow, we will have to make sure we use this class as logger for the flow.

 class TestTracer:
    def __init__(self):
      set_trace_processors([LocalTracingProcessor()])

That’s the only line needed to start using our class. set_trace_processor makes sure to pass on all events to this class.

To test I created an orchestrator to read a document and create a podcast out of it. I will not go over the code as it just uses two agents as tools and one Python function as tool. Given below is the tool instruction for the orchestrator.

You are an orchestrating service that creates a podcast based on a given topic. You will always use the available tools to accomplish this task. You have access to the following tools:
        
        1) KeywordAgent: to extract keywords from the content. It takes a text as input and returns keywords in JSON format.
        2) PodcastAgent: to generate a podcast suggestion based on the topic and keywords. It returns a JSON with 'podcast_name’, 'episode_title' and 'podcast_text' fields.
        3) write_audio: to write the podcast text to an audio file. It takes 'podcast_name', 'episode_title' and 'podcast_text' as input and creates a audio file named [podcast_name]_[episode_title].wav with the podcast text. It returns the filename in JSON format.
        
Your task is to read the content, extract keywords using KeywordAgent, generate a podcast suggestion using PodcastAgent, and write the podcast text to an audio file using write_audio. Just return the final response from the main agent, no explanations.

After running this, we can see the workflow primary trace as follows.

However, more interesting is the flow of spans. Let’s see that.

On the page you can click on each one of these links and see further details about the spans. You will be able to see the detailed input/ response and the time taken within each span. This will help in understanding how the agents worked in tandem. From the diagram above, we can see that MainAgent delegated the task of keyword extraction to KeywordAgent. After it got the keywords, PodcastAgent was called to generate the podcast.

Every agent went through a FunctionSpanData —> TaskSpanData —> AgentSpanData —>TurnSpanData —> GenerationSpanData. Looking at the flow, it starts making more sense how each of these spans relate to each other.

Conclusion

That’s all I have for now. Understanding tracing helps us further understand how the agents themselves work. I just wanted to take this short detour, next blog will be on MCP and invoking from OpenAI. Ciao for now!

Posts