Background
Kohaku-Lab is a virtual open-source AI research lab led by KohakuBlueleaf (Yeh Shih-Ying, Taiwan). Its core project, KohakuTerrarium, is a multi-agent orchestration framework positioned similarly to LangGraph/CrewAI, but with a stronger emphasis on pluginization and runtime composability.
This article is a preliminary research piece for contributing an OpenTelemetry observability plugin, mapping out the ecosystem architecture, plugin mechanisms, and comparing observability practices across major agent frameworks.
Last updated: May 2026 (based on KohakuTerrarium v1.4.0, kt-biome v1.2.0)
1. Ecosystem Overview
The Kohaku-Lab organization has 6 repositories, forming a dependency chain centered on KohakuTerrarium:
| Project | Purpose | Language | Relationship to Core |
|---|---|---|---|
| KohakuTerrarium | Multi-agent orchestration framework | Python 3.10+ | The core itself |
| KohakuRAG | Layered RAG engine | Python | Depends on core framework |
| kt-biome | Official plugin pack | Python 3.12+ | Zero dependencies, runtime-mounted |
| KohakuVault | SQLite KV store | Python | Pip dependency of the core |
Key constraint: kt-biome declares zero runtime dependencies (no pip dependencies in pyproject.toml). It installs into the framework via kt install and directly inherits the host framework's capabilities at runtime.
2. KohakuTerrarium Core Architecture (v1.4.0)
2.1 Overall Layering
v1.4.0 introduced several important new modules:
| Module | Purpose | Description |
|---|---|---|
| studio/ | Programmatic API layer | Catalog / Identity / Sessions / Persistence / Editors — wraps the Terrarium engine |
| builtin_skills/ | Built-in skill system | 10 subagent skills + 28 tool skills, Markdown-defined prompt engineering |
| parsing/ | Structured parsing | Events / Format / Patterns / State Machine |
| prompt/ | Prompt engineering | Aggregator / FrameworkHints / SkillLoader / Template / ToolContributions |
| serving/ | Serving infrastructure | process_metrics + web serving |
| sandbox/ | Sandboxed execution | Config / Parse / Presets / Profile / Violations |
| user_command/ | User command protocol | The 7th module protocol (Slash Commands) |
2.2 Composition Algebra
The framework provides 4 composition primitives for assembling Creatures into complex workflows:
| Operator | Semantics | Example |
|---|---|---|
>> | Sequential execution | researcher >> writer |
& | Parallel execution | analyst_a & analyst_b |
| | Fan-out to multiple | dispatcher | [a, b, c] |
* | Repeat N times | reviewer * 3 |
2.3 Parallel Execution Model
A three-layer async parallel mechanism:
&operator:asyncio.gather(*branches)— runs all branches in parallel, latency determined by the slowest branch- SubAgentManager:
spawn()createsasyncio.Task,wait_all()waits for all to complete (Promise.allsemantics) - Terrarium Runtime: Each Creature gets its own
asyncio.Task, output wiring uses fire-and-forget dispatch
Execution model: single-threaded asyncio event loop (cooperative concurrency), no OS threads or multiprocessing needed. LLM calls achieve true I/O concurrency.
2.4 Plugin Hook Lifecycle
This is the primary integration point for the OTEL plugin. In v1.4.0, PluginHook expanded from 11 to 15 hooks:
New hooks:
- on_load / on_unload — Plugin lifecycle management
- pre_tool_dispatch — Pre-dispatch interception for tools (useful for routing control)
- on_compact_start / on_compact_end — Non-blocking context compression events (new Compact system in v1.4.0)
Existing reference implementations:
- event_logger.py — JSONL structured logging, mounts 8 hooks, records token usage
- cost_tracker.py — Token cost accumulation + budget threshold alerts
- otel_metrics.py — Added in v1.2.0, OpenTelemetry Metrics + Tracing (see below)
3. kt-biome Plugin Pack Structure (v1.2.0)
kt-biome is the framework's official extension pack and the target repository for the OTEL plugin (confirmed by maintainers on Issue #45).
v1.2.0 updates:
| Category | Change | New Items |
|---|---|---|
| Creatures | 9 → 10 | bounded_general (max_iterations: 50) |
| Plugins | 11 → 12 | otel_metrics, multimodal_guard, seamless_memory, termination_goal, pev_verifier, context_files, family_guidance |
| Tools | Unchanged | bash_docker (container execution), bash_ssh (remote execution) |
| I/O | 5 → 4 | telegram (input + output) |
| Triggers | Unchanged | cron (full cron expression support) |
| Skills | Updated | git-commit-flow, pdf-merge, todo-file |
Creature inheritance: general is the base (24 tools / 6 sub-agents); swe, researcher, etc. inherit and extend it. bounded_general adds an iteration cap.
Terrarium topology example (swe_team):
3.1 Existing OTEL Plugin: otel_metrics.py
kt-biome v1.2.0 already includes an OpenTelemetry plugin implementation (kt_biome/plugins/otel_metrics.py). Key design decisions:
- Metrics: 16 Counters + 7 Histograms, DELTA temporality
- Tracing: Optional, exported via OTLP HTTP
- Lazy loading:
try: from opentelemetry... except ImportError— no hard dependency - Configuration: kohaku.yaml options (service_name, endpoint, export_interval)
| Metric Type | Example Names | Purpose |
|---|---|---|
| Counter | kt.llm.calls / kt.llm.tokens / kt.tool / kt.subagent / kt.compact | Call counts |
| Histogram | kt.llm.duration / kt.tool.duration / kt.agent.session.duration | Latency distribution |
This means the basic OTEL support from Issue #45 has already been implemented. Future contribution directions include: adopting OTel GenAI gen_ai.* semantic conventions, enhancing tracing, and integrating with observability platforms.
4. Agent Observability: Industry Consensus
The OTEL plugin design isn't built from scratch — the industry has already formed clear consensus. Here's a comparison of observability practices across major frameworks.
4.1 Framework Architecture Comparison
| Framework | Agent Loop Model | Sub-agents | Observability Approach | Language |
|---|---|---|---|---|
| Claude Code | Async Generator state machine | Recursive query(), depth ≤ 1 | Native OTel (Traces + Metrics + Logs) | TypeScript |
| OpenAI Agents SDK | ReAct Loop | Handoff mechanism | Built-in TraceProvider, auto-collection | Python |
| AutoGen (Microsoft) | Message-passing loop | Multi-agent conversation | Native OTel, uses gen_ai.* directly | Python |
| CrewAI | Sequential/Hierarchical | Task delegation | EventBus Listener + OTel Exporter | Python |
| OpenCode (sst) | ReAct for loop | Tool-based, synchronous blocking | PubSub Broker + JSON debug logs | Go |
| KohakuTerrarium | Creature Controller Loop | Vertical + horizontal SubAgent | Plugin Hooks + Observer + otel_metrics | Python |
4.2 Representative Architecture Deep Dives
Claude Code: State Machine + Native OTel
Claude Code's core is an approximately 1700-line async generator function query.ts. Every interaction goes through the same code path, and each loop iteration reconstructs an immutable state.
Three aspects of its observability design are particularly noteworthy:
- Three-signal approach: Metrics (token/cost), Log Events (structured logs per prompt/tool/error), Traces (Beta, span hierarchy: interaction → llm_request → tool → hook)
- W3C Trace Context propagation: Child processes inherit trace context, enabling cross-process distributed tracing
- Hook system: 9 event types (PreToolUse / PostToolUse / SubagentStop, etc.), where hooks themselves are external processes (shell/HTTP/LLM)
Known limitation: Hook execution itself is not covered by OTel tracing (Issues #44983, #50287).
OpenCode: ReAct + PubSub
OpenCode (archived, migrated to charmbracelet/crush) uses a classic ReAct loop. Sub-agents exist as Tools with synchronous blocking calls and read-only access.
The four agent types (coder / summarizer / task / title) are flat string constants, not a hierarchical structure. Observability relies on a Go generics-based Broker[T] event bus + structured logging, with no OTel integration.
4.3 OpenTelemetry GenAI Semantic Conventions
OTel has released experimental GenAI semantic conventions (the gen_ai.* namespace), which are becoming the de facto standard:
| Span Type | Name | Purpose |
|---|---|---|
| Agent | invoke_agent | One span per agent invocation |
| LLM | chat | One span per model call |
| Tool | execute_tool | One span per tool execution |
Core attributes:
gen_ai.operation.name— Operation typegen_ai.request.model— Model identifiergen_ai.agent.name / id— Agent identifiergen_ai.usage.input/output_tokens— Token usage
Core metrics:
gen_ai.client.token.usage(Histogram) — Token consumptiongen_ai.client.operation.duration(Histogram) — Operation duration
4.4 Agent-as-Trace Pattern (Emerging Consensus)
Frameworks are converging on the same pattern:
A single agent invocation = one distributed Trace containing multiple nested Spans (Agent → LLM → Tool → LLM → ...). Token usage is recorded at the LLM Span level and aggregated at the Trace level.
4.5 Observability Platform Comparison
| Platform | Protocol | Data Model | OTel Relationship |
|---|---|---|---|
| Langfuse | Native OTLP | Trace → Observation (7 types) | Direct integration |
| Phoenix (Arize) | OTLP + OpenInference | Span-based | Translation layer: GenAI → OpenInference |
| LangSmith | Custom + OTel export | @traceable decorator | One-way export, has context propagation gaps |
| W&B Weave | Custom + OTel export | @weave.op, auto-patches 30+ providers | Export-compatible |
The two competing semantic convention sets (OTel gen_ai.* vs OpenInference openinference.*) are converging — Phoenix has built a translation layer, and Langfuse accepts OTLP directly.
5. Key Takeaways
| Dimension | Conclusion |
|---|---|
| Basic OTEL support | Already implemented (otel_metrics.py) — Metrics + optional Tracing |
| Integration approach | Implement PluginBase + register 15 PluginHook callbacks |
| Reference implementations | otel_metrics.py (OTEL) and event_logger.py (JSONL logging) |
| Dependency constraint | Lazy-loading pattern — OTEL SDK is an optional dependency |
| Data sources | 15 PluginHooks — sufficient for building complete Traces + Metrics |
| Semantic standards | Should adopt OTel GenAI gen_ai.* semantic conventions (current implementation uses kt.* prefix) |
| Architecture pattern | Agent-as-Trace (industry consensus) |
Future contribution directions:
- Evaluate gaps between the existing otel_metrics.py implementation and OTel GenAI semantic conventions
- Consider migrating from
kt.*prefix to the standardgen_ai.*prefix - Enhance Tracing capabilities: complete Span chains (Agent → LLM → Tool → SubAgent)
- Integrate with observability platforms (Langfuse / Phoenix / Jaeger, etc.)
- Validate end-to-end trace chain integrity on the swe_team terrarium