perf(runtime): Hermes Phase 1-3 — prompt caching + parallel tools + smart retry

Phase 1: Anthropic prompt caching
- Add cache_control ephemeral on system prompt blocks
- Track cache_creation/cache_read tokens in CompletionResponse + StreamChunk

Phase 2A: Parallel tool execution
- Add ToolConcurrency enum (ReadOnly/Exclusive/Interactive)
- JoinSet + Semaphore(3) for bounded parallel tool calls
- 7 tools annotated with correct concurrency level
- AtomicU32 for lock-free failure tracking in ToolErrorMiddleware

Phase 2B: Tool output pruning
- prune_tool_outputs() trims old ToolResult > 2000 chars to 500 chars
- Integrated into CompactionMiddleware before token estimation

Phase 3: Error classification + smart retry
- LlmErrorKind + ClassifiedLlmError for structured error mapping
- RetryDriver decorator with jittered exponential backoff
- Kernel wraps all LLM calls with RetryDriver
- CONTEXT_OVERFLOW recovery triggers emergency compaction in loop_runner
This commit is contained in:
iven
2026-04-24 08:39:56 +08:00
parent 6d6673bf5b
commit 9060935401
25 changed files with 672 additions and 129 deletions

View File

@@ -223,6 +223,33 @@ impl Serialize for ZclawError {
/// Result type alias for ZCLAW operations
pub type Result<T> = std::result::Result<T, ZclawError>;
/// LLM 调用错误的细粒度分类,指导重试和恢复策略
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum LlmErrorKind {
Auth,
AuthPermanent,
BillingExhausted,
RateLimited,
Overloaded,
ServerError,
Timeout,
ContextOverflow,
ModelNotFound,
Unknown,
}
/// 分类后的 LLM 错误,附带恢复提示
#[derive(Debug, Clone)]
pub struct ClassifiedLlmError {
pub kind: LlmErrorKind,
pub retryable: bool,
pub should_compress: bool,
pub should_rotate_credential: bool,
pub retry_after: Option<std::time::Duration>,
pub message: String,
}
#[cfg(test)]
mod tests {
use super::*;