perf(runtime): Hermes Phase 1-3 — prompt caching + parallel tools + smart retry

Phase 1: Anthropic prompt caching - Add cache_control ephemeral on system prompt blocks - Track cache_creation/cache_read tokens in CompletionResponse + StreamChunk Phase 2A: Parallel tool execution - Add ToolConcurrency enum (ReadOnly/Exclusive/Interactive) - JoinSet + Semaphore(3) for bounded parallel tool calls - 7 tools annotated with correct concurrency level - AtomicU32 for lock-free failure tracking in ToolErrorMiddleware Phase 2B: Tool output pruning - prune_tool_outputs() trims old ToolResult > 2000 chars to 500 chars - Integrated into CompactionMiddleware before token estimation Phase 3: Error classification + smart retry - LlmErrorKind + ClassifiedLlmError for structured error mapping - RetryDriver decorator with jittered exponential backoff - Kernel wraps all LLM calls with RetryDriver - CONTEXT_OVERFLOW recovery triggers emergency compaction in loop_runner
2026-04-24 08:39:56 +08:00
parent 6d6673bf5b
commit 9060935401
25 changed files with 672 additions and 129 deletions
--- a/crates/zclaw-types/src/error.rs
+++ b/crates/zclaw-types/src/error.rs
@@ -223,6 +223,33 @@ impl Serialize for ZclawError {
 /// Result type alias for ZCLAW operations
 pub type Result<T> = std::result::Result<T, ZclawError>;

+/// LLM 调用错误的细粒度分类，指导重试和恢复策略
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
+#[serde(rename_all = "snake_case")]
+pub enum LlmErrorKind {
+    Auth,
+    AuthPermanent,
+    BillingExhausted,
+    RateLimited,
+    Overloaded,
+    ServerError,
+    Timeout,
+    ContextOverflow,
+    ModelNotFound,
+    Unknown,
+}
+
+/// 分类后的 LLM 错误，附带恢复提示
+#[derive(Debug, Clone)]
+pub struct ClassifiedLlmError {
+    pub kind: LlmErrorKind,
+    pub retryable: bool,
+    pub should_compress: bool,
+    pub should_rotate_credential: bool,
+    pub retry_after: Option<std::time::Duration>,
+    pub message: String,
+}
+
 #[cfg(test)]
 mod tests {
    use super::*;