Root cause: start_consumer() was called in new() before any register() calls,
so the consumer's cloned HashMap was always empty. Workers like log_operation
and record_usage were never found, causing "Unknown worker" errors.
- Add WorkerDispatcher::start() method to be called after all register()s
- Update main.rs to call dispatcher.start() after 7 workers registered
- key_pool.rs: cast cooldown_until to timestamptz for comparison with NOW()
- key_pool.rs: cast request_count to bigint (INT4→INT8) for sqlx decoding
- service.rs: cast cooldown_until to timestamptz in quota sort query
- scheduler.rs: cast last_seen_at to timestamptz in device cleanup
- totp.rs: use DateTime<Utc> instead of rfc3339 string for updated_at
- P2-18: TOTP QR code local generation via qrcode lib (no external service)
- P2-21: Suspend foreign LLM providers (OpenAI/Anthropic/Gemini) for early stage
- P3-04: get_progress() now calculates actual percentage from completed/total steps
- P3-05: saveSaaSSession calls now have .catch() error logging
- P3-06: SaaS relay chatStream passes session_key/agent_id to backend
- P3-02: Whiteboard unification plan document created
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
P1-04: GenerationPipeline hardcoded model="default" causing classroom
generation 404. Added model field to GenerationPipeline struct, passed
from kernel config via with_driver(driver, model). Static scene
generation now receives model parameter.
P1-03: LLM API concurrent 500 DATABASE_ERROR. Added transient DB error
retry (PoolTimedOut/Io) in create_relay_task with 200ms backoff.
Recommend setting ZCLAW_DB_MIN_CONNECTIONS=10 for burst resilience.
- Add IF NOT EXISTS to accounts_template_assignment ALTER COLUMN
- Add IF NOT EXISTS to webhooks CREATE INDEX statements
- Add created_at/updated_at columns + ON CONFLICT DO NOTHING to industry templates
- Bump SCHEMA_VERSION 13→14 to force migration re-run on existing DB
- 16 down SQL files in migrations/down/ for each incremental migration
- db::run_down_migrations() executes rollback files in reverse order
- migrate_down CLI task: task=migrate_down timestamp=20260402
- Initial schema and seed data excluded (would be destructive)
- cache: insert-then-retain pattern avoids empty-window race during refresh
- relay: manage_task_status flag for proper failover state transitions
- relay: retry_task re-resolves model groups instead of blind provider reuse
- relay: filter empty-member groups from available models list
- relay: quota cache stale entry cleanup (TTL 5x expiry)
- error: from_sqlx_unique helper for 409 vs 500 distinction
- model_config: unique constraint handling, duplicate member check
- model_config: failover_strategy whitelist, model_id vs group name conflict check
- model_config: group-scoped member removal with group_id validation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sprint 2: 产品体验打磨 + 行业模板
- Create PipelineResultPreview component with tab-based output switching
- Connect workflow/hand messages to PresentationContainer in ChatArea
- Add auto-trigger first Hand after onboarding (industry-specific queries)
- Seed 3 industry agent templates (education, healthcare, design-shantou)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Model Groups provide logical model names that map to multiple physical
models across providers, with automatic failover when one provider's
key pool is exhausted.
Backend:
- New model_groups + model_group_members tables with FK constraints
- Full CRUD API (7 endpoints) with admin-only write permissions
- Cache layer: DashMap-backed CachedModelGroup with load_from_db
- Relay integration: ModelResolution enum for Direct/Group routing
- Cross-provider failover: sort_candidates_by_quota + OnceLock cache
- Relay failure path: record failure usage + relay_dequeue (fixes
queue counter leak that caused connection pool exhaustion)
- add_group_member: validate model_id exists before insert
Frontend:
- saas-relay-client: accept getModel() callback for dynamic model selection
- connectionStore: prefer conversationStore.currentModel over first available
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- saas test harness: align WorkerDispatcher::new and AppState::new
signatures with SpawnLimiter addition and init_db(&DatabaseConfig)
- growth sqlite: add CJK fallback (LIKE-based) when FTS5 unicode61
tokenizer fails on Chinese queries (unicode61 doesn't index CJK)
- ChatArea.tsx: remove duplicate useState(searchOpen) declaration on line 70
- scheduled_task/mod.rs: fix route from /api/scheduler/tasks to /api/v1/scheduler/tasks
(matches admin-v2 service baseURL pattern and all other modules)
- scheduled_task/handlers.rs: remove @reserved annotations (now has Admin V2 frontend)
- scheduled_task/handlers.rs: update doc comments with correct /api/v1/ paths
- Add welcomeMessage/quickCommands fields to Clone interface
- Persist template welcome/quick data via updateClone after creation
- FirstConversationPrompt: prefer template-provided welcome message
over dynamically generated one
- FirstConversationPrompt: render template quick_commands as chips
instead of hardcoded QUICK_ACTIONS when available
- Tighten assign/unassign template endpoint permissions from model:read
to relay:use (self-service operation for all authenticated users)
- Fix seed template tools to match actual runtime tool names
(file_read/file_write/shell_exec/web_fetch)
- Persist system_prompt/temperature/max_tokens via identity system
in agentStore.createFromTemplate()
- Fire-and-forget assignTemplate() in AgentOnboardingWizard
- Fix saas-relay-client unused variable warning
- Make pgvector extension optional in knowledge_base migration
- Increase StreamBridge timeout from 30s to 90s for thinking models
- create_item: wrap item + version INSERT in transaction for atomicity
- update_item handler: validate content length (100KB) before DB hit
- KnowledgeChunk: document missing embedding field, safe per explicit SELECT usage
Server side:
- POST /api/v1/billing/usage/increment endpoint with dimension whitelist
(hand_executions, pipeline_runs, relay_requests) and count validation (1-100)
- Returns updated usage quota after increment
Desktop side:
- New saas-billing.ts mixin with incrementUsageDimension() and
reportUsageFireAndForget() (non-blocking, safe for finally blocks)
- handStore.triggerHand: reports hand_executions after successful run
- PipelinesPanel.handleRunComplete: reports pipeline_runs on completion
- SaaSClient type declarations for new billing methods
Billing pipeline now covers all three dimensions:
relay_requests → relay handler (server-side, real-time)
hand_executions → handStore (client-side, fire-and-forget)
pipeline_runs → PipelinesPanel (client-side, fire-and-forget)
- Wire relay handler to increment_usage() for JSON responses (tokens + relay_requests)
- Wire relay handler to increment_dimension("relay_requests") for SSE streams
- Add increment_dimension() function for hand_executions/pipeline_runs dimensions
- Schedule AggregateUsageWorker hourly for reconciliation (run_on_start=true)
- Mount mock payment routes in dev mode (ZCLAW_SAAS_DEV=true)
Previously the quota middleware always allowed requests because usage
counters were never incremented. Now relay requests update billing_usage_quotas
in real-time, with the aggregator providing hourly reconciliation.
- payment.rs: create_payment, handle_payment_callback, query_payment_status
- Mock pay page for development mode with HTML confirm/cancel flow
- Payment callback handler with subscription auto-creation on success
- Alipay form-urlencoded and WeChat JSON callback parsing
- 7 new routes including callback and mock-pay endpoints
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Markdown-aware content splitting (512 token chunks with 64 overlap)
- CJK keyword extraction from chunk content with stop-word filtering
- Full refresh strategy (delete old chunks → re-insert on update)
- Phase 2 placeholder for vector embedding API integration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add execute_scheduled_task helper that fetches task info and dispatches
by target_type (agent/hand/workflow)
- Replace STUB warn+simple-UPDATE with full execution flow: dispatch task,
then update state with interval-aware next_run_at calculation
- Update next_run_at using interval_seconds for recurring tasks instead
of setting NULL
- Fix pre-existing cache.rs borrow-after-move bug (id.clone() in insert)
- Add trusted_proxies field to ServerConfig (Vec<String>, serde default)
- Default value is empty vector (no proxy trust until explicitly configured)
- Development config: trust localhost IPs (127.0.0.1, ::1)
- Production config: placeholder localhost IPs with comment to replace
Root cause: each relay request executes 13-17 serial DB queries, exhausting
the 50-connection pool under concurrency. When pool is exhausted, sqlx returns
PoolTimedOut which maps to 500 DATABASE_ERROR.
Fixes:
1. log_operation → dispatch_log_operation (async Worker dispatch, non-blocking)
2. record_usage → tokio::spawn (3 DB queries moved off critical path)
3. DB pool: max_connections 50→100 (env-configurable), acquire_timeout 5s→8s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. key_pool: merge 3 serial UPDATE queries into 2 (cumulative stats +
last_used_at combined into single UPDATE)
2. service: reduce SSE spawn sleep from 3s to 500ms and add 5s timeout
on DB operations to prevent connection hoarding
- Add migration to add last_used_at TIMESTAMPTZ column to provider_keys
- Update select_best_key() SQL to sort by last_used_at ASC NULLS FIRST
- Update record_key_usage() to set last_used_at = NOW() on each use
- Bump SCHEMA_VERSION to 10