feat(knowledge): Phase A 知识库可见性隔离 + 结构化数据源 + 蒸馏Worker
Some checks failed
CI / Lint & TypeCheck (push) Has been cancelled
CI / Unit Tests (push) Has been cancelled
CI / Build Frontend (push) Has been cancelled
CI / Rust Check (push) Has been cancelled
CI / Security Scan (push) Has been cancelled
CI / E2E Tests (push) Has been cancelled

- knowledge_items 增加 visibility(public/private) + account_id 字段
- 新建 structured_sources + structured_rows 表 (Excel JSONB 行级存储)
- 结构化数据源 CRUD API (5 路由: list/get/rows/delete/query)
- 安全查询: JSONB GIN 索引 + 可见性过滤 + 行数限制
- 蒸馏 Worker: 复用 Provider Key Pool 调 DeepSeek/Qwen API
- L0 质量过滤: 长度/隐私检测
- create_item 增加 is_admin 参数控制可见性默认值
- generate_embedding: extract_keywords_from_text 改为 pub 复用

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
iven
2026-04-12 18:36:05 +08:00
parent b8fb76375c
commit c3593d3438
10 changed files with 846 additions and 20 deletions

View File

@@ -78,7 +78,7 @@ impl Worker for GenerateEmbeddingWorker {
let chunk_id = uuid::Uuid::new_v4().to_string();
let mut chunk_keywords = keywords.clone();
extract_chunk_keywords(chunk, &mut chunk_keywords);
extract_keywords_from_text(chunk, &mut chunk_keywords);
sqlx::query(
"INSERT INTO knowledge_chunks (id, item_id, chunk_index, content, keywords, created_at)
@@ -112,10 +112,8 @@ impl Worker for GenerateEmbeddingWorker {
}
}
/// 从 chunk 内容中提取高频中文词组作为补充关键词
///
/// 简单策略:提取 2-4 字的连续中文字符段,取出现频率 > 1 的
fn extract_chunk_keywords(content: &str, keywords: &mut Vec<String>) {
/// 从 chunk 内容中提取高频中文词组作为补充关键词(公开,供 distill_knowledge worker 复用)
pub fn extract_keywords_from_text(content: &str, keywords: &mut Vec<String>) {
let chars: Vec<char> = content.chars().collect();
let mut i = 0;