🔄 卡若AI 同步 2026-02-22 09:12 | 更新:总索引与入口、火种知识模型、运营中枢参考资料、运营中枢工作台 | 排除 >20MB: 8 个
This commit is contained in:
94
04_卡火(火)/火种_知识模型/本地代码库索引/SKILL.md
Normal file
94
04_卡火(火)/火种_知识模型/本地代码库索引/SKILL.md
Normal file
@@ -0,0 +1,94 @@
|
||||
---
|
||||
name: 本地代码库索引
|
||||
description: 使用 Ollama 本地 embedding 对卡若AI 代码库做索引与语义检索,不上传云端
|
||||
triggers: 本地索引、本地搜索、不上传云端、本地代码库、索引卡若AI
|
||||
owner: 火种
|
||||
group: 火
|
||||
version: "1.0"
|
||||
updated: "2026-02-22"
|
||||
---
|
||||
|
||||
# 本地代码库索引
|
||||
|
||||
> **管理员**:卡火(火)
|
||||
> **口头禅**:"让我想想..."
|
||||
> **职责**:在本地对卡若AI 代码库做 embedding 索引与语义检索,**不上传任何数据到云端**
|
||||
|
||||
---
|
||||
|
||||
## 一、能做什么
|
||||
|
||||
- **建索引**:扫描卡若AI 目录,用 `nomic-embed-text` 本地向量化,存入本地文件
|
||||
- **语义搜索**:根据自然语言问题,在本地检索最相关的代码/文档片段
|
||||
- **完全本地**:embedding 与索引全部在本机,无云端上传
|
||||
|
||||
---
|
||||
|
||||
## 二、执行步骤
|
||||
|
||||
### 2.1 前置条件
|
||||
|
||||
1. **Ollama 已安装并运行**:`ollama serve` 在后台
|
||||
2. **nomic-embed-text 已拉取**:`ollama pull nomic-embed-text`
|
||||
3. **检查**:`curl http://localhost:11434/api/tags` 能看到 `nomic-embed-text`
|
||||
|
||||
### 2.2 建索引(首次或更新)
|
||||
|
||||
```bash
|
||||
cd /Users/karuo/Documents/个人/卡若AI
|
||||
python3 04_卡火(火)/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py index
|
||||
```
|
||||
|
||||
- 默认索引目录:`/Users/karuo/Documents/个人/卡若AI`(可配置)
|
||||
- 默认排除:`node_modules`、`.git`、`__pycache__`、`.venv` 等
|
||||
- 索引结果存入:`04_卡火(火)/火种_知识模型/本地代码库索引/index/local_index.json`
|
||||
|
||||
### 2.3 语义搜索
|
||||
|
||||
```bash
|
||||
python3 04_卡火(火)/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py search "如何做语义搜索"
|
||||
```
|
||||
|
||||
或
|
||||
|
||||
```bash
|
||||
python3 04_卡火(火)/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py search "本地模型embed怎么用" --top 5
|
||||
```
|
||||
|
||||
- 返回:文件路径、片段内容、相似度分数
|
||||
|
||||
### 2.4 在 Cursor 对话中使用
|
||||
|
||||
1. **关闭 Cursor 云索引**:Settings → Indexing & Docs → Pause Indexing
|
||||
2. **建好本地索引**(见 2.2)
|
||||
3. 对话时说:「用本地索引查 XXX」或「@本地索引 搜索 YYY」
|
||||
4. AI 会执行 `python3 .../local_codebase_index.py search "XXX"` 并基于结果回答
|
||||
|
||||
---
|
||||
|
||||
## 三、与 Cursor 的配合
|
||||
|
||||
| Cursor 操作 | 建议 |
|
||||
|:----------------------|:-----------------------------|
|
||||
| Codebase Indexing | **Pause** 或 **Delete** |
|
||||
| 本地索引 | 定期运行 `index` 更新 |
|
||||
| 对话检索 | 说「本地索引搜索 XXX」 |
|
||||
|
||||
详见:`运营中枢/参考资料/Cursor索引与本地索引方案.md`
|
||||
|
||||
---
|
||||
|
||||
## 四、相关文件
|
||||
|
||||
| 文件 | 说明 |
|
||||
|:-----|:-----|
|
||||
| `脚本/local_codebase_index.py` | 索引与检索主脚本 |
|
||||
| `index/local_index.json` | 本地索引数据(建索引后生成) |
|
||||
| `运营中枢/参考资料/Cursor索引与本地索引方案.md` | 方案说明 |
|
||||
|
||||
---
|
||||
|
||||
## 五、依赖
|
||||
|
||||
- 前置:`04_卡火(火)/火种_知识模型/本地模型`(Ollama + nomic-embed-text)
|
||||
- 外部:`ollama`、`requests`(与 local_llm_sdk 相同)
|
||||
200
04_卡火(火)/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py
Normal file
200
04_卡火(火)/火种_知识模型/本地代码库索引/脚本/local_codebase_index.py
Normal file
@@ -0,0 +1,200 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
卡若AI 本地代码库索引
|
||||
|
||||
对卡若AI 目录做本地 embedding 索引,支持语义检索。不上传任何数据到云端。
|
||||
依赖:Ollama + nomic-embed-text,与 local_llm_sdk 相同。
|
||||
|
||||
用法:
|
||||
python local_codebase_index.py index # 建索引
|
||||
python local_codebase_index.py search "问题" # 语义搜索
|
||||
python local_codebase_index.py status # 查看索引状态
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import math
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Any
|
||||
|
||||
# 项目根目录
|
||||
_REPO_ROOT = Path(__file__).resolve().parents[4]
|
||||
_SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
_INDEX_DIR = _SCRIPT_DIR.parent / "index"
|
||||
_INDEX_FILE = _INDEX_DIR / "local_index.json"
|
||||
|
||||
# 索引配置
|
||||
INDEX_ROOT = os.environ.get("KARUO_INDEX_ROOT", str(_REPO_ROOT))
|
||||
EXCLUDE_DIRS = {
|
||||
"node_modules", ".git", "__pycache__", ".venv", "venv",
|
||||
"dist", "build", ".next", ".cursor", ".github", ".gitea",
|
||||
"chroma_db", "大文件外置"
|
||||
}
|
||||
EXCLUDE_SUFFIXES = {".pyc", ".pyo", ".map", ".min.js", ".lock", ".log"}
|
||||
CHUNK_SIZE = 800 # 每块约 800 字符,便于 embedding
|
||||
CHUNK_OVERLAP = 80
|
||||
|
||||
# 纳入索引的后缀
|
||||
INCLUDE_SUFFIXES = {".md", ".py", ".js", ".ts", ".tsx", ".json", ".mdc", ".txt", ".sh"}
|
||||
|
||||
|
||||
def _add_local_llm():
|
||||
"""确保能导入 local_llm_sdk"""
|
||||
sdk_dir = _REPO_ROOT / "04_卡火(火)" / "火种_知识模型" / "本地模型" / "脚本"
|
||||
if str(sdk_dir) not in sys.path:
|
||||
sys.path.insert(0, str(sdk_dir))
|
||||
|
||||
|
||||
def _chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
|
||||
"""将长文本切成 overlapping 块"""
|
||||
chunks = []
|
||||
start = 0
|
||||
while start < len(text):
|
||||
end = min(start + size, len(text))
|
||||
chunk = text[start:end].strip()
|
||||
if chunk:
|
||||
chunks.append(chunk)
|
||||
start += size - overlap
|
||||
return chunks
|
||||
|
||||
|
||||
def _collect_files(root: str) -> List[Dict[str, str]]:
|
||||
"""收集要索引的文件,返回 [{path, content}]"""
|
||||
items = []
|
||||
root_path = Path(root)
|
||||
for fp in root_path.rglob("*"):
|
||||
if not fp.is_file():
|
||||
continue
|
||||
rel = fp.relative_to(root_path)
|
||||
parts = rel.parts
|
||||
if any(d in parts for d in EXCLUDE_DIRS):
|
||||
continue
|
||||
if fp.suffix.lower() in EXCLUDE_SUFFIXES:
|
||||
continue
|
||||
if fp.suffix.lower() not in INCLUDE_SUFFIXES:
|
||||
continue
|
||||
try:
|
||||
content = fp.read_text(encoding="utf-8", errors="ignore")
|
||||
except Exception:
|
||||
continue
|
||||
if len(content.strip()) < 20:
|
||||
continue
|
||||
items.append({"path": str(rel), "content": content})
|
||||
return items
|
||||
|
||||
|
||||
def _embed_via_ollama(text: str) -> List[float]:
|
||||
"""通过 Ollama 获取文本 embedding"""
|
||||
_add_local_llm()
|
||||
from local_llm_sdk import get_llm
|
||||
llm = get_llm()
|
||||
result = llm.embed(text[:8000], show_notice=False)
|
||||
if result.get("success") and result.get("embedding"):
|
||||
return result["embedding"]
|
||||
raise RuntimeError(f"Embed 失败: {result}")
|
||||
|
||||
|
||||
def cmd_index():
|
||||
"""建索引"""
|
||||
import time
|
||||
print(f"📁 索引根目录: {INDEX_ROOT}")
|
||||
print("📂 收集文件中...")
|
||||
files = _collect_files(INDEX_ROOT)
|
||||
print(f" 共 {len(files)} 个文件")
|
||||
if not files:
|
||||
print(" 无文件可索引")
|
||||
return
|
||||
_add_local_llm()
|
||||
from local_llm_sdk import get_llm
|
||||
llm = get_llm()
|
||||
records = []
|
||||
total = 0
|
||||
for i, f in enumerate(files):
|
||||
path, content = f["path"], f["content"]
|
||||
chunks = _chunk_text(content)
|
||||
for j, chunk in enumerate(chunks):
|
||||
if len(chunk) < 20:
|
||||
continue
|
||||
try:
|
||||
emb = llm.embed(chunk[:8000], show_notice=False)
|
||||
if emb.get("success") and emb.get("embedding"):
|
||||
records.append({
|
||||
"path": path,
|
||||
"chunk": chunk,
|
||||
"embedding": emb["embedding"]
|
||||
})
|
||||
total += 1
|
||||
except Exception as e:
|
||||
print(f" ⚠️ {path} 块 {j}: {e}")
|
||||
if (i + 1) % 20 == 0:
|
||||
print(f" 已处理 {i+1}/{len(files)} 文件, {total} 块")
|
||||
time.sleep(0.3)
|
||||
_INDEX_DIR.mkdir(parents=True, exist_ok=True)
|
||||
with open(_INDEX_FILE, "w", encoding="utf-8") as f:
|
||||
json.dump({"records": records, "root": INDEX_ROOT}, f, ensure_ascii=False, indent=0)
|
||||
print(f"✅ 索引完成: {len(records)} 块 → {_INDEX_FILE}")
|
||||
|
||||
|
||||
def cmd_search(query: str, top_k: int = 5):
|
||||
"""语义搜索"""
|
||||
if not _INDEX_FILE.exists():
|
||||
print("❌ 索引不存在,请先运行: python local_codebase_index.py index")
|
||||
return
|
||||
with open(_INDEX_FILE, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
records = data.get("records", [])
|
||||
if not records:
|
||||
print("❌ 索引为空")
|
||||
return
|
||||
query_emb = _embed_via_ollama(query)
|
||||
scores = []
|
||||
for r in records:
|
||||
v = r["embedding"]
|
||||
dot = sum(a * b for a, b in zip(query_emb, v))
|
||||
n1 = math.sqrt(sum(a * a for a in query_emb))
|
||||
n2 = math.sqrt(sum(b * b for b in v))
|
||||
score = dot / (n1 * n2) if n1 and n2 else 0
|
||||
scores.append((score, r))
|
||||
scores.sort(key=lambda x: -x[0])
|
||||
print(f"\n🔍 查询: {query}\n")
|
||||
for i, (score, r) in enumerate(scores[:top_k], 1):
|
||||
print(f"--- [{i}] {r['path']} (score={score:.3f}) ---")
|
||||
txt = r["chunk"][:400].replace("\n", " ")
|
||||
print(f"{txt}{'...' if len(r['chunk']) > 400 else ''}\n")
|
||||
|
||||
|
||||
def cmd_status():
|
||||
"""查看索引状态"""
|
||||
if not _INDEX_FILE.exists():
|
||||
print("❌ 索引未创建。运行: python local_codebase_index.py index")
|
||||
return
|
||||
with open(_INDEX_FILE, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
n = len(data.get("records", []))
|
||||
root = data.get("root", "?")
|
||||
print(f"📁 索引根: {root}")
|
||||
print(f"📊 索引块数: {n}")
|
||||
print(f"📄 索引文件: {_INDEX_FILE}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="卡若AI 本地代码库索引")
|
||||
sub = parser.add_subparsers(dest="cmd", required=True)
|
||||
sub.add_parser("index")
|
||||
sp = sub.add_parser("search")
|
||||
sp.add_argument("query", help="搜索问题")
|
||||
sp.add_argument("--top", "-n", type=int, default=5, help="返回前 N 个结果")
|
||||
sub.add_parser("status")
|
||||
args = parser.parse_args()
|
||||
if args.cmd == "index":
|
||||
cmd_index()
|
||||
elif args.cmd == "search":
|
||||
cmd_search(args.query, top_k=args.top)
|
||||
elif args.cmd == "status":
|
||||
cmd_status()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user