🔄 卡若AI 同步 2026-02-22 14:59 | 更新:金仓、卡木、运营中枢工作台 | 排除 >20MB: 8 个
This commit is contained in:
@@ -166,7 +166,9 @@ bash scripts/存客宝_lytiao_Docker部署.sh
|
||||
|
||||
### 5. kr宝塔 运行堵塞 + Node 深度修复
|
||||
|
||||
- **运行堵塞**(负载 100%、CPU 98%):结束异常 node 进程、停 Node、修复 site.db、查日志、批量启动。
|
||||
- **负载诊断**:`./scripts/.venv_tx/bin/python scripts/腾讯云_TAT_kr宝塔_负载诊断.py`(只读,输出负载/CPU/Node 进程)
|
||||
- **负载分析文档**:`references/kr宝塔_负载100_原因分析与处理.md`
|
||||
- **运行堵塞修复**:结束异常 node、停 Node、修复 site.db、批量启动。
|
||||
- **TAT**:`./scripts/.venv_tx/bin/python scripts/腾讯云_TAT_kr宝塔_运行堵塞与Node深度修复.py`
|
||||
- **宝塔终端**(推荐):上传 `scripts/kr宝塔_运行堵塞与Node深度修复_宝塔终端执行.sh` 后 `bash` 执行。
|
||||
|
||||
|
||||
@@ -237,10 +237,19 @@ ssh -p 22022 -i "服务器管理/Steam/id_ed25519" root@43.139.27.93 "nginx -s r
|
||||
|
||||
---
|
||||
|
||||
## 七、本次诊断结果摘要(2026-02-20)
|
||||
## 七、负载 100% 详细分析
|
||||
|
||||
见独立文档:`references/kr宝塔_负载100_原因分析与处理.md`
|
||||
|
||||
**要点**:负载由 Node 项目 MODULE_NOT_FOUND 崩溃循环导致;修正 site.db 启动命令后可批量启动。TAT 诊断:`scripts/腾讯云_TAT_kr宝塔_负载诊断.py`。
|
||||
|
||||
---
|
||||
|
||||
## 八、本次诊断结果摘要(2026-02-20)
|
||||
|
||||
- **本机 → kr宝塔**:ping 正常,22022 端口可达。
|
||||
- **SSH**:曾出现 Connection closed by remote host,建议用宝塔面板终端执行上述命令。
|
||||
- **负载**:TAT 诊断显示负载已降至 0.5,Node 进程数为 0(全停);根因是 Node 崩溃循环。
|
||||
- **带宽**:近 24h 公网出带宽最大 5.1 Mbps,已顶满 5M,是「带宽卡」的主要原因;已提供腾讯云脚本与宝塔终端处理步骤。
|
||||
|
||||
---
|
||||
|
||||
179
01_卡资(金)/金仓_存储备份/服务器管理/references/kr宝塔_负载100_原因分析与处理.md
Normal file
179
01_卡资(金)/金仓_存储备份/服务器管理/references/kr宝塔_负载100_原因分析与处理.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# kr宝塔 负载 100% · 原因分析与处理
|
||||
|
||||
> 适用:43.139.27.93(kr宝塔,2核4G,41 站点)。当宝塔首页显示「运行堵塞」、负载 100%、CPU 98% 时参考。
|
||||
|
||||
---
|
||||
|
||||
## 一、负载原因分析(按可能性排序)
|
||||
|
||||
### 1. Node 项目反复崩溃重启(最常见)
|
||||
|
||||
**现象**:多个 Node 项目配置错误(如 MODULE_NOT_FOUND、启动命令 `node /path` 把目录当入口),启动即失败,宝塔/pm2 自动重启,形成死循环。
|
||||
|
||||
**占用**:每个 `npm start` / `node` 进程 30~50% CPU,2 核下 2~3 个即可占满。
|
||||
|
||||
**判断**:
|
||||
```bash
|
||||
ps aux | grep -E 'node|npm|pnpm' # 看是否有大量 node 进程
|
||||
# 宝塔 Node 项目 → 查看启动日志,是否有 MODULE_NOT_FOUND
|
||||
```
|
||||
|
||||
**处理**:
|
||||
1. 停止全部 Node 项目(宝塔 Node 项目 → 批量停止)
|
||||
2. 修正启动命令为 `cd /项目根目录 && (pnpm start || npm run start)`
|
||||
3. 逐一启动,确认无报错后再启动下一个
|
||||
4. 脚本:`腾讯云_TAT_kr宝塔_运行堵塞与Node深度修复.py` 或 `kr宝塔_运行堵塞与Node深度修复_宝塔终端执行.sh`
|
||||
|
||||
---
|
||||
|
||||
### 2. 多站并发 + 2核瓶颈
|
||||
|
||||
**现象**:41 个站点、10+ Node 应用,在 2 核 CPU 上高并发时易顶满。
|
||||
|
||||
**占用**:Nginx + 多个 next-server/node 进程,单核处理能力有限。
|
||||
|
||||
**判断**:
|
||||
```bash
|
||||
ps aux --sort=-%cpu | head -15 # 看 TOP 进程
|
||||
ss -ant state established | wc -l # 连接数
|
||||
```
|
||||
|
||||
**处理**:
|
||||
- 短期:停用或合并低优先级 Node 项目
|
||||
- 长期:升级为 4 核(腾讯云控制台 → 实例 → 调整配置)
|
||||
|
||||
---
|
||||
|
||||
### 3. 带宽打满导致请求堆积
|
||||
|
||||
**现象**:出带宽 5M 顶满,请求排队,Nginx/上游响应变慢,连接数堆积,CPU 处理堆积请求升高。
|
||||
|
||||
**判断**:
|
||||
```bash
|
||||
# 腾讯云监控或脚本
|
||||
./scripts/.venv_tx/bin/python scripts/kr宝塔_腾讯云带宽与CPU近24h.py
|
||||
```
|
||||
|
||||
**处理**:升级带宽、Nginx 限速、优化大流量站点(CDN、压缩、缓存)。
|
||||
|
||||
---
|
||||
|
||||
### 4. 异常进程 / 挖矿 / 攻击
|
||||
|
||||
**现象**:未知进程占满 CPU,或单 IP 连接数异常多。
|
||||
|
||||
**判断**:
|
||||
```bash
|
||||
ps aux --sort=-%cpu | head -20
|
||||
ss -antn state established | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
|
||||
```
|
||||
|
||||
**处理**:结束异常进程;对异常 IP 限连接数或封禁(宝塔安全/防火墙)。
|
||||
|
||||
---
|
||||
|
||||
### 5. 磁盘 I/O 满或磁盘满
|
||||
|
||||
**现象**:`df -h` 显示 90%+ 占用,或 `iostat` 显示高 I/O 等待。
|
||||
|
||||
**判断**:
|
||||
```bash
|
||||
df -h / /www
|
||||
du -sh /www/wwwlogs /var/log /tmp 2>/dev/null
|
||||
```
|
||||
|
||||
**处理**:清理日志(见下文「六、一键清理」)、扩容磁盘。
|
||||
|
||||
---
|
||||
|
||||
## 二、诊断命令速查(宝塔终端执行)
|
||||
|
||||
```bash
|
||||
# 一键诊断
|
||||
echo "=== 负载 ===" && uptime
|
||||
echo "=== 内存 ===" && free -m
|
||||
echo "=== 磁盘 ===" && df -h / /www
|
||||
echo "=== 连接数 ===" && ss -ant state established | wc -l
|
||||
echo "=== CPU TOP10 ===" && ps aux --sort=-%cpu | head -11
|
||||
echo "=== Node 进程 ===" && ps aux | grep -E 'node|npm' | grep -v grep
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、TAT 远程诊断(无需 SSH)
|
||||
|
||||
```bash
|
||||
./scripts/.venv_tx/bin/python scripts/腾讯云_TAT_kr宝塔_负载诊断.py
|
||||
```
|
||||
|
||||
输出:负载、内存、磁盘、连接数、TOP 进程、Node 进程详情。
|
||||
|
||||
---
|
||||
|
||||
## 四、处理流程(优先级)
|
||||
|
||||
| 优先级 | 动作 | 说明 |
|
||||
|--------|------|------|
|
||||
| 1 | 停止全部 Node | 先降压,避免反复崩溃重启 |
|
||||
| 2 | 修正 Node 启动命令 | 解决 MODULE_NOT_FOUND,避免死循环 |
|
||||
| 3 | 结束高 CPU 进程 | `kill -9 PID` 异常 node/npm |
|
||||
| 4 | 清理磁盘/日志 | 释放空间,减轻 I/O |
|
||||
| 5 | 逐一启动 Node | 确认无报错再启下一个 |
|
||||
| 6 | 长期:升级配置 | 4 核、升带宽 |
|
||||
|
||||
---
|
||||
|
||||
## 五、Node 反复崩溃的根因与修复
|
||||
|
||||
**根因**:宝塔 Node 项目配置的启动命令为 `node /www/wwwroot/self/wanzhi/玩值大屏`,Node 把**目录路径**当模块加载,报 MODULE_NOT_FOUND,进程退出,宝塔自动重启,循环往复。
|
||||
|
||||
**正确启动命令**:
|
||||
```
|
||||
cd /www/wwwroot/self/wanzhi/玩值大屏 && (pnpm start 2>/dev/null || npm run start)
|
||||
```
|
||||
|
||||
**修复**:运行 `kr宝塔_运行堵塞与Node深度修复_宝塔终端执行.sh`,脚本会自动修正 site.db 中所有 Node 项目的 `project_script`。
|
||||
|
||||
---
|
||||
|
||||
## 六、一键清理(磁盘 / 日志)
|
||||
|
||||
在宝塔终端执行:
|
||||
|
||||
```bash
|
||||
# 清理网站日志(7 天前、>50M 截断)
|
||||
find /www/wwwlogs -name '*.log' -mtime +7 -type f -delete
|
||||
find /www/wwwlogs -name '*.log' -type f -size +50M -exec truncate -s 0 {} \;
|
||||
|
||||
# 清理 /tmp
|
||||
find /tmp -type f -mtime +7 -delete
|
||||
|
||||
# 清理系统日志(7 天前)
|
||||
find /var/log -name '*.log' -mtime +7 -type f -delete
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 七、带宽相关(已有结论)
|
||||
|
||||
- **近 24h**:公网出带宽最大 5.1 Mbps,顶满 5M
|
||||
- **建议**:升级带宽、Nginx 限速(limit_conn、limit_rate)
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## 八、本次诊断结果(2026-02-20 TAT inv-i08m1g0e3t)
|
||||
|
||||
| 指标 | 数值 | 结论 |
|
||||
|------|------|------|
|
||||
| 负载 | 0.00, 0.05, 0.52 | ✅ 已恢复(此前 100% 已解除) |
|
||||
| 内存 | 2777/7578 MB 已用,4537 可用 | ✅ 正常 |
|
||||
| 磁盘 | 66G/79G,87%,11G 可用 | ⚠️ 偏高,建议清理日志 |
|
||||
| 连接数 | 8 | ✅ 正常 |
|
||||
| CPU TOP | nginx 0.2%、systemd 0.1% | ✅ 无高 CPU 进程 |
|
||||
| Node 进程 | 0 | ⚠️ **全部 Node 已停止** |
|
||||
|
||||
**根因结论**:此前负载 100% 主要来自 **Node 项目 MODULE_NOT_FOUND 崩溃→宝塔自动重启→再崩溃** 的死循环。当前 Node 已全部停止,负载自然回落。
|
||||
|
||||
**下一步**:运行 `kr宝塔_运行堵塞与Node深度修复_宝塔终端执行.sh` 修正启动命令并批量启动 Node。
|
||||
117
01_卡资(金)/金仓_存储备份/服务器管理/scripts/腾讯云_TAT_kr宝塔_负载诊断.py
Normal file
117
01_卡资(金)/金仓_存储备份/服务器管理/scripts/腾讯云_TAT_kr宝塔_负载诊断.py
Normal file
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
腾讯云 TAT:kr宝塔 负载诊断(只读,不杀进程)
|
||||
输出:uptime、内存、磁盘、连接数、CPU/内存 TOP20、node 进程详情
|
||||
"""
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
|
||||
KR_INSTANCE_ID = "ins-aw0tnqjo"
|
||||
REGION = "ap-guangzhou"
|
||||
|
||||
DIAG_SCRIPT = r'''#!/bin/bash
|
||||
echo "========== kr宝塔 负载诊断 =========="
|
||||
echo ""
|
||||
echo "【1】负载与运行时间"
|
||||
uptime
|
||||
echo ""
|
||||
echo "【2】内存"
|
||||
free -m
|
||||
echo ""
|
||||
echo "【3】磁盘"
|
||||
df -h / /www 2>/dev/null
|
||||
echo ""
|
||||
echo "【4】连接数(ESTABLISHED)"
|
||||
echo "总数:" $(ss -ant state established 2>/dev/null | wc -l)
|
||||
echo ""
|
||||
echo "【5】各端口连接数 TOP15"
|
||||
ss -antn state established 2>/dev/null | awk '{print $4}' | cut -d: -f2 | sort | uniq -c | sort -rn | head -15
|
||||
echo ""
|
||||
echo "【6】单IP连接数 TOP10"
|
||||
ss -antn state established 2>/dev/null | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
|
||||
echo ""
|
||||
echo "【7】CPU TOP20"
|
||||
ps aux --sort=-%cpu 2>/dev/null | head -21
|
||||
echo ""
|
||||
echo "【8】内存 TOP20"
|
||||
ps aux --sort=-%mem 2>/dev/null | head -21
|
||||
echo ""
|
||||
echo "【9】Node/npm/pnpm 进程"
|
||||
ps aux | grep -E 'node|npm|pnpm|next-server' | grep -v grep
|
||||
echo ""
|
||||
echo "【10】Node 进程数"
|
||||
echo "node:" $(pgrep -c node 2>/dev/null || echo 0)
|
||||
echo "npm:" $(pgrep -c npm 2>/dev/null || echo 0)
|
||||
echo ""
|
||||
echo "========== 诊断完成 =========="
|
||||
'''
|
||||
|
||||
def _read_creds():
|
||||
d = os.path.dirname(os.path.abspath(__file__))
|
||||
for _ in range(6):
|
||||
if os.path.isfile(os.path.join(d, "运营中枢", "工作台", "00_账号与API索引.md")):
|
||||
with open(os.path.join(d, "运营中枢", "工作台", "00_账号与API索引.md")) as f:
|
||||
t = f.read()
|
||||
sid = skey = None
|
||||
for line in t.splitlines():
|
||||
m = re.search(r"SecretId[^|]*\|\s*`([^`]+)`", line, re.I)
|
||||
if m and "AKID" in m.group(1): sid = m.group(1).strip()
|
||||
m = re.search(r"SecretKey\s*\|\s*`([^`]+)`", line, re.I)
|
||||
if m: skey = m.group(1).strip()
|
||||
return sid or os.environ.get("TENCENTCLOUD_SECRET_ID"), skey or os.environ.get("TENCENTCLOUD_SECRET_KEY")
|
||||
d = os.path.dirname(d)
|
||||
return None, None
|
||||
|
||||
|
||||
def main():
|
||||
sid, skey = _read_creds()
|
||||
if not sid or not skey:
|
||||
print("❌ 未配置腾讯云凭证"); return 1
|
||||
try:
|
||||
from tencentcloud.common import credential
|
||||
from tencentcloud.tat.v20201028 import tat_client, models
|
||||
except ImportError:
|
||||
print("pip install tencentcloud-sdk-python-tat"); return 1
|
||||
|
||||
cred = credential.Credential(sid, skey)
|
||||
client = tat_client.TatClient(cred, REGION)
|
||||
req = models.RunCommandRequest()
|
||||
req.Content = base64.b64encode(DIAG_SCRIPT.encode("utf-8")).decode()
|
||||
req.InstanceIds = [KR_INSTANCE_ID]
|
||||
req.CommandType = "SHELL"
|
||||
req.Timeout = 60
|
||||
req.CommandName = "kr宝塔_负载诊断"
|
||||
resp = client.RunCommand(req)
|
||||
inv_id = resp.InvocationId
|
||||
print("✅ TAT 已下发 InvocationId:", inv_id)
|
||||
print(" 等待 90s 获取诊断输出...")
|
||||
time.sleep(90)
|
||||
|
||||
try:
|
||||
req2 = models.DescribeInvocationTasksRequest()
|
||||
f = models.Filter()
|
||||
f.Name, f.Values = "invocation-id", [inv_id]
|
||||
req2.Filters = [f]
|
||||
r2 = client.DescribeInvocationTasks(req2)
|
||||
for t in (r2.InvocationTaskSet or []):
|
||||
print("\n状态:", getattr(t, "TaskStatus", ""))
|
||||
tr = getattr(t, "TaskResult", None)
|
||||
if tr:
|
||||
j = json.loads(tr) if isinstance(tr, str) else {}
|
||||
out = j.get("Output", "")
|
||||
if out:
|
||||
try: out = base64.b64decode(out).decode("utf-8", errors="replace")
|
||||
except: pass
|
||||
print(out)
|
||||
except Exception as e:
|
||||
print("查询:", e)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user