Files

T

jessey 984292c368 feat(m15): collect session-scoped system logs (kernel + coredumps) — 0.31.0

core/syslogs.py gathers, scoped to the diagnostic window:
- kernel-log slice (journalctl -k): Xid, OOM, MCE, PCIe AER, thermal, hung tasks
- crashed-process records (coredumpctl): exe, signal, when
Stored as syslogs.txt in the diagnostic dir, included in the Report bundle, and
fed to the AI on "Explain" alongside the game logs. Best-effort (degrades if the
tools are missing/denied); treats journalctl's "-- No entries --" as empty.
Tests + docs (M15/SPEC).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 14:10:30 +02:00

12 KiB

Raw Blame History

RigDoctor — Product Specification (DRAFT v0.2)

Living spec. The foundational decisions (name, language, platform/GPU priority, MVP scope, packaging, scope-of-action, GUI/tray) are now settled — see DECISIONS.md (D1–D11). Anything still marked [OPEN] is tracked there (D12–D15).

1. Vision

A single, modular toolkit that lets a Linux gamer monitor, diagnose, and understand the health of their machine — especially the hard-to-catch faults that happen under gaming load. The goal is to make otherwise near-impossible-to-investigate problems (random freezes, the screen suddenly going black mid-game, GPU "lost" events) tractable by capturing the right data automatically and explaining it in plain language. Users install only the modules relevant to their hardware via an interactive installer.

Motivating cases:

An RTX 3070 intermittently falls off the PCIe bus under heavy GPU/VRAM load (Xid 79 / Xid 154, NV_ERR_GPU_IS_LOST). The crash is OS-independent (also seen on Windows in Tarkov) and load-correlated, pointing at hardware (VRAM thermals / power transients / PCIe signal integrity).
A monitor going black mid-session (e.g. during Path of Exile) — is it the GPU dropping, a driver reset, a cable/DP link issue, or a power event? Manually impossible to tell after the fact.

In both cases the last sensor readings before the freeze are normally never captured. RigDoctor's crash-safe logger is designed to fix exactly that.

2. Goals / Non-goals

Goals

Catch and preserve the machine's state in the seconds before a hard freeze.
Make hard-to-investigate gaming faults debuggable: collect scattered signals, correlate them, and explain them.
Be GUI-first (D17): the desktop GUI is the primary interface, complemented by a system-tray / top-menu-bar applet for quick actions — backed by a full CLI that keeps complete functionality for headless / SSH / scripting use. (D10/D11/D17)
Be modular: a novice installs a one-click "monitor + capture + report" bundle; a power user installs everything including the GUI, tray, and diagnostics.
Low overhead; safe defaults; no telemetry/phone-home.

Non-goals (for now)

Not a benchmark-score / e-peen leaderboard tool.
Not a stress-test / load-generator — explicitly out of scope (D7). Users can run existing tools (gpu-burn, vkmark, stress-ng) alongside the logger if they want.
Not an overclocking utility.
Read-only by default, with a narrow consent-gated exception. RigDoctor diagnoses and suggests actions (with the exact command where possible). It does not apply changes itself — except a small set of runtime-reversible gaming tunables (M6: CPU governor, NVIDIA persistence, PCIe ASPM policy, swappiness, THP) that can be applied from the GUI via a single pkexec prompt, no reboot, revert on reboot (D22, realizing the D9 milestone). Risky/ persistent fixes (GRUB cmdline, CPU mitigations) remain suggestion-only.

3. Target users & platforms

Users: Linux gamers from novice ("is my PC ok?" + alerts, via GUI/tray) to advanced (raw logs, log forensics, headless capture over SSH).
Distros: Ubuntu first (and Debian via apt). Arch (pacman) / Fedora (dnf) / openSUSE (zypper) best-effort later, behind the distro abstraction. (D3)
GPUs: NVIDIA first (seed hardware). AMD second, Intel third — behind the vendor abstraction. (D4)
Display: GUI and tray must work under both X11 and Wayland on Ubuntu/GNOME; all core functionality must also work fully headless (CLI, over SSH, no display).
Runtime: Python 3 + Qt (PySide6). Core/CLI/daemon are stdlib-only; GUI and tray add PySide6. (D2)

4. Functional requirements (by module)

Module IDs are stable. M7 (stress/repro) is dropped (D7). M10/M11 are the new GUI and tray modules.

M1 — Sensor core (foundation, always installed)

Unified sampling of: CPU temp/freq/load, per-core; GPU temp/(mem-junction if exposed)/ clocks/power/util/fan/VRAM/PCIe gen+width/throttle reasons; RAM (DDR5 SPD) temps; NVMe/SSD temps; system load. Pluggable sources: nvidia-smi/NVML (first), amdgpu sysfs/rocm-smi (later), /sys/class/hwmon, lm-sensors. Stdlib-only.

M2 — Live monitor (TUI)

HWMonitor-style terminal dashboard: current / session-min / session-max per sensor, grouped by subsystem, with throttle/critical highlighting. Refresh rate configurable. The terminal face of the live data (the GUI in M10 presents the same data graphically).

M3 — Crash-capture logger (daemon)

Headless background sampler that writes CSV/JSON and fsyncs every sample so the last readings survive a hard lock. Detects GPU "lost"/hang (query timeout) and writes a marker. Ring-buffer/rotation to bound disk use. Runs as a systemd --user service. Trigger model is user-selectable (D6): always-on, game-launch-triggered, or manual (CLI / tray button). Stdlib-only.

M4 — Health report (one-shot)

Scans journalctl for Xid, kernel panics, OOM-killer, MCE, PCIe AER, thermal events; checks SMART disk health; flags driver/library version mismatches; verifies GPU firmware; prints a prioritized findings list with plain-language explanations and suggested fixes (read-only per D9). Reuses M1 for a live snapshot.

M5 — System inventory

CPU/GPU/motherboard/BIOS/RAM/storage, kernel, driver versions, X11/Wayland + compositor, PCIe topology. Exportable (Markdown/JSON) to paste into forum/bug reports.

M6 — Gaming environment checks

Detects & evaluates: GPU power profile / persistence mode, CPU governor, Proton/Wine/Steam versions, GameMode, MangoHud, shader cache, swappiness, hugepages, CPU mitigations, PCIe ASPM. Flags settings that hurt stability/performance and suggests the fix command. Also includes Steam library/game detection (the D12 "pick a game" foundation) and, per D22, a one-click apply for the runtime-reversible tunables (governor, persistence, ASPM, swappiness, THP) plus one-click install of optional tools (GameMode/MangoHud/cpupower).

M8 — Alerting

Threshold + event alerts (desktop notification / sound / log) on overheat, throttle, GPU-lost, SMART failure. Surfaces in the tray applet (M11) when installed. Optional.

M10 — Desktop GUI (PySide6/Qt)

Full graphical front-end over the core engine: live dashboard (graphs/gauges), browse and visualize captured crash logs, run a health report and view findings, view system inventory, toggle the logger and its trigger mode. Mirrors CLI capability for non-terminal users. Optional module (pulls in PySide6).

A small always-available applet in the Linux top menu bar (system tray / StatusNotifierItem; on Ubuntu/GNOME via the AppIndicator extension). Optional module. Contents (D13):

At-a-glance live readouts (from M1) in the dropdown, refreshed periodically: CPU temp, GPU temp, memory used/total (e.g. "14 GB / 32 GB"); a status dot (normal / throttling / alert) alongside.
Run Diagnostic — the headline action; launches the guided diagnostic session below.
Supporting actions: Open dashboard (M10), Start/Stop recording, Snapshot now, Quit.

Guided diagnostic session (M3 + M4 workflow)

The "Run Diagnostic" flow available from the tray (M11), the GUI (M10), and the CLI:

Pick a game to focus on — chosen from detected/installed games (via the D12 game detection: Steam library / recently played / running process).
Collect — RigDoctor runs a focused crash-capture session (M3) scoped to that game: it logs while you play, bracketing the session via the D12 wrapper/watcher.
Scan & analyze — when the session ends (or after a crash + reboot), it runs the health report (M4) over the captured window + system logs to surface likely issues.
Present findings — a prioritized, plain-language list with suggested fixes (read-only, D9). This is the one-click expression of the seed use case; it orchestrates existing modules rather than adding a new one.

M9 — Installer (see ARCHITECTURE §5)

Interactive wizard: detect GPU vendor (NVIDIA-first) → present module menu grouped into bundles with descriptions and the exact packages each needs → resolve & install (apt first) → write config → optionally enable the systemd --user logger service and pick its trigger mode. Delivered with the user-local install (and the optional .deb) (D8). Module list/bundling is final per D14.

Lets a user (A) grant a helper (B) a shared terminal over the relay: A shares a real PTY running their shell; B watches live and may type only if A allows it (otherwise read-only) — a deliberate, consent-gated exception to the read-only stance (D9). A reads along and can type too (e.g. a sudo password, which stays local and is never sent to B). Account-gated by the Gitea token; per-session share code. The shared terminal preserves colors/theming and can be viewed full-screen. (The earlier read-only stats view / bundle export were dropped — D23.)

M14 — AI assistant (D24)

Optional module that explains the collected diagnostics in plain language. Strictly opt-in and never automatic — the model is contacted only when the user presses "Explain with AI" (GUI) or runs rigdoctor ai explain; configuring it contacts nothing. The user explicitly chooses a provider (no default): Ollama (local, private, no key) or Claude (Anthropic Messages API, key in the keyring, with a consent prompt before sending data). Answers are grounded in the actual findings plus matched reference facts from a curated, exact-match knowledge base ("RAG-lite" — no embeddings/vector store, stdlib only); no fine-tuning. HTTP via stdlib urllib (no new core dependency); output is advisory (consistent with D9).

M15 — Logging & report bundles (D25)

Opt-in (one logging_enabled toggle, default off). When on: the application logs to a rotating app.log, and each diagnostic is stored in its own directory (capture log, structured result, human-readable report, session-scoped game logs (Proton/Steam) and system logs (journalctl -k slice + coredumpctl crashed-process records), and a record of every AI interaction — the exact data sent, the model, and its reply). The collected logs are also fed to the AI on "Explain". System-log collection is best-effort (degrades if tools are missing/denied). A Report action zips one diagnostic's directory (plus the app log) into a shareable bundle saved under the reports folder (GUI button; CLI rigdoctor bundle). Everything stays local — a report only leaves the machine if the user shares the zip. Stdlib only (logging + zipfile).

5. Non-functional requirements

Zero hard deps for the core/CLI/daemon — Python stdlib + tools already present. Qt (PySide6) is required only by the GUI (M10) and tray (M11) modules, declared in the .deb and pulled in only when those modules are selected.
Crash-safe logging — flush + fsync per sample; bounded disk usage.
Low overhead — default ≤1 Hz sampling; negligible CPU/GPU cost. The always-on daemon is stdlib-only (no Qt loaded) so it stays tiny.
GUI-first, CLI-complete (D17) — the GUI is the primary interface, but every capability is also reachable from the CLI so RigDoctor runs fully headless (SSH/servers). Both front-ends sit over the same engine; neither is the only way to do something.
Privacy — local only; inventory export is opt-in and reviewable; no telemetry.
Portability — graceful degradation when a sensor/tool is unavailable (N/A, not crash).

6. Open questions

None tracked — all foundational decisions (D1–D15) are settled; see DECISIONS.md. Detail to flesh out during build: the tray's supporting-action set and per-module apt package names. Packaging/deps are Ubuntu/apt-only (D15) — no multi-distro mapping is maintained.

12 KiB Raw Blame History Unescape Escape