# RigDoctor — Product Specification (DRAFT v0.2) > Living spec. The foundational decisions (name, language, platform/GPU priority, MVP scope, > packaging, scope-of-action, GUI/tray) are now settled — see `DECISIONS.md` (D1–D11). > Anything still marked **[OPEN]** is tracked there (D12–D15). ## 1. Vision A single, modular toolkit that lets a Linux gamer **monitor**, **diagnose**, and **understand the health** of their machine — especially the hard-to-catch faults that happen under gaming load. The goal is to make otherwise near-impossible-to-investigate problems (random freezes, the screen suddenly going black mid-game, GPU "lost" events) tractable by capturing the right data automatically and explaining it in plain language. Users install only the modules relevant to their hardware via an interactive installer. **Motivating cases:** - An RTX 3070 intermittently falls off the PCIe bus under heavy GPU/VRAM load (`Xid 79` / `Xid 154`, `NV_ERR_GPU_IS_LOST`). The crash is OS-independent (also seen on Windows in Tarkov) and load-correlated, pointing at hardware (VRAM thermals / power transients / PCIe signal integrity). - A monitor going black mid-session (e.g. during Path of Exile) — is it the GPU dropping, a driver reset, a cable/DP link issue, or a power event? Manually impossible to tell after the fact. In both cases the last sensor readings before the freeze are normally never captured. RigDoctor's crash-safe logger is designed to fix exactly that. ## 2. Goals / Non-goals **Goals** - Catch and preserve the machine's state in the seconds before a hard freeze. - Make hard-to-investigate gaming faults debuggable: collect scattered signals, correlate them, and explain them. - Be **GUI-first** (D17): the **desktop GUI** is the primary interface, complemented by a **system-tray / top-menu-bar applet** for quick actions — backed by a **full CLI** that keeps complete functionality for headless / SSH / scripting use. (D10/D11/D17) - Be modular: a novice installs a one-click "monitor + capture + report" bundle; a power user installs everything including the GUI, tray, and diagnostics. - Low overhead; safe defaults; no telemetry/phone-home. **Non-goals (for now)** - Not a benchmark-score / e-peen leaderboard tool. - **Not a stress-test / load-generator** — explicitly out of scope (D7). Users can run existing tools (gpu-burn, vkmark, stress-ng) alongside the logger if they want. - Not an overclocking utility. - **Read-only by default, with a narrow consent-gated exception.** RigDoctor diagnoses and *suggests* actions (with the exact command where possible). It does **not** apply changes itself — **except** a small set of **runtime-reversible** gaming tunables (M6: CPU governor, NVIDIA persistence, PCIe ASPM policy, swappiness, THP) that can be applied from the GUI via a single pkexec prompt, no reboot, revert on reboot (D22, realizing the D9 milestone). Risky/ persistent fixes (GRUB cmdline, CPU mitigations) remain suggestion-only. ## 3. Target users & platforms - **Users:** Linux gamers from novice ("is my PC ok?" + alerts, via GUI/tray) to advanced (raw logs, log forensics, headless capture over SSH). - **Distros:** **Ubuntu first** (and Debian via `apt`). Arch (`pacman`) / Fedora (`dnf`) / openSUSE (`zypper`) best-effort later, behind the distro abstraction. (D3) - **GPUs:** **NVIDIA first** (seed hardware). AMD second, Intel third — behind the vendor abstraction. (D4) - **Display:** GUI and tray must work under both X11 and Wayland on Ubuntu/GNOME; **all core functionality must also work fully headless** (CLI, over SSH, no display). - **Runtime:** Python 3 + Qt (PySide6). Core/CLI/daemon are stdlib-only; GUI and tray add PySide6. (D2) ## 4. Functional requirements (by module) > Module IDs are stable. **M7 (stress/repro) is dropped** (D7). M10/M11 are the new GUI and > tray modules. ### M1 — Sensor core (foundation, always installed) Unified sampling of: CPU temp/freq/load, per-core; GPU temp/(mem-junction if exposed)/ clocks/power/util/fan/VRAM/PCIe gen+width/throttle reasons; RAM (DDR5 SPD) temps; NVMe/SSD temps; system load. Pluggable sources: `nvidia-smi`/NVML (first), `amdgpu` sysfs/`rocm-smi` (later), `/sys/class/hwmon`, `lm-sensors`. Stdlib-only. ### M2 — Live monitor (TUI) HWMonitor-style terminal dashboard: current / session-min / session-max per sensor, grouped by subsystem, with throttle/critical highlighting. Refresh rate configurable. The terminal face of the live data (the GUI in M10 presents the same data graphically). ### M3 — Crash-capture logger (daemon) Headless background sampler that writes CSV/JSON and **`fsync`s every sample** so the last readings survive a hard lock. Detects GPU "lost"/hang (query timeout) and writes a marker. Ring-buffer/rotation to bound disk use. Runs as a `systemd --user` service. **Trigger model is user-selectable** (D6): always-on, game-launch-triggered, or manual (CLI / tray button). Stdlib-only. ### M4 — Health report (one-shot) Scans `journalctl` for Xid, kernel panics, OOM-killer, MCE, PCIe AER, thermal events; checks SMART disk health; flags driver/library version mismatches; verifies GPU firmware; prints a prioritized findings list with plain-language explanations and **suggested** fixes (read-only per D9). Reuses M1 for a live snapshot. ### M5 — System inventory CPU/GPU/motherboard/BIOS/RAM/storage, kernel, driver versions, X11/Wayland + compositor, PCIe topology. Exportable (Markdown/JSON) to paste into forum/bug reports. ### M6 — Gaming environment checks Detects & evaluates: GPU power profile / persistence mode, CPU governor, Proton/Wine/Steam versions, GameMode, MangoHud, shader cache, swappiness, hugepages, CPU mitigations, PCIe ASPM. Flags settings that hurt stability/performance and **suggests** the fix command. Also includes Steam library/game detection (the D12 "pick a game" foundation) and, per D22, a **one-click apply** for the runtime-reversible tunables (governor, persistence, ASPM, swappiness, THP) plus one-click install of optional tools (GameMode/MangoHud/cpupower). ### M8 — Alerting Threshold + event alerts (desktop notification / sound / log) on overheat, throttle, GPU-lost, SMART failure. Surfaces in the tray applet (M11) when installed. Optional. ### M10 — Desktop GUI (PySide6/Qt) Full graphical front-end over the core engine: live dashboard (graphs/gauges), browse and visualize captured crash logs, run a health report and view findings, view system inventory, toggle the logger and its trigger mode. Mirrors CLI capability for non-terminal users. Optional module (pulls in PySide6). ### M11 — System-tray / menu-bar applet (PySide6/Qt) A small always-available applet in the Linux top menu bar (system tray / StatusNotifierItem; on Ubuntu/GNOME via the AppIndicator extension). Optional module. Contents (D13): - **At-a-glance live readouts (from M1)** in the dropdown, refreshed periodically: **CPU temp, GPU temp, memory used/total** (e.g. "14 GB / 32 GB"); a status dot (normal / throttling / alert) alongside. - **Run Diagnostic** — the headline action; launches the *guided diagnostic session* below. - **Supporting actions:** Open dashboard (M10), Start/Stop recording, Snapshot now, Quit. ### Guided diagnostic session (M3 + M4 workflow) The "Run Diagnostic" flow available from the tray (M11), the GUI (M10), and the CLI: 1. **Pick a game to focus on** — chosen from detected/installed games (via the D12 game detection: Steam library / recently played / running process). 2. **Collect** — RigDoctor runs a focused crash-capture session (M3) scoped to that game: it logs while you play, bracketing the session via the D12 wrapper/watcher. 3. **Scan & analyze** — when the session ends (or after a crash + reboot), it runs the health report (M4) over the captured window + system logs to surface likely issues. 4. **Present findings** — a prioritized, plain-language list with suggested fixes (read-only, D9). This is the one-click expression of the seed use case; it orchestrates existing modules rather than adding a new one. ### M9 — Installer (see ARCHITECTURE §5) Interactive wizard: detect GPU vendor (NVIDIA-first) → present module menu grouped into bundles with descriptions and the exact packages each needs → resolve & install (apt first) → write config → optionally enable the `systemd --user` logger service and pick its trigger mode. Delivered with the user-local install (and the optional `.deb`) (D8). Module list/bundling is final per D14. ### M12 — Session sharing / remote assist (D16, scoped to terminal-only by D23) Lets a user (A) grant a helper (B) a **shared terminal** over the relay: A shares a real PTY running their shell; B watches live and may type **only if A allows it** (otherwise read-only) — a deliberate, consent-gated exception to the read-only stance (D9). A reads along and can type too (e.g. a sudo password, which stays local and is never sent to B). Account-gated by the Gitea token; per-session share code. The shared terminal preserves colors/theming and can be viewed full-screen. *(The earlier read-only stats view / bundle export were dropped — D23.)* ### M14 — AI assistant (D24) Optional module that explains the collected diagnostics in plain language. **Strictly opt-in and never automatic** — the model is contacted only when the user presses "Explain with AI" (GUI) or runs `rigdoctor ai explain`; configuring it contacts nothing. The user explicitly chooses a provider (no default): **Ollama** (local, private, no key) or **Claude** (Anthropic Messages API, key in the keyring, with a consent prompt before sending data). Answers are **grounded** in the actual findings plus matched reference facts from a curated, exact-match knowledge base ("RAG-lite" — no embeddings/vector store, stdlib only); no fine-tuning. HTTP via stdlib `urllib` (no new core dependency); output is advisory (consistent with D9). ## 5. Non-functional requirements - **Zero hard deps for the core/CLI/daemon** — Python stdlib + tools already present. **Qt (PySide6) is required only by the GUI (M10) and tray (M11) modules**, declared in the `.deb` and pulled in only when those modules are selected. - **Crash-safe logging** — flush + `fsync` per sample; bounded disk usage. - **Low overhead** — default ≤1 Hz sampling; negligible CPU/GPU cost. The always-on daemon is stdlib-only (no Qt loaded) so it stays tiny. - **GUI-first, CLI-complete** (D17) — the GUI is the primary interface, but every capability is *also* reachable from the CLI so RigDoctor runs fully headless (SSH/servers). Both front-ends sit over the same engine; neither is the only way to do something. - **Privacy** — local only; inventory export is opt-in and reviewable; no telemetry. - **Portability** — graceful degradation when a sensor/tool is unavailable (N/A, not crash). ## 6. Open questions None tracked — all foundational decisions (D1–D15) are settled; see `DECISIONS.md`. Detail to flesh out during build: the tray's supporting-action set and per-module apt package names. Packaging/deps are **Ubuntu/apt-only** (D15) — no multi-distro mapping is maintained.