Planning docs (SPEC, ARCHITECTURE, MODULES, ROADMAP, DECISIONS) with decisions D1-D15 settled: RigDoctor name, Python 3 + Qt/PySide6 stack (core/CLI/daemon stdlib-only), Ubuntu + NVIDIA first, .deb packaging, read-only + suggestions, GUI + tray modules, stress module dropped. First code: the M1 sensor core (stdlib-only) and a CLI. - core engine: Reading/Sample model, Sampler, hwmon reader - self-probing sources (NVIDIA first): nvidia-smi GPU, coretemp/k10temp CPU, /proc/meminfo + DDR5 SPD memory, NVMe storage - CLI: snapshot (text/JSON), monitor, sources; record/report stubbed - stdlib unittest smoke tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.9 KiB
RigDoctor — Product Specification (DRAFT v0.2)
Living spec. The foundational decisions (name, language, platform/GPU priority, MVP scope, packaging, scope-of-action, GUI/tray) are now settled — see
DECISIONS.md(D1–D11). Anything still marked [OPEN] is tracked there (D12–D15).
1. Vision
A single, modular toolkit that lets a Linux gamer monitor, diagnose, and understand the health of their machine — especially the hard-to-catch faults that happen under gaming load. The goal is to make otherwise near-impossible-to-investigate problems (random freezes, the screen suddenly going black mid-game, GPU "lost" events) tractable by capturing the right data automatically and explaining it in plain language. Users install only the modules relevant to their hardware via an interactive installer.
Motivating cases:
- An RTX 3070 intermittently falls off the PCIe bus under heavy GPU/VRAM load
(
Xid 79/Xid 154,NV_ERR_GPU_IS_LOST). The crash is OS-independent (also seen on Windows in Tarkov) and load-correlated, pointing at hardware (VRAM thermals / power transients / PCIe signal integrity). - A monitor going black mid-session (e.g. during Path of Exile) — is it the GPU dropping, a driver reset, a cable/DP link issue, or a power event? Manually impossible to tell after the fact.
In both cases the last sensor readings before the freeze are normally never captured. RigDoctor's crash-safe logger is designed to fix exactly that.
2. Goals / Non-goals
Goals
- Catch and preserve the machine's state in the seconds before a hard freeze.
- Make hard-to-investigate gaming faults debuggable: collect scattered signals, correlate them, and explain them.
- Offer three ways to run: full CLI / headless (works over SSH), a desktop GUI, and a system-tray / top-menu-bar applet with quick actions. (D10/D11)
- Be modular: a novice installs a one-click "monitor + capture + report" bundle; a power user installs everything including the GUI, tray, and diagnostics.
- Low overhead; safe defaults; no telemetry/phone-home.
Non-goals (for now)
- Not a benchmark-score / e-peen leaderboard tool.
- Not a stress-test / load-generator — explicitly out of scope (D7). Users can run existing tools (gpu-burn, vkmark, stress-ng) alongside the logger if they want.
- Not an overclocking utility.
- Not (yet) an auto-fixer. RigDoctor is read-only: it diagnoses and suggests actions (with the exact command where possible) but does not apply changes itself in this stage. Auto-apply is a deliberate later milestone behind explicit consent. (D9)
3. Target users & platforms
- Users: Linux gamers from novice ("is my PC ok?" + alerts, via GUI/tray) to advanced (raw logs, log forensics, headless capture over SSH).
- Distros: Ubuntu first (and Debian via
apt). Arch (pacman) / Fedora (dnf) / openSUSE (zypper) best-effort later, behind the distro abstraction. (D3) - GPUs: NVIDIA first (seed hardware). AMD second, Intel third — behind the vendor abstraction. (D4)
- Display: GUI and tray must work under both X11 and Wayland on Ubuntu/GNOME; all core functionality must also work fully headless (CLI, over SSH, no display).
- Runtime: Python 3 + Qt (PySide6). Core/CLI/daemon are stdlib-only; GUI and tray add PySide6. (D2)
4. Functional requirements (by module)
Module IDs are stable. M7 (stress/repro) is dropped (D7). M10/M11 are the new GUI and tray modules.
M1 — Sensor core (foundation, always installed)
Unified sampling of: CPU temp/freq/load, per-core; GPU temp/(mem-junction if exposed)/
clocks/power/util/fan/VRAM/PCIe gen+width/throttle reasons; RAM (DDR5 SPD) temps; NVMe/SSD
temps; system load. Pluggable sources: nvidia-smi/NVML (first), amdgpu sysfs/rocm-smi
(later), /sys/class/hwmon, lm-sensors. Stdlib-only.
M2 — Live monitor (TUI)
HWMonitor-style terminal dashboard: current / session-min / session-max per sensor, grouped by subsystem, with throttle/critical highlighting. Refresh rate configurable. The terminal face of the live data (the GUI in M10 presents the same data graphically).
M3 — Crash-capture logger (daemon)
Headless background sampler that writes CSV/JSON and fsyncs every sample so the last
readings survive a hard lock. Detects GPU "lost"/hang (query timeout) and writes a marker.
Ring-buffer/rotation to bound disk use. Runs as a systemd --user service. Trigger model
is user-selectable (D6): always-on, game-launch-triggered, or manual (CLI / tray button).
Stdlib-only.
M4 — Health report (one-shot)
Scans journalctl for Xid, kernel panics, OOM-killer, MCE, PCIe AER, thermal events; checks
SMART disk health; flags driver/library version mismatches; verifies GPU firmware; prints a
prioritized findings list with plain-language explanations and suggested fixes (read-only
per D9). Reuses M1 for a live snapshot.
M5 — System inventory
CPU/GPU/motherboard/BIOS/RAM/storage, kernel, driver versions, X11/Wayland + compositor, PCIe topology. Exportable (Markdown/JSON) to paste into forum/bug reports.
M6 — Gaming environment checks
Detects & evaluates: GPU power profile / persistence mode, CPU governor, Proton/Wine/Steam versions, GameMode, MangoHud, shader cache, swappiness, hugepages, CPU mitigations, PCIe ASPM. Flags settings that hurt stability/performance and suggests the fix command (read-only per D9).
M8 — Alerting
Threshold + event alerts (desktop notification / sound / log) on overheat, throttle, GPU-lost, SMART failure. Surfaces in the tray applet (M11) when installed. Optional.
M10 — Desktop GUI (PySide6/Qt)
Full graphical front-end over the core engine: live dashboard (graphs/gauges), browse and visualize captured crash logs, run a health report and view findings, view system inventory, toggle the logger and its trigger mode. Mirrors CLI capability for non-terminal users. Optional module (pulls in PySide6).
M11 — System-tray / menu-bar applet (PySide6/Qt)
A small always-available applet in the Linux top menu bar (system tray / StatusNotifierItem; on Ubuntu/GNOME via the AppIndicator extension). Optional module. Contents (D13):
- At-a-glance live readouts (from M1) in the dropdown, refreshed periodically: CPU temp, GPU temp, memory used/total (e.g. "14 GB / 32 GB"); a status dot (normal / throttling / alert) alongside.
- Run Diagnostic — the headline action; launches the guided diagnostic session below.
- Supporting actions: Open dashboard (M10), Start/Stop recording, Snapshot now, Quit.
Guided diagnostic session (M3 + M4 workflow)
The "Run Diagnostic" flow available from the tray (M11), the GUI (M10), and the CLI:
- Pick a game to focus on — chosen from detected/installed games (via the D12 game detection: Steam library / recently played / running process).
- Collect — RigDoctor runs a focused crash-capture session (M3) scoped to that game: it logs while you play, bracketing the session via the D12 wrapper/watcher.
- Scan & analyze — when the session ends (or after a crash + reboot), it runs the health report (M4) over the captured window + system logs to surface likely issues.
- Present findings — a prioritized, plain-language list with suggested fixes (read-only, D9). This is the one-click expression of the seed use case; it orchestrates existing modules rather than adding a new one.
M9 — Installer (see ARCHITECTURE §5)
Interactive wizard: detect GPU vendor (NVIDIA-first) → present module menu grouped into
bundles with descriptions and the exact packages each needs → resolve & install (apt first)
→ write config → optionally enable the systemd --user logger service and pick its trigger
mode. Delivered alongside the .deb (D8). Module list/bundling is final per D14.
5. Non-functional requirements
- Zero hard deps for the core/CLI/daemon — Python stdlib + tools already present. Qt
(PySide6) is required only by the GUI (M10) and tray (M11) modules, declared in the
.deband pulled in only when those modules are selected. - Crash-safe logging — flush +
fsyncper sample; bounded disk usage. - Low overhead — default ≤1 Hz sampling; negligible CPU/GPU cost. The always-on daemon is stdlib-only (no Qt loaded) so it stays tiny.
- Headless-equivalent — every diagnostic capability is reachable from the CLI; the GUI and tray are conveniences over the same engine, never the only way to do something.
- Privacy — local only; inventory export is opt-in and reviewable; no telemetry.
- Portability — graceful degradation when a sensor/tool is unavailable (N/A, not crash).
6. Open questions
None tracked — all foundational decisions (D1–D15) are settled; see DECISIONS.md. Detail
to flesh out during build: the tray's supporting-action set and per-module apt package names.
Packaging/deps are Ubuntu/apt-only (D15) — no multi-distro mapping is maintained.