Files
rigdoctor/docs/SPEC.md
T
jessey ce5f830393
release / release (push) Successful in 2m13s
Release 0.0.2: M3 logger (CLI + GUI), GUI-first, CI release workflow
Crash-capture logger (M3):
- crash-safe JSONL (fsync per sample), size-based rotation, GPU-lost/recovered
  markers, atomic status file
- CLI: record run/start/stop/status/report (run = systemd-ready entrypoint)
- shared core.reccontrol so CLI + GUI drive the same recorder
- crashlog tests (writer, rotation, reader, summary, recorder)

GUI:
- Recording/Logs page: start/stop/interval controls, live status, post-crash report
- shared render helpers (format_raw/headline, render_summary)

Docs/decisions:
- GUI-first (D17); CLI keeps full parity
- D8 revised: user-local self-updating install primary, .deb optional
- planned: M12 session sharing (D16), M13 no-root auto-update from public repo (D18)
- versioning + CHANGELOG convention (D19)

Infra:
- .gitea/workflows/release.yml: build wheel+sdist and publish a Gitea release
  v<version> on push to main
- align version to the 0.0.x release line; bump to 0.0.2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 17:16:41 +02:00

9.8 KiB
Raw Blame History

RigDoctor — Product Specification (DRAFT v0.2)

Living spec. The foundational decisions (name, language, platform/GPU priority, MVP scope, packaging, scope-of-action, GUI/tray) are now settled — see DECISIONS.md (D1D11). Anything still marked [OPEN] is tracked there (D12D15).

1. Vision

A single, modular toolkit that lets a Linux gamer monitor, diagnose, and understand the health of their machine — especially the hard-to-catch faults that happen under gaming load. The goal is to make otherwise near-impossible-to-investigate problems (random freezes, the screen suddenly going black mid-game, GPU "lost" events) tractable by capturing the right data automatically and explaining it in plain language. Users install only the modules relevant to their hardware via an interactive installer.

Motivating cases:

  • An RTX 3070 intermittently falls off the PCIe bus under heavy GPU/VRAM load (Xid 79 / Xid 154, NV_ERR_GPU_IS_LOST). The crash is OS-independent (also seen on Windows in Tarkov) and load-correlated, pointing at hardware (VRAM thermals / power transients / PCIe signal integrity).
  • A monitor going black mid-session (e.g. during Path of Exile) — is it the GPU dropping, a driver reset, a cable/DP link issue, or a power event? Manually impossible to tell after the fact.

In both cases the last sensor readings before the freeze are normally never captured. RigDoctor's crash-safe logger is designed to fix exactly that.

2. Goals / Non-goals

Goals

  • Catch and preserve the machine's state in the seconds before a hard freeze.
  • Make hard-to-investigate gaming faults debuggable: collect scattered signals, correlate them, and explain them.
  • Be GUI-first (D17): the desktop GUI is the primary interface, complemented by a system-tray / top-menu-bar applet for quick actions — backed by a full CLI that keeps complete functionality for headless / SSH / scripting use. (D10/D11/D17)
  • Be modular: a novice installs a one-click "monitor + capture + report" bundle; a power user installs everything including the GUI, tray, and diagnostics.
  • Low overhead; safe defaults; no telemetry/phone-home.

Non-goals (for now)

  • Not a benchmark-score / e-peen leaderboard tool.
  • Not a stress-test / load-generator — explicitly out of scope (D7). Users can run existing tools (gpu-burn, vkmark, stress-ng) alongside the logger if they want.
  • Not an overclocking utility.
  • Not (yet) an auto-fixer. RigDoctor is read-only: it diagnoses and suggests actions (with the exact command where possible) but does not apply changes itself in this stage. Auto-apply is a deliberate later milestone behind explicit consent. (D9)

3. Target users & platforms

  • Users: Linux gamers from novice ("is my PC ok?" + alerts, via GUI/tray) to advanced (raw logs, log forensics, headless capture over SSH).
  • Distros: Ubuntu first (and Debian via apt). Arch (pacman) / Fedora (dnf) / openSUSE (zypper) best-effort later, behind the distro abstraction. (D3)
  • GPUs: NVIDIA first (seed hardware). AMD second, Intel third — behind the vendor abstraction. (D4)
  • Display: GUI and tray must work under both X11 and Wayland on Ubuntu/GNOME; all core functionality must also work fully headless (CLI, over SSH, no display).
  • Runtime: Python 3 + Qt (PySide6). Core/CLI/daemon are stdlib-only; GUI and tray add PySide6. (D2)

4. Functional requirements (by module)

Module IDs are stable. M7 (stress/repro) is dropped (D7). M10/M11 are the new GUI and tray modules.

M1 — Sensor core (foundation, always installed)

Unified sampling of: CPU temp/freq/load, per-core; GPU temp/(mem-junction if exposed)/ clocks/power/util/fan/VRAM/PCIe gen+width/throttle reasons; RAM (DDR5 SPD) temps; NVMe/SSD temps; system load. Pluggable sources: nvidia-smi/NVML (first), amdgpu sysfs/rocm-smi (later), /sys/class/hwmon, lm-sensors. Stdlib-only.

M2 — Live monitor (TUI)

HWMonitor-style terminal dashboard: current / session-min / session-max per sensor, grouped by subsystem, with throttle/critical highlighting. Refresh rate configurable. The terminal face of the live data (the GUI in M10 presents the same data graphically).

M3 — Crash-capture logger (daemon)

Headless background sampler that writes CSV/JSON and fsyncs every sample so the last readings survive a hard lock. Detects GPU "lost"/hang (query timeout) and writes a marker. Ring-buffer/rotation to bound disk use. Runs as a systemd --user service. Trigger model is user-selectable (D6): always-on, game-launch-triggered, or manual (CLI / tray button). Stdlib-only.

M4 — Health report (one-shot)

Scans journalctl for Xid, kernel panics, OOM-killer, MCE, PCIe AER, thermal events; checks SMART disk health; flags driver/library version mismatches; verifies GPU firmware; prints a prioritized findings list with plain-language explanations and suggested fixes (read-only per D9). Reuses M1 for a live snapshot.

M5 — System inventory

CPU/GPU/motherboard/BIOS/RAM/storage, kernel, driver versions, X11/Wayland + compositor, PCIe topology. Exportable (Markdown/JSON) to paste into forum/bug reports.

M6 — Gaming environment checks

Detects & evaluates: GPU power profile / persistence mode, CPU governor, Proton/Wine/Steam versions, GameMode, MangoHud, shader cache, swappiness, hugepages, CPU mitigations, PCIe ASPM. Flags settings that hurt stability/performance and suggests the fix command (read-only per D9).

M8 — Alerting

Threshold + event alerts (desktop notification / sound / log) on overheat, throttle, GPU-lost, SMART failure. Surfaces in the tray applet (M11) when installed. Optional.

M10 — Desktop GUI (PySide6/Qt)

Full graphical front-end over the core engine: live dashboard (graphs/gauges), browse and visualize captured crash logs, run a health report and view findings, view system inventory, toggle the logger and its trigger mode. Mirrors CLI capability for non-terminal users. Optional module (pulls in PySide6).

M11 — System-tray / menu-bar applet (PySide6/Qt)

A small always-available applet in the Linux top menu bar (system tray / StatusNotifierItem; on Ubuntu/GNOME via the AppIndicator extension). Optional module. Contents (D13):

  • At-a-glance live readouts (from M1) in the dropdown, refreshed periodically: CPU temp, GPU temp, memory used/total (e.g. "14 GB / 32 GB"); a status dot (normal / throttling / alert) alongside.
  • Run Diagnostic — the headline action; launches the guided diagnostic session below.
  • Supporting actions: Open dashboard (M10), Start/Stop recording, Snapshot now, Quit.

Guided diagnostic session (M3 + M4 workflow)

The "Run Diagnostic" flow available from the tray (M11), the GUI (M10), and the CLI:

  1. Pick a game to focus on — chosen from detected/installed games (via the D12 game detection: Steam library / recently played / running process).
  2. Collect — RigDoctor runs a focused crash-capture session (M3) scoped to that game: it logs while you play, bracketing the session via the D12 wrapper/watcher.
  3. Scan & analyze — when the session ends (or after a crash + reboot), it runs the health report (M4) over the captured window + system logs to surface likely issues.
  4. Present findings — a prioritized, plain-language list with suggested fixes (read-only, D9). This is the one-click expression of the seed use case; it orchestrates existing modules rather than adding a new one.

M9 — Installer (see ARCHITECTURE §5)

Interactive wizard: detect GPU vendor (NVIDIA-first) → present module menu grouped into bundles with descriptions and the exact packages each needs → resolve & install (apt first) → write config → optionally enable the systemd --user logger service and pick its trigger mode. Delivered with the user-local install (and the optional .deb) (D8). Module list/bundling is final per D14.

M12 — Session sharing / remote assist (D16)

Lets a user (A) grant a helper (B) inspection access, as an escalating, consent-driven ladder: (1) diagnostic bundle export (inventory + recent capture log + report, one-way); (2) live read-only view of the dashboard + logs over a user-chosen tunnel (Tailscale/cloudflared/SSH — no RigDoctor-hosted relay); (3) gated interactive terminal wrapping an existing tool (tmate/sshx), read-only by default, read-write only on explicit consent. Per-session consent, ephemeral revocable tokens, permission escalation (view ≠ shell), and a session audit log. Tier 3 is a deliberate, consent-gated exception to the read-only stance (D9). Built in Phase 6.

5. Non-functional requirements

  • Zero hard deps for the core/CLI/daemon — Python stdlib + tools already present. Qt (PySide6) is required only by the GUI (M10) and tray (M11) modules, declared in the .deb and pulled in only when those modules are selected.
  • Crash-safe logging — flush + fsync per sample; bounded disk usage.
  • Low overhead — default ≤1 Hz sampling; negligible CPU/GPU cost. The always-on daemon is stdlib-only (no Qt loaded) so it stays tiny.
  • GUI-first, CLI-complete (D17) — the GUI is the primary interface, but every capability is also reachable from the CLI so RigDoctor runs fully headless (SSH/servers). Both front-ends sit over the same engine; neither is the only way to do something.
  • Privacy — local only; inventory export is opt-in and reviewable; no telemetry.
  • Portability — graceful degradation when a sensor/tool is unavailable (N/A, not crash).

6. Open questions

None tracked — all foundational decisions (D1D15) are settled; see DECISIONS.md. Detail to flesh out during build: the tray's supporting-action set and per-module apt package names. Packaging/deps are Ubuntu/apt-only (D15) — no multi-distro mapping is maintained.