Files
rigdoctor/docs/SPEC.md
T
jessey ce5f830393
release / release (push) Successful in 2m13s
Release 0.0.2: M3 logger (CLI + GUI), GUI-first, CI release workflow
Crash-capture logger (M3):
- crash-safe JSONL (fsync per sample), size-based rotation, GPU-lost/recovered
  markers, atomic status file
- CLI: record run/start/stop/status/report (run = systemd-ready entrypoint)
- shared core.reccontrol so CLI + GUI drive the same recorder
- crashlog tests (writer, rotation, reader, summary, recorder)

GUI:
- Recording/Logs page: start/stop/interval controls, live status, post-crash report
- shared render helpers (format_raw/headline, render_summary)

Docs/decisions:
- GUI-first (D17); CLI keeps full parity
- D8 revised: user-local self-updating install primary, .deb optional
- planned: M12 session sharing (D16), M13 no-root auto-update from public repo (D18)
- versioning + CHANGELOG convention (D19)

Infra:
- .gitea/workflows/release.yml: build wheel+sdist and publish a Gitea release
  v<version> on push to main
- align version to the 0.0.x release line; bump to 0.0.2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 17:16:41 +02:00

170 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RigDoctor — Product Specification (DRAFT v0.2)
> Living spec. The foundational decisions (name, language, platform/GPU priority, MVP scope,
> packaging, scope-of-action, GUI/tray) are now settled — see `DECISIONS.md` (D1D11).
> Anything still marked **[OPEN]** is tracked there (D12D15).
## 1. Vision
A single, modular toolkit that lets a Linux gamer **monitor**, **diagnose**, and
**understand the health** of their machine — especially the hard-to-catch faults that happen
under gaming load. The goal is to make otherwise near-impossible-to-investigate problems
(random freezes, the screen suddenly going black mid-game, GPU "lost" events) tractable by
capturing the right data automatically and explaining it in plain language. Users install
only the modules relevant to their hardware via an interactive installer.
**Motivating cases:**
- An RTX 3070 intermittently falls off the PCIe bus under heavy GPU/VRAM load
(`Xid 79` / `Xid 154`, `NV_ERR_GPU_IS_LOST`). The crash is OS-independent (also seen on
Windows in Tarkov) and load-correlated, pointing at hardware (VRAM thermals / power
transients / PCIe signal integrity).
- A monitor going black mid-session (e.g. during Path of Exile) — is it the GPU dropping,
a driver reset, a cable/DP link issue, or a power event? Manually impossible to tell after
the fact.
In both cases the last sensor readings before the freeze are normally never captured.
RigDoctor's crash-safe logger is designed to fix exactly that.
## 2. Goals / Non-goals
**Goals**
- Catch and preserve the machine's state in the seconds before a hard freeze.
- Make hard-to-investigate gaming faults debuggable: collect scattered signals, correlate
them, and explain them.
- Be **GUI-first** (D17): the **desktop GUI** is the primary interface, complemented by a
**system-tray / top-menu-bar applet** for quick actions — backed by a **full CLI** that
keeps complete functionality for headless / SSH / scripting use. (D10/D11/D17)
- Be modular: a novice installs a one-click "monitor + capture + report" bundle; a power
user installs everything including the GUI, tray, and diagnostics.
- Low overhead; safe defaults; no telemetry/phone-home.
**Non-goals (for now)**
- Not a benchmark-score / e-peen leaderboard tool.
- **Not a stress-test / load-generator** — explicitly out of scope (D7). Users can run
existing tools (gpu-burn, vkmark, stress-ng) alongside the logger if they want.
- Not an overclocking utility.
- **Not (yet) an auto-fixer.** RigDoctor is **read-only**: it diagnoses and *suggests*
actions (with the exact command where possible) but does not apply changes itself in this
stage. Auto-apply is a deliberate later milestone behind explicit consent. (D9)
## 3. Target users & platforms
- **Users:** Linux gamers from novice ("is my PC ok?" + alerts, via GUI/tray) to advanced
(raw logs, log forensics, headless capture over SSH).
- **Distros:** **Ubuntu first** (and Debian via `apt`). Arch (`pacman`) / Fedora (`dnf`) /
openSUSE (`zypper`) best-effort later, behind the distro abstraction. (D3)
- **GPUs:** **NVIDIA first** (seed hardware). AMD second, Intel third — behind the vendor
abstraction. (D4)
- **Display:** GUI and tray must work under both X11 and Wayland on Ubuntu/GNOME; **all core
functionality must also work fully headless** (CLI, over SSH, no display).
- **Runtime:** Python 3 + Qt (PySide6). Core/CLI/daemon are stdlib-only; GUI and tray add
PySide6. (D2)
## 4. Functional requirements (by module)
> Module IDs are stable. **M7 (stress/repro) is dropped** (D7). M10/M11 are the new GUI and
> tray modules.
### M1 — Sensor core (foundation, always installed)
Unified sampling of: CPU temp/freq/load, per-core; GPU temp/(mem-junction if exposed)/
clocks/power/util/fan/VRAM/PCIe gen+width/throttle reasons; RAM (DDR5 SPD) temps; NVMe/SSD
temps; system load. Pluggable sources: `nvidia-smi`/NVML (first), `amdgpu` sysfs/`rocm-smi`
(later), `/sys/class/hwmon`, `lm-sensors`. Stdlib-only.
### M2 — Live monitor (TUI)
HWMonitor-style terminal dashboard: current / session-min / session-max per sensor, grouped
by subsystem, with throttle/critical highlighting. Refresh rate configurable. The terminal
face of the live data (the GUI in M10 presents the same data graphically).
### M3 — Crash-capture logger (daemon)
Headless background sampler that writes CSV/JSON and **`fsync`s every sample** so the last
readings survive a hard lock. Detects GPU "lost"/hang (query timeout) and writes a marker.
Ring-buffer/rotation to bound disk use. Runs as a `systemd --user` service. **Trigger model
is user-selectable** (D6): always-on, game-launch-triggered, or manual (CLI / tray button).
Stdlib-only.
### M4 — Health report (one-shot)
Scans `journalctl` for Xid, kernel panics, OOM-killer, MCE, PCIe AER, thermal events; checks
SMART disk health; flags driver/library version mismatches; verifies GPU firmware; prints a
prioritized findings list with plain-language explanations and **suggested** fixes (read-only
per D9). Reuses M1 for a live snapshot.
### M5 — System inventory
CPU/GPU/motherboard/BIOS/RAM/storage, kernel, driver versions, X11/Wayland + compositor,
PCIe topology. Exportable (Markdown/JSON) to paste into forum/bug reports.
### M6 — Gaming environment checks
Detects & evaluates: GPU power profile / persistence mode, CPU governor, Proton/Wine/Steam
versions, GameMode, MangoHud, shader cache, swappiness, hugepages, CPU mitigations,
PCIe ASPM. Flags settings that hurt stability/performance and **suggests** the fix command
(read-only per D9).
### M8 — Alerting
Threshold + event alerts (desktop notification / sound / log) on overheat, throttle,
GPU-lost, SMART failure. Surfaces in the tray applet (M11) when installed. Optional.
### M10 — Desktop GUI (PySide6/Qt)
Full graphical front-end over the core engine: live dashboard (graphs/gauges), browse and
visualize captured crash logs, run a health report and view findings, view system inventory,
toggle the logger and its trigger mode. Mirrors CLI capability for non-terminal users.
Optional module (pulls in PySide6).
### M11 — System-tray / menu-bar applet (PySide6/Qt)
A small always-available applet in the Linux top menu bar (system tray /
StatusNotifierItem; on Ubuntu/GNOME via the AppIndicator extension). Optional module.
Contents (D13):
- **At-a-glance live readouts (from M1)** in the dropdown, refreshed periodically:
**CPU temp, GPU temp, memory used/total** (e.g. "14 GB / 32 GB"); a status dot
(normal / throttling / alert) alongside.
- **Run Diagnostic** — the headline action; launches the *guided diagnostic session* below.
- **Supporting actions:** Open dashboard (M10), Start/Stop recording, Snapshot now, Quit.
### Guided diagnostic session (M3 + M4 workflow)
The "Run Diagnostic" flow available from the tray (M11), the GUI (M10), and the CLI:
1. **Pick a game to focus on** — chosen from detected/installed games (via the D12 game
detection: Steam library / recently played / running process).
2. **Collect** — RigDoctor runs a focused crash-capture session (M3) scoped to that game:
it logs while you play, bracketing the session via the D12 wrapper/watcher.
3. **Scan & analyze** — when the session ends (or after a crash + reboot), it runs the
health report (M4) over the captured window + system logs to surface likely issues.
4. **Present findings** — a prioritized, plain-language list with suggested fixes
(read-only, D9).
This is the one-click expression of the seed use case; it orchestrates existing modules
rather than adding a new one.
### M9 — Installer (see ARCHITECTURE §5)
Interactive wizard: detect GPU vendor (NVIDIA-first) → present module menu grouped into
bundles with descriptions and the exact packages each needs → resolve & install (apt first)
→ write config → optionally enable the `systemd --user` logger service and pick its trigger
mode. Delivered with the user-local install (and the optional `.deb`) (D8). Module
list/bundling is final per D14.
### M12 — Session sharing / remote assist (D16)
Lets a user (A) grant a helper (B) inspection access, as an escalating, consent-driven
ladder: (1) **diagnostic bundle export** (inventory + recent capture log + report, one-way);
(2) **live read-only view** of the dashboard + logs over a user-chosen tunnel
(Tailscale/cloudflared/SSH — no RigDoctor-hosted relay); (3) **gated interactive terminal**
wrapping an existing tool (tmate/sshx), read-only by default, read-write only on explicit
consent. Per-session consent, ephemeral revocable tokens, permission escalation (view ≠
shell), and a session audit log. Tier 3 is a deliberate, consent-gated exception to the
read-only stance (D9). Built in Phase 6.
## 5. Non-functional requirements
- **Zero hard deps for the core/CLI/daemon** — Python stdlib + tools already present. **Qt
(PySide6) is required only by the GUI (M10) and tray (M11) modules**, declared in the
`.deb` and pulled in only when those modules are selected.
- **Crash-safe logging** — flush + `fsync` per sample; bounded disk usage.
- **Low overhead** — default ≤1 Hz sampling; negligible CPU/GPU cost. The always-on daemon
is stdlib-only (no Qt loaded) so it stays tiny.
- **GUI-first, CLI-complete** (D17) — the GUI is the primary interface, but every capability
is *also* reachable from the CLI so RigDoctor runs fully headless (SSH/servers). Both
front-ends sit over the same engine; neither is the only way to do something.
- **Privacy** — local only; inventory export is opt-in and reviewable; no telemetry.
- **Portability** — graceful degradation when a sensor/tool is unavailable (N/A, not crash).
## 6. Open questions
None tracked — all foundational decisions (D1D15) are settled; see `DECISIONS.md`. Detail
to flesh out during build: the tray's supporting-action set and per-module apt package names.
Packaging/deps are **Ubuntu/apt-only** (D15) — no multi-distro mapping is maintained.
</content>