Files
rigdoctor/docs/SPEC.md
T
jessey 2e545ff718 feat(share): terminal-only sharing, bigger + full-screen — 0.25.0
Scope M12 down to a single shared-terminal mode (D23, amends D16):
- Share page rewritten terminal-only: host shares their PTY/shell; guest watches
  and may type only if the host ticks "Allow the guest to type" (read-only
  otherwise — the D9 consent exception). Terminal is larger; either side can pop
  it full-screen (Esc to exit).
- Removed the read-only stats view + HTTP server (core/share.py) and the
  `rigdoctor share serve` CLI; deleted their tests.
- Docs: D23 added; SPEC/MODULES/ROADMAP updated (M12 → done, terminal-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:04:52 +02:00

173 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RigDoctor — Product Specification (DRAFT v0.2)
> Living spec. The foundational decisions (name, language, platform/GPU priority, MVP scope,
> packaging, scope-of-action, GUI/tray) are now settled — see `DECISIONS.md` (D1D11).
> Anything still marked **[OPEN]** is tracked there (D12D15).
## 1. Vision
A single, modular toolkit that lets a Linux gamer **monitor**, **diagnose**, and
**understand the health** of their machine — especially the hard-to-catch faults that happen
under gaming load. The goal is to make otherwise near-impossible-to-investigate problems
(random freezes, the screen suddenly going black mid-game, GPU "lost" events) tractable by
capturing the right data automatically and explaining it in plain language. Users install
only the modules relevant to their hardware via an interactive installer.
**Motivating cases:**
- An RTX 3070 intermittently falls off the PCIe bus under heavy GPU/VRAM load
(`Xid 79` / `Xid 154`, `NV_ERR_GPU_IS_LOST`). The crash is OS-independent (also seen on
Windows in Tarkov) and load-correlated, pointing at hardware (VRAM thermals / power
transients / PCIe signal integrity).
- A monitor going black mid-session (e.g. during Path of Exile) — is it the GPU dropping,
a driver reset, a cable/DP link issue, or a power event? Manually impossible to tell after
the fact.
In both cases the last sensor readings before the freeze are normally never captured.
RigDoctor's crash-safe logger is designed to fix exactly that.
## 2. Goals / Non-goals
**Goals**
- Catch and preserve the machine's state in the seconds before a hard freeze.
- Make hard-to-investigate gaming faults debuggable: collect scattered signals, correlate
them, and explain them.
- Be **GUI-first** (D17): the **desktop GUI** is the primary interface, complemented by a
**system-tray / top-menu-bar applet** for quick actions — backed by a **full CLI** that
keeps complete functionality for headless / SSH / scripting use. (D10/D11/D17)
- Be modular: a novice installs a one-click "monitor + capture + report" bundle; a power
user installs everything including the GUI, tray, and diagnostics.
- Low overhead; safe defaults; no telemetry/phone-home.
**Non-goals (for now)**
- Not a benchmark-score / e-peen leaderboard tool.
- **Not a stress-test / load-generator** — explicitly out of scope (D7). Users can run
existing tools (gpu-burn, vkmark, stress-ng) alongside the logger if they want.
- Not an overclocking utility.
- **Read-only by default, with a narrow consent-gated exception.** RigDoctor diagnoses and
*suggests* actions (with the exact command where possible). It does **not** apply changes
itself — **except** a small set of **runtime-reversible** gaming tunables (M6: CPU governor,
NVIDIA persistence, PCIe ASPM policy, swappiness, THP) that can be applied from the GUI via a
single pkexec prompt, no reboot, revert on reboot (D22, realizing the D9 milestone). Risky/
persistent fixes (GRUB cmdline, CPU mitigations) remain suggestion-only.
## 3. Target users & platforms
- **Users:** Linux gamers from novice ("is my PC ok?" + alerts, via GUI/tray) to advanced
(raw logs, log forensics, headless capture over SSH).
- **Distros:** **Ubuntu first** (and Debian via `apt`). Arch (`pacman`) / Fedora (`dnf`) /
openSUSE (`zypper`) best-effort later, behind the distro abstraction. (D3)
- **GPUs:** **NVIDIA first** (seed hardware). AMD second, Intel third — behind the vendor
abstraction. (D4)
- **Display:** GUI and tray must work under both X11 and Wayland on Ubuntu/GNOME; **all core
functionality must also work fully headless** (CLI, over SSH, no display).
- **Runtime:** Python 3 + Qt (PySide6). Core/CLI/daemon are stdlib-only; GUI and tray add
PySide6. (D2)
## 4. Functional requirements (by module)
> Module IDs are stable. **M7 (stress/repro) is dropped** (D7). M10/M11 are the new GUI and
> tray modules.
### M1 — Sensor core (foundation, always installed)
Unified sampling of: CPU temp/freq/load, per-core; GPU temp/(mem-junction if exposed)/
clocks/power/util/fan/VRAM/PCIe gen+width/throttle reasons; RAM (DDR5 SPD) temps; NVMe/SSD
temps; system load. Pluggable sources: `nvidia-smi`/NVML (first), `amdgpu` sysfs/`rocm-smi`
(later), `/sys/class/hwmon`, `lm-sensors`. Stdlib-only.
### M2 — Live monitor (TUI)
HWMonitor-style terminal dashboard: current / session-min / session-max per sensor, grouped
by subsystem, with throttle/critical highlighting. Refresh rate configurable. The terminal
face of the live data (the GUI in M10 presents the same data graphically).
### M3 — Crash-capture logger (daemon)
Headless background sampler that writes CSV/JSON and **`fsync`s every sample** so the last
readings survive a hard lock. Detects GPU "lost"/hang (query timeout) and writes a marker.
Ring-buffer/rotation to bound disk use. Runs as a `systemd --user` service. **Trigger model
is user-selectable** (D6): always-on, game-launch-triggered, or manual (CLI / tray button).
Stdlib-only.
### M4 — Health report (one-shot)
Scans `journalctl` for Xid, kernel panics, OOM-killer, MCE, PCIe AER, thermal events; checks
SMART disk health; flags driver/library version mismatches; verifies GPU firmware; prints a
prioritized findings list with plain-language explanations and **suggested** fixes (read-only
per D9). Reuses M1 for a live snapshot.
### M5 — System inventory
CPU/GPU/motherboard/BIOS/RAM/storage, kernel, driver versions, X11/Wayland + compositor,
PCIe topology. Exportable (Markdown/JSON) to paste into forum/bug reports.
### M6 — Gaming environment checks
Detects & evaluates: GPU power profile / persistence mode, CPU governor, Proton/Wine/Steam
versions, GameMode, MangoHud, shader cache, swappiness, hugepages, CPU mitigations,
PCIe ASPM. Flags settings that hurt stability/performance and **suggests** the fix command.
Also includes Steam library/game detection (the D12 "pick a game" foundation) and, per D22,
a **one-click apply** for the runtime-reversible tunables (governor, persistence, ASPM,
swappiness, THP) plus one-click install of optional tools (GameMode/MangoHud/cpupower).
### M8 — Alerting
Threshold + event alerts (desktop notification / sound / log) on overheat, throttle,
GPU-lost, SMART failure. Surfaces in the tray applet (M11) when installed. Optional.
### M10 — Desktop GUI (PySide6/Qt)
Full graphical front-end over the core engine: live dashboard (graphs/gauges), browse and
visualize captured crash logs, run a health report and view findings, view system inventory,
toggle the logger and its trigger mode. Mirrors CLI capability for non-terminal users.
Optional module (pulls in PySide6).
### M11 — System-tray / menu-bar applet (PySide6/Qt)
A small always-available applet in the Linux top menu bar (system tray /
StatusNotifierItem; on Ubuntu/GNOME via the AppIndicator extension). Optional module.
Contents (D13):
- **At-a-glance live readouts (from M1)** in the dropdown, refreshed periodically:
**CPU temp, GPU temp, memory used/total** (e.g. "14 GB / 32 GB"); a status dot
(normal / throttling / alert) alongside.
- **Run Diagnostic** — the headline action; launches the *guided diagnostic session* below.
- **Supporting actions:** Open dashboard (M10), Start/Stop recording, Snapshot now, Quit.
### Guided diagnostic session (M3 + M4 workflow)
The "Run Diagnostic" flow available from the tray (M11), the GUI (M10), and the CLI:
1. **Pick a game to focus on** — chosen from detected/installed games (via the D12 game
detection: Steam library / recently played / running process).
2. **Collect** — RigDoctor runs a focused crash-capture session (M3) scoped to that game:
it logs while you play, bracketing the session via the D12 wrapper/watcher.
3. **Scan & analyze** — when the session ends (or after a crash + reboot), it runs the
health report (M4) over the captured window + system logs to surface likely issues.
4. **Present findings** — a prioritized, plain-language list with suggested fixes
(read-only, D9).
This is the one-click expression of the seed use case; it orchestrates existing modules
rather than adding a new one.
### M9 — Installer (see ARCHITECTURE §5)
Interactive wizard: detect GPU vendor (NVIDIA-first) → present module menu grouped into
bundles with descriptions and the exact packages each needs → resolve & install (apt first)
→ write config → optionally enable the `systemd --user` logger service and pick its trigger
mode. Delivered with the user-local install (and the optional `.deb`) (D8). Module
list/bundling is final per D14.
### M12 — Session sharing / remote assist (D16, scoped to terminal-only by D23)
Lets a user (A) grant a helper (B) a **shared terminal** over the relay: A shares a real PTY
running their shell; B watches live and may type **only if A allows it** (otherwise read-only)
— a deliberate, consent-gated exception to the read-only stance (D9). A reads along and can
type too (e.g. a sudo password, which stays local and is never sent to B). Account-gated by the
Gitea token; per-session share code. The shared terminal preserves colors/theming and can be
viewed full-screen. *(The earlier read-only stats view / bundle export were dropped — D23.)*
## 5. Non-functional requirements
- **Zero hard deps for the core/CLI/daemon** — Python stdlib + tools already present. **Qt
(PySide6) is required only by the GUI (M10) and tray (M11) modules**, declared in the
`.deb` and pulled in only when those modules are selected.
- **Crash-safe logging** — flush + `fsync` per sample; bounded disk usage.
- **Low overhead** — default ≤1 Hz sampling; negligible CPU/GPU cost. The always-on daemon
is stdlib-only (no Qt loaded) so it stays tiny.
- **GUI-first, CLI-complete** (D17) — the GUI is the primary interface, but every capability
is *also* reachable from the CLI so RigDoctor runs fully headless (SSH/servers). Both
front-ends sit over the same engine; neither is the only way to do something.
- **Privacy** — local only; inventory export is opt-in and reviewable; no telemetry.
- **Portability** — graceful degradation when a sensor/tool is unavailable (N/A, not crash).
## 6. Open questions
None tracked — all foundational decisions (D1D15) are settled; see `DECISIONS.md`. Detail
to flesh out during build: the tray's supporting-action set and per-module apt package names.
Packaging/deps are **Ubuntu/apt-only** (D15) — no multi-distro mapping is maintained.
</content>