Initial commit: docs, decisions, and M1 sensor core
Planning docs (SPEC, ARCHITECTURE, MODULES, ROADMAP, DECISIONS) with decisions D1-D15 settled: RigDoctor name, Python 3 + Qt/PySide6 stack (core/CLI/daemon stdlib-only), Ubuntu + NVIDIA first, .deb packaging, read-only + suggestions, GUI + tray modules, stress module dropped. First code: the M1 sensor core (stdlib-only) and a CLI. - core engine: Reading/Sample model, Sampler, hwmon reader - self-probing sources (NVIDIA first): nvidia-smi GPU, coretemp/k10temp CPU, /proc/meminfo + DDR5 SPD memory, NVMe storage - CLI: snapshot (text/JSON), monitor, sources; record/report stubbed - stdlib unittest smoke tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+156
@@ -0,0 +1,156 @@
|
||||
# RigDoctor — Product Specification (DRAFT v0.2)
|
||||
|
||||
> Living spec. The foundational decisions (name, language, platform/GPU priority, MVP scope,
|
||||
> packaging, scope-of-action, GUI/tray) are now settled — see `DECISIONS.md` (D1–D11).
|
||||
> Anything still marked **[OPEN]** is tracked there (D12–D15).
|
||||
|
||||
## 1. Vision
|
||||
|
||||
A single, modular toolkit that lets a Linux gamer **monitor**, **diagnose**, and
|
||||
**understand the health** of their machine — especially the hard-to-catch faults that happen
|
||||
under gaming load. The goal is to make otherwise near-impossible-to-investigate problems
|
||||
(random freezes, the screen suddenly going black mid-game, GPU "lost" events) tractable by
|
||||
capturing the right data automatically and explaining it in plain language. Users install
|
||||
only the modules relevant to their hardware via an interactive installer.
|
||||
|
||||
**Motivating cases:**
|
||||
- An RTX 3070 intermittently falls off the PCIe bus under heavy GPU/VRAM load
|
||||
(`Xid 79` / `Xid 154`, `NV_ERR_GPU_IS_LOST`). The crash is OS-independent (also seen on
|
||||
Windows in Tarkov) and load-correlated, pointing at hardware (VRAM thermals / power
|
||||
transients / PCIe signal integrity).
|
||||
- A monitor going black mid-session (e.g. during Path of Exile) — is it the GPU dropping,
|
||||
a driver reset, a cable/DP link issue, or a power event? Manually impossible to tell after
|
||||
the fact.
|
||||
|
||||
In both cases the last sensor readings before the freeze are normally never captured.
|
||||
RigDoctor's crash-safe logger is designed to fix exactly that.
|
||||
|
||||
## 2. Goals / Non-goals
|
||||
|
||||
**Goals**
|
||||
- Catch and preserve the machine's state in the seconds before a hard freeze.
|
||||
- Make hard-to-investigate gaming faults debuggable: collect scattered signals, correlate
|
||||
them, and explain them.
|
||||
- Offer **three ways to run**: full **CLI / headless** (works over SSH), a **desktop GUI**,
|
||||
and a **system-tray / top-menu-bar applet** with quick actions. (D10/D11)
|
||||
- Be modular: a novice installs a one-click "monitor + capture + report" bundle; a power
|
||||
user installs everything including the GUI, tray, and diagnostics.
|
||||
- Low overhead; safe defaults; no telemetry/phone-home.
|
||||
|
||||
**Non-goals (for now)**
|
||||
- Not a benchmark-score / e-peen leaderboard tool.
|
||||
- **Not a stress-test / load-generator** — explicitly out of scope (D7). Users can run
|
||||
existing tools (gpu-burn, vkmark, stress-ng) alongside the logger if they want.
|
||||
- Not an overclocking utility.
|
||||
- **Not (yet) an auto-fixer.** RigDoctor is **read-only**: it diagnoses and *suggests*
|
||||
actions (with the exact command where possible) but does not apply changes itself in this
|
||||
stage. Auto-apply is a deliberate later milestone behind explicit consent. (D9)
|
||||
|
||||
## 3. Target users & platforms
|
||||
|
||||
- **Users:** Linux gamers from novice ("is my PC ok?" + alerts, via GUI/tray) to advanced
|
||||
(raw logs, log forensics, headless capture over SSH).
|
||||
- **Distros:** **Ubuntu first** (and Debian via `apt`). Arch (`pacman`) / Fedora (`dnf`) /
|
||||
openSUSE (`zypper`) best-effort later, behind the distro abstraction. (D3)
|
||||
- **GPUs:** **NVIDIA first** (seed hardware). AMD second, Intel third — behind the vendor
|
||||
abstraction. (D4)
|
||||
- **Display:** GUI and tray must work under both X11 and Wayland on Ubuntu/GNOME; **all core
|
||||
functionality must also work fully headless** (CLI, over SSH, no display).
|
||||
- **Runtime:** Python 3 + Qt (PySide6). Core/CLI/daemon are stdlib-only; GUI and tray add
|
||||
PySide6. (D2)
|
||||
|
||||
## 4. Functional requirements (by module)
|
||||
|
||||
> Module IDs are stable. **M7 (stress/repro) is dropped** (D7). M10/M11 are the new GUI and
|
||||
> tray modules.
|
||||
|
||||
### M1 — Sensor core (foundation, always installed)
|
||||
Unified sampling of: CPU temp/freq/load, per-core; GPU temp/(mem-junction if exposed)/
|
||||
clocks/power/util/fan/VRAM/PCIe gen+width/throttle reasons; RAM (DDR5 SPD) temps; NVMe/SSD
|
||||
temps; system load. Pluggable sources: `nvidia-smi`/NVML (first), `amdgpu` sysfs/`rocm-smi`
|
||||
(later), `/sys/class/hwmon`, `lm-sensors`. Stdlib-only.
|
||||
|
||||
### M2 — Live monitor (TUI)
|
||||
HWMonitor-style terminal dashboard: current / session-min / session-max per sensor, grouped
|
||||
by subsystem, with throttle/critical highlighting. Refresh rate configurable. The terminal
|
||||
face of the live data (the GUI in M10 presents the same data graphically).
|
||||
|
||||
### M3 — Crash-capture logger (daemon)
|
||||
Headless background sampler that writes CSV/JSON and **`fsync`s every sample** so the last
|
||||
readings survive a hard lock. Detects GPU "lost"/hang (query timeout) and writes a marker.
|
||||
Ring-buffer/rotation to bound disk use. Runs as a `systemd --user` service. **Trigger model
|
||||
is user-selectable** (D6): always-on, game-launch-triggered, or manual (CLI / tray button).
|
||||
Stdlib-only.
|
||||
|
||||
### M4 — Health report (one-shot)
|
||||
Scans `journalctl` for Xid, kernel panics, OOM-killer, MCE, PCIe AER, thermal events; checks
|
||||
SMART disk health; flags driver/library version mismatches; verifies GPU firmware; prints a
|
||||
prioritized findings list with plain-language explanations and **suggested** fixes (read-only
|
||||
per D9). Reuses M1 for a live snapshot.
|
||||
|
||||
### M5 — System inventory
|
||||
CPU/GPU/motherboard/BIOS/RAM/storage, kernel, driver versions, X11/Wayland + compositor,
|
||||
PCIe topology. Exportable (Markdown/JSON) to paste into forum/bug reports.
|
||||
|
||||
### M6 — Gaming environment checks
|
||||
Detects & evaluates: GPU power profile / persistence mode, CPU governor, Proton/Wine/Steam
|
||||
versions, GameMode, MangoHud, shader cache, swappiness, hugepages, CPU mitigations,
|
||||
PCIe ASPM. Flags settings that hurt stability/performance and **suggests** the fix command
|
||||
(read-only per D9).
|
||||
|
||||
### M8 — Alerting
|
||||
Threshold + event alerts (desktop notification / sound / log) on overheat, throttle,
|
||||
GPU-lost, SMART failure. Surfaces in the tray applet (M11) when installed. Optional.
|
||||
|
||||
### M10 — Desktop GUI (PySide6/Qt)
|
||||
Full graphical front-end over the core engine: live dashboard (graphs/gauges), browse and
|
||||
visualize captured crash logs, run a health report and view findings, view system inventory,
|
||||
toggle the logger and its trigger mode. Mirrors CLI capability for non-terminal users.
|
||||
Optional module (pulls in PySide6).
|
||||
|
||||
### M11 — System-tray / menu-bar applet (PySide6/Qt)
|
||||
A small always-available applet in the Linux top menu bar (system tray /
|
||||
StatusNotifierItem; on Ubuntu/GNOME via the AppIndicator extension). Optional module.
|
||||
Contents (D13):
|
||||
- **At-a-glance live readouts (from M1)** in the dropdown, refreshed periodically:
|
||||
**CPU temp, GPU temp, memory used/total** (e.g. "14 GB / 32 GB"); a status dot
|
||||
(normal / throttling / alert) alongside.
|
||||
- **Run Diagnostic** — the headline action; launches the *guided diagnostic session* below.
|
||||
- **Supporting actions:** Open dashboard (M10), Start/Stop recording, Snapshot now, Quit.
|
||||
|
||||
### Guided diagnostic session (M3 + M4 workflow)
|
||||
The "Run Diagnostic" flow available from the tray (M11), the GUI (M10), and the CLI:
|
||||
1. **Pick a game to focus on** — chosen from detected/installed games (via the D12 game
|
||||
detection: Steam library / recently played / running process).
|
||||
2. **Collect** — RigDoctor runs a focused crash-capture session (M3) scoped to that game:
|
||||
it logs while you play, bracketing the session via the D12 wrapper/watcher.
|
||||
3. **Scan & analyze** — when the session ends (or after a crash + reboot), it runs the
|
||||
health report (M4) over the captured window + system logs to surface likely issues.
|
||||
4. **Present findings** — a prioritized, plain-language list with suggested fixes
|
||||
(read-only, D9).
|
||||
This is the one-click expression of the seed use case; it orchestrates existing modules
|
||||
rather than adding a new one.
|
||||
|
||||
### M9 — Installer (see ARCHITECTURE §5)
|
||||
Interactive wizard: detect GPU vendor (NVIDIA-first) → present module menu grouped into
|
||||
bundles with descriptions and the exact packages each needs → resolve & install (apt first)
|
||||
→ write config → optionally enable the `systemd --user` logger service and pick its trigger
|
||||
mode. Delivered alongside the `.deb` (D8). Module list/bundling is final per D14.
|
||||
|
||||
## 5. Non-functional requirements
|
||||
- **Zero hard deps for the core/CLI/daemon** — Python stdlib + tools already present. **Qt
|
||||
(PySide6) is required only by the GUI (M10) and tray (M11) modules**, declared in the
|
||||
`.deb` and pulled in only when those modules are selected.
|
||||
- **Crash-safe logging** — flush + `fsync` per sample; bounded disk usage.
|
||||
- **Low overhead** — default ≤1 Hz sampling; negligible CPU/GPU cost. The always-on daemon
|
||||
is stdlib-only (no Qt loaded) so it stays tiny.
|
||||
- **Headless-equivalent** — every diagnostic capability is reachable from the CLI; the GUI
|
||||
and tray are conveniences over the same engine, never the only way to do something.
|
||||
- **Privacy** — local only; inventory export is opt-in and reviewable; no telemetry.
|
||||
- **Portability** — graceful degradation when a sensor/tool is unavailable (N/A, not crash).
|
||||
|
||||
## 6. Open questions
|
||||
None tracked — all foundational decisions (D1–D15) are settled; see `DECISIONS.md`. Detail
|
||||
to flesh out during build: the tray's supporting-action set and per-module apt package names.
|
||||
Packaging/deps are **Ubuntu/apt-only** (D15) — no multi-distro mapping is maintained.
|
||||
</content>
|
||||
Reference in New Issue
Block a user