Files
rigdoctor/docs/MODULES.md
T
jessey 984292c368 feat(m15): collect session-scoped system logs (kernel + coredumps) — 0.31.0
core/syslogs.py gathers, scoped to the diagnostic window:
- kernel-log slice (journalctl -k): Xid, OOM, MCE, PCIe AER, thermal, hung tasks
- crashed-process records (coredumpctl): exe, signal, when
Stored as syslogs.txt in the diagnostic dir, included in the Report bundle, and
fed to the AI on "Explain" alongside the game logs. Best-effort (degrades if the
tools are missing/denied); treats journalctl's "-- No entries --" as empty.
Tests + docs (M15/SPEC).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 14:10:30 +02:00

12 KiB

RigDoctor — Module Catalog (DRAFT v0.2)

Status: not started · 🟦 designing · 🟨 in progress · done

Module set per D14, plus M12 (session sharing, D16), M13 (auto-update, D18), M14 (AI assistant, D24), and M15 (logging & reports, D25). M7 (stress/repro) was dropped (D7). M10/M11 are the GUI and tray modules (D10/D11). GPU scope reads "all (NVIDIA first)" — NVIDIA first, others via the vendor abstraction (D4).

ID Module Bundle Key deps GPU scope Priority Status
M1 Sensor core Essential none (nvidia-smi, sysfs) all (NVIDIA first) P0
M3 Crash-capture logger Essential none (opt: smartmontools) all (NVIDIA first) P0
M4 Health report (log scan) Essential none (opt: smartmontools) all (NVIDIA first) P0
M2 Live monitor (TUI) Monitoring none (stdlib curses) all P1
M8 Alerting Monitoring libnotify (opt) all P2
M5 System inventory Diagnostics none (opt: lm-sensors, dmidecode) all P1
M6 Gaming env checks Diagnostics none all P2 🟨
M10 Desktop GUI Desktop UI python3-pyside6 all P2
M11 Tray / menu-bar applet Desktop UI python3-pyside6 (+ AppIndicator on GNOME) all P2
M9 Installer (meta) none all P1 🟨
M12 Session sharing (shared terminal) Sharing none (relay) all P3
M13 Auto-update (core) none (stdlib; user-local file swap) all P3
M14 AI assistant (explain diagnostics) (optional) none (stdlib urllib; Ollama or Claude) all P3
M15 Logging & report bundles (core) none (stdlib logging + zip) all P3
M7 Stress / repro dropped (D7)

Notes per module

  • M1 Sensor core — the foundation everything else samples from. Stdlib-only. Abstracts NVIDIA/AMD/Intel + hwmon behind one interface; ship the NVIDIA + hwmon path first.

  • M3 Crash-capture logger — the highest-value piece for the seed use case. fsync per sample; GPU-lost detection via query timeout; bounded rotation; systemd --user service with a user-selectable trigger mode (always-on / game-launch / manual — D6). Implemented (manual trigger): JSONL log with fsync-per-sample, size-based rotation (log_max_bytes/log_backups), GPU-lost/recovered event markers, atomic status file, and rigdoctor record run|start|stop|status|report. The foreground run is the systemd-ready entrypoint. The game-launch trigger is implemented via the D12 wrapper (rigdoctor wrap %command%, see M6/below); the systemd --user service unit + always-on trigger (D6) and the zero-config watcher (D12) are still pending. Also fully driven from the GUI's Recording/Logs page (M10) via shared core.reccontrol.

  • M4 Health report — turns scattered logs into a prioritized, plain-language findings list with suggested fixes (read-only, D9). Reuses M1 for a live snapshot. Also powers the guided diagnostic session (with M3): pick a game → focused capture → scan → findings (see SPEC §4). Implemented: journalctl scan (Xid/panic/OOM/MCE/AER/thermal/amdgpu), SMART, NVIDIA driver-mismatch, journald-persistence + live-temp checks; rigdoctor report (text/JSON) + GUI Health tab. GPU-firmware verification deferred.

  • M2 Live monitor — the terminal "HWMonitor for Linux" face. Implemented (tui.py): rigdoctor monitor is a stdlib curses dashboard — current / session-min / session-max per sensor, grouped by subsystem, with temperature & utilization color bands; q quits, r resets the min/max. Falls back to a plain redraw on a non-TTY (--plain forces it).

  • M5 / M6 Diagnostics — inventory export + gaming-env checks; M6 flags risky settings and suggests the fix command but does not apply it (D9). M6 implemented (Steam detection first — the D12 "pick a game" foundation): discovers Steam installs + all library folders (libraryfolders.vdf, multi-drive) and the games in each (appmanifest_*.acf), filtering runtimes/Proton/redistributables — stdlib only. Libraries are opt-in (steam_libraries config); the GUI Games page lists them with per-library counts and rescans in the background on every launch, badging games installed since the last scan (cached in state/games.json). CLI: rigdoctor games / games libraries [--enable|--disable|--all]. Env-check engine implemented (core/gameenv.py): a read-only findings report (reusing the M4 Finding model) over PCIe ASPM, NVIDIA persistence mode, CPU governor (the three seed-case contributors to GPU bus-drop / Xid 79), GameMode, MangoHud, swappiness, shader cache, THP, CPU mitigations, and installed Proton versions — each with the suggested fix command. CLI rigdoctor gameenv; GUI Environment page. Per D22, the GUI adds one-click apply for the runtime-reversible tunables (governor / NVIDIA persistence / PCIe ASPM / swappiness / THP — dropdown + Apply via a single pkexec prompt, core/fixes.py) and one-click install of optional tools (GameMode / MangoHud / cpupower, now in the M9 catalog). GRUB/mitigations stay suggestion-only. Guided diagnostic (D12 "pick a game", core/diagnostic.py): a focused capture tagged with a game → window-scoped report (capture summary + M4 findings), in the CLI (rigdoctor diagnose start/status/finish) and GUI (per-game Run Diagnostic → recording banner → results dialog). Auto-capture via the D12 wrapper (rigdoctor wrap %command%, core/wrap.py; GUI "Auto-capture…" helper). Hard crashes are detected (capture left without a clean stop) and flagged on next launch with a crash-boot kernel-log analysis (pending_crash/analyze_crash + health.check_previous_boot). Non-Steam launchers (Lutris SQLite + Heroic JSON, core/launchers.py) are detected and listed alongside Steam games; env checks also cover GPU PowerMizer (X), Wine and Steam-client versions. Pending: the zero-config watcher (D12 fallback) — landing with M9's trigger-mode work.

  • M8 Alerting — threshold/event notifications; integrates with the tray applet (M11).

  • M10 Desktop GUI — PySide6 graphical front-end over the core engine. Optional; adds the Qt dependency. Dark-themed window with a grouped sidebar (Monitor / Diagnose / System / App) over: Dashboard (live history graphs + per-subsystem cards), Games (M6 detection

    • Run Diagnostic), Recordings (recorder controls + view/report any captured log + analyze a crash), System Health (M4 scan), Tuning (M6 gaming tunables + fixes), Inventory (M5), Settings (components/deps + alerts + account + uninstall), and Share (M12). A global recording badge shows on every page. GUI-first per D17.
  • M11 Tray appletQSystemTrayIcon menu-bar applet. Implemented (gui/tray.py, D13): the menu shows live M1 readouts (CPU temp, GPU temp, memory used/total) + a status line (Normal / Hot / GPU not responding), led by a Run Diagnostic submenu (per detected game → the guided session), plus Open dashboard / Start-Stop recording / Snapshot-copy / Quit. It shares the dashboard's sample stream (no extra sampling) and drives the existing MainWindow flows. With a tray present, closing the window hides to the tray (Quit exits); rigdoctor-gui --tray starts hidden for autostart. Optional; shares the Qt dependency with M10. Needs a tray host — on GNOME that means the AppIndicator extension; degrades to no-op if none is available.

  • M9 Installer — interactive wizard layered on the .deb (D8); apt-first dependency resolution; enables the logger service and trigger mode. Implemented (first cut): distro/ package-manager/GPU detection (core/sysenv), an optional-component catalog (core/catalog), and dependency install via pkexec/sudo — rigdoctor install [--check] [-y] + GUI Setup tab. The user-local app install is install.sh (private venv + ~/.local/bin launchers + desktop entry, no root; handles the python3-venv prerequisite) plus a self-extracting .run (pure-Python self-extractor, packaging/make_run.py, built by CI). Pending: config/module selection + systemd --user service enable.

  • M12 Session sharing / remote assist (D16, scoped to terminal-only by D23) — a single mode: a host-consented shared terminal over the relay. The host shares a real PTY running their $SHELL (colors/theming preserved — fish etc.); the guest watches live and can type only if the host allows it (otherwise read-only) — a deliberate, consent-gated exception to D9. The host reads along and can type too (e.g. a sudo password, which stays local). Either side can pop the terminal full-screen. Account-gated by the Gitea token. The earlier read-only stats view and share serve (Tier 1/2) were removed.

  • M13 Auto-update (D18) — check + auth implemented: updates are gated to Gitea account holders via a Personal Access Token, stored encrypted in the OS keyring (secret-tool) with a 0600-file fallback (config.load_token/save_token/token_backend). core/updates queries the releases API with the token; CLI login/logout/update; GUI Setup "Update access" panel + sidebar states. The no-root self-update apply is implemented: rigdoctor update runs an authenticated pip install --upgrade "rigdoctor[gui] @ git+https://oauth2:<token>@…@<tag>" into the user-local venv (GUI "Update to v…" button + restart prompt; token scrubbed). Installed via the user-local install.sh / self-extracting .run (M9). Original plan: On launch, check the public Gitea releases API and self-update a user-local install with no root (download → verify checksum/signature → atomic symlink swap → restart, incl. the daemon). HTTPS-only, version-check-only (no telemetry), opt-out-able. Surfaced in the GUI; rigdoctor update in the CLI. (.deb users update via apt instead.)

  • M14 AI assistant (D24) — optional, strictly opt-in, never automatic: explains the collected diagnostics in plain language only when the user presses "Explain with AI" (core/ai.py, GUI button on the diagnostic dialog, rigdoctor ai explain). The user picks a provider explicitly (no default): Ollama (local, private, no key) or Claude (Anthropic Messages API, key in the keyring; consent prompt before sending). Answers are grounded — we pass the actual findings plus matched reference facts from a curated knowledge base (core/ai_knowledge.py, "RAG-lite": exact keyword/code match, no embeddings, stdlib only), which lifts a small local model and sharpens Claude. Stdlib urllib (no pip deps); output is advisory (D9). Configure in Settings → AI assistant.

  • M15 Logging & report bundles (D25) — opt-in via one logging_enabled toggle (default off): application logging to a rotating app.log (core/applog.py) and per-diagnostic storage (core/diagstore.py) — each diagnostic gets its own DATA_DIR/diagnostics/<id>/ (capture, result.json, report.txt, scoped game logs (core/gamelogs.py) and system logs (core/syslogs.pyjournalctl -k slice + coredumpctl crashed-process records), and an ai/ record of every AI interaction: exact data sent, model, reply). "Report" zips one into DATA_DIR/reports/ (GUI button on the diagnostic dialog; CLI rigdoctor bundle). All logs are session-scoped and fed to the AI on "Explain". Stays local; shareable on demand.

Bundles (final — D14)

  • Essential: M1 + M3 + M4 (the MVP, NVIDIA-only — D5)
  • Monitoring: M2 + M8
  • Diagnostics: M5 + M6
  • Desktop UI: M10 + M11 (adds PySide6)
  • Sharing: M12 (session sharing / remote assist — D16)
  • AI: M14 (optional AI explanations — D24)

MVP candidate — confirmed (D5)

M1 + M3 + M4 (Essential), NVIDIA-only, CLI-first. Gives a working tool that captures the GPU crash and explains the logs — deliverable before the installer, GUI/tray, or multi-vendor work.