feat(health): detect no-Xid GPU freezes (open-module VA-space faults) #46

Merged
jessey merged 4 commits from feat/gpu-vaspace-spt into main 2026-05-29 14:10:59 +00:00

4 Commits

Author SHA1 Message Date
jessey b65f36bb2d Merge branch 'main' into feat/gpu-vaspace-spt
tests / core (pull_request) Successful in 12s
tests / gui-smoke (pull_request) Successful in 29s
2026-05-29 14:10:01 +00:00
jessey 0f9cb4b684 chore(release): v0.42.0
tests / core (pull_request) Successful in 17s
tests / gui-smoke (pull_request) Successful in 29s
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 16:09:02 +02:00
jessey b9bfec961c feat(games): manually add games (e.g. SPT) with launch + own logs
Some titles never show up in a Steam/Lutris/Heroic scan — standalone mod
launchers like SPT (Single-Player Tarkov), itch.io downloads, hand-installed
executables. Add a user-authored custom-games list (core/customgames.py) shown
alongside the other sources in `rigdoctor games` and the GUI.

Each entry can carry a launch command and a log directory:
  - `rigdoctor games add "SPT" --command .../tarkov.sh` (logs/ auto-detected)
  - `rigdoctor games play "SPT"` launches it under the crash-capture wrapper
    (wrap.run gains an explicit game-name override, since there's no SteamAppId)
  - the diagnostic now feeds the game's own logs to the analysis: gamelogs
    .collect(game=...) tails the registered log dir (SPT's server/launcher logs)
    alongside the kernel log, freshness-scoped by mtime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 16:07:25 +02:00
jessey b1bc961b79 feat(health): detect no-Xid GPU freezes (open-module VA-space faults)
The kernel-log scanner only caught Xid codes, OOM, panic, MCE, AER, thermal,
and amdgpu resets — so a hard freeze that logs NO Xid slipped through entirely.
Add detection for the NVIDIA open-kernel-module VA-space mapping fault
(gpu_vaspace.c / dmaAllocMapping / NVKMS GEM-allocation failures), which can
storm for minutes and end in a freeze without the GPU ever "falling off the
bus". Also flag when the open kernel module (nvidia-*-open) is loaded — the
context behind these faults — and add an AI-knowledge entry so the assistant
distinguishes it from the Xid 79 hardware drop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 16:07:14 +02:00