Files
rigdoctor/tests
jessey edc2166011 feat(health): GPU stress monitor + per-drive SMART health/wear
Two diagnostics for the load-correlated GPU crashes and for storage wear.

GPU stress (`rigdoctor stress` + a System Health "Stress test…" dialog): drive a GPU
load and sample sensors at high rate, then report per-metric min/avg/peak, time spent
above each temp threshold, power vs limit, throttling (decoded from the NVML
clocks-event bitmask), and any GPU fault (Xid / VA-space freeze / query-timeout hang)
in the window. Load source: explicit --command, an auto-detected loader, or
monitor-only (you launch the game). Analysis is a pure, unit-tested function.

Drive health (core/drives.py): parse full `smartctl --json` per drive into prioritized
findings — SMART verdict, derived life-left % (NVMe percentage_used or SATA
wear-leveling), power-on hours, TBW, temperature, and failure predictors
(reallocated/pending/offline sectors, NVMe media errors, low spare). Replaces the old
pass/fail-only check_smart; runs through the same elevated path (collect-priv / sudo),
degrading to "needs root" notes unprivileged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 16:59:06 +02:00
..