NVIDIA 4060 Mobile GSP 崩溃

硬件

笔记本: Lenovo Legion Y7000P 2024
CPU: Intel(R) Core(TM) i7-14650HX (24) @ 5.20 GHz
GPU: NVIDIA GeForce RTX 4060 Max-Q / Mobile
DE: KDE Plasma 6.5.5
Kernel: linux-zen-6.18.5.zen1-1
GPU 驱动: nvidia-open-dkms 590.48.01-2

现象

正使用电脑的时候画面突然卡住, 然后就死了, 必须物理按键强制重启. 切换 tty 也尝试了, 不行, 画面完全没有反应.
通过看日志确定是 GSP 崩溃的原因.

相关日志的核心部分

1月 20 21:20:29 Archlinux kernel: CPU: 10 UID: 0 PID: 379 Comm: nv_queue Tainted: G           OE       6.18.5-zen1-1-zen #1 PREEMPT(full)  a154a9980921b27ce7dfb38846b5368b5e23c7ab
1月 20 21:20:29 Archlinux kernel: Hardware name: LENOVO 83DG/LNVNB161216, BIOS NMCN32WW 06/13/2025
1月 20 21:20:29 Archlinux kernel:  _kgspRpcRecvPoll+0xa62/0xc30 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  _issueRpcAndWait+0xd5/0x930 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  ? osGetCurrentThread+0x26/0x60 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  ? _rmGpuLockIsOwner+0x24/0x90 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  rpcRmApiControl_GSP+0x274/0x960 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  gpuLogOobXidMessage_KERNEL+0x12d/0x160 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  ? os_alloc_mem+0x111/0x120 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  nvErrorLog2+0x93/0xf0 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  nvErrorLog_va+0x42/0x50 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  _gpuRefreshRecoveryActionInLock+0x12d/0x170 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  ? _gpuRefreshRecoveryActionInLock+0xf0/0x170 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  rm_execute_work_item+0x14f/0x1c0 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  _main_loop+0x93/0x150 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia 58b884b3e61283649a246751be72c9c062c1bc52]
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 01685200
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvCpuctl           : 00000080
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqmask          : 01049042
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvIrqdest          : 01000040
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrStat      : 00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrInfo      : badf5108
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvPrivErrAddr      : 20000000001c811c
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvHubErrStat       : 00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: falconMailbox         : 0:000003f8 1:000003f8
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqstat         : 00000010
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: falconIrqmode         : 0180fc24
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: fbifInstblk           : 00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: fbifCtl               : 00000190
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: fbifThrottle          : 00000064
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkBlk           : 0:00000000 1:00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: fbifAchkCtl           : 0:00000000 1:00000000
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: fbifCg1               : 0000000f
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 00 = 0xffffffff93003dfa
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 01 = 0xffffffff9300a024
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 02 = 0x0000000001685200
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 03 = 0x0000000001684d4c
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 04 = 0xffffffff9300a024
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 05 = 0x0000000001684d16
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 06 = 0x0000000001b1b594
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 07 = 0x0000000001684cee
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 08 = 0x00000000016835c4
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 09 = 0x0000000001683358
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 10 = 0x0000000001ade67e
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 11 = 0x00000000019da9d0
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 12 = 0xffffffff93003e3c
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 13 = 0x00000000019da9c6
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 14 = 0x0000000001ade656
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 15 = 0x0000000001ade5c4
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: TRACE: 16 = 0xffffffff93003dfa
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 _kgspLogXid119: ********************************************************************************
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 152459!
1月 20 21:20:29 Archlinux kernel: NVRM: GPU0 nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from pRmApi->Control(pRmApi, pGpu->hInternalClient, pGpu->hInternalSu
bdevice, NV2080_CTRL_CMD_INTERNAL_LOG_OOB_XID, &params, sizeof(params)) @ gpu.c:7268
1月 20 21:20:29 Archlinux kernel: NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
1月 20 21:21:14 Archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequence 152460 (0x2080a0d1 0x7e8).
1月 20 21:21:14 Archlinux kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
1月 20 21:21:14 Archlinux kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
1月 20 21:21:14 Archlinux kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
1月 20 21:21:14 Archlinux kernel: NVRM: kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
1月 20 21:21:14 Archlinux kernel: NVRM: kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 01685200
1月 20 21:21:14 Archlinux kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 152460!
1月 20 21:21:59 Archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=371, name=nvidia-modeset/, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) sequenc
e 152461 (0x20802801 0x4).
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(0) = 0x00000000
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(1) = 0x00000000
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(2) = 0x00000000
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kgspDumpMailbox_TU102: GSP: MAILBOX(3) = 0x00000000
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpNondestructive_IMPL: PRI: riscvPc               : 01685200
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset @ kernel_gsp.c:2428
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpDestructive_IMPL: ICD: Core is booted.
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpDestructive_IMPL: ICD: RSTAT0 0x0000000000000001
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpDestructive_IMPL: ICD: RSTAT3 0x0000000000000000
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpDestructive_IMPL: ICD: RSTAT4 0x0000000000000000
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 kflcnCoreDumpDestructive_IMPL: ICD: [ERROR] ICD Halt command failed.
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 _issueRpcAndWait: rpcRecvPoll timedout for fn 76 sequence 152461!
1月 20 21:21:59 Archlinux kernel: NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_client.c:844
1月 20 21:21:59 Archlinux kernel: NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || (status == NV_ERR_GPU_IN_FULLCHIP_RESET) @ rs_server.c:259

下面的日志差不多就是上面最后两行的循环. 可以看到核心错误就是 Xid 119 (GSP RPC Timeout) . 我已经不是第一次遇到了, 而且发生的非常突然.

来自 Gemini 的分析

深度原因推测

根据 0x01685200 这个死锁位置和 Function 76,结合你的 Zen 内核 (6.18.5-zen1) 和 Lenovo Legion 硬件,大概率是以下两种机制冲突:

  1. ACPI 电源状态冲突: Legion 笔记本的 BIOS 在处理显卡电源状态(特别是从低功耗状态变频,或者屏幕关闭/开启时)与 Linux 的 GSP 固件逻辑打架了。GSP 试图读取一个寄存器,但该寄存器因为电源策略被锁死,导致 GSP 在硬件层面挂起。
  2. Zen 内核的调度过于激进: Zen 内核往往修改了 CPU 的调度器和中断处理频率。如果 CPU 给 GSP 发指令的间隔(Polling)与 GSP 固件预期的时序不匹配(Race Condition),会导致 GSP 内部的状态机乱序,进而死锁。

求助

我现在有没有大佬, 小白在此求一个解决办法.
目前已经换用 Linux 内核, 用几天试试, 看一下是不是 zen 的调度和 GSP 冲突导致的.

1 Like