Nvidia is a well-known tech company for manufacturing as well as designing GPU (graphics processing units). Its GPUs are used in workstations for many applications. It is the key component for the architecture of a PC. Nvidia driver is a software driver installed on your PC for Nvidia graphics GPU. A driver is an essential software that needs to be installed for all the systems having Nvidia GPU, without it, the system will not work the way it should. It is automatically installed on the system, but not the newer version, it needs to be installed manually to have the latest driver version. When you are using the Nvidia driver, you may experience the error “failed to initialize nvml driver library version mismatch error”.
Though the error seems disturbing, it is not something that can’t be solved. We provide you with solutions that you can simply implement. First, let’s figure out how the error shows up
How the error pops up
The error appears when you try to use the command ‘nvidia-smi’ or run a GPU. It happens when you have updated Nvidia drivers on a node, but your run-time driver is outdated. You get the following error
Warning Failed 91s (x4 over 2m12s) kubelet, ip-10-0-129-17.us-west-2.compute.internal Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\\\n\\\"\"": unknown
How To Solve the Error “failed to initialize nvml driver library version mismatch error”
We have simply amazing solutions to help you get rid of the error warning. Have a look at the solutions
Solution 1 – Drain and reboot the worker
The easiest and simplest way you can try to resolve the error is to reboot the node. It ensures the drivers are initialized properly once it is updated. In the case you need drivers to be updated on node GPU worker, node draining is the right approach. Then proceed to the upgrade driver, and then node reboot prior to fresh workloads deployment.
It resolves the error warning.
Solution 2 – Reload Nvidia kernel modules
It is the ideal solution if you don’t need to drain and reboot. Though it also includes draining GPU workloads on the node. In case you don’t want to use draining due to any reason, then this solution is of no use to you. It only works in case of non-GPU on the worker node. It can’t be drained and rebooted.
Drain GPU Workload and stop the Nvidia driver
To drain the GPU workload, you need to stop GPU workloads on the node in use. Also, stop the Nvidia plugin. For this, you need to remove the label.
kubectl label node konvoy.mesosphere.com/gpu-provider-
After the removal of the label, pods that are linked with the Nvidia Konvoy addon will also be removed.
Kubelet restart
Next, you need to restart the kubelet as it is the last step that is using the nvidia kernel module. Once you restart it, the service is stopped. Use this command to do that
sudo systemctl restart kubelet
Unload the Nvidia Kernal module
In the case of a module that can’t be removed as it is being used, you need a command to remove it as it is important to terminate any GPU workloads running before you remove the modules. Use the following command in a similar order to do so
sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia
To verify the unload, you need to use the below command
lsmod | grep ^nvidia
No output in return you should get.
Relaunch the Nvidia addon pods
The last step is to relaunch it using the following label to make the GPU workloads to be accepted by node
kubectl label node konvoy.mesosphere.com/gpu-provider=NVID
Conclusion
In this post, we discussed the solutions to fix the error “failed to initialize nvml driver library version mismatch error” in a simpler way.
I hope you find it helpful!