Fix the Error “failed to initialize nvml driver library version mismatch error”

Nvidia is a well-known tech company for manufacturing as well as designing GPU (graphics processing units). Its GPUs are used in workstations for many applications. It is the key component for the architecture of a PC. Nvidia driver is a software driver installed on your PC for Nvidia graphics GPU. A driver is an essential software that needs to be installed for all the systems having Nvidia GPU, without it, the system will not work the way it should. It is automatically installed on the system, but not the newer version, it needs to be installed manually to have the latest driver version. When you are using the Nvidia driver, you may experience the error “failed to initialize nvml driver library version mismatch error”.

Though the error seems disturbing, it is not something that can’t be solved. We provide you with solutions that you can simply implement. First, let’s figure out how the error shows up

How the error pops up

The error appears when you try to use the command ‘nvidia-smi’ or run a GPU. It happens when you have updated Nvidia drivers on a node, but your run-time driver is outdated. You get the following error

Warning Failed 91s (x4 over 2m12s) kubelet, Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\\\n\\\"\"": unknown

How To Solve the Error “failed to initialize nvml driver library version mismatch error”

We have simply amazing solutions to help you get rid of the error warning. Have a look at the solutions

Solution 1 – Drain and reboot the worker

The easiest and simplest way you can try to resolve the error is to reboot the node. It ensures the drivers are initialized properly once it is updated. In the case you need drivers to be updated on node GPU worker, node draining is the right approach. Then proceed to the upgrade driver, and then node reboot prior to fresh workloads deployment.

It resolves the error warning.

Solution 2 – Reload Nvidia kernel modules

It is the ideal solution if you don’t need to drain and reboot. Though it also includes draining GPU workloads on the node. In case you don’t want to use draining due to any reason, then this solution is of no use to you. It only works in case of non-GPU on the worker node. It can’t be drained and rebooted.

Drain GPU Workload and stop the Nvidia driver

To drain the GPU workload, you need to stop GPU workloads on the node in use. Also, stop the Nvidia plugin. For this, you need to remove the label.

kubectl label node 

After the removal of the label, pods that are linked with the Nvidia Konvoy addon will also be removed.

Kubelet restart

Next, you need to restart the kubelet as it is the last step that is using the nvidia kernel module. Once you restart it, the service is stopped. Use this command to do that

sudo systemctl restart kubelet 

Unload the Nvidia Kernal module

In the case of a module that can’t be removed as it is being used, you need a command to remove it as it is important to terminate any GPU workloads running before you remove the modules. Use the following command in a similar order to do so

sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia 

To verify the unload, you need to use the below command

lsmod | grep ^nvidia 

No output in return you should get.

Relaunch the Nvidia addon pods

The last step is to relaunch it using the following label to make the GPU workloads to be accepted by node

kubectl label node


In this post, we discussed the solutions to fix the error “failed to initialize nvml driver library version mismatch error” in a simpler way.

I hope you find it helpful!

Leave a Reply

Your email address will not be published. Required fields are marked *