Troubleshooting

This page lists common errors and ways to address them.

Run nvidia-smi
- This should display your GPU information.
- Otherwise, you will need to use the following to install your GPU:
  - https://docs.nvidia.com/cuda/cuda-installation-guide-linux
  - https://ubuntu.com/server/docs/nvidia-drivers-installation
Run python3 -c "import torch; print(torch.cuda.is_available())"or python -c "import torch; print(torch.cuda.is_available())"
- This should print True
- Otherwise, reboot your server or ensure you have torch installed.

I get this error: hypermind.dht.protocol.ValidationError: local time must be within 3 seconds of others on WSL. What should I do?
All clocks on all nodes need to be synchronized. Please set the date using an NTP server:
```
sudo apt install ntpdate
sudo ntpdate pool.ntp.org
```
The server starts loading blocks and then prints: Killed. What should I do?
This happens since Windows doesn't allocate much RAM to WSL by default, so the server gets OOM-killed.
To increase the memory limit, go to C:/Users/username and create the .wslconfig with this contents:
```
[wsl2]
memory=12GB
```
Then reboot WSL (run sudo reboot in the WSL console) and it should work fine.
I get this error: torch.cuda.OutOfMemoryError: CUDA out of memory. What should I do?
If you use an Anaconda env, run this before starting the server:
```
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
```
If you use Docker, add this argument after --rm in the Docker command:
```
-e "PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128"
```
WSL clock tends to get out of synch, which prevents the server from launching with the error hypermind.dht.protocol.ValidationError: local time must be within 3 seconds of others.
To sync the WSL clock run sudo ntpdate pool.ntp.org. See more fixes discussed at stackverflow.

Last updated 3 months ago