CUDA out of memory error

F_T · September 30, 2019, 10:43pm

Hi,

I faced a new problem regarding model_zoo regression.py code.

I started to tunning hyperparameter and since I have 4 gpus I used 4 different.py files and run each of them on one GPUs using this command line " os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘gpu_id’". at first, I didn’t have any problem but after 3, 4 runs I get a weird error. This is my error:

RuntimeError: CUDA out of memory. Tried to allocate 938.50 MiB (GPU 0; 11.92 GiB total capacity; 11.23 GiB already allocated; 242.06 MiB free; 10.54 MiB cached)

so when I change GPU id using “os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘1’” my code doesn’t understand it and just use GPU 0 I also reboot my GPU but it’s not working.

I’m wondering if you can help me with this problem.

Thanks,

mufeili · October 1, 2019, 9:22pm

By “3, 4” runs, do you mean running the program for “3, 4” times or is it just “3, 4” epochs? If you mean “3, 4” times, probably it’s just that you did not release the GPU memory before running a new program and try killing them first.

As about setting GPU device, have you tried directly calling CUDA_VISIBLE_DEVICES=1 python ... in terminal?

F_T · October 1, 2019, 10:27pm

Hi,
By 3,4 I mean 3,4 times and how should I release the GPU memory before running a new program? If you mean to kill them first and then run it. yes I killed them and then run a new trial.

I also tried `CUDA_VISIBLE_DEVICES=1 python … and I get the same error.

Thanks for your help.

mufeili · October 2, 2019, 9:47am

Try nvidia-smi after you killing the program and see if you properly killed all the related processes.

As for specifying the GPU to use, are you sure your program is not using the correct GPU? For example, if I set CUDA_VISIBLE_DEVICES=1, even your program use cuda:0, it’s in fact using the first GPU visible to the program, which is GPU 1. Another potential source might be the use of torch.cuda.set_device.