I kept having out of memory situations where I was unable to finish fine-tuning or quanization jobs when I had hundreds of GB's of video memory seemingly free - I kept getting:

Failed to load checkpoint: Some modules are dispatched on the CPU or the disk

It turns out that in a unified memory Grace Blackwell system you need to drop your OS cache or it can consume too much of the unified memory, resulting in paging to disk instead of GPU use.

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Permalink