Recently I’ve been playing around with AI-assisted image upscaling with ESRGAN and Real-ESRGAN, and I made the grave mistake to act on my desire to train it on my own dataset. I’ve written some Python in the past, but this is from the perspective of someone's first venture into ML.
So, I was using a pretty high-end machine with an AMD RX 6900 XT GPU, ready to do some GPU accelerated computing. I had booted into Windows 10. I followed a guide for training on my own dataset. PyTorch was the name of the framework that served as the foundation for BasicSR, which in turn provided the tools I needed.
After following a guide, and after installing a bunch of modules via the
pip package manager, I ran the python script that would begin training, and descended into what I guess you could call the "rabbit hole".
It failed at first. As a developer, I’m used to that. It would be more suspicious if everything just worked on the first try, right? CUDA was seemingly required. OK, I knew that CUDA was specific to NVIDIA, and I did already suspect that something like this might happen. I thought maybe it would use OpenGL or something on my AMD GPU. Guess not. I figured that in 2021, this would be a solved problem. It wasn’t.
OK, I then moved over to my 2014 MacBook Pro since it has a dedicated NVIDIA GTX 750M GPU. That one supported CUDA, I knew that much. But I also knew there was some bad blood between Apple and NVIDIA, but still, there are plenty of Macs around from that time with older NVIDIA GPUs, running Mac OS Big Sur with no issues, so I figured CUDA would still be working. I know, it isn’t as new as it used to be anymore, but still I figured it would be faster than training on the CPU, right? So I went and cloned the repo again, installed the deps again, prepared the dataset again, and that whole thing, this time on the Mac.
Ran the same command. The exact same error occurred.
Odd, I thought. Then I went to investigate and found that NVIDIA stopped supporting CUDA on Mac OS/OS X after version 10.2, and High Sierra was the last Mac OS version to ever support it, just like newer NVIDIA GPUs in general. Dammit, Mac OS was not an option, then.
At this point I decided I should try to install Windows 10 on my Mac via Bootcamp, so that I could get CUDA working on my GTX 750M. And that’s exactly what I did. Installed the latest NVIDIA “game ready” drivers. Went and found an installer for the CUDA toolkit from NVIDIA's website. Both things installed fine, and after a reboot, I cloned the repo once again, installed the...Well, you know how it goes already. Ran the python script. The same error occurred. For the third time.
But this time I had double-checked, and CUDA was working. Running
nvcc in the terminal reported back with a version. I searched online. Hmm OK, I needed to configure PyTorch to use either CUDA version 10.2 or 11.1 while installing it, and it had to match whatever version of the CUDA toolkit I had installed. Well, I had just found an installer on NVIDIA’s website and thought the latest version would do. So, I checked with
nvcc, and I was running 11.4. Decided I wasn’t taking any chances anymore at this point, so I downgraded to CUDA 11.1. Reinstalled PyTorch with installation flags to use CUDA 11.1.
Ran the python script again.
It failed, of course.
It still wasn’t able to find any CUDA-capable GPUs. I tried installing version 10.2 of the CUDA toolkit, reinstalled PyTorch and so on. Tried again. It failed.
At this point I found that apparently I was an idiot, because the NVIDIA GTX 750M sitting in my MacBook Pro had a CUDA compute capability of 3.0, whatever that means, which was well below the minimum supported level of PyTorch. And the last version of PyTorch to support my GPU was 0.3.1, which was a million years old and wouldn’t work with BasicSR, which required at least version 1.7. I realized there was nothing to do. My trusty 750M supported the CUDA toolkit, but it wasn’t powerful enough for PyTorch. My good old MacBook Pro was officially not good enough anymore, apparently.
Ok, so I went back to the PC. Back to the PyTorch website. So, apparently PyTorch did in fact support an alternative to CUDA called ROCm. This, I found after a few seconds of reading, seemed to be AMD’s response to CUDA, but it only worked in Linux. I was already dual-booting Windows 10 with Mac OS in a Hackintosh system (which was really why I went with an AMD GPU in the first place), and I already wanted to add Linux to the mix anyway, so I went and downloaded Ubuntu 21.04, prepared a bootable USB flash drive, and installed it. Went through a whole host of issues with getting the EFI partition to play nice with OpenCore, which is the boot loader for my Hackintosh, but it ended up working just fine.
Now I had a triple-boot system! Fancy. OK, so I booted Ubuntu, went to the AMD website and downloaded the
amdgpu-pro Linux drivers. Hmm, they were made for Ubuntu 20.04, but I figured they might work for 21.04. Wrong, they didn’t. Apparently, AMD only supported the LTS version of Ubuntu on Linux. OK, back to Windows 10 to prepare yet another bootable USB flash drive, but this time with Ubuntu 20.04.
While the flash drive was being prepared I was sitting there, looking into the monitor, thinking about my life decisions. What I was even doing? I used to play music. That's what my wife fell in love with. Not this guy sitting with a triple-boot Hackintosh and a MacBook next to it screaming into his monitor all the time. When she asked what I was doing, what could I even respond? "I'm desperately trying to train a model on my own dataset because I want a model that is really good at upscaling normal maps?". Yeah, no, it's better not to say that.
OK, back to reality. I Installed...wait, what was I installing? Oh right, the bootable USB Flash drive with Ubuntu 20.04 was ready, and I had apparently booted from it while thinking these beautiful thoughts. Went through the EFI stuff yet again, and finally the AMD graphics drivers installed correctly.
Now I was finally ready to install ROCm! I followed an installation guide on AMDs website. Naturally it didn’t work at all. I checked to see if I had missed something, and I certainly had: You couldn’t both install the
rocm-dkms package per the installation guide and
amdgpu-pro at the same time! You had to pick one or the other. Now, you might be screaming at your monitor right now, furiously wanting to explain to me why that is, but to me, this felt like the weirdest thing. But OK. I uninstalled
amdgpu-pro and proceeded with installing ROCm with the guide. It actually worked! Each piece of progress was a reason to celebrate. Now all I had to do was run these two commands to see if it was working correctly…
...OK, it wasn't, there was some permission thing. I had to add my user to a usergroup called video. OK, that didn’t work, because apparently in Ubuntu 20.04 specifically, that group should be called render instead of video, which was the name still in older versions and all other distros. OK, OK, OK. Look, there was a time when I wanted to understand why all these things are the way they are, but I have kids now, okay? Two of them. The decline in my GitHub contribution chart is a daily reminder of how little energy I have left for these things, nowadays. Also, I've been trying to get this shit to work for days in a row now, physically and mentally tired already before even writing the first console command.
Anyway. After adding my user to a few usergroups, adding some ROCm specific binaries to my PATH, and after chowning a few files, it did seem like everything was working correctly.
So now, finally, I could go back to the PyTorch website and follow the instructions to install a version of PyTorch with support for ROCm. It installed correctly. I thought this might be it! I then went and cloned the Real-ESRGAN repo, installed the...for what felt like the hundredths time, and then ran the python command to commence training that had failed me so many times in the past.
No, just kidding. It failed.
Of course it failed.
I think I was laughing. The error? ROCm, an abbreviation of Radeon Open Compute with a mystical lowercase m in the end, AMD's own thing, didn’t actually support my AMD Radeon RX 6900 XT graphics card, or more generally, RDNA2. In fact, it didn’t even support RNDA1-based graphics cards like the 5700 XT.
The universe was laughing at me. I was a bad consumer who had made all the wrong hardware decisions. Silly me.
So, let’s recap. I attempted this from three different operating systems, one of which I installed two separate versions of. I attempted on two different GPUs, from two different vendors, on two different computers. One was CUDA-compatible. The other had one of the most powerful graphics cards money can buy at the time of writing. I installed PyTorch, in a multitude of configurations, at least 10 times. I cloned the Real-ESRGAN repo, configured Python and
pip, installed the deps, and prepared the dataset I-don’t-know-how-many times. It never worked.
So, let’s see. What are my options? As far as I can tell, training on the CPU. But after trying to monkey-patch the source code for PyTorch and BasicSR, I couldn’t even get that working!
I guess the only way forward is to pay money. But I already did! This GPU cost me a ton of money.
I’m left to conclude that if I didn’t know any better, based on my own research, it follows from this little study that it is, in fact, impossible to train (Real-)ESRGAN. With CUDA. With ROCm. On Windows, Linux, and Mac OS. On an NVIDIA graphics card, on an AMD graphics card, and even on an Intel CPU.