I am trying GPU for machine learning library.
I found that training these little robots is a huge time-consuming ordeal.
to get any kind of converge on any complex model is orders of magnitude more difficult than the very simple models.
the time goes from few hours for a simple mode to 48 and sometime 78 hours, and then you find out that you agent fails to learn the controller.
I first I thought, the time was not a big deal, if the training converges to a solution, but that is not the case, you have to train and tweak those model dozens of times, before you get a working model.
and this is not sustainable.
not only that, but you can also add more cost more since you have to run your CPU at maximums power for days.
based on that, I think it is advantageous to have a GPU version, that you can run as trial ballon many times.
I'm using Vulkan for GPU acceleration.
Performance comparisons between my radix sort and matrix multiply implementations
using CUDA and Vulkan, I consistently observed that CUDA is more than twice as fast as Vulkan.
After extensive tuning and testing, I concluded that the main reason for this performance gap
is NVIDIA’s optimization strategy.
CUDA generates PTX(an intermediate representation), which NVIDIA appears to optimize very aggressively at the driver level.
SPIRV shaders used in Vulkan seems to be translated in a more direct manner, with significantly fewer optimizations applied.
I suspect that a major factor is memory bandwidth utilization.
Memory coalescing does not seem to be an automatic hardware thing, instead the instruction has to be issue. But Spirv shaders does not have a way to express that, neither seems to be PTX.
But after they are passed to the driver for hardware execution, the PTX version get coalescing,
(that is, it issues 128 or even 256 bit reads and writes transations, instead of SpirV issuing 32)
therefore, spirv code end up issuing a lot more memory transaction than the equivalent PTX code
anyway, that's my speculation.
Essentially, it's like comparing a debug version of code to a release version and
expecting the debug version to be faster, something only possible if the debug version
uses a fundamentally more efficient algorithm.
This discrepancy becomes clear when compiling CUDA code with optimizations disabled.
under those conditions, the performance of Vulkan and CUDA can be comparable,
and in some cases, it's not obvious which one is faster.
Unfortunately, developers have limited control over this, as it's up to NVIDIA to apply the same level of optimization to SPIRV as they do to PTX.
That said, Vulkan’s cross platform supports is a major strength because given the increasing diversity of hardware, the performance tradeoff is often acceptable.
Even with this overhead, Vulkan still significantly outperforms CPU based implementations.
therefore, the performance loss only applies to Nvidia hardware.