Cuda vs Vulkan compute

by **Julio Jerez** » Fri May 23, 2025 12:07 pm

I am trying GPU for machine learning library.
I found that training these little robots is a huge time-consuming ordeal.
to get any kind of converge on any complex model is orders of magnitude more difficult than the very simple models.
the time goes from few hours for a simple mode to 48 and sometime 78 hours, and then you find out that you agent fails to learn the controller.

I first I thought, the time was not a big deal, if the training converges to a solution, but that is not the case, you have to train and tweak those model dozens of times, before you get a working model.
and this is not sustainable.
not only that, but you can also add more cost more since you have to run your CPU at maximums power for days.
based on that, I think it is advantageous to have a GPU version, that you can run as trial ballon many times.

I'm using Vulkan for GPU acceleration.
Performance comparisons between my radix sort and matrix multiply implementations
using CUDA and Vulkan, I consistently observed that CUDA is more than twice as fast as Vulkan.

After extensive tuning and testing, I concluded that the main reason for this performance gap
is NVIDIA’s optimization strategy.
CUDA generates PTX(an intermediate representation), which NVIDIA appears to optimize very aggressively at the driver level.
SPIRV shaders used in Vulkan seems to be translated in a more direct manner, with significantly fewer optimizations applied.

I suspect that a major factor is memory bandwidth utilization.
Memory coalescing does not seem to be an automatic hardware thing, instead the instruction has to be issue. But Spirv shaders does not have a way to express that, neither seems to be PTX.
But after they are passed to the driver for hardware execution, the PTX version get coalescing,
(that is, it issues 128 or even 256 bit reads and writes transations, instead of SpirV issuing 32)
therefore, spirv code end up issuing a lot more memory transaction than the equivalent PTX code
anyway, that's my speculation.

Essentially, it's like comparing a debug version of code to a release version and
expecting the debug version to be faster, something only possible if the debug version
uses a fundamentally more efficient algorithm.

This discrepancy becomes clear when compiling CUDA code with optimizations disabled.
under those conditions, the performance of Vulkan and CUDA can be comparable,
and in some cases, it's not obvious which one is faster.
Unfortunately, developers have limited control over this, as it's up to NVIDIA to apply the same level of optimization to SPIRV as they do to PTX.

That said, Vulkan’s cross platform supports is a major strength because given the increasing diversity of hardware, the performance tradeoff is often acceptable.
Even with this overhead, Vulkan still significantly outperforms CPU based implementations.
therefore, the performance loss only applies to Nvidia hardware.

by **JernejL** » Tue May 27, 2025 1:20 am

It could be also the cause that nvidia purposely does not optimize their vulkan ggpu drivers to promote cuda.

It would be interesting to see how vulkan compute shaders on radeon compare.

by **Julio Jerez** » Tue May 27, 2025 3:38 pm

yeah.
The idea behind Vulkan is to free developers from the limitations imposed by graphics drivers.
It's often marketed as being "closer to the hardware."

However, in my opinion, that's a large overstatement that never became reality.
what makes a program fast is the quality of its software and in the GPU world, that means shaders.

Vulkan’s approach is to use a general-purpose intermediate representation called SPIR - V,
which allows for broad generic optimizations, similar to what LLVM does.
This is intended to create a hardware agnostic way to optimize code without relying
on vendor specific drivers.
But the reality is that GPU performance often depends on specialized instructions
unique to each hardware vendor and those can only be fully utilized
if the programming language exposes them.
Nvidia, for example, provides access to many such instructions through CUDA.
These include features like scatter / gather operations and warp level reductions operations
that allow data sharing across threads within a wavefront, such as summing 32 values in parallel.

So, when Nvidia compiles SPIR - V code, they’re not doing anything wrong by simply
translating it without deep optimization.
It’s not their job to optimize SPIR - V itself.

a typical analogy would be a Lua optimized script vs a lua native complied lua optimized script
the lua optimize script is very good, but it will never beat the compiled optimized script

Ironically, SPIR - V’s generality its Vulkan biggest strength, but is also its main weakness.
Despite that, it remains a solid option and the only cross platform option.

But when comparing GPUs, Nvidia is still generations ahead of AMD, which unfortunately,
has consistently underinvested in software tooling, instead let third part do it,
and the result it is mess of inconsistencies.
Their hardware may be competitive or even superior in some cases, but without robust software to support and showcase it, the performance potential is never fully realized.

by **JoeJ** » Wed May 28, 2025 9:44 am

Julio Jerez wrote:Nvidia, for example, provides access to many such instructions through CUDA.
These include features like scatter / gather operations and warp level reductions operations
that allow data sharing across threads within a wavefront, such as summing 32 values in parallel.

Subgroups might be helpful for things like reductions in case you missed this.
https://www.khronos.org/blog/vulkan-subgroup-tutorial

Maybe Slang is an interesting alternative for Spir-V.
https://developer.nvidia.com/blog/differentiable-slang-a-shading-language-for-renderers-that-learn/

(pretty random links shown on top of google search)

I guess what would really help is if more game devs would use Vulkan. Sadly it became more or less a mobile API in practice and DX12 has won.

by **Julio Jerez** » Wed May 28, 2025 10:28 pm

yes, I saw that the subgroup thing, and I am using it.
but you never know if is doing the job or not, since I have not found how to do it in glsl.

It is alright,
I am using to see if I can get better perforce on this reinforcement learning this.
as I said, training the spider robot, take about 48 hours, and my net architecture very simple.

there are two paradigms on deep learning, wide layer layers and but not very deep
and narrow layers but deep arch.

the general consensus is that wide layer capture detail, while narrow and deep arch capture high level features.

the big problem is that my network are four hidden layers of 64 x 64.
and that take from 36 to 48 hours for 500 billion steps.

but if I make the layer wide, like 256 x 256 that's 16 time larger, and now we are talking week of training time. IT may be cut to maty 3 to four day is I only use two layers,
but I am convinced that deep layers are the way to go for more robust controller.

with this kind of challenge, if GPU version is only two to three time faster,
we are talking from days to few hours of training, which is more manageable.
I expect better that three time the speed.

IIII made good progress already by I have no test it yet.
I am adding some work to the API so that the code look similar for both GPU and CPU.

by **JoeJ** » Thu May 29, 2025 4:37 am

that take from 36 to 48 hours for 500 billion steps.

Reminds me on my terrain simulation stuff.
It's somehow depressing when we have to work hard on optimizing preprocessing tools. :wink:

I got my ragdoll to walk, btw. Does not look better than what i had years ago, but it's robust.
Currently i need to have a large sideway distance between feet.
If i make it smaller to be realistic, it falls. Idk why yet.
Guess, figuring it out will take me 48h or a week.
The competition of human brain vs. AI is still going on... :mrgreen:

But sadly this has to wait. Currently working a Vulkan renderer. Shadowmaps, cluster culling and all that.
Vulkan is a chore, but it is not as bad as i thought. Rate of progress is constant, and sometimes i even have fun working on gfx.

by **Julio Jerez** » Thu May 29, 2025 10:41 am

Oh yes, working on shader stuff can be really fun, once you get past the housekeeping and setup,
especially with Vulkan.
I'm currently using a very old version (1.1), so I often find it frustrating due to the lack
of proper debugging and profiling tools.
NVIDIA's Nsight lets you debug and profile CUDA kernels,
but I haven’t found a way to get it working with Vulkan at all.
They said that newer Vulka 1.3 and 1.4 have layer that work better with Nsight but I have not tested yet

Instead, I took a pretty drastic approach
I refactored the entire machine learning library to work in a way that's similar to Vulkan.
It was a daunting task at first, but now that I’m further along,
I’m finding that the C++ version is about 50% faster than the old one.
Plus, there's the added benefit that switching to the GPU is just a matter of initializing the right buffers.

Now I have classes like ndBrainTrainerGpu and ndBrainTrainerCpu.
It’s still a work in progress; the GPU version is currently slower because
I'm submitting commands to a queue and then waiting on a fence to get the result.

But that actually helped with debugging, since I could easily compare it with the CPU version.
Eventually, I plan to optimize it by submitting a long sequence of commands and syncing only once per epoch.
So far, it’s looking promising!

on the brain vs. AI, I think that reinforcement learning modeless agent actually wins.
I refuse to call reinforcement learning AI. to me it isn't.
But it is a much more powerful tool that controller theory.

Controller theory is all based on the linearity of a system.
you can use the stability criteria, and stuff like that. but in the end the best you can get is
uninteresting controllers if you are lucky.
for example, the zero moment point robots with huge foot plants, that look ridiculous in the early 2000's.

what RL can do is that it does not rely of those stability criteria.
instead, is rely on improving sequence of events.
It is actually what generic algorithm do many years ago, but now with neural net
the evaluation of the sequence is far more robust.

an RL algorithm can do things like take a bad early step, because several hundred steps into the future it finds a stable state that will increase the score making the average value of that sequence higher. That's something that a model based system can never do. or it is very hard to implement.
anyway, yes I think that RL is a far more powerful tool than control theory.

Here is the irony, if you apply RL to a model that is stable, you can get incredible results.
like the google video that have those zmp little robot playing socker.
that is the camp that I am aligning myself now.

by **JoeJ** » Thu May 29, 2025 5:28 pm

Regarding debugging i do the same. Developing everything on CPU but with GPU in mind, so i only need to port to shaders later. I wish GPU programming would become as accessible as C++ so this could be easier, but sadly it just never happens.
Luckily for gfx there is not much problem solving going on, so debugging is rarely needed. It's mostly about about figuring out how to do what i want using the complicated API. For compute that's still quite simple, but for gfx it becomes very tedious.
Still i make my way. I have everything GPU driven in a single command buffer. Even the rendering of multiple shadow maps. So i have the whole gfx engine in one single 'drawcall' and no CPU handholding at all, which is very cool.

But i still feel like a Vulkan noob. Without Sascha Willems tutorials i would not get anywhere i guess. :lol:

by **Julio Jerez** » Fri May 30, 2025 10:33 am

ah, yes I remember Sascha Willems, he was one of the early adopters of Newton.
very smart and sharp guy indeed. I am going to have to check his tutorial, yes.

I'm not particularly concerned about performance. In my estimation, once everything is set and working, I expect roughly a 10x improvement using a GPU over a CPU.

Here’s my reasoning: training neural networks is extremely memory intensive.
This is a major reason why NVIDIA is generating billions in revenue by selling server grade
GPUs priced at $10,000 or more.

Let me give you my rough estimate.
I'm currently training a model on a CPU using a network architecture with
5 fully connected (dense) layers, each containing 64 neurons.
The total number of parameters is approximately:

64 × 64 × 5 ≈ 20,000

You can think of this as the "atomic size" of the neural network.
During training, the effective memory footprint grows significantly depending on the algorithm used.
For example, using Soft Actor-Critic (SAC) with a mini-batch size of 256, the model requires
around 40 MB of memory. and the algorithm iterates several times per step over those buffers

The issue is that this model isn't complex enough to capture the nuances of the data.
A general consensus among AI researchers is that:
wider layers help capture more detail, while deeper layers extract higher-level features.
So, to build a more reliable controller, I’d need to widen the layers say to 256 neurons per layer. but that’s much more demanding on a CPU. The scaling is roughly quadratic.

So, increasing from 64 to 256 neurons per layer boosts memory usage from 40 MB to about 0.5 GB.

This shows that training neural networks is heavily dependent on memory bandwidth.
Now, when I compare hardware specs:

My NVIDIA 4070 Super GPU provides about 670 GB/s.
My Intel 13th-gen CPU offers about 90 GB/s of memory bandwidth.

That’s roughly a 7x bandwidth advantage in favor of the GPU.
When you also factor in the GPU’s superior parallel processing capabilities,
it's entirely reasonable to expect at least a 10x performance boost,
even in a scenario with suboptimal GPU shader utilization.

that's my assessment.

by **Julio Jerez** » Sun Jun 01, 2025 12:26 pm

Wow, it turns out the consensus among AI developers that wider neural network architectures are better at capturing the complexity of data, seem to be correct.
I, as usual was wrong, I thought that even if that was true, deeper layers can make up for that shortcoming. That could be true, but deeper layers can't do jack if the early layers can't decompose the data into unambiguous classes.
Basically, what happens is that each layer group the input data into classification regions,
but if a layer is too narrow, those groups are too close, and they may even overlap.
while wider layers, can get longer distances between the classes, at least that my interpretation.
I have to implement a way to visualize hidden layers to verify that conjecture.

If you look into this topic, you'll find explanations like this:

“In neural networks, a 'wide' architecture refers to layers with many neurons, while 'narrow' networks have fewer neurons per layer. A deep network has more layers, whereas a wide network has more neurons in each layer. While both can model complex relationships, deep networks are typically better at learning hierarchical features. Wide networks, on the other hand, often perform better on tasks that benefit from hard memorization.”

In my own experiments with the MNIST dataset, which I use as a benchmark for testing accuracy and performance, I’m seeing results that completely surprised me.

Previously, I used 64 neurons per hidden layer.
Training would take around 15–16 minutes for about 150 epochs.
I considered that pretty good—at least, I thought so at the time.

But after refactoring my library, I ran the same architecture and got nearly identical results.
I suspect this is because long vector dot product operations now utilize the CPU cache more efficiently. In contrast, the shorter vectors I used earlier likely caused more cache line flushes.

Today, after cleaning up the script and removing legacy support, I decided to go with layers that are 256 neurons wide. I expected the training to take hours.

But I was blown away by the difference.

Yes, each epoch takes longer to complete, but the learning progress per epoch is incredible.
After just the first epoch, the prediction accuracy hit 95%, as compared to only 80–85% with 64-neuron layers.

This also allows me to apply dropout regularization to improve generalization on test data, something that didn’t work well with narrower layers.
With all of this, the network now achieves even better accuracy in half the number of epochs.

Overall, the performance has improved fourfold, but more importantly the network generalizes much better in a smaller number of epochs, something that before it did well on training, but it required lots and lots of epochs to merely do a decent work on test data.

Wider layers truly make a huge difference.
here is the logs.

by **Julio Jerez** » Sun Jun 01, 2025 2:57 pm

A way to visualize what the hidden layers are doing is to create a gray scale image of the weight of the hide layers.

That is the hidden layers are matrices, so if we read the weight and normalize the from 0 to 1 and save it as a gray scale image, with should see how the layer represent the input.

by **JoeJ** » Mon Jun 02, 2025 2:44 am

I can't believe it, but they say it: Shader debugging for Vulkan is here :shock:

https://www.youtube.com/watch?v=Fja4lT508cA

by **Julio Jerez** » Mon Jun 02, 2025 9:22 am

That will be great, if it actually worked for me.
I have never been able to run Nsight from NVidia or render doc tools and have them recognized Vulkan.
I just re installed later version to make sure it the one he talks about or even a newer one,
but I still and I still get this error message respect for Vulkan support.

: Untitled.png (23.61 KiB) Viewed 6611 times

nvidia does a better job but does not trace any shader either.

by **Julio Jerez** » Mon Jun 02, 2025 12:27 pm

the one thing I find really tedious is how time consuming to make an interface that work the same for GPU and CPU.
specially wit this Macchione learning thing, when the turnaround for testing something is hours.
and each time a minor change is apply you have to go over ton of code.
I guess this is why writing specification is a good idea.

I feel a lot of respect for tensor flow and pytourch, people the make very easy for the end user to using in python scrips, but the work under the hood is huge.

I still do not use though. I try that libtourch and the binary alone is a ridiculous 5 gig dll.
and to me that's unacceptable.

by **JoeJ** » Tue Jun 03, 2025 3:43 pm

Hmm, i had RenderDoc working with my Vulkan app, but currently i'm only using Radeon Profiler sometimes when i want to know shader occupancy.

Getting such tools to work and figuring them out always is it's own chore. So i will try this debugging only when i'm really desperate about something.

And then in case it really works for real, i'll regret i have waited for so long. :lol:

Cuda vs Vulkan compute

Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Who is online