Cuda vs Vulkan compute

by **Julio Jerez** » Mon Jun 16, 2025 1:59 pm

well, I spent the weekend getting dirty in the tile * and wrote a tile based
matrix x matrix multiple opencl kernel.

The previous one which I though was going to be better, using the matrix transpose,
and the profile showed this. It was really bad, the profile alleady say the app was GPU bound.

: Untitled.png (37.56 KiB) Viewed 27934 times

The new one is quite good, so far. and the profile now say the app is GPU bounds.
so that's good because the GPU optimization can make better, but for now this is good enough

: Untitled1.png (72.01 KiB) Viewed 27934 times

I can still add some more optimization like adding an array of register.
some people report doubling the throughput using those tricks.

It is a the point whe the GPU version and the CPU version are beraking even.

but this is very good news, because that test is the feed forward update,
what make really slow is the back propagate because is that part is consume ferocious in memory bandwidth, which is when the gpu has the upper hand over cpus.
at least that's the theory, and I have been known to be very wrong at these predictions.

anyway, I now can remove all the transpose and clean then enable the backpagattion to see what we are.

by **Julio Jerez** » Tue Jun 17, 2025 5:44 pm

ahh, finally I gave a clean advantage for the GPU:

This is the timing for the most optimized verrisn I can mustered so far

this is the same training in GPU.

A little under three time faster :mrgreen:

the good knews is that I can optimized the Gpu version some more in different aspects.
-First by profile the kernels. I already did that for the matrix multiply, but there are other tow multiplication that are still using the transpose method, replacing those tow will further improve preference.

Th ether one is by adding more GPU, functionality, for example I have no converted that loss functions.
and finally, an easy one, is to preload the data set to gpu.

anyway, this seems to be making some progress now,
if that speed up holds for training the robot, the training time should be cut to 8 hours, which is still too much.

by **Julio Jerez** » Thu Jun 19, 2025 11:35 am

wow, I have to do some heavy duty debugging to find out a bug that was generating slightly different result form run to run when using GPUs, but not when using CPUs opencl of emulation.

that's the tale tail of a Barrier, missing or misplaced.
and it is usually a really bad sign because when the solution is synchronization, It means drastic performance cost in flops throughput.

it turned out that was not the cause and the solution made the system a little faster still.
The amd gpu backend now takes 23 ms,

This is a big wing in favor of GPU.
and the best part is that I can only be faster.
This is the profile capture form the amd GPU tool.

: Untitled.png (76.3 KiB) Viewed 27913 times

IT is clear that one kernel dominates the entire update by a factor 0f more than 10.

and that kernel is this.

Code: Select all: __kernel void brainAccumulateGradients( __global const UniformBufferLayerArguments* parameters, __global float* gradientBuffer) { uint itemId = get_local_id(0); uint groupId = get_group_id(0); uint workGroupSize = get_local_size(0); uint inputSize = parameters->m_inputSize; uint miniBatchSize = parameters->m_inputOutputSize; float sum = 0.0f; uint start = groupId * workGroupSize; for (uint j = 0; j < miniBatchSize; ++j) { uint base = start + j * inputSize; sum += gradientBuffer[base + itemId]; } float weightFactor = 1.0f / (float)miniBatchSize; barrier(CLK_LOCAL_MEM_FENCE); gradientBuffer[start + itemId] = sum * weightFactor; }

Basically, the code performs an element-wise addition of a long vector across the rows of a matrix.

This is a challenging operation to optimize because there doesn’t seem to be enough computation per memory access to hide latency. I suspect my mistake is in summing the matrix row by row.
While this might seem reasonable, especially since each row contains a substantial amount of data,
in practice, it may be causing cache thrashing.

What I think is happening is that the compute units are invalidating the cache with each row traversal. This kind of code can run fast on GPUs that don't have a unified L2 cache.
But if the GPU does have an L2 cache, and the vectors are large (over half a megabyte in this small demo), then failure to reuse cached data severely limits performance due to memory bandwidth constraints.

What I plan to try is breaking the operation into a series of kernel launches, each accumulating one row at a time. The idea is that the first pass will read all the data into cache. Even if the cache can't hold it all, each subsequent pass will work on a matrix half the size of the previous one. That means the likelihood of hitting cached data increases with each pass.

This is just a hypothesis, but if this approach even halves the compute time,
it could shift the performance balance strongly in favor of the GPU backend. :mrgreen:

edit: the naive version of that did really work, in fact it made it about 75% slower.
I believe is because it dispatches to make kernels, and the driver kill it.

I can try to reduce that by for now, I think that adding high level optimization will yield better results.
the next step if to preload the database to device memory and see how that does.

This is actually important for training because the replace buffer have to be updated incrementally, which mean it might need my old sort function. for regrouping the data in gpu memory.
anyway, we are no in a good position to start treining again.

by **Julio Jerez** » Thu Jun 19, 2025 2:10 pm

for context, in my old system, the gpu backend trainer is a blowout.

this is my best optimized parallel trainner in cpu

this is the GPU

a 7.5 speed up, hikes. :mrgreen:

by **Julio Jerez** » Sun Jun 22, 2025 12:28 pm

alright, this is it the result are in.
these are the results of training the Mnist data set in cpu vs gpu both optimized.
GPU could be optimized some more, but that will be making specialized shader optimization
to capitalized in some hardware, stuff like know the warp size, of the shader memory bank size, by I am no going for that. As lock as the Gpu at least twice as fast and optimized CPU, that a win.
but is reality I am getting about 4+ speed up.

here is the result for the small model (350k) parameters, training the Mnist for 20 epochs

this is the GPU

As you can see, the model isn't large enough to capture the underlying structure of the data.
As a result, it only achieves just under 98% accuracy on the test set.
Of course, I could apply techniques like dropout regularization or extend the training time, but doing so increases the number of trial-and-error iterations, precisely the kind of inefficiency that adding a GPU backend is meant to address.

Note:
For context, once a model's training accuracy exceeds around 98% of the target, further progress becomes extremely slow.
This is because the "learning" in neural networks is driven by gradients,
which is just calculating partial derivatives of a multivariable function and update the wights proportional to that vector of partials derivatives.
At that stage, most predictions are already correct, so the error between the predicted and true values is often zero. As a result, the algorithm ends up averaging thousands of zero gradients and only a few meaningful ones, making it much harder to continue improving the model.

Anyway, a simple test to try, is to increase the size of the hidden layers, using 512 neurons instead of 256.
The general consensus is that wider models are better at learning the underlying patterns in the data and tend to make more confident classifications, which usually translates into better performance on unseen data.

However, this change comes at a cost: it roughly quadruples the training time on both CPU and GPU. And since CPUs are significantly slower for this kind of workload, experimenting with architectural changes becomes impractical without GPU acceleration.

Here are the results of that test, with almost a million parameters:

and the gpu :mrgreen:

It seems that at least on this test, the consensus is true,
the model generalizes better in both training and test data achieving over and significant fraction over 98% of accuracy is so few epochs.
The important part is that the test only takes 36 seconds in GPU while the CPU is 195.
That's a 5.4 speed up factor, which according to my investigations, is better that the result that people are getting in similar test Tensorflow and Pytorch, at least it is in the ballpark.

Now is back to train robot one more time. :shock:

edit:
ahh, and for completion, the nvida test gives these results.

that's no bad for a legacy nvida GPU, it is quite competitive with the 7800 amd actually.

by **Julio Jerez** » Fri Jun 27, 2025 2:09 pm

wow, the prognostics for the GPU reinforcement learning are very promising

I have converted the part of the class that most interact with cpu memory.
At first the result what very mediocre, barely a fraction faster than the CPU.
But after reading about Shared Virtual memory buffers.
It makes the code very verbose, but when set correctly,
it avoids lot of the overhead for setting the DMA transfer.

these are now the first results of training.
CPU:

: Untitled.png (415.74 KiB) Viewed 27884 times

GPU:

: Untitled1.png (306.49 KiB) Viewed 27884 times

That's about 10x factor, and the cpu timing is from the physics SDK,
it is as if the GPU does not fell the load at all.

when the entire class is complete, the cpu timing is over 100 ms per step.
a 10x speed factor could be quite good, but if it keeps the pace, it could be a lot better than that :mrgreen:

by **Julio Jerez** » Thu Jul 03, 2025 2:01 pm

Wow, I have to dive into some serious debugging.
For some reason, I never considered that, in a computer program, an array could exceed 2 billion elements.

I ran into a similar issue a while ago while working on sorting and fluid simulations in CUDA.
Back then, the problem wasn't the number of elements, it was the pointer address.
At the time, my GPU had less than 2 GB of VRAM. I primarily worked on 32-bit mode.
Then I upgraded to an NVIDIA 1660 Super with 6 GB, and suddenly, I had to rewrite my utility vector class to use int64 instead of uint32, and update the entire codebase. That was a tedious process.

Now, while working with machine learning and training models,
I'm again running into vectors that do exceed 2 billion elements.
The tricky part is that smaller datasets like MNIST work fine since they fall under the 2 GB threshold.
But once I move on to training robots, the amount of offline data blows past that limit.
So, I’ve decided it’s time to fix this issue once and for all.

That threw a real monkey range in refactoring the library.
But I am now almost complete and ready to try again.

by **Julio Jerez** » Wed Jul 09, 2025 12:27 pm

This isn’t new to me, I've always tried to structure data to make efficient use of cache lines whenever possible.
That said, code optimized for memory access patterns doesn’t always look clean or elegant,
so I’ve only applied these optimizations loosely in the past.

A classic example is matrix-vector or matrix-matrix multiplication. Typically, one matrix in the operation is cache-friendly, while the other isn't.
A common workaround is to transpose one of the matrices to improve memory locality.
However, this comes with trade-offs: it doubles memory usage, requires additional synchronization between the original and transposed matrices, and introduces overhead from the transpose operation itself. In practice, for small and medium size matrices, this often ends up being no better than just accepting the cache inefficiencies.

Advanced libraries like BLAS and LAPACK use tiling algorithms to address these issues more effectively, but those implementations are massive and primarily deliver significant performance gains only when working with large matrices like 2k by 2k or more, which is not the case for ML.

That sets the context.
For testing and debugging my neural network library, I’ve been using the MNIST dataset.
To keep CPU time manageable during matrix operations, I initially set the hidden layer width to 64 neurons. However, this led to poor model convergence.

Here are the results from a 20-epoch run:

Total time: ~1 minute

To improve accuracy, I had two main options:
-Increase the number of epochs
-Increase the width of the hidden layer

Both options increase runtime quadratically. More epochs aren’t always effective because once the network hits its learning capacity, gradients become too small to be meaningful, and numerical noise starts to dominate, making further learning difficult.

The second option, widening the hidden layer is generally favored in modern ML practice.
It's widely accepted that wider networks often capture the underlying structure of the data better than deep, narrow ones. The downside is performance: doubling the width can quadruple runtime.
For instance, increasing the hidden layer to 512 neurons boosts accuracy significantly, but also increases training time substantially.

Here are the results for a 512-neuron hidden layer:

Total time: ~10 minutes

Ideally, you'd want both wide and deep networks to get the best of both worlds, but at that point, you’re looking at hours of training time.
with these premises I start to explore GPU, first Vulkan then OpecnCl.
Vulkan is too low level for my test. so I try opencl.

by **Julio Jerez** » Wed Jul 09, 2025 12:53 pm

When working with OpenCL, careful attention to memory usage is essential.

On a GPU, reading a single value from memory doesn’t just retrieve that value, it fetches an entire memory segment (a wavefront).
So, if you naively port traditional matrix operations to the GPU, the performance gains may be underwhelming, or even negative, once you account for the overhead of transferring data between the host and the device.

After rewriting the matrix algorithm using a tile-based approach, the performance improvements were significant. This led me to wonder: what if I applied the same tiled algorithm on the CPU?

After all, a CPU’s L1 cache is roughly the same size as a GPU’s shared local memory. Additionally, while a typical GPU memory fetch involves a 32 x 32-bit segment, the CPU typically fetches a 64 x 8-bit cache line. If every element in a cache line is used efficiently, you can, in theory, achieve a 16x performance gain due to memory locality, and there is not cost for host to gpu memory.

So, I rolled up my sleeves, implemented the tiled version for the CPU, and refactored the library accordingly.
It wasn’t easy, but I eventually got a working prototype, and the results exceeded my expectations.

The speedup was dramatic: a model that previously required 10 to 15 minutes for 20 epochs on the CPU now trains in just 27 seconds.

Next, I plan to revisit the GPU version and integrate some of the optimization techniques used in the CPU implementation. I’m not expecting a 10x–100x improvement, more realistically, a 3x–5x speedup. But that would still be a valuable gain.
I found out that the proverbial 100x gpu only happen when you compare very poor cpu to highly efficient gpu version of the same routine.
stuff like comparing radix sort in gpu, to standard library quick sort, and shenanigans like that.

by **Julio Jerez** » Fri Jul 11, 2025 1:32 pm

Alright, I've now completed the OpenCL versions of all the kernels, and the results are impressive.
Using the most optimized CPU implementation:

Now with GPU (OpenCL 2.0, AMD-APP 3652.0):

a factor of four as I predicted, but bare in mine, that both cpu and gpu result are about 5 to 10 time faster, compared to when I started. this is better than a 10-fold gain. :mrgreen:

I am happy with these results.

With this GPU acceleration in place, I can finally revisit the reinforcement learning setup for training robots. I’m expecting a dramatic reduction in training time from several hours or even days down to just a few hours, possibly even minutes.

by **Julio Jerez** » Tue Jul 15, 2025 11:14 pm

this is the final blow,

over 95% in just one epoch, in less than 0.5 seconds, and does better in test than in training.
I have never seen a training that is not using CNN, that achieve that performance at that speed.

I always have to keep checking the result each time I add GPU functionality, because GPU programming is prompt to errors, and if you add too much code, they it is hard to find what broke the code you already have. especially when new functionality changes existing code.

So far, I am very happy with these results. :mrgreen:

by **Julio Jerez** » Sat Jul 19, 2025 2:07 pm

Wow,
The one task I greatly understated is how tedious is to have
All the support on the vertir class to go on.

It is easy to see how in cpp, doing scalar operations, you can do stuff like
X = x[I] > 0 ? 1 : -1

When translated to GPU buffer, you need to write lot of predication.
I get a new respect for the people how have write a full
ML libraries that unify CPU and GPU operations.

by **Julio Jerez** » Mon Sep 01, 2025 3:28 pm

It feels like the graphics API dilemma is never going to be resolved—if anything, it keeps getting worse.

About 20 years ago, I chose OpenGL because it was the only truly cross-platform library. Back then, companies like Apple, Intel, and pretty much anyone outside of Windows were making grand promises that they would never abandon OpenGL.

Well, here we are today, and those same companies have abandoned it. Ironically, Microsoft is now one of the few still giving OpenGL proper support.

These days, it’s nearly impossible to build a true cross-platform demo unless you either write your own wrappers or rely on a game engine like Unreal or Unity.

I tried going down the Vulkan route, but after multiple attempts, I can honestly say I find it disappointing. To me, Vulkan feels like a solution in search of a problem: unnecessarily complex, poorly designed, and with shaky driver support.

I miss the old days of libraries like RenderWare, BRender, and RenderMorphics. Back then, you knew the scope of what you were getting. Today, with OpenGL or Vulkan, you can put in all the effort to build something, and you’re lucky if it even runs consistently across more than one PC with different GPUs. Forget about targeting mobiles devices, it really *.

what I am doing now. is that I will made a very lightweight, render library and move all the rendering code, there is that I can support other platform like apply OS and mobiles.

This will also let me implement more advanced rendering feature like reflections, maybe even tracing. at my own pace.

I am going to put on my hat of graphic programmer and put some effort on that.

by **JoeJ** » Tue Sep 02, 2025 6:07 am

I tried going down the Vulkan route, but after multiple attempts, I can honestly say I find it disappointing. To me, Vulkan feels like a solution in search of a problem: unnecessarily complex, poorly designed, and with shaky driver support.

I was postponing work on a renderer for years because i was so afraid of Vulkan complexity.
But then, after spending 3 months purely on shadow map research, it took me only another 3 months for the basic renderer i need.
Less then expected, so i thought this time i can afford a longer break working on ragdolls.
Since then i'm working for almost 3 months to make my ragdoll walk as fast as humans do, but still no success so far. ; )

I would say VK isn't as worse as we think, once we get used to it and just do the work.
It could be much easier now as well, as they have added 'easy paths' not requiring renderpasses for example, afaik.

Forget about targeting mobiles devices, it really *.

But Vulkan is the only way to do mobile gfx at all, in case OpenGL is not enough?

I am going to put on my hat of graphic programmer and put some effort on that.

Why? For fun?
I mean, your rendering is fine to demonstrate a physics engine.
Low level gfx api is overkill, raytraced reflections and shadows would be pointless.
You could continue work on soft bodies or cloth instead, for example?

Regarding Apple, it is a bit of their own problem if they boycott any cross platform API.
Obviously they don't want cross platform software and games neither. Well, they can have that.

by **Julio Jerez** » Tue Sep 02, 2025 8:03 am

The idea is to have all the rendering stuff incapsulated in a small library.

Then after you accomplish that with one API,
It is easy to make different back end for others.

I will start with opencl, and remove all of the direct calls from all the demos.

The after that is done.
Writing a vulkan, direct x, or even different versions of the same API, should be simpler.

Write now, all the demos are sprinkled with opengl calls and it is hard to build them for mobiles or for mac os.
That was not the case in the pass, until apple and Intel start to abandoned opengl, and support vulkan very badly.

My issue with vulkan, is that it is too low lever for not really clear advantage and for lot a gratuitous complexity.

Also the Vulcan API is not well designed.
They have stuff like to create an object you need to pass an allocator pointer, them to destroy you also has to pass an allocator pointer. That's just bad design.

Not, in my opinion, vulkan does not has the support Chronous want people to believe. Even and and Nvidia hardware sport is very mediocre.

And the issue with the physic is that once you pass the rigid body, stuff like deformation requires high performance computing. So it is inevitable that you have to use gpus.

But I do not want to mix the core library with graphic API call. I try that with the fluid and Cuda, and it becomes a real mess very quickly.

I want that to be indirect, like linking to an agnostic render library. And the library does the low level stuff.

Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Who is online