Cuda vs Vulkan compute

by **Julio Jerez** » Fri Jun 06, 2025 1:52 pm

wow, I regret that I spend all this this on these GUP stuff.

I prepared a test to compared the cpu and the GPU,

this is evalaiting the Mnist data base, and I find that the CPU run faster in all cases.

multi threaded cpu
training time 11.126507 (sec)

opencl platform: NVIDIA CUDA
opencl platform: Intel(R) OpenCL
opencl device: NVIDIA GeForce RTX 4070 Ti SUPER
training time 12.225079 (sec)

The thing is that the CPU code is write on a very inefficiency way to mimic the gpu shaders are doing so that I can debug it, If I run the optimized version the CPU is about five time faster.
also the GPU test is doing zero synchronization, and zero memory copy form CPU to GPU memory,
as far as I can tell this is the best that can be done. Yet is just slow.

These results are a blowout so discouraging that I find really hard to find motivation to continue.
I have no idea how people get so impressive results using GPU compute.
This makes zero sense to me.
on top of that there is no way to profile or debug.

by **Julio Jerez** » Fri Jun 06, 2025 6:02 pm

one of the think I did is that I added a GPU context the emulate the GPU and mush as I can.

because the glsl code is significate different than c++, I also adding an OpenCL layer.
It seems opencl has gotten a secund life, with version 3.0 now is not supported by almost everyone, they also added a c++ wrapper, that make it a lot simpler to set it up.

I was also hopping that I could debug it with the intel ecosystem, but apparently Intel abandonee opens cl in favor of sycl.
anyway, has become a task far bigger that I expected.
The one thing I notice is that It seem Opencl is better for GPU computing that vulkan.
I will keep them both until I get one working, then remove the oether.

by **Julio Jerez** » Sat Jun 07, 2025 10:01 am

I'll continue.
My only remaining hope is that, under a truly massive workload,
the GPU might outperform the CPU. But the more progress I make, the more unlikely that seems.

Yes, it's true that GPUs offer five to six times the memory bandwidth of CPUs.
However, that bandwidth is often wasted due to redundancy.
A CPU, with fewer cores, tends to reuse memory efficiently.
In contrast, a GPU frequently needs to make redundant copies of data.

On top of that, NVIDIA tends to offer only minimal compliance with standards.
This reinforces the idea that, unless you're using CUDA, you’re unlikely to get good performance.

I came across a comment that supports some of these suspicions:

Runtime Compilation: OpenCL kernels are typically compiled at runtime within the application, as noted in various GitHub discussions.

Challenges and Considerations:
– Accessing PTX: Retrieving the PTX code generated by an OpenCL kernel requires host-side code and occurs at runtime.
– PTX from OpenCL vs. CUDA: While OpenCL can generate PTX,
it’s often less optimized than PTX produced directly from CUDA code.

by **Julio Jerez** » Sat Jun 07, 2025 10:18 am

to give you an idea here is a very simple open cl Kerner

Code: Select all: typedef struct { uint m_inputSize; uint m_outputSize; uint m_parametersBatchSize; uint m_parametersStartOffset; uint m_inputOutputSize; uint m_inputOutputStartOffset; uint m_unused[4]; } UniformBufferObject; )""""; __kernel void brainCopyInput(__global const UniformBufferObject* parameters, __global float* inputOutputData, __global float* inputBuffer) { uint itemId = get_local_id(0); uint groupId = get_group_id(0); uint workGroupSize = get_local_size(0); uint inputSize = parameters->m_inputSize; uint inputOutputSize = parameters->m_inputOutputSize; uint inputOutputStartOffset = parameters->m_inputOutputStartOffset; uint srcBase = groupId * inputSize; uint dstBase = groupId * inputOutputSize + inputOutputStartOffset; uint workGroupSizeReminder = inputSize % workGroupSize; uint modWorkGroupSize = inputSize - workGroupSizeReminder; for (uint i = 0; i < modWorkGroupSize; i += workGroupSize) { float a = inputBuffer[srcBase + i + itemId]; inputOutputData[dstBase + i + itemId] = a; } if (itemId < workGroupSizeReminder) { float a = inputBuffer[srcBase + modWorkGroupSize + itemId]; inputOutputData[dstBase + modWorkGroupSize + itemId] = a; } }

basically, it takes and array of inputs and place and separate the by strides and cpu each to another buffer

by **Julio Jerez** » Sat Jun 07, 2025 10:26 am

and this is what the nvida drive produces. you can see that is never use 128 or 256 memory transactions from global memory or adjacent data.
this code is very poor quality, in fact is worse than a debug build.
if you write the same thing in a cuda kernel you get a very different outcome.

Later I will try to hack the code and see if I can use the float4 and float 8 data tape,
maybe I can get to generate better code.

Code: Select all: // // Generated by NVIDIA NVVM Compiler // // Compiler Build ID: UNKNOWN // Unknown Toolkit Version // Based on NVVM 7.0.1 // .version 8.7 .target sm_89, texmode_independent .address_size 64 // .globl brainCopyInput // brainLayerLinear_$_cachedInput has been demoted // brainLayerLinear_$_cachedOutput has been demoted // brainLayerLinear_$_reductionBuffer has been demoted // brainLayerSoftmaxActivation_$_tmpInputBuffer has been demoted // brainLayerSoftmaxActivation_$_reductionBuffer has been demoted .entry brainCopyInput( .param .u64 .ptr .global .align 4 brainCopyInput_param_0, .param .u64 .ptr .global .align 4 brainCopyInput_param_1, .param .u64 .ptr .global .align 4 brainCopyInput_param_2 ) { .reg .pred %p<4>; .reg .f32 %f<3>; .reg .b32 %r<25>; .reg .b64 %rd<12>; ld.param.u64 %rd3, [brainCopyInput_param_0]; ld.param.u64 %rd1, [brainCopyInput_param_1]; ld.param.u64 %rd2, [brainCopyInput_param_2]; mov.u32 %r1, %tid.x; mov.u32 %r11, %ctaid.x; mov.b32 %r12, %envreg0; add.s32 %r13, %r12, %r11; ld.global.u32 %r14, [%rd3]; mul.lo.s32 %r2, %r14, %r13; ld.global.u32 %r15, [%rd3+16]; ld.global.u32 %r16, [%rd3+20]; mad.lo.s32 %r3, %r15, %r13, %r16; mov.u32 %r4, %ntid.x; rem.u32 %r5, %r14, %r4; sub.s32 %r6, %r14, %r5; setp.eq.s32 %p1, %r6, 0; @%p1 bra $L__BB0_3; add.s32 %r7, %r2, %r1; add.s32 %r8, %r3, %r1; mov.u32 %r24, 0; $L__BB0_2: add.s32 %r18, %r7, %r24; mul.wide.u32 %rd4, %r18, 4; add.s64 %rd5, %rd2, %rd4; // this is at least for time slower that in cuda, the code should isseus a 256 or 128 bit cpy st.global.f128 ld.global.f32 %f1, [%rd5]; add.s32 %r19, %r8, %r24; mul.wide.u32 %rd6, %r19, 4; add.s64 %rd7, %rd1, %rd6; // this is at least for time slower that in cuda, the code should isseus a 256 or 128 bit cpy st.global.f128 st.global.f32 [%rd7], %f1; add.s32 %r24, %r24, %r4; setp.lt.u32 %p2, %r24, %r6; @%p2 bra $L__BB0_2; $L__BB0_3: setp.le.u32 %p3, %r5, %r1; @%p3 bra $L__BB0_5; add.s32 %r20, %r2, %r1; add.s32 %r21, %r20, %r6; mul.wide.u32 %rd8, %r21, 4; add.s64 %rd9, %rd2, %rd8; // this is at least for time slower that in cuda, the code should isseus a 256 or 128 bit cpy st.global.f128 ld.global.f32 %f2, [%rd9]; add.s32 %r22, %r3, %r1; add.s32 %r23, %r22, %r6; mul.wide.u32 %rd10, %r23, 4; add.s64 %rd11, %rd1, %rd10; // this is at least for time slower that in cuda, the code should isseus a 256 or 128 bit cpy st.global.f128 st.global.f32 [%rd11], %f2; $L__BB0_5: ret; }

by **Julio Jerez** » Sat Jun 07, 2025 10:54 am

I did a quick hack, and I change the code to this form, and see if I can fool the compiler

Code: Select all: float8* ptr0 = (float8*)&inputBuffer[srcBase]; float8* ptr1 = (float8*)&inputOutputData[dstBase]; for (uint i = 0; i < modWorkGroupSize; i += workGroupSize) { //float a = inputBuffer[srcBase + i + itemId]; //inputOutputData[dstBase + i + itemId] = a; float8 a = ptr0[i + itemId]; ptr1[i + itemId] = a; }

I did not expect that to work, but it did, now the generated assembly is this.
the loop is trasnlated to this

Code: Select all: $L__BB0_2: add.s32 %r16, %r21, %r1; mul.wide.u32 %rd8, %r16, 32; add.s64 %rd9, %rd2, %rd8; // nvidia unrooll the float 8 into two 128 bit transactions, // this code should be at least four time faster ld.global.v4.f32 {%f1, %f2, %f3, %f4}, [%rd9]; ld.global.v4.f32 {%f9, %f10, %f11, %f12}, [%rd9+16]; add.s64 %rd10, %rd1, %rd8; st.global.v4.f32 [%rd10+16], {%f9, %f10, %f11, %f12}; st.global.v4.f32 [%rd10], {%f1, %f2, %f3, %f4}; add.s32 %r21, %r21, %r4; setp.lt.u32 %p2, %r21, %r6; @%p2 bra $L__BB0_2;

you can see that the loop is unrolled and at least on paper should be four time faster
NVidia unroll the float 8 into two 128 bit transactions, this code should be at least four time faster

that suspect that using float16 should get even better unrolling and
and I believe that the best we can expect, there is not st.global.v8.f32
even though the bus is 256 bit, what I suspect is that two compute using can be using half the bus simultaneity, or that the code issuing two consecutive st.global.v4.f32 to adjacent memory does the trick.
The point is that we should not have to do this, the compiler should do that kind of coalescing as I believe it does for CUDA.

with this I will delude myself and continue, them later when a get the working model, I can try that kins of optimization.
the one thing that goes out of the windows, is that the code in now tune to NVidia even when should be generic.

edit: I try float16, and now I see pretty good code generation, a 4 times code unrolling all adjacent and consecutive. If there was a way to use the full bandwidth, this will be it.

Code: Select all: $L__BB0_2: add.s32 %r16, %r21, %r1; mul.wide.u32 %rd8, %r16, 64; add.s64 %rd9, %rd2, %rd8; ld.global.v4.f32 {%f1, %f2, %f3, %f4}, [%rd9]; ld.global.v4.f32 {%f9, %f10, %f11, %f12}, [%rd9+16]; ld.global.v4.f32 {%f17, %f18, %f19, %f20}, [%rd9+32]; ld.global.v4.f32 {%f25, %f26, %f27, %f28}, [%rd9+48]; add.s64 %rd10, %rd1, %rd8; st.global.v4.f32 [%rd10+48], {%f25, %f26, %f27, %f28}; st.global.v4.f32 [%rd10+32], {%f17, %f18, %f19, %f20}; st.global.v4.f32 [%rd10+16], {%f9, %f10, %f11, %f12}; st.global.v4.f32 [%rd10], {%f1, %f2, %f3, %f4}; add.s32 %r21, %r21, %r4; setp.lt.u32 %p2, %r21, %r6; @%p2 bra $L__BB0_2;

by **Julio Jerez** » Sun Jun 08, 2025 9:12 pm

Oh well,
almost every thing I speculated there is wrong.

according to this post
https://developer.nvidia.com/blog/how-a ... c-kernels/

It will only apply to older GPU, like the 1080 and lower,
newer GPU, do coalizing at the compute unit level.

that opens more questions, because if the gpu is doing the coalescing then there is not explanation as to why is slower than a simd CPU optimized version.

anyway, at least this does not forces me to write hand optimized opencl or glsl kernel code.

well, I guess I will have to wait until I get a complete trainer.
I opted for writing an emulation in cpu that I can debug, and them make it simpler to just translate those kernel to OpenCL or glsl. I am almost complete now.

It is appalling that people accept vulkan or opencl without any debugging or profiling tool.
If anyone ask me, this is the real reason Nvidia and Directx pop are dominating,
It is the tool set. you can't write a large project with debugging, it is ridicules.

by **JoeJ** » Mon Jun 09, 2025 2:25 am

Julio Jerez wrote:I find that the CPU run faster in all cases.

I use AMD GPU, but maybe some of my unexpected experiences can add some thing.

In my GI project i have the problem that the GPU keeps clocked at only 300 MHz, although it can do 1.5 GHz. I see this in the driver panel. Take a look too.
However, enforcing high clock did not give the expected speedup. It got faster only a little bit like 10%.
Idk what's going on, but i hope i'll figure out while integrating it to a renderer which i'm currently working on.

The one thing I notice is that It seem Opencl is better for GPU computing that vulkan.

When i worked on my GI project about 10 years ago, VK was clearly faster. Not only due to command buffers and indirect dispatch, but the shaders themselves also were 10% faster.
But now, ten years later on a newer GPU, OpenCL is a little bit faster for me too it seems.
Which adds to the confusion from above.
My best explanation is that AMD might have abandoned compiler settings for my old GCN architecture. RDNA register file is twice as large, RDNA4 even 3 times larger. This means compilers can use more registers, and indded the occupancy values i see seem lower than what i remember from back then.

it’s often less optimized than PTX produced directly from CUDA code.

Back then the move form OpenGL to OpenCL gave me a speedup of two on NV GPUs.
Just to confirm that different API indeed can make a big difference. Code was basically equivalent for both. I could also compare some NV SDK samples between OpenCL and Cuda, where the latter often was almost twice as fast.

strides

Unrelated, but strides in memory access patterns can have a huge impact, and most people don't know.
For example, if each workgroup is supposted to generate a list in memory which can have at most 256 entries, we will tend to give each group a region of memory with a stride of 256 like so:
buffer_lists[workgroupID * 256 + localTheeadID] = x;

Consequently, many threads will try to write to memory locations seperated by a power of two stride, which can cause bank conflicts on some architectures and thus a serialization slowdown.
The fix is simply:
buffer_lists[workgroupID * 257 + localTheeadID] = x;

Wasting a little bit of memory like that avoids the POT stride, and iirc gave me a 4x speedup on AMD GCN, while NV Fermi / Kepler was unaffected.

by **Julio Jerez** » Mon Jun 09, 2025 11:11 am

The thing is that I am not using any strided data.
I am setting it all up according to the guide line of that blog post from Nvidia.

My interpretation is that each compute unit of the GPU.
Has a dedicated memory address possessor.
That it's main function is to analyze memory transactions issue by the threads of a computer unit.

If all the addresses are adjacent in memory, plus they met the required alignment. Then the transaction is coalesced into a series of larger memory transfers .

For example say 64 thread read main memory a block of adjacent floats.
The manager check it pass memory requirements,
instead of doing 64 reads, it does 16 reads,
In fact it might even do 8 read if it was a 256 but memory bus.

The only requirement I am not abiding by is the alignment.
But the blog post also says that in new hardware that requirement is not that critical since the controller has a cache that read two blocks and get the un align chunk.

Later I will abide by the alignment.

Anyway that's where I am,
Today I will resume the open cl, since I now have an emulation in CPU mode that mimic what the shader do,
This let me just translate the shader without worrying too much about correctness.

And yes it seems opencl has gotten a second life, the 3.xx
Version is much nicer, plus they now has an official download from chronous.
Before you had to get different download from different vendors and they were incompatible with each other because all the moronic specialized extensions.

If you wore with Intel, the it did even compiled for amd and so on.

by **Julio Jerez** » Tue Jun 10, 2025 2:52 pm

well I now start to get a little bit of hope.

I now converted the backprogation cpu kerners to opencl, so the trani does more wor.

and it is getting these results.

the moral I get form this is that you really has to stress the GPU with a huge load of work in order to overcome the latencies and synchronization across the bus.

That give me a little bit of hope.
and the good part is that the gpu is still very inefficient.
for in the example, I am preloading the minibatch of images in every step.
but if I preload the entire dataset to GPU memory, I guess I can see the elusive 5 to ten time speed up,
I can only hope.

but anyway, this is better now.

by **Julio Jerez** » Wed Jun 11, 2025 9:04 pm

ahh, finally some success. :mrgreen:

this is my best the optimized cpu Mnist, it is a linear model with 4 dense lineal layers of 256 neurons each expect the last one which is the categorical SoftMax classifier.

this is the gpu version, which is not optimized, and is doing lot of cpu synchronization

It is almost twice as fast, they both reach over 99% accuracy in 10 epochs, so that's good even when the intermediate results from arithmetic are quite different.

I think that with little effort, I can get it faster that that after I preload the data set to gpu.
I am no predicting anymore, But I have reasonable hope to expect, it can me mush better.

anyway, twice as fast, is already a good reason to go on.

by **Julio Jerez** » Thu Jun 12, 2025 11:35 am

ah, I took a secund look at amd, to see what tool they have.
and they do have an interesting gpu profiler.
I was going to try intel, arc, but I read they are ditching opencl, so I install my amd 7800 gou.

they first thing is that making everything equal, they GPU version of amd in about 2.5 times slower that the nvidia 4070 version, but that's ok.
The reason is to see what error and am making, I do believe I have a fundamental misunderstanding of the command queue operations that can only be clarified by visualizing the flow in a profiles.

anyway, I too the first capture and I just got this.

: Untitled.png (48.22 KiB) Viewed 13142 times

the summary already say that it is GPU bound, so that seem promising

by **Julio Jerez** » Thu Jun 12, 2025 12:12 pm

ahh, this profiler is impressive, it immediately gave a high-level stats of what is joint on. and
it the culprit is the matrix multiply kernels.

: Untitled.png (90.08 KiB) Viewed 13141 times

I was suspecting of the activation tanh and exp function, by they do even show up in the profiles.
then by the end is a factor, but I believe I can make both those ebetter.
The cool par is that, thsi is generic.
so it should apply to all back end: nvida, intel, amd and cpus.

by **Julio Jerez** » Thu Jun 12, 2025 10:01 pm

Wow, the profiler really is cool.
I optimized the matrix x vector, and not it is about 40 time faster.
The update now takes under 15 ms.
And the profiler day that the app is CPU bound.
Which is cool.

The only problem is that the update becomes more complicated, seems the matrix x vector now work in the transpose of the matrix.
But it can't just use indices, it has to actually transposes into an array. For the next operation to be fast.

Dot product operations and reductions are real bad in GPUs.

by **Julio Jerez** » Sat Jun 14, 2025 10:44 am

I've found that it's nearly impossible to get a mid-sized matrix-vector multiplication to outperform the GPU using OpenCL — not even close, regardless of how much you optimize.

For the GPU to have a real advantage, the matrix needs to be massive, like 4096×4096.
But in typical dense layers matrix sizes in machine learning are more modest,
around 256×256 or maybe 512×512 at most.

I came across a site that does some performance comparisons:
https://cnugteren.github.io/tutorial/pages/page2.html

: Untitled.png (53.08 KiB) Viewed 12156 times

It clearly shows the shenanigans tricks NVIDIA pulls with OpenCL.
If you use their proprietary libraries, you get 3 to 4 times the throughput compared to even your best hand optimized hand-written OpenCL or CUDA code.
For instance, cuBLAS regularly achieves 4 to 6 times the performance of their own CLBlast implementation.
I have been long enough programing low level optimizing code to see that blow outs like that,
using same Hardwares and same algorism can only be possible if one version is compiled in debug mode.
the sad part, is that even under that conditions Nvidia beats AMD and Intel hands down.

using Amd, the best you can hope for is one third of the Nvida equivale speed,
and that's only after making your algorithm useless for a practical usage.

Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Re: Cuda vs Vulkan compute

Who is online