Cuda vs Vulkan compute

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Mon Jun 16, 2025 1:59 pm

well, I spent the weekend getting dirty in the tile * and wrote a tile based
matrix x matrix multiple opencl kernel.

The previous one which I though was going to be better, using the matrix transpose,
and the profile showed this. It was really bad, the profile alleady say the app was GPU bound.
Untitled.png
Untitled.png (37.56 KiB) Viewed 27934 times


The new one is quite good, so far. and the profile now say the app is GPU bounds.
so that's good because the GPU optimization can make better, but for now this is good enough
Untitled1.png
Untitled1.png (72.01 KiB) Viewed 27934 times


I can still add some more optimization like adding an array of register.
some people report doubling the throughput using those tricks.

It is a the point whe the GPU version and the CPU version are beraking even.

    CPU optimized results
    training time 3.485306 (sec)

    opencl platform: AMD Accelerated Parallel Processing
    opencl platform: Intel(R) OpenCL
    opencl platform: Intel(R) OpenCL
    opencl device name: gfx1101
    opencl device version: OpenCL 2.0 AMD-APP (3652.0)
    opencl device local memory: 65536

    training time 3.514954 (sec)


but this is very good news, because that test is the feed forward update,
what make really slow is the back propagate because is that part is consume ferocious in memory bandwidth, which is when the gpu has the upper hand over cpus.
at least that's the theory, and I have been known to be very wrong at these predictions.

anyway, I now can remove all the transpose and clean then enable the backpagattion to see what we are. :D
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Tue Jun 17, 2025 5:44 pm

ahh, finally I gave a clean advantage for the GPU:

This is the timing for the most optimized verrisn I can mustered so far

    parallel optimized cpu:
    training time 82.004327 (sec)

this is the same training in GPU.
    platforms found:
    opencl platform: AMD Accelerated Parallel Processing
    opencl platform: Intel(R) OpenCL
    opencl platform: Intel(R) OpenCL

    selecting:
    opencl device name: gfx1101
    opencl device version: OpenCL 2.0 AMD-APP (3652.0)
    opencl device compute units: 30
    opencl device local memory: 65536

    training time 28.532017 (sec)

A little under three time faster :mrgreen: :D :shock: :D

the good knews is that I can optimized the Gpu version some more in different aspects.
-First by profile the kernels. I already did that for the matrix multiply, but there are other tow multiplication that are still using the transpose method, replacing those tow will further improve preference.

Th ether one is by adding more GPU, functionality, for example I have no converted that loss functions.
and finally, an easy one, is to preload the data set to gpu.

anyway, this seems to be making some progress now,
if that speed up holds for training the robot, the training time should be cut to 8 hours, which is still too much.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Thu Jun 19, 2025 11:35 am

wow, I have to do some heavy duty debugging to find out a bug that was generating slightly different result form run to run when using GPUs, but not when using CPUs opencl of emulation.

that's the tale tail of a Barrier, missing or misplaced.
and it is usually a really bad sign because when the solution is synchronization, It means drastic performance cost in flops throughput.

it turned out that was not the cause and the solution made the system a little faster still.
The amd gpu backend now takes 23 ms,

    Best model: epoch: 16 success rate:99.278328% training fail count:433 test fail count:219
    Best model: epoch: 17 success rate:99.311668% training fail count:413 test fail count:217
    epoch: 18 success rate:99.335007% training fail count:399 test fail count:223
    epoch: 19 success rate:99.351669% training fail count:389 test fail count:222

    platforms found:
    opencl platform: AMD Accelerated Parallel Processing
    opencl platform: Intel(R) OpenCL
    opencl platform: Intel(R) OpenCL
    opencl platform: Oclgrind

    selecting:
    opencl device name: gfx1101
    opencl device version: OpenCL 2.0 AMD-APP (3652.0)
    opencl device compute units: 30
    opencl device local memory: 65536

    training time 23.135165 (sec)

This is a big wing in favor of GPU.
and the best part is that I can only be faster.
This is the profile capture form the amd GPU tool.

Untitled.png
Untitled.png (76.3 KiB) Viewed 27913 times


IT is clear that one kernel dominates the entire update by a factor 0f more than 10.

and that kernel is this.

Code: Select all
    __kernel void brainAccumulateGradients(
            __global const UniformBufferLayerArguments* parameters,
            __global float* gradientBuffer)
    {
        uint itemId = get_local_id(0);
        uint groupId = get_group_id(0);
        uint workGroupSize = get_local_size(0);

        uint inputSize = parameters->m_inputSize;
        uint miniBatchSize = parameters->m_inputOutputSize;
       
        float sum = 0.0f;
        uint start = groupId * workGroupSize;
        for (uint j = 0; j < miniBatchSize; ++j)
        {
            uint base = start + j * inputSize;
            sum += gradientBuffer[base + itemId];
        }
        float weightFactor = 1.0f / (float)miniBatchSize;
        barrier(CLK_LOCAL_MEM_FENCE);

        gradientBuffer[start + itemId] = sum * weightFactor;
    }


Basically, the code performs an element-wise addition of a long vector across the rows of a matrix.

This is a challenging operation to optimize because there doesn’t seem to be enough computation per memory access to hide latency. I suspect my mistake is in summing the matrix row by row.
While this might seem reasonable, especially since each row contains a substantial amount of data,
in practice, it may be causing cache thrashing.

What I think is happening is that the compute units are invalidating the cache with each row traversal. This kind of code can run fast on GPUs that don't have a unified L2 cache.
But if the GPU does have an L2 cache, and the vectors are large (over half a megabyte in this small demo), then failure to reuse cached data severely limits performance due to memory bandwidth constraints.

What I plan to try is breaking the operation into a series of kernel launches, each accumulating one row at a time. The idea is that the first pass will read all the data into cache. Even if the cache can't hold it all, each subsequent pass will work on a matrix half the size of the previous one. That means the likelihood of hitting cached data increases with each pass.

This is just a hypothesis, but if this approach even halves the compute time,
it could shift the performance balance strongly in favor of the GPU backend. :mrgreen: :D :shock:


edit: the naive version of that did really work, in fact it made it about 75% slower.
I believe is because it dispatches to make kernels, and the driver kill it.

I can try to reduce that by for now, I think that adding high level optimization will yield better results.
the next step if to preload the database to device memory and see how that does.

This is actually important for training because the replace buffer have to be updated incrementally, which mean it might need my old sort function. for regrouping the data in gpu memory.
anyway, we are no in a good position to start treining again.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Thu Jun 19, 2025 2:10 pm

for context, in my old system, the gpu backend trainer is a blowout.

this is my best optimized parallel trainner in cpu

    Best model: epoch: 15 success rate:99.401665% training fail count:359 test fail count:208
    Best model: epoch: 16 success rate:99.453339% training fail count:328 test fail count:206
    epoch: 19 success rate:99.508331% training fail count:295 test fail count:220

    training time 207.753696 (sec)

    training data results:

this is the GPU :D :D

    Best model: epoch: 17 success rate:99.339996% training fail count:396 test fail count:229
    Best model: epoch: 18 success rate:99.348328% training fail count:391 test fail count:222
    epoch: 19 success rate:99.361664% training fail count:383 test fail count:232

    platforms found:
    opencl platform: NVIDIA CUDA
    opencl platform: Intel(R) OpenCL
    opencl platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)

    selecting:
    opencl device name: NVIDIA GeForce GTX 1660 SUPER
    opencl device version: OpenCL 3.0 CUDA
    opencl device compute units: 22
    opencl device local memory: 49152

    training time 27.064774 (sec)

a 7.5 speed up, hikes. :mrgreen: :D :shock: :twisted:
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Sun Jun 22, 2025 12:28 pm

alright, this is it the result are in.
these are the results of training the Mnist data set in cpu vs gpu both optimized.
GPU could be optimized some more, but that will be making specialized shader optimization
to capitalized in some hardware, stuff like know the warp size, of the shader memory bank size, by I am no going for that. As lock as the Gpu at least twice as fast and optimized CPU, that a win.
but is reality I am getting about 4+ speed up.

here is the result for the small model (350k) parameters, training the Mnist for 20 epochs

    Best model: epoch: 18 success rate: 99.495003% training fail count:303 test fail count:205
    epoch: 19 success rate:99.528328% training fail count:283 test fail count:206

    results: multithreaded optimized cpu small model
    mnist database, model number of parameters 335114
    training time 74.433944 (sec)

    training data results:
    num_right: 59698 out of 60000
    num_wrong: 302 out of 60000
    success rate 99.496666%

    test data results:
    num_right: 9796 out of 10000
    num_wrong: 204 out of 10000
    success rate 97.959999%

this is the GPU
    opencl device name: gfx1101
    opencl device version: OpenCL 2.0 AMD-APP (3652.0)
    opencl device compute units: 30
    opencl device local memory: 65536

    Best model: epoch: 18 success rate:99.065002% training fail count:561 test fail count:205
    epoch: 19 success rate:99.093338% training fail count:544 test fail count:206

    results: opencl gpu small model
    mnist database, model number of parameters 335114
    training time 15.896462 (sec)

    training data results:
    num_right: 59439 out of 60000
    num_wrong: 561 out of 60000
    success rate 99.065002%

    test data results:
    num_right: 9796 out of 10000
    num_wrong: 204 out of 10000
    success rate 97.959999%

As you can see, the model isn't large enough to capture the underlying structure of the data.
As a result, it only achieves just under 98% accuracy on the test set.
Of course, I could apply techniques like dropout regularization or extend the training time, but doing so increases the number of trial-and-error iterations, precisely the kind of inefficiency that adding a GPU backend is meant to address.

Note:
For context, once a model's training accuracy exceeds around 98% of the target, further progress becomes extremely slow.
This is because the "learning" in neural networks is driven by gradients,
which is just calculating partial derivatives of a multivariable function and update the wights proportional to that vector of partials derivatives.
At that stage, most predictions are already correct, so the error between the predicted and true values is often zero. As a result, the algorithm ends up averaging thousands of zero gradients and only a few meaningful ones, making it much harder to continue improving the model.

Anyway, a simple test to try, is to increase the size of the hidden layers, using 512 neurons instead of 256.
The general consensus is that wider models are better at learning the underlying patterns in the data and tend to make more confident classifications, which usually translates into better performance on unseen data. :D :D

However, this change comes at a cost: it roughly quadruples the training time on both CPU and GPU. And since CPUs are significantly slower for this kind of workload, experimenting with architectural changes becomes impractical without GPU acceleration.

Here are the results of that test, with almost a million parameters: :o :shock: :shock:
    epoch: 16 success rate:99.561668% training fail count:263 test fail count:188
    Best model: epoch: 18 success rate:99.620003% training fail count:228 test fail count:166
    epoch: 19 success rate:99.570000% training fail count:258 test fail count:191

    results: multithreaded optimized cpu bigger model
    mnist database, model number of parameters 932362
    training time 195.515136 (sec)

    training data results:
    num_right: 59773 out of 60000
    num_wrong: 227 out of 60000
    success rate 99.621666%

    test data results:
    num_right: 9835 out of 10000
    num_wrong: 165 out of 10000
    success rate 98.349998%

and the gpu :mrgreen: :shock: :D 8) :mrgreen: :P :shock: :mrgreen:
    Best model: epoch: 15 success rate:99.404999% training fail count:357 test fail count:175
    epoch: 16 success rate:99.404999% training fail count:357 test fail count:182
    epoch: 18 success rate:99.436661% training fail count:338 test fail count:188
    epoch: 19 success rate:99.436661% training fail count:338 test fail count:180

    results: opencl gpu bigger model
    mnist database, model number of parameters 932362
    training time 36.439773 (sec)

    training data results:
    num_right: 59644 out of 60000
    num_wrong: 356 out of 60000
    success rate 99.406670%

    test data results:
    num_right: 9826 out of 10000
    num_wrong: 174 out of 10000
    success rate 98.260002%

It seems that at least on this test, the consensus is true,
the model generalizes better in both training and test data achieving over and significant fraction over 98% of accuracy is so few epochs.
The important part is that the test only takes 36 seconds in GPU while the CPU is 195.
That's a 5.4 speed up factor, which according to my investigations, is better that the result that people are getting in similar test Tensorflow and Pytorch, at least it is in the ballpark.

Now is back to train robot one more time. :shock: 8) :shock: :mrgreen:

edit:
ahh, and for completion, the nvida test gives these results.
    Best model: epoch: 18 success rate:99.430000% training fail count:342 test fail count:169
    epoch: 19 success rate:99.451668% training fail count:329 test fail count:172

    results:
    mnist database, model number of parameters 932362
    training time 59.745972 (sec)

    training data results:
    num_right: 59659 out of 60000
    num_wrong: 341 out of 60000
    success rate 99.431664%
    test data results:
    num_right: 9832 out of 10000
    num_wrong: 168 out of 10000
    success rate 98.320000%

that's no bad for a legacy nvida GPU, it is quite competitive with the 7800 amd actually.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Fri Jun 27, 2025 2:09 pm

wow, the prognostics for the GPU reinforcement learning are very promising

I have converted the part of the class that most interact with cpu memory.
At first the result what very mediocre, barely a fraction faster than the CPU.
But after reading about Shared Virtual memory buffers.
It makes the code very verbose, but when set correctly,
it avoids lot of the overhead for setting the DMA transfer.

these are now the first results of training.
CPU:
Untitled.png
Untitled.png (415.74 KiB) Viewed 27884 times


GPU:
Untitled1.png
Untitled1.png (306.49 KiB) Viewed 27884 times


That's about 10x factor, and the cpu timing is from the physics SDK,
it is as if the GPU does not fell the load at all.

when the entire class is complete, the cpu timing is over 100 ms per step.
a 10x speed factor could be quite good, but if it keeps the pace, it could be a lot better than that :mrgreen: :mrgreen: :mrgreen: :D :shock:
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Thu Jul 03, 2025 2:01 pm

Wow, I have to dive into some serious debugging.
For some reason, I never considered that, in a computer program, an array could exceed 2 billion elements.

I ran into a similar issue a while ago while working on sorting and fluid simulations in CUDA.
Back then, the problem wasn't the number of elements, it was the pointer address.
At the time, my GPU had less than 2 GB of VRAM. I primarily worked on 32-bit mode.
Then I upgraded to an NVIDIA 1660 Super with 6 GB, and suddenly, I had to rewrite my utility vector class to use int64 instead of uint32, and update the entire codebase. That was a tedious process.

Now, while working with machine learning and training models,
I'm again running into vectors that do exceed 2 billion elements.
The tricky part is that smaller datasets like MNIST work fine since they fall under the 2 GB threshold.
But once I move on to training robots, the amount of offline data blows past that limit.
So, I’ve decided it’s time to fix this issue once and for all.

That threw a real monkey range in refactoring the library.
But I am now almost complete and ready to try again.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Wed Jul 09, 2025 12:27 pm

This isn’t new to me, I've always tried to structure data to make efficient use of cache lines whenever possible.
That said, code optimized for memory access patterns doesn’t always look clean or elegant,
so I’ve only applied these optimizations loosely in the past.

A classic example is matrix-vector or matrix-matrix multiplication. Typically, one matrix in the operation is cache-friendly, while the other isn't.
A common workaround is to transpose one of the matrices to improve memory locality.
However, this comes with trade-offs: it doubles memory usage, requires additional synchronization between the original and transposed matrices, and introduces overhead from the transpose operation itself. In practice, for small and medium size matrices, this often ends up being no better than just accepting the cache inefficiencies.

Advanced libraries like BLAS and LAPACK use tiling algorithms to address these issues more effectively, but those implementations are massive and primarily deliver significant performance gains only when working with large matrices like 2k by 2k or more, which is not the case for ML.

That sets the context.
For testing and debugging my neural network library, I’ve been using the MNIST dataset.
To keep CPU time manageable during matrix operations, I initially set the hidden layer width to 64 neurons. However, this led to poor model convergence.

Here are the results from a 20-epoch run:
    Training accuracy: 96.53%
    Test accuracy: 95.92%
Total time: ~1 minute

To improve accuracy, I had two main options:
-Increase the number of epochs
-Increase the width of the hidden layer

Both options increase runtime quadratically. More epochs aren’t always effective because once the network hits its learning capacity, gradients become too small to be meaningful, and numerical noise starts to dominate, making further learning difficult.

The second option, widening the hidden layer is generally favored in modern ML practice.
It's widely accepted that wider networks often capture the underlying structure of the data better than deep, narrow ones. The downside is performance: doubling the width can quadruple runtime.
For instance, increasing the hidden layer to 512 neurons boosts accuracy significantly, but also increases training time substantially.

Here are the results for a 512-neuron hidden layer:
    Training accuracy: 99.47%
    Test accuracy: 98.29%
Total time: ~10 minutes

Ideally, you'd want both wide and deep networks to get the best of both worlds, but at that point, you’re looking at hours of training time.
with these premises I start to explore GPU, first Vulkan then OpecnCl.
Vulkan is too low level for my test. so I try opencl.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Wed Jul 09, 2025 12:53 pm

When working with OpenCL, careful attention to memory usage is essential.

On a GPU, reading a single value from memory doesn’t just retrieve that value, it fetches an entire memory segment (a wavefront).
So, if you naively port traditional matrix operations to the GPU, the performance gains may be underwhelming, or even negative, once you account for the overhead of transferring data between the host and the device.

After rewriting the matrix algorithm using a tile-based approach, the performance improvements were significant. This led me to wonder: what if I applied the same tiled algorithm on the CPU?

After all, a CPU’s L1 cache is roughly the same size as a GPU’s shared local memory. Additionally, while a typical GPU memory fetch involves a 32 x 32-bit segment, the CPU typically fetches a 64 x 8-bit cache line. If every element in a cache line is used efficiently, you can, in theory, achieve a 16x performance gain due to memory locality, and there is not cost for host to gpu memory.

So, I rolled up my sleeves, implemented the tiled version for the CPU, and refactored the library accordingly.
It wasn’t easy, but I eventually got a working prototype, and the results exceeded my expectations.

The speedup was dramatic: a model that previously required 10 to 15 minutes for 20 epochs on the CPU now trains in just 27 seconds.

    Results: Dataset: MNIST
    Model Parameters: 932,362
    Training Time: 27.7 seconds

    Training Accuracy:
    Correct: 59,681 / 60,000
    Accuracy: 99.47%

    Test Accuracy:
    Correct: 9,829 / 10,000
    Accuracy: 98.29%
Next, I plan to revisit the GPU version and integrate some of the optimization techniques used in the CPU implementation. I’m not expecting a 10x–100x improvement, more realistically, a 3x–5x speedup. But that would still be a valuable gain.
I found out that the proverbial 100x gpu only happen when you compare very poor cpu to highly efficient gpu version of the same routine.
stuff like comparing radix sort in gpu, to standard library quick sort, and shenanigans like that.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Fri Jul 11, 2025 1:32 pm

Alright, I've now completed the OpenCL versions of all the kernels, and the results are impressive.
Using the most optimized CPU implementation:
    Training time: 27.78 seconds
    Training accuracy: 99.41%
    Test accuracy: 98.18%
Now with GPU (OpenCL 2.0, AMD-APP 3652.0):
    Training time: 6.97 seconds
    Training accuracy: 99.38%
    Test accuracy: 98.04%

a factor of four as I predicted, but bare in mine, that both cpu and gpu result are about 5 to 10 time faster, compared to when I started. this is better than a 10-fold gain. :mrgreen: :D
I am happy with these results.

With this GPU acceleration in place, I can finally revisit the reinforcement learning setup for training robots. I’m expecting a dramatic reduction in training time from several hours or even days down to just a few hours, possibly even minutes.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Tue Jul 15, 2025 11:14 pm

this is the final blow,

    opencl device name: gfx1101
    opencl device version: OpenCL 2.0 AMD-APP (3652.0)
    opencl device compute units: 30
    opencl device local memory: 65536

    results:
    mnist database, model number of parameters 932362
    training time 0.478264 (sec)

    training data results:
    num_right: 57069 out of 60000
    num_wrong: 2931 out of 60000
    success rate 95.114998%

    test data results:
    num_right: 9518 out of 10000
    num_wrong: 482 out of 10000
    success rate 95.180000%

over 95% in just one epoch, in less than 0.5 seconds, and does better in test than in training.
I have never seen a training that is not using CNN, that achieve that performance at that speed.

I always have to keep checking the result each time I add GPU functionality, because GPU programming is prompt to errors, and if you add too much code, they it is hard to find what broke the code you already have. especially when new functionality changes existing code.

So far, I am very happy with these results. :mrgreen: :D :twisted: :shock: :mrgreen:
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Sat Jul 19, 2025 2:07 pm

Wow,
The one task I greatly understated is how tedious is to have
All the support on the vertir class to go on.

It is easy to see how in cpp, doing scalar operations, you can do stuff like
X = x[I] > 0 ? 1 : -1

When translated to GPU buffer, you need to write lot of predication.
I get a new respect for the people how have write a full
ML libraries that unify CPU and GPU operations.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Mon Sep 01, 2025 3:28 pm

It feels like the graphics API dilemma is never going to be resolved—if anything, it keeps getting worse.

About 20 years ago, I chose OpenGL because it was the only truly cross-platform library. Back then, companies like Apple, Intel, and pretty much anyone outside of Windows were making grand promises that they would never abandon OpenGL.

Well, here we are today, and those same companies have abandoned it. Ironically, Microsoft is now one of the few still giving OpenGL proper support.

These days, it’s nearly impossible to build a true cross-platform demo unless you either write your own wrappers or rely on a game engine like Unreal or Unity.

I tried going down the Vulkan route, but after multiple attempts, I can honestly say I find it disappointing. To me, Vulkan feels like a solution in search of a problem: unnecessarily complex, poorly designed, and with shaky driver support.

I miss the old days of libraries like RenderWare, BRender, and RenderMorphics. Back then, you knew the scope of what you were getting. Today, with OpenGL or Vulkan, you can put in all the effort to build something, and you’re lucky if it even runs consistently across more than one PC with different GPUs. Forget about targeting mobiles devices, it really *.

what I am doing now. is that I will made a very lightweight, render library and move all the rendering code, there is that I can support other platform like apply OS and mobiles.

This will also let me implement more advanced rendering feature like reflections, maybe even tracing. at my own pace.

I am going to put on my hat of graphic programmer and put some effort on that.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda vs Vulkan compute

Postby JoeJ » Tue Sep 02, 2025 6:07 am

I tried going down the Vulkan route, but after multiple attempts, I can honestly say I find it disappointing. To me, Vulkan feels like a solution in search of a problem: unnecessarily complex, poorly designed, and with shaky driver support.


I was postponing work on a renderer for years because i was so afraid of Vulkan complexity.
But then, after spending 3 months purely on shadow map research, it took me only another 3 months for the basic renderer i need.
Less then expected, so i thought this time i can afford a longer break working on ragdolls.
Since then i'm working for almost 3 months to make my ragdoll walk as fast as humans do, but still no success so far. ; )

I would say VK isn't as worse as we think, once we get used to it and just do the work.
It could be much easier now as well, as they have added 'easy paths' not requiring renderpasses for example, afaik.

Forget about targeting mobiles devices, it really *.

But Vulkan is the only way to do mobile gfx at all, in case OpenGL is not enough?

I am going to put on my hat of graphic programmer and put some effort on that.

Why? For fun?
I mean, your rendering is fine to demonstrate a physics engine.
Low level gfx api is overkill, raytraced reflections and shadows would be pointless.
You could continue work on soft bodies or cloth instead, for example? 8)

Regarding Apple, it is a bit of their own problem if they boycott any cross platform API.
Obviously they don't want cross platform software and games neither. Well, they can have that. :?
User avatar
JoeJ
 
Posts: 1489
Joined: Tue Dec 21, 2010 6:18 pm

Re: Cuda vs Vulkan compute

Postby Julio Jerez » Tue Sep 02, 2025 8:03 am

The idea is to have all the rendering stuff incapsulated in a small library.

Then after you accomplish that with one API,
It is easy to make different back end for others.

I will start with opencl, and remove all of the direct calls from all the demos.

The after that is done.
Writing a vulkan, direct x, or even different versions of the same API, should be simpler.

Write now, all the demos are sprinkled with opengl calls and it is hard to build them for mobiles or for mac os.
That was not the case in the pass, until apple and Intel start to abandoned opengl, and support vulkan very badly.

My issue with vulkan, is that it is too low lever for not really clear advantage and for lot a gratuitous complexity.

Also the Vulcan API is not well designed.
They have stuff like to create an object you need to pass an allocator pointer, them to destroy you also has to pass an allocator pointer. That's just bad design.

Not, in my opinion, vulkan does not has the support Chronous want people to believe. Even and and Nvidia hardware sport is very mediocre.

And the issue with the physic is that once you pass the rigid body, stuff like deformation requires high performance computing. So it is inevitable that you have to use gpus.

But I do not want to mix the core library with graphic API call. I try that with the fluid and Cuda, and it becomes a real mess very quickly.

I want that to be indirect, like linking to an agnostic render library. And the library does the low level stuff.
Julio Jerez
Moderator
Moderator
 
Posts: 12425
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

PreviousNext

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron