Cuda Solver

by **Julio Jerez** » Thu Apr 07, 2022 3:34 pm

Yours is about 1 ms faster than mine.

Those 3.x ms is all cpu time, the gpu it well under half ms, and I assume a system like your is even lower.

So far is going well. I am going continue going with the broadphase.

I hope to comple that part this week end. And start with the contact calculation.

I will be happy if after all is set we get that seven running in real time with all the bodies interactions with each others.
But let us see, it is still a long way to go.

by **Julio Jerez** » Thu Apr 07, 2022 5:38 pm

Finally some one other than I say what I have been saying fir years and years.

https://news.ycombinator.com/item?id=26150682

by **JoeJ** » Thu Apr 07, 2022 6:12 pm

I have no more NV GPU currently. Maybe the GTX 670 would still work, but won't try.

About those bubbles, i think i have something like 100 or more dispatches, and while i'm worried too, that's not too much to be a real problem. (Remember the issues about draw calls with high level APIs - afaik that's really only problem when approaching > 1000 calls.)
In my case, many of my dispatches do nothing. They are indirect, so receive work count from GPU, and often the count is zero.
That's my major point of critique still left with low level APIs.
The problem is, even if they do no work, memory barriers are still executed and cost time, probably e.g. because they flush cashes.

With AMDs Mantle API, you could conditionally skip over those empty dispatches and the barriers. Problem solved.
But neither DX12 nor VK can do something like that.
VK recently added 'conditional draws'. That's almost the same idea, and it also works with compute not just rendering, but you can not make barriers conditionally too, so it's useless to implement control flow on GPU.
Same for DX 12. Nothing about barriers.
Even the same for NV device generated command buffers extension for VK. No barriers.

It's beyond me why they all seemingly just forget about this. :roll:

by **Julio Jerez** » Fri Apr 08, 2022 2:19 pm

Umm now I got to the part that makes it really difficult.

Cuda say the support lambda and in a way cpp do.
But the way they do is by inventing thier own semantic and having the compiler translate it to cuda code.
For example the sematic to call a Kerner is
Kernernamr<<<some device param>>> (Kerner parameter)

One parameter is the lambda expression that implements the operation.

But that's not how it should be, because you can not use that to encapsulate something like a template class that implement a complex algorithm that requires more than one kernel call like sorting, where it will invoke more than one kernel call.

Basically the algorithm can not be turned into a library utility that can be reused.

It * but it seem the only way is to just copy the algorithm every time we have to use it.
Or maybe make a class that does the operation and we can pass some parameter to tweak it.

I see if I can try that because thier lambda can't be collected

by **Julio Jerez** » Fri Apr 08, 2022 3:05 pm

I know this is confusing so here is an example, say we make a template class to sort an array.
In cuda it will look like this

Code: Select all: template <class T, typename cuEvaluateKey> __global__ void CudaHistogram(typename cuEvaluateKey, T* const array, int* histogram, int size) { int index = threadIdx.x + blockDim.x * blockIdx.x; int key = cuEvaluateKey(&array[index]); } template<class T> class CudaCountingSort { public: CudaCountingSort(T* const src, T* const dst, int* histogram, int size) :m_src(src) ,m_dst(dst) ,m_histogram(histogram) ,m_size(size) { } void Sort() { auto EvaluateKey = [] __device__(T & dataElement, int* histogram) { return 0; }; //CudaHistogram << <blocksCount, D_THREADS_PER_BLOCK, 0, stream >> > (GenerateHash, info, bodiesGpu, scan, hashArray); CudaHistogram << <1, 256, 0, 0>> > (EvaluateKey, m_src, m_histogram, 100); } const T* m_src; T* m_dst; int* m_histogram; int m_size; };

that will work, but the problem is that the expression that evaluate the sort key is a function object implemented inside another functions.

In normal CPP the template class woudl be implement as

Code: Select all: template<class T, typename EvaluateKey> class CudaCountingSort { .... }

now the lambda is pass by the caller, and the call can be used to sort many kind of arrays.
the way it is in cuda, there is not way to do that because the way the pass the lambda is that their compiler get the address of the object at run time and pass it to the GPU, but that is not assessable to c++. so the only way is that we need to copy and past the class and edited for every new object.

I will still use like this because makes it cleaner, but they still have a long way to go to be CPP compliance.
In that sense sycl is probably closer to legitimate C++11
but anyway, it is still huge improvement vs opencl or any other api now.
Microsoft AMP had that right more than 10 years ago.

I wish I can post these questions over the nvidea forum, but I do not know if it is just to me, or that's how the operates, but they do not answer questions at all, even worse, after few days the questions are silently closed as if the question was answered.
https://forums.developer.nvidia.com/posted

by **JoeJ** » Fri Apr 08, 2022 6:15 pm

Hihi, maybe they have a blacklist of people they don't like. Especially that's people with initials JJ. :lol:

If so, you'd need another nickname to get answers. :mrgreen:

by **Bird** » Sat Apr 09, 2022 7:47 am

That's too bad. I'm a NVIDIA OptiX user and support on that forum is fantastic.... almost as good as it is here.

by **Julio Jerez** » Sat Apr 09, 2022 6:17 pm

I now added more to the sort.
Since this is a 3d volume.
The sorting g key has 6 digits.
This require 3 kernels calls per dimensions, that's 18 kernel invocations.
1 to get the histogram
1 to calculate the scan prefix
1 to sort the data.

In theory the sort is const time independent of the body count. But in practice is determined by memory bandwith
It is actually very very fast. But as you can see it quickly adds up.

But now I see that as more Kerner are added, for some reason more silence bubbles are inserted in the gpu time line. This is very concerning.

It is not a problem yet, but if this keeps happening it makes it hard to stimate the final results.

by **Julio Jerez** » Sat Apr 09, 2022 11:13 pm

I now have the full grid sorting class.
It hard to bring down a gpu to its knees.
At 32k elements in that sence it generat around 80k entries, which is fine for the defail array i am creating.

So for stress test I set the body count to 40 x 40 x 40
64k bodies.
It make 156k entries. So the 128k buffer is not big enought.

Hacking it and running the physic takes 12 ms, but the gpu is still older 1ms update.
I am really impressed.

I am now have to add some of the edge case handling.
Basically the kerners has to generate a status report, that the can use to check changes.
For example in this case, the buffer size is 128k entries.
But the general items is 168k entries.
So it has to write back the new value and clamp the passes items to the max alow.
The in the next frame the buffers has to be resized.

It is start to get complex, but to be honest is far, far simplex than I was anticipating.

by **Julio Jerez** » Sun Apr 10, 2022 1:57 pm

I now have the code that generate the grid hash every frame.
still a lot more work to come, to stress tested I try a 125k scene.

the scene run a 5 fps, but that no a concern since for these molton objects will have a option, like set as background Gravity, which will be use for stuff like
gravity will be fixed. so no callback
the material will also be fix.
and for GPU will have interop.

the physics takes about 22 ms, but 20 ms is just setting the transforms.
the GPU take 2 ms, so the GPU is not a bottom less pit of performance it start to show sign of the load.
but these is a very heavy load, I am very impressed.

the one concern that keep growing is that for some reason as kernel are executed, the driver keeps insertion those silence spaces. in the capture below, you can see that take 2 ms, but there are three silences of about 150 us

: Untitled.png (59.34 KiB) Viewed 4155 times

of course this is a stress test, my expectation are far more humble, if we can get
8 to 10 bodies in a middle range gpu taking a very small fraction of the GPU, I would consider a success. We cannot take over the GPU, for physics. but in graphics there is a lot of spare idle GPU time. anyway, I do no know what to make of those silence gaps, but the keep showing up.

I am now to the generation of new colliding pairs.

the phase does two this.
-Generate all pairs.
-Prune duplicates.
-Merge with existing pair and leave only the new pairs.
-Delete dead pairs
-copy that array to the cpu so that the engine generated contact joints.

and that will complete the broad phase.

by **Julio Jerez** » Tue Apr 12, 2022 12:40 pm

um this start to get more concerning, I optimized the sort, so that is only does the necessary passes.
for example the grid size is 8 units, that means in can one digit can cover an area of
8 * 256 = 2048 units.
the map use an integer, so that 4 digits, so it is sour the 3 must significant bits, does a lot of work for nothing.
for most scenes in digit is enough tow, is an extremely huge size, so we only need to do tow sorting passes. I did the optimization that the sort only applies the necessary passes.

that scene the larger grid in one dimension is 26. So it needs 3 passes, one fore each axis.
this should be typical for most scene.

I expect this to reduce the sort time. and is does, it literary cut it half, but the GPU time does no changes, instead the silence gaps because twice as long. this start to me a serious concern because as I add kernels, the idle time is doublings. no sure why not one has even noticed that.

: Untitled.png (33.92 KiB) Viewed 4057 times

by **Bird** » Tue Apr 12, 2022 2:13 pm

I know nothing about GPU profiling but I tried profiling with Nsight for 12 secs. I guess you're using a different profiling tool?

by **Julio Jerez** » Tue Apr 12, 2022 3:29 pm

I am using

NVIDIA Visual Profiler
Version: 11.6

(c) Copyright 2011-2022 NVIDIA Corporation. All rights reserved.
Visit http://developer.nvidia.com/cuda

This product includes software developed by the
Eclipse Foundation http://eclipse.org/
Apache Software Foundation http://www.apache.org/

In vs 2019 I only get the debugger, but not the profiles.
you image look very different, are you launching form VS 2019?

by **Bird** » Tue Apr 12, 2022 3:46 pm

I am using NVIDIA Nsight Systems 2022.2.1
https://developer.nvidia.com/nsight-systems

It says in the Profiling user guide that Visual Profiler is being deprecated and if I'm going to try to learn one, I want to learn one that will be around in the future.

https://docs.nvidia.com/cuda/profiler-u ... g-overview

Note that Visual Profiler and nvprof will be deprecated in a future CUDA release. The NVIDIA Volta platform is the last architecture on which these tools are fully supported. It is recommended to use next-generation tools NVIDIA Nsight Systems for GPU and CPU sampling and tracing and NVIDIA Nsight Compute for GPU kernel profiling.

by **Bird** » Tue Apr 12, 2022 3:52 pm

I launched Newton demo from the standalone Nsight gui app instead of from within vs 2019

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online