Cuda Solver

by **Julio Jerez** » Tue Apr 12, 2022 4:22 pm

the image you show print the time that takes soring the array of 65k,
about 0.7 ms.
than comes at about 90 m key per seconds. very, very low compared to the metric people claim.

my keys ares 128 bit so that 4 time the bandwidth because they measure key 32 bit keys
so even been generous and multiplying my result by 4, that's under 400 mkeys per secund.
I seems claims anywhere from as 1 to 16 giga key per second, even on much order GPUs which I tend not to believe.

When I run the nvidia sorting code demos which come in a library, the result is about 1 gkey per secund.

Sorting 1048576 32-bit unsigned int keys and values
radixSortThrust, Throughput = 1073.0961 MElements/s, Time = 0.00098 s, Size = 1048576 elements
Test passed

but the problem with those demos is that there are so misleading, sorting an array of 32 bit key only is useless. you have to at least add one extra word to the key as an index to the item, and that automatically cut the throughput by half.

I wish I can make the sort faster, because it is the workhorse of the engine. but I found some problems. the first pass I made, was about 5 time faster, but I discovered it had a bug only in GPU, so I had to add a part that serializes buckets of 256 elements in one thread.

basically, if you have a count of n element, n less or equal than 256. and you have to write them in the same order that there are found in memory, and that's no easy in a muticore.
I am doing in a loop, but I am hoping to find a better way.

one of the problems about the NVidia demos, is that they are really hard to find how the code is done, for almost any not trivial demo.
anyway, I will try to tune this a little more, because it seem that it can degrade very quickly.

by **Julio Jerez** » Tue Apr 12, 2022 4:36 pm

ah It just occur to me that, I can copy and past my algorithm to the nvidia demo, so that I can time it and see how far form the standard it is, the standard been their version.

if it is with in 50% I will consider good enough to go on, if not them I see what else I can do.
I do not want to use theirs because it comes with lot of baggage.
My version provide lots of flexibility because in one loop I can do stuff to prepare for another kernel,
I can also re use pointers, and so on.

while in their version you place stuff in a strong typed vector, and it will just sort the vector. I did not even see how to pass the lambda operator.

this is thier code.

Code: Select all: thrust::device_vector<T> d_keys; thrust::device_vector<unsigned int> d_values; for (unsigned int i = 0; i < numIterations; i++) { // reset data before sort d_keys = h_keys; if (!keysOnly) d_values = h_values; checkCudaErrors(cudaEventRecord(start_event, 0)); if (keysOnly) thrust::sort(d_keys.begin(), d_keys.end()); else thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_values.begin());

and their vector does not allow for an operator either

Code: Select all: template<typename T, typename Alloc = thrust::device_allocator<T> > class device_vector : public detail::vector_base<T,Alloc> {

that's just too naive. I even go a step more and say mediocre.

another misleading point is that they Only pass one buffer, the key array.
so their either figured out how to allocate memory in GPU, or they sort in place but radix sort is not in place.
there are lots of hidden things in those demos.

by **Julio Jerez** » Tue Apr 12, 2022 6:12 pm

so on that kernel that you show in the image that took 200+ us
it is this function
__global__ void CudaSortItems(Predicate EvaluateKey, const ndGpuInfo& info, const cuAabbGridHash* src, cuAabbGridHash* dst, int* histogram, int size, int digit)

one of the this that made expensive was that the sequential part, we are reading inside the loop.
but the key could all be read in parallel and cache to a array in local memory. them the inner loop only iterates over an array on ints in local memory. with that change the kernel is now 20 us.

: Untitled.png (36.26 KiB) Viewed 3271 times

it runs tree times and is the dominates the sort, we can now calculate the theoretical throughput of our. and that would be

thsort toral time is (0.000021 + 0.000012) * 3 = 0.0001 ms
64000 / 0.0001 ~= 640 millions keys per secund.

that's about two time faster than the nvidia version, if we consider that we are using 128 bit keys,
and they use 32 bit. now that's better.

if you can please take another capture and see if the function show similar results.
that operation has to be extremally efficient because it is how we will do logic.
I still say it should be about 5 to 10 time faster,
but using nvdia as the standard we are now in the ballpark, so we can keep goint.

by **Bird** » Tue Apr 12, 2022 6:48 pm

I had to take the capture in Release mode because the profile app kept crashing in Debug mode. I think maybe it might have hit an assert in your code.

by **Julio Jerez** » Tue Apr 12, 2022 7:51 pm

oh you were taking captured in debug. that explains the 200 us.
debug cuda is like visual studio.
anyway is 20 us, overall, so that seems about right for now.

the assert, yes there is a bug on the transform. I see it when eth boxes jerk.
but that code will be re written when I get there. I just leave a way to stress the code.
for now I will keep going form the top.

I find that as I made progress it gets harder and harder to debug code.
I will start implement stuff first in c++ them in cuda.

I do not know if this still exist, but I remember 15 years ago, cuda had an emulation mode, that made possible to emulate kernels, I looked around to see if they still have it, but as everything on nvidea forums and docs, it is always a miss bag. They say there is this option --device-emulation
but when I try, I get this error

nvcc fatal : Unknown option '--device-emulation'

I guess I will start writing the code twice, one in c and one in cpp.
debugging in the debugger is just too cumbersome.

for that I will have to go over all the support functions and add the prefix keyward __host__ so that I can call them from the cpp versions. something like

Code: Select all: inline __device__ __host__ cuVector(float val) { x = val; y = val; z = val; w = val; }

by **Bird** » Thu Apr 14, 2022 3:16 pm

I'm trying to hook up my project to the latest Newton version on github. I'm applying an rotation like in the sandbox demo but the body does not rotate when the engine is running. I have _D_NEWTON_CUDA set and the solver mode set to dWorld::ndCudaSolver. Is there anything else I need to do?

Code: Select all: ndBodyDynamic* const body = new ndBodyDynamic(); body->SetCollisionShape (shapeInst); body->SetMassMatrix (node->description().mass, shapeInst); node->setUserData (body); body->SetMatrix (startPose); // Newton takes ownership of the NewtonCallback object body->SetNotifyCallback (new NewtonCallbacks (node)); const ndVector omega (dGaussianRandom (2.0f), dGaussianRandom (4.0f), dGaussianRandom (3.0f), 0.0f); body->SetOmega (omega);

by **Julio Jerez** » Thu Apr 14, 2022 3:40 pm

Are you trying to get the cuda part goint.?

by **Julio Jerez** » Thu Apr 14, 2022 3:56 pm

On another note.

Last weekend I spend time adding more of the broadphase
And I ran into a problem that I knew was going g to be problematic with the way I was doing but I tried anyway.
Them a start getting random crashes all over the place.

The problem is that when generation colling pairs, the number of pair is not known.
So I decided to just generates the number of pairs the current buffer can hold and set a flag to indicate that the buffer has to be resized.

I did not count with the latency, and not matter how hard I try it alway end up overusing memory.
The only way to get around that is to read the data back if an overflow occur. But with async update is equivalent to just sync the update.

Such a simple problem is quite hard to solve.
So I am now using a different approach
That is each buffer will have a proxi in gpu.
Them as the kernel are executed. There is a structure that contain the status of the update,

If a kerne detect a buffer overflow, I will just set the flag to I dicate the rest of all kerners that they should just abort.

This will continue happening until the scene status is check and the offending buff is resized and the valid flag is reset.

At first I thought that will be hard to maintain, but afte a while is quite simple, since it almost move all the logic from the gup to the cpu. The Kerner so far are much simpler.

by **Bird** » Thu Apr 14, 2022 6:50 pm

Julio Jerez wrote:Are you trying to get the cuda part goint.?

Yes, I have just a rotating body like in the Sandbox cuda demo. It rotates correctly when using the ndSimdAvx2Solver solver mode but does not rotate if I switch to ndCudaSolver solver mode;

by **Julio Jerez** » Thu Apr 14, 2022 6:52 pm

hold on, until I check in the changes I am making.
them we can figure out what is wrong.

First I have to make stable. It isn't at the moment.

by **Julio Jerez** » Thu Apr 14, 2022 8:05 pm

ok. let us try now.
first try getting latest and see if it run stable.

by **Bird** » Thu Apr 14, 2022 8:16 pm

Seems stable here. Physics time is about 4.6 ms on my machine

by **Julio Jerez** » Thu Apr 14, 2022 8:27 pm

Ok.
Try get it in your app now.

by **Bird** » Thu Apr 14, 2022 9:22 pm

Still having the same problem. If I run with ndSimdAvx2Solver, it works as expected. But if I switch to ndCudaSolver then my body does not rotate like it should

by **Julio Jerez** » Thu Apr 14, 2022 9:35 pm

set a break point in function
void ndWorldSceneCuda::UpdateTransform()

see if it get to line UpdateTransformNotify(threadIndex, body);

that's the call to the transform update.

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online