Cuda Solver

by **Bird** » Sat Apr 16, 2022 2:38 pm

Yes, swap is being called. I added this

Code: Select all: void ndCudaContext::SwapBuffers() { printf("%s \n", "swapping"); dSwap(m_sceneInfoCpu0, m_sceneInfoCpu1); m_transformBufferCpu0.Swap(m_transformBufferCpu1); }

And I get this

SetTransform buffer 0 : w(-0.417342 0.341314 -0.404124) r(0.000000 0.000000 0.000000 1.000000)
SetTransform body 0 : w(1.709886 0.698956 0.781648) r(0.000000 0.000000 0.000000 1.000000)
swapping
swapping
SetTransform buffer 0 : w(-0.417342 0.341314 -0.404124) r(0.000000 0.000000 0.000000 1.000000)
SetTransform body 0 : w(1.709886 0.698956 0.781648) r(0.000000 0.000000 0.000000 1.000000)
swapping
swapping
SetTransform buffer 0 : w(-0.417342 0.341314 -0.404124) r(0.165896 0.000000 0.000000 0.986143)
SetTransform body 0 : w(1.709886 0.698956 0.781648) r(0.165896 0.000000 0.000000 0.986143)
swapping
swapping

by **Julio Jerez** » Sat Apr 16, 2022 3:04 pm

SetTransform buffer 0 : w(-0.417342 0.341314 -0.404124) r(0.000000 0.000000 0.000000 1.000000)
SetTransform body 0 : w(1.709886 0.698956 0.781648) r(0.000000 0.000000 0.000000 1.000000)
swapping
swapping
SetTransform buffer 0 : w(-0.417342 0.341314 -0.404124) r(0.000000 0.000000 0.000000 1.000000)
SetTransform body 0 : w(1.709886 0.698956 0.781648) r(0.000000 0.000000 0.000000 1.000000)

ah that's the problem, your code call syn twice. that's like a render using calling swap buffer twice,
it will display the same frame every time. it should be.

SetTransform buffer 0 : w(-0.417342 0.341314 -0.404124) r(0.000000 0.000000 0.000000 1.000000)
SetTransform body 0 : w(1.709886 0.698956 0.781648) r(0.000000 0.000000 0.000000 1.000000)
swapping
SetTransform buffer 0 : w(-0.417342 0.341314 -0.404124) r(0.000000 0.000000 0.000000 1.000000)
SetTransform body 0 : w(1.709886 0.698956 0.781648) r(0.000000 0.000000 0.000000 1.000000)
swapping
...

but do not do anything yet. let me fix that double buffer in CPU thing. the app is entitled to call sync how many time is want. and that should not change the buffers, that's my mistake.
but after I get that fixed, you should check that is does not call sync twice. in role.
inv in the cpu is serializes the frame.

My guess is that you are calling Sync at the end of frame, and the frame update does as well so the secund call wait for the frame to complete.
I guess for an app like you is fine, because you do not care about concurrency.
that should there that is a big bug in the newton side that demonstrate the buffer should go to the GPU.

if I am correct this bug should be reproducible in the sandbox of we set the sync update in the menu.
anyway I will fix that and this afternoon. do no do anything until I get it right.

by **Julio Jerez** » Sat Apr 16, 2022 3:18 pm

yes I just verify by setting play syn and is does display by very jerky, in all case is a bug that need fixing.
when running correctly is looks like this

SetTransform buffer 0 : w(0.000000 0.500000 -3.000000) r(-0.108195 0.000000 0.000000 -0.994130)
SetTransform body 0 : w(0.654416 -0.409420 -1.027692) r(-0.108195 0.000000 0.000000 -0.994130)
SwapBuffers
SetTransform buffer 0 : w(0.000000 0.500000 -3.000000) r(-0.190568 0.000000 0.000000 -0.981674)
SetTransform body 0 : w(0.654416 -0.409420 -1.027692) r(-0.190568 0.000000 0.000000 -0.981674)
SwapBuffers
SetTransform buffer 0 : w(0.000000 0.500000 -3.000000) r(-0.271618 0.000000 0.000000 -0.962405)
SetTransform body 0 : w(0.654416 -0.409420 -1.027692) r(-0.271618 0.000000 0.000000 -0.962405)
SwapBuffers
SetTransform buffer 0 : w(0.000000 0.500000 -3.000000) r(-0.350783 0.000000 0.000000 -0.936457)
SetTransform body 0 : w(0.654416 -0.409420 -1.027692) r(-0.350783 0.000000 0.000000 -0.936457)
SwapBuffers
SetTransform buffer 0 : w(0.000000 0.500000 -3.000000) r(-0.427514 0.000000 0.000000 -0.904009)
SetTransform body 0 : w(0.654416 -0.409420 -1.027692) r(-0.427514 0.000000 0.000000 -0.904009)
SwapBuffers
SetTransform buffer 0 : w(0.000000 0.500000 -3.000000) r(-0.501277 0.000000 0.000000 -0.865287)
SetTransform body 0 : w(0.654416 -0.409420 -1.027692) r(-0.501277 0.000000 0.000000 -0.865287)

so I will make so that it is like that for all modes. I really, really hate GPU programming.

by **Bird** » Sat Apr 16, 2022 5:36 pm

I'm not calling Sync() AFAIK.

I'm just using the code from the sandbox demo for advancing the sim

Code: Select all: double NewtonWorld::advanceTime (ndFloat32 timestep) { m_timeAccumulator += timestep; // if the time step is more than max timestep par frame, throw away the extra steps. if (m_timeAccumulator > descreteStep * maxSteps) { ndFloat32 steps = ndFloor (m_timeAccumulator / descreteStep) - maxSteps; dAssert (steps >= 0.0f); m_timeAccumulator -= descreteStep * steps; } while (m_timeAccumulator > descreteStep) { Update (descreteStep); m_timeAccumulator -= descreteStep; deletePendingObjects(); } return static_cast<double> (GetScene()->GetWorld()->GetUpdateTime()); }

by **Julio Jerez** » Sat Apr 16, 2022 5:42 pm

do not worry about.
let us try again.
please sync and run it.

by **Bird** » Sat Apr 16, 2022 6:01 pm

Looking good. Thanks for all the help.

https://youtu.be/bLqgrJyz9D4

by **Julio Jerez** » Sat Apr 16, 2022 6:22 pm

Well, there is still a lot to do.
But it is a start.

Can you place many objects?

by **Julio Jerez** » Sat Apr 16, 2022 11:18 pm

ah, very cool to see some object there spinning.

I now committed the part that calculate the aabb and for the grid cell map and sort them.
that was causing me some problems, but I think I have right now.
when you get some time try to sync and trying it out so that we can keep the bug in check.

It is far, far harder to debug in GPU, because it does no crashes right away, instead trash memory silently.

by **Bird** » Sun Apr 17, 2022 7:26 am

Latest version seems to be working fine.

Here's 40,000 turtles.

https://youtu.be/IbjANVkHa9U

by **Julio Jerez** » Sun Apr 17, 2022 7:51 am

Oh snap!!! :mrgreen:

Now we keep moving forward.
I was worry the later changes will make slower, and it does but I think it is with in acceptable bound.

One problem is that there are algorithm fir which the size is not know ahead of time.
One is the grid count. It depend on the number of cell a body intersect.

But the normal cuda ask fir the number of blocks in advance.

They had a feature called dynamic parallism which is simple the ability of a kernel to call other kernels.

A typical problem that can be solved with that feature are recursive algorithm like quick sort.

The problem is that nvidia set it up that you have to use thier linker to preprices the object call. It seem the linke need to know the address of the child kernels.

Here is the problem. The only way to set that up is by making the project a dll or a executable.

But newton is set up that the pluging is a static library.
And there is no way to link invoke the cuda linke on a non cuda project.

So my solution was to just use normal kelners and use and oper bound size. So the jerseys now gave more check than nessesary.
But I think I can live with that rather than make the sdk a bu ch of dlls
In fact most people using the engine use it as static Liv.

Anyway. I will continue this method and them after we get a working solver, we can investigate the dynamic linking which I think is a cool feature.

by **Julio Jerez** » Sun Apr 17, 2022 2:20 pm

ok I now made the change that the doc talk about here
https://developer.nvidia.com/blog/how-o ... s-cuda-cc/
I did not get the result they claim but I do it anyway.
I do regret that the code become more complex. since we nwo have double buffer in cpu and double buffer in gpu.

bassically the sequence now is.
in GPU, there are two streams the compute stream and the memory copy stream.

at the beginning of each frame there is a frame synchronize. (btw that was the messing part that was showing the weird behaviors in the profile when many frame were accumulated in a way that did no make since. Now with device sync at the beginning of each frame the divide are more predictable.
It does not fixes the mysterious silences gaps, but it is better.
anyway the sequnece in GPU is not this

frame begin:
-device Sync
-memcpyasync oddFrame from GPU to CPU host buffer using the memcpy stream.
-excute all scene and solver kernels. (scene, collsion, solver, etc) using the solver Stream
-Copy in gpu the transforms to the even Frames, also using solver Stream

in cpu:
copy the transform form CPY od frame to the rigid bodies.

end Frame:
swap GPU buffer,
increment Odd/Even counter.

these are the result. Old system.

: Untitled.png (35.26 KiB) Viewed 3445 times

as you can see there is only one stream and memcpy and compute shader are all executed sequentially.

here is the newer method.

: suspension.png (41.84 KiB) Viewed 3445 times

nwo there are tow stream, and according to teh Doc, these serius of GPU have two a dma chanel that is indepemnde of the Compute, so the can do simultanuslly
1 memcpy, for CPU to GPU, 1 memcpy form GPU to CPU and comuputed kernels.
one memcpy can be GPU to GPU

but the picture below show the memcpy in separate stream by it is not concurrent.
so either I am missing something, ot this GPU does not support it.

but I leave it anyway, because this is how we can exploit the advance features of the GPU.
I will debug it and committed it later, then people interested can check it out if it works.

now before moving on they one last optimization I nee to apply and them we can move to Generate colliding pairs.

by **Julio Jerez** » Sun Apr 17, 2022 3:06 pm

ah, it was my bug. I have a memcpy in the compute stream, is tiny but that cause the copy engine to block until that stream was copied.
now the frame looks like this, all memory copies are concurrent with computation bot in CPU and GPU.

: Untitled.png (32.41 KiB) Viewed 3443 times

nice overlap, but as usual no good dee goes unpunished, they drive insert that huge 200 us gap.
I still do not know why is that.

it is all committed now.

by **JoeJ** » Mon Apr 18, 2022 5:42 am

they drive insert that huge 200 us gap.
I still do not know why is that.

Could it be cache flushes?
The gaps seem bit for that, but maybe it takes that long to update all VRAM.
The profiling tool i've used did not show me such things either.

Btw, all things you mention seem the same as with using compute shaders.
The only feature we would miss is option to launch kernels from kernels, which til yet you do not use.
So a port to compute would be probably pretty simple once you're done.

by **Julio Jerez** » Mon Apr 18, 2022 10:49 am

JoeJ wrote:The only feature we would miss is option to launch kernels from kernels, which til yet you do not use.
So a port to compute would be probably pretty simple once you're done.

Launching kernel from kernel is in fact a nessesary feature.
There are Kerner for which the size of the items is not known in advanced. Instead it is determined by the result of a previous kernel.
The problem is tha nvidea set it up so that the only way it works is if you make the project a dll or an exe.

I spent several day investigating why I can get to work until I finally gave. As I said before, nvdia does not answer questions. It seems they want people to ask on github, but that seems to be a mess out there.
Anyway I will just use the vanilla functionality.

And yes all of the sudden, a port to direct12 compute does not seems that complex after we get this plugin.
We can use it as the template to copy from.
Open gl is out because open gl beside the fact that is single threaded it has the real nasty feature that it implant all the object in thread local memory, so it is impossible to do any synchronization from other threads. And newton 4 if highly multithreaded.
There are some extension that can make the context local, but last time I try that is not supported by any driver vendor
I will leave out opencl, since it is a dead end.
So that leave us with dx12 and vulkan. Maybe sycle if intel put thier act together.

Isn't it amazing how expectacularlly the objective of the Chronous group of making universal standard, has failed.
It achieved the opposite, a fragmentation of the HPC. There are more standards than there are applications using them.

by **Bird** » Mon Apr 18, 2022 10:51 am

I was able to get 400,000 turtles spinning. Although, occasionally they would stop spinning and then start up again after a short amount of time.

Here's the profile from Nsight

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online