Cuda Solver

by **Julio Jerez** » Mon Mar 28, 2022 8:25 am

Is not fixed yet, but I know what it is.
It happen when changing solver.
One of the new chage is that now we have a scen and solver as plug into to the world.
Before we only destroy the solver and crate a new one.

The scen is more complex because it is a complex web of entagled nodes that are used to manipulate all the objects.

So a new scene has to be recreated from the old one.
It uses a copy constructor for that, but I made a mistake some where. Is not a big deal, I will fix it later.

Anyway yest the gpu scen is now around 2.1 to 2.2 ms
In my machine.
The good part is that of that the gpu is around 0.1 ms.
Running the three 7 kernels. And about .2 ms getting the data from the gpu. The is about 0.5 ms that seems to be a fix cost of the cuda drive for calling synchronization.
But it seem the is no way around that, not calling sync every update cause the drive to accumulate Kerner and memcopy until is has enough to saturate the gpu, the it issue all at once.
It seem the only way to force the driver to launch the kernels when using async streams is but calling syncronize.

That's is ok, I am satisfies with that, is running asyn and concurrent. And with the code that is still to come the synk cost will be nothing.

Another big surprised I also found is that in Cuda all memcopy are done by the cpu or at least by using the gpu memmove. Even asyn copy are like that. This is about one third of the theoretical bus speed.

But is you use what the call hist pinned memory.
That just a group of memory page that are looked but the of so that the do not change visual address. The copy used dma and reach that top speed.

So the mem copy when from 2.5 gbyte per second to 6.5 gb per secund. That's the big difference that you see now.

So now it will run faster or slower in different system depending on the kind of hardware.
Pci 2 or 3 run a different speed but only when using those dma hardware.

Anyway 1.5 ms fir 27k scene us not bad.
But we now have to convert all the scen and solver kerners.

There also a cool funtinality that I did not expected with streams.
I seem streams can be use for esterogeneus kerlnel launch.

Fir example, say we have a routine that calculate sphere sphere collision, one fir sphere box, and one fir box, box.

That would be three kernel launches, of a very comex kerne full of switches cases. But with streams all three routing can be run concurrently if the are map to a separate stream.

Anyway there is a lot to learn. But cuda offer lot of thing that are not possible with other languages like opencl.
I can see why cuda beat it consistently.

by **Julio Jerez** » Mon Mar 28, 2022 11:28 am

The stream thing is actually very cool.

It solves the next challenge I have which is the broadphase.

The broadphae in newton is actually very good, but it is too branchy and not friendly to gpu.
Fir example that scene of 27k boxes, in cpu takes from 18 to 20 ms, and about 12 are just the broadphase.

That's far better than a multi grid sweep and prune which is about 60 ms.
a GPU multitgrid sweep and prune takes about few hundred microsencd since it almost constant time with the core count. So is dominated by memory bandwidth.

The problem is that after run GPU sweep and prune, the next step is see if new overlapping pairs are bew contacts, or some old contact are still overlapping.
And that part we need it cpu for the high level interface.

But here is where stream before handy
What we can do is we run the bradphase in gpu and get the new pairs and the new daed pairs
Them we copy those buffer to the cpu in a dedicate stream,
That stream will ron concurrent with the rest of gpu and cpu.

The we make a background tread that take that data as a task, an each time there is a new back of contacts, it add and remove them to the bodies.
This way the high leve engine work as is it now.
Materials, callback an so on.

This I have not tested in gpu, but I have a good reason to think will work because in a typical scene very few new contacts are generate per frame,

So what this does is that when bew contact are generate we load them to the shadow representative in gpu memory and merge the arrays.

I have high hopes for this. And I was thinking how was I going g to go about.

Si now the next step is to add the multi grid sweep and prune, and see how that goes.

by **Julio Jerez** » Mon Mar 28, 2022 9:03 pm

ok found the crash bug, it was in the free list memory manager.

after making the chunk the beams where never cleared, because I always considered that if the free list was flushed, it was because the world was destroyed, or the scene was going to be generated from scratch. Now that we are making a scene using the current as the parameter, them the current is destroyed.
That operation flushes the freezes the list. but it never reset the beam counters, so the next allocation was searching for memory entry that where in the list but that where used by other entities.
so it was trashing memory.
That bug was always there so this is a very good fix.

by **Julio Jerez** » Tue Mar 29, 2022 11:11 am

A thing that I realized about the gou stuff is that.
In the engine there is a lot stuff that is designed to work in a sequential way.

One example the collision shapes and the collision instance.
The instan operate over a shape, but delegate too much functionality to the shape.
For example take calculation the aabb.

The aabb is part of the shape, so it make that call, the in return is applies any transformation.

This has works for many years, but is not good for gpus,

An equivalent method is to cache the object of the shape with the Instance and the operation will be the same. Another thing is the scale. We apply scale based on the type, uniform, no no scale and no uniform. For gpu that is a divergence branch.

this function:

Code: Select all: void ndShapeInstance::CalculateAabb(const ndMatrix& matrix, ndVector& p0, ndVector& p1) const { m_shape->CalculateAabb(matrix, p0, p1); switch (m_scaleType) { case m_unit: { p0 -= m_padding; p1 += m_padding; break; } case m_uniform: case m_nonUniform: { ndMatrix matrix1(matrix); matrix1[0] = matrix1[0].Scale(m_scale.m_x); matrix1[1] = matrix1[1].Scale(m_scale.m_y); matrix1[2] = matrix1[2].Scale(m_scale.m_z); matrix1 = matrix.Inverse() * matrix1; ndVector size0(ndVector::m_half * (p1 - p0)); ndVector origin(matrix1.TransformVector( ndVector::m_half * (p0 + p1))); ndVector size(matrix1.m_front.Abs().Scale(size0.m_x) + matrix1.m_up.Abs().Scale(size0.m_y) + matrix1.m_right.Abs().Scale(size0.m_z)); p0 = (origin - size - m_padding) & ndVector::m_triplexMask; p1 = (origin + size + m_padding) & ndVector::m_triplexMask; break; } case m_global: default: { //ndMatrix matrix1__(matrix); //matrix1__[0] = matrix1__[0].Scale(m_scale.m_x); //matrix1__[1] = matrix1__[1].Scale(m_scale.m_y); //matrix1__[2] = matrix1__[2].Scale(m_scale.m_z); //ndVector p0_; //ndVector p1_; //m_shape->CalculateAabb(m_aligmentMatrix * matrix1__, p0_, p1_); //p0_ -= m_padding; //p1_ += m_padding; // but some shape aabb can't take a non orthonormal scaled matrix // need to do a more conservative aabb, will be a little larger, ndMatrix matrix1(matrix); matrix1[0] = matrix1[0].Scale(m_scale.m_x); matrix1[1] = matrix1[1].Scale(m_scale.m_y); matrix1[2] = matrix1[2].Scale(m_scale.m_z); matrix1 = matrix.Inverse() * m_aligmentMatrix * matrix1; ndVector size0(ndVector::m_half * (p1 - p0)); ndVector origin(matrix1.TransformVector(ndVector::m_half * (p0 + p1))); ndVector size(matrix1.m_front.Abs().Scale(size0.m_x) + matrix1.m_up.Abs().Scale(size0.m_y) + matrix1.m_right.Abs().Scale(size0.m_z)); p0 = (origin - size - m_padding) & ndVector::m_triplexMask; p1 = (origin + size + m_padding) & ndVector::m_triplexMask; break; } } dAssert(p0.m_w == ndFloat32(0.0f)); dAssert(p1.m_w == ndFloat32(0.0f)); }

translate to this, which is far, far more expensive, but can be simply a lithe more if we cache the obb and do not have to assume that the shape will calculate it every call.
with today's capacity of floats even in CPUs, this is probably better.

Code: Select all: void ndShapeInstance::CalculateAabb(const ndMatrix& matrix, ndVector& p0, ndVector& p1) const { m_shape->CalculateAabb(matrix, p0, p1); ndMatrix matrix1(matrix); matrix1[0] = matrix1[0].Scale(m_scale.m_x); matrix1[1] = matrix1[1].Scale(m_scale.m_y); matrix1[2] = matrix1[2].Scale(m_scale.m_z); matrix1 = matrix.Inverse() * m_aligmentMatrix * matrix1; ndVector size0(ndVector::m_half * (p1 - p0)); ndVector origin(matrix1.TransformVector(ndVector::m_half * (p0 + p1))); ndVector size(matrix1.m_front.Abs().Scale(size0.m_x) + matrix1.m_up.Abs().Scale(size0.m_y) + matrix1.m_right.Abs().Scale(size0.m_z)); p0 = (origin - size - m_padding) & ndVector::m_triplexMask; p1 = (origin + size + m_padding) & ndVector::m_triplexMask; dAssert(p0.m_w == ndFloat32(0.0f)); dAssert(p1.m_w == ndFloat32(0.0f)); }

I will try to jut use non uniform scale for all shapes. And take it as a cost we mos assume.
We already done that in few places like the gyro. And it in theong run it is simpler.

by **Julio Jerez** » Tue Mar 29, 2022 12:08 pm

so it simply reduced to this. considerably more expensive, but is simpler and handle every case.
and does not depend on the shape type.

Code: Select all: void ndShapeInstance::CalculateAabb(const ndMatrix& matrix, ndVector& p0, ndVector& p1) const { ndMatrix matrix1; matrix1[0] = matrix[0].Scale(m_scale.m_x); matrix1[1] = matrix[1].Scale(m_scale.m_y); matrix1[2] = matrix[2].Scale(m_scale.m_z); matrix1[3] = matrix[3]; matrix1 = m_aligmentMatrix * matrix1; const ndVector size0(m_shape->GetObbSize()); const ndVector origin(matrix1.TransformVector(m_shape->GetObbOrigin())); const ndVector size(matrix1.m_front.Abs().Scale(size0.m_x) + matrix1.m_up.Abs().Scale(size0.m_y) + matrix1.m_right.Abs().Scale(size0.m_z)); p0 = (origin - size - m_padding) & ndVector::m_triplexMask; p1 = (origin + size + m_padding) & ndVector::m_triplexMask; dAssert(p0.m_w == ndFloat32(0.0f)); dAssert(p1.m_w == ndFloat32(0.0f)); }

this way the shadow proxy body in GPU can have its AABB updated each time the transform changes.
by simply make the obb size and aabb part of the structure.
remember we are going for a complete version of GPU, so each change will have to be retro converted.

by **JoeJ** » Tue Mar 29, 2022 6:11 pm

Does this mean non uniform scaling becomes completely free (on CPU)?

Oh, i just have another idea: A capsule shape, but both spherical caps can have a different radius?
Would allow for more accurate ragdoll limbs. Even better with nonuniform scale as well.
But not sure about the cost about the cone instead cylinder in the middle. Although, you already have cones. So probably easy to add?

by **Julio Jerez** » Tue Mar 29, 2022 8:16 pm

It means all scale are non uniform.
So a no scale is passing an indemnity matrix.
Uniform space is a matrix with diagonal set to the scale factor.
Non uniform is a matrix with diagonal set to the different scale factor.
Sheer scale is a symmetric pad matrix. But this only apply to compound child with non uniform scale.
I decided for 4.00 not to support skew matrices because ther is not really good use for that, is only a gimic.
And not is not free in cpu,
Before no scale was free fir all shapes, and non uniform was the most expensive in cpu.
Now it is a compromise the cost is in between, not free but not as e pensive as not uniform.

This cost is so small that it is better to use the general scale for all solvers.

I have not tested the support vertex yet, but I an guessed it will be a similar compromise.

JoeJ wrote:Oh, i just have another idea: A capsule shape, but both spherical caps can have a different radius?

That had been the case for capsules and cylinders since newton 2.xx
The have three parameters, two radius and the height.

There are more cool stuff that can be done with general scales, like perspective,
If you assume homogenous matrices, and the resul is the divide. That will make possible to make cone our of capsule and cillinderes by passing a one radio to zero.
But them that a funtinalty that so far not one had ever used.
So I decided is better to make a cone shape instead.

by **JoeJ** » Wed Mar 30, 2022 3:55 am

Ha, never noticed capsules already support this. Nice

If you assume homogenous matrices, and the resul is the divide. That will make possible to make cone our of capsule and cillinderes by passing a one radio to zero.
But them that a funtinalty that so far not one had ever used.
So I decided is better to make a cone shape instead.

I never understood how projection matrices can do a perspective divide. I assume the division is an extra step after matrix multiplication?
I would be interested in learning this. I started using skew matrices in my editor. As i compose my terrain from many boxes, skew is very useful to express some kind of flow.
'Perspective projection' would be useful to express something like peaks or ridges of mountains.
But i do not request such features from Newton. Agree non uniform scale is enough.

by **JoeJ** » Thu Mar 31, 2022 6:01 am

Look at that - there is now a non NV GPU which can run your Cuda stuff: https://videocardz.com/newz/chinas-moore-threads-mtt-s60-gpu-supports-directx-and-can-run-league-of-legends
:mrgreen:

I wonder if we ever get Chinas HW here. I would not wonder if there are some issues with patents

by **Julio Jerez** » Thu Mar 31, 2022 9:38 am

The more the merrier.
But seing all the nvidia references, my guess is that nvidia
Legal stuff is already working to sue them out of existence.
Nvidia does not take kindly to competitors.
They once sue ID for using a version of alpha blend with saturation they claim they invented and that was only a game.

The claim to invention of things that existed since man discover arithmetic from nvidea and all the tech companies is staggering. Pattenting is the cottage industry of the tech companies.

The difference is that Nvidia is the only one who smell his own farts, they think they really invented those things out of nothing because they manage to put in a paper.

by **Julio Jerez** » Tue Apr 05, 2022 1:03 pm

has anyone been able to debug a cuda kernel and see local variable in a watch window?
the dock seems to be quite dubious.

I read every thing about it, and in one side it seem to say that this use GDB debugger and visual studio code. in another set hint that is possible to use visual studio but an the doc and tutorial goad back to vs 2008, yes most tutorial hat show people check variables are all run in Linux.

I am adding the code for the broad phase, but at the point there are already many kernels that I do no really know if they are bug free. and it will be nice if It can be debug.
as is stand now the only way to debug this is by almost doing the thing twice, one in GPU and the get the data and doing in cpu to see if the results match, that is not only time consuming by double the possibility of generation even more bugs.

I have this kernel, that is close to the CPU one.

Code: Select all: auto UpdateAABB = [] __device__(cuBodyProxy& body) { __shared__ cuVector aabb[D_THREADS_PER_BLOCK][2]; // calculate shape global Matrix body.m_globalSphapeRotation = body.m_localRotation * body.m_rotation; cuMatrix3x3 matrix(body.m_globalSphapeRotation.GetMatrix3x3()); body.m_globalSphapePosition = matrix.RotateVector(body.m_localPosition) + body.m_posit; // calculate world aabb //ndMatrix scaleMatrix; //scaleMatrix[0] = matrix[0].Scale(m_scale.m_x); //scaleMatrix[1] = matrix[1].Scale(m_scale.m_y); //scaleMatrix[2] = matrix[2].Scale(m_scale.m_z); //scaleMatrix[3] = matrix[3]; //scaleMatrix = m_alignmentMatrix * scaleMatrix; matrix.m_front = matrix.m_front.Scale(body.m_scale.x); matrix.m_up = matrix.m_up.Scale(body.m_scale.y); matrix.m_right = matrix.m_right.Scale(body.m_scale.z); matrix = body.m_alignRotation.GetMatrix3x3() * matrix; //const ndVector size0(m_shape->GetObbSize()); //const ndVector size(scaleMatrix.m_front.Abs().Scale(size0.m_x) + scaleMatrix.m_up.Abs().Scale(size0.m_y) + scaleMatrix.m_right.Abs().Scale(size0.m_z)); //const ndVector origin(scaleMatrix.TransformVector(m_shape->GetObbOrigin())); const cuVector origin(matrix.RotateVector(body.m_obbOrigin) + body.m_globalSphapePosition); const cuVector size(matrix.m_front.Abs().Scale(body.m_obbSize.x) + matrix.m_up.Abs().Scale(body.m_obbSize.y) + matrix.m_right.Abs().Scale(body.m_obbSize.z)); //p0 = (origin - size - m_padding) & ndVector::m_triplexMask; //p1 = (origin + size + m_padding) & ndVector::m_triplexMask; const cuVector padding(1.0f / 16.0f); const cuVector minBox(origin - size - padding); const cuVector maxBox(origin + size + padding); body.m_minAabb = minBox; body.m_maxAabb = maxBox; // calculate bondin box for this tyhread block aabb[blockIdx.x][0] = minBox; aabb[blockIdx.x][1] = maxBox; __syncthreads(); for (int i = D_THREADS_PER_BLOCK / 2; i; i = i >> 1) { if (blockIdx.x < i) { aabb[blockIdx.x][0] = aabb[blockIdx.x][0].Min(aabb[blockIdx.x + i][0]); aabb[blockIdx.x][1] = aabb[blockIdx.x][1].Max(aabb[blockIdx.x + i][1]); } __syncthreads(); } if (blockIdx.x == 0) { body.m_xxx0 = aabb[0][0]; body.m_xxx1 = aabb[0][1]; } //int x0 = __float2int_rd(body.m_minAabb.x * D_CUDA_SCENE_INV_GRID_SIZE); //int y0 = __float2int_rd(body.m_minAabb.y * D_CUDA_SCENE_INV_GRID_SIZE); //int z0 = __float2int_rd(body.m_minAabb.z * D_CUDA_SCENE_INV_GRID_SIZE); //int x1 = __float2int_rd(body.m_maxAabb.x * D_CUDA_SCENE_INV_GRID_SIZE); //int y1 = __float2int_rd(body.m_maxAabb.y * D_CUDA_SCENE_INV_GRID_SIZE); //int z1 = __float2int_rd(body.m_maxAabb.z * D_CUDA_SCENE_INV_GRID_SIZE); };

it calculates the world transform, the aabb on in the scene. and then calculate the aabb of all the boxes on that thread block.
that's already lot of work, but it have to do even more. the thing is that there is no way to check the intermediate results.

on another note, I noticed that the nvidia compiler since to be quiet wastefull in Register allocation, that little shade is using about 60 registers, and cpu compile can do the same thing with 8 register and some spills but in 64 bit is can do it with 16 float register and 14 integer registers, no spill all the contrary has spared registers.

one thing is that I am using a vector class 4, I am not doing that vector 3 nonsense, and that make is use more register, but on the there had using vector 3 cause the complier to issue 32 but memory transactions to memory which is the worse offence.
Ther are lots of things that contradict in cuda, is more like a black art that it used to be for CPUs.

by **Bird** » Tue Apr 05, 2022 5:08 pm

Yes, I'm able to debug and see local variables. I'm using the latest version of Nsight vs2019 edition.
https://developer.nvidia.com/nsight-vis ... io-edition

Here's a screen grab and also a list of errors that pops up when I run the sandox box demo. The demo crashes intermittently

by **Julio Jerez** » Tue Apr 05, 2022 7:18 pm

Ah maybe it does not work well with vs 2017.

I will try 2019 see how tha goes.

I am not getting the crash, but I know something is wrong because I am getting very inconsisten behavior.
I think a shader is writing pass the memory bounds.

But I see if with vs 2019 I can see what is wrong.

by **Bird** » Tue Apr 05, 2022 7:30 pm

I hit this assert after running a few seconds in debug mode

by **Julio Jerez** » Tue Apr 05, 2022 7:46 pm

ah, I reinstalled Nsight, and it works with VS 2019.
I can set break point and I can see variable in the watch.
I see my first mistake.

I too had seen that bug, somehow a rotation quat is losing magnitud even when they are normalized.
I added a debug test to check for magnitud, but since I could not debug it I can no check what is wrong.

the assert is consistent after it happens, so I can now check why is happening.
I am actually surprised it got this far without debug and not blowing really bad.

I work on a graphics library by nvidia call nvn or something like that and is was notoriously hard giving ton of blew screens, but this cuda seem very tolerant to mistakes.

anyway now that I can step in debug I need to check oy what is wrong.
thanks for the tip.

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online