Cuda Solver

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

Re: Cuda Solver

Postby Julio Jerez » Tue Apr 05, 2022 7:53 pm

just for verification that it was me doing something wrong.
I test again vs 2017 and not, Nsight debuger does not works there.
so it seems minimum version is VS 2019

at least for debugging, and I guess that's fine.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby Julio Jerez » Tue Apr 05, 2022 10:01 pm

ah it is really cool.
I found several bug already.
there are more, it is no easy to debug, but it is better than going blind.
if you try now, it should not crash, but may still generate the rotation assert.
I reduced the scene to two blocks so that is easier for me to debug.

at the end of the day, I still need some kind of cpu debugging by reading the data and checking is out.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby Bird » Wed Apr 06, 2022 8:05 am

You can also you printf for debugging

Code: Select all
auto UpdateAABB = [] __device__(cuBodyProxy& body, cuBoundingBox* bBox, int* scan)
{
   
   __shared__  cuBoundingBox aabb[D_THREADS_PER_BLOCK];
   // calculate shape global Matrix
   body.m_globalSphapeRotation = body.m_localRotation * body.m_rotation;
   cuMatrix3x3 matrix(body.m_globalSphapeRotation.GetMatrix3x3());
   body.m_globalSphapePosition = matrix.RotateVector(body.m_localPosition) + body.m_posit;
      
   matrix.m_front = matrix.m_front.Scale(body.m_scale.x);
   
        uint2 testIndex = make_uint2(blockDim.x * blockIdx.x + threadIdx.x,
                 blockDim.y * blockIdx.y + threadIdx.y);

   if(testIndex.x == 216)
      printf("%f-%f-%f \n", matrix.m_front.x, matrix.m_front.y, matrix.m_front.z);
}
Bird
 
Posts: 636
Joined: Tue Nov 22, 2011 1:27 am

Re: Cuda Solver

Postby Julio Jerez » Wed Apr 06, 2022 10:43 am

Ah that's good to know.
I will make a conditional trace bases on printf
That will probably be quicker than stepping in debug to track a particular thing.

Debugging seem good to catch glaring bugs. But dos not seem practical to track details.

I found too very big bugs.
The first was those location error generated by the context class reseting the device before deleting all device memory.

The secund was that more that one shader reading and writing outside the memory bound of a buffer.

All in all, cuda is very cool.
The only part I found very annoying an tedious.
Is that to invoke a kernel the kernel has to be a global c function. It does not let you declare it in the function of as a member class. Not even if it was static.

The secund nuance is that I'm order to pass a lambda operator the kernel ha to be a template. Or at least I could not find a way to make it not template.

Making a template in this context is not really sensible since it is unlikely two templates will have the same code skeleton

But those are not really too bad. The fact that you cam pass lamdas operators is a huge big step forward, so very good for them.

It fix nice with how newton 4 is architected, which is all based on thread pool that pass lambdas operators as jobs.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby Julio Jerez » Wed Apr 06, 2022 2:53 pm

on another note. I posted few questions over the Nvidia forum, never had an answer. and it seem that after 14 days the automatically close the topic, I can't even open my own post.
I now posted this:
https://forums.developer.nvidia.com/t/why-the-same-kernel-runs-a-different-speed-when-invoke-more-than-once/210431

I see some inconsistencies in kernel call that are too great to be explain by latency. I though sio was the memory bug by I have those fixed and the slow down persist.

it does no seem much, but a frame will take about few dozen kernels call, may a hundreds, and if some take 250 us , that's not going to works. that kernel call takes longer that the memory copy. and I can't see why, specially when it is call before, and take 1/5 of the time.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby JoeJ » Wed Apr 06, 2022 5:58 pm

It fix nice with how newton 4 is architected, which is all based on thread pool that pass lambdas operators as jobs.


How do you do this? I wanted this too, but could not figure out how.
So i still write callbacks for my old school job system, which is tedious.
User avatar
JoeJ
 
Posts: 1489
Joined: Tue Dec 21, 2010 6:18 pm

Re: Cuda Solver

Postby Julio Jerez » Wed Apr 06, 2022 7:21 pm

yes that took my deepest knowledge of CPP.
in the pass I implemented by using classes that I send to the thread pool, you can check it in 3.14 and the early release on 4.00

the problem with classes, is that it requires many parament setting, but since you can pass them ion a constructor because the only way to initializes an array of classes in cpp is by giving the a copy constructure so the loop function was very ugly, reading the parameter before in enter the loop.
that bother me for a long time.

one nice way to fix that is by making a class with a function operator().
It is difficult to find info about how that operator is use. in fact until c++ 11 the operator() did no really worked in c++ because is was upto the compiler maker how to do it.
so documenting is very sparce.
basically, is a function that let you call a class with an arbitrary set of parameters. here is a test
form c++ 11 the implementation is standardized but it is still hard to use.
here is an example: https://cplusplus.com/reference/functio ... ator_func/
you see that using the operator, you can make an array of objects and pass parameters. so I was ready to use that.

however it is still not quite that easy, because now I have the problem that each class will have a different set of paraments, or I have to pass pointer to the parent object and dereference them.

here is where lambda come to the rescue. a lambda function is in fact a function call operator()
but in addiction that have the extra feature call the closure, implemented by the symbol []

you can think of the closure and global variable that you can pass to the class.
for example say you make a lambda

auto UpdateAabb = [] (int a, int b) {return a + b};

that will add a + b, nothing special
but now let use say we nee to read a value from the parent class, say c which is a member of the caller class, of course you can always add another lambda wit three parameters

auto UpdateAabb = [] (int a, int b, int c) {return a + b + c};

as you can see, it gets very tedious very quickly. Specially if you want call place them in an array and the simply call the function pointer from the array. for example say a in a thread index, and b is an thread count. for example

auto UpdateAabb = [] (int threadIndex, int thread Count) {return c[index] + a};


now if you have a thread pool and you wnat to call each thread with the inde of teh tread, all you do is you make the array of objects, whi all have the default construtors, you pass that array to the thread pool and the thread pool iterate over the array calling

foreach thread (i)
obj[i](threadid, thread);

that can be made into a system, but it remains how to set the parameters,

and that where the closure come in,

you can make your lamdda with a set of closre argumen will all be pass as member variavle of the
somethimg like:

auto UpdateAabb = [this] (int threadIndex, int thread Count) {return c[index] + a};

now you see that the lambda function does not have to pass this as argument, instead the object is created add a member variable as set it to this.

you can do that by hand but by it is also very hard because each class will have to have teh same type. it is better to make a template function that will do the proper initialization. o fthe array of objects, I did it in file
C:\Development\newton-dynamics\newton-4.00\sdk\dCore\ndThreadPool.h
the function is

Code: Select all
namespace ndMakeObject
{
   template<typename Type> auto ndFunction(const Type & obj) -> decltype (::ndFunction<Type>(obj))
   {
      return ::ndFunction<Type>(obj);
   }
}


so what that does is that take the lambda function, and make one object on the stack and place the closure argument as member variables.

them I call the function
Code: Select all
template <typename Function>
void ndThreadPool::ParallelExecute(const Function& ndFunction)


with a reference to that objects. and that function make a array of objects, the side of the number of threads, and initialize each one by the copy constructors.

if you open the function you will see that is using alloca a and them iterate over the array, calling the copy constructor on each class.

them for each class, it call the function operators with the thread ID and thread counts.

It takes a little tweaking and massaging but after you get it going the high level is very clean,
check it out, it could be a good exercise. this is as far adbavacne as it can get in C++, and it only works for c++ 11 and up.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby JoeJ » Thu Apr 07, 2022 4:57 am

Thanks. I assumed it should work somehow this way. But i really need to see your code to get in detail ;)

Mostly i use lambdas only to have helper sub functions declared inside a function.

But i also did some things like e.g. a tree Traversal, and you give it a std::function object.
The object then has both context of the calling code, and also provides the traversal with decision on if it wants to descend a branch or not.

Then i wanted something similar for multi threading, but did not know how to 'transfer' the context.
I guess that's what you solve with creating the array of objects.
Will see. Probably a good example to learn about modern C++. :D
User avatar
JoeJ
 
Posts: 1489
Joined: Tue Dec 21, 2010 6:18 pm

Re: Cuda Solver

Postby Julio Jerez » Thu Apr 07, 2022 7:31 am

The old method is not much different than the new.
The old method used a class declared in the funtion.
The the class was instantiated in an array on the stack the size of the number of thread pool.

Them in a loop the pointer to each class in tge array was the user data send ti the tread pool fir execution.

But as I said. That method required tha the element of the class to be set after the array was initialized, which make impossible in cpp. Since the only constructor allow in a array in the default constructor.
Therefore the invitation was done in the virtual funtion of the class when it was call from the thread pool.

Lambdas function are not really functions. They are in fact unnamed class with a virtual operator(), identical to how was done before.

A lambda is a funtion only when it had not closure arguments, when they do, them they each close argument, is one data member of the class that is initialized to the argument. That' what makes it a complete new thing, because by make the lambda with closure, your class is created on the stack and properly initialized.

Tge sencud part is how to get the pointer to the lambda function created on the stack.

That I found in some c++ doc that explaying how standard cpp does it fir std and also how both library does it.

Once you have that, the rest is the same you call you thread pool with the pointer to that funtion that make the lambda.

Them in your thread pool you make the array of lambdas, and you initialized each element to the pointer using a copy constructor.
All has to be template so that the array of object can be defined. So you make you base class that the lambda is type of a template with the funtion operator()

And that's all there is to it.

There are some tricks, like for example making the array of funtion pointer on tge stack, makes cpp call expensive runtime checking.
So it us better to used allaca which allocate variable size array on tge stack, the in a loop you iterate coping the calling the operator new in place to initialized the classes.
These classes are the jobs passes to tge thread pool.

Noticed that this can only work for thread pool that are execute immediately. And the called waits for all the tread to complete the job before returning. This is because the stack run out of scope when the funtion returns.
So in a way it is more limited.

Some how when put all together it is far more elegant, a little faster since all the closure initialization is done by the compiler, a is also more robust since this is supported by all cpp11 or better compilers.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby Julio Jerez » Thu Apr 07, 2022 7:46 am

On the cuda kernels, I am finding something that seem very peculiar.
It seem that some kernels affect the performance of other kerners.

In the solver, the first Kerner I added was the transform of all the bodies so tha I can have a visual cue.

It was very fast, about 50 micro seconds much less than getting the results from the gpu.

Now I added about 7 more kernel invocations. They are call before the last one which is the transforms. They operate in the same array of proxy bodies.
But wah I found is that the last kernel execution take the amount of all the other Kerner combined.

I posted this on tge nvidua forum when I gave 6 kernel call and the timing was 250 us.
Since them I added a new one that takes 20 ms. And now the transom takes 270 ms.

This is a really bad tread. At that rate is will become slower and slower, it expect about 70 to 100 kernel call since some will be in a loop. So if tha then continue we are taking g of 10' to dizen of military secunds.

I do not really know what I am doi g wrong. I know the kernel are async. But that does not explain why the last one takes the combine time of the precedent ones.

But no one Rey over the nvidua forum. In fact after few days the forum just close down the questions silently as if it was resolved not let in the poster to follow up.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby JoeJ » Thu Apr 07, 2022 9:32 am

Now I added about 7 more kernel invocations. They are call before the last one which is the transforms. They operate in the same array of proxy bodies.
But wah I found is that the last kernel execution take the amount of all the other Kerner combined.

Sounds you see what i called 'automatic async compute' before.
If your kernels have no barriers to strictly order their execution, execution will overlap.
That's good, but you get misleading timings from the overlap.

To get exact timings, i have so added temporary extra barriers to prevent overlaps.
User avatar
JoeJ
 
Posts: 1489
Joined: Tue Dec 21, 2010 6:18 pm

Re: Cuda Solver

Postby Julio Jerez » Thu Apr 07, 2022 11:53 am

the have two objects to control that stream and graphs.
I am using a stream. because while graphs are very cool, they have some serious limitation.
with streams all kernels and memory operations you issue to one in one stream are execute in the order in which they are receive by the driver. Using multiple streams is how you can split a device into sub devices is posible but at the mercy of the driver, basically if you have a huge GPU with several tousand of cores, it is unlikely that you will be able to occupy the entire width of the GPU. so teh drive may decide to split the GPU, say 32 compute units for graphics and and the rest for oth stuff, but you can make say two or more streams, and dispatch them, it is up to the driver to issue them overlapping by assigning some compute units, or issue them sequentially, in all cases you have no control.

our case does no need that level of complexity, I am simply using one stream, and issue all the call there. reason been that the default stream serializes all memory operation stalling the pipeline, while all other streams have the ability to issue asyn memory operation. I think I have that set up right.

what I am seem is a very erratic behavior on the GPU side, for some reason when executing kernel, it issue huge bubbles of time where no activity happens. some time the time is accredited to a shader, but some time is just a gap in the time line.

It is almost as it if been preempted, to do other work, maybe do some rendering of some internal stuff. the game since to be around 200 micro second here is the last picture.
Untitled.png
Untitled.png (23.92 KiB) Viewed 6565 times


as you can see there are some kernels, and then a silent moment them resume executing the next kernels.

there is a top time line that represent the drive time in cpu. there is not gap there, but seem quite heavy, it seems the driver take a big over header sending kernels to the GPU.

anyway so far is not a problem, but is this repeats periodically several time per frames, it could be a problem.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby Julio Jerez » Thu Apr 07, 2022 12:23 pm

in the drive timeline this is how it look.
Untitled.png
Untitled.png (13.51 KiB) Viewed 6562 times


the first bar in the drive overheard about 50 us. the last one is getting the transform and the end on tow sup steps and is about 30 us, the one in between are kernel submitions by driver, and they are very small at around 3 to 5 us. some time is jump 10 us.
they is all asyncrounous, my worry is that as the small bars add up, they will be form 100 to maybe 200 of them, it start to add up. for example sorting and array, needs several call about 12, and sorting is the work horse of paraller programing, because it is how to organize your data by come coherant properties, them you can operate in the sorted array knowing that all the element that share that property are adjacent in memory. In cpu we use trees, and hash.

anyway let us keep going. I am now close to get the broad phase.

basically it will determine all the pairs in every frame. they pair will be copied async to a double buffer, while the cpu will be reading the previus array of pairs, then the cpu will ittearate over the array creation the new joints and deleting dead joint,

then in each frame there array of new joint is loaded and the array array of dead joint is also loaded,
them in a proxy
them a couple of kernels will merge they array in memory with the dead array, and the new array. sort them is a way that alive contact come first, duplicated and dead contact goes other end,
and the number on contact joint is the number of a alive joints.

this is the same logic of the cpu broaphase, exept that since in cpu wit use a exclusive red black tree, checking if contact are duplicate with just check if there was a joint with that ID already.

given that the number of pair per frame is small, them it is efficient in both cpu and GPU, but the GPU has to do a lot more work with the sorting.

they one difference is that for GPU, when to bodies come into contact, the GPU will generate a pair. and write that to a memory buffer, until the next frame, so contact are at least one frame behind. but should not a problem if the aabb has some padding. the penalty is that the GPU will carries a lot more pairs. since the aabbs has to be bigger to make on for the latency.
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby Julio Jerez » Thu Apr 07, 2022 12:27 pm

do you have a nvidia GPU joe? I remomber you said has one, but I do no remember now.
if so, can you run the GPU solvers
Julio Jerez
Moderator
Moderator
 
Posts: 12426
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Cuda Solver

Postby Bird » Thu Apr 07, 2022 2:20 pm

I just updated to the latest git hub version and physics time is about 3.7 ms on my NVIDIA Geforce RTX 2070 Super
Bird
 
Posts: 636
Joined: Tue Nov 22, 2011 1:27 am

PreviousNext

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 1 guest