Cuda Solver

by **Julio Jerez** » Sat May 21, 2022 3:07 pm

but the thing is that other that make a simple kerners, nothing else seems to work.

I uploaded to this link, if anyone can test it before I post the question over the nvidia site.
https://www.mediafire.com/file/8svscvsrhbjdxcq/dCudaTest.rar/file

they has no answer any of my question, but there are thousand of people using this, I have to assume this * actually works, but I read all the blogs, the docs and many questions from stack overflood and other sites from people with the same problem and all I see is people giving solutions that do not work and posting links to some one else who apparently solver until you find out that it did not.

by **Julio Jerez** » Sat May 21, 2022 5:39 pm

I posted over the nvidai forum, to see if I get a respond.
https://forums.developer.nvidia.com/t/what-using-dynamics-parallelism-i-cant-get-the-nvidia-linker-to-work/215214
but I do not keep my hope high, since I has not gotten a single reply from them.

by **Julio Jerez** » Sat May 21, 2022 7:47 pm

oh wow,
trying thing randomly, comparing the sample project with th eone I am generation, which ios no eassy since the do no use cmake.
I noticed that their project do not specify the cuda libraries, but mine does.
so when I open the linker properties and I noticed this

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\Win32\cudadevrt.lib
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\Win32\cudart_static.lib
cudadevrt.lib
cudart_static.lib
kernel32.lib
user32.lib
gdi32.lib
winspool.lib

the cuda libs are include twice so I when an commented out the inclusion of the libraries like this
#if (PTR_SIZE EQUAL

# #link_libraries ("$ENV{CUDA_PATH}/lib/x64/cudadevrt.lib")
# #link_libraries ("$ENV{CUDA_PATH}/lib/x64/cudart_static.lib")
# target_link_libraries(${projectName} PRIVATE "$ENV{CUDA_PATH}/lib/x64/cudadevrt.lib")
# target_link_libraries(${projectName} PRIVATE "$ENV{CUDA_PATH}/lib/x64/cudart_static.lib")
#else()
# #link_libraries ("$ENV{CUDA_PATH}/lib/Win32/cudadevrt.lib")
# #link_libraries ("$ENV{CUDA_PATH}/lib/Win32/cudart_static.lib")
# target_link_libraries(${projectName} PRIVATE "$ENV{CUDA_PATH}/lib/Win32/cudadevrt.lib")
# target_link_libraries(${projectName} PRIVATE "$ENV{CUDA_PATH}/lib/Win32/cudart_static.lib")
#endif()

and the solution now compiles.

it does not solves the origional project which is making the plugin in a static library sine what seen to add the library alone with some other linker setting is these intructions

Code: Select all: set(CMAKE_CUDA_STANDARD 14) CMAKE_POLICY(SET CMP0104 OLD) project(${projectName} LANGUAGES CXX CUDA) set(CMAKE_CUDA_COMPILER $ENV{CUDA_PATH}/bin/nvcc.exe) set(CMAKE_CXX_STANDARD 17) set(CMAKE_CUDA_STANDARD 14) set_target_properties(${projectName} PROPERTIES CUDA_SEPARABLE_COMPILATION ON CUDA_RESOLVE_DEVICE_SYMBOLS ON)

and some of those onle apply to dll or exe generation.
the important part is that we can continue develoment.

so the plan changes a litle because I have to sperate the code that is C++ and the code that is .cu and all that cuda will be in the DLL,

by **Bird** » Sat May 21, 2022 10:59 pm

Glad to hear it.

I managed to make a static library version working. But I can't get Dynamic Parallelism working. I always get a cudaLaunchKernel cudaErrorInvalidSource(300) error

I made a github project that show a static link. I am not a fan of cmake( to put it mildly). I used Premake instead

https://github.com/Hurleyworks/CudaDP

by **Julio Jerez** » Sun May 22, 2022 4:06 pm

Bird wrote:But I can't get Dynamic Parallelism working. I always get a cudaLaunchKernel cudaErrorInvalidSource(300) error

that's the whole purpose of the effort.
I still have not try yet, so I might too have the same errors, but that another issue.
so far we now pass the point where the engine build and link the genertated code by the compiler.

I still get these nasty warnings,

Code: Select all: 1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include\crt/host_runtime.h(256): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed 1>Done building project "ndSolverCuda.vcxproj". 1>ndCudaDevice.cu 1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\include\crt/host_runtime.h(256): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed 1>Done building project "ndSolverCuda.vcxproj". ... 1>C:/Users/julio/AppData/Local/Temp/tmpxft_0000a98c_00000000-7_ndSolverCuda_d.device-link.reg.c(2): warning C4100: 'prelinked_fatbinc': unreferenced formal parameter 1>C:/Users/julio/AppData/Local/Temp/tmpxft_0000a98c_00000000-7_ndSolverCuda_d.device-link.reg.c(3): warning C4100: 'prelinked_fatbinc': unreferenced formal parameter ...

I assume is because the compiler is generation them but the code is not calling them yet.

anyway, I think I have a good refactorization now.
the cuda context is now a dll, and containing all the cuda kernel and is made of .cu files and header.

all the cpp+ glue code is included in the ndNewton library.
so that class is what the engine see for scene and solver, and it is all c++
thet the cuda code is an interface in the dll, but does no implement any coda code,
all the code stuff is hidden in the implementation, the class looks like this

Code: Select all: class ndCudaContext { public: D_CUDA_API ndCudaContext(); D_CUDA_API ~ndCudaContext(); D_CUDA_API bool IsValid() const; D_CUDA_API const char* GetStringId() const; ndCudaDevice* m_device; ndCudaContextImplement* m_implement; };

so it is a 100% cuda application and the contact is just the c++ glue.
if an app does no use dll, we still can load the dll as a resurce using loadDll function.

this seems to works, I managed to initialice the device, but does not do anything so far and has assert almost every where.
It is just too much to cover in one weekend. but I am hopeful this will allow for two things.
1-use the dynamics paralelism.
2-use the standard code libraries in case we need it.

if anyone sync and try to build it just to make sure I did not make error on the cmake scripts.

now I will continue adding the functionality again.

one of the problem of making a strict cuda project is that I will no be able to share high level newton code. so I will have to rely on casting and proxy data structures like we did in 3.14

for example ndArray, can not be use so those will have to be in the ndNewton side and the call will have to pass the adress and size to the cuda context, because is no aware of the high level,
but I think that will be fine. and far better than no working.

by **Dave Gravel** » Sun May 22, 2022 4:29 pm

The cmake script working good for me with vs2019.
When I build the project I get some warnings in the cuda code.

warning C4505: '__cudaUnregisterBinaryUtil'
warning C4100: 'prelinked_fatbinc'
C:/Users/Dave/AppData/Local/Temp/tmpxft_00004e20_00000000-7_ndSolverCuda_d.device-link.reg.c(2)

All look to work good for me.

by **Bird** » Sun May 22, 2022 5:47 pm

I got the same nasty warnings you mentioned but it compiles okay

I can't even get Dynamic Parallelism working as part of a simple executable instead of a static library. I can't believe with all the hype about it I can't find a single working example using it on the internet.

by **Julio Jerez** » Sun May 22, 2022 6:31 pm

excellent.

on this

Bird wrote: I can't even get Dynamic Parallelism working as part of a simple executable instead of a static library. I can't believe with all the hype about it I can't find a single working example using it on the internet.

welcome to my disappointment.

I am doing the refactoring and hopping that when I try dynamic parallelism I will work.
I could just try and test it, but since I already made the decision of going this route.
when I get there I will test it and if is work, that's great.
if is does not them I see what I can do, but refactoring the engine is actually a good thing, because I think that all the code that is in newton Template could be cause some sort of problem with nvidia compiler.
know the cuda context is a standalone library that do no use newton struct at all.

I will just keep going

by **Julio Jerez** » Sun May 22, 2022 9:43 pm

ahh, good news for a change.
I have it working.
I set up this kernel

Code: Select all: __global__ void ndCudaScene(ndCudaSceneInfo& info) { unsigned threads = info.m_bodyArray.m_size - 1; unsigned bodyBlocksCount = (threads + D_THREADS_PER_BLOCK - 1) / D_THREADS_PER_BLOCK; printf("ndCudaScene\n"); CudaInitBodyArray << <bodyBlocksCount, D_THREADS_PER_BLOCK, 0 >> > (info); CudaMergeAabb << <1, D_THREADS_PER_BLOCK, 0 >> > (info); }

I added printf on each of the child kernels, and it printed the sequence below

ndCudaScene
InitBodyArray
CudaMergeAabb

ndCudaScene
InitBodyArray
CudaMergeAabb

now this makes a huge difference, because now after a kernrel call, we can querie the result for a subsequence calculation.

an example is as follows, We have an array of bodies, the first kernel calculate the aabbs, this is a fix size array,
the next kernel used the number of intersection to calculates how many grids each aabb intersect, and that's a variable size, before that value was unknown in cpu, so I had to pass a huge number of blocks, causing the gpu dispatch do a lot of extra work, now the number of interaction we just pass the exact block count.

this kind of thing makes for a lot of simplifications. not to mention that nvidia also claims that dynamics parallelism is be faster, but many people complain that it is actually slower.
I have to believe nvidia, but one thing that is suspicious is that child kernels are not showing up in the profiler.

Boy, I which nvidia answer some of these questions, their support is the absolute worse I have ever seen.

in any case, if you guys try, maybe anyone can see if the Kerners show in the profiler.
I just going to keep going.

edit:
I just checked it in now.

by **Bird** » Mon May 23, 2022 11:14 am

Excellent news. It seems to be working fine here. The only newton kernel I see in the NsightCompute app is ndCudaScene.

by **Julio Jerez** » Mon May 23, 2022 12:36 pm

yes that's what I see as well.
My guess is that nSight is not awared of those kernels calls because they are called via procedural generated glue cuda code produced by the compiler. so the kernel may have line number info that the profiler can used, but call generated at link time does not.
Is like when we use a compiled library in cpp, and the lib was not compile with debug info.
that is my speculations. Maybe there is one option for that, but if there is one I can't find it

but anyway, considering that we are getting it for free, let use not be soo picky.
It seems that's a small price to pay for all the flexibility we get from that functionality.

One cool thing, is that with this it is even possible to add debug sanity to the various places without having to issue synchronization, I just added sanity check to the prefix scan, and now the function looks like this

Code: Select all: __global__ void ndCudaHillisSteelePrefixScan(ndCudaSceneInfo& info, unsigned blockSize) { const unsigned threads = info.m_histogram.m_size; const unsigned prefixScanSuperBlockAlign = D_PREFIX_SCAN_PASSES * blockSize; const unsigned superBlocks = (threads + prefixScanSuperBlockAlign - 1) / prefixScanSuperBlockAlign; const unsigned histogramBlocks = D_PREFIX_SCAN_PASSES * superBlocks; ndCudaHillisSteelePaddBufferInternal << <D_PREFIX_SCAN_PASSES, blockSize, 0 >> > (info); for (int i = 0; i < (D_PREFIX_SCAN_PASSES_BITS - 1); i++) { ndCudaHillisSteelePrefixScanAddBlocksInternal << <histogramBlocks, blockSize, 0 >> > (info, i); } ndCudaHillisSteelePrefixScanAddBlocksFinalInternal << <histogramBlocks, blockSize, 0 >> > (info); ndCudaHillisSteeleAddSupeBlocksInternal << <D_PREFIX_SCAN_PASSES, blockSize, 0 >> > (info); #ifdef _DEBUG // issue debug code, here for sanity check. unsigned sanityBlocks = threads / blockSize; ndCudaHillisSteeleSanityCheck << <sanityBlocks, blockSize, 0 >> > (info); if (info.m_frameIsValid == 0) { printf("function: ndCudaHillisSteelePrefixScan failed\n"); } #endif }

as you can see, from the high level, the whole function is just a single call, and the dimension of the kernel is adjusted accordingly,
this was what was causing so much aggravation. but now the whole thing is a reusable function.

the next step is to see if we can pass lambda arguments, for example a sort routine will use different semantic to evaluate the sort key, here is a example in cpp

Code: Select all: template <class T, class dCompareKey> void ndSort(T* const array, ndInt32 elements, void* const context = nullptr)

and is usage example is

Code: Select all: class CompareNodes { public: ndInt32 Compare(const ndNodeBase* const elementA, const ndNodeBase* const elementB, void* const) const { ndFloat32 areaA = elementA->m_area; ndFloat32 areaB = elementB->m_area; if (areaA < areaB) { return 1; } if (areaA > areaB) { return -1; } return 0; } }; ndSort<ndNodeBase*, CompareNodes>(leafArray, leafNodesCount);

in theory this should be also possible in cuda, although last time I try I could not get it to work.
but I was discourage since I could get it in one call. but now we can try again.

by **Bird** » Mon May 23, 2022 1:01 pm

the next step is to see if we can pass lambda arguments, for example a sort routine will use different semantic to evaluate the sort key, here is a example in cpp.in theory this should be also possible in cuda, although last time I try I could not get it to work.
but I was discourage since I could get it in one call. but now we can try again

I'm not sure if it will help you or not but there is an compiler option for NVCC

4.2.3.19. --extended-lambda (-extended-lambda)
Allow __host__, __device__ annotations in lambda declarations.

https://developer.nvidia.com/blog/new-c ... es-cuda-8/

by **Julio Jerez** » Mon May 23, 2022 1:19 pm

oh yes, I read that, it is in the cmake script.

Set(CMAKE_CUDA_FLAGS_DEBUG "-Xcompiler=\"-MDd -W4 -wd4201 -wd4324 -Zi -Ob0 -Od /RTC1\" --extended-lambda")

no sure why nvidia make those rule that seems so contrived, but anyway it is thoer stuff.

more goodness I just try making a template function that take and argument, and it compile just fine

Code: Select all: template <typename SortKeyPredicate> __global__ void XXXXX(const ndCudaSceneInfo& info, SortKeyPredicate sortKey) { } template <typename SortKeyPredicate> __global__ void ndCudaCountingSort(const ndCudaSceneInfo& info, SortKeyPredicate sortKey) { //unsigned val = 0; //val = GetKey(val); XXXXX << <1, 1, 0 >> > (info, sortKey); }

the way is will be use is like this.

Code: Select all: ndCudaScene << <1, 1, 0, m_solverComputeStream >> > (*infoGpu); ndCudaCountingSort << <1, 1, 0, m_solverComputeStream >> > (*infoGpu, SortKey);

so to sort an array with a large key we just make one lambda function that evaluates the key, and we pass that as argument, and voila we now have a reusable sorting routine, for example

Code: Select all: auto SortKey_x = [] __device__(const unsigned& item) { return item & mask; }; auto SortKey_y = [] __device__(const unsigned& item) { return (item>>maskSize) & mask; }; auto SortKey_y = [] __device__(const unsigned& item) { return (item>>(2 * maskSize)) & mask; }; ndCudaCountingSort << <1, 1, 0, m_solverComputeStream >> > (*infoGpu, SortKey_x); ndCudaCountingSort << <1, 1, 0, m_solverComputeStream >> > (*infoGpu, SortKey_y); ndCudaCountingSort << <1, 1, 0, m_solverComputeStream >> > (*infoGpu, SortKey_z);

and that's how newton use it in the engine. and to me that far more practical and useful than what nvidia provide in their curb library whet they only sort vectors of keys.

by **JoshKlint** » Tue May 24, 2022 11:04 am

I am curious, will Newton on the GPU involve any sort of latency to retrieve the calculation results? What do you expect here?

by **Julio Jerez** » Tue May 24, 2022 2:39 pm

I start to get very sick and tire of guessing the GPU stuff that should be be that hard, I added the timing and for my surprise I am getting reading that makes no sense.

this code

Code: Select all: __global__ void ndCudaBeginFrame(ndCudaSceneInfo& info) { long long coreTicks = clock64(); info.m_timeSlice = coreTicks; printf("t0 = %lld ", coreTicks); } __global__ void ndCudaEndFrame(ndCudaSceneInfo& info, int frameCount) { long long coreTicks = clock64(); info.m_frameCount = frameCount; printf("t1 = %lld diff= %lld\n", coreTicks, coreTicks - info.m_timeSlice); info.m_timeSlice = coreTicks - info.m_timeSlice; }

generate these reading

and that does no make sence in any context.

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online