Cuda Solver

by **Bird** » Wed May 18, 2022 3:52 pm

Works fine here.

gpu: NVIDIA GeForce RTX 2070 SUPER
warp size: 32
muliprocesors: 40
compute capability: 7.5
memory bus with: 256 bits
memory: (mbytes) 8191
The thread 0x36a0 has exited with code 0 (0x0).
The thread 0x4b1c has exited with code 0 (0x0).
The thread 0x2ad0 has exited with code 0 (0x0).
The thread 0x446c has exited with code 0 (0x0).
The thread 0x3b18 has exited with code 0 (0x0).
The thread 0x404c has exited with code 0 (0x0).
The thread 0x3368 has exited with code 0 (0x0).

I'm seeing a GLATTER errror here

GLATTER: in 'D:\ActiveWorks\OpenSource\latestNewton\newton-dynamics\newton-4.00\applications\ndSandbox\toolbox\ndShaderPrograms.cpp'(26):
GLATTER: OpenGL call produced GL_INVALID_OPERATION error.

by **Julio Jerez** » Wed May 18, 2022 4:30 pm

I seen lot of glitter errors too.
I see if U can change to some other gl loader.
I see lot of people using glad.
I may switch to that over the weekend.

Glatt fail in debug 32 bit. And even the newer version fail too.
I updated and instead of fixing the error It hit new once.

On the gpu. It seem my sort shader had a problem with gpu capability 6.1 like g force 1060.

I see if I can find it.
It seem some sync issue because when I get the assert, I I execute the same shader again, It sorts the array.

The gtx 1060 is one of the most common gpu. So it is better to get it fixed. Plus we do not know if this is a bug that has not happens yet but will just take longer.
These problems are really hard to find.

by **Julio Jerez** » Wed May 18, 2022 10:52 pm

if you can when you get type try to sync. we fix a bug in Glatter, maybe does no issue the warings.

I tested the sort here on my system and is work fine,

gpu: NVIDIA GeForce GTX 1660 SUPER
warp size: 32
muliprocesors: 22
compute capability: 7.5
memory bus with: 192 bits
memory: (mbytes) 6143

the only difference I see if that a gfx 10xx are compute capability 6.1 and some how something is failing ramdom.

they one thing I noticed is that compute 6.1 only support 16 kerners per multiprocessors, and I am sending a lot more than that several hundred, since the array are quite large. the compute 7.5 take 128, no sure why such a huge difference, and I do not know if that is the problem.

I have one more thing to try, and that is to use emulated atomic, and if that fail too, I will just set so that the min spec is compute 7.5

maybe later we find what is wrong, but for now let us move on.
gforces 10xx gpus are still quite popular, so it will be nice if we support them. for the light of me, I do not see what I am doing wrong.

by **Bird** » Thu May 19, 2022 6:52 am

This the Glatter error messages are gone now.

I think you said you've already seen these but it looks likes there are some tools in this library that might help. Cub is already included in the Cuda Toolkit.
https://github.com/NVIDIA/cub

by **Julio Jerez** » Thu May 19, 2022 8:59 am

Yes the problem I have is that must function in the cuda sdk librarues use the dynamic parallelism, which requires a pass by the nvidia linker.

The nvidia linker can only be invoked for dolls or exe. But the newton extension solver are static libraries.

So for me to use any of the nvidia library I would have to make the newton library a dll, or the plugins dlls, of in case of static linking them the user would has to make thier project a cuda solution.

I spen lot of time try to go around that and it seem there is not way around. The dynamic parallelism in Cuda generates relocatable cuda code that needs to by linked, and that can only be done if the solution was a cuda visual studio project.

by **Julio Jerez** » Thu May 19, 2022 9:13 am

Also the sort version that I wrote in fact is more flexible and I belive equally or a little faster that the one in the cuda library.
Buy I only compare the result by running there in cuda sample as a control. Them running mine and estimating the performance by extrapolating the results.

Thier will be very hard to used since what they do is that the sort a vector of keys and pass an indirect vector of item

This is more general but quite memory bloaded. In my case the key is embedded as part of the vector of items

When I run the test what I see is similar performance as my version. But when I read the bench mark published in many sites, I see ridiculous numbers like one dude claim 16 giga key per secunds, while I only see around 500 million for 64 bit keys and 1000 million for 32 bit keys running cuda samples, Not items

I am getting around 350 million in mine, but my key are 64 bit and contain the item as well.

I concluded that there might be somerginal gain on optimizing code. But at the end of the day it is all dominate by the memory bandwidth of the gpu. So the code that moves less memory is the one the wins.
Si as long as algorithm are o(n) or o(n log(n))
The gpu reduces it to O(n/corecount)
But does not reduce the memory complexity.
So it is quite posible that in a middle range an algorithm with higher time complexity but Lowe memory foot print like bitonic sort can beat one with linear complexity but higher memory usage.

I think this is ok, what is not ok it that it fail at ramdone in the 10xx series of gpu. And I can't determine why.

by **Julio Jerez** » Thu May 19, 2022 9:25 am

this is a clue as to why my sort routine fail in gtx 1060, they say this

Each thread block uses cub::BlockRadixSort to collectively sort its own input segment. The class is specialized by the data type being sorted, by the number of threads per block, by the number of keys per thread, and implicitly by the targeted compilation architecture.

that suggest that their code has conditional compilation for specific gpus, while mine is generic.
but that goes against all they say about standard c cpp.
the only instruction that I am using is the atomicAdd or local array of data, that since to be supported since capability 3 and up.

My last atent to debug it is to see if that's the cause by writing a non atomic version. and is still fail I will just no support 10xx gpus until I find a solution.

by **Bird** » Thu May 19, 2022 10:10 am

If I understand this post correctly, it looks like they are doing what you want to do.

https://forums.developer.nvidia.com/t/c ... uda/192779

by **Julio Jerez** » Thu May 19, 2022 10:35 am

Oh yes that's the same problem.

I did follow the steps, but I did not do the inherent from parent. So maybe that big problem I had. I will try again this Saturday.

Dynamic parallelism open a whole new worl of possibilities.

With such low granularity the it has to code one kernel at a time and it need to make calls just to set or tes a variable for results.

With the amity to call kernels from kernels is simply the process a great deal. It also reduces the cpu overhead of kernels call which start to be noticed already.

I really, really hope this works.

by **Julio Jerez** » Thu May 19, 2022 2:42 pm

well I try what teh say here.

For everyone’s benefit, Yuki pointed out (via email) that my executable was not linking to cudadevrt.lib.

Two approaches to solve this:

1 - Via Visual Studio: In the project properties for the executable, go to Linker > Input > Additional Dependencies > Edit… Make sure the “Inherit from parent or project defaults” box is checked.
2 - Via CMake. In the CMakeLists.txt file for the executable, we need to link to cudadevrt.lib. Simply add a target_link_libraries() command, like so:

cmake_minimum_required(VERSION 3.17 FATAL_ERROR)

project(CUDA_Dynamic_Parallelism)

add_executable(${PROJECT_NAME} main.cpp)
set_target_properties(${PROJECT_NAME} PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(${PROJECT_NAME} MyLib)
target_link_libraries(${PROJECT_NAME} ${CUDAToolkit_LIBRARY_DIR}/cudadevrt.lib)
Note that you might need to call the find_package() command higher up in the CMake hierarchy, rather than down in the CMakeLists file for the static library.

but does not really works, in fact I have under and #if def in the cmake script.
also he mssed one extra step. who is this

Set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -rdc=true")

the problem is is you read his post is that both methods imply setting options on a executable project.
"Via Visual Studio: In the project properties for the executable"
but the cuda solve is a static library, and you cna no invoke linker on static libraries,

so doing what he recommend after the build is completed, it still gets these links errors.

3>ndSolverCuda_d.lib(ndCudaContext.obj) : error LNK2019: unresolved external symbol __cudaRegisterLinkedBinary_ad068f47_16_ndCudaContext_cu_dbd915f8
3>ndSolverCuda_d.lib(cuPrefixScan.obj) : error LNK2019: unresolved external symbol __cudaRegisterLinkedBinary_97401e57_15_cuPrefixScan_cu_886547d9
3>ndSolverCuda_d.lib(cuSortBodyAabbCells.obj) : error LNK2019: unresolved external symbol __cudaRegisterLinkedBinary_eb48d415_22_cuSortBodyAabbCells_cu_f9bee96a

so I am back on square one.

by **Julio Jerez** » Thu May 19, 2022 2:50 pm

for now the quick solution is this.

Code: Select all: ndCudaContext* ndCudaContext::CreateContext() { cudaDeviceProp prop; ndCudaContext* context = nullptr; cudaError_t cudaStatus = cudaGetDeviceProperties(&prop, 0); int campbility = prop.major * 100 + prop.minor; // go as far back as 5.2 Maxwell GeForce GTX 960 or better. //if ((cudaStatus == cudaSuccess) && (campbility >= 600)) if ((cudaStatus == cudaSuccess) && (campbility >= 700))

over the weekend I will test the same code on teh cub making a stand alone project so that maybe some one can test.
but given I has had zero answers from nvidia dev forum, I do no know what else I can do,

by **Bird** » Thu May 19, 2022 9:13 pm

so doing what he recommend after the build is completed, it still gets these links errors.

3>ndSolverCuda_d.lib(ndCudaContext.obj) : error LNK2019: unresolved external symbol __cudaRegisterLinkedBinary_ad068f47_16_ndCudaContext_cu_dbd915f8

Maybe this will help
https://gist.github.com/gavinb/c993f71c ... 52a3f8ef30

by **Julio Jerez** » Fri May 20, 2022 9:42 pm

if you check the cmake script you will see that I am already using the cmake command. below

Code: Select all: set_target_properties(${PROJECT_NAME} PROPERTIES CUDA_SEPARABLE_COMPILATION ON CUDA_RESOLVE_DEVICE_SYMBOLS ON )

but I cant since they are meant executable binaries.
I place the on the parent app, it do not work either because that not a cuda project.

I do no think that there is a way around that, the only way the cuda linker can be invoked is if the project is a exe or a dll.

I think that fine. I already start the process of separation the c code from the GPU code.

then the GPU code will all be encapsulated in a dll that the engine can load as a loadable dll then same way shades are loaded. and sine the GPU support will be optional it should be ok for people who do static linker.

I will try to elaborate more on this method this weekend. and if it work, then I will be able to lnk the cude libraries, and also use dynamics parallelism.

with out those functionality it too crude and progress is too slow.

by **Julio Jerez** » Sat May 21, 2022 1:08 pm

ok, I now try all of the recommendations from many different sites and this does not really work.

I now made a very simple stand alone cmake scrip that generate a DLL

2>cuSortBodyAabbCells.obj
2>C:\Program Files\Microsoft Visual Studio\2022\Professional\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 11.7.targets(879,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\nvcc.exe" -dlink -o ndSolverCuda_1.dir\Debug\ndSolverCuda_1.device-link.obj -Xcompiler "/EHsc /W4 /nologo /Od /FdndSolverCuda_1.dir\Debug\vc143.pdb /Zi /RTC1 /MDd /GR" -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin/crt" -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\x64" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\Win32\cudadevrt.lib" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\Win32\cudart_static.lib" cudadevrt.lib cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib -forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -gencode=arch=compute_52,code=sm_52 -G --machine 64 ndSolverCuda_1.dir\Debug\cuSortBodyAabbCells.obj" exited with code 1.
2>Done building project "ndSolverCuda_1.vcxproj" -- FAILED.
3>------ Build started: Project: ALL_BUILD, Configuration: Debug x64 ------
3>Building Custom Rule C:/tmp/dCudaTest/CMakeLists.txt
4>------ Skipped Build: Project: INSTALL, Configuration: Debug x64 ------
4>Project not selected to build for this solution configuration
========== Build: 2 succeeded, 1 failed, 0 up-to-date, 1 skipped ==========

now it does call the nvidia linker, but that's just to go to a new set of new problems.
and this time the error do not even say what failed.

I read over the NVidia site and there are ton of people with the same problem and so far I see no one has gotten a solution to the problems.

I put the archive in a google share drive, but I do not know how to make public,
or of anyone know a public file share system so that I can put it there and maybe someone can tell me what do to make that with. that will be great.

if I cannot get pass this point, this seem again a death blow to this project. just like is was 12 yeat agon where nvidia refuse to put atomics.

I am not spending more time on this, because it just seem an accumulation of error over error, and spending time on work around more that making actual progress.

by **JoeJ** » Sat May 21, 2022 1:52 pm

Well, you could just make an alternative cuda project with a demo.
Interested people could then include your source directly into their own cuda project.
Not really an issue. I also did this all the time (including Newton sources instead using dll / lib).

The point is, for games we could not use cuda anyway, except if we want a game exclusively for NV owners.
So the remaining applications are tools, research, etc. And for those things there should be no problem.

Instead of a file sharing site you can upload a zip file to github for example.

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online