Is not fixed yet, but I know what it is.
It happen when changing solver.
One of the new chage is that now we have a scen and solver as plug into to the world.
Before we only destroy the solver and crate a new one.
The scen is more complex because it is a complex web of entagled nodes that are used to manipulate all the objects.
So a new scene has to be recreated from the old one.
It uses a copy constructor for that, but I made a mistake some where. Is not a big deal, I will fix it later.
Anyway yest the gpu scen is now around 2.1 to 2.2 ms
In my machine.
The good part is that of that the gpu is around 0.1 ms.
Running the three 7 kernels. And about .2 ms getting the data from the gpu. The is about 0.5 ms that seems to be a fix cost of the cuda drive for calling synchronization.
But it seem the is no way around that, not calling sync every update cause the drive to accumulate Kerner and memcopy until is has enough to saturate the gpu, the it issue all at once.
It seem the only way to force the driver to launch the kernels when using async streams is but calling syncronize.
That's is ok, I am satisfies with that, is running asyn and concurrent. And with the code that is still to come the synk cost will be nothing.
Another big surprised I also found is that in Cuda all memcopy are done by the cpu or at least by using the gpu memmove. Even asyn copy are like that. This is about one third of the theoretical bus speed.
But is you use what the call hist pinned memory.
That just a group of memory page that are looked but the of so that the do not change visual address. The copy used dma and reach that top speed.
So the mem copy when from 2.5 gbyte per second to 6.5 gb per secund. That's the big difference that you see now.
So now it will run faster or slower in different system depending on the kind of hardware.
Pci 2 or 3 run a different speed but only when using those dma hardware.
Anyway 1.5 ms fir 27k scene us not bad.
But we now have to convert all the scen and solver kerners.
There also a cool funtinality that I did not expected with streams.
I seem streams can be use for esterogeneus kerlnel launch.
Fir example, say we have a routine that calculate sphere sphere collision, one fir sphere box, and one fir box, box.
That would be three kernel launches, of a very comex kerne full of switches cases. But with streams all three routing can be run concurrently if the are map to a separate stream.
Anyway there is a lot to learn. But cuda offer lot of thing that are not possible with other languages like opencl.
I can see why cuda beat it consistently.