JoeJ wrote:I guess it's because of synchronizing: Barrier until all 128 or 256 cores are done means a lot of unnecessary data swapping for only 8 virtual cores.
Unfortunately i can not reduce the workgroup size, so my project is worthless for CPU tests.
these is one of mombojumbo lingo reason why I am abandoning OpenCL
all that talk of work group, how many cores, local memory, global memory etc, soudl be theo problem not mine.
AMP fide all that for the user, the have STL style vector, that I imagine do all that on behave of the user.
JoeJ wrote:What things do you want to accelerate?
I will accelerate the newton solve and collision system,
Netwon has a parallel solver that by the end of newton 1.5 and at the beginning of 200 I converted to cuda, and I has about 5x performance game on a GForce 285 (24 cores)
I was able to do stack of boxed 100 x 100 (5000 boxes)
but afterbegging Nvidea for some functionality I decided to abandone Cuda all together.
The parallel solve is in newton 300, bu is I sonly a proof of concept, because there si a lot of brute force on the paralelle solver. It is only good for massive parallel cores like a GPU or a new Intel PHY
The solve is competitive with the solve that take islands per single core, but since the does a lot more calculations it is no practical.
to give you an example say you have 1000 boxes resin on the floor.
for the single solve that will be 1000 island. each core will resolve each box, but since the island are small the will converge and the solve will terminate on early out exit for most islands.
if I send these same 1000 island to the parallel solver, all island will do the number of iteration of the worse island, and that a lot more calculation for the CPU.
for a GPU with 30 cores, that extra core make up of the extra number of iterations
depend of how this go, I will port the collision system as well, at least part of it.