Newton on GPU

by **JoeJ** » Sun Sep 28, 2014 3:46 pm

On i7-930 CPU by OpenCL the same last test (Edit 2) takes 750 ms, so GPU is a 100 x speed up.
My single threaded C++ code takes 4000 ms, dividing by 4 cores would be 1000 ms.
My C++ stuff is not cache friendly - bloated AOS struct with unused data.
And there are some differences in the algorithms, but for a coarse comparision it should suffice.
It seems i would not win much using OpenCL on CPU?
I'm surprised by that myself now, for CPU only interest using OpenCL seems not worth the trouble?

I see the single CPU at 6 gfloaps
I see the single AMP CPU at 88 gflops

Does this compare "standart C++ code" with the same code, just adding some macros on the same Hardware? That can't be true. "AMP CPU" means clustering multiple CPUs, yes?

by **Julio Jerez** » Sun Sep 28, 2014 5:05 pm

the cpu simple is just the code running on C++.

the second that say single AMP, is hard to know if it is C or GPU, because Visual studio express does not support AMP debug. I intalled a script tha is suppor to allow debugging , but all is does is that I can not see the GPU debug, but I can not set a break point on any of the kernels.
My guess Is that it is GPU. even when I select to be the default accelerator that is supposed to be CPU. I thin it is only CPU I fteh GPU present is not directX11 compliance.

It will be hard to code anything without a debugger.

The one thing is clear, AMP seem to be the easiest way to get GPU into a project.

by **JoeJ** » Sun Sep 28, 2014 5:25 pm

Ooops - ignore what i wrote above. I forgot to switch to release mode.
The multithreaded OpenCL time is 675 ms.
My single threaded C++ is 526 ms.
That means i beat OpenCL even using only one Core. :roll:

I guess it's because of synchronizing: Barrier until all 128 or 256 cores are done means a lot of unnecessary data swapping for only 8 virtual cores.
Unfortunately i can not reduce the workgroup size, so my project is worthless for CPU tests.

What things do you want to accelerate?

by **Julio Jerez** » Sun Sep 28, 2014 6:03 pm

JoeJ wrote:I guess it's because of synchronizing: Barrier until all 128 or 256 cores are done means a lot of unnecessary data swapping for only 8 virtual cores.
Unfortunately i can not reduce the workgroup size, so my project is worthless for CPU tests.

these is one of mombojumbo lingo reason why I am abandoning OpenCL
all that talk of work group, how many cores, local memory, global memory etc, soudl be theo problem not mine.
AMP fide all that for the user, the have STL style vector, that I imagine do all that on behave of the user.

JoeJ wrote:What things do you want to accelerate?

I will accelerate the newton solve and collision system,
Netwon has a parallel solver that by the end of newton 1.5 and at the beginning of 200 I converted to cuda, and I has about 5x performance game on a GForce 285 (24 cores)

I was able to do stack of boxed 100 x 100 (5000 boxes)
but afterbegging Nvidea for some functionality I decided to abandone Cuda all together.

The parallel solve is in newton 300, bu is I sonly a proof of concept, because there si a lot of brute force on the paralelle solver. It is only good for massive parallel cores like a GPU or a new Intel PHY

The solve is competitive with the solve that take islands per single core, but since the does a lot more calculations it is no practical.
to give you an example say you have 1000 boxes resin on the floor.
for the single solve that will be 1000 island. each core will resolve each box, but since the island are small the will converge and the solve will terminate on early out exit for most islands.

if I send these same 1000 island to the parallel solver, all island will do the number of iteration of the worse island, and that a lot more calculation for the CPU.
for a GPU with 30 cores, that extra core make up of the extra number of iterations

depend of how this go, I will port the collision system as well, at least part of it.

by **JoeJ** » Sun Sep 28, 2014 6:48 pm

Julio Jerez wrote:all island will do the number of iteration of the worse island

You could sort or bin the islands, say using bins of <8, <16, <32.... bodies.
Then proccess all islands from one bin in parallel, the longest thread runtime should be closer to the shortest.
If this sounds practicable for you, think of using workgroup size of 8, 16, 32 threads accordingly.
The more threads, the more fast memory you can use. If you can solve the problem using that very limited memory (64k), you get great performance. If the problem is too large and has to be solved using slow memory, no great performance. It's often better to waste threads just to have enough fast memory. Balancing this seems the most important from what i've learned up to now.

This is why most of the mumbo jumbo is necessary to handle. Other things like that stupid 3 dimensional workspace are just confusing and totally unecessary, and after learning 2 of those languages i don't know anymore what word means which thing... this takes me hours of AARRGH!

by **JoeJ** » Mon Sep 29, 2014 5:06 am

Did the Tests again on 6core i7-5820K system:

Single threaded C++: 400ms (/6 = 67ms)
AMD OpenCL driver: 85ms
Intel OpenCL driver: 58ms (Sorry Intel, at the end you have the lead)
GTX480 OpenCL: 8.5ms
GTX480 OpenGL Compute Shader: 20ms

On this CPU OpenCL performs well, don't ask me why.

by **Julio Jerez** » Wed Oct 01, 2014 4:09 pm

I finally have the first kernel (shader) of the parallel solver going.

AMP seem very eassy to get gong, I see some cost for rung the kernel I hope is no a linear cost.

AMP separate the programmer form the hardware 100%, you can see the assembly code of a shader in debug but have no seem in release so there is no way to know if t is doing AVX or SSE or GPU .

Also the assembly language is not native code is some intermediate Byte code that I am guessing is compiled by the drive. I hope that's the case because it seem very inefficient.

I will code the solver and some of the broadphase on AMP, but the main goals is to finish the Cloth, and Destruction, because those can only be practical with massive parallel execution.

But will be nice to have the physics engine with stable physics on GPU, not the * have seem like 100 or toasend fally boxes with no all jittering all over the place.
my goald is zero lost on the quality, if anything the quality should be better.

by **JoeJ** » Thu Oct 02, 2014 2:37 am

Julio Jerez wrote:no way to know if t is doing AVX or SSE or GPU

You should be able to hear the fan of GPU

Not being able to choose what processor should be used is not so good.
However, going the easy way first and developing the algorithms is a good idea.
(Even if my experience is different: I worked > 2 years on algorithms only on CPU,
but when moving to GPU everything changed dramatically)
At a ater point you can always decide to port to low level if performance problems arise.

Julio Jerez wrote:I will code the solver and some of the broadphase on AMP

For me tree traversal is a disappointment. Simple stackless tree per thread works best by far.
Packend traversal is elegant but slower. Breadth first should be perfectly work and data coherent,
But even if i'd had enough memory for the stack it would be slower too.
Making the scene 2x larger, traversal takes 3x the time.
I'll seek a way to avoid per sample traversal completely.
Collision traversal might do little better, as it calculates only a small set of results.

For the solver however i'd expect a huge speed up.

by **Julio Jerez** » Thu Oct 02, 2014 8:06 am

if does give a list of the devices when enumeration the devices. but the byte code is the same for all.
so you do not know if it using the special instruction of AVX.
I am not concern about having the ability to debug is good enough for my, plus I believe this will also recognize GPU on CPU

yes I guess a tree traversal is no a good candidate for a GPU. I am not there yet.

by **pHySiQuE** » Fri Oct 03, 2014 4:45 pm

You might consider just doing some specific features on the GPU, like fluid or dumb particles.

by **Julio Jerez** » Mon Oct 06, 2014 7:39 am

well I am get a bad start. with my first kernels

to test before I go on, I made a scene of 1800 stacked bodies.
This produces a memory buffer of about 9 megabytes of data that need to be uploaded to the GPU each frame. The function copy, stall the GPU every 10 frames or so and is about 5 to 10 time slower than the CPU. DMA transfers on the PC are terrible slow.
I have to come with a way to keep the data on GPU wich made everything extremely complex.

When I tried the first time I was using what the called array_view, thes are the glue between GPU memory and CPU memory, but they were really slow, so I changed to use only arrays which are buffer that reside on the device only. Then using the copy funtion which issue DMA transfers to copy the data but this is much slower.

I read that AMP on visual studio 2012 does not supports shared memory, but it does in 2013
http://blogs.msdn.com/b/nativeconcurren ... -view.aspx

I guess I will have to revert back to use array_view.
One thing is clear, and that is that GPU general programing remain gratuitously complicated.
The idea that you have to write code as if it was a graphics shade I find unacceptable.

I will wait until I get my copy of Visual studio 2013 and see if It recognize the Native device with shared memory.

if it does I will make so that AMP is supported for VS 2013 and up.
Like I say I am no interested on the GPU, I am interested on the CPU part.

Microsoft said that AMP will leverage the CPU using a native device called WARP witch use the AVX or the GPU on eh CPU if there is one which all use shared memory, but that is no true at least it is not true with Visual studio 2012.

by **manny** » Mon Oct 06, 2014 10:24 am

Wow. That sounds really bad. Looking forward to your results with VS2013.

by **JoeJ** » Mon Oct 06, 2014 10:31 am

Uploading and downloading 8 MB to AGP GPU takes 4.4ms for me with OpenCL.
Unfortunately i have no integrated GPU to test. I was very disapointed that the brand new i7-5820K at work also does NOT have an integrated GPU! :evil:

The idea of shared GPU / CPU memory is what the PC platform really needs, but i'm afraid it will take years until that becomes mainstream.
There's no marketing and gamers do not know about the need to have this. They think their platform is so much faster than consoles and do not care about the AGP bottleneck. And if time proofs them wrong, the problem must driver overhead or lazy porting

All in all there remains only a small usecase for accelerated physics, Microsoft OS (>7!) and a integrated GPU. But offering it as an alternative might change the situation at least a bit

Let me know how it works with 2013...

by **Julio Jerez** » Mon Oct 06, 2014 10:54 am

JoeJ wrote:Uploading and downloading 8 MB to AGP GPU takes 4.4ms for me with OpenCL.

That's what I takes on this machine too. but the each 10 to 20 millisecond there is a 100 or so millisecond stall.

the thing is that a PC has about 6gbyte per second bandwidth, and the GPU is much higher, I do not know why this does that kind of stuttering.

In any case a 10 megabyte transfer is no even a huge amount of data, consoles do a lot more than that.
and I expect to a least have 5 to 10,000 bodies which I will produce around 100 mbytes of working data each frame.

by **JoeJ** » Fri Oct 10, 2014 7:13 am

JoeJ wrote:For me tree traversal is a disappointment.

I was wrong with that, traversal including reading all this huge random access data is very fast.
After i've disabled writing out the results (20mb), kernel goes from 11ms to 1.5ms.

I found out that on AMD GPU you really need to write to write to adjacent memory locatons.
In a synthetic test the speed difference is factor 20 :shock:

I guess it's the same for AMD APUs, so keep in mind

Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Who is online