Newton on GPU

by **JoeJ** » Wed Sep 24, 2014 6:36 am

There was a thread for this but it seems to be gone.

Julio, if i remember right, you've had issues on OpenCL, because of some AMD / Intel mismatch and started looking to AMP for this reason.
I myself believed that it's not possible f.ex. to use both Intel CPU and NV GPU at the same time.
Maybe you was wrong similar like me. In fact you can use all available hardware, you just need to install OpenCL driver for each.

Here on Intel/NV System i can even choose if i want to use AMD or Intel driver for CPU:
My log output:

Available Platform Vendor (0): NVIDIA Corporation
Device 0 GeForce GTX 480 Device ID is 515788048

Available Platform Vendor (1): Advanced Micro Devices, Inc.
Device 0 Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Device ID is 166825688

Available Platform Vendor (2): Intel(R) Corporation
Device 0 Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Device ID is 168043664

That's awesome, isn't it?
I took my project from Intel/ATI machine and i there was nothing i had to change (same libs, same headers)

Some interesting performance measure i can share:
(I just ported a single Compute Shader to Open CL, which does tree-traversal similar to collision broadphase - not very GPU friendly)

i7-5820K 6core Intel OpenCL Driver: 7.8 ms
i7-5820K 6core AMD OpenCL Driver: 2.4 ms
i7 930 (AMD Driver): 8.3 ms

R9-280X OpenCL: 4.2 ms
R9-280X OpenGL Compute Shader: 4.3 ms

GTX480: OpenCL: 1.59 ms
GTX480: OpenGL Compute Shader: 4.3 ms

NV seems to have removed their OpenCL throttle on latest drivers to please Maxwell launch.
This was the primary reason for not using OpenCL for me and it's gone

EDIT:
Some more numbers on another shader.
Does some typical brute force GPU stuff all in local memory, some synchronisation, no atomics:

i7 930 (AMD Driver): 830 ms
R9 280X OpenCL: 3.94 ms
GTX670 OpenCL: 10 ms

EDIT2:
Reenabling all optimizations i can compare APIs and GPUs now:

GTX670: OpenCL: 8.15 ms (3 / 6)
GTX670: OpenGL Compute Shader: 18.3 ms (3.3 / 15.6)

R9 280X OpenCL: 5.9 ms (4.2 / 1.6)
R9 280X OpenGL Compute Shader: 6.6 (4.4 / 2.8 )

2 Numbers in braces refer to 2 shaders:
1. Tree tarversal, no shared memory, no atomics, random memory access
2. Interreflection, all in shared memory, lots of atomics
Only the first number (total runtime of both) is accurate.

We can see that ATI / NV have different talents, but more important:
OpenCL is simply faster on both. The source code is the same for both - just syntax changes.

The bad result on GTX670 2nd. compute shader is strange.
Its from a computer with old dual core CPU.
I remember it was something about 12ms on i7.

Looking here show's it's a fair comparision: http://gpuboss.com/gpus/Radeon-R9-280X-vs-GeForce-GTX-670

Send me GTX980 / 290x for up to date results :lol:

by **Julio Jerez** » Wed Sep 24, 2014 10:41 am

I see you have result for those on the same Intel hardware.
you mean you can install AMD and Intel OpenCL driver on the same machine?

also you have AMD beating Intel on intel hardware by almost three fold?
did you have a comparison to a CPU running on a tread pool.

My main reason to go to Open CL is that as hardware evolve it become very tedious to use the resources.
for example AVX register and new instruction are awesome, but using then you get in a combinatorial case of code blooding.
so my main goal is OpenCL on CPU, but I also want it to be general.

so if this is the case maybe is time to go back an take another look.

by **JoeJ** » Wed Sep 24, 2014 11:03 am

Julio Jerez wrote:you mean you can install AMD and Intel OpenCL driver on the same machine?

Yes, on that machine i have Intel+NV+AMD drivers, opencl.dll is always the same i believe, but there is an additional opencl-amd.dll for example.

Julio Jerez wrote:also you have AMD beating Intel on intel hardware by almost three fold?

Yes, seems AMD puts more effort on OpenCL, while Intel just make it work. They have additional drivers for Xeon-Phi, maybe installing those too would help. Also i forgot to see why the HD4000 Intel GPU has not been listed. Need to check tomorrow...

Julio Jerez wrote:did you have a comparison to a CPU running on a tread pool.

Not sure what you mean. My origianl CPU template algorithm is single threaded only - i did not plan to keep it on CPU.
Interesting is however, my i7-930 with AMD dll performs better with work group size of 128 than with a size of 4 or 8.

Julio Jerez wrote:for example AVX register and new instruction are awesome, but using then you get in a combinatorial case of code blooding.

I just started, but OpenCL seems to be the solution for this... i'm very excited

I've done a lot with compute shaders. OpenCL is much easier to use and less bloated.
Else there is not much difference between those - both suit well for either general purpose or simply graphics.
OpenCL seems easier to debug i hope. If so, i can stop writing everything first on CPU, and port it to GPU later.

EDIT:
I'll post more numbers when i've got more shaders running. You can't get a good picture from only one algo. The ATI GPU looks bad here but overall it's nearly twice as fast as gtx 480 or 670.

by **Julio Jerez** » Thu Sep 25, 2014 11:33 am

has you try AMP? I really like the AMP idea.

I was reading more about and it seem that there will be some adoption form Intel, and some other compiler makers. I would never expect Apple to embrace it, but to me that twill be their lost.

To me, OpenCL, CUDA, and DirectCompute, are abominations.
of the three OpenCL is probably the best, but like everything the Kronos group does, they take a good idea and ruing it. for the Past five year the Khronos Group has being the worse thing that has happen to Computer science. OpenCL is just to hard and too cumbersome to use.

I think I will try this again by this time I will try with AMP.
AMP seem straight forward and very similar to what OpenMP was for CPU.
and more importantly; you can get all in one development environments.

by **JoeJ** » Thu Sep 25, 2014 3:17 pm

No, i have not tried AMP. It looks more feature rich than i initally thought and also more easy to use than OpenCL. But i'm not sure if you have the same fine grained control over synchronisation and memory.
At the introductionary page i just looked there is no example about how to use local memory. Without direct control over that it's probably impossible to get peak performance. That's primary a GPU only issue, but surely important. Using for either this or another thing quickly results in a performance difference of factor 10-50.

If it's possible to debug in Visual Studio, go for it...

Looking back i again lost hours without success porting code from compute shader to OpenCL

It's just a black box, a lot of guessing and: "aaargm - i don't wan't to write a lot of debug code to output what's going on', let's try this and that first..."
I have not looked at AMDs CodeXL tool yet, but using free VS Express there are not much options for GPU debugging in general.

OpenCL is just to hard and too cumbersome to use.

Yep. That's true. It's exact the same problem as with your vehicle. :twisted:

Mostly complexity can't get reduced beyond a level that is still too high.
Nothing to do about it. :wink:

by **Julio Jerez** » Thu Sep 25, 2014 4:22 pm

JoeJ wrote:But i'm not sure if you have the same fine grained control over synchronisation and memory. At the introductionary page i just looked there is no example about how to use local memory. Without direct control over that it's probably impossible to get peak performance. That's primary a GPU only issue, but surely important. Using for either this or another thing quickly results in a performance difference of factor 10-50.

that Is exact the kind of thing I want to avoid, I had tried CUDA and OpenCL, and I realized that after you read the introduction and you start program with it, you get consumed trying to understand the so called "virtual processor abstraction" that the made out of their asses.
while that's neat, after a while you realized that all you accomplish is lot of work for very lite payoff.
I am not at a point that I want an abstraction layer that separate me from Local Memory, Pages, Threads, wrap, and all the lingo that come with each of those pseudo language.

yes I understand that Opencl, Direct Compute, of CUDA would probably yield better performance,
But I will be OK if for example using OpenCL I get 10x performance and using AMP I only Get 3 to 4X
I'd take simplicity of usage over that the peak performance.

I am so convince that I will use AMP that I am now removing the OpenCL projects that I started about a Year ago.

I would use OpenCL if there was not alternative, but a template based implementation for HPC that is integrated to your IDE is too good to pass on.

by **JoeJ** » Thu Sep 25, 2014 5:38 pm

Agree to that.
For me situation is different as i need to get as fast as possible now.
Actually my GI is barley fast enough for diffuse but i need to fight for some ms to add reflections.

I still do not request Newton running on GPU (there will be no cycles left)
And also i think character cloth and fluids is graphics engine stuff.
I plan to use Newton cloth and softbodies only for game mechanics, not for eye candy.

by **manny** » Thu Sep 25, 2014 5:45 pm

I have used AMP on a nextgen console. It is really cool and simple, unfortunately it's microsoft only. Microsoft has a sleight of multi processor tech that is cool. IIRC AMP is only a part of that slice.

I am so convince that I will use AMP that I am now removing the OpenCL project that I started about a Year ago.
I would use OpenCL if there was not alternative, but a Template Based implementation for HPC that is integrated to you IDE is too good to pass on.

AMP might not be a wise choice for a crossplatform library. Mobile phones are incredibly important, and so is the PS4. Going AMP means only Windows PC (+xbone) and that is not a wise choice for Newton Dynamics.
Maybe you should wait a little before doing that I think I read somewhere that the OpenMP 4 spec contains "accelerator" support to offload work to the GPU.
OpenMP like AMP is really simple to use, and embeds nicely into your existing c/c++ code.

And also i think character cloth and fluids is graphics engine stuff.

In our current xbone game we use cloth and softbodys for thing like capes for our characters and that is important. It's more than just eyecandy as that character (in that particular case a super hero) requires it visually.

by **Julio Jerez** » Thu Sep 25, 2014 10:21 pm

Manny, AMP I think is been adopted but many compiler makers, check this out
http://www.hsafoundation.com/bringing-camp-beyond-windows-via-clang-llvm/

you are right It may be the case that companies like Nvidia, Apple and Sony will never cave to adopted, but I know for a fact that hardware like PS4 use Clang LLVM, so the only reason for them not to adopt it is politics. As a matter of fact the CPU on the PS4 and the Xbox 1 is the kind of CPU that AMP will shine the most.

I think that as more people start adopting AMP, the more will force the draconian companies like Apple and Sony to adopt the progress. I see not reason for Opencl, Cuda, or Direct Compute.

by **manny** » Fri Sep 26, 2014 6:04 am

You are right, that is looking cool and I wasn't aware of that. But that's why I said you should wait a little before commit to such a heavy decision. We're currently at an intersection in that technology and nobody can predict which way it will go either OpenMP or AMP. I think you are right that Cuda and OpenCL don't have a bright future (even though OpenCL implementation on LLVM+OSX is kinda cool).
But I think OpenMP 4.0 with the GPU accelerators could be the way to go, as OpenMP is already supported on most compilers (MSVC, LLVM, GCC)

GCC 4.9 supports OpenMP 4.0 for C/C+
Support for the parts of the OpenMP 4.0 language that are not associated with the "target" constructs are contained in the "runtime" directory. Support for offloading computation via the "target" directive is in the separate "offload" directory. That builds a library that provides the interfaces for transferring code and data to an attached computational device. Initial support here is for the Intel® Xeon Phi™ coprocessor, but work is beginning to support other attached computing devices, and the design is intended to be general.
MSVC11 supports OpenMP 2.0
Intel C++ supports OpenMP 4.0

So basically nearly all relevant compilers already support OpenMP 4.0. Anyways, I think it's best to wait until on of the technologys emerge as a clear winner. Or you might have to rip things out after another year

EDIT: just read the feature list of that LLVM x AMP functionality. it's quite interesting that it compiles to OpenCL so that could actually work as a technology bridge - if it actually flows into the main LLVM repository.

by **Julio Jerez** » Fri Sep 26, 2014 8:23 am

I love openmp
I few year ago, the last version of newton 1.5 fully supported opemMP, but I have to remove it because it was so buggy, and only Intel supported at the time.
Micrososft came with support for it later with VS 2005 and to my surprise I was getting a next lost using open MP with Visual studio. so I remove it and for Newton 2.00

by **manny** » Fri Sep 26, 2014 3:32 pm

Julio Jerez wrote:I love openmp

me too, doing multicore development with #pragmas rocks. it became a really mature product in the meantime, unfortunately MS still lacks behind with only implementing openMP 2.0

probably the best thing that can be done right now is nothing at all and wait until a good cross platform solution becomes available. or just stick with openCL

by **Julio Jerez** » Sat Sep 27, 2014 1:56 pm

the only problem about Microsoft software is how horrendous it is.
There always talk a good game with idea and how they wi dominate, but the moment you decide to try anything they do.
you immediately remember the thing that make you abandonee Microsoft in the first place.
here is just one of thier Blog about AMP
http://blogs.msdn.com/b/nativeconcurren ... c-amp.aspx

If you have tried to run C++ AMP program in Visual Studio 2012 on Windows 8 in debug configuration, you might have noticed that after a successful execution there are DXGI warnings reported in the output window. Long story short – there is nothing to worry about; these warnings may be safely ignored. If you are interested in the background, read on for more information.

further down is go on to say that they do not release the object and they you will get all king of warning and memory leak, but do not worry because that will take car of that.

That was no the only site that I found complaing about AMP famous memory leaks.
The is the same BS that happen with OpenMP, it appears as if they were creating and destroying threads each time you call a kernel, making several time slower that any other implementation.
It has being almost three year since they announce AMP, you would think it will be reliable by now, but as usual It takes Microsoft 5 to 10 year for then to Clean up anything the do.
This do not inspire my too much confidence.

by **JoeJ** » Sun Sep 28, 2014 3:19 am

Added final results to first post. Focus on GPU, but still interesting.

by **Julio Jerez** » Sun Sep 28, 2014 2:08 pm

do you have a reference to the CPU?

the cool thing about AMP Is that they claim to have a efficient CPU fallback that the call WARP.
when I run the nbody demo,

I see the single CPU at 6 gfloaps
I see the muliticope CPU at 24 gflops
I see the single AMP CPU at 88 gflops
I see the tile AMP CPU at 150-199 gflops
I see the tile AMP GPU at 150-199 gflops

now the demo is very simple and 100% data parallel, so thet test si very misleading, but If the
figure scales in the same proportion, it seem that AMP can get a 2 to 3 fold performance game on my system.
My GPU is no very good, but I am not really count of GPU, I am more interested on the CPU results.
My guess that the can get such high yield because they can by pass all the overhead that windows trading system has to add for supporting mutitreading.

Newton on GPU

Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Re: Newton on GPU

Who is online