Opencl or Sycl

by **Julio Jerez** » Sun Jan 23, 2022 12:40 pm

One of the thing that hamper opencl development, is how verbose and low granularity its. It forces an app to rewrite the every function, and when using relative moderate complex templates or c++11 lambda, the job could be a huge undertaking.

About 8 years ago, I invested lot of time on Microsoft Amp whi was template snd lambda based. Just to find out that by vs2012 was detracted and by 2013 dropped altogether.

I now heard about this new Sycl wich seem to be a repackaging of waht amp was ment to be but improved.

The thing is that I have not seem any support for it other that using intel compiler.

Has anyone use Sycl?

I converted all the mutitgreaded engine in the sdk to be minimum c++11 compliance lambda base.

The user is to make the transition to gpu with little changes as possible, but it is as it always is. The support fir these standard is so lacking.

It is almost that they force you to use Cuda.

by **Julio Jerez** » Sun Jan 23, 2022 2:52 pm

one of the cool things about Sycl, is that Intel support it, and integrate rather nicely with visual studio 2017.

I try this about tow year ago, when the say integrated with VS 2015,
but it was a contrived process with may moronic steps that I could never got to work,
but I took a look again and now they have an automatic integration that let the use make sycl app but just making a project. so you just select dcp++ and teh type of project and that's all. form the intell
[url]https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-dpcpp-compiler/top.html
[/url]

it say this:

Windows*
The compiler integrates into the following versions of Microsoft Visual Studio*:
Visual Studio 2019
Visual Studio 2017

and it is true, so I guess that we can just go for Sycl with a minimum of VS-2017

I just made a hello world project, and this is how one of the sycl kernel looks like

Code: Select all: //************************************ // Compute vector addition in DPC++ on device: sum of the data is returned in // 3rd parameter "sum_parallel" //************************************ void VectorAddInDPCPP( queue& q, const IntArray &addend_1, const IntArray &addend_2, IntArray &sum_parallel) { // create the range object for the arrays managed by the buffer range<1> num_items{array_size}; buffer<int, 1> addend_1_buf(addend_1.data(), num_items); buffer<int, 1> addend_2_buf(addend_2.data(), num_items); buffer<int, 1> sum_buf(sum_parallel.data(), num_items); auto TestKernel = [&](handler &h) { auto sum_accessor = sum_buf.get_access<dp_write>(h); auto addend_1_accessor = addend_1_buf.get_access<dp_read>(h); auto addend_2_accessor = addend_2_buf.get_access<dp_read>(h); h.parallel_for(num_items, [=](id<1> i) { sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i]; }); }; q.submit(TestKernel); }

this is all cpp code, but guess what this is how kernels look in newton now

Code: Select all: void VectorAddInNewton( ndThreadPool* const threadPool, const ndArray<ndInt32>& addend_1, const ndArray<ndInt32>& addend_2, ndArray<ndInt32>& sum_parallel) { auto TestKernel = ndMakeObject::ndFunction([&](ndInt32 threadIndex, ndInt32 threadCount) { D_TRACKTIME(); const ndStartEnd startEnd(sum_parallel.GetCount(), threadIndex, threadCount); for (ndInt32 i = startEnd.m_start; i < startEnd.m_end; ++i) { sum_parallel[i] = addend_1[i] + addend_2[i]; } }); threadPool->ParallelExecute(TestKernel); }

the similarity is remarkable.

there are few hurdles, I can only get the Sycl to work when selection the Host device, the GPU returns is recognized, but when execute the kernel, return false.

also I do had not found how to make enumerate devices other than intel, but the say the new versions works can select CUDA back end, and it is opencl base, so if there is an open cl drive,
it should recognize Intel, but also AMD and Nvidia gpus.

anyway I will experiment with this because thsi coudl be a huge, development for newton.

by **Julio Jerez** » Sun Jan 23, 2022 3:22 pm

and it gets better and better, I added thsi bit of code.

Code: Select all: auto platforms = sycl::platform::get_platforms(); for (auto &platform : platforms) { std::cout << "Platform: " << platform.get_info<sycl::info::platform::name>() << std::endl; auto devices = platform.get_devices(); for (auto &device : devices) { std::cout << " Device: " << device.get_info<sycl::info::device::name>() << std::endl; } }

and it prints this:

Platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
Device: Intel(R) FPGA Emulation Device

Platform: Intel(R) OpenCL
Device: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

Platform: AMD Accelerated Parallel Processing
Device: Ellesmere

Platform: Intel(R) OpenCL HD Graphics
Device: Intel(R) HD Graphics 630

Platform: Intel(R) Level-Zero
Device: Intel(R) Graphics [0x5912]

Platform: SYCL host platform
Device: SYCL host device

it is just that they when you select gpu device, they just pick the intel 630
but the AMD is there as well, plus this can also be run on CPU, and with those multi huge cores cpus, who knows.
I have been waiting for this for almost 15 years. this start to look pretty good.

by **Julio Jerez** » Sun Jan 23, 2022 4:04 pm

well, I now have so that is create a specific hardware vendor queue with this code

Code: Select all: queue q; for (auto &platform : platforms) { auto devices = platform.get_devices(); for (auto &device : devices) { auto name = platform.get_info<sycl::info::platform::name>(); int vendorName = name.find("AMD"); //int vendorName = name.find("SYCL host platform"); if (vendorName >= 0) { auto info = device.get_info<sycl::info::device::name>(); std::cout << "Platform: " << name << std::endl; std::cout << " Device: " << info << std::endl; q = queue(device); } } }

but is I set to select "SYCL host platform", it works fine
but if I select "AMD:

it does recognize it and I get this

Platform: AMD Accelerated Parallel Processing
Device: Ellesmere

but when it goes to execute the kernel, it gives me this error

Device: Ellesmere
InvalidBuiltinSetName: Expects OpenCL.std. Actual is SPIRV.debug
[Src: ..\..\..\libSPIRV\SPIRVModule.cpp:594 SPIRVBuiltinSetNameMap::rfind(BuiltinSetName, &BuiltinSet) ]

Exception thrown at 0x00007FF9AAAB80B0 (amdocl64.dll) in DPCPPConsoleApplication1.exe: 0xC0000005: Access violation reading location 0x0000000000000008.

I do not know what that means, but so far they only devices that seems to work in the host.

by **JoeJ** » Tue Jan 25, 2022 7:33 am

Afaik, the problem with AMD is SYCL requires ROCm/hip, but that's only actually working on Linux.
I have some hope they can get it to work on Win11 maybe, as the given reason was something like windows driver model restrictions. But idk if this changes for Win11.
Sadly there are rarely any official statements on the subject. I've catched up those infos from forum discussions.
I'm confused as well, because i always assumed SYCL builds on top of OpenCL, so it should just work.

It's always the same. They wont us to buy their HW, but they don't want us to use it.

by **Julio Jerez** » Tue Jan 25, 2022 7:52 am

It seems that the problem is that sycl translate the kernel code to spir-v and passes that to the opencl driver to compile to gpu binary.
But amd drives expect spir intermediate code.

Apparently seem like AMD has some ship on thier shoulder and prefer to hurt themself if something was going some how help intel or nvidia.

Meantime nvidia is eating all thier launch with CUDA.

CUDA is cpp, and just works. I had refused to use it, but tge state of affair that Intel, Microsoft, Apple, amd, and the moronic Khronos group has created is forcing every one to just go NVidia.

I when to buy an nvidia middle range card to start with cuda, and buy or not, is not possible to buy one.

by **Julio Jerez** » Tue Jan 25, 2022 8:07 am

This is beyond the scope of what I am doing.

But it seems to me that if anyone download llvm, and modifying Clang

It would be relatively easy to take a cpp file and rename it to a different extension.
Them pass that file to cland, and translate the imput to an cpp file that is almost identical to the imput, but on the lambda kernel, add a keyword, to let the clang compiler know that it most translate that lambda to opengl compute shader language and asked as a static string to the file.

That will be a cpp generate code, but that will be almost identical to the imput, in fact it would work for everything.

And can be made to generate both the cpp kernel and the cpu opencl kernel.

All that is needed for that is a strong cpp parcel, like clang or gcc which are open source

My guess that's what Microsoft did with amp, and what nvidia and Intel, coputecpp, al the other do now, but the make so hard to use.

Thus sycl, is not as easy as they make it seem, to start with the impose compiler restriction, you have to use thier compiler, which automatically force it a dll mode, rolling out many platforms.

After dabble a little, I do not think I will go with Sycl, too many restrictions.

by **JoeJ** » Tue Jan 25, 2022 10:24 am

But it seems to me that if anyone download llvm, and modifying Clang

I guess this would be interesting to have an efficient JIT compiled scripting language. But i don't want to dig into this either.

Regarding GPGPU, generating machine code is not the only problem.
We also need to control execution regarding queues, synchronization barriers, async compute, etc.
That's important, for example we want to run our GPGPU stuff async while rendering, so GPU always has work around to do while gfx pipeline is stalled or underutilized. But we also want to prevent cache thrashing happening if running too much tasks at once.

That's the reason why i would not even want to use Cuda even if it were cross platform. Currently, only compute shaders allow us execution control at least at a somewhat fine level.
Thus, anything running on game client HW has to use compute shaders, even if a nightmare to maintain.
Situation would be different if using iGPU for physics and dGPU for gfx.
But that dream is over i guess. I'm happy enough if people in 3 years can afford just iGPU, and it's performance reaches console levels.
dGPU looks totally dead to me at the moment and has no future, and PC platform feels outdated as a whole.
I really wonder who buys all those toasters currently, but i guess it's enthusiasts beside miners.
I doubt the average gamer is willing to spend 1000$ just for a teraflops 400W monster GPU.

At some point, 'low power gaming' might become the new marketing buzzword. We should lower expectations on increasing power.
Even mobile or completely new platforms could dethrone Win-PC easily at the moment.

by **Julio Jerez** » Tue Jan 25, 2022 11:51 am

Oh no, I do not mean anything so grandiose as compling code.
I mean a translator from source code to source code.

The step would be something g like this.
-make a core library that does the inclination of the environment, could be opencl, but could also be direct compute, opengl compute shader, vulkan, or ever cuda.

- the library will have some template vector classes, to write the code.

- the to write a program, using lambda and adding a closure parameter, for example opencl

-the tool will take the source and anything outside tge lambda is just copied to the source file.
The code inside the lambda is converted to and glsl, a direct compute, or a cuda kernel as a string of source embeded is the output file.

Them tge rest is all handles by either the ide, or the make file.

In visual studio, this could be a pre-built step.
Basically you rename the flounce file to something like
Filename.hpc

Them you add a file association rule.
Them every time the file change. The build rule will regenerate the
Filename.cpp

The reason I mention clang, is because is the only open source compiler that is 100% cpp compliance.
And can parcel proper code.

Of course is possible to do it with less powerfull parcel a d we can add some restrictions.

One restriction would be that the library will have to be template base, since it will not be able to make function call to funtion that are compiled in other libraries.

But clang has libc as part of the source, so the parcel can find all the funtion calls from a lambda kernel and emit the equivalent in the gpu kernel.

by **JoshKlint** » Wed Jan 26, 2022 5:43 am

Why not Vulkan compute shaders?

by **Julio Jerez** » Sun Jan 30, 2022 2:48 pm

the problem is, I do not want to be an expert in the next flavor of shader programming language,
I leave that to the people who like that.
To me, all those shader nonsence should be part of the C++ language.
I always hate it when to get some access to some functionality in cpp you have to write inline assembly, so I was very pleased when Microsoft dropped altogether.

Sycl attract me because is an extension of C++, the same way Microsoft AMP was, until for some reason never explained, they dropped it. I guess with MS you win some and you lose some, like mr Viny said.
I wish MS adopt Sycl but my guess is, that just as it took with open mp, it will take several generations of VS, maybe by 2025 or so.

I did not get any useful answers from the Intel forum, but I want to give a fair chance to this sycl thing, so I ordered this book.

Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

I will read it and then try to go by the book and see what that get me.

by **JoeJ** » Sun Jan 30, 2022 3:05 pm

Julio Jerez wrote:To me, all those shader nonsence should be part of the C++ language.

Afaik, C++23 executors are planned to support running lambdas on GPU.
Not sure how practical and efficient this will be, but currently it's my last hope.
Hard to find any info, though.

by **Julio Jerez** » Sun Jan 30, 2022 4:32 pm

JoeJ wrote:Afaik, C++23 executors are planned to support running lambdas on GPU.

yes that would be so awesome and would totally terminate the proliferation of all the nonsense floating around for 20 years.
I just hope that when they do it, they do not follow the advice of Chronos group. they may even sabotage to the point that they make it unusable.
we should have that since C++ 11, maybe 14.

that's what Sycl really is, and AMP was, but Sycl does not seems really well thought out, they try to cover everything and cover very little, so far I can't even get it to work on their own HP630 gpu.
I am given it the benefit of the doubt and going with that it is I doing something wrong, hence the Book.
But for what I can see Microsoft AMP was far simpler and did the same thing.

by **JoshKlint** » Fri Feb 11, 2022 11:51 am

Vulkan takes a lot of work to set up the first time, but setting up an environment just for compute would be a lot simpler. It took me less than a day to get compute shaders working, having never used them before in any API.

by **Julio Jerez** » Fri Feb 11, 2022 1:30 pm

It is not getting compute shader.
It is the low level of those apps. It is the twenty first century and those apps still deal with c and shader language.

They do not even deal with plain cpp classes.

Newton 4 now embraces full cpp 11 as minimum, the multithreaded code is very high-level lambdas based. I put some effort on that on the promise that sycl will be a solution.

Converting all that code back to c code and compute shaders not to mention all the boiler plate code that is needed is something I am not longer willing to do.

I wanted to give it a try to this Sycl, but if I can't get it to work, as much as I do not like NVidia low handed tactic, I will hold my nose and get a NVidia gpu card that supports cuda.
I actually went to best buy and I only found one card gf 730 that was cuda ready, and when I got home it is so old that does even have hdmi connectors.
I will see if I can get a conversion cable.

It seems the Sycl solution is even more limited than what CUDA is offering.
I thought that by translating to OpenCL, it will be cross hardware if the driver supports it, but they when out of their way to make incompatible, they are not translating to OpenCL, they are translation to intermediate Intel OpenCl, so only intel hardware support.
I ask these questions, and these are their answers.
https://community.intel.com/t5/Intel-on ... -p/1353852

Code: Select all: Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements

so that makes CUDA not the worse candidate anymore.

Opencl or Sycl

Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Re: Opencl or Sycl

Who is online