by Julio Jerez » Sun Nov 20, 2022 11:45 am
The threads in neuton are not sofisticated.
It has been a long time since I gave up in efficient. to make threads efficient.
The problem is that the synchronization objects are the same not matter who implements them.
And at the end of the day a sync object will call and assembly instruction name syscall.
Which is an operation system call to the kernel.
This is the only way to syncronize threads in x86 hardware.
That call could be any where from few thousand trick to several millions if a thread swiches.
There is not way around that.
So after years of dealing with thread synchronization my solution is just not use any synchronization object.
Instead just use algorithms solutions. That in general consist in taking you data, sorted by some criteria in a flat array, divide the array by the thread number and have each tread running those sub set of jobs.
This actually work very well, but if has the problem that the sub arrays are equal size, but if the job are variable complexity, the the work load per tread is uneven.
It takes a lot more subtleties than than but that the thrust of the method.
Here is the big problem, assuming the work lod is even, now cpu run a different speed, so the method has the effect that thevtgread pool is controll by the slowest thread.
So I try to fix that by using a atomic counter instead of fix size sub arrays, the work load look beautifully even, but overall they ro slower.
I will post some profile trace so you can see what I mean.
A way to explaining it is by thinking of a multilane highway.
And there is a told booth.
If the high is not too busy, the not too many vehicle pile at the booth.
But if the highway in very busy, the you alway see a pile of vehicles some time several hundreds of Metter long.
And the wider and the busiest the highway the more severe the problem.
That very much what happen with a physic engine. The task as many short and variable size,
So any sunc object become the bottleneck.
I put lot of work on getting newton 4 to be lock and atomic free, and it is the first time it can achieve almost linear time acing with thread, core counts, and simd instruction.
But since last week, I have that edia of variable suzecwirk load, and forgot about the disastrous blocking effect of atomic, mutates and all the sync objects.
I will commit the change just to have as reference but disable under an if def.
The I will just remove since it is not worth pusuing it.
Here is a little story.
Back in 2004 and 5
I registered to be tester of couda with the 8800 gpu.
The bigest complain I and almost every one had was that there was not atomic operations, so cuda kernels where limited to very simple data parallel operation. The people at nvidia though that there was not need fir atomic, so ther considered us idiots and tge animosity started to buil up.
I concluded that it was not worth continuing. And I abandone it.
Not 15 years later, a am actually doing what they were saying, using sorting and parallel reduction to avoid atomic at all cost.