alright, this is it the result are in.
these are the results of training the Mnist data set in cpu vs gpu both optimized.
GPU could be optimized some more, but that will be making specialized shader optimization
to capitalized in some hardware, stuff like know the warp size, of the shader memory bank size, by I am no going for that. As lock as the Gpu at least twice as fast and optimized CPU, that a win.
but is reality I am getting about 4+ speed up.
here is the result for the small model (350k) parameters, training the Mnist for 20 epochs
Best model: epoch: 18 success rate: 99.495003% training fail count:303 test fail count:205
epoch: 19 success rate:99.528328% training fail count:283 test fail count:206
results: multithreaded optimized cpu small model
mnist database, model number of parameters 335114
training time 74.433944 (sec)
training data results:
num_right: 59698 out of 60000
num_wrong: 302 out of 60000
success rate 99.496666%
test data results:
num_right: 9796 out of 10000
num_wrong: 204 out of 10000
success rate 97.959999%
this is the GPU
opencl device name: gfx1101
opencl device version: OpenCL 2.0 AMD-APP (3652.0)
opencl device compute units: 30
opencl device local memory: 65536
Best model: epoch: 18 success rate:99.065002% training fail count:561 test fail count:205
epoch: 19 success rate:99.093338% training fail count:544 test fail count:206
results: opencl gpu small model
mnist database, model number of parameters 335114
training time 15.896462 (sec)
training data results:
num_right: 59439 out of 60000
num_wrong: 561 out of 60000
success rate 99.065002%
test data results:
num_right: 9796 out of 10000
num_wrong: 204 out of 10000
success rate 97.959999%
As you can see, the model isn't large enough to capture the underlying structure of the data.
As a result, it only achieves just under 98% accuracy on the test set.
Of course, I could apply techniques like dropout regularization or extend the training time, but doing so increases the number of trial-and-error iterations, precisely the kind of inefficiency that adding a GPU backend is meant to address.
Note:
For context, once a model's training accuracy exceeds around 98% of the target, further progress becomes extremely slow.
This is because the "learning" in neural networks is driven by gradients,
which is just calculating partial derivatives of a multivariable function and update the wights proportional to that vector of partials derivatives.
At that stage, most predictions are already correct, so the error between the predicted and true values is often zero. As a result, the algorithm ends up averaging thousands of zero gradients and only a few meaningful ones, making it much harder to continue improving the model.
Anyway, a simple test to try, is to increase the size of the hidden layers, using 512 neurons instead of 256.
The general consensus is that wider models are better at learning the underlying patterns in the data and tend to make more confident classifications, which usually translates into better performance on unseen data.
However, this change comes at a cost: it roughly quadruples the training time on both CPU and GPU. And since CPUs are significantly slower for this kind of workload, experimenting with architectural changes becomes impractical without GPU acceleration.
Here are the results of that test, with almost a million parameters:
epoch: 16 success rate:99.561668% training fail count:263 test fail count:188
Best model: epoch: 18 success rate:99.620003% training fail count:228 test fail count:166
epoch: 19 success rate:99.570000% training fail count:258 test fail count:191
results: multithreaded optimized cpu bigger model
mnist database, model number of parameters 932362
training time 195.515136 (sec)
training data results:
num_right: 59773 out of 60000
num_wrong: 227 out of 60000
success rate 99.621666%
test data results:
num_right: 9835 out of 10000
num_wrong: 165 out of 10000
success rate 98.349998%
and the gpu
Best model: epoch: 15 success rate:99.404999% training fail count:357 test fail count:175
epoch: 16 success rate:99.404999% training fail count:357 test fail count:182
epoch: 18 success rate:99.436661% training fail count:338 test fail count:188
epoch: 19 success rate:99.436661% training fail count:338 test fail count:180
results: opencl gpu bigger model
mnist database, model number of parameters 932362
training time 36.439773 (sec)
training data results:
num_right: 59644 out of 60000
num_wrong: 356 out of 60000
success rate 99.406670%
test data results:
num_right: 9826 out of 10000
num_wrong: 174 out of 10000
success rate 98.260002%
It seems that at least on this test, the consensus is true,
the model generalizes better in both training and test data achieving over and significant fraction over 98% of accuracy is so few epochs.
The important part is that the test only takes 36 seconds in GPU while the CPU is 195.
That's a 5.4 speed up factor, which according to my investigations, is better that the result that people are getting in similar test Tensorflow and Pytorch, at least it is in the ballpark.
Now is back to train robot one more time.

edit:
ahh, and for completion, the nvida test gives these results.
Best model: epoch: 18 success rate:99.430000% training fail count:342 test fail count:169
epoch: 19 success rate:99.451668% training fail count:329 test fail count:172
results:
mnist database, model number of parameters 932362
training time 59.745972 (sec)
training data results:
num_right: 59659 out of 60000
num_wrong: 341 out of 60000
success rate 99.431664%
test data results:
num_right: 9832 out of 10000
num_wrong: 168 out of 10000
success rate 98.320000%
that's no bad for a legacy nvida GPU, it is quite competitive with the 7800 amd actually.