about 0.7 ms.
than comes at about 90 m key per seconds. very, very low compared to the metric people claim.
my keys ares 128 bit so that 4 time the bandwidth because they measure key 32 bit keys
so even been generous and multiplying my result by 4, that's under 400 mkeys per secund.
I seems claims anywhere from as 1 to 16 giga key per second, even on much order GPUs which I tend not to believe.
When I run the nvidia sorting code demos which come in a library, the result is about 1 gkey per secund.
Sorting 1048576 32-bit unsigned int keys and values
radixSortThrust, Throughput = 1073.0961 MElements/s, Time = 0.00098 s, Size = 1048576 elements
Test passed
but the problem with those demos is that there are so misleading, sorting an array of 32 bit key only is useless. you have to at least add one extra word to the key as an index to the item, and that automatically cut the throughput by half.
I wish I can make the sort faster, because it is the workhorse of the engine. but I found some problems. the first pass I made, was about 5 time faster, but I discovered it had a bug only in GPU, so I had to add a part that serializes buckets of 256 elements in one thread.
basically, if you have a count of n element, n less or equal than 256. and you have to write them in the same order that there are found in memory, and that's no easy in a muticore.
I am doing in a loop, but I am hoping to find a better way.
one of the problems about the NVidia demos, is that they are really hard to find how the code is done, for almost any not trivial demo.
anyway, I will try to tune this a little more, because it seem that it can degrade very quickly.