* • [2011-05-21 Sat] Performance and accuracy of the matrix multiplication routines :code:CUDA:HPC:
:PROPERTIES:
:ID:       d2c53a82114b23385dc76afa86df4501
:END:
CUBLAS on Nvidia Tesla versus MKL and ATLAS on Intel Nehalem

Philippe Estival, Luc Giraud

Scientific computation relies heavily on 64 bits
arithmetic. The evolution of the Graphical
Processing Units to the status of massively
micro-parallel vector units and the improvement of
their programmability make them stand as
powerfull coprocessors for many classes
of matrix calculus. But on these processors
inheriting from architectures dedicated to video
processing in the first place, the space for
double precision is narrow yet. One building block
of dense linear algebra, the GEneralized Matrix
Multiply Routine has been considerably
accelerated on the GPU. We figure in this paper
more details regarding its speed, but first,
accuracy.

https://hgpu.org/?p=7671

The paper was poorly rated and you should probably
go read better scientific litterature.