So we'd like to use ALF on certain Accelerators. The most simple solution would be to use AMDs ACML:
This library provides a full lapack and BLAS Implementation with a working Fortran Interface that can autoamtically use external accelerators via OpenCL. Sadly AMD has only 2 profiles for two of their GPUs from around 2014 and I could not get it to work satisfactory. The next idea is to use clMAGMA from the MAGMA initiative:
The most recent version has a working Fortran interface(I suppose), support for sparse vector operations and is maintained by a similar set of people as the reference lapack. But it essentially only supports CUDA. There is magmaMIC that can utilize Xeon Phi's and there's clMAGMA that uses OpenCL as backend. Sadly when the authors tried to add a Fortran Interface they found out that there would be some work involved. So this is also out.... There is ViennaCL:
But this is only C++ but it looks very powerful especially for sparse operations. Since ALF spends its time mostly in low-level BLAS3 Routines (ZHEMM's in my branch on the Hubbard-model) we can get away with just trying to plug in a library that emulates the BLAS interface. To my knowledge thereis no library that provides a full Fortran Interface. If we go on to write our own wrappers there are two contenders: clBLAS: https://github.com/clMathLibraries/clBLAS This was a part of AMD's ACML and is now open sourced and seems to be a little bit maintained. and TomTom recently released clBLAST:
It is very new and being from the outside of HPC puts a lot of effort into ensuring the portability, and also has Netlib.org lapack interface that can be almost linked against fortran.
So for now I will try to see wether clBLAS works and I can offload ZHEMM calls...
First experiences with clBLAS: Adding the ZHEMM call is now finished. This works and gives correct results. For now I could only test execution on a CPU(i7-2600). The Multiplication is automatically parralellized but oversubscribes my CPU with ~ 8 threads. This would be OK, but the runtime is 5 times longer than plain single thread execution... Some numbers:
(core-i7 920, 8x8 lattice ) master: 13s clalf: 97s (upto 4 threads...), CLBlast: 97s (around 1.5 threads effectively used)
(core-i7 920, 12x12 lattice ) master: 136s clalf: 415s (upto 4 threads...), clBlast: 171s (~ 2.5 threads)
(core-i7 920, 16x16 lattice ) master: 776(single thread) , clBlast: 545s (~ 4 threads used well)
(core -i7 2600 20x20) master: 357s, clBlas: 880s
For now I concur that clBLAS is an AMD GPU only solution. The numbers didn't change much by using the inbuilt auto-tuner for my CPU for CLBlast.