ALF issueshttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues2021-04-27T18:39:28Zhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/181Reduce memory consumption on a node by sharing data between MPI task2021-04-27T18:39:28ZJohannes HofmannReduce memory consumption on a node by sharing data between MPI taskSome memory objects are independent of the rank of the MPI task within a tempering group that shares all parameters. Those can be shared between different MPI jobs on the same node and thus reduce the memory footprint of ALF.Some memory objects are independent of the rank of the MPI task within a tempering group that shares all parameters. Those can be shared between different MPI jobs on the same node and thus reduce the memory footprint of ALF.Johannes HofmannJohannes Hofmannhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/177FFT for application of Op_T matrices2021-03-24T17:53:33ZFlorian GothFFT for application of Op_T matricesFakher wants to use an FFT to optimize for circulant matrices.Fakher wants to use an FFT to optimize for circulant matrices.Fakher F. AssaadFakher F. Assaadhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/176Block Diagonal Op_T2020-11-30T13:03:23ZFlorian GothBlock Diagonal Op_TFakher wants to automatically decompose block diagonal matrices.Fakher wants to automatically decompose block diagonal matrices.Fakher F. AssaadFakher F. Assaadhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/175Include MSCBDECOMP + higher order checkerboard2021-01-06T01:03:41ZFlorian GothInclude MSCBDECOMP + higher order checkerboard- [x] include MSCBDECOMP
- [x] make an object for it
- [x] Higher order checkerboard- [x] include MSCBDECOMP
- [x] make an object for it
- [x] Higher order checkerboardFlorian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/174More Use cases for the ContainerElementBase Object Hierarchy2020-12-03T15:16:37ZFlorian GothMore Use cases for the ContainerElementBase Object HierarchyNow with the ContainerElementBase Object Hierarchy in place and a dynamic container present that is able to take up
these objects in some "dynamic" way we can think about extensions.
## Approximate techniques
- [ ] Minimum Split Checkerboard + Higher Order Checkerboard #175
- [ ] Lanczos
## Exact techniques
- [ ] Block techniques #176
- [ ] FFT techniques
## Use cases in other places
- [ ] Op_V is not that much different. Get rid of the various types?Now with the ContainerElementBase Object Hierarchy in place and a dynamic container present that is able to take up
these objects in some "dynamic" way we can think about extensions.
## Approximate techniques
- [ ] Minimum Split Checkerboard + Higher Order Checkerboard #175
- [ ] Lanczos
## Exact techniques
- [ ] Block techniques #176
- [ ] FFT techniques
## Use cases in other places
- [ ] Op_V is not that much different. Get rid of the various types?Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/167Why is UDV%D a full complex number2020-10-26T12:41:40ZFlorian GothWhy is UDV%D a full complex number@fassaad, @Hofmann :
UDV%D is still a full complex number. We calculate D from the return values of norm related functions (Abs) which give positive real numbers. To my knowledge we never store anything complex in this array anyway and hence we might want to optimize that away.@fassaad, @Hofmann :
UDV%D is still a full complex number. We calculate D from the return values of norm related functions (Abs) which give positive real numbers. To my knowledge we never store anything complex in this array anyway and hence we might want to optimize that away.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/114block diagonal interaction terms2019-09-29T10:50:34ZJohannes Hofmannblock diagonal interaction termsSome models have a block diagonal structure in the interaction terms, similar to the already exploited N_fl structure. However, the non-interacting part may not be block diagonal such that N_fl cannot be used.
This branch aims to implement this structure!
I would propose to redefine the operator type to contain a list of the "dense" blocks of the operators. I.e., P(:) --> P(:,b); O(:,:) --> O(:,:,b) and so on. b should be the index of the block and appear as the last index for memory efficiency as with n_fl. The rest of the code should then be updated accordingly.
It might also be convenient, to use the number of blocks as an optional argument when the operator is allocated, ideally in a way that the existing model without that structure can stay as they are. This way, the more experienced user can make use of this option whereas a "new" user doesn't have to care about this in the beginning.Some models have a block diagonal structure in the interaction terms, similar to the already exploited N_fl structure. However, the non-interacting part may not be block diagonal such that N_fl cannot be used.
This branch aims to implement this structure!
I would propose to redefine the operator type to contain a list of the "dense" blocks of the operators. I.e., P(:) --> P(:,b); O(:,:) --> O(:,:,b) and so on. b should be the index of the block and appear as the last index for memory efficiency as with n_fl. The rest of the code should then be updated accordingly.
It might also be convenient, to use the number of blocks as an optional argument when the operator is allocated, ideally in a way that the existing model without that structure can stay as they are. This way, the more experienced user can make use of this option whereas a "new" user doesn't have to care about this in the beginning.Emilie HuffmanEmilie Huffmanhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/113Cholesky vs maxent2020-10-19T17:09:44ZFlorian GothCholesky vs maxentyou have to touch xqmc1 and Vhelp.you have to touch xqmc1 and Vhelp.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/103Suggestion: increment N_auto in powers of two to speed up autocorrelation ana...2019-05-21T12:02:40ZMartin HohenadlerSuggestion: increment N_auto in powers of two to speed up autocorrelation analysisThe documentation points out that the autocorrelation analysis is expensive. However, there is very little use in analyzing and including every bin length between 1 and 500 (in the example of Fig.1). It is common in the literature to increase N_auto by a factor of two instead.The documentation points out that the autocorrelation analysis is expensive. However, there is very little use in analyzing and including every bin length between 1 and 500 (in the example of Fig.1). It is common in the literature to increase N_auto by a factor of two instead.https://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/94automatically generate zslgemm and zslhemm with arbitary N2019-02-07T22:01:21ZJonas Schwabautomatically generate zslgemm and zslhemm with arbitary NCreate Libraries/Modules/Mat_subroutines.f90 with script during compilation with arbitrary high N.Create Libraries/Modules/Mat_subroutines.f90 with script during compilation with arbitrary high N.Jonas SchwabJonas Schwabhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/91Test the compact WY decomposition in real life2018-10-01T15:55:30ZFlorian GothTest the compact WY decomposition in real lifeMake a branch from Johannes' branch and introduce xGEQRT into udv_state in the case of the logscale where permutations are disabled anyway.Make a branch from Johannes' branch and introduce xGEQRT into udv_state in the case of the logscale where permutations are disabled anyway.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/90PLASMAAlf2020-09-04T12:08:18ZFlorian GothPLASMAAlfPlasma,
https://bitbucket.org/icl/plasma
seems to have a superior QR decomposition that is highly parralizable.
Let's see where this leadsPlasma,
https://bitbucket.org/icl/plasma
seems to have a superior QR decomposition that is highly parralizable.
Let's see where this leadsFlorian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/74Optimization for hopping operators2017-09-12T15:59:41ZJohannes HofmannOptimization for hopping operatorsIt might be useful to specialize hopping operator as many of us simulate Ising fields coupled to bond hopping or hopping operators squared in Kondo type couplings.It might be useful to specialize hopping operator as many of us simulate Ising fields coupled to bond hopping or hopping operators squared in Kondo type couplings.https://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/62clALF2017-06-22T17:05:01ZFlorian GothclALFSo we'd like to use ALF on certain Accelerators. The most simple solution would be to use AMDs ACML:
http://developer.amd.com/tools-and-sdks/archive/compute/amd-core-math-library-acml/acml-downloads-resources/
This library provides a full lapack and BLAS Implementation with a working Fortran Interface that can autoamtically use
external accelerators via OpenCL. Sadly AMD has only 2 profiles for two of their GPUs from around 2014 and I could not get it to work satisfactory.
The next idea is to use clMAGMA from the MAGMA initiative:
http://icl.cs.utk.edu/magma/
The most recent version has a working Fortran interface(I suppose), support for sparse vector operations and is maintained by a similar set of people as the reference lapack. But it essentially only supports CUDA.
There is magmaMIC that can utilize Xeon Phi's and there's clMAGMA that uses OpenCL as backend.
Sadly when the authors tried to add a Fortran Interface they found out that there would be some work involved. So this is also out....
There is ViennaCL:
http://viennacl.sourceforge.net/
But this is only C++ but it looks very powerful especially for sparse operations.
Since ALF spends its time mostly in low-level BLAS3 Routines (ZHEMM's in my branch on the Hubbard-model)
we can get away with just trying to plug in a library that emulates the BLAS interface.
To my knowledge thereis no library that provides a full Fortran Interface.
If we go on to write our own wrappers there are two contenders:
clBLAS: https://github.com/clMathLibraries/clBLAS
This was a part of AMD's ACML and is now open sourced and seems to be a little bit maintained.
and TomTom recently released clBLAST:
https://arxiv.org/abs/1705.05249 , https://cnugteren.github.io/clblast/clblast.html
It is very new and being from the outside of HPC puts a lot of effort into ensuring the portability, and also has Netlib.org lapack interface that can be almost linked against fortran.
So for now I will try to see wether clBLAS works and I can offload ZHEMM calls...
First experiences with clBLAS:
Adding the ZHEMM call is now finished. This works and gives correct results.
For now I could only test execution on a CPU(i7-2600). The Multiplication is automatically
parralellized but oversubscribes my CPU with ~ 8 threads. This would be OK, but the runtime is 5 times longer than plain single thread execution...
Some numbers:
(core-i7 920, 8x8 lattice ) master: 13s clalf: 97s (upto 4 threads...), CLBlast: 97s (around 1.5 threads effectively used)
(core-i7 920, 12x12 lattice ) master: 136s clalf: 415s (upto 4 threads...), clBlast: 171s (~ 2.5 threads)
(core-i7 920, 16x16 lattice ) master: 776(single thread) , clBlast: 545s (~ 4 threads used well)
(core -i7 2600 20x20) master: 357s, clBlas: 880s
For now I concur that clBLAS is an AMD GPU only solution.
The numbers didn't change much by using the inbuilt auto-tuner for my CPU for CLBlast.So we'd like to use ALF on certain Accelerators. The most simple solution would be to use AMDs ACML:
http://developer.amd.com/tools-and-sdks/archive/compute/amd-core-math-library-acml/acml-downloads-resources/
This library provides a full lapack and BLAS Implementation with a working Fortran Interface that can autoamtically use
external accelerators via OpenCL. Sadly AMD has only 2 profiles for two of their GPUs from around 2014 and I could not get it to work satisfactory.
The next idea is to use clMAGMA from the MAGMA initiative:
http://icl.cs.utk.edu/magma/
The most recent version has a working Fortran interface(I suppose), support for sparse vector operations and is maintained by a similar set of people as the reference lapack. But it essentially only supports CUDA.
There is magmaMIC that can utilize Xeon Phi's and there's clMAGMA that uses OpenCL as backend.
Sadly when the authors tried to add a Fortran Interface they found out that there would be some work involved. So this is also out....
There is ViennaCL:
http://viennacl.sourceforge.net/
But this is only C++ but it looks very powerful especially for sparse operations.
Since ALF spends its time mostly in low-level BLAS3 Routines (ZHEMM's in my branch on the Hubbard-model)
we can get away with just trying to plug in a library that emulates the BLAS interface.
To my knowledge thereis no library that provides a full Fortran Interface.
If we go on to write our own wrappers there are two contenders:
clBLAS: https://github.com/clMathLibraries/clBLAS
This was a part of AMD's ACML and is now open sourced and seems to be a little bit maintained.
and TomTom recently released clBLAST:
https://arxiv.org/abs/1705.05249 , https://cnugteren.github.io/clblast/clblast.html
It is very new and being from the outside of HPC puts a lot of effort into ensuring the portability, and also has Netlib.org lapack interface that can be almost linked against fortran.
So for now I will try to see wether clBLAS works and I can offload ZHEMM calls...
First experiences with clBLAS:
Adding the ZHEMM call is now finished. This works and gives correct results.
For now I could only test execution on a CPU(i7-2600). The Multiplication is automatically
parralellized but oversubscribes my CPU with ~ 8 threads. This would be OK, but the runtime is 5 times longer than plain single thread execution...
Some numbers:
(core-i7 920, 8x8 lattice ) master: 13s clalf: 97s (upto 4 threads...), CLBlast: 97s (around 1.5 threads effectively used)
(core-i7 920, 12x12 lattice ) master: 136s clalf: 415s (upto 4 threads...), clBlast: 171s (~ 2.5 threads)
(core-i7 920, 16x16 lattice ) master: 776(single thread) , clBlast: 545s (~ 4 threads used well)
(core -i7 2600 20x20) master: 357s, clBlas: 880s
For now I concur that clBLAS is an AMD GPU only solution.
The numbers didn't change much by using the inbuilt auto-tuner for my CPU for CLBlast.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/22Save Stack space2019-09-25T17:36:36ZFlorian GothSave Stack spaceArrays that are declared with their size upon their declaration(i.e. they don't have the allocatable keyword) get by default allocated on the stack.
On systems that have set the available stack space to unlimited (Most Supercomputing centers do that for compatibility reasons) this will not pose an issue.
To enhance compatibility with systems that impose stack space we should use instead arrays that are allocatable.Arrays that are declared with their size upon their declaration(i.e. they don't have the allocatable keyword) get by default allocated on the stack.
On systems that have set the available stack space to unlimited (Most Supercomputing centers do that for compatibility reasons) this will not pose an issue.
To enhance compatibility with systems that impose stack space we should use instead arrays that are allocatable.Jefferson Stafusa E. PortelaJefferson Stafusa E. Portela