ALF issueshttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues2019-09-25T17:36:36Zhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/22Save Stack space2019-09-25T17:36:36ZFlorian GothSave Stack spaceArrays that are declared with their size upon their declaration(i.e. they don't have the allocatable keyword) get by default allocated on the stack.
On systems that have set the available stack space to unlimited (Most Supercomputing centers do that for compatibility reasons) this will not pose an issue.
To enhance compatibility with systems that impose stack space we should use instead arrays that are allocatable.Arrays that are declared with their size upon their declaration(i.e. they don't have the allocatable keyword) get by default allocated on the stack.
On systems that have set the available stack space to unlimited (Most Supercomputing centers do that for compatibility reasons) this will not pose an issue.
To enhance compatibility with systems that impose stack space we should use instead arrays that are allocatable.Jefferson Stafusa E. PortelaJefferson Stafusa E. Portelahttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/62clALF2017-06-22T17:05:01ZFlorian GothclALFSo we'd like to use ALF on certain Accelerators. The most simple solution would be to use AMDs ACML:
http://developer.amd.com/tools-and-sdks/archive/compute/amd-core-math-library-acml/acml-downloads-resources/
This library provides a full lapack and BLAS Implementation with a working Fortran Interface that can autoamtically use
external accelerators via OpenCL. Sadly AMD has only 2 profiles for two of their GPUs from around 2014 and I could not get it to work satisfactory.
The next idea is to use clMAGMA from the MAGMA initiative:
http://icl.cs.utk.edu/magma/
The most recent version has a working Fortran interface(I suppose), support for sparse vector operations and is maintained by a similar set of people as the reference lapack. But it essentially only supports CUDA.
There is magmaMIC that can utilize Xeon Phi's and there's clMAGMA that uses OpenCL as backend.
Sadly when the authors tried to add a Fortran Interface they found out that there would be some work involved. So this is also out....
There is ViennaCL:
http://viennacl.sourceforge.net/
But this is only C++ but it looks very powerful especially for sparse operations.
Since ALF spends its time mostly in low-level BLAS3 Routines (ZHEMM's in my branch on the Hubbard-model)
we can get away with just trying to plug in a library that emulates the BLAS interface.
To my knowledge thereis no library that provides a full Fortran Interface.
If we go on to write our own wrappers there are two contenders:
clBLAS: https://github.com/clMathLibraries/clBLAS
This was a part of AMD's ACML and is now open sourced and seems to be a little bit maintained.
and TomTom recently released clBLAST:
https://arxiv.org/abs/1705.05249 , https://cnugteren.github.io/clblast/clblast.html
It is very new and being from the outside of HPC puts a lot of effort into ensuring the portability, and also has Netlib.org lapack interface that can be almost linked against fortran.
So for now I will try to see wether clBLAS works and I can offload ZHEMM calls...
First experiences with clBLAS:
Adding the ZHEMM call is now finished. This works and gives correct results.
For now I could only test execution on a CPU(i7-2600). The Multiplication is automatically
parralellized but oversubscribes my CPU with ~ 8 threads. This would be OK, but the runtime is 5 times longer than plain single thread execution...
Some numbers:
(core-i7 920, 8x8 lattice ) master: 13s clalf: 97s (upto 4 threads...), CLBlast: 97s (around 1.5 threads effectively used)
(core-i7 920, 12x12 lattice ) master: 136s clalf: 415s (upto 4 threads...), clBlast: 171s (~ 2.5 threads)
(core-i7 920, 16x16 lattice ) master: 776(single thread) , clBlast: 545s (~ 4 threads used well)
(core -i7 2600 20x20) master: 357s, clBlas: 880s
For now I concur that clBLAS is an AMD GPU only solution.
The numbers didn't change much by using the inbuilt auto-tuner for my CPU for CLBlast.So we'd like to use ALF on certain Accelerators. The most simple solution would be to use AMDs ACML:
http://developer.amd.com/tools-and-sdks/archive/compute/amd-core-math-library-acml/acml-downloads-resources/
This library provides a full lapack and BLAS Implementation with a working Fortran Interface that can autoamtically use
external accelerators via OpenCL. Sadly AMD has only 2 profiles for two of their GPUs from around 2014 and I could not get it to work satisfactory.
The next idea is to use clMAGMA from the MAGMA initiative:
http://icl.cs.utk.edu/magma/
The most recent version has a working Fortran interface(I suppose), support for sparse vector operations and is maintained by a similar set of people as the reference lapack. But it essentially only supports CUDA.
There is magmaMIC that can utilize Xeon Phi's and there's clMAGMA that uses OpenCL as backend.
Sadly when the authors tried to add a Fortran Interface they found out that there would be some work involved. So this is also out....
There is ViennaCL:
http://viennacl.sourceforge.net/
But this is only C++ but it looks very powerful especially for sparse operations.
Since ALF spends its time mostly in low-level BLAS3 Routines (ZHEMM's in my branch on the Hubbard-model)
we can get away with just trying to plug in a library that emulates the BLAS interface.
To my knowledge thereis no library that provides a full Fortran Interface.
If we go on to write our own wrappers there are two contenders:
clBLAS: https://github.com/clMathLibraries/clBLAS
This was a part of AMD's ACML and is now open sourced and seems to be a little bit maintained.
and TomTom recently released clBLAST:
https://arxiv.org/abs/1705.05249 , https://cnugteren.github.io/clblast/clblast.html
It is very new and being from the outside of HPC puts a lot of effort into ensuring the portability, and also has Netlib.org lapack interface that can be almost linked against fortran.
So for now I will try to see wether clBLAS works and I can offload ZHEMM calls...
First experiences with clBLAS:
Adding the ZHEMM call is now finished. This works and gives correct results.
For now I could only test execution on a CPU(i7-2600). The Multiplication is automatically
parralellized but oversubscribes my CPU with ~ 8 threads. This would be OK, but the runtime is 5 times longer than plain single thread execution...
Some numbers:
(core-i7 920, 8x8 lattice ) master: 13s clalf: 97s (upto 4 threads...), CLBlast: 97s (around 1.5 threads effectively used)
(core-i7 920, 12x12 lattice ) master: 136s clalf: 415s (upto 4 threads...), clBlast: 171s (~ 2.5 threads)
(core-i7 920, 16x16 lattice ) master: 776(single thread) , clBlast: 545s (~ 4 threads used well)
(core -i7 2600 20x20) master: 357s, clBlas: 880s
For now I concur that clBLAS is an AMD GPU only solution.
The numbers didn't change much by using the inbuilt auto-tuner for my CPU for CLBlast.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/74Optimization for hopping operators2017-09-12T15:59:41ZJohannes HofmannOptimization for hopping operatorsIt might be useful to specialize hopping operator as many of us simulate Ising fields coupled to bond hopping or hopping operators squared in Kondo type couplings.It might be useful to specialize hopping operator as many of us simulate Ising fields coupled to bond hopping or hopping operators squared in Kondo type couplings.https://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/90PLASMAAlf2020-09-04T12:08:18ZFlorian GothPLASMAAlfPlasma,
https://bitbucket.org/icl/plasma
seems to have a superior QR decomposition that is highly parralizable.
Let's see where this leadsPlasma,
https://bitbucket.org/icl/plasma
seems to have a superior QR decomposition that is highly parralizable.
Let's see where this leadsFlorian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/91Test the compact WY decomposition in real life2018-10-01T15:55:30ZFlorian GothTest the compact WY decomposition in real lifeMake a branch from Johannes' branch and introduce xGEQRT into udv_state in the case of the logscale where permutations are disabled anyway.Make a branch from Johannes' branch and introduce xGEQRT into udv_state in the case of the logscale where permutations are disabled anyway.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/94automatically generate zslgemm and zslhemm with arbitary N2019-02-07T22:01:21ZJonas Schwabautomatically generate zslgemm and zslhemm with arbitary NCreate Libraries/Modules/Mat_subroutines.f90 with script during compilation with arbitrary high N.Create Libraries/Modules/Mat_subroutines.f90 with script during compilation with arbitrary high N.Jonas SchwabJonas Schwabhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/103Suggestion: increment N_auto in powers of two to speed up autocorrelation ana...2019-05-21T12:02:40ZMartin HohenadlerSuggestion: increment N_auto in powers of two to speed up autocorrelation analysisThe documentation points out that the autocorrelation analysis is expensive. However, there is very little use in analyzing and including every bin length between 1 and 500 (in the example of Fig.1). It is common in the literature to increase N_auto by a factor of two instead.The documentation points out that the autocorrelation analysis is expensive. However, there is very little use in analyzing and including every bin length between 1 and 500 (in the example of Fig.1). It is common in the literature to increase N_auto by a factor of two instead.https://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/113Cholesky vs maxent2020-10-19T17:09:44ZFlorian GothCholesky vs maxentyou have to touch xqmc1 and Vhelp.you have to touch xqmc1 and Vhelp.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/114block diagonal interaction terms2019-09-29T10:50:34ZJohannes Hofmannblock diagonal interaction termsSome models have a block diagonal structure in the interaction terms, similar to the already exploited N_fl structure. However, the non-interacting part may not be block diagonal such that N_fl cannot be used.
This branch aims to implement this structure!
I would propose to redefine the operator type to contain a list of the "dense" blocks of the operators. I.e., P(:) --> P(:,b); O(:,:) --> O(:,:,b) and so on. b should be the index of the block and appear as the last index for memory efficiency as with n_fl. The rest of the code should then be updated accordingly.
It might also be convenient, to use the number of blocks as an optional argument when the operator is allocated, ideally in a way that the existing model without that structure can stay as they are. This way, the more experienced user can make use of this option whereas a "new" user doesn't have to care about this in the beginning.Some models have a block diagonal structure in the interaction terms, similar to the already exploited N_fl structure. However, the non-interacting part may not be block diagonal such that N_fl cannot be used.
This branch aims to implement this structure!
I would propose to redefine the operator type to contain a list of the "dense" blocks of the operators. I.e., P(:) --> P(:,b); O(:,:) --> O(:,:,b) and so on. b should be the index of the block and appear as the last index for memory efficiency as with n_fl. The rest of the code should then be updated accordingly.
It might also be convenient, to use the number of blocks as an optional argument when the operator is allocated, ideally in a way that the existing model without that structure can stay as they are. This way, the more experienced user can make use of this option whereas a "new" user doesn't have to care about this in the beginning.Emilie HuffmanEmilie Huffmanhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/167Why is UDV%D a full complex number2020-10-26T12:41:40ZFlorian GothWhy is UDV%D a full complex number@fassaad, @Hofmann :
UDV%D is still a full complex number. We calculate D from the return values of norm related functions (Abs) which give positive real numbers. To my knowledge we never store anything complex in this array anyway and hence we might want to optimize that away.@fassaad, @Hofmann :
UDV%D is still a full complex number. We calculate D from the return values of norm related functions (Abs) which give positive real numbers. To my knowledge we never store anything complex in this array anyway and hence we might want to optimize that away.Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/174More Use cases for the ContainerElementBase Object Hierarchy2020-12-03T15:16:37ZFlorian GothMore Use cases for the ContainerElementBase Object HierarchyNow with the ContainerElementBase Object Hierarchy in place and a dynamic container present that is able to take up
these objects in some "dynamic" way we can think about extensions.
## Approximate techniques
- [ ] Minimum Split Checkerboard + Higher Order Checkerboard #175
- [ ] Lanczos
## Exact techniques
- [ ] Block techniques #176
- [ ] FFT techniques
## Use cases in other places
- [ ] Op_V is not that much different. Get rid of the various types?Now with the ContainerElementBase Object Hierarchy in place and a dynamic container present that is able to take up
these objects in some "dynamic" way we can think about extensions.
## Approximate techniques
- [ ] Minimum Split Checkerboard + Higher Order Checkerboard #175
- [ ] Lanczos
## Exact techniques
- [ ] Block techniques #176
- [ ] FFT techniques
## Use cases in other places
- [ ] Op_V is not that much different. Get rid of the various types?Florian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/175Include MSCBDECOMP + higher order checkerboard2021-01-06T01:03:41ZFlorian GothInclude MSCBDECOMP + higher order checkerboard- [x] include MSCBDECOMP
- [x] make an object for it
- [x] Higher order checkerboard- [x] include MSCBDECOMP
- [x] make an object for it
- [x] Higher order checkerboardFlorian GothFlorian Gothhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/176Block Diagonal Op_T2020-11-30T13:03:23ZFlorian GothBlock Diagonal Op_TFakher wants to automatically decompose block diagonal matrices.Fakher wants to automatically decompose block diagonal matrices.Fakher F. AssaadFakher F. Assaadhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/177FFT for application of Op_T matrices2021-03-24T17:53:33ZFlorian GothFFT for application of Op_T matricesFakher wants to use an FFT to optimize for circulant matrices.Fakher wants to use an FFT to optimize for circulant matrices.Fakher F. AssaadFakher F. Assaadhttps://git.physik.uni-wuerzburg.de/ALF/ALF/-/issues/195Automatic computation of Hopping_Matrix_Type%Multiplicity2021-06-14T09:05:04ZFrancesco Parisen ToldinAutomatic computation of Hopping_Matrix_Type%MultiplicityThe field Hopping_Matrix_Type%Multiplicity contains the number of times a given orbital appears in the list of bonds. Therefore it could be deduced from Hopping_Matrix_Type%List which contains the bonds emanating from a unit cell.
It seems a good idea to have it automatically computed, instead of relying in the user to enter it, in particular if the user defines a new set of hopping with checkerboard decomposition. This will ensure consistency and avoid bugs potentially difficult to detect.
The algorithm to compute it is basically easy (just a loop over Hopping_Matrix_Type%List).
Concerning the design, I see some possibilities:
1. Define a procedure like this
```
Subroutine Set_Multiplicity(this)
Type(Hopping_Matrix_Type), intent(INOUT) :: this
```
Pros: simplicity
Cons: again, we rely in the user to call it. And we have to properly document it.
2. Considering that, as far as I can see, Hopping_Matrix_Type%Multiplicity is only used in Predefined_Hoppings_set_OPT, we could remove altogether the field Hopping_Matrix_Type%Multiplicity and define a function that computes it, something like
```
Subroutine Compute_Multiplicity(hop_matrix, multiplicity)
Type(Hopping_Matrix_Type), intent(IN) :: hop_matrix
Integer, intent(OUT) :: multiplicity(:)
```
Pros: we remove completely the possibility of errors in Multiplicity. No need to document it.
Cons: at the moment Multiplicity is only used in a subroutine, presumably called only once. Should we use it in more routines, we may lose in efficiency by computing it every time.
3. Let the object compute Multiplicity the first time is needed, and cache the result. This is obtained by having a method of Hopping_Matrix_Type like that
```
Function Multiplicity(n)
implicit none
n, intent(IN) :: Integer
Multiplicity :: dble
If (is_cache_valid) then
Multiplicity = m_Multiplicity(n)
else
call Set_Multiplicity()
is_cache_valid = .true.
Multiplicity = m_Multiplicity(n)
end if
```
Here, is_cache_valid is a boolean that signals if multiplicity has been computed. m_Multiplicity is where the data are stored. Both should be, I think, private variables.
Pro: This can be regarded as a "internal" method that we do not need to necessarily document. Good efficiency. We hide the actual implementation of Multiplicity, without changing the syntax: that is, still Multiplicity(n) gives me the multiplicity of the n-th orbital.
Cons: Perhaps slightly more complex in the logic.
I personally favor solution 3., but I am not entirely sure: does Fortran allow for such type of c++-like design?The field Hopping_Matrix_Type%Multiplicity contains the number of times a given orbital appears in the list of bonds. Therefore it could be deduced from Hopping_Matrix_Type%List which contains the bonds emanating from a unit cell.
It seems a good idea to have it automatically computed, instead of relying in the user to enter it, in particular if the user defines a new set of hopping with checkerboard decomposition. This will ensure consistency and avoid bugs potentially difficult to detect.
The algorithm to compute it is basically easy (just a loop over Hopping_Matrix_Type%List).
Concerning the design, I see some possibilities:
1. Define a procedure like this
```
Subroutine Set_Multiplicity(this)
Type(Hopping_Matrix_Type), intent(INOUT) :: this
```
Pros: simplicity
Cons: again, we rely in the user to call it. And we have to properly document it.
2. Considering that, as far as I can see, Hopping_Matrix_Type%Multiplicity is only used in Predefined_Hoppings_set_OPT, we could remove altogether the field Hopping_Matrix_Type%Multiplicity and define a function that computes it, something like
```
Subroutine Compute_Multiplicity(hop_matrix, multiplicity)
Type(Hopping_Matrix_Type), intent(IN) :: hop_matrix
Integer, intent(OUT) :: multiplicity(:)
```
Pros: we remove completely the possibility of errors in Multiplicity. No need to document it.
Cons: at the moment Multiplicity is only used in a subroutine, presumably called only once. Should we use it in more routines, we may lose in efficiency by computing it every time.
3. Let the object compute Multiplicity the first time is needed, and cache the result. This is obtained by having a method of Hopping_Matrix_Type like that
```
Function Multiplicity(n)
implicit none
n, intent(IN) :: Integer
Multiplicity :: dble
If (is_cache_valid) then
Multiplicity = m_Multiplicity(n)
else
call Set_Multiplicity()
is_cache_valid = .true.
Multiplicity = m_Multiplicity(n)
end if
```
Here, is_cache_valid is a boolean that signals if multiplicity has been computed. m_Multiplicity is where the data are stored. Both should be, I think, private variables.
Pro: This can be regarded as a "internal" method that we do not need to necessarily document. Good efficiency. We hide the actual implementation of Multiplicity, without changing the syntax: that is, still Multiplicity(n) gives me the multiplicity of the n-th orbital.
Cons: Perhaps slightly more complex in the logic.
I personally favor solution 3., but I am not entirely sure: does Fortran allow for such type of c++-like design?Francesco Parisen ToldinFrancesco Parisen Toldin