Page 1 of 1

BSE in supercells

Posted: Wed Jan 29, 2025 11:05 am
by kaynat_kalvi

Hi, I have a cubic system having 216 same atoms, I want to calculate the dielectric function for it using DFT+mBSE. I am keeping low k-points, to reduce memory and cal time and using 2 nodes, having 128 cpus per node. As BSE doesn't support KPAR, I want to optimize me calculation and needs suggestions for that. Please guide me through this.


Re: BSE in supercells

Posted: Wed Jan 29, 2025 11:39 am
by alexey.tal

Dear kaynat_kalvi,

Thank you for your question.

There is a number of things that one can do to optimize the performance of the BSE calculation for large cells. We have also optimized out BSE algorithm in VASP 6.5.0. Do you have access to VASP 6.5?

Could you please provide the input files for you system?

Best wishes,
Alexey


Re: BSE in supercells

Posted: Tue Feb 04, 2025 1:53 pm
by alexey.tal

Below I will list the recommendations for optimizing the BSE performance for large supercells in VASP 6.5.

There are two steps in the BSE calculation: setup of the Hamiltonian and the diagonalization.

Diagonalization:

  • The most efficient way to compute the spectrum is the Lanczos iterative algorithm (IBSE=3).
  • If VASP is compiled with the flag -Dsingle_prec_bse, the Hamiltonian is stored and solved in single precision, which usually provides sufficient accuracy but requires half of the memory and compute time.
  • It is possible to use OMEGAMAX in BSE to exclude the transitions beyond a given energy range, thus minimizing the rank of the Hamiltonian.

Setting up the matrix:

  • The calculation of the BSE Hamiltonian can scale up to NKPTS*(NKPTS+1)/2 CPU cores after that the computational load cannot be distributed evenly. However, in the case of large supercells, the number of k-points might be quite small. For example, if we have the k-mesh of 2x2x2, the maximum number of cores we can distributed this calculation efficiently is 36. Thus, it is quite important to use flags NBSEBLOCKO and/or NBSEBLOCKV to divide the bands into groups that can be calculated in parallel. For example, if the blocking factor divides the occupied bands into two blocks, the computation can be evenly divided over NKPTS*2*(NKPTS*2+1)/2 or 136 CPU cores. We do not recommend blocking the bands if the number of CPU cores does not exceed NKPTS*(NKPTS+1)/2.
  • KPAR can and should be used with IBSE=1,2, or 3. The best efficiency can be achieved when KPAR=number of MPI ranks, i.e., store a copy of all orbitals on every MPI rank. This however dramatically increases the memory requirements for large supercells and KPAR often has to be limited to a small value or 1.
  • It is often sufficient to use lower accuracy for the PRECFOCK, but can significantly reduce compute time for the FFTs, which are the most demanding part of the calculations with large cells.

Furthremore, our BSE code is ported to the Nvidia GPUs, so the BSE calculations for large cells can be run very efficiently on GPUs. The Lanczos algorithm currently doesn't support the GPU offloading, but the time-evolution BSE (IBSE=1) and the exact diagonalization algorithm (IBSE=2) can be fully run on GPUs. Furthermore, the time-evolution algorithm is somewhat slower than the Lanczos algorithm but it nevertheless outperforms the exact diagonalization for large BSE matrices.

Let me know if something is unclear or if you have further questions on BSE.