GPU-friendly G-vector optimizations and de-duplication of form factor calculation #825

azadoks · 2023-02-17T09:09:41Z

I remember from one of the weekly meetings that the IdDicts that are currently used for G-vector optimization in form factor calculations don't play well with GPU calculations.

Repeating the unique-G-norm finding and form-factor calculation all over the place also isn't the nicest.

I'm wondering if we could take a page out of QE's book and initialize a G_shells vector (i.e. the unique values of |G| or |G+k|) and a G-vector to G-shell mapping in PlaneWaveBasis and in each Kpoint.

With some intelligent sorting, this could likely be as performant as a hash-map on CPU, and it would also work on GPU.

Along with this, I think we've discussed trying to centralize form-factor and structure-factor calculations somewhere to reduce duplication. If we move to G-shells, these functions would have to be re-worked anyway, so it would be a good opportunity to bring everything together.

For form-factors, I'd suggest two methods: one for density-like objects that don't need spherical harmonics (i.e. l===0, m===0), and one for orbital/projector-like objects that do (arbitrary l, m). I use a marker struct in PseudoPotentialIO for the same purpose with the Bessel transform.

Structure factors really only need to be calculated once for a given atomic geometry, so they could even be stored.

The text was updated successfully, but these errors were encountered:

antoine-levitt · 2023-02-17T09:15:47Z

I remember also there was the question of interpolating the form factors. Presumably it can be made fast enough that we don't need this type of optimization, which would maybe be simpler?

azadoks · 2023-02-17T14:39:11Z

Interpolation would be another good option, yes.

Each pseudo quantity needs to be FTed once per SCF (when initializing model terms: beta-proj. for nonlocal and core density for NLCC in the exchange-correlation and when creating a guess density or guess wavefunction via superposition), so regardless of the approach, the goal is to reduce the number of evaluations of the quantity_fourier functions as much as possible with the lowest overhead possible.

The minimum for exact evaluation would be one call per unique value of |G| for the density quantities and once for unique |G+k| for the orbital/projector quantities (i.e. what's done with the IdDict).

If we can automatically find a set of points to use in building the interpolators which is both smaller than the above and which gives an accurate interpolation, then that would be the way to go.

Building the interpolator with the unique |G|/|G+k| would still give the exact solution, but then we're using it as a glorified lookup table, no?

Do you have ideas for how to select a reduced set of points while still ensuring the accuracy of the interpolator? I think that how we do this would determine which method is simpler.

antoine-levitt · 2023-02-17T15:32:05Z

Just a plain old linear grid with spline interpolation? Then just plug in Interpolations.jl which should be fast. The projector is very smooth in Fourier space (it is compactly supported in real space) so that should give good results. To be clear, I'm suggesting not computing the exact result here but having some error, controlled by the qspace grid sampling. I believe that's what abinit does

azadoks · 2023-02-20T08:22:05Z

Ah I see, sounds very reasonable.
For posterity, the ABINIT input variable is mqgrid, and it defaults to 3001 points for all reciprocal-space pseudopotential quantities.

I still feel that giving users the choice to use an exact method that's not too slow would be good. If the interpolation is easier to implement first though, we can always add this in later.

azadoks · 2023-02-24T09:54:31Z

EDIT: see here for the code

Here's some sample code for how these functions could look.
I wrote these for specific cases, but generalizing them to two interfaces: one for densities and one for projectors/orbitals shouldn't be too difficult (especially with the PseudoPotentialIO interface).
I think all the code should be GPU-friendly, no hash maps!

On CPU, I'm seeing a 2-3x speedup for both transforms.
The density transform has about 2-3x reduction in memory use with many orders of magnitude fewer allocations.
The orbital/projector transform decreases memory use by <10% but with a few orders of magnitude fewer allocations.

First, the structure factors are precomputed when the PlaneWaveBasis and KPoints are generated:

Structure Factors

function build_structure_factors(Gs::AbstractArray{Vec3{T}},
                                 positions::AbstractVector{Vec3{S}}) where {T<:Real,S<:Real}
    map(positions) do position
        map(Gs) do G
            cis2pi(-dot(G, position))
        end
    end
end

The appropriate 1~3 lines are added in the PlaneWaveBasis constructor and in build_kpoints.

Then, two functions need to be written each for density quantities and orbital/projector quantities: build_interpolators and build_atomic_superposition / build_projection_vectors. The density functions are fairly clean, but the orbital/projector functions could probably be simplified.

Density

"""
Build an on-grid cubic spline interpolator for `eval_psp_{some-density}_fourier` using `n_qnorm_interpolate` points between 0 and `max|Gcart|`.
"""
function build_interpolators_valence_charge(basis::PlaneWaveBasis{T},
                                            n_qnorm_interpolate::Int=3001) where {T}
    qnorm_max = maximum(norm.(G_vectors_cart(basis)))
    qnorm_interpolate = range(0, qnorm_max, n_qnorm_interpolate)

    map(basis.model.atom_groups) do atom_group
        psp = basis.model.atoms[first(atom_group)].psp
        f̃ = eval_psp_density_valence_fourier.(psp, qnorm_interpolate)
        scale(interpolate(f̃, BSpline(Cubic(Line(OnGrid())))), qnorm_interpolate)        
    end
end

function superposition_valence_charge(basis::PlaneWaveBasis{T}) where {T}
    itps = build_interpolators_valence_charge(basis)

    ρ = map(enumerate(G_vectors(basis))) do (iG, G)
        qnorm_cart = norm(recip_vector_red_to_cart(basis.model, G))
        ρ_G = sum(enumerate(basis.model.atom_groups); init=zero(Complex{T})) do (igroup, atom_group)
            form_factor = itps[igroup](norm(qnorm_cart))
            sum(atom_group) do iatom
                structure_factor = basis.structure_factors[iatom][iG]
                form_factor * structure_factor
            end
        end
        ρ_G / sqrt(basis.model.unit_cell_volume)
    end
    enforce_real!(basis, ρ)
    irfft(basis, ρ)
end

Projector / Orbital

"""
Build on-grid cubic spline interpolators for each projector/orbital quantity using `n_qnorm_interpolate` points between 0 and `max|G+k|` over all k-points.
"""
function build_interpolators_projectors(basis::PlaneWaveBasis{T},
                                        atom_groups::Vector{Vector{Int}},
                                        n_qnorm_interpolate::Int=3001) where {T}
    qnorm_max = maximum(maximum(norm.(Gplusk_vectors_cart(basis, kpt))) for kpt in basis.kpoints)
    qnorm_interpolate = range(0, qnorm_max, n_qnorm_interpolate)

    map(atom_groups) do atom_group
        psp = basis.model.atoms[first(atom_group)].psp
        map(0:psp.lmax) do l
            map(1:count_n_proj_radial(psp, l)) do iproj_l
                f̃ = eval_psp_projector_fourier.(psp, iproj_l, l, qnorm_interpolate)
                scale(interpolate(f̃, BSpline(Cubic(Line(OnGrid())))), qnorm_interpolate)
            end
        end
    end
end

function build_projection_vectors(basis::PlaneWaveBasis{T}, atom_groups::Vector{Vector{Int}}) where {T}
    itps = build_interpolators_projectors(basis, atom_groups, 6002)  # itps[group][l][n]
    psps = [basis.model.atoms[first(atom_group)].psp for atom_group in atom_groups]
    psp_positions = [basis.model.positions[atom_group] for atom_group in atom_groups]
    nproj = count_n_proj(psps, psp_positions)
    ngroup = length(atom_groups)
    lmax = maximum(psp.lmax for psp in psps)
    sqrt_Vuc = sqrt(basis.model.unit_cell_volume)

    proj_vectors = map(basis.kpoints) do kpt
        qs_cart = Gplusk_vectors_cart(basis, kpt)
        qnorms_cart = norm.(qs_cart)
        proj_vectors_k = zeros(Complex{T}, size(qs_cart, 1), nproj)

        angular = map(0:lmax) do l  # angular[l][m][q]
            map(-l:l) do m
                map(qs_cart) do q_cart      
                    im^l * ylm_real(l, m, q_cart)
                end
            end
        end

        radial = map(1:ngroup) do igroup  # radial[group][l][n][q]
            psp = psps[igroup]
            map(0:psp.lmax) do l
                map(1:count_n_proj_radial(psp, l)) do n
                    map(itps[igroup][l+1][n], qnorms_cart)
                end
            end
        end

        iproj = 1
        for igroup in 1:ngroup
            psp = psps[igroup]
            for iatom in atom_groups[igroup]
                for l in 0:psp.lmax
                    il = l + 1
                    for im in 1:(2l + 1)
                        for n in 1:count_n_proj_radial(psp, l)
                            aff = angular[il][im]  # Form-factor angular part
                            rff = radial[igroup][il][n]  # Form-factor radial part
                            sf = kpt.structure_factors[iatom]  # Structure factor
                            proj_vectors_k[:,iproj] .= sf .* aff .* rff ./ sqrt_Vuc
                            iproj += 1
                        end
                    end
                end
            end
        end
        proj_vectors_k
    end
    return proj_vectors
end

Here are some benchmarks comparing the current density superposition and the proposed density superposition (using 3001 q-points for interpolation). The system is:

a = 10.26;  # Silicon lattice constant in Bohr
lattice = a / 2 * [[0 1 1.];
                   [1 0 1.];
                   [1 1 0.]];
Si = ElementPsp(:Si, psp=load_psp(artifact"pd_nc_sr_lda_standard_0.4.1_upf", "Si.upf"));
atoms     = [Si, Si];
positions = [ones(3)/8, -ones(3)/8];

model = model_LDA(lattice, atoms, positions);
basis = PlaneWaveBasis(model; Ecut=90, kgrid=[4, 4, 4]);

For the density:

CURRENT
=======
BenchmarkTools.Trial: 28 samples with 1 evaluation.
 Range (min … max):  174.720 ms … 193.339 ms  ┊ GC (min … max): 0.00% … 4.74%
 Time  (median):     178.545 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   180.555 ms ±   4.936 ms  ┊ GC (mean ± σ):  1.39% ± 2.00%

    ▁   █                      ▁     ▁                           
  ▆▆█▆▆▆█▆▆▆▁▁▁▁▁▆▁▁▁▁▁▁▁▆▆▁▁▁▁█▆▆▆▁▆█▁▁▁▁▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  175 ms           Histogram: frequency by time          193 ms <

 Memory estimate: 52.83 MiB, allocs estimate: 1074285.

PROPOSED
========
BenchmarkTools.Trial: 86 samples with 1 evaluation.
 Range (min … max):  56.158 ms … 67.641 ms  ┊ GC (min … max): 0.00% … 12.14%
 Time  (median):     57.848 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.727 ms ±  2.432 ms  ┊ GC (mean ± σ):  1.38% ±  3.36%

        ██  ▃                                                  
  ▃▁▃▃▄▆██▆██▅▆▅▃▅▆▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▃▁▁▄▁▁▁▃▁▁▃▁▃▁▁▁▁▁▁▄▃▁▁▁▁▃ ▁
  56.2 ms         Histogram: frequency by time        66.6 ms <

 Memory estimate: 20.38 MiB, allocs estimate: 155.

The extrema of the absolute error on the computed quantity are (7.060324547225605e-15, 4.0259753170124313e-10).

And for the projectors/orbitals:

CURRENT
=======
BenchmarkTools.Trial: 38 samples with 1 evaluation.
 Range (min … max):  129.208 ms … 140.966 ms  ┊ GC (min … max): 0.00% … 4.57%
 Time  (median):     132.293 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   133.667 ms ±   3.660 ms  ┊ GC (mean ± σ):  2.13% ± 2.45%

  ▂ ▂     █                               ▂                      
  █▁█▁▁▅█▁█▅▅▁▁▅▅▅▅▁▅▁▁▁▁▅▁▁▁▁▁▅▁▅▁▅▁▁▅▁▅██▁▅▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▅▁█ ▁
  129 ms           Histogram: frequency by time          141 ms <

 Memory estimate: 85.05 MiB, allocs estimate: 189406.

PROPOSED
========
BenchmarkTools.Trial: 95 samples with 1 evaluation.
 Range (min … max):  48.554 ms … 61.851 ms  ┊ GC (min … max): 0.00% … 13.73%
 Time  (median):     52.618 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.946 ms ±  3.124 ms  ┊ GC (mean ± σ):  3.83% ±  4.18%

       █ ▄ ▃ ▁        ▁▁ ▁ ▄                                   
  ▆▁▄▆▆█▆█▆█▇█▇▆▁▆▁▇▄▄██▄█▄█▄▇▄▆▁▄▇▄▄▄▁▁▆▁▄▁▁▄▁▆▄▁▄▁▁▁▁▁▁▁▁▄▆ ▁
  48.6 ms         Histogram: frequency by time        61.4 ms <

 Memory estimate: 76.14 MiB, allocs estimate: 3484.

The extrema of the absolute error on the computed quantity are (per k-point):

Real: (0.0, 5.94385651808693e-14)     Imag: (0.0, 6.04724603725515e-14)
Real: (0.0, 9.40262041837503e-10)     Imag: (0.0, 6.15765846298521e-10)
Real: (0.0, 3.06965659488400e-11)     Imag: (0.0, 1.77525377210996e-11)
Real: (0.0, 1.85421929546625e-11)     Imag: (0.0, 1.81109788624511e-11)
Real: (0.0, 5.27737619446591e-10)     Imag: (0.0, 9.18646822674340e-10)
Real: (0.0, 2.17376962046919e-09)     Imag: (0.0, 2.17376962046919e-09)
Real: (0.0, 5.31006835724494e-12)     Imag: (0.0, 5.31006835724490e-12)
Real: (0.0, 1.03028994710707e-09)     Imag: (0.0, 1.03028994710707e-09)

antoine-levitt · 2023-02-24T14:59:10Z

The error is very encouraging, although I would have expected a larger speedup. Note though that the larger the system, the better the speedup. I'm assuming the bottleneck is the precomputation of the function on the interpolation points? If so, we might be able to gain by using higher-degree interpolation. But that seems like a good compromise between accuracy, simplicity and speed.

azadoks · 2023-02-24T15:37:35Z

I think what's done now is already quite efficient, and as you said, the speedup improves with more atoms and elements, but also with the number of projectors.
d- and f-states should perform much better with this approach.
For ultrasoft pseudos, the number of augmentation functions should be proportional to the number of projectors squared, so the speedup should pay off even more with them in future.

Most of the time is spent in precomputation for the interpolation. @louisponet gave me some pointers on speeding up the Bessel transforms, which is actually where the most time is spent, so I've been working on that in PseudoPotentialIO. I think the biggest performance gains w.r.t. the Bessel transform would be to work on removing the need for checks on angular momentum in fast_sphericalbesselj, which could be lifted all the way up to here in the form-factor calculation.

I'm not sure how much performance we can squeeze out of higher-degree interpolation, but I'd guess that it's probably max 2x, what do you think?

antoine-levitt · 2023-02-25T13:32:40Z

I'm not sure how much performance we can squeeze out of higher-degree interpolation, but I'd guess that it's probably max 2x, what do you think?

I don't know, since the function is so smooth 3000 points is probably overkill and we can get by with much less. Also https://en.wikipedia.org/wiki/Hankel_transform#Numerical_evaluation looks applicable and probably the best way to do this, but if cubic splines are already good enough it's probably not worth the bother...

azadoks · 2023-02-27T09:24:19Z

I could implement numerical evaluation in PseudoPotentialIO for pseudos on a log grid; it looks pretty neat. This will mainly work for UPF v1 pseudos / HGH pseudos in UPF format, notably not PseudoDojo.

The two other methods mentioned on Wikipedia for uniform grids: asymptotic expansion of the Bessel functions and the projection-slice theorem don't seem to fit very well in our case at a glance.

One other trick that I've seen is that ABINIT uses a recursive algorithm to pre-compute the Bessel functions of order l=0 to l=lmax for each pseudopotential, which gives an rgrid x qnorm_grid x (lmax+1) lookup table.

The last optimization that I see helping out is truncating the pseudo quantities on the radial grid where they decay to 0. This is recommended by Abinit for numerical stability, but in my testing it's also another good way to cut out Bessel calls and significantly increase performance.

At the end of the day, performance improvements in this area are aimed at setup time and mostly effect cases where Ecut is big, the system is large, there are many pseudopotentials, or you're running tens of thousands of relatively quick calculations (which is my case). It's great if it's fast, but I'm not sure we need to optimize it to death.

antoine-levitt · 2023-02-27T12:35:52Z

Yeah, if it's fast enough there's not much point...

I could implement numerical evaluation in PseudoPotentialIO for pseudos on a log grid; it looks pretty neat. This will mainly work for UPF v1 pseudos / HGH pseudos in UPF format, notably not PseudoDojo.

One could presumably interpolate from a linear to a log grid.

azadoks mentioned this issue Mar 14, 2023

Non-linear core correction and pseudopotential density guess #766

Merged

27 tasks

mfherbst added the gpu Label for GPU-related issues label Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-friendly G-vector optimizations and de-duplication of form factor calculation #825

GPU-friendly G-vector optimizations and de-duplication of form factor calculation #825

GPU-friendly G-vector optimizations and de-duplication of form factor calculation #825

GPU-friendly G-vector optimizations and de-duplication of form factor calculation #825

Comments