Introduction

The simulation of quantum systems has remained a persistent challenge until today, primarily due to the exponential growth of the Hilbert space, making it exceedingly difficult to parameterize the wave functions of large systems using exact methods. Since the seminal work of Carleo and Troyer1, the idea of using neural networks to simulate quantum systems1,2,3,4,5 has been applied successfully for a large number of quantum systems, leveraging various neural network architectures. These architectures include restricted Boltzmann machines6, convolutional neural networks (CNNs)7, group CNNs8, autoencoders9 as well as autoregressive neural networks such as recurrent neural networks (RNNs)5,10,11,12,13,14, with neural network representations of both amplitude and phase distributions of the quantum state under consideration. These neural quantum states (NQS) make use of the innate ability of neural networks to efficiently represent probability distributions. When applying them to represent quantum systems, this ability can help to reduce the number of parameters required to encode the system.

Despite their representative power, NQS have been shown to face challenges during the training process, for example, when they are trained to minimize the energy, i.e. to represent ground states. This results from the intricate nature of the loss landscape, characterized by numerous saddle points and local minima that complicate the search for the global minimum15. One promising avenue to overcome this problem is the use of many uncorrelated samples during the training. This strategy is facilitated when using autoregressive neural networks16,17, allowing to directly sample from the wave functions’ amplitudes. Autoregressive networks have already been applied in the physics context18,19, such as for the variational simulation of spin systems11,12,13,14.

Many works have so far focused on NQS representations of spin systems at half-filling, revealing that NQS can be used to study a variety of phenomena that are relevant to state-of-the-art research, as e.g. shown for RNN representations on various lattice geometries, including frustrated spin systems11,20, and systems with topological order21. For all of these systems, the physics becomes even richer when introducing mobile impurities, e.g. holes, into the system, yielding a competition between the magnetic background and the kinetic energy of the impurity. Simulating such systems holds particular relevance for understanding high-temperature superconductivity, where the superconducting dome arises upon doping the antiferromagnetic half-filled state with holes22. The search for NQS that are capable of representing such spinful fermionic systems is still in its early stages. In recent years, the first NQS have been developed that obey the fermionic statistics, simulating molecules23,24,25,26,27, spinless fermions17 and spinful fermions28,29,30,31. Among those architectures are FermiNet23,24, Slater-Jastrow ansätze17,28 or variants of Jordan-Wigner transformations25,29,32.

Here, we use an autoregressive neural network architecture, supplemented with a Jordan-Wigner transformation, to simulate ground and excited states of the high interaction limit of the Fermi-Hubbard model, believed to capture essential features of high-temperature cuprate superconductors. Specifically, we use a tensorized 2D version of an RNN wave function11 with gated recurrent units33,34,35, proven to successfully model spin systems10,11,20,21,36,37. In the remainder of this paper, we will discuss the system under investigation, namely the bosonic and fermionic t − J model, followed by the presentation of a scheme for the calculation of dispersion relations from any NQS architecture, tested for quasiparticle dispersions of 1D and 2D t − J systems. We find that the RNN succeeds in accurately capturing the features of the considered low-energy states. Lastly, we discuss the performance of the RNN ansatz and identify the strengths and bottlenecks of this ansatz.

Results

We apply the RNN to simulate ground and excited states of the fermionic (bosonic) t − J model, both in one and two dimensions. In its more generalized form, known as the fermionic (bosonic) t − XXZ model, with anisotropic superexchange interactions denoted as Jz and J±, the Hamiltonian under consideration reads as follows:

$${{{{{{{{\mathcal{H}}}}}}}}}_{t{{{{{{{\rm{XXZ}}}}}}}}}= -t \, {\sum}_{\langle {{{{{{{\boldsymbol{i}}}}}}}},{{{{{{{\boldsymbol{j}}}}}}}}\rangle ,\sigma }{{{{{{{{\mathcal{P}}}}}}}}}_{G}\left({\hat{c}}_{{{{{{{{\boldsymbol{i}}}}}}}},\sigma }^{{{{\dagger}}} }{\hat{c}}_{{{{{{{{\boldsymbol{j}}}}}}}},\sigma }+{{{{{{{\rm{h.c.}}}}}}}}\right){{{{{{{{\mathcal{P}}}}}}}}}_{G}\\ +{J}_{z}{\sum}_{\langle {{{{{{{\boldsymbol{i}}}}}}}},{{{{{{{\boldsymbol{j}}}}}}}}\rangle }\left({\hat{S}}_{{{{{{{{\boldsymbol{i}}}}}}}}}^{z}\cdot {\hat{S}}_{{{{{{{{\boldsymbol{j}}}}}}}}}^{z}-\frac{1}{4}{\hat{n}}_{{{{{{{{\boldsymbol{i}}}}}}}}}{\hat{n}}_{{{{{{{{\boldsymbol{j}}}}}}}}}\right)\\ +{J}_{\pm }{\sum}_{\langle {{{{{{{\boldsymbol{i}}}}}}}},{{{{{{{\boldsymbol{j}}}}}}}}\rangle }\frac{1}{2}\left({\hat{S}}_{{{{{{{{\boldsymbol{i}}}}}}}}}^{+}\cdot {\hat{S}}_{{{{{{{{\boldsymbol{j}}}}}}}}}^{-}+{\hat{S}}_{{{{{{{{\boldsymbol{i}}}}}}}}}^{-}\cdot {\hat{S}}_{{{{{{{{\boldsymbol{j}}}}}}}}}^{+}\right),$$
(1)

with the fermionic (bosonic) creation and annihilation operators \({\hat{c}}_{{{{{{{{\boldsymbol{i}}}}}}}},\sigma }^{{{{\dagger}}} }\) and \({\hat{c}}_{{{{{{{{\boldsymbol{i}}}}}}}},\sigma }\) for particles at site i with spin σ, spin (pseudospin) operators \({\hat{{{{{{{{\boldsymbol{S}}}}}}}}}}_{{{{{{{{\boldsymbol{i}}}}}}}}}={\sum }_{\sigma ,{\sigma }^{{\prime} }}{\hat{c}}_{{{{{{{{\boldsymbol{i}}}}}}}},\sigma }^{{{{\dagger}}} }\frac{1}{2}{{{{{{{{\boldsymbol{\sigma }}}}}}}}}_{\sigma {\sigma }^{{\prime} }}{\hat{c}}_{{{{{{{{\boldsymbol{i}}}}}}}},{\sigma }^{{\prime} }}\) and density operators \({\hat{n}}_{{{{{{{{\boldsymbol{i}}}}}}}}}\)38. Furthermore, \({{{{{{{{\mathcal{P}}}}}}}}}_{G}\) projects out states with more than one particle per site. Note that for a single hole, this single occupancy constraint leads to the same statistics for the bosonic and fermionic models. For Jz = J±, Eq. (1) reduces to the t − J model and for J± = 0 to the t − Jz model. Despite its relevance for the study of unconventional superconductivity, whether there are actually superconducting phases in the t − J and the Fermi Hubbard model is still under debate39,40,41. While finding NQS representations of superconducting states, or – more broadly – pairing wave functions28,30,42,43,44, is still a topic of current research, here, we focus on the capacity of the RNN ansatz to represent ground and excited states of the t − J model and not on their superconducting properties.

In the absence of doping (\({\hat{n}}_{{{{{{{{\boldsymbol{i}}}}}}}}}=1\)), Eq. (1) reduces to the XXZ model or, in the case of Jz = J ± , the Heisenberg model. Prior studies have already utilized RNNs to simulate these spin models20,45, with the possibility of rendering the model stoquastic by making use of the Marshall sign rule46. This is done by implementing the sign rule directly in the RNN architecture20, yielding a simplified optimization procedure of the wave functions’ phase. In contrast, we do not implement any bias on the phase of the quantum state, in order to make our architecture applicable to any number of holes in the system.

When the ground state at \({\hat{n}}_{{{{{{{{\boldsymbol{i}}}}}}}}}=1\) is doped with a single hole, the resulting mobile impurity gets dressed with a cloud of magnetic excitations. This yields the formation of a magnetic polaron, which has already been observed in ultracold atom experiments47. Its properties strongly depend on the spin background, see Fig. 1a and b. Upon further doping, the strong correlations in the model make the simulation of the Fermi-Hubbard or t − J models numerically challenging, despite impressive numerical advances in the past years39,48,49,50: Commonly used methods all come with their specific limitations, e.g. density matrix renormalization group51,52 is limited by the area-law of entanglement, making it challenging to apply this methods to 2D or higher dimensions. Finally, the calculation of spectral functions or the dispersion relations E(k)53, as exemplary shown in Fig. 1, is of great interest for many fields in physics to reveal the emergent physics of a system under investigation. In condensed matter physics, they are typically used to infer the dominating excitations in the ground state or higher energy states, e.g. upon doping the system. This information is contained in specific features of the spectra, e.g. the bandwidth of the quasiparticle dispersion E(k). However, the calculation of spectra or dispersions E(k) is in general, computationally costly using conventional methods, e.g. density-matrix renormalization group (DMRG) simulations54,55: The former typically involves a, in general expensive, time-evolution of the state, and the latter the calculation of a global operator, the momentum k, which is typically very costly for matrix-product-states. Our RNN ansatz uses \(U(1)=U{(1)}_{\hat{N}}\times U{(1)}_{{\hat{S}}_{z}}\) symmetry, i.e. conserved total particle and total magnetization10,11,20,25,45,56. Further details on the RNN architecture can be found in Methods.

Fig. 1: Results for the t − J and t − Jz square lattice with 10 × 4 sites, t/Jz = 3 and open boundaries in x, periodic boundaries in y direction.
figure 1

a) Quasiparticle dispersion of a single hole for the t − J system obtained with the recurrent neural network (RNN) (blue markers), compared to the matrix product state (MPS) spectral function from ref. 53 with the spectral weight S indicated by the colormap and shown in the inset for k = (0.4π, 0.5π) (gray dashed lines). We average the energy over the last 100 training iterations with the standard deviation denoted by the respective error bars in blue. b Dispersion of the t − Jz system obtained with the RNN, compared to the MPS spectral function and with the same error bars as in a. c Relative errors \(\Delta \epsilon =\frac{{E}_{{{{{{{{\rm{RNN}}}}}}}}}-{E}_{{{{{{{{\rm{DMRG}}}}}}}}}}{| {E}_{{{{{{{{\rm{DMRG}}}}}}}}}| }\) during the training, with hidden dimension dh = 300. The training was restarted after 20000 steps (dashed lines), with an increased number of samples per iteration from 200 to 600 (t − J) or 1000 (t − Jz) in the last restart.

NQS dispersion relations

Here, we calculate the dispersion relations E(k) in t − XXZ systems with different dimensions and different lattice geometries using NQS. The method that we propose is applicable to any NQS architecture since the momentum is enforced via a momentum constraint in the cost function, forcing the system to a specific target momentum ktarget within the dispersion, see Fig. 2 and Methods. This is in contrast to previous works26,57,58, where the momentum is used for the definition of the wave function coefficients. Hence, the scheme only requires the possibility to draw samples from the NQS and calculate the respective probabilities, making the calculation of ENQS(kx, ky) computationally efficient, and allows to use pretrained NQS from the ground state as a starting point for the momentum training. Furthermore, the scheme can also be combined with spatial symmetries. This could help to improve the accuracy, e.g. when using a NQS with implemented translational invariance. Moreover, additional symmetries could also be used to calculate e.g. m4 rotational resonances59 or to probe the competition between the s, p or d-wave ground state energies60. Here, we focus on the peak positions of the quasiparticle spectra. In principle, also the spectral weights could also be accessed by calculating the overlap \(\langle {\psi }_{{{{{{{{\boldsymbol{k}}}}}}}}}^{1h}| {\hat{c}}_{{{{{{{{\boldsymbol{k}}}}}}}}}| {\psi }_{0}^{0h}\rangle\) of the momentum eigenstate \({\psi }_{{{{{{{{\boldsymbol{k}}}}}}}}}^{1h}\) with the ground state \({\psi }_{0}^{0h}\) upon removing a particle from the system. To also obtain the peak positions and spectral weights of higher energy states, usually the Green’s function is computed. In the context of NQS, this has been done for a quantum Ising model using time dependent variational Monte Carlo (t − VMC)61, for J1 − J2 and Heisenberg models using an extension of the stochastic reconfiguration algorithm62,63 or for the Hubbard model using dynamical variational Monte Carlo (VMC)64.

Fig. 2: Calculating dispersion relations from NQS.
figure 2

Adding the momentum constraint \({{{{{{{{\mathcal{C}}}}}}}}}_{{{{{{{{{\rm{k}}}}}}}}}_{{{{{{{{\rm{target}}}}}}}}}}\) of Eq. (12), on top of the energy minimization \({{{{{{{\mathcal{C}}}}}}}}\) of Eq. (15) in a, changes the loss landscape as schematically shown in b and forces the neural quantum state into a higher energy state E(k) with the desired momentum ktarget.

t − XXZ model in 1D

In Fig. 3a the dispersion for a single hole in an antiferromagnetic t − XXZ chain with 20 sites and J± = 1, Jz = 4 and t = 8, obtained with a 1D RNN and exact diagonalization (ED) is shown. Note that in the case of a single hole, no holes can be exchanged and hence the bosonic and fermionic models are the same. The results for the ground state energy at kx = 0.5π, obtained during a training with 20000 iterations, and the energies away from the ground state, shown in Fig. 3, are in relatively good agreement with the exact values from ED. However, at some values of kx ≠ 0.5π it can be seen that the RNN is trapped in local minima close to the ground state. Overall, the RNN succeeds in capturing physical properties like the bandwidth very accurately, revealing the underlying physical excitations:

Fig. 3: Results for the 1D t − XXZ system with 20 sites and J± = 1, Jz = 4 and t = 8.
figure 3

a Quasiparticle dispersion for a single hole obtained with the recurrent neural network (RNN, red markers), compared to exact energies from exact diagonalization(ED, light red lines) and the combined spinon and holon dispersions from Eq. (2) (gray). We average the RNN energy over the last 100 training iterations, each with 200 samples, with the standard deviation denoted by the error bars. We show the exact low-energy excited states as well. b Relative error \(\Delta \epsilon =\frac{{E}_{{{{{{{{\rm{RNN}}}}}}}}}-{E}_{{{{{{{{\rm{ED}}}}}}}}}}{| {E}_{{{{{{{{\rm{ED}}}}}}}}}| }\) during the ground state training. a and b are obtained using a 1D RNN architecture with a hidden dimension dh = 100.

For the system under consideration, the bandwidth and the shape of the dispersion in Fig. 3a is a result of spin-charge separation in 1D systems. Spin-charge separation denotes the fact that the motion of a hole in such an antiferromagnetic (AFM) spin chain with coupling J±, Jzt can be approximated by an almost free hole that is only weakly coupled to the spin chain. Hence, the dispersion in Fig. 3 can be approximated by two separate dispersions; i.e. holon and spinon dispersions. Hereby, the holon is the charge excitation, associated with energy scales t, and the spinon is the spin excitation associated with energy J, Jz. In ref. 59 it is shown that the combined dispersion is

$$E({k}_{x})=-2t\,\cos ({k}_{h})+{J}_{\pm }\,\cos \left(2\Delta k\right)+{J}_{\pm }+{J}_{z},$$
(2)

where Δk = kx − kh, kh is the momentum of the holon and kx = kh + ks is the combined momentum of the holon and spinon. Eq. (2) is denoted by the gray line in Fig. 3. Again, the agreement with the RNN is relatively good.

t − J model on a square lattice

Due to the layered structure of high-Tc superconductors like cuprates22 or nickelates65,66, the physics of t − J systems upon doping is particularly interesting in 2D. In Figs. 1 and 4, the quasiparticle dispersion for a single hole on 10 × 4 and 4 × 4t − J and t − Jz lattices are presented. In both cases, Figs. 1b and 4b show that the ground state convergence is better for the t − Jz model with relative errors on the order of Δϵ ≈ 10−3 for both system sizes, yielding a good agreement with the reference energies from DMRG (10 × 4 system) and ED (4 × 4 system) for all considered energies E(kx, ky) away from the ground state. With a relative error of Δϵ ≈ 10−2, the error of the t − J ground states is above the t − Jz systems, which is also reflected in the accuracy of the dispersion ERNN(kx, ky) in Figs. 1 and 4.

Fig. 4: Results for the t − J (blue) and t − Jz (red) square lattice with 4 × 4 sites, t/J = 3 and periodic boundaries.
figure 4

a Quasiparticle dispersion for a single hole obtained with the recurrent neural network (RNN, blue and red markers), compared to the exact energies from exact diagonalization (ED, blue and red lines). We average the energy over the last 100 training iterations, each with 200 samples, with the standard deviation denoted by the respective error bars shown in blue and red. We show the exact low-energy excited states as well. b Relative error Δϵ during the ground state training for t − J (light blue) and t − Jz (light red) square lattice ground states, with a hidden dimension dh = 100 and minimum-step stochastic reconfiguration (t − J) and dh = 70 and Adam (t − Jz). Thick lines are averages over 100 training iterations to guide the eye.

In contrast to the previous section, there is no spin-charge separation in the strict sense in two-dimensional systems. In the case tJ± = Jz = : J that we consider here (t/J = 3), the mobile dopant can be described by fractionalized spinons and chargons that are confined by a string-like potential that arises due to the spin background distortion when the dopant moves through the system67,68. Based on this idea, Laughlin69 drew the analogy with the 1D Fermi-Hubbard or t − J systems and suggested that the dispersion in the respective 2D systems can be interpreted in terms of pointlike partons, spinons and chargons, that interact with each other. This parton picture explains that the quasiparticle dispersion for a single hole is dominated the spinon with a bandwidth on the order of J±, with corrections by the chargon on energy scales of t53. This mechanism also provides the explanation for the flat dispersion for the t − Jz model in contrast to the t − J model, as captued by the RNN, see Figs. 1 and 4. Despite the small deviations from the dispersions calculated with ED or DMRG, our RNN architecture, succeeds in capturing the respective bandwidths of t − Jz and t − J models very accurately, allowing to gain valuable insights on the spinon and chargon physics from the RNN dispersions. Furthermore, the fact that node (π/2, π/2) and antinode (π, 0) are degenerate in the 4 × 4 system is correctly reproduced.

Lastly, we would like to mention that there is a small region of suppressed spectral weight near (π, π) in the DMRG results of the t − J system59. This suppression yields difficulties for our RNN scheme that are further discussed in Supplementary Note 3.

t − J model on a triangular lattice

On triangular lattices, the physical phenomena that are observed are distinctly different from the physics of bipartite lattices, due to the notion of frustration and the absence of particle-hole symmetry in non-bipartite lattices, among them e.g. kinetic frustration70,71. In particular, the underlying constituents upon doping the triangular ladder are not known71, making the triangular lattice an intriguing system to study. Recent advancements have shown that these lattices can also be studied experimentally using optical triangular lattices72,73,74 and solid-state platforms based on Moiré heterostructures75,76,77.

Triangular spin systems have already been studied using RNNs20. Here, we consider a triangular t − J ladder with length Lx = 9, with the quasiparticle dispersion for a single hole and the learning curves with and without doping shown in Fig. 5.

Fig. 5: Results for the t − J model on a triangular lattice with 9 × 2 sites, t/J = 3 and periodic boundaries along x direction.
figure 5

a Quasiparticle dispersion for a single hole obtained with the recurrent neural network (RNN, blue markers), compared to the energies from exact diagonalization (ED, light blue lines). The large error for k = 0.998π appears together with a large violation of translational invariance ΔTransl., measured by the relative difference between log-amplitudes \(\log | \psi (\sigma ){| }^{2}\) and \(\log | \psi ({\hat{T}}_{y}\sigma ){| }^{2}\) as defined in Eq. (3) shown in green. All values are averages over the last 100 training iterations, each with 200 samples, with the standard deviation denoted by the blue error bars. We show the exact low-energy excited states as well. b Relative error Δϵ during the ground state training without doping (orange) and with one hole (blue).

As suggested in ref. 20, we use variational annealing for the training for the triangular lattice that was shown to improve the performance for frustrated systems like the triangular Heisenberg model20, see Methods. In Fig. 5 it can be seen that this procedure yields relatively good results for the ground states, with errors of Δϵ ≈ 0.001 for both Nh = 0 and Nh = 1. For the dispersion shown in Fig. 5a, we consider the momentum k defined along the ladder, as shown in the inset figure. When enforcing k ≠ 0.444π away from the ground state, the exact energy gaps from ED to the first excited states strongly decrease, and the the RNN gets trapped in these states in most cases, in particular for k > 0.444π. Furthermore, the error bars of the enforced momenta are much higher compared to the other lattice geometries that were studied in Figs. 1, 3 and 4, suggesting that the RNN states partly break the translation invariance, and hence challenge the momentum optimization scheme. This is further supported by the relative difference between the wave function amplitudes \(\log | {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\sigma }_{i}){| }^{2}=\log {p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\sigma }_{i})\) and the respective translated samples \(\log | {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\hat{T}}_{{{{{{{{{\bf{e}}}}}}}}}_{\mu }}{\sigma }_{i}){| }^{2}=\log {p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\hat{T}}_{{{{{{{{{\bf{e}}}}}}}}}_{\mu }}{\sigma }_{i})\),

$${\Delta }_{{{{{{{{\rm{Transl.}}}}}}}}}^{\mu }=\frac{1}{{N}_{s}}{\sum}_{i}\frac{{\left(\log {p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\sigma }_{i})-\log {p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\left({\hat{T}}_{{{{{{{{{\bf{e}}}}}}}}}_{\mu }}{\sigma }_{i}\right)\right)}^{2}}{{\left(\log {p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\sigma }_{i})+\log {p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\left({\hat{T}}_{{{{{{{{{\bf{e}}}}}}}}}_{\mu }}{\sigma }_{i}\right)\right)}^{2}},$$
(3)

which we take as a measure for the violation of the translational invariance at the respective momenta. Figure 5 shows that the momenta with large errors coincident with a large \({\Delta }_{{{{{{{{\rm{Transl.}}}}}}}}}^{y}\).

Performance of the RNN ansatz

The analysis of the quasiparticle dispersions indicate that our bosonic and fermionic RNN ansätze successfully learns the dominating physics in the considered regimes. In the following, we provide a more detailed discussion on the performance of the RNN ansatz for fermionic and bosonic spin systems upon hole doping, focusing on ground states on a 4 × 4 square lattice.

Figure 6 shows the relative error for the ground state energies of the t − J(z) model obtained with our RNN ansatz upon doping the half-filled system with Nh holes. Starting from Nh = 0, where the t − J reduces to the Heisenberg model, our RNN reaches a relative ground state energy error Δϵ ≈ 10−4 after 20,000 training steps compared to ED. Fig. 6b shows that the respective phase and amplitude distributions are relatively simple in this case, with a low variance for the logarithmic amplitude and only two values for the phase, 0 and π. Note that when comparing to the literature of ground state representations using RNNs for the Heisenberg model11,45, the optimization problem in our setup is more challenging due to the following reasons: (i) The RNN that we use has a local Hilbert space dimension of three states instead of two, allowing for all values of Nh in principle. (ii) Our RNN learns the sign structure without any bias, i.e. we do not implement the Marshall sign rule already in the RNN, which would only work for Nh = 0. (iii) We do not include the knowledge of spatial symmetries yet, which can improve the performance as shown in Methods.

Fig. 6: Recurrent neural network (RNN) representation for ground states of the bosonic and fermionic t − J(z) model with t/J = 3, 0≤Nh≤12 for a 4 × 4 square lattice with open boundaries.
figure 6

a Relative energy error for bosons (blue) and fermions (orange) as well as the Hilbert space dimension for a fixed number of holes Nh and magnetization \({\hat{S}}_{z}\) (gray). b Logarithmic amplitude \(\log (| \psi {| }^{2})\) and phase \(\log ({{{{{{{\rm{Im}}}}}}}}\psi )\) distributions from exact diagonalization (ED) for exemplary bosonic (blue) and fermionic (orange) hole numbers. c Relative error and Hilbert space dimension for the t − Jz model. We use a hidden dimension hd = 100. All values obtained from the RNN are averages over the last 100 training steps (each with 200 samples), and error bars denote the respective standard deviation.

Upon doping, the exact log-amplitude and phase distributions from ED can become more complicated than for the t − Jz model. For example, for Nh = 4, the variance of the exact amplitudes becomes very large, \({\sigma }_{{N}_{h} = 6}^{{{{{{{{\rm{b}}}}}}}}}(\log | \psi {| }^{2})=15.91\), see Fig. 6b. This yields larger ground state energy errors than for the t − Jz model, and is further complicated when including the antisymmetry in the fermionic case. Again, we make the observation that for larger hole dopings, Nh ≥ 6 for bosons and Nh ≥ 10 for fermions, the distributions for phase and amplitude become less complicated than in the low to intermediate doping regime, yielding a higher accuracy of the RNN wave function with errors Δϵ ≤ 10−4 for bosons and Δϵ ≤ 10−2 for fermions in the respective doping regimes.

Our results show that in the low doping regime of the t − J model, both fermionic systems and bosonic systems are difficult to learn, see Fig. 6. This suggests that not only the fermionic sign structure poses difficulties: Firstly, the Hilbert space dimension in the finite doping regime indicated in gray in Fig. 6a and c becomes much larger than for spin systems, challenging both the RNN ansatz and its training. Second, the frustrated motion of holes in the AFM Heisenberg background can potentially cause problems. When these holes move through the system, the spin background is affected, giving rise to an effective J1 − J2 spin model with nearest and next-nearest spin exchange interactions and is hence more difficult to learn78. For the t − Jz model, we observe that, probably due to the lack of spin dynamics resulting from the absence of spin-flip terms, the relative errors are comparably low in the bosonic case. Furthermore, for all states with high \(\log | \psi {| }^{2}\) variance, there is a significant amount of configurations σ with a large negative log amplitude, i.e. ψ(σ)2 ≈ 0. This makes an accurate determination of expectation values extremely costly and can affect the training process. For example, in ref. 79 it was shown that this yields higher variances for the gradients determined by stochastic reconfiguration. Lastly, Fig. 6 shows that the performance decreases for fermions, which is in agreement with the fact that already on a mean-field level, fermionic Slater determinants are much more complicated than the bosonic states.

In the Methods section, we provide a detailed discussion on these challenges are encountered during training our t − J RNN architecture, yielding the relatively large errors encountered e.g. in Fig. 6. Besides the increased Hilbert space dimension and the small amplitudes for certain configurations discussed above, we discuss the learning plateau associated with a local minimum that is encountered for all considered optimization routines—including annealing20, minimum-step stochastic reconfiguration (minSR)80 and the recently proposed stochastic reconfiguration (SR) variant based on a linear algebra trick81—and the fact that SR algorithms have problems with autoregressive architectures82; the complicated interplay between phase and amplitude optimization15; and the difficulty to implement constraints on the symmetry sector under consideration, e.g. the particle number, magnetization and spatial symmetries directly into the RNN architecture11,45. Many of these challenges are inherent to the simulation of both bosonic and fermionic systems. Our results indicate that the bottleneck for simulating fermionic spinful systems is the training and not only the expressivity of the ansatz, and point the way to possible improvements concerning the ansatz and the training procedure.

Conclusions

To conclude, we present a neural network architecture, based on RNNs11, to simulate ground states of the fermionic and bosonic t − J model upon finite hole doping. We show that, despite many challenges due to the increased complexity of the learning problem compared to spin systems, the RNN succeeds in capturing physical properties like the shape of the dispersion, indicating the dominating emergent excitations of the systems. In order to calculate the dispersion, we present a method that can be used with any NQS ansatz and for any lattice geometry and map out quasiparticle dispersion using the RNN ansatz for several different lattice geometries, including 1D and 2D systems. Moreover, it enables an extremely efficient calculation of dispersion relations compared to conventional methods like DMRG54, which usually require a time-evolution of the state55. The dispersion scheme yields a good agreement when comparing to exact diagonalization or DMRG results, and is expected to perform even better for a better ground state convergence. In principle, it can also be combined with a translationally symmetric NQS ansatz to improve the accuracy. Furthermore, the scheme could be combined with additional symmetries, e.g. rotational symmetries, enabling the calculation of m4 rotational spectra83.

Methods

The RNN ansatz and its bottlenecks

Given these relatively high errors on the ground state energies in some cases, we test potential bottlenecks of our approach in the Methods section, namely: (i) Difficulties in learning either the phase or the amplitude, by considering the partial learning problems separately. (ii) The optimization procedure. (iii) The optimization landscape. (iv) The expressivity of the RNN ansatz, compared to the complexity of the learning problem.

The partial learning problem

One potential bottleneck of our approach is the way the RNN wave function is split into amplitude and phase. In order to test if there are problems with the optimization of the phase or amplitude alone, we consider their learning problems separately as suggested e.g. in refs. 15,84.

  1. 1.

    Phase training: We sample from the exact ground state distribution ψ2, calculated with ED, and optimize only the phase.

  2. 2.

    Amplitude training: Given the correct phase distribution from ED, we optimize only the logarithmic amplitude to check if the ground-state probability amplitudes can be learned.

Figure 7 shows the results of amplitude and phase trainings (dark and light blue), compared to the full training of both amplitude and phase (red). For all considered systems, the results of the partial trainings are closer to the exact ground state, e.g. for open boundaries and Nh = 1, the relative error is decreased from Δϵ = 0.0147(37) to Δϵ = 0.0040(30) for the amplitude training and Δϵ = 0.0039(33) for the phase training. However, for all considered cases we observe the same problem as in the full training: the RNN gets stuck in a plateau that survives up to 20000 training steps. Although the relative error of the plateau decreases when considering the partial learning problems, the improvement is surprisingly low given the amount of information that is added to the training. Furthermore, whether the amplitude or phase training is more problematic remains unclear. Even for the phase training, for which the training samples are generated from the exact distribution ψ2 calculated with ED, the improvement is not significantly larger than for the amplitude training. This is in agreement with the results of Bukov et al.15.

Fig. 7: Partial training.
figure 7

Separate amplitude (dark blue) and phase (light blue) training, for ground states of the t − J model on a 4 × 4 square lattice with t/J = 3, open boundaries and without holes Nh = 0 (a) and with a single hole Nh = 1 (b), compared to the full training in red. Thick lines and markers represent averages over 100 epochs (each with 200 samples) to guide the eye, with the respective standard deviation denoted by the error bars. We use a hidden dimension of hd = 70.

Comparison of optimizers

As a next test, we compare the optimization results of different optimizers in Fig. 8a, namely Stochastic gradient descent (SGD), adaptive methods like AdaBound85 and Adam86, and more advanced methods such as Adam+Annealing20 and the recently developed variant of stochastic reconfiguration (SR), minimum-step SR (minSR)80. We show the optimization results for the t − Jz model on the left and the t − J model on the right, both for Nh = 1.

Fig. 8: Testing different optimizers.
figure 8

a Optimization results for the t − Jz model (left) and the t − J model (right) on a 4 × 4 square lattice with t/Jz = 3, both for a single hole Nh = 1 and periodic boundaries, using stochastic gradient descent (SGD), AdaBound, Adam, Adam+Annealing and minimum-step stochastic reconfiguration (minSR), and 200 samples (1000 samples for minSR) in each variational Monte Carlo (VMC) step. All values are averages over the last 100 training steps, error bars denote the respective standard deviation. b Eigenvalues of the T-matrix (minSR algorithm80, solid lines) and of the XTX matrix (stochastic reconfiguration variant of Rende et al.81, dotted lines) before the training, for the 4 × 4t − J system with one hole and open boundaries and hidden dimensions hd = 30, 70, using 1000 samples.

Typically, Adam is used for RNN wave function optimization11,20,45,87, adapting the learning rate in each VMC update. For 200 samples used in each optimization step, Adam yields relative errors on the order of Δϵ ≈ 10−3 for the t − Jz model and Δϵ ≈ 10−2 for the t − J model. AdaBound, employing dynamic bounds on learning rates, yielding a gradual transition from Adam to SGD during the training, has similar results.

Another modification of the Adam training is the use of variational annealing, shown to improve the performance for frustrated systems20. The idea of annealing is to avoid getting stuck in local minima by including an artificial temperature T in the learning process. In order to do so, the variational free energy of the model,

$${F}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}=\langle {H}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\rangle -T({n}_{{{{{{{{\rm{step}}}}}}}}})\cdot S$$
(4)

instead of the energy (13) is minimized. Here, the averaged Hamiltonian 〈Hλ〉 is given by 〈Hλ〉 = ∑σψλ(σ)2Hλ(σ). Furthermore, S denotes the Shannon entropy

$$S=-{\sum}_{{{{{{{{\boldsymbol{\sigma }}}}}}}}}| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}(\sigma ){| }^{2}\log \left[| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}(\sigma ){| }^{2}\right].$$
(5)

The minimization procedure that we use starts with a warmup phase with a constant temperature T0, before decreasing the temperature T(t) = T0(1 − (t − twarmup)/τ) linearly with the minimization steps t. Typically, we use τ = 5000 and stop the training after tfinal = 20000 training iterations, but tests up to τ = 20000 and tfinal = 40000 did not yield any improvements. Figure 8a shows that for the square lattice, the use of annealing does not bring any advantage within the error bars.

Lastly, we apply two recently developed variant of stochastic reconfiguration (SR): minSR80 and the linear algebra trick by Rende et al.81. Both methos are introduced later in the text, see Eqs. (17) and (18). For a stable training, we ensure non-exploding gradients by adding a diagonal offset δ(t) to the diagonals of the T-matrix, with δ(t) exponentially decaying from 1 to 10−10. After determining the gradients using Eq. (17), we apply the Adam update rule, which we empirically find to perform better than the GD update. Moreover, since it is crucial to use enough samples for a sufficiently good approximation of the gradients in SR, typically more samples than for the other optimization routines are needed. Here, we use 1000 samples in each minSR update and find that the results on the one-hole t − J ground state errors improve below the values obtained with Adam, see Fig. 8a on the right. However, we show in the Supplementary Note 2 that a comparison with Adam using the same number of samples does not lead to a conclusive result, which optimization routine is better, similar to the SR results in ref. 15.

The reason behind this can be understood when considering the spectrum of the T-matrix of the minSR algorithm: Similar to the results of ref. 82 for the S-matrix of the SR algorithm, Fig. 8b shows that the eigenvalues of T, λi, decrease extremely rapidly, in particular at the beginning of the training, indicating a very flat optimization landscape. This is a typical problem of autoregressive architectures82 and causes uncontrolled, high values of T−1 and consequently also of the gradients δθ, see Eq. (17). Furthermore, the shape of the spectrum does not have any feature that indicates that the spectrum could be cut off at a specific eigenvalue, making a regularization very difficult. Hence, the diagonal offset δ(t) must be chosen relatively large, yielding parameter updates that are very similar to the plain vanilla Adam optimization as long as δ(t) is larger than many of the T-eigenvalues. The spectrum of the (XTX) matrix of the SR variant by Rende et al.81, see Eq. (18), exhibits the same problem.

When comparing the results for different hidden dimensions, e.g. for minSR in Fig. 8a (right), it may suggest that a hidden dimension hd > 100 could in principle improve the results further. However, we will show later that for such a large number of parameters, it is even possible, by restricting to a fixed number of holes and hence reducing the Hilbert space dimension to \(\ll {3}^{{N}_{{{{{{{{\rm{sites}}}}}}}}}}\), to encode the wave function using exact methods.

Symmetries

The RNN ansatz we use has implemented \(U(1)=U{(1)}_{\hat{N}}\times U{(1)}_{{\hat{S}}_{z}}\) symmetry, i.e. a conserved total particle number and atotal magnetization \({\hat{S}}_{z}=0(0.5)\) for even (odd) particle numbers11,25. This is done by calculating the current particle number Np(i) (magnetization Sz(i)) after the i-th RNN cell during the sampling process and assigning a zero conditional probability if Np(i) = Ntarget (Sz(i) = Sz,target) for all sites j > i that are considered afterwards, see Supplementary Note 1C. For even (odd) particle numbers, we use Sz,target = 0 (Sz,target = 0.5). As a next test, we employ additional spatial symmetries: For a symmetry operation \({{{{{{{\mathcal{T}}}}}}}}\) according to the lattice symmetry, we know that

$$| \psi (\sigma ){| }^{2}=| \psi ({{{{{{{\mathcal{T}}}}}}}}\sigma ){| }^{2}$$
(6)

for the exact ground state. For rotational C4 symmetry of the square lattice, we employ this constraint (i) in the training, by implementing it in the cost function, or (ii) in the RNN ansatz as in ref. 11.

The constraint in the cost function that we use in (i) is calculated by rotating all samples drawn from ψλ2 according to C4 in each VMC step, calculating \({p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({{{{{{{{\mathcal{T}}}}}}}}}_{i}\sigma )=| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({{{{{{{{\mathcal{T}}}}}}}}}_{i}\sigma ){| }^{2}\) for all \({\{{{{{{{{{\mathcal{T}}}}}}}}}_{i}\}}_{i}\) and adding the squared difference \(\gamma (t){\sum }_{{{{{{{{\boldsymbol{\sigma }}}}}}}}}{\left(| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}(\sigma ){| }^{2}-| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({{{{{{{{\mathcal{T}}}}}}}}}_{i}\sigma ){| }^{2}\right)}^{2}\) with a prefactor \(\gamma (t)={\gamma }_{0}{\log }_{10}(1+9(t-{t}_{{{{{{{{\rm{warmup}}}}}}}}})/\tau )\) to the cost function. Typically, we use long decay times on the order of τ = 5000 steps.

For (ii), we assign

$${p}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}(\sigma )=\frac{1}{| {\{{{{{{{{{\mathcal{T}}}}}}}}}_{i}\}}_{i}| }{\sum}_{{{{{{{{\mathcal{T}}}}}}}}=1,{\{{{{{{{{{\mathcal{T}}}}}}}}}_{i}\}}_{i}}| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({{{{{{{\mathcal{T}}}}}}}}\sigma ){| }^{2}$$
(7)

for all operations \({{{{{{{{\mathcal{T}}}}}}}}}_{i}\) in the symmetry group, similar to ref. 11. Note that this symmetrization scheme keeps the autoregressive property of the RNN, as pointed out e.g. in ref. 88.

The optimization results using (i) and (ii) are shown in Fig. 9 for the t − J and t − Jz model on a 4 × 4 square lattice. It can be seen that constraining the RNN wave function directly via (ii) is more succesful than via the cost function (i): Using (ii), we get an order of magnitude lower relative errors compared to the results without spatial symmetries for the t − Jz model. This possibly results from the fact that the additional constraint on the symmetry leads to barriers in the loss landscape in the regions where the symmetry is violated. Even when increasing the symmetry constraint gradually during the training, as described above, these barriers can prevent getting close to the minimum.

Fig. 9: The effect of U(1) and spatial symmetries.
figure 9

Relative error for t − J (dark blue) and t − Jz (light blue) models on a 4 × 4 square lattice with one hole, t/Jz = 3 and periodic boundaries, averaged over 100 training steps with 200 samples per step (error bars denote the respective standard deviation). We use a recurrent neural networks (RNN) with implemented \(U(1)=U{(1)}_{\hat{N}}\times U{(1)}_{{\hat{S}}_{z}}\) symmetry, U(1) and C4 symmetry, implemented via the cost function and the RNN ansatz, with a hidden dimension of hd = 70. For the t − Jz model, we provide the relative energy errors as numbers in light blue.

The t − J model results do not improve significantly for both symmetry implementations (i) and (ii), with an error on the order of Δϵ ≈ 10−2 with and without spatial symmetries. Hence, we conclude that applying symmetries does only help to improve the accuracy if the ground state can already be learned sufficiently well, as for the t − Jz model.

For systems with sufficiently high convergence, also rotational symmetries like s, p or d-wave symmetries could be enforced to probe the competition between the ground state energies in the respective symmetry sectors60, which is highly relevant for the study of high-Tc superconductivity. In addition, also low-energy excited states for these symmetry sectors could be calculated by making use of the dispersion scheme presented in this work, e.g. m4 rotational spectra89.

Complexity of the learning problem

Lastly, we consider the complexity of our learning problem and compare it to the expressivity of our RNN ansatz in terms of the number of parameters that are encoded in the RNN. In Fig. 10 on the left, we show the number of parameters used in the RNN ansatz for the 4 × 4t − J square lattice for hidden dimensions 30 ≤hd ≤ 100. The number of parameters encoded in the ansatz is slightly lower than the number of parameters that is actually used (gray circles on the left). This is due to the way we encode the U(1) symmetry in our approach, resulting in a small fraction of weights that are not updated since the respective probabilities are set to zero to obey the U(1) symmetry, see Supplementary Note 1C. Furthermore, we show the dimension of the Hilbert space for the same system (with variable particle number) 316 in black. For the small system size that we consider in Fig. 10, this Hilbert space dimension is two orders of magnitude larger than the number of RNN parameters. For the 10 × 4 system in Fig. 1 however, our RNN representation has 13 orders of magnitude less parameters than the Hilbert space with dimension 340 that is learned.

Fig. 10: Number of parameters for the exact wave function of a 4 × 4 system compared to the recurrent neural network (RNN) ansatz.
figure 10

a We compare the number of parameters of the exact wave function using \(U{(1)}_{\hat{N}}\times U{(1)}_{{\hat{S}}_{z}}\) symmetry for 0≤Nh≤4 holes (blue) to the Hilbert space dimension 316 that we want to learn with the RNN ansatz. The effective number of parameters of the RNN ansatz, i.e. the number of parameters that is kept when enforcing the U(1) symmetry, with hidden dimension 30≤hd≤100 is denoted by the gray markers. b Hilbert space dimension for a local dimension of 2 (Heisenberg model), 3 (t − J model) and 4 (Fermi-Hubbard model).

The Hilbert space dimension \({3}^{{N}_{{{{{{{{\rm{sites}}}}}}}}}}\) considered so far in this section allows for three states per site – spin up, down and hole –, i.e. for a variable number of holes in the system. For a fixed number of holes, the number of parameters to describe the exact state is given by all D combinations to distribute a fixed number of holes Nh, particles N and N on the Nsites sites, i.e.

$$D=\frac{{N}_{{{{{{{{\rm{sites}}}}}}}}}!}{({N}_{\downarrow }!)\left.({N}_{{{{{{{{\rm{sites}}}}}}}}}-{N}_{\downarrow })!\right)}\cdot \frac{({N}_{\uparrow }+{N}_{h})!}{({N}_{\uparrow }!{N}_{h}!)},$$
(8)

shown also in Fig. 6a and c (gray), with \(D \, \ll \, {3}^{{N}_{{{{{{{{\rm{sites}}}}}}}}}}\), as shown by the blue lines in Fig. 10 for 1≤Nh≤4. In fact, for Nh = 1 our RNNs encode even more parameters than this exact parameterization when hd > 70. This reveals one main problem of our RNN ansatz, namely the way how the U(1) symmmetries are encoded: The fixed particle and magnetization Hilbert space dimension D is typically much smaller than the dimension of the RNN, since setting the RNN conditionals to zero during the sampling path corresponds to a dimension 3x2y1z with x + y + z = Nsites and typically y and z small. For future studies, we envision an RNN ansatz for a fixed number of holes in the spirit of Eq. (8), reducing the dimension of the parameter space that needs to be learned and hence facilitating the learning problem.

Lastly, we would like to point out that the learning problem that we consider here is more complex than for spin systems that are typically considered with this architecture11,36,37,45, as can be seen when comparing the Hilbert space dimensions for local dimensions d = 2 as for spin systems, vs. d = 3 as for the t − J model in Fig. 10 on the right. For larger systems, this difference increases, e.g. for the 10 × 4 system in Fig. 1 the Hilbert space dimension increases by seven orders of magnitude when going from a spin to a t − J system (with flexible number of holes). This problem becomes even more pronounced when the Fermi-Hubbard model with local dimension d = 4 would be considered.

NQS dispersion relations

Our dispersion scheme relies on an additional term in the cost function that penalizes momenta of the NQS away from the target momentum. The momentum kNQS of the NQS wave function is calculated from the translation operator \({\hat{T}}_{{{{{{{{\boldsymbol{R}}}}}}}}}\), which translates a state ψ(r) by the respective vector R, i.e. \({\hat{T}}_{{{{{{{{\boldsymbol{R}}}}}}}}}\psi ({{{{{{{\boldsymbol{r}}}}}}}})=\psi ({{{{{{{\boldsymbol{r}}}}}}}}-{{{{{{{\boldsymbol{R}}}}}}}})\). Furthermore, it can be written as90

$${\hat{T}}_{{{{{{{{\boldsymbol{R}}}}}}}}}=\exp \left(-i{{{{{{{\boldsymbol{R}}}}}}}}\cdot {{{\hat{{{{{\boldsymbol{k}}}}}}}}}\right)\,,$$
(9)

with the momentum operator \(\hat{{{{{{{{\boldsymbol{k}}}}}}}}}\). To determine the expectation value kNQS = (kx, ky) using samples σ drawn from the NQS wave function, we calculate the expectation value of \({\hat{T}}_{{{{{{{{\boldsymbol{R}}}}}}}}}\). For example, for a square lattice, this is done by translating all snapshots by R = ex and R = ey with eμ = a for lattice distance a and μ = x, y. Then, we calculate the respective translated states, \({\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\hat{T}}_{{{{{{{{{\boldsymbol{e}}}}}}}}}_{\mu }}\sigma )\), to determine the expectation value

$$\left\langle {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\right\vert {\hat{T}}_{{{{{{{{{\boldsymbol{e}}}}}}}}}_{\mu }}\left\vert {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\right\rangle =\exp \left(-i{{{{{{{{\boldsymbol{e}}}}}}}}}_{\mu }\cdot {{{{{{{{\boldsymbol{k}}}}}}}}}_{{{{{{{{\rm{NQS}}}}}}}}}\right)\approx \frac{1}{{N}_{s}}{\sum}_{i}\frac{{\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\hat{T}}_{{{{{{{{{\boldsymbol{e}}}}}}}}}_{\mu }}{\sigma }_{i})}{{\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}({\sigma }_{i})},$$
(10)

with the first equality due to the translational invariance of the ground state of a square lattice, which we assume to be (approximately) present for our NQS ground states, see also Supplementary Note 3. Hence,

$${{{{{{{{\boldsymbol{k}}}}}}}}}_{{{{{{{{\rm{NQS}}}}}}}}}^{\mu }=\frac{i}{a}\log \left\langle {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\right\vert {\hat{T}}_{{{{{{{{{\boldsymbol{e}}}}}}}}}_{\mu }}\left\vert {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\right\rangle .$$
(11)

Using a sufficiently converged NQS ground state wave function as initial state, we train using VMC with an additional term in the loss function,

$${{{{{{{\mathcal{C}}}}}}}}({{{{{{{{\boldsymbol{k}}}}}}}}}_{{{{{{{{\rm{target}}}}}}}}})=\gamma (t){\sum}_{\mu }{\left({{{{{{{{\boldsymbol{k}}}}}}}}}_{{{{{{{{\rm{NQS}}}}}}}}}^{\mu }-{{{{{{{{\boldsymbol{k}}}}}}}}}_{{{{{{{{\rm{target}}}}}}}}}^{\mu }\right)}^{2},$$
(12)

with the RNN momentum kNQS and the target momentum ktarget. We use a prefactor \(\gamma (t)={\gamma }_{0}{\log }_{10}(1+9(t-{t}_{{{{{{{{\rm{warmup}}}}}}}}})/\tau )\) that is turned on with typically τ = 100, …, 1000 and γ0 = 1, …, 10 and gradually lifts all areas in the loss landscape that correspond to a NQS wave function with momentum kNQS ≠ ktarget, forcing the NQS to a higher energy state at momentum kNQS = ktarget, see Fig. 2.

For ktarget far away from the ground state momentum, we observe empirically that the imaginary part of kNQS can become large, on the same order as the real part, in particular if the ground state accuracy was not sufficiently high. In these cases, the RNN ends up in states that are not eigenstates of the momentum operator. In order to prevent our RNN wave function to get trapped in these states we apply an additional constraint in the loss function in these cases, penalizing large imaginary parts of the momentum, \({{{{{{{\rm{Im}}}}}}}}\,{{{{{{{{\boldsymbol{k}}}}}}}}}_{{{{{{{{\rm{NQS}}}}}}}}}\).

Architecture and training

In the present paper we use a recurrent neural network (RNN)91 to represent a quantum state defined on a 2D lattice with Nsites = NxNy positions occupied by Np particles. RNNs and similar generative architectures combined with variational energy minimization have already been applied successfully for spin systems5,11,36,45. One of the advantages of these architectures is their autoregressive property, which allows extremely efficient independent sampling from the RNN wave function19,92, which is important for the training procedure.

In order to represent fermionic wave functions, we start from the same approach as for bosonic spin systems and use an RNN architecture consisting of Nsites (tensorized) gated recurrent units (GRUs), each one representing one site of the system. The information is passed from the first cell, corresponding to the first lattice site, to the last site in a recurrent fashion, see Supplementary Note 1a.

In order to find the ground state of the system under consideration, we use the variational Monte Carlo (VMC) minimization of the energy92,93. VMC has already been used in a wide range of machine learning applications (see e.g. refs. 6,18 for an overview). In VMC, the expectation value of the energy of the RNN trial wave function,

$$\langle {E}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\rangle ={\sum}_{{{{{{{{\boldsymbol{\sigma }}}}}}}}}| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}(\sigma ){| }^{2}\,{E}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}^{{{{{{{{\rm{loc}}}}}}}}}(\sigma )\approx \frac{1}{{N}_{s}}{\sum}_{i}{E}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}^{{{{{{{{\rm{loc}}}}}}}}}({\sigma }_{i}),$$
(13)

is minimized. Here, we have defined the local energy

$${E}_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}^{{{{{{{{\rm{loc}}}}}}}}}(\sigma )=\frac{\langle \sigma | {{{{{{{\mathcal{H}}}}}}}}| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\rangle }{\langle \sigma | {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\rangle }.$$
(14)

As shown e.g. in refs. 11,29 one can use the cost function

$${{{{{\mathcal{C}}}}}}=\frac{1}{{N}_{s}}{\sum }_{i}\underbrace{{\left[{E}_{{{{{\boldsymbol{\lambda }}}}}}^{loc}({\sigma}_{i})-\langle {E}_{{{{{\boldsymbol{\lambda }}}}}}^{loc}\rangle\right]}}_{={:}-\sqrt{{N}_{s}}{\bar{\epsilon}}({{\sigma }_{i}})}$$
(15)

to minimize both the local energy as well as the variance of the local energy to make the training more stable. In Eq. (15), we have defined \({\bar{\epsilon }}({{\sigma }_{i}}):= -\frac{1}{\sqrt{{N}_{s}}}\left[{E}_{{{{{\boldsymbol{\lambda }}}}}}^{{{{{\rm{loc}}}}}}({\sigma }_{i})-\langle {E}_{{{{{\boldsymbol{\lambda }}}}}}^{{{{{\rm{loc}}}}}}\rangle \right]\), where Ns denotes the number of samples.

One of the main difficulties of neural network quantum states is the optimization of Eq. (15), due to its typically rugged landscape with many local minima and saddle points15. If not stated differently, we use the Adam optimizer86 for the optimization of Eq. (15), following previous works on NQS using RNNs10,11,45. To improve the optimization, often stochastic reconfiguration (SR)94,95 is used. Here, we use two recently proposed, SR variants, namely minimum-step stochastic reconfiguration (minSR) and the SR variant based on a linear algebra trick by Rende et al.81. In contrast to conventional SR, these variants enable the use of a large numbers of NQS parameters, see Supplementary Note 2. In minSR, each parameter λk of the neural network is optimized individually according to

$${\bar{O}}_{{\sigma }_{i}k}\,\delta {\lambda }_{k}=\bar{\epsilon }({\sigma }_{i}),$$
(16)

where σi denote the sample configurations, k the parameter index, \({O}_{{\sigma }_{i}k}=\frac{1}{\psi ({\sigma }_{i})}\frac{\partial \psi ({\sigma }_{i})}{\partial {\lambda }_{k}}\) and \({\bar{O}}_{{\sigma }_{i}k}=({O}_{{\sigma }_{i}k}-\langle {O}_{{\sigma }_{i}k}\rangle )/\sqrt{{N}_{s}}\). Eq. (16) is solved by

$$\delta {\lambda }_{k}={\bar{O}}_{k{\sigma }_{i}^{{\prime} }}^{{{{\dagger}}} }{({T}^{-1})}_{{\sigma }_{i}^{{\prime} }{\sigma }_{i}}\,\bar{\epsilon }({\sigma }_{i})\,,$$
(17)

with \(T=\bar{O}{\bar{O}}^{{{{\dagger}}} }\)80. In the version of Rende et al.,

$$\delta {\lambda }_{k}={X}_{k{\sigma }_{i}^{\prime} }{({X}^{T}X)}_{{\sigma }_{i}^{{\prime} }{\sigma }_{i}}^{-1} \, {f}_{{\sigma }_{i}},$$
(18)

with \(X={{{{{{{\rm{Concat}}}}}}}}({{{{{{{\rm{Re}}}}}}}}\,\bar{O},{{{{{{{\rm{Im}}}}}}}}\,\bar{O})\) and \({f}_{{\sigma }_{i}}={{{\rm{Concat}}}}({{{\rm{Re}}}} \, \bar{\epsilon }({\sigma }_{i}),-{{{\rm{Im}}}}\,\bar{\epsilon }({\sigma }_{i}))\)81.

Fermionic RNN Wave Functions

The architecture introduced above is per se bosonic. When considering fermionic systems, we need to take the antisymmetry of the wave function into account. This antisymmetry is included during the variational Monte Carlo steps when calculating the local energy introduced in Eq. (14). We can expand the local energy to

$${E}_{loc}({{\sigma }_{i}})={\sum}_{{{\sigma }_{i}^{\prime}}}\frac{\left\langle {{\sigma }_{i}}\right| H\left| {{\sigma }_{i}^{\prime }}\right\rangle \langle {{\sigma }_{i}^{\prime }}| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\rangle }{\langle {{\sigma }_{i}}| {\psi }_{{{{{{{{\boldsymbol{\lambda }}}}}}}}}\rangle }.$$
(19)

In this sum, we multiply each term with a factor (−1)P if \({{\sigma }^{\prime}}\) is connected to σ by P two-particle permutations, as suggested in ref. 29. In order to do so, we take the permutations along the sampling path into account. For the t − XXZ Hamiltonian under consideration we only need to consider the hopping term for calculating the antisymmetric signs. An example is shown in Fig. 11. This procedure is equivalent to the implementation of Jordan-Wigner strings as e.g. in ref. 25.

Fig. 11: Jordan-Wigner strings on the level of NQS expectation values.
figure 11

a) A typical configuration σ for a 5 × 5 system with five holes and ten spin up (red) and spin down (blue) particles each. Sites are labeled in a 1D manner, as denoted by the white numbers. An exemplary hopping process (arrow in a) to the nearest neighbor in horizontal direction ends in the configuration \({\sigma }^{{\prime} }\) (b) and effectively exchanges P particles, here P = 3. The respective sign of \({\sigma }^{{\prime} }\) relative to σ is calculated using Eq. (19).