While masked diffusion models demonstrate better language modeling compared to other discrete diffusion (
Austin et al. 2021;
Lou et al. 2023), we argue that they are less amenable to guidance, since once a token is unmasked at some time \(t\) it remains so for all \(s < t\).
In contrast, with uniform noising, intermediate latents can be refined multiple times throughout the denoising process.
We therefore revisit categorical uniform noise discrete diffusion, where \(\boldsymbol{\pi} = \boldsymbol{u} = 1/N\), where \(N\) is the size of the vocabulary.
Our aim is that by analyzing this class of diffusion models more carefully, we can reduce the gap to absorbing-state and yield performant models that are more easily steered by the guidance tools we developed above.
Uniform Noise Forward Process  
We formulate uniform noise diffusion using the interpolating discrete diffusion framework (
Sahoo et al. (2024)).
When letting \(\boldsymbol{\pi} = \boldsymbol{u}\), the input \(\mathbf{x}\) transitions to a random state with some probability at each time step.
Crucially, after \(\mathbf{x}\) changes once, it can do so again.
Formally, when \(\boldsymbol{\pi} = \boldsymbol{u}\), the posterior from above becomes
$$q(\mathbf{z}_s \mid \mathbf{z}_t, \mathbf{x}_0) = \mathrm{Cat} \left(\mathbf{z}_s;\frac{N \alpha_t \mathbf{z}_t \odot \mathbf{x}_0 + (\alpha_ts - \alpha_t)\mathbf{z}_t + (\alpha_s - \alpha_t)\mathbf{x}_0 + \frac{(\alpha_s - \alpha_t)(1- \alpha_s)}{N \alpha_s}\boldsymbol{1}}{N \alpha_t\langle \mathbf{z}_t, \mathbf{x}_0\rangle + 1 - \alpha_t}\right)$$
Denoising Process   The optimal form for the reverse diffusion process \(p_\theta\) matches the posterior.
In fact, setting \(p_\theta\) to the posterior reduces the KL terms in the ELBO to zero.
However, setting \(p_\theta\) to exactly the posterior is not possible because it cannot be a function \(\mathbf{x}_0\) (which \(p_\theta\) is generating).
Therefore, we introduce a predictive model of the 'clean' data given a noisy latent \(\mathbf{z}_t\) at time \(t\).
We use \(\mathbf{x}_\theta\) to parameterize the denoising process as \(p_\theta(\mathbf{z}_s \mid \mathbf{z}_t) = q(\mathbf{z}_s \mid \mathbf{z}_t, \mathbf{x} = \mathbf{x}_\theta)\), yielding:
$$p_\theta(\mathbf{z}_s \mid \mathbf{z}_t) = \mathrm{Cat} \left(\mathbf{z}_s;\frac{N\alpha_t \mathbf{z}_t \odot \mathbf{x}_\theta + (\alpha_ts - \alpha_t)\mathbf{z}_t + (\alpha_s - \alpha_t)\mathbf{x}_\theta + \frac{(\alpha_s - \alpha_t)(1- \alpha_s)}{N\alpha_s}\boldsymbol{1}}{N \alpha_t\langle \mathbf{z}_t, \mathbf{x}_\theta\rangle + 1 - \alpha_t}\right).$$
Note that this minimizes the \(\mathcal{L}_{diff}\) precisely when \(\mathbf{x}_\theta = \mathbf{x}_0,\) as desired.
UDLM   To build towards
Uniform
Discrete
Language
Models (UDLM), we derive an improved variational objective by taking \(T\rightarrow\infty\) and analyzing each term \(\mathcal{L}_{recons}, \mathcal{L}_{diff}, \mathcal{L}_{prior}\), introduced above.
This yields three improvements: (1) a simple and elegant closed-form expression for the variational bound that is easier to reason about; (2) an analytical reduction of \(\mathcal{L}_{recons}, \mathcal{L}_{prior}\) to zero, which tightens the ELBO; (3) a further tightening via the continuous-time extension of \(\mathcal{L}_{diff}\).
Please refer to
our manuscript for more details about this derivation.