Notes on diffusion models
Notes & Scribbles

Giung Nam, 2026-03-20

Continuous Diffusion

For continuous data xRK\bm{x} \in \mathbb{R}^{K}, the most basic choice for the forward process is a Gaussian distribution. In DDPM, the forward process allows us to sample any latent state zt\bm{z}_{t} directly from the data z0=x\bm{z}_{0} = \bm{x}, or transition between two intermediate steps s<ts<t, as follows:

q(ztz0)=N(zt;αtz0,(1αt)I),q(ztzs)=N(zt;αtszs,(1αts)I).\begin{align} q(\bm{z}_{t} \mid \bm{z}_{0}) &= \mathcal{N}\left( \bm{z}_{t} ; \sqrt{\alpha_{t}} \bm{z}_{0}, (1-\alpha_{t})\bm{I} \right), \\ q(\bm{z}_{t} \mid \bm{z}_{s}) &= \mathcal{N}\left( \bm{z}_{t} ; \sqrt{\alpha_{t \mid s}} \bm{z}_{s}, (1-\alpha_{t \mid s})\bm{I} \right). \end{align}

where αt[0,1]\alpha_{t} \in [0, 1] is a strictly decreasing function in tt with α01\alpha_{0} \approx 1 and α10\alpha_{1} \approx 0, and αts=αt/αs\alpha_{t \mid s} = \alpha_{t} / \alpha_{s}. The reverse posterior is then given by:

q(zszt,z0)=N(zs;αts(1αs)zt+αs(1αts)z01αt,(1αs)(1αts)1αtI).\begin{align}\textstyle q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) = \mathcal{N}\left( \bm{z}_{s} ; \frac{ \sqrt{\alpha_{t \mid s}}(1-\alpha_{s})\bm{z}_{t} + \sqrt{\alpha_{s}}(1-\alpha_{t \mid s})\bm{z}_{0} }{1-\alpha_{t}}, \frac{(1-\alpha_{s})(1-\alpha_{t \mid s})}{1-\alpha_{t}}\bm{I} \right). \end{align}

Denoising Diffusion Implicit Model (DDIM)

DDIM introduces a hyperparameter σs\sigma_{s} for the variance of the reverse posterior, generalizing the stochastic process while maintaining the same marginals q(ztz0)q(\bm{z}_{t} \mid \bm{z}_{0}). The reverse transition is defined by:

q(zszt,z0)=N(zs;αsz0+1αsσs2ztαtz01αt,σs2I).\begin{align}\textstyle q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) = \mathcal{N}\left( \bm{z}_{s} ; \sqrt{\alpha_{s}}\bm{z}_{0} + \sqrt{1 - \alpha_{s} - \sigma_{s}^{2}} \frac{ \bm{z}_{t} - \sqrt{\alpha_{t}}\bm{z}_{0} }{\sqrt{1 - \alpha_{t}}}, \sigma_{s}^{2}\bm{I} \right). \end{align}

While setting σs2=(1αs)(1αts)(1αt)\sigma_{s}^{2} = \frac{(1-\alpha_{s})(1-\alpha_{t \mid s})}{(1-\alpha_{t})} recovers the standard reverse posterior, DDIM generally defines a non-Markovian forward process; the forward transition now depends on z0\bm{z}_{0}:

q(ztzs,z0)=N(zt;αtz0+1αt1αsσs2zsαsz01αs,σs2(1αt)1αsI).\begin{align}\textstyle q(\bm{z}_{t} \mid \bm{z}_{s}, \bm{z}_{0}) = \mathcal{N}\left( \bm{z}_{t} ; \sqrt{\alpha_{t}}\bm{z}_{0} + \sqrt{1-\alpha_{t}}\sqrt{1-\alpha_{s}-\sigma_{s}^{2}}\frac{\bm{z}_{s} - \sqrt{\alpha_{s}}\bm{z}_{0}}{1-\alpha_{s}}, \frac{\sigma_{s}^{2}(1-\alpha_{t})}{1-\alpha_{s}} \bm{I} \right). \end{align}

Using a neural network xϕ(zt,t)z0\bm{x}_{\phi}(\bm{z}_{t}, t) \approx \bm{z}_{0}, parameterized by ϕ\phi, the approximated reverse process pϕ(zszt)p_{\phi}(\bm{z}_{s} \mid \bm{z}_{t}) can be written in the same form. The discrete-time and continuous-time diffusion losses are given by:

L(ϕ;z0,T)=Et{1T,2T,,1}Eztq(ztz0)[TDKL(q(zszt,z0)pϕ(zszt))]=Et{1T,2T,,1}Eztq(ztz0)[T2σs2(αsαt1αsσs21αt)2z0xϕ(zt,t)2],limTL(ϕ;z0,T)=EtU[0,1]Eztq(ztz0)[12σt2(αt2αt(1αt)(αtσt21αt))2z0xϕ(zt,t)2].\begin{align} \mathcal{L}(\phi ; \bm{z}_{0}, T) &= \textstyle \mathbb{E}_{t \sim \{\frac{1}{T},\frac{2}{T},\dots,1\}} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ T D_{\text{KL}}\left( q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) \parallel p_{\phi}(\bm{z}_{s} \mid \bm{z}_{t}) \right) \right] \\ &= \textstyle \mathbb{E}_{t \sim \{\frac{1}{T},\frac{2}{T},\dots,1\}} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ \frac{T}{2 \sigma_{s}^{2}} \left( \sqrt{\alpha_{s}} - \sqrt{\alpha_{t}}\sqrt{\frac{1 - \alpha_{s} - \sigma_{s}^{2}}{1-\alpha_{t}}} \right)^{2} \left\lVert \bm{z}_{0} - \bm{x}_{\phi}(\bm{z}_{t}, t) \right\rVert^{2} \right], \\ \lim_{T \to \infty} \mathcal{L}(\phi ; \bm{z}_{0}, T) &= \textstyle \mathbb{E}_{t \sim \mathcal{U}[0, 1]} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ \frac{1}{2\sigma_{t}^{2}} \left( \frac{\alpha_{t}^{\prime}}{2 \sqrt{\alpha_{t}} (1-\alpha_{t})} \left( \sqrt{\alpha_{t}} \sigma_{t}^{2} - \frac{1}{\sqrt{\alpha_{t}}} \right) \right)^{2} \left\lVert \bm{z}_{0} - \bm{x}_{\phi}(\bm{z}_{t}, t) \right\rVert^{2} \right]. \end{align}

Continuous-time formulation (SDE, ODE)

DDIM injects noise on a discrete-time grid:

zt=αtz0+1αtσs2zsαsz01αs+σsε, where εN(0,I).\begin{align}\textstyle \bm{z}_{t} = \sqrt{\alpha_{t}} \bm{z}_{0} + \sqrt{1 - \alpha_{t} - \sigma_{s}^{2}} \frac{\bm{z}_{s} - \sqrt{\alpha_{s}}\bm{z}_{0}}{\sqrt{1 - \alpha_{s}}} + \sigma_{s}\bm{\varepsilon}, \text{ where } \bm{\varepsilon} \sim \mathcal{N}(\bm{0}, \bm{I}). \end{align}

Transitioning to continuous-time involves reducing the time interval Δt=ts\Delta t = t - s to an infinitesimal dt\mathrm{d}t, turning difference equations into differential equations. By introducing εs=zsαsz01αs\bm{\varepsilon}_{s} = \frac{\bm{z}_{s} - \sqrt{\alpha_{s}} \bm{z}_{0}}{\sqrt{1 - \alpha_{s}}}, βt=1αtsΔt\beta_t = \frac{1 - \alpha_{t \mid s}}{\Delta t}, and σs2=(1αs)(1αts)η2(1αt)\sigma_{s}^{2} = \frac{(1-\alpha_{s}) (1-\alpha_{t \mid s}) \eta^{2}}{(1-\alpha_{t})}, the forward step and its approximation are given by:

zt=1βtΔtzs+(1αtσs21βtΔt1αsσs2)εs+ηβtΔtε(112βtΔt)zs+(1η221αsβtΔt)εs+ηβtΔtε.\begin{align} \bm{z}_{t} &\textstyle = \sqrt{1 - \beta_{t} \Delta t} \bm{z}_{s} + \left( \sqrt{1 - \alpha_{t} - \sigma_{s}^{2}} - \sqrt{1 - \beta_{t} \Delta t} \sqrt{1 - \alpha_{s} - \sigma_{s}^{2}} \right) \bm{\varepsilon}_{s} + \eta \sqrt{\beta_{t} \Delta t} \bm{\varepsilon} \\ &\textstyle \approx \left( 1 - \frac{1}{2} \beta_{t} \Delta t \right) \bm{z}_{s} + \left( \frac{1 - \eta^{2}}{2 \sqrt{1 - \alpha_{s}}} \beta_{t} \Delta t \right) \bm{\varepsilon}_{s} + \eta \sqrt{\beta_{t} \Delta t} \bm{\varepsilon}. \end{align}

Using the score-matching identity, εs=1αsz=zslogq(zz0)\bm{\varepsilon}_{s} = -\sqrt{1 - \alpha_{s}} \nabla_{\bm{z} = \bm{z}_{s}} \log{q(\bm{z} \mid \bm{z}_{0})}, we obtain the increment:

ztztΔt12βtΔtztΔt1η22βtΔtz=ztΔtlogq(ztΔtz0)+ηβtΔtε,\begin{align}\textstyle \bm{z}_{t} - \bm{z}_{t - \Delta t} \approx -\frac{1}{2} \beta_{t} \Delta t \bm{z}_{t - \Delta t} -\frac{1 - \eta^{2}}{2} \beta_{t} \Delta t \nabla_{\bm{z} = \bm{z}_{t - \Delta t}} \log{q(\bm{z}_{t - \Delta t} \mid \bm{z}_{0})} +\eta \sqrt{\beta_{t} \Delta t} \bm{\varepsilon}, \end{align}

which corresponds to the continuous-time SDE evolving forward and reverse in time, respectively:

dz(t)=[f(z(t),t)1η22g2(t)zt=z(t)logq(ztz0)]dt+ηg(t)dw(t),dzˉ(t)=[f(zˉ(t),t)1+η22g2(t)zt=zˉ(t)logq(zt)]dt+ηg(t)dwˉ(t),\begin{align} \mathrm{d}\bm{z}(t) &\textstyle = \left[ \bm{f}(\bm{z}(t), t) - \frac{1 - \eta^{2}}{2}g^{2}(t) \nabla_{\bm{z}_{t} = \bm{z}(t)} \log{q(\bm{z}_{t} \mid \bm{z}_{0})} \right] \mathrm{d}t + \eta g(t) \mathrm{d}\bm{w}(t), \\ \mathrm{d}\bar{\bm{z}}(t) &\textstyle = \left[ \bm{f}(\bar{\bm{z}}(t), t) - \frac{1 + \eta^{2}}{2} g^{2}(t) \nabla_{\bm{z}_{t} = \bar{\bm{z}}(t)} \log{q(\bm{z}_{t})} \right] \mathrm{d}t + \eta g(t) \mathrm{d}\bar{\bm{w}}(t), \end{align}

where f(z(t),t)=12βtz(t)\bm{f}(\bm{z}(t), t) = -\frac{1}{2} \beta_{t} \bm{z}(t)is the drift coefficient, g(t)=βtg(t) = \sqrt{\beta_{t}} is the diffusion coefficient, and w(t)\bm{w}(t) and wˉ(t)=w(1t)w(1)\bar{\bm{w}}(t) = \bm{w}(1-t) - \bm{w}(1) are standard Wiener processes in forward and reverse time, respectively.

Remark. Here, η=0\eta = 0 corresponds to the deterministic DDIM setup where σs=0\sigma_{s} = 0; the forward and reverse SDEs collapse into a single PF-ODE, establishing a unique, invertible path between the noise and the data. On the other hand, η=1\eta=1 corresponds to the stochastic DDPM setup. It is worth noting that even in this stochastic regime, the reverse-time SDE does not inject arbitrary random noise; for g(t)0g(t) \neq 0, the diffusion term remains strictly coupled with the score-driven drift term.

Remark. Introducing a neural network for ε\bm{\varepsilon}-prediction, i.e., εϕ(zt,t)ε0\bm{\varepsilon}_{\phi}(\bm{z}_{t}, t) \approx \bm{\varepsilon}_{0}, is more algebraically concise than x\bm{x}-prediction in this formulation, as the ε\bm{\varepsilon}-prediction network directly yields the score function via the score-matching identity. Indeed, it is standard practice for diffusion models to adopt an ε\bm{\varepsilon}-prediction strategy.

Discrete Diffusion

For discrete data x{0,1}K\bm{x} \in \{0, 1\}^{K} represented as a one-hot vector, the most basic choice for the forward process is a Categorical distribution. The forward process allows us to sample any latent state zt\bm{z}_{t} directly from the data z0=x\bm{z}_{0} = \bm{x}, or transition between two intermediate steps s<ts<t, as follows:

q(ztz0)=Cat(zt;αtz0+(1αt)π),q(ztzs)=Cat(zt;αtszs+(1αts)π).\begin{align} q(\bm{z}_{t} \mid \bm{z}_{0}) &= \mathrm{Cat}\left( \bm{z}_{t} ; \alpha_{t} \bm{z}_{0} + (1-\alpha_{t}) \bm{\pi} \right), \\ q(\bm{z}_{t} \mid \bm{z}_{s}) &= \mathrm{Cat}\left( \bm{z}_{t} ; \alpha_{t \mid s} \bm{z}_{s} + (1-\alpha_{t \mid s}) \bm{\pi} \right). \end{align}

where αt[0,1]\alpha_{t} \in [0, 1] is a strictly decreasing function in tt with α01\alpha_{0} \approx 1 and α10\alpha_{1} \approx 0, and αts=αt/αs\alpha_{t \mid s} = \alpha_{t} / \alpha_{s}. The reverse posterior is then given by:

q(zszt,z0)=Cat(zs;[αtszt+(1αts)zt,π1][αsz0+(1αs)π]αtzt,z0+(1αt)zt,π).\begin{align}\textstyle q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) = \mathrm{Cat}\left( \bm{z}_{s} ; \frac{ \left[ \alpha_{t \mid s} \bm{z}_{t} + (1-\alpha_{t \mid s}) \langle \bm{z}_{t}, \bm{\pi} \rangle \bm{1} \right] \odot \left[ \alpha_{s} \bm{z}_{0} + (1-\alpha_{s}) \bm{\pi} \right] }{ \alpha_{t} \langle \bm{z}_{t}, \bm{z}_{0} \rangle + (1-\alpha_{t}) \langle \bm{z}_{t}, \bm{\pi} \rangle } \right). \end{align}

Masked Diffusion Model (MDM)

MDM defines the KthK^\text{th} category as a special [MASK] token, which serves as an absorbing state with a stationary distribution π=m=[0,,0,1]\bm{\pi} = \bm{m} = [0,\dots,0,1]^\top. In the forward process, tokens either remain themselves or transition to the mask. For s<ts < t, the reverse posterior is given by:

q(zszt,z0)={Cat(zs;z0)ztm,Cat(zs;(1αs)m+(αsαt)z01αt)zt=m.\begin{align} q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) = \begin{cases} \mathrm{Cat}\left( \bm{z}_{s} ; \bm{z}_{0} \right) & \bm{z}_{t} \neq \bm{m}, \\ \mathrm{Cat}\left( \bm{z}_{s} ; \frac{(1-\alpha_{s})\bm{m} + (\alpha_{s} - \alpha_{t})\bm{z}_{0}}{1-\alpha_{t}} \right) & \bm{z}_{t} = \bm{m}. \end{cases} \end{align}

Using a neural network xϕ(zt,t)z0\bm{x}_{\phi}(\bm{z}_{t}, t) \approx \bm{z}_{0}, parameterized by ϕ\phi and ensured to be xϕ,m=0\langle \bm{x}_{\phi}, \bm{m} \rangle = 0, the approximated reverse process pϕ(zszt)p_{\phi}(\bm{z}_{s} \mid \bm{z}_{t}) can be written in the same form. The discrete-time and continuous-time diffusion losses are given by:

L(ϕ;z0,T)=Et{1T,2T,1}Eztq(ztz0)[TDKL(q(zszt,z0)pϕ(zszt))]=Et{1T,2T,1}Eztq(ztz0)[T(αtαs)1αtlogxϕ(zt,t),z0],limTL(ϕ;z0,T)=EtU[0,1]Eztq(ztz0)[αt1αtlogxϕ(zt,t),z0].\begin{align} \mathcal{L}(\phi ; \bm{z}_{0}, T) &=\textstyle \mathbb{E}_{t \sim \{\frac{1}{T},\frac{2}{T}\dots,1\}} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ T D_{\text{KL}}\left( q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) \parallel p_{\phi}(\bm{z}_{s} \mid \bm{z}_{t}) \right) \right] \\ &=\textstyle \mathbb{E}_{t \sim \{\frac{1}{T},\frac{2}{T}\dots,1\}} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ \frac{T(\alpha_{t} - \alpha_{s})}{1 - \alpha_{t}} \log{\langle \bm{x}_{\phi}(\bm{z}_{t}, t), \bm{z}_{0} \rangle} \right], \\ \lim_{T \to \infty} \mathcal{L}(\phi ; \bm{z}_{0}, T) &=\textstyle \mathbb{E}_{t \sim \mathcal{U}[0, 1]} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ \frac{\alpha_{t}^\prime}{1 - \alpha_{t}} \log{\langle \bm{x}_{\phi}(\bm{z}_{t}, t), \bm{z}_{0} \rangle} \right]. \end{align}

ReMasking Diffusion Model (ReMDM)

ReMDM introduces a hyperparameter σt\sigma_{t} to MDM, enabling transitions to the mask during the reverse process and generalizing the stochastic process while maintaining the same marginals q(ztz0)q(\bm{z}_{t} \mid \bm{z}_{0}). The reverse transition is defined by:

q(zszt,z0)={Cat(zs;σtm+(1σt)z0)ztm,Cat(zs;(1αsσtαt)m+(αsαt+σtαt)z01αt)zt=m.\begin{align} q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) = \begin{cases} \mathrm{Cat}\left( \bm{z}_{s} ; \sigma_{t} \bm{m} + (1 - \sigma_{t}) \bm{z}_{0} \right) & \bm{z}_{t} \neq \bm{m}, \\ \mathrm{Cat}\left( \bm{z}_{s} ; \frac{(1 - \alpha_{s} - \sigma_{t}\alpha_{t}) \bm{m} + (\alpha_{s} - \alpha_{t} + \sigma_{t}\alpha_{t}) \bm{z}_{0}}{1-\alpha_{t}} \right) & \bm{z}_{t} = \bm{m}. \end{cases} \end{align}

While setting σt=0\sigma_{t} = 0 recovers the reverse posterior of MDM, ReMDM generally defines a non-Markovian forward process; the forward transition now depends on z0\bm{z}_{0}:

q(ztzs,z0)={Cat(zt;(1σt)αtz0+(αsαt+σtαt)mαs)zsm,Cat(zt;σtαtz0+(1αsσtαt)m1αs)zs=m.\begin{align} q(\bm{z}_{t} \mid \bm{z}_{s}, \bm{z}_{0}) = \begin{cases} \mathrm{Cat}\left( \bm{z}_{t} ; \frac{ (1 - \sigma_{t}) \alpha_{t} \bm{z}_{0} + (\alpha_{s} - \alpha_{t} + \sigma_{t}\alpha_{t}) \bm{m} }{\alpha_{s}} \right) & \bm{z}_{s} \neq \bm{m}, \\ \mathrm{Cat}\left( \bm{z}_{t} ; \frac{ \sigma_{t} \alpha_{t} \bm{z}_{0} + (1 - \alpha_{s} - \sigma_{t} \alpha_{t})\bm{m} }{1-\alpha_{s}} \right) & \bm{z}_{s} = \bm{m}. \end{cases} \end{align}

The discrete-time diffusion loss is given by:

L(ϕ;z0,T)=Et{1T,2T,1}Eztq(ztz0)[TDKL(q(zszt,z0)pϕ(zszt))]=Et{1T,2T,1}Eztq(ztz0)[T(αtαsσtαt)1αtlogxϕ(zt,t),z0].\begin{align} \mathcal{L}(\phi ; \bm{z}_{0}, T) &=\textstyle \mathbb{E}_{t \sim \{\frac{1}{T},\frac{2}{T}\dots,1\}} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ T D_{\text{KL}}\left( q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) \parallel p_{\phi}(\bm{z}_{s} \mid \bm{z}_{t}) \right) \right] \\ &=\textstyle \mathbb{E}_{t \sim \{\frac{1}{T},\frac{2}{T}\dots,1\}} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ \frac{T(\alpha_{t} - \alpha_{s} - \sigma_{t}\alpha_{t})}{1 - \alpha_{t}} \log{\langle \bm{x}_{\phi}(\bm{z}_{t}, t), \bm{z}_{0} \rangle} \right]. \end{align}

Uniform Diffusion Model (UDM)

UDM defines the stationary distribution as a uniform across all categories, π=1/K\bm{\pi} = \bm{1} / K. In this regime, the forward process allows a token to transition to any other category (including itself), eventually erasing all signal as it converges to uniform noise. For s<ts < t, the reverse posterior is given by:

q(zszt,z0)=Cat(zs;[Kαtszt+(1αts)1][Kαsz0+(1αs)1]Kαtzt,z0+(1αt)).\begin{align}\textstyle q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) = \mathrm{Cat}\left( \bm{z}_{s} ; \frac{ \left[ K \alpha_{t \mid s} \bm{z}_{t} + (1 - \alpha_{t \mid s}) \bm{1} \right] \odot \left[ K \alpha_{s} \bm{z}_{0} + (1 - \alpha_{s}) \bm{1} \right] }{K \alpha_{t} \langle \bm{z}_{t}, \bm{z}_{0} \rangle + (1-\alpha_{t})} \right). \end{align}

Using a neural network xϕ(zt,t)z0\bm{x}_{\phi}(\bm{z}_{t}, t) \approx \bm{z}_{0}, parameterized by ϕ\phi, the approximated reverse process can be written in the same form. The continuous-time diffusion loss is given by:

limTL(ϕ;z0,T)=limTEt{1T,2T,1}Eztq(ztz0)[TDKL(q(zszt,z0)pϕ(zszt))]=EtU[0,1]Eztq(ztz0)[αtKαt[Kxˉ(i)Kxˉϕ(i)j s.t. zt(j)=0(xˉ(j)xˉ(i)logxˉϕ(i)xˉ(j)xˉϕ(j)xˉ(i))]],\begin{align} \lim_{T \to \infty} \mathcal{L}(\phi ; \bm{z}_{0}, T) &=\textstyle \lim_{T \to \infty} \mathbb{E}_{t \sim \{\frac{1}{T},\frac{2}{T}\dots,1\}} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ T D_{\text{KL}}\left( q(\bm{z}_{s} \mid \bm{z}_{t}, \bm{z}_{0}) \parallel p_{\phi}(\bm{z}_{s} \mid \bm{z}_{t}) \right) \right] \\ &=\textstyle \mathbb{E}_{t \sim \mathcal{U}[0, 1]} \mathbb{E}_{\bm{z}_{t} \sim q(\bm{z}_{t} \mid \bm{z}_{0})} \left[ \frac{\alpha_{t}^{\prime}}{K \alpha_{t}} \left[ \frac{K}{\bar{\bm{x}}^{(i)}} - \frac{K}{\bar{\bm{x}}_{\phi}^{(i)}} - \sum_{j \text{ s.t. } \bm{z}_{t}^{(j)} = 0} \left( \frac{\bar{\bm{x}}^{(j)}}{\bar{\bm{x}}^{(i)}} \log{\frac{\bar{\bm{x}}_{\phi}^{(i)} \bar{\bm{x}}^{(j)}}{\bar{\bm{x}}_{\phi}^{(j)} \bar{\bm{x}}^{(i)}}} \right) \right] \right], \end{align}

where xˉ(i)=Kαtz0(i)+(1αt)\bar{\bm{x}}^{(i)} = K \alpha_{t} \bm{z}_{0}^{(i)} + (1 - \alpha_{t}), xˉϕ(i)=Kαtxϕ(i)(zt,t)+(1αt)\bar{\bm{x}}_{\phi}^{(i)} = K \alpha_{t} \bm{x}_{\phi}^{(i)}(\bm{z}_{t}, t) + (1 - \alpha_{t}), and i=arg maxj[K]zt(j)i = \argmax_{j \in [K]} \bm{z}_{t}^{(j)}.

Continuous-time formulation (CTMC)

Markovian discrete diffusion models randomly transit states on a discrete-time grid:

zt=(1bt)zs+btvt, where btBern(1αts) and vtCat(π).\begin{align} \bm{z}_{t} = (1 - b_{t}) \bm{z}_{s} + b_{t} \bm{v}_{t}, \text{ where } b_{t} \sim \mathrm{Bern}(1 - \alpha_{t \mid s}) \text{ and } \bm{v}_{t} \sim \mathrm{Cat}(\bm{\pi}). \end{align}

Transitioning to continuous-time involves reducing the time interval Δt=ts\Delta t = t - s to an infinitesimal dt\mathrm{d}t, turning difference equations into differential equations. By introducing βt=1αtsΔt\beta_{t} = \frac{1 - \alpha_{t \mid s}}{\Delta t}, the transition probability matrix is:

P(s,t)=(1βtΔt)I+βtΔtπ1.\begin{align} \bm{P}(s, t) = (1 - \beta_{t} \Delta t) \bm{I} + \beta_{t} \Delta t \bm{\pi} \bm{1}^\top. \end{align}

As the state remains unchanged over zero time, i.e., P(s,s)=I\bm{P}(s, s) = \bm{I}, we obtain the increment:

P(s,t)P(s,s)=βtΔt(π1I),\begin{align} \bm{P}(s, t) - \bm{P}(s, s) = \beta_{t} \Delta t \left( \bm{\pi}\bm{1}^\top - \bm{I} \right), \end{align}

which corresponds to the continuous-time CTMC evolving forward in time with the transition-rate matrix:

Q(t)=βt(π1I).\begin{align} \bm{Q}(t) = \beta_{t} (\bm{\pi}\bm{1}^\top - \bm{I}). \end{align}

The forward evolution of the marginal probability distribution q(t)=αtz0+(1αt)π\bm{q}(t) = \alpha_{t} \bm{z}_{0} + (1 - \alpha_{t}) \bm{\pi} is governed by the Kolmogorov forward equation:

q˙(t)=Q(t)q(t).\begin{align}\textstyle \dot{\bm{q}}(t) = \bm{Q}(t) \bm{q}(t). \end{align}

The relationship between the forward rate matrix Q(t)\bm{Q}(t) and the reverse rate matrix Qˉ(t)\bar{\bm{Q}}(t) is governed by the entry-wise probability balance:

Qˉij(t)=Qji(t)qj(t)qi(t).\begin{align} \bar{Q}_{ij}(t) = Q_{ji}(t) \frac{q_{j}(t)}{q_{i}(t)}. \end{align}

Reward-tilted Diffusion

The ultimate goal of a diffusion model is to generate samples following the data distribution q(z0)q(\bm{z}_{0}). However, in many applications, we seek samples under a specific reward function rr. To this end, we define the reward-tilted distribution by reweighting the original density with an exponential reward term:

qr(z0)q(z0)exp(βr(z0)).\begin{align}\textstyle q^{r}(\bm{z}_{0}) \propto q(\bm{z}_{0}) \exp{\left( \beta r(\bm{z}_{0}) \right)}. \end{align}

Local search via guidance

In continuous diffusion, steering toward high reward can be achieved by modifying the score function. By defining an intermediate reward-tilted density qr(zt)q(zt)exp(βrt(zt))q^{r}(\bm{z}_{t}) \propto q(\bm{z}_{t}) \exp{\left( \beta r_{t}(\bm{z}_{t}) \right)}, involving an intermediate reward function rtr_{t}, the corresponding score is:

ztlogqr(zt)=ztlogq(zt)+βztrt(zt),\begin{align}\textstyle \nabla_{\bm{z}_{t}} \log{q^{r}(\bm{z}_{t})} = \nabla_{\bm{z}_{t}} \log{q(\bm{z}_{t})} + \beta \nabla_{\bm{z}_{t}} r_{t}(\bm{z}_{t}), \end{align}

where the second term acts as a steering gradient. Since the reward is typically only defined at t=0t=0, one can define the intermediate reward rtr_{t} as the expected tilted reward conditioned on the current noisy state:

rt(zt):=1βlogEz0q(zt)[exp(βr(z0))].\begin{align}\textstyle r_{t}(\bm{z}_{t}) := \frac{1}{\beta} \log{\mathbb{E}_{\bm{z}_{0} \sim q(\cdot \mid \bm{z}_{t})} \left[ \exp{\left( \beta r(\bm{z}_{0}) \right)} \right]}. \end{align}

For discrete diffusion, where gradients are unavailable, we tilt the transition rates of the CTMC. The reverse evolution is governed by a modified rate matrix Qˉr(t)\bar{\bm{Q}}^{r}(t) via Doob's hh-transform:

Qˉijr(t)=Qˉij(t)h(j,t)h(i,t), where h(zt,t):=exp(βrt(zt)).\begin{align}\textstyle \bar{Q}^{r}_{ij}(t) = \bar{Q}_{ij}(t) \frac{h(j,t)}{h(i,t)}, \text{ where } h(\bm{z}_{t},t) := \exp{\left( \beta r_{t}(\bm{z}_{t}) \right)}. \end{align}

Global search via SMC

While local guidance provides point-wise steering, SMC methods offer a robust global search strategy. Using the reverse transition of the original diffusion process as the proposal distribution, SMC evolves a population of MM particles via:

(Sampling)m[M]:zsmp(zsztm),(Resampling)m[M]:zsmzsm, where mCat({Gˉ(zti,zsi)}i=1M).\begin{align}\textstyle \text{(Sampling)} &\quad \forall m \in [M] : \bm{z}_{s}^{m} \sim p(\bm{z}_{s} \mid \bm{z}_{t}^{m}), \\ \text{(Resampling)} &\quad \forall m \in [M] : \bm{z}_{s}^{m} \gets \bm{z}_{s}^{m^{\prime}}, \text{ where } m^\prime \sim \mathrm{Cat}\left( \left\{ \bar{G}(\bm{z}_{t}^{i}, \bm{z}_{s}^{i}) \right\}_{i=1}^{M} \right). \end{align}

Here, the unnormalized and normalized incremental weights are:

G(zt,zs)=h(zs,s)h(zt,t),Gˉ(zti,zsi)=G(zti,zsi)j=1MG(ztj,zsj).\begin{align}\textstyle G(\bm{z}_{t}, \bm{z}_{s}) = \frac{h(\bm{z}_{s}, s)}{h(\bm{z}_{t}, t)}, \quad \bar{G}(\bm{z}_{t}^{i}, \bm{z}_{s}^{i}) = \frac{G(\bm{z}_{t}^{i}, \bm{z}_{s}^{i})}{\sum_{j=1}^{M}G(\bm{z}_{t}^{j}, \bm{z}_{s}^{j}) }. \end{align}

Continuous Diffusion

For continuous data $\bm{x} \in \mathbb{R}^{K}$, the most basic choice for the forward process is a Gaussian distribution. In DDPM, the forward process allows us to sample any latent state $\bm{z}{t}$ directly from the data $\bm{z}{0} = \bm{x}$, or transition between two intermediate steps $s<t$, as follows: $$ \begin{align} q(\bm{z}{t} \mid \bm{z}{0}) &= \mathcal{N}\left( \bm{z}{t} ; \sqrt{\alpha{t}} \bm{z}{0}, (1-\alpha{t})\bm{I} \right), \ q(\bm{z}{t} \mid \bm{z}{s}) &= \mathcal{N}\left( \bm{z}{t} ; \sqrt{\alpha{t \mid s}} \bm{z}{s}, (1-\alpha{t \mid s})\bm{I} \right). \end{align} $$ where $\alpha_{t} \in [0, 1]$ is a strictly decreasing function in $t$ with $\alpha_{0} \approx 1$ and $\alpha_{1} \approx 0$, and $\alpha_{t \mid s} = \alpha_{t} / \alpha_{s}$. The reverse posterior is then given by: $$ \begin{align}\textstyle q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) = \mathcal{N}\left( \bm{z}{s} ; \frac{ \sqrt{\alpha_{t \mid s}}(1-\alpha_{s})\bm{z}{t} + \sqrt{\alpha{s}}(1-\alpha_{t \mid s})\bm{z}{0} }{1-\alpha{t}}, \frac{(1-\alpha_{s})(1-\alpha_{t \mid s})}{1-\alpha_{t}}\bm{I} \right). \end{align} $$

Denoising Diffusion Implicit Model (DDIM)

DDIM introduces a hyperparameter $\sigma_{s}$ for the variance of the reverse posterior, generalizing the stochastic process while maintaining the same marginals $q(\bm{z}{t} \mid \bm{z}{0})$. The reverse transition is defined by: $$ \begin{align}\textstyle q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) = \mathcal{N}\left( \bm{z}{s} ; \sqrt{\alpha_{s}}\bm{z}{0} + \sqrt{1 - \alpha{s} - \sigma_{s}^{2}} \frac{ \bm{z}{t} - \sqrt{\alpha{t}}\bm{z}{0} }{\sqrt{1 - \alpha{t}}}, \sigma_{s}^{2}\bm{I} \right). \end{align} $$ While setting $\sigma_{s}^{2} = \frac{(1-\alpha_{s})(1-\alpha_{t \mid s})}{(1-\alpha_{t})}$ recovers the standard reverse posterior, DDIM generally defines a non-Markovian forward process; the forward transition now depends on $\bm{z}{0}$: $$ \begin{align}\textstyle q(\bm{z}{t} \mid \bm{z}{s}, \bm{z}{0}) = \mathcal{N}\left( \bm{z}{t} ; \sqrt{\alpha{t}}\bm{z}{0} + \sqrt{1-\alpha{t}}\sqrt{1-\alpha_{s}-\sigma_{s}^{2}}\frac{\bm{z}{s} - \sqrt{\alpha{s}}\bm{z}{0}}{1-\alpha{s}}, \frac{\sigma_{s}^{2}(1-\alpha_{t})}{1-\alpha_{s}} \bm{I} \right). \end{align} $$ Using a neural network $\bm{x}{\phi}(\bm{z}{t}, t) \approx \bm{z}{0}$, parameterized by $\phi$, the approximated reverse process $p{\phi}(\bm{z}{s} \mid \bm{z}{t})$ can be written in the same form. The discrete-time and continuous-time diffusion losses are given by: $$ \begin{align} \mathcal{L}(\phi ; \bm{z}{0}, T) &= \textstyle \mathbb{E}{t \sim {\frac{1}{T},\frac{2}{T},\dots,1}} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ T D_{\text{KL}}\left( q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) \parallel p{\phi}(\bm{z}{s} \mid \bm{z}{t}) \right) \right] \ &= \textstyle \mathbb{E}{t \sim {\frac{1}{T},\frac{2}{T},\dots,1}} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ \frac{T}{2 \sigma{s}^{2}} \left( \sqrt{\alpha_{s}} - \sqrt{\alpha_{t}}\sqrt{\frac{1 - \alpha_{s} - \sigma_{s}^{2}}{1-\alpha_{t}}} \right)^{2} \left\lVert \bm{z}{0} - \bm{x}{\phi}(\bm{z}{t}, t) \right\rVert^{2} \right], \ \lim{T \to \infty} \mathcal{L}(\phi ; \bm{z}{0}, T) &= \textstyle \mathbb{E}{t \sim \mathcal{U}[0, 1]} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ \frac{1}{2\sigma_{t}^{2}} \left( \frac{\alpha_{t}^{\prime}}{2 \sqrt{\alpha_{t}} (1-\alpha_{t})} \left( \sqrt{\alpha_{t}} \sigma_{t}^{2} - \frac{1}{\sqrt{\alpha_{t}}} \right) \right)^{2} \left\lVert \bm{z}{0} - \bm{x}{\phi}(\bm{z}_{t}, t) \right\rVert^{2} \right]. \end{align} $$

Continuous-time formulation (SDE, ODE)

DDIM injects noise on a discrete-time grid: $$ \begin{align}\textstyle \bm{z}{t} = \sqrt{\alpha{t}} \bm{z}{0} + \sqrt{1 - \alpha{t} - \sigma_{s}^{2}} \frac{\bm{z}{s} - \sqrt{\alpha{s}}\bm{z}{0}}{\sqrt{1 - \alpha{s}}} + \sigma_{s}\bm{\varepsilon}, \text{ where } \bm{\varepsilon} \sim \mathcal{N}(\bm{0}, \bm{I}). \end{align} $$ Transitioning to continuous-time involves reducing the time interval $\Delta t = t - s$ to an infinitesimal $\mathrm{d}t$, turning difference equations into differential equations. By introducing $\bm{\varepsilon}{s} = \frac{\bm{z}{s} - \sqrt{\alpha_{s}} \bm{z}{0}}{\sqrt{1 - \alpha{s}}}$, $\beta_t = \frac{1 - \alpha_{t \mid s}}{\Delta t}$, and $\sigma_{s}^{2} = \frac{(1-\alpha_{s}) (1-\alpha_{t \mid s}) \eta^{2}}{(1-\alpha_{t})}$, the forward step and its approximation are given by: $$ \begin{align} \bm{z}{t} &\textstyle = \sqrt{1 - \beta{t} \Delta t} \bm{z}{s} + \left( \sqrt{1 - \alpha{t} - \sigma_{s}^{2}} - \sqrt{1 - \beta_{t} \Delta t} \sqrt{1 - \alpha_{s} - \sigma_{s}^{2}} \right) \bm{\varepsilon}{s} + \eta \sqrt{\beta{t} \Delta t} \bm{\varepsilon} \ &\textstyle \approx \left( 1 - \frac{1}{2} \beta_{t} \Delta t \right) \bm{z}{s} + \left( \frac{1 - \eta^{2}}{2 \sqrt{1 - \alpha{s}}} \beta_{t} \Delta t \right) \bm{\varepsilon}{s} + \eta \sqrt{\beta{t} \Delta t} \bm{\varepsilon}. \end{align} $$ Using the score-matching identity, $\bm{\varepsilon}{s} = -\sqrt{1 - \alpha{s}} \nabla_{\bm{z} = \bm{z}{s}} \log{q(\bm{z} \mid \bm{z}{0})}$, we obtain the increment: $$ \begin{align}\textstyle \bm{z}{t} - \bm{z}{t - \Delta t} \approx -\frac{1}{2} \beta_{t} \Delta t \bm{z}{t - \Delta t} -\frac{1 - \eta^{2}}{2} \beta{t} \Delta t \nabla_{\bm{z} = \bm{z}{t - \Delta t}} \log{q(\bm{z}{t - \Delta t} \mid \bm{z}{0})} +\eta \sqrt{\beta{t} \Delta t} \bm{\varepsilon}, \end{align} $$ which corresponds to the continuous-time SDE evolving forward and reverse in time, respectively: $$ \begin{align} \mathrm{d}\bm{z}(t) &\textstyle = \left[ \bm{f}(\bm{z}(t), t) - \frac{1 - \eta^{2}}{2}g^{2}(t) \nabla_{\bm{z}{t} = \bm{z}(t)} \log{q(\bm{z}{t} \mid \bm{z}{0})} \right] \mathrm{d}t + \eta g(t) \mathrm{d}\bm{w}(t), \ \mathrm{d}\bar{\bm{z}}(t) &\textstyle = \left[ \bm{f}(\bar{\bm{z}}(t), t) - \frac{1 + \eta^{2}}{2} g^{2}(t) \nabla{\bm{z}{t} = \bar{\bm{z}}(t)} \log{q(\bm{z}{t})} \right] \mathrm{d}t + \eta g(t) \mathrm{d}\bar{\bm{w}}(t), \end{align} $$ where $\bm{f}(\bm{z}(t), t) = -\frac{1}{2} \beta_{t} \bm{z}(t)$is the drift coefficient, $g(t) = \sqrt{\beta_{t}}$ is the diffusion coefficient, and $\bm{w}(t)$ and $\bar{\bm{w}}(t) = \bm{w}(1-t) - \bm{w}(1)$ are standard Wiener processes in forward and reverse time, respectively.

Remark. Here, $\eta = 0$ corresponds to the deterministic DDIM setup where $\sigma_{s} = 0$; the forward and reverse SDEs collapse into a single PF-ODE, establishing a unique, invertible path between the noise and the data. On the other hand, $\eta=1$ corresponds to the stochastic DDPM setup. It is worth noting that even in this stochastic regime, the reverse-time SDE does not inject arbitrary random noise; for $g(t) \neq 0$, the diffusion term remains strictly coupled with the score-driven drift term.

Remark. Introducing a neural network for $\bm{\varepsilon}$-prediction, i.e., $\bm{\varepsilon}{\phi}(\bm{z}{t}, t) \approx \bm{\varepsilon}_{0}$, is more algebraically concise than $\bm{x}$-prediction in this formulation, as the $\bm{\varepsilon}$-prediction network directly yields the score function via the score-matching identity. Indeed, it is standard practice for diffusion models to adopt an $\bm{\varepsilon}$-prediction strategy.

Discrete Diffusion

For discrete data $\bm{x} \in {0, 1}^{K}$ represented as a one-hot vector, the most basic choice for the forward process is a Categorical distribution. The forward process allows us to sample any latent state $\bm{z}{t}$ directly from the data $\bm{z}{0} = \bm{x}$, or transition between two intermediate steps $s<t$, as follows: $$ \begin{align} q(\bm{z}{t} \mid \bm{z}{0}) &= \mathrm{Cat}\left( \bm{z}{t} ; \alpha{t} \bm{z}{0} + (1-\alpha{t}) \bm{\pi} \right), \ q(\bm{z}{t} \mid \bm{z}{s}) &= \mathrm{Cat}\left( \bm{z}{t} ; \alpha{t \mid s} \bm{z}{s} + (1-\alpha{t \mid s}) \bm{\pi} \right). \end{align} $$ where $\alpha_{t} \in [0, 1]$ is a strictly decreasing function in $t$ with $\alpha_{0} \approx 1$ and $\alpha_{1} \approx 0$, and $\alpha_{t \mid s} = \alpha_{t} / \alpha_{s}$. The reverse posterior is then given by: $$ \begin{align}\textstyle q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) = \mathrm{Cat}\left( \bm{z}{s} ; \frac{ \left[ \alpha_{t \mid s} \bm{z}{t} + (1-\alpha{t \mid s}) \langle \bm{z}{t}, \bm{\pi} \rangle \bm{1} \right] \odot \left[ \alpha{s} \bm{z}{0} + (1-\alpha{s}) \bm{\pi} \right] }{ \alpha_{t} \langle \bm{z}{t}, \bm{z}{0} \rangle + (1-\alpha_{t}) \langle \bm{z}_{t}, \bm{\pi} \rangle } \right). \end{align} $$

Masked Diffusion Model (MDM)

MDM defines the $K^\text{th}$ category as a special [MASK] token, which serves as an absorbing state with a stationary distribution $\bm{\pi} = \bm{m} = [0,\dots,0,1]^\top$. In the forward process, tokens either remain themselves or transition to the mask. For $s < t$, the reverse posterior is given by: $$ \begin{align} q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) = \begin{cases} \mathrm{Cat}\left( \bm{z}{s} ; \bm{z}{0} \right) & \bm{z}{t} \neq \bm{m}, \ \mathrm{Cat}\left( \bm{z}{s} ; \frac{(1-\alpha{s})\bm{m} + (\alpha_{s} - \alpha_{t})\bm{z}{0}}{1-\alpha{t}} \right) & \bm{z}{t} = \bm{m}. \end{cases} \end{align} $$ Using a neural network $\bm{x}{\phi}(\bm{z}{t}, t) \approx \bm{z}{0}$, parameterized by $\phi$ and ensured to be $\langle \bm{x}{\phi}, \bm{m} \rangle = 0$, the approximated reverse process $p{\phi}(\bm{z}{s} \mid \bm{z}{t})$ can be written in the same form. The discrete-time and continuous-time diffusion losses are given by: $$ \begin{align} \mathcal{L}(\phi ; \bm{z}{0}, T) &=\textstyle \mathbb{E}{t \sim {\frac{1}{T},\frac{2}{T}\dots,1}} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ T D_{\text{KL}}\left( q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) \parallel p{\phi}(\bm{z}{s} \mid \bm{z}{t}) \right) \right] \ &=\textstyle \mathbb{E}{t \sim {\frac{1}{T},\frac{2}{T}\dots,1}} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ \frac{T(\alpha{t} - \alpha_{s})}{1 - \alpha_{t}} \log{\langle \bm{x}{\phi}(\bm{z}{t}, t), \bm{z}{0} \rangle} \right], \ \lim{T \to \infty} \mathcal{L}(\phi ; \bm{z}{0}, T) &=\textstyle \mathbb{E}{t \sim \mathcal{U}[0, 1]} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ \frac{\alpha_{t}^\prime}{1 - \alpha_{t}} \log{\langle \bm{x}{\phi}(\bm{z}{t}, t), \bm{z}_{0} \rangle} \right]. \end{align} $$

ReMasking Diffusion Model (ReMDM)

ReMDM introduces a hyperparameter $\sigma_{t}$ to MDM, enabling transitions to the mask during the reverse process and generalizing the stochastic process while maintaining the same marginals $q(\bm{z}{t} \mid \bm{z}{0})$. The reverse transition is defined by: $$ \begin{align} q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) = \begin{cases} \mathrm{Cat}\left( \bm{z}{s} ; \sigma_{t} \bm{m} + (1 - \sigma_{t}) \bm{z}{0} \right) & \bm{z}{t} \neq \bm{m}, \ \mathrm{Cat}\left( \bm{z}{s} ; \frac{(1 - \alpha{s} - \sigma_{t}\alpha_{t}) \bm{m} + (\alpha_{s} - \alpha_{t} + \sigma_{t}\alpha_{t}) \bm{z}{0}}{1-\alpha{t}} \right) & \bm{z}{t} = \bm{m}. \end{cases} \end{align} $$ While setting $\sigma{t} = 0$ recovers the reverse posterior of MDM, ReMDM generally defines a non-Markovian forward process; the forward transition now depends on $\bm{z}{0}$: $$ \begin{align} q(\bm{z}{t} \mid \bm{z}{s}, \bm{z}{0}) = \begin{cases} \mathrm{Cat}\left( \bm{z}{t} ; \frac{ (1 - \sigma{t}) \alpha_{t} \bm{z}{0} + (\alpha{s} - \alpha_{t} + \sigma_{t}\alpha_{t}) \bm{m} }{\alpha_{s}} \right) & \bm{z}{s} \neq \bm{m}, \ \mathrm{Cat}\left( \bm{z}{t} ; \frac{ \sigma_{t} \alpha_{t} \bm{z}{0} + (1 - \alpha{s} - \sigma_{t} \alpha_{t})\bm{m} }{1-\alpha_{s}} \right) & \bm{z}{s} = \bm{m}. \end{cases} \end{align} $$ The discrete-time diffusion loss is given by: $$ \begin{align} \mathcal{L}(\phi ; \bm{z}{0}, T) &=\textstyle \mathbb{E}{t \sim {\frac{1}{T},\frac{2}{T}\dots,1}} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ T D{\text{KL}}\left( q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) \parallel p{\phi}(\bm{z}{s} \mid \bm{z}{t}) \right) \right] \ &=\textstyle \mathbb{E}{t \sim {\frac{1}{T},\frac{2}{T}\dots,1}} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ \frac{T(\alpha{t} - \alpha_{s} - \sigma_{t}\alpha_{t})}{1 - \alpha_{t}} \log{\langle \bm{x}{\phi}(\bm{z}{t}, t), \bm{z}_{0} \rangle} \right]. \end{align} $$

Uniform Diffusion Model (UDM)

UDM defines the stationary distribution as a uniform across all categories, $\bm{\pi} = \bm{1} / K$. In this regime, the forward process allows a token to transition to any other category (including itself), eventually erasing all signal as it converges to uniform noise. For $s < t$, the reverse posterior is given by: $$ \begin{align}\textstyle q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) = \mathrm{Cat}\left( \bm{z}{s} ; \frac{ \left[ K \alpha_{t \mid s} \bm{z}{t} + (1 - \alpha{t \mid s}) \bm{1} \right] \odot \left[ K \alpha_{s} \bm{z}{0} + (1 - \alpha{s}) \bm{1} \right] }{K \alpha_{t} \langle \bm{z}{t}, \bm{z}{0} \rangle + (1-\alpha_{t})} \right). \end{align} $$ Using a neural network $\bm{x}{\phi}(\bm{z}{t}, t) \approx \bm{z}{0}$, parameterized by $\phi$, the approximated reverse process can be written in the same form. The continuous-time diffusion loss is given by: $$ \begin{align} \lim{T \to \infty} \mathcal{L}(\phi ; \bm{z}{0}, T) &=\textstyle \lim{T \to \infty} \mathbb{E}{t \sim {\frac{1}{T},\frac{2}{T}\dots,1}} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ T D{\text{KL}}\left( q(\bm{z}{s} \mid \bm{z}{t}, \bm{z}{0}) \parallel p{\phi}(\bm{z}{s} \mid \bm{z}{t}) \right) \right] \ &=\textstyle \mathbb{E}{t \sim \mathcal{U}[0, 1]} \mathbb{E}{\bm{z}{t} \sim q(\bm{z}{t} \mid \bm{z}{0})} \left[ \frac{\alpha{t}^{\prime}}{K \alpha_{t}} \left[ \frac{K}{\bar{\bm{x}}^{(i)}} - \frac{K}{\bar{\bm{x}}{\phi}^{(i)}} - \sum{j \text{ s.t. } \bm{z}{t}^{(j)} = 0} \left( \frac{\bar{\bm{x}}^{(j)}}{\bar{\bm{x}}^{(i)}} \log{\frac{\bar{\bm{x}}{\phi}^{(i)} \bar{\bm{x}}^{(j)}}{\bar{\bm{x}}{\phi}^{(j)} \bar{\bm{x}}^{(i)}}} \right) \right] \right], \end{align} $$ where $\bar{\bm{x}}^{(i)} = K \alpha{t} \bm{z}{0}^{(i)} + (1 - \alpha{t})$, $\bar{\bm{x}}{\phi}^{(i)} = K \alpha{t} \bm{x}{\phi}^{(i)}(\bm{z}{t}, t) + (1 - \alpha_{t})$, and $i = \argmax_{j \in [K]} \bm{z}_{t}^{(j)}$.

Continuous-time formulation (CTMC)

Markovian discrete diffusion models randomly transit states on a discrete-time grid: $$ \begin{align} \bm{z}{t} = (1 - b{t}) \bm{z}{s} + b{t} \bm{v}{t}, \text{ where } b{t} \sim \mathrm{Bern}(1 - \alpha_{t \mid s}) \text{ and } \bm{v}{t} \sim \mathrm{Cat}(\bm{\pi}). \end{align} $$ Transitioning to continuous-time involves reducing the time interval $\Delta t = t - s$ to an infinitesimal $\mathrm{d}t$, turning difference equations into differential equations. By introducing $\beta{t} = \frac{1 - \alpha_{t \mid s}}{\Delta t}$, the transition probability matrix is: $$ \begin{align} \bm{P}(s, t) = (1 - \beta_{t} \Delta t) \bm{I} + \beta_{t} \Delta t \bm{\pi} \bm{1}^\top. \end{align} $$ As the state remains unchanged over zero time, i.e., $\bm{P}(s, s) = \bm{I}$, we obtain the increment: $$ \begin{align} \bm{P}(s, t) - \bm{P}(s, s) = \beta_{t} \Delta t \left( \bm{\pi}\bm{1}^\top - \bm{I} \right), \end{align} $$ which corresponds to the continuous-time CTMC evolving forward in time with the transition-rate matrix: $$ \begin{align} \bm{Q}(t) = \beta_{t} (\bm{\pi}\bm{1}^\top - \bm{I}). \end{align} $$ The forward evolution of the marginal probability distribution $\bm{q}(t) = \alpha_{t} \bm{z}{0} + (1 - \alpha{t}) \bm{\pi}$ is governed by the Kolmogorov forward equation: $$ \begin{align}\textstyle \dot{\bm{q}}(t) = \bm{Q}(t) \bm{q}(t). \end{align} $$

The relationship between the forward rate matrix $\bm{Q}(t)$ and the reverse rate matrix $\bar{\bm{Q}}(t)$ is governed by the entry-wise probability balance: $$ \begin{align} \bar{Q}{ij}(t) = Q{ji}(t) \frac{q_{j}(t)}{q_{i}(t)}. \end{align} $$

Reward-tilted Diffusion

The ultimate goal of a diffusion model is to generate samples following the data distribution $q(\bm{z}{0})$. However, in many applications, we seek samples under a specific reward function $r$. To this end, we define the reward-tilted distribution by reweighting the original density with an exponential reward term: $$ \begin{align}\textstyle q^{r}(\bm{z}{0}) \propto q(\bm{z}{0}) \exp{\left( \beta r(\bm{z}{0}) \right)}. \end{align} $$

Local search via guidance

In continuous diffusion, steering toward high reward can be achieved by modifying the score function. By defining an intermediate reward-tilted density $q^{r}(\bm{z}{t}) \propto q(\bm{z}{t}) \exp{\left( \beta r_{t}(\bm{z}{t}) \right)}$, involving an intermediate reward function $r{t}$, the corresponding score is: $$ \begin{align}\textstyle \nabla_{\bm{z}{t}} \log{q^{r}(\bm{z}{t})} = \nabla_{\bm{z}{t}} \log{q(\bm{z}{t})} + \beta \nabla_{\bm{z}{t}} r{t}(\bm{z}{t}), \end{align} $$ where the second term acts as a steering gradient. Since the reward is typically only defined at $t=0$, one can define the intermediate reward $r{t}$ as the expected tilted reward conditioned on the current noisy state: $$ \begin{align}\textstyle r_{t}(\bm{z}{t}) := \frac{1}{\beta} \log{\mathbb{E}{\bm{z}{0} \sim q(\cdot \mid \bm{z}{t})} \left[ \exp{\left( \beta r(\bm{z}_{0}) \right)} \right]}. \end{align} $$

For discrete diffusion, where gradients are unavailable, we tilt the transition rates of the CTMC. The reverse evolution is governed by a modified rate matrix $\bar{\bm{Q}}^{r}(t)$ via Doob's $h$-transform: $$ \begin{align}\textstyle \bar{Q}^{r}{ij}(t) = \bar{Q}{ij}(t) \frac{h(j,t)}{h(i,t)}, \text{ where } h(\bm{z}{t},t) := \exp{\left( \beta r{t}(\bm{z}_{t}) \right)}. \end{align} $$

Global search via SMC

While local guidance provides point-wise steering, SMC methods offer a robust global search strategy. Using the reverse transition of the original diffusion process as the proposal distribution, SMC evolves a population of $M$ particles via: $$ \begin{align}\textstyle \text{(Sampling)} &\quad \forall m \in [M] : \bm{z}{s}^{m} \sim p(\bm{z}{s} \mid \bm{z}{t}^{m}), \ \text{(Resampling)} &\quad \forall m \in [M] : \bm{z}{s}^{m} \gets \bm{z}{s}^{m^{\prime}}, \text{ where } m^\prime \sim \mathrm{Cat}\left( \left{ \bar{G}(\bm{z}{t}^{i}, \bm{z}{s}^{i}) \right}{i=1}^{M} \right). \end{align} $$ Here, the unnormalized and normalized incremental weights are: $$ \begin{align}\textstyle G(\bm{z}{t}, \bm{z}{s}) = \frac{h(\bm{z}{s}, s)}{h(\bm{z}{t}, t)}, \quad \bar{G}(\bm{z}{t}^{i}, \bm{z}{s}^{i}) = \frac{G(\bm{z}{t}^{i}, \bm{z}{s}^{i})}{\sum_{j=1}^{M}G(\bm{z}{t}^{j}, \bm{z}{s}^{j}) }. \end{align} $$