RainDiff: End to End Precipitation Nowcasting Via Token-wise Attention Diffusion

1Mohamed bin Zayed University of AI,
2Australian National University, 3Linkoping University,


Figure 1: A visualization from the SEVIR dataset shows that, at the longest forecast horizon, RainDiff avoids oversmoothed outputs and better preserves weather fronts compared to the state-of-the-art DiffCast, resulting in closer alignment with the ground truth.

Abstract

Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios. Our code will be publicly released.

RainDiff Architecture

Figure 2: Overall architecture of our precipitation nowcasting framework RainDiff. Given an input sequence \(X_{0}\), a deterministic predictor \(\mathcal{F}_{\theta_1}\) outputs a coarse prediction \(\mu\). The concatenation of \(X_{0}\) and \(\mu\) is encoded by a cascaded spatio-temporal encoder \(\mathcal{F}_{\theta_3}\) to yield conditioning features \(h\), refined by Post-attention. A diffusion-based stochastic module \(\mathcal{F}_{\theta_2}\) equipped with Token-wise Attention at all resolutions in pixel space predicts residual segments \(\hat{r}\) autoregressively, where the denoising process is conditioned on \(h\) and the predicted segments. This design captures rich contextual relationships and inter-frame dependencies in the radar field while keeping computation efficient.

Why Token-wise Attention Is Helpful?

We propose Token-wise Attention, integrated across all spatial resolutions in our network. This design enables accurate modeling of fine-scale structures while maintaining computational efficiency. Unlike conventional self-attention, our token-wise formulation avoids the quadratic complexity induced by the high dimensionality of radar data. Moreover, all operations are performed directly in pixel space, eliminating the need for an external latent autoencoder. Finally, drawing on empirical insights, we introduce Post-attention, which leverages token-wise attention to emphasize the informative conditional context crucial for the denoising process.

Results

Table 1: Quantitative comparison across four radar nowcasting datasets (Shanghai Radar, MeteoNet, SEVIR, CIKM). We evaluate deterministic baselines (PhyDNet, SimVP, EarthFarseer, AlphaPre) and probabilistic methods (DiffCast) against our RainDiff using CSI, pooled CSI at \(4{\times}4\) and \(16{\times}16\) (CSI-4 / CSI-16), HSS, LPIPS, and SSIM. Bold marks our results. Overall, RainDiff attains the best or tied-best performance on most metrics and datasets, indicating both stronger localization and perceptual/structural quality. This design allows capturing rich context and dependency between frames in the radar field while maintaining efficient computation.

Dataset Distribution
Embodied Results

Figure 3: Frame-wise CSI and HSS for various methods on the Shanghai Radar dataset. As lead time increases, scores drop across all methods due to accumulating forecast uncertainty, yet our approach consistently outperforms the baselines at most timesteps—often by a larger margin at longer leads—demonstrating superior robustness to temporal expanding.

Embodied Results

Figure 4: Qualitative comparison with existing works on the Shanghai Radar dataset, where the reflectivity range is on the top right. Deterministic models yield blurry outputs, while the stochastic model DiffCast, though sharper, introduces excessive and uncontrolled randomness at air masses' boundaries. Integrating Token-wise Attention not only enables the generation of realistic, high-fidelity details but also regulates the model's stochastic behavior, leading to forecasts with improved structural accuracy and consistency, thereby mitigating the chaotic predictions seen in DiffCast.

BibTeX

@misc{nguyen2025raindiffendtoendprecipitationnowcasting,
      title={RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion}, 
      author={Thao Nguyen and Jiaqi Ma and Fahad Shahbaz Khan and Souhaib Ben Taieb and Salman Khan},
      year={2025},
      eprint={2510.14962},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.14962}, 
}