We propose Token-wise Attention, integrated across all spatial resolutions in our network. This design enables accurate modeling of fine-scale structures while maintaining computational efficiency. Unlike conventional self-attention, our token-wise formulation avoids the quadratic complexity induced by the high dimensionality of radar data. Moreover, all operations are performed directly in pixel space, eliminating the need for an external latent autoencoder. Finally, drawing on empirical insights, we introduce Post-attention, which leverages token-wise attention to emphasize the informative conditional context crucial for the denoising process.
Table 1: Quantitative comparison across four radar nowcasting datasets (Shanghai Radar, MeteoNet, SEVIR, CIKM). We evaluate deterministic baselines (PhyDNet, SimVP, EarthFarseer, AlphaPre) and probabilistic methods (DiffCast) against our RainDiff using CSI, pooled CSI at \(4{\times}4\) and \(16{\times}16\) (CSI-4 / CSI-16), HSS, LPIPS, and SSIM. Bold marks our results. Overall, RainDiff attains the best or tied-best performance on most metrics and datasets, indicating both stronger localization and perceptual/structural quality. This design allows capturing rich context and dependency between frames in the radar field while maintaining efficient computation.
Figure 3: Frame-wise CSI and HSS for various methods on the Shanghai Radar dataset. As lead time increases, scores drop across all methods due to accumulating forecast uncertainty, yet our approach consistently outperforms the baselines at most timesteps—often by a larger margin at longer leads—demonstrating superior robustness to temporal expanding.
Figure 4: Qualitative comparison with existing works on the Shanghai Radar dataset, where the reflectivity range is on the top right. Deterministic models yield blurry outputs, while the stochastic model DiffCast, though sharper, introduces excessive and uncontrolled randomness at air masses' boundaries. Integrating Token-wise Attention not only enables the generation of realistic, high-fidelity details but also regulates the model's stochastic behavior, leading to forecasts with improved structural accuracy and consistency, thereby mitigating the chaotic predictions seen in DiffCast.
@misc{nguyen2025raindiffendtoendprecipitationnowcasting,
title={RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion},
author={Thao Nguyen and Jiaqi Ma and Fahad Shahbaz Khan and Souhaib Ben Taieb and Salman Khan},
year={2025},
eprint={2510.14962},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.14962},
}