Impact Statement
Future precipitation in the upper Indus Basin and, by extension, the water security of downstream communities remains difficult to predict. New approaches quantifying the probabilities of extreme precipitation events, which lead to floods or droughts, are needed to inform more sustainable policy and climate change adaptation. This paper presents one such method.
1. Introduction
The Indus is one of the longest rivers in Asia. From the Tibetan Plateau, it meanders through northern India and along the entire length of Pakistan before flowing into the Arabian Sea. Over its 3.000 km journey shown in Figure 1, it sustains the livelihoods of more than 270 million people, including the world’s largest contiguous irrigation system (Shrestha et al., Reference Shrestha, Wagle, Rajbhandari, Khan and Adams2019). Of all the rivers originating in the Hindu-Kush Karakoram Himalaya (HKKH) region, the Indus depends the most on high-altitude water resources (Immerzeel et al., Reference Immerzeel, Van Beek and Bierkens2010; Lutz et al., Reference Lutz, Immerzeel, Shrestha and Bierkens2014). Over 60% of its total flow is attributed to glacier and snow melt in an area known as the Upper Indus Basin (UIB) (Immerzeel and Bierkens, Reference Immerzeel and Bierkens2012; Lutz et al., Reference Lutz, Immerzeel, Kraaijenbrink, Shrestha and Bierkens2016). The Indus River is therefore one of the most sensitive to climate change. However, there is considerable uncertainty around the future amount, seasonality, and composition of the Indus’ flow (Krishnan et al., Reference Krishnan, Shrestha, Ren, Rajbhandari, Saeed, Sanjay, Syed, Vellore, Xu, You and Ren2019; Lutz et al., Reference Lutz, Immerzeel, Kraaijenbrink, Shrestha and Bierkens2016; Orr et al., Reference Orr, Ahmad, Alam, Appadurai, Bharucha, Biemans, Bolch, Chaulagain, Dhaubanjar, Dimri, Dixon, Fowler, Gioli, Halvorson, Hussain, Jeelani, Kamal, Khalid, Liu, Lutz, Mehra, Miles, Momblanch, Muccione, Mukherji, Mustafa, Najmuddin, Nasimi, Nüsser, Pandey, Parveen, Pellicciotti, Pollino, Potter, Qazizada, Ray, Romshoo, Sarkar, Sawas, Sen, Shah, Shah, Shea, Sheikh, Shrestha, Tayal, Tigala, Virk, Wester and Wescoat2022; Sabin et al., Reference Sabin, Krishnan, Vellore, Priya, Borgaonkar, Singh, Sagar, Krishnan, Sanjay, Gnanaseelan, Mujumdar, Kulkarni and Chakraborty2020). This includes predictions of extremes such as floods and droughts. Coupled with an increasing demand for freshwater, future water security in this area is at risk. This security is predominantly undermined by the river’s largest source of current and future hydrological uncertainty: precipitation (Immerzeel et al., Reference Immerzeel, Lutz, Andrade, Bahl, Biemans, Bolch, Hyde, Brumby, Davies, Elmore, Emmer, Feng, Fernández, Haritashya, Kargel, Koppes, Kraaijenbrink, Kulkarni, Mayewski, Nepal, Pacheco, Painter, Pellicciotti, Rajaram, Rupper, Sinisalo, Shrestha, Viviroli, Wada, Xiao, Yao and Baillie2020; Orr et al., Reference Orr, Ahmad, Alam, Appadurai, Bharucha, Biemans, Bolch, Chaulagain, Dhaubanjar, Dimri, Dixon, Fowler, Gioli, Halvorson, Hussain, Jeelani, Kamal, Khalid, Liu, Lutz, Mehra, Miles, Momblanch, Muccione, Mukherji, Mustafa, Najmuddin, Nasimi, Nüsser, Pandey, Parveen, Pellicciotti, Pollino, Potter, Qazizada, Ray, Romshoo, Sarkar, Sawas, Sen, Shah, Shah, Shea, Sheikh, Shrestha, Tayal, Tigala, Virk, Wester and Wescoat2022; Smolenaars et al., Reference Smolenaars, Lutz, Biemans, Dhaubanjar, Immerzeel and Ludwig2021).

Figure 1. Map of Indus Basin showing elevation (TROPOMI, 2019), the Indus River (Kelso and Patterson, Reference Kelso and Patterson2010, in blue) and the UIB (NASA JPL, 2013, watershed boundary shown with dashed line). The inset shows the watershed’s location with respect to the Asian continent.
In the UIB, precipitation largely originates from two major atmospheric events: the Indian summer monsoon (ISM) and the western disturbances (WDs). The ISM brings precipitation from June to September through the displacement of humid air from the Indian Ocean, and in particular the Bay of Bengal (Bookhagen and Burbank, Reference Bookhagen and Burbank2010). The strength of the ISM is primarily modulated by the El Niño-Southern Oscillation (Ding and Wang, Reference Ding and Wang2005; Di Capua et al., Reference Di Capua, Kretschmer, Donner, Van Den Hurk, Vellore, Krishnan and Coumou2020). WDs are extratropical storms that originate over the Mediterranean region. They travel as frontal systems, collecting moisture from the Mediterranean, Black, and Caspian seas until they intensify upon reaching South Asia (Midhuna et al., Reference Midhuna, Kumar and Dimri2020). They are primarily modulated by the North Atlantic Oscillation and El Niño-Southern Oscillation (Hunt and Zaz, Reference Hunt and Zaz2023; Bharati et al., Reference Bharati, Deb, Hunt, Orr and Dash2025) and are strongest in winter from December to April, peaking in March over the UIB (Filippi et al., Reference Filippi, Palazzi, von Hardenberg and Provenzale2014; Dahri et al., Reference Dahri, Ludwig, Moors, Ahmad, Khan and Kabat2016; Bharati et al., Reference Bharati, Deb, Hunt, Orr and Dash2025). Consequently, the eastern UIB receives up to 70% of its annual precipitation in the summer, whereas the western UIB receives 40–60% of its precipitation during winter (Dahri et al., Reference Dahri, Ludwig, Moors, Ahmad, Khan and Kabat2016; Immerzeel et al., Reference Immerzeel, Wanders, Lutz, Shea and Bierkens2015). The complex orography of the UIB also strongly influences precipitation, resulting in large localized spatial–temporal variations, including extremes (Bookhagen and Burbank, Reference Bookhagen and Burbank2006; Lutz et al., Reference Lutz, Immerzeel, Kraaijenbrink, Shrestha and Bierkens2016).
Global climate models (GCMs) often struggle to accurately simulate synoptic and local precipitation patterns (Stephens et al., Reference Stephens, L’Ecuyer, Forbes, Gettelmen, Golaz, Bodas-Salcedo, Suzuki, Gabriel and Haynes2010; Rocheta et al., Reference Rocheta, Sugiyanto, Johnson, Evans and Sharma2014). Their coarse spatial resolution and limited physics cannot resolve complex hydro-meteorological dynamics, including over the HKKH region (Palazzi et al., Reference Palazzi, Von Hardenberg and Provenzale2013; Panday et al., Reference Panday, Thibeault and Frey2015). To overcome these issues, regional climate models (RCMs) are used to dynamically downscale GCM outputs. These models better represent precipitation variability, including extremes. However, RCMs assume that large-scale driving fields, such as temperature and geopotential, are reliably simulated (Chokkavarapu and Mandla, Reference Chokkavarapu and Mandla2019). RCMs are also computationally expensive, lack error estimates, propagate considerable model-dependent uncertainty (Hawkins and Sutton, Reference Hawkins and Sutton2009), and are generally not optimized for mountainous regions, especially the HKKH (Cannon et al., Reference Cannon, Carvalho, Jones, Hoell, Norris, Kiladis and Tahir2017; Norris et al., Reference Norris, Carvalho, Jones, Cannon, Bookhagen, Palazzi and Tahir2017, Reference Norris, Carvalho, Jones and Cannon2019; Orr et al., Reference Orr, Listowski, Couttet, Collier, Immerzeel, Deb and Bannister2017). For example, RCMs from the Coordinated Regional Climate Downscaling Experiment (CORDEX) for South Asia regularly report seasonal errors of over 100% for present-day precipitation when compared to gauge-based gridded precipitation (Sanjay et al., Reference Sanjay, Krishnan, Shrestha, Rajbhandari and Ren2017). On the other hand, large-scale circulation variables, such as horizontal wind, temperature, and pressure fields, are typically better simulated by GCMs (Tapiador et al., Reference Tapiador, Roca, Del Genio, Dewitte, Petersen and Zhang2019; Arias et al., Reference Arias, Bellouin, Coppola, Jones, Krinner, Marotzke, Naik, Palmer, Plattner, Rogelj, Rojas, Sillmann, Storelvmo, Trewin, Rao, Adhikary, Allan, Armour, Bala and Zickfeld2023). Furthermore, these large-scale features have previously been used to statistically downscale monthly precipitation and seasonal precipitation forecasts over the Tian Shan and Pamir mountains in Central Asia using support vector regression (Umirbekov et al., Reference Umirbekov, Peña-Guerrero and Müller2022) and to downscale monthly cumulative and daily extreme precipitation over High Mountain Asia using neural networks (Gerlitz et al., Reference Gerlitz, Conrad and Böhner2015).
This paper explores the feasibility of using variables better represented by GCMs to predict future precipitation over the UIB. For this study, monthly fifth-generation ECMWF reanalysis (ERA5) precipitation data (Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, de Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, de Rosnay, Rozum, Vamborg, Villaume and Thépaut2020) is used to train a machine learning model for temporal prediction over a 15-year horizon. The following experiments serve as a proof-of-concept for applying this approach to predict future precipitation from GCMs. If ERA5 precipitation is assumed to be close to reality, this application can be considered a proof-of-concept for heterogeneous model output statistics downscaling and bias-correction (Maraun and Widmann, Reference Maraun and Widmann2017). If ERA5 precipitation is assumed to be different from reality, this application can simply be considered an emulation of ERA5.
A large variety of methods are applicable to this task. Traditional methods such as multiple linear regression or distribution shift methods are more interpretable (Schoof and Pryor, Reference Schoof and Pryor2001; Gudmundsson et al., Reference Gudmundsson, Bremnes, Haugen and Engen-Skaugen2012; Maraun and Widmann, Reference Maraun and Widmann2017). However, scalable machine learning methods, such as convolutional neural networks, recurrent neural networks, or transformers (Wang et al., Reference Wang, Tian, Lowe, Kalin and Lehrter2021; Zhong et al., Reference Zhong, Du, Chen, Wang and Li2024), and in particular probabilistic machine learning methods, such as generative adversarial networks (Harris et al., Reference Harris, McRae, Chantry, Dueben and Palmer2022; Kumar et al., Reference Kumar, Atey, Singh, Chattopadhyay, Acharya, Singh, Nanjundiah and Rao2023; Leinonen et al., Reference Leinonen, Nerini and Berne2021) or diffusion models (Ling et al., Reference Ling, Lu, Luo, Bai, Behera, Jin, Pan, Jiang and Yamagata2024; Srivastava et al., Reference Srivastava, Yang, Kerrigan, Dresdner, McGibbon, Bretherton and Mandt2024; Mardani et al., Reference Mardani, Brenowitz, Cohen, Pathak, Chen, Liu, Vahdat, Nabian, Ge, Subramaniam, Kashinath, Kautz and Pritchard2025), have become strong choices due to their accuracy in modeling complex spatiotemporal distributions.
In this work, we choose to instead explore Gaussian processes (GP) (Rasmussen and Williams, Reference Rasmussen and Williams2005). This Bayesian machine learning method retains a high level of interpretability while providing a mechanism for incorporating domain knowledge into the model and returning uncertainty estimates through predictive probability distributions of future precipitation. These distributions can then be used to quantify the probability of extreme precipitation events. Different GP covariance functions are explored, including a non-stationary Gibbs kernel, building on work from Lalchand et al. (Reference Lalchand, Tazi, Cheema, Turner and Hosking2022). The paper is structured as follows. The data are first presented in Section 2 and the methodology in Section 3. Model performance is presented in Section 4. Finally, the study’s limitations and further work are discussed in Section 5.
2. ERA5 precipitation and predictive variables
Previous research has found that ERA5 climate reanalysis, although generally exhibiting a wet bias for the HKKH region (Baudouin et al., Reference Baudouin, Herzog and Petrie2020; Chen et al., Reference Chen, Sharma, Zhou, Yang, Li, Niu, Hu and Khadka2021; Kumar et al., Reference Kumar, Hodnebrog, Daloz, Sen, Badiger and Krishnaswamy2021), offers precipitation estimates close to observations, in terms of amounts, seasonality, and variability, from a daily to multi-annual scale, compared to other reanalysis and RCM products over the UIB (Baudouin et al., Reference Baudouin, Herzog and Petrie2020; Dahri et al., Reference Dahri, Ludwig, Moors, Ahmad, Ahmad, Ahmad, Riaz and Kabat2021; Khanal et al., Reference Khanal, Tiwari, Lutz, Hurk and Immerzeel2023). In this work, monthly averages are chosen to learn the UIB’s climatology rather than using daily averages that capture its more stochastic weather patterns. Smoother monthly data also avoids dealing with consecutive days of zero precipitation, which would require the more involved modeling of a Bernoulli-gamma distribution (Martinez-Villalobos and Neelin, Reference Martinez-Villalobos and Neelin2019; Girona-Mata et al., Reference Girona-Mata, Orr, Widmann, Bannister, Dars, Hosking, Norris, Ocio, Phillips, Steiner and Richard2024).
Figure 2 shows ERA5 monthly precipitation data over the UIB from 1970 to 2004. The precipitation has an approximately log-normal distribution with a mean of
$ 1.83\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
, a standard deviation of
$ 1.84\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
and values ranging from 0.00 to
$ 23.40\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
. Soft
$ k $
-means clustering is applied to the precipitation data using different time intervals and cluster numbers, excluding latitude and longitude. Three distinct precipitation regimes, shown in Figure 3, are identified and subsequently referred to as the Gilgit, Ngari, and Khyber regimes. Figure 2 shows that the Ngari regime is characterized by a clear periodic pattern, while the Gilgit and Khyber regimes have more stochastic time series (Figure 2).

Figure 2. ERA5 monthly precipitation over the UIB between 1970 and 2004. The average annual and seasonal precipitation over the UIB (left) and time series for three locations in the UIB (right) are shown. The time series shows the typical precipitation regimes in the basin (Figure 3) and are fit with a linear regression model (dotted green lines) to identify trends over the studied time period.

Figure 3. UIB precipitation regimes. Clusters are generated by applying soft
$ k $
-means to ERA5 monthly precipitation data from 1970 to 2004. The figure shows the clusters for a weighting threshold of 0.6, following the approach of Lalchand et al. (Reference Lalchand, Tazi, Cheema, Turner and Hosking2022).
Potential variables for predicting future precipitation are compiled from existing literature. They are summarized in Table 1. Large-scale geopotential fields are first considered. The strength and variability of the ISM and WDs over the UIB have previously been linked to large-scale geopotential fields outside the UIB via empirical orthogonal function (EOF) analysis (Filippi et al., Reference Filippi, Palazzi, von Hardenberg and Provenzale2014; Mölg et al., Reference Mölg, Maussion and Scherer2014). EOF analysis, also known as geographically weighted principal component analysis, is the decomposition of a signal or dataset in terms of its orthogonal basis functions. EOFs reduce the dimensionality and noise of large-scale fields and are therefore commonly used to determine atmospheric patterns (Weare and Nasstrom, Reference Weare and Nasstrom1982). Here, monthly EOFs are generated from daily ERA5 geopotentials at 200 hPa, 500 hPa, and 850 hPa over the UIB. Spatially averaged EOFs over the Bay of Bengal (10–16°N, 83–93°E) and east of the Caspian Sea (35–40°N, 60–70°E) are also calculated to capture large-scale pressure systems entering the basin.
Table 1. Considered model features with dimensions and abbreviations. These monthly variables are or are derived from ERA5 (Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, de Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, de Rosnay, Rozum, Vamborg, Villaume and Thépaut2020), with the exception of the atmospheric indices, which are calculated by NOAA (2011) using historical observations

In addition to geopotential fields, monthly historical climate indices from NOAA (2011) are investigated to train the model. Climate indices are numerical measures used to describe specific patterns or anomalies in the Earth’s atmosphere and oceans. These indices are typically derived from observational data. The NAO, Niño 3.4, and Niño 4 indices, which provide a representation of the North Atlantic and El Niño-Southern oscillations, are considered. The NAO index is calculated by performing a rotated principal component analysis (RPCA) to the geopotential anomaly at 500 hPa (National Weather Service Climate Prediction Center, 2008). The Niño 3.4 and 4 indices measure anomalous sea surface temperatures over overlapping areas in the tropical Pacific Ocean. Finally, ERA5 surface-level variables are studied. These include static fields such as latitude, longitude, altitude, slope of sub-grid orography, and angle of sub-grid orography as well as local meteorological variables such as 2 m dew point temperature and total column water vapour. Although the latter two variables are not large-scale fields, they can be more faithfully modelled by GCMs. This is a result of the variables being more directly related to the thermodynamic properties of the atmosphere and less influenced by the complex microphysical processes that affect precipitation, especially over mountains (Houze, Reference Houze2012; Tapiador et al., Reference Tapiador, Roca, Del Genio, Dewitte, Petersen and Zhang2019). The monthly 2 m temperature averaged over 20–40°N 65–85°E is also investigated as a proxy for seasonality with annual periodicity similar to precipitation.
3. Method
This section first introduces GPs and methods to address non-stationary and non-Gaussian data. We then present the experimental setup, including the validation protocol and the specific models used.
3.1. Gaussian processes
Consider the data set
$ {\left\{{y}_i\right\}}_{i=1}^N $
representing
$ N $
monthly ERA5 precipitation values where
$ {y}_i\in \mathrm{\mathbb{R}} $
. The values are given at inputs
$ {\left\{{\boldsymbol{x}}_i\right\}}_{i=1}^N\in {\mathrm{\mathbb{R}}}^D $
, where
$ D $
is the number of input dimensions (e.g., time, latitude, longitude, or elevation). We assume
$ {y}_i $
can be generated by a function
$ f $
at
$ {\boldsymbol{x}}_i $
:

where
$ {\varepsilon}_i $
is a noise term and is assumed to be normally distributed with zero mean and standard deviation
$ {\sigma}_n $
,
$ {\varepsilon}_i\sim \mathcal{N}\left(0;{\sigma}_n^2\right) $
.
The function
$ f $
can be modeled with a GP. We refer the reader to Rasmussen and Williams (Reference Rasmussen and Williams2005) for an introduction to GPs and follow their notation. A GP is a stochastic process for which any finite collection of its random variables is distributed according to a multivariate normal distribution. Similarly to a multivariate normal distribution, a GP is defined by a mean function
$ \mu \left(\boldsymbol{x}\right) $
and a covariance function or kernel
$ k\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right) $
:

where
$ \boldsymbol{x} $
and
$ {\boldsymbol{x}}^{\prime } $
are two arbitrary inputs. Practically, the mean function
$ \mu \left(\boldsymbol{x}\right) $
is used to model behaviour common to all data, such as linear trends. The covariance function
$ k\left(\boldsymbol{x},{\boldsymbol{x}}^{\prime}\right) $
captures the correlation between the outputs and encodes properties such as smoothness and periodicity. The mean and covariance function can therefore be used to sample representative functions from the GP. The spread of these functions is interpreted as the uncertainty of the model.
The covariance function makes GPs well-suited for highly correlated geophysical datasets. If the covariance function is stationary, the correlation of the outputs depends only on the distance between the inputs
$ \boldsymbol{x} $
and
$ {\boldsymbol{x}}^{\prime } $
. The Gibbs kernel (Reference Gibbs1998) introduces non-stationarity through input-dependent covariance lengthscales in each dimension. This flexibility should allow the function samples to adapt to the varying precipitation patterns across the UIB. Instead of a single point estimate of the lengthscale
$ {\mathrm{\ell}}_d $
per dimension
$ d $
, we can consider
$ {\mathrm{\ell}}_d\left(\boldsymbol{x}\right) $
, a lengthscale function. The Gibbs kernel for multi-dimensional inputs is given by,

We model these lengthscale functions non-parametrically with latent GPs, building on work from Lalchand et al. (Reference Lalchand, Tazi, Cheema, Turner and Hosking2022). In this paper, our aim is to design a covariance function using knowledge of precipitation and large-scale circulation patterns and enhance it by learning non-stationary spatial relationships between the data.
The distribution of GP is, by definition, Gaussian. Making the distribution of the target variable (i.e., precipitation) more normal can therefore help with inference. Warped GPs (Snelson et al., Reference Snelson, Ghahramani and Rasmussen2003) allow us to handle non-Gaussian distributions by transforming the GP inputs and outputs. For precipitation in the UIB, a Box–Cox function is used as the warping function:

where
$ {y}_i $
is the
$ i $
th observation and is assumed to be positive and
$ \lambda $
is the scaling factor found through maximum likelihood estimation. We note that the inverse transformation of the predicted mean yields the median of the resulting distribution.
3.2. Experimental setup
The design and validation of the following models adhere to the guidelines proposed by Tazi et al. (Reference Tazi, Lin, Viljoen, Gardner, John, Ge and Turner2023). To train and evaluate the GPs, ERA5 monthly precipitation data are divided chronologically. The training set represents 35 years between 1970 and 2004. The model is evaluated on the next 15 years of precipitation data between 2005 and 2019. Ideally, the model would be trained on all the data in the UIB. However, to overcome the high computational complexity of GPs for large datasets, the data are sampled. Unless otherwise stated, the training set contains 10,000 samples from the training period, while the validation and test sets are each comprised of 3500 samples from the validation period. The sampling for each set is random, stratified by location, and without replacement. The training set represents 6% of the total available data over the training period. Performance is evaluated by computing the root mean square error (RMSE), the coefficient of determination
$ {R}^2 $
, and the mean log loss (MLL) as defined in Appendix A.
For the first model investigated, the temporal distribution of the precipitation is captured by fitting a GP to the time series of each grid point within the UIB. This collection of GPs is spatially independent and is subsequently referred to as the single-location model. A constant mean function is used here and in the following models, as no obvious trends are shared between the data, as shown in Figure 2. For precipitation in the UIB, an approximately annual periodicity is expected. We therefore use an additive kernel structure with local periodicity defined as

where
$ {k}_{\mathrm{Per}} $
is the periodic kernel and
$ {k}_{\mathrm{Mat}32} $
is the Matérn 3/2 kernel. We use a Matérn kernel rather than a smooth squared exponential kernel to better model discontinuities in time and space. The periodic and Matérn kernels follow their standard definitions as presented by Duvenaud (Reference Duvenaud2014). The other variables, excluding time and spatial information such as elevation, are represented by
$ {\boldsymbol{x}}_{\mathrm{other}} $
. The
$ {\boldsymbol{x}}_{\mathrm{other}} $
inputs are determined through greedy feature selection as detailed in Appendix B and could include local or large-scale features from Table 1. To establish a baseline, we first train the single-location model with all data at each location for the full training period (full model). A data-limited model is then trained using the previously described training set, approximately 34 points per location.
For the second model, we apply a GP to each cluster shown in Figure 3. The cluster model uses a similar kernel design:

where the subscript ‘STAT’ refers to the stationarity of the kernel,
$ {k}_{\mathrm{Mat}32\hbox{-} \mathrm{ARD}} $
is the Matérn 3/2 kernel with automatic relevance determination (MacKay, Reference MacKay1996),
$ {\boldsymbol{x}}_{\mathrm{lat},\mathrm{lon}} $
are the latitude and longitude inputs, and
$ {\boldsymbol{x}}_{\mathrm{other}\ \mathrm{spatial}} $
are the other inputs, excluding time, latitude, and longitude, but including other spatial variables such as elevation or slope. These inputs are again chosen through greedy feature selection (Appendix B).
Finally, GPs are fit to the entire UIB. Two whole-basin models are applied. One model is defined using the stationary kernel from Equation 6. The other model uses a non-stationary kernel with the Gibbs function described in Equation 3 for the spatial inputs
$ {\boldsymbol{x}}_{\mathrm{lat},\mathrm{lon}} $
. The non-stationary kernel is defined as

4. Results
4.1. Single-location model
The single-location model is created by training a GP at each gridpoint. Following the results shown in Figure B1, the inputs from Table 1 for this model are time, total column water vapour, 2 m dew point temperature, UIB-averaged 2 m temperature, and the first EOF components of the geopotential at 200 and 500 hPa over the UIB. The outcome of the selection is consistent with features that have a high linear correlation with precipitation. Although large-scale circulation is known to influence precipitation, this link is not established for many of the single-location models during the feature selection process.
For the full single-location model, the GPs have an average predictive skill of
$ {R}^2 $
= 0.50,
$ \mathrm{RMSE}=0.71\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
and MLL = 0.87 for the test set as shown in Table 2. The performance of the full model with respect to location, evaluated on all data between 2005 and 2019 is plotted in Figure 4. The southeast UIB presents the best
$ {R}^2 $
scores. In these cases, the GPs faithfully predict most precipitation values but sometimes struggle to match the monsoon highs. Several GPs in the centre and west of the basin capture precipitation patterns with less accuracy. These regions roughly correspond to the boundaries between the clusters shown in Figure 3. In these more difficult cases, the GPs increase their uncertainty bounds accordingly, indicated by a higher MLL. The RMSE values scale with the average annual precipitation (Figure 2). Spatial differences in
$ {R}^2 $
scores could suggest that one or more important variables are missing. The differences could also point to an inappropriate kernel design, such as constraining the precipitation to follow a periodic pattern. The data-limited model yields predictive skill of
$ {R}^2 $
= − 0.29,
$ \mathrm{RMSE}=1.15\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
and MLL = 2.46 over the 15-year horizon (Table 2). However, including information from neighbouring locations and tailoring the input features improves performance, as shown with the cluster model in the next section.
Table 2. Performance of the single-location(SLM), cluster (Khyber, Gilgit, Ngari), and whole-basin models (WBM). The RMSE is given in
$ \mathrm{mm}\;{\mathrm{d}}^{-1} $


Figure 4. Test
$ {R}^2 $
(left), RMSE (centre), and MLL (right) for the full single-location model across the UIB evaluated on all precipitation data between 2005 and 2019.
4.2. Cluster model
Next, GPs are trained on the precipitation clusters. Input features are selected for each precipitation regime in Figure B2. For the cluster models, 2 m dew point temperature is favoured over total column water vapour for most of the basin in terms of local climatic variables. The cluster models are also able to better leverage relationships between precipitation and large-scale features. The predictive skill of the models is also minimally improved (and even occasionally worsened) when including orographic features such as slope, elevation, or the angle of the sub-grid orography, despite their well-established influence on precipitation (Houze, Reference Houze2012; Dahri et al., Reference Dahri, Ludwig, Moors, Ahmad, Khan and Kabat2016; Wolvin et al., Reference Wolvin, Strong, Rupper and Steenburgh2024).
The features’ contributed variance is then compared in Table C1. The 2 m dew point temperature is a strong predictor for all precipitation regimes, in particular for Ngari model. The total column water vapour and the El Nino 3.4 index contribute strongly to the Khyber model, whereas orography is more important for the Gilgit model. Large-scale features generally contribute more to the Gilgit and Khyber regimes. These results are consistent with known precipitation drivers for the UIB. The Khyber cluster is located at a lower altitude but is more exposed to frontal systems driven by large-scale circulation patterns. On the other hand the Ngari cluster is more sheltered and mainly receives precipitation from the ISM. The Gilgit cluster sits between the two and is characterized by the largest elevation gradients (Wolvin et al., Reference Wolvin, Strong, Rupper and Steenburgh2024).
In this setting, the cluster model benefits from far less data than the full single-location model, but is still able to infer information about the distribution of precipitation. Using an area weighted average for this model, Table 2 shows
$ {R}^2 $
= 0.49,
$ \mathrm{RMSE}=0.96\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
and MLL = 1.03 for the test set. The uncertainty calibration for the cluster model therefore improves on the full single-location model. However, the GP distribution medians generally underestimate high precipitation values at both training and test time.
4.3. Whole-basin models
GPs with stationary and non-stationary kernels are then applied to the whole basin. Following the results from Figure B3, the stationary model is trained with time, latitude, longitude, 2 m dew point temperature, UIB-averaged 2 m temperature, the first EOF components of the geopotential at 500 and 850 hPa over the UIB, the averaged first EOF component of the geopotential at 200 hPa over the Bay of Bengal and the averaged first and second EOF components of the geopotential at 500 hPa east of the Caspian Sea. These results are consistent with the trends observed in the previous section, with the stationary whole-basin model selecting the most important features from the cluster models.
For the stationary model, the 2 m dew point temperature and the UIB-averaged 2 m temperature contribute the most to the prediction of precipitation (Table C1). The same features are also used to train the non-stationary GP. The learnt lengthscales for latitude and longitude are presented in Figure 5. The optimized lengthscale values match the annual distribution of precipitation shown in Figure 2 with on average longer lengthscales for longitude and shorter lengthscales for latitude.

Figure 5. Learnt non-stationary kernel lengthscales over the UIB for longitude, latitude and their magnitude.
The whole-basin model with a stationary kernel has a similar
$ {R}^2 $
(0.49), a worse RMSE (
$ 1.25\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
) but a better MLL (0.91) compared to the cluster model (Table 2). On the other hand, the non-stationary model struggles to generalize as well to new times. This more flexible model yields large predictive variances and thus MLL scores, including at training time. This behaviour suggests that this kernel design is not appropriate and may be more suitable for low-dimensional interpolation tasks such as downscaling. More data or simpler lengthscale functions could also improve model accuracy and uncertainty calibration.
5. Discussion
In this paper, we have shown that GPs can generate uncertainty-aware predictions of precipitation with moderate skill. With further analysis, the stochastic nature of these models can help us better capture the climate change signal in the means, extremes, and seasonal cycles. By producing interpretable outputs similar to a perturbed-physics ensemble for a single RCM, GPs set themselves apart from other traditional downscaling and bias-correction methods.
However, this study suffers from three main limitations. First, the application of the cluster and whole-basin models is restricted by the scalability of GPs. For these models, 6% of the training data is used for training. This amount of data requires 60 GB of RAM to train the stationary whole-basin model, which is generally not possible without a high-memory GPU (e.g., NVIDIA A100 or V100 with 80 GB HBM2 memory) to load and optimize the covariance matrix. On a single Intel Xeon Gold 5317 CPU, training the model would take 62 hours. Therefore, methods to select representative data samples (Liu et al., Reference Liu, Motoda and Yu2004) and scale GPs such as products of experts, low-rank approximations, and inducing points (Liu et al., Reference Liu, Ong, Shen and Cai2020; Tazi et al., Reference Tazi, Lin, Viljoen, Gardner, John, Ge and Turner2023) will need to be implemented to keep future inference and hyperparameter learning tractable. Moving to even larger sets (>
$ {10}^6 $
data points) would require probabilistic deep learning, such as neural processes (Vaughan et al., Reference Vaughan, Tebbutt, Hosking and Turner2022; Andersson et al., Reference Andersson, Bruinsma, Markou, Requeima, Coca-Castro, Vaughan, Ellis, Lazzara, Jones and Hosking2023) or hybrid deep learning-GP approaches (Putra et al., Reference Putra, Ishikawa and Tatsubori2022; Sun et al., Reference Sun, Jiang, Yang, Xie and Chen2022).
Second, our knowledge of past precipitation in the UIB is incomplete. The limited in situ observations make capturing the true underlying distribution of precipitation and evaluating models challenging. Accessibility and harsh weather conditions mean that most precipitation gauges are placed in relatively low-altitude locations (Dahri et al., Reference Dahri, Ludwig, Moors, Ahmad, Khan and Kabat2016). Alternatives, such as remote sensing, exist, but these tools alone cannot determine precipitation at a fine enough spatiotemporal scale to resolve its distribution accurately (Yin et al., Reference Yin, Zhang, Liu, Colella and Chen2008; Andermann et al., Reference Andermann, Bonnet and Gloaguen2011; Meng et al., Reference Meng, Li, Hao, Wang and Shao2014; Shukla et al., Reference Shukla, Ojha, Singh, Pal and Fu2019).
Third, most features, including large-scale circulation fields and indices, do not provide clear enough signals to improve the GP predictions. This behaviour is due to the features being heavily cross-correlated: only a few variables are needed for efficient inference. Principal component analysis could be a useful tool to improve feature selection, model skill, and the GP’s numerical stability during inference, but at the expense of losing model interpretability. There is also growing evidence that long-distance atmospheric interactions, including those studied in this paper, are prone to analytical artefacts and are not as strong or simple climate predictors as once thought (Runge et al., Reference Runge, Petoukhov and Kurths2014; McGraw and Barnes, Reference McGraw and Barnes2018).
The studied temporal resolution and timescales could also limit the establishment of correlations between precipitation and these large-scale variables, and thus model accuracy. Previous work has shown that correlations with atmospheric oscillations are highly seasonal and sometimes predictive up to a year in advance (Umirbekov et al., Reference Umirbekov, Peña-Guerrero and Müller2022). Building more seasonal structure into the covariance function could therefore improve the viability of using large-scale features. Furthermore, weekly or daily fluctuations in the ISM and WDs cannot be resolved on a monthly scale. Including these fluctuations could allow the model to capture intra-seasonal variability crucial to predicting extreme precipitation events in the UIB. Future work could also investigate meteorological conditions during these extreme events in more depth and test more climate variables, such as large-scale humidity at different pressure levels.
Going forward, the GP models could also be refined by explicitly incorporating physics linked to precipitation over the UIB. Multiple GPs could be trained to learn the relationships between different variables, e.g., temperature and elevation. The outputs of these models could then be used to predict precipitation. For this purpose, a setup similar to the GP auto-regressive (GPAR) framework by Requeima et al. (Reference Requeima, Tebbutt, Bruinsma and Turner2019) could be considered. More faithful results could also be achieved by incorporating training data from other sources, such as remote sensing or in situ precipitation gauges using multi-fidelity GPs (Tazi et al., Reference Tazi, Orr, Hernandez-González, Hosking and Turner2024).
Additionally, a better scoring function could be used for the feature selection experiments. The MLL is calculated in the warped precipitation space and likely underestimates the loss for large precipitation values. This score also does not penalize for model complexity, which could lead to overly complicated models with too many features. Future model outputs could also be assessed with respect to long-term variability to better reflect their sensitivity to climate change.
Finally, the eventual aim of this work is to generate local precipitation predictions from GCM outputs. To achieve this goal, a GP model could be re-trained on coarsened ERA5 input data and make predictions from GCM projections. This procedure should be straightforward, as the ensemble members of the Coupled Model Intercomparison Project (CMIP) report all the variables needed to derive the input features presented in this paper (Eyring et al., Reference Eyring, Bony, Meehl, Senior, Stevens, Stouffer and Taylor2016; Hersbach et al., Reference Hersbach, Bell, Berrisford, Hirahara, Horányi, Muñoz-Sabater, Nicolas, Peubey, Radu, Schepers, Simmons, Soci, Abdalla, Abellan, Balsamo, Bechtold, Biavati, Bidlot, Bonavita, de Chiara, Dahlgren, Dee, Diamantakis, Dragani, Flemming, Forbes, Fuentes, Geer, Haimberger, Healy, Hogan, Hólm, Janisková, Keeley, Laloyaux, Lopez, Lupu, Radnoti, de Rosnay, Rozum, Vamborg, Villaume and Thépaut2020). The results of this process could enable more accurate predictions of local precipitation distributions over the UIB.
6. Conclusion
This paper has demonstrated that it is possible to predict certain aspects of precipitation over complex orography using variables that are better represented by GCMs, including large-scale circulation fields. With only 6% of the available data for the training period, the whole-basin model with a stationary covariance function is able to predict monthly precipitation with an average
$ {R}^2 $
of 0.49, an RMSE of
$ 1.25\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
and an MLL of 0.92. The 2 m dew point temperature and the UIB-averaged 2 m temperature were identified as the most predictive features for precipitation. Using a cluster basin model with more tailored input features produces a smaller RMSE (
$ 0.97\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
) but a higher MLL (1.02). These results improve on the data-limited single-location model, but not the full single-location model (RMSE = 0.50,
$ \mathrm{RMSE}=0.71\;\mathrm{mm}\;{\mathrm{d}}^{-1} $
, MLL = 0.87). The stationary whole-basin model also improves on the non-stationary whole-basin model in this data-sparse setting. The non-stationary model could therefore be more suitable for better-sampled or interpolative tasks. In summary, these experiments highlight the trade-offs between learning spatial covariance, incorporating domain knowledge, and increasing computational complexity with GPs. More importantly, this work paves the way for more accurate precipitation predictions with uncertainty distributions and, in turn, more sustainable climate adaptation at a local level.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10020.
Data availability statement
The code for this paper is available on GitHub and Zenodo (Tazi, Reference Tazi2024). ERA5 data are available through the Copernicus Data Store (Hersbach et al., Reference Hersbach, Bell, Berrisford, Biavati, Horányi, Sabater, Nicolas, Peubey, Radu, Rozum, Schepers, Simmons, Soci, Dee and Thépaut2023a, Reference Hersbach, Bell, Berrisford, Biavati, Horányi, Sabater, Nicolas, Peubey, Radu, Rozum, Schepers, Simmons, Soci, Dee and Thépaut2023b) and climate indices from the US National Ocean and Atmospheric Administration (NOAA, 2011).
Acknowledgements
The authors thank Tony Phillips for providing the upper Indus Basin shapefiles.
Author contribution
Conceptualization: K.T; S.H. Methodology: K.T; R.T. Data curation: K.T. Data visualization: K.T. Original draft: K.T.; R.E.T; A.O.; S.H. All authors approved the final submitted draft.
Funding statement
This work was supported by the Engineering and Physical Sciences Research Council through the AI for Environmental Risk (AI4ER) CDT [grant number: EP/S022961/1; award number: 2270379].
Competing interests
The authors declare none.
Ethical statement
The research meets all ethical guidelines, including adherence to the legal requirements of the studied countries.
A. Metric definitions
A.1. Root mean square error
The root mean square error (RMSE) represents the typical distance of the model from the data. For multiple input–output pairs
$ \left\{\boldsymbol{X},\boldsymbol{Y}\right\} $
, the RMSE is given by:

where
$ \boldsymbol{f} $
is the set of predicted values at
$ \boldsymbol{X} $
. We use
$ \left\langle \cdot \right\rangle $
here and in the following definitions as a shorthand for the average over the
$ N $
data points. The RMSE is more sensitive to outliers and systematic errors than the mean absolute error or bias.
A.2. Coefficient of determination
The coefficient of determination or
$ {R}^2 $
represents the fraction of the data’s variance explained by the model. It is given by:

where
$ {\mathrm{SS}}_{\mathrm{res}} $
is the sum of the squared residuals and
$ {\mathrm{SS}}_{\mathrm{tot}} $
the total sum of squares. An
$ {R}^2 $
of 1 indicates a perfect fit, while a negative
$ {R}^2 $
means the model performs worse than the mean.
A.3. Mean log loss
The mean log loss (MLL) can be interpreted as measure of uncertainty calibration for probabilistic models. Using the predictive distribution at the test inputs
$ \boldsymbol{X} $
, the probability of the target data
$ \boldsymbol{Y} $
given the model
$ \mathrm{\mathcal{M}} $
,
$ p\left(\boldsymbol{Y}|\mathrm{\mathcal{M}}\right) $
, can be calculated. The log loss is given by taking the negative logarithm of this probability (Rasmussen and Williams, Reference Rasmussen and Williams2005). Taking the mean over all inputs gives the MLL:

where
$ \boldsymbol{\sigma} $
and
$ \boldsymbol{\mu} $
are the predicted means and variance at the inputs
$ \boldsymbol{X} $
using the model
$ \mathrm{\mathcal{M}} $
. Smaller values imply more skill. The MLL is applied to the warped GP predictions.
B. Feature selection experiments
The models input features are selected via the MLL on the validation set. Figures B1, B2, and B3 show the results of greedy feature selection experiments for the single-location, cluster, and stationary whole-basin models trained on the first 5000 points from the training set. Under the assumption that we would like the models to be consistent in time and space, the covariance function are initialized with time for the single-location model and time, latitude and longitude for the cluster and whole-basin models.

Figure B1. Model MLL as a function of selected input features for the single-location GPs. The plot illustrates the greedy feature selection process, where the best previous kernel design is used to test the next most likely feature. Negative changes imply the feature improves model predictive skill. Features associated with positive changes decrease model predictive skill and are not included in subsequent covariance functions.

Figure B2. As Figure B1 for the cluster GPs. Although EOF500C2 reduces the validation MLL of the Ngari GP, we exclude this feature in the final model as it sets the variance of previously selected features to be very low.

Figure B3. As Figure B1 for the stationary whole-basin GP.
Feature rank for the feature selection experiments is established through recursive feature elimination using a random forest ensemble. The recursive feature elimination shows total column water vapour and 2 m dew point temperature to be most important when predicting precipitation. Of the climate indices, the NAO index is most closely linked to precipitation. The first components of the empirical orthogonal functions (EOFs) of the geopotentials over the UIB are the most predictive of the geopotential field variables. The other atmospheric fields are found be to less connected to precipitation than local input features such as slope or elevation. Time-lagged atmospheric indices and large-scale features were also investigated (not shown), but presented no additional correlation with precipitation.
C. Feature variance contribution
Table C1 details the contributed variance of the input features for the cluster and whole-basin models.
Table C1. Feature variance contribution for the cluster (Khyber, Gilgit, Ngari) and stationary whole-basin (UIB) models. The variances are normalized with respect to the training set distribution, with higher values representing larger contributions towards predicting precipitation. Empty entries correspond to features that were not chosen during the feature selection experiments

Comments
No accompanying comment.