A Spatiotemporal Weighted Regression Model (STWR v1.0) for 1 Analyzing Local Non-stationarity in Space and Time

: Local spatiotemporal non-stationarity occurs in various natural and socioeconomic processes. Many studies have 9 attempted to introduce time as a new dimension into the geographically weighted regression model (GWR), but the actual 10 results are sometimes not satisfied or even worse than the original GWR model. The core issue here is a mechanism for 11 weighting effects of both temporal variation and spatial variation. In many geographical and temporal weighted regression 12 models (GTWR), the concept of time distance has been inappropriately treated as time interval. Consequently, the combined 13 effect of temporal and spatial variation is often inaccurate in the resulting spatiotemporal kernel function. This limitation 14 restricts the configuration and performance of spatiotemporal weights in many existing GTWR models. To address this 15 issue, we propose a new spatiotemporal weighted regression (STWR) model and the calibration method for it. A highlight of 16 STWR is a new temporal kernel function, in which the method for temporal weighting is based on the degree of impact from 17 each observed point to a regression point. The degree of impact, in turn, is based on the rate of value variation of the nearby 18 observed point during the time interval. The updated spatiotemporal kernel function is based on a weighted combination of 19 the temporal kernel with a commonly used spatial kernel (Gaussian or bi-square) by specifying a linear function of spatial 20 bandwidth versus time. Three simulated datasets of spatiotemporal processes were used


Introduction
Time, space and attributes are three essential characteristics in geographic entities, and they are recorded to reflect the state and evolution of various real-world phenomena and processes.Because space and time frame all aspects of the discipline of geography (Goodchild, 2013), it is important to observe the spatiotemporal variations and explore appropriate analytical methods to study and reason the internal mechanisms and evolutionary laws.In recent years, new platforms and instruments have brought increasingly massive spatiotemporal data, such as the time-and geo-tagged sensor monitoring records and remote sensing images.Those big data create great opportunities for studying human and environmental dynamics from different perspectives, such as the patterns of human behavior (Chen et al., 2011), environmental risk assessment (Sun et al., 2015), and disease outbreaks (Takahashi et al., 2008).Nevertheless, although spatiotemporal modeling has been a long-term research focus in the field of geographical information science (GIScience) (Cressie, 1991;Cressie and Wikle, 2015), the models are not mature yet and challenges still exist (Fotheringham et al., 2015), which call for further work.
In this paper, the technological development and discussion focus on modeling local spatiotemporal variations within the framework of geographically weighted regression (GWR).GWR is a method for modeling spatially heterogeneous processes (Brunsdon et al., 1996(Brunsdon et al., , 1998;;Fotheringham et al., 2003).It has been applied in a variety of areas, such as climate science (Brown et al., 2012), geology (Atkinson et al., 2003), mineral exploration (Wang et al., 2015), transportation analysis (Cardozo et al., 2012), crime studies (Cahill and Mulligan, 2007;Wheeler and Waller, 2009), environmental science (Mennis and Jordan, 2005), and house price modeling (Fotheringham et al., 2015).GWR calibrates a separate regression model at each location through a data-borrowing scheme, in which distance-weights can be calculated by drawing on data from neighboring observations of each regression point (Fraser et al., 2012).This operation complies with Tobler's first law of geography -"Everything is related to everything else, but near things are more related than distant things" (Tobler, 1970).
Numerous studies have been devoted to incorporating the temporal dimension into spatial regression (Pace et al., 2000;Gelfand et al., 2004;Crespo et al., 2007;Cressie and Wikle, 2015).However, most of these studies assume that temporal effects are constant over space from a global perspective of modeling (Fotheringham et al., 2015).To address that issue, Crespo et al. (2007) extended GWR by developing spatiotemporal bandwidths that account for varying local spatial effects across time.Huang andWu (2010, 2014) proposed a geographical and temporal weighted regression model (GTWR) with a method of measuring the spatiotemporal 'closeness' and a parameter ratio  to deal with different measured units in time and space.Although the approach can address the issue to some extent, Fotheringham et al. (2015) pointed out that a sole measurement of integrated spatial and temporal distances can be misleading as location and time are usually measured at different scales, and he stated that the calculation of distance in three dimensions (time and two-dimensional space) remains a challenge.
A spatiotemporal kernel function, which consists of mixed spatial and time-decay bandwidths, was proposed by Fotheringham et al. (2015).Nevertheless, the stepwise strategy applied in this function for bandwidth optimization does not always seem reasonable.In practice, this function needs to first find and fix an optimized spatial bandwidth, then it will find the optimized temporal bandwidth.After that, the spatiotemporal weight will be calculated.This stepwise search process means that the function is not able to optimize both temporal and spatial bandwidths at the same time.However, a more reasonable thought is that the spatiotemporal bandwidth and its weight are simultaneously affected by both spatial and temporal effects of a process.There should be ways to further improve the spatiotemporal kernel function in Fotheringham et al. (2015).
The aim of this paper to develop a better methodology for the spatiotemporal kernel function.Following Tobler's first law, we propose an algorithm, the spatiotemporal weighted regression (STWR).In STWR, the velocity of value change is higher related if they were in near time and space.Therefore, STWR can borrow data not only from nearby locations, but also from nearby value variation through time.The latter is what we call as "time distance" in STWR.The time distance is not the concept of time interval, but the rate of value variation through time.It is a kind of value change that reflects the temporal effect of nearby points to the regression point.Accordingly, our local spatiotemporal regression analysis model can take advantage of the variation in data to identify temporal non-stationarity, which is an advantage comparing with GWR and GTWR.
Before giving more details about STWR, we can further clarify the meaning of a few concepts.A common issue in the existing GTWR models is that they use the concept of time interval, instead of the above-mentioned "time distance", to calculate temporal and spatiotemporal weights.A time interval is the period between two observed time stages.A time distance, in the context of STWR, is the rate of value variation between an observed point and a regression point through a time interval.We can think about the following scenario for a group of points.The values of some points do not change or change slightly from time A to time B, while a few other points may change greatly in that period.However, many GTWR models ignore the difference in the value changes of observed points during a period of time, and regard that all these points have the same temporal effect to their neighbor regression point.It is hard to believe that some unchanged observations constantly affect their nearby regression points during the observed time interval.Intuitively, different variations of the observed points have different temporal effects.For example, the faster the house price of a point change, the stronger the temporal effect is to the house price at its nearby point.Moreover, the rate of value changes at different observed points (time non-stationary) may also have spatial heterogeneity.The data values observed at different points are results of mixed spatiotemporal effects and some other unknown factors (including errors).Therefore, using only time interval in the calculation of temporal and spatiotemporal weights might interpret local spatiotemporal effect imprecisely.
There are other issues in the temporal kernel functions and the multiplication form of spatial and temporal kernels used by the existing GTWR models (Huang et al., 2010;Wu et al., 2014;Fotheringham et al., 2015).When calculating the spatiotemporal effect, these models generally use time intervals and the common kernel functions to calculate temporal weights, such as Gaussian kernel or bi-square kernel.However, an appropriate temporal kernel function should not be the same as the spatial kernel function, because space is in two or three dimensions while time is in one dimension and one direction.Each regression point can borrow observed points from any directions in space but only use points from the past rather than from the future.Moreover, the integrated spatiotemporal weights might be underestimated in these GTWR models by using a multiplication of the spatial and temporal weights.Because both the spatial weights and the temporal weights range from 0 to 1, and the multiplied weight value is never bigger than the smaller one before multiplying, which means that the composite spatiotemporal impacts are never greater than the single spatial impacts and the single temporal impacts.However, the real combined spatiotemporal impacts, may be higher than the single spatial impacts or the temporal impacts, or at least may be higher than the smaller ones.The multiplication formulation of spatiotemporal kernel in GTWR also makes the calculated weight decay faster.
The above-mentioned limitations and issues in GWR and GTWR are the driving force behind our development of STWR.The remainder of this article is organized as follows.Section 2 introduces the STWR model formulation, including temporal kernel and spatiotemporal kernel functions.Section 3 describes the methods for bandwidth selection and calibration when STWR is in operation.Section 4 presents results of applying GWR, GTWR and STWR to three sets of simulated data.
Section 5 presents experiment results with real-world precipitation hydrogen isotope data.In Section 6, we close the article with a summary of the key findings and a few thoughts for future research.

The strategy of time distance decay
Since GWR is the background of our work, it is helpful to first give a brief overview of the GWR framework.The basic formulation of GWR can be described in two equations below (Fotheringham et al., 2003).
In Equation 1,  ! is a response variable of regression point  at a location with the coordinates ( !,  ! ). !# is the  'ℎ independent variable, and  !denotes the error term for the  'ℎ observed point.A key difference between GWR and the traditional global regression method, such as Ordinary Least Squares (OLS), is that GWR allows the coefficient  # ( !,  ! ) vary spatially to identify spatial heterogeneity.Equation 2 represents the GWR calibration in a matrix form.( !,  ! ) is a diagonal weighting matrix specific to location , which is calibrated by a specified kernel function with a given bandwidth.
Every element  ! in the weighting matrix reflects the impact from another observed point to the regression point.A bigger  !value means a higher impact.
GWR has a strategy of spatial distance decay impact on a regression point (Brunsdon et al., 1998;Fotheringham et al., 2003).A similar "time distance decay" strategy was also discussed in several recent GTWR models (Crespo et al., 2007;Huang et al., 2010;Wu et al., 2014;Fotheringham et al., 2015).Yet, those models did not fully reflect the effect of time distance decay.Sample points are observed at different time stages, and those data points closer in time distance to a regression point have more impact on the regression point than those farther away.The time distance refers to the value variation rate between an observed point and a regression point during a certain time interval.For example, in Fig. 1, there are four time stages from old to new: T-s, T-q, T-p and T. Through a fitting and calibration process, the spatiotemporal bandwidth will be fitted, and the spatiotemporal effects (weights) from observed points to a regression points at time stage T will be calculated by a specific spatiotemporal kernel function.Then, in prediction, the value of a regression point at time stage T can be estimated.Thus, the observed points at time stage T only have spatial effect on the regression point (Fig. 1).
There is temporal effect from data points at time stages T-p and T-q (shown as stars, pentagons and triangles in the planes of T-p and T-q in Fig. 1), within a certain spatial bandwidth  ($ at each time stage, to the regression point.The time distance decay should reflect that different variations of the observed points have different temporal effects.However, as mentioned in the previous section, many existing GTWR models have applied a strategy of time interval decay instead of time distance decay.Consequently, they regard that all the observed points have the same temporal effect to their neighbor regression point.Compared to existing GTWR models, the time distance decay strategy of STWR considers the effect of different variations of observed points through time.For example, some data points may have higher impact on the regression point, though their spatial distance is farther than other points.Fig. 1 illustrates that the locations of some star-shape points are farther away from the regression point than some pentagon-shape points at time stage T-p, which denotes that there exist mixed impacts (spatial impact and temporal impact) on the regression point.The temporal impacts depend on the rate of value variation, which is the value difference between the observed point and the regression point divided by a time interval (e.g., [T-p, T] and [T-q, T-p] each is a time interval).If the observed time stage is too long ago or the rate of value variation is too small, and exceeds the limit of optimized temporal bandwidth for the regression point (as shown by observations at time stage T-s), the data points at this time stage may have no impact on the regression point.Even though some of those data points may have huge difference in value and are close to the regression point in space, they are not within the range of the optimized temporal bandwidth.Spatial bandwidths also vary along the time line, and usually the bandwidth gets larger when the observation time is closer to the time stage of the regression point (Fig. 1).

The spatiotemporal kernel function of STWR
We assume that a set of observed points interval  in a study area, where  represents the current time stage and  '%! ,  ∈ { 0,1,2, . . .,  } ( ' =  '%" ) denotes the number of observed points at each recorded time.As the idea described above, we can borrow neighbor points in space and their value variation during certain recent time intervals, so we can still use Equation 1 to generate local estimates.The weight matrix  in GWR usually depends on the spatial kernel (Fotheringham et al., 2015).In STWR, we need to consider the temporal effect, so the form of  is different from that in GWR.Correspondingly, we should have a spatiotemporal kernel, which can be understood as a temporal extension based on the spatial kernel.However, if we use a multiplication form to combine the temporal kernel and the spatial kernel (Huang et al., 2010;Wu et al., 2014;Fotheringham et al., 2015), we will face the problem of time and space interaction as mentioned above in the Introduction section.To address that issue, we design a weighted average form for the spatiotemporal kernel.
In Equation 3, is the weight at time  and at the observed location . , and  $ are the spatial and temporal kernel, respectively, and they both have a value range of 0 to 1.  is an adjustable parameter to scale the temporal and spatial effects, which can be optimized with the bandwidth selections.The role of parameter a is different from the scale parameter ) in GTWR (Huang et al., 2010).a is introduced here for adjusting the outputs of the spatial kernel  , and the temporal kernel  $ , which means measuring the relative strength of the spatial and temporal impacts on the regression point.
But the scale parameter  is used for adjusting the inconsistency of the time distance and space distance, which cannot adjust the relative strength of  , and  $ . ,!+ and  '!+ are the spatial (Euclidean) and temporal distance between the regression point  and an observed data point , respectively. ($ is the spatial bandwidth  ( at a certain time stage , and  $ denotes the temporal bandwidth.
The time distance, as mentioned above, is not the time interval but the rate of value variation between an observed point and a regression point through a time interval.Following the time distance decay strategy in STWR, we can further derive the temporal kernel  $ as shown below.
In Equation 4,  !(') −  +('%6) is the subtraction of the regression point 's observed value at  from the point 's observed value at  − , which denotes the value change during the time interval .The internal part of the exponential function is negative, in order to make the weight  !+)' ' range from 0 to 1.The faster the value change rate is, the bigger the weight is, which means that the time impact is larger.When the time interval  is out of the range (0,  $ ), the weight will be set to zero, which denotes that there is no impact because the observed variation is too far to affect the current moment.For example, if the price of a nearby house has changed a long time ago, it may have little or no impact on the present house price.But if the house price had a sharp change recently, it will have a big impact on the present house price.Therefore, the faster the rate of observed value changes and the shorter the time interval is, the greater the impact on the regression point will be.Compared with GTWR models, the advantage of STWR is that the temporal kernel function  $ can better leverage the variation data.
To calibrate the weight value  !+($ ' , we need a spatial kernel function.The most widely used kernel functions are bisquare and Gaussian (Fotheringham et al., 2003), which are given in Equations 5 and 6, respectively.

𝐵𝑖 − 𝑠𝑞𝑢𝑎𝑟𝑒: 𝑤
In Equations 5 and 6,  ( is the spatial bandwidth.Derived from  ( and  ($ ,  (' is the initial spatial bandwidth at the given time stage  of the regression point (i.e.,  is the initial time for searching observed points in the past).Many functions can be specified for the change of spatial bandwidth during the time intervals.Because in most cases it will have smooth change during a certain short time interval, we assume that the spatial bandwidth changes linearly along with time, as defined bellow.
In Equation 7,  denotes the slope of spatial bandwidth change in correspondence to , and  (' denotes the initial spatial bandwidth at . Importing Equations 4 to 7, the calibration of Equation 3 can be further derived into Equations 8 and 9, which are our spatiotemporal kernel functions in STWR.Equations 8 and 9 are based on the bi-square and Gaussian kernel, respectively.With the STWR spatiotemporal kernel, we only need to optimize the parameters  and  instead of the spatial bandwidth  ($ .However, we shall traverse all the observed points at the initial time stage  to find the optimized spatial bandwidth  (' .Moreover, we shall also traverse all the time stages to find the optimized temporal bandwidth  $ . (8) (9)

Bandwidth selection and parameter estimation
Some goodness-of-fit diagnostics (Loader, 1999) are widely used in general GWR-based models, such as the crossvalidation (CV) score (Cleveland, 1979;Bowman, 1984) and the Akaike Information Criterion (AIC) (Akaike, 1973;Akaike, 1998).For STWR, we use cross-validation (CV) as the default searching criteria and we also calculate the value of a corrected version of AIC (Hurvich et al., 1998), the AICc, which is defined bellow.
In Equation 10, n is the sample size,  k is the estimated standard deviation of the error term, and () denotes the trace of the hat matrix  (Hoaglin and Welsch, 1978).
Although there is no need to optimize spatial bandwidth  ($ of the past time stages in STWR, other parameters such as  and  need to be optimized.Also, we should give the  $ and initial  (' through trials.For more potential combinations of these parameters for different spatiotemporal processes, a more reasonable limit and optimization procedure is hence needed.

Calibration of STWR
Calibration of the STWR models can be conducted by using weighted least squares.The estimator for the coefficients at In Equation 11,  = ,! and  = ,! are observed independent and dependent variables of  )' respectively. = ,! $ is the transpose of  = ,! . )' ( !,  ! ) denotes the spatiotemporal weight matrix for observed points at different locations to the regression point ( !,  ! ) at different time stages during .For a better illustration, we show the weight matrix  )' during the time interval  in Fig. 2. The matrix  )' here is a bit different form the ( !,  ! ) in Equation 2. The records in the  'ℎ row of  )' are the diagonal elements in ( !,  ! ), and only no zero values are used to calibrate the coefficients  .# for each regression point.Thus, each row r of this hat matrix is shown below.
In Equation 12,  !' is the  'ℎ row of the matrix of independent variables at .  )' is the matrix of independent variables during a time interval , and  )' $ is its transpose.Although the  )' in Equation 12 is equal to the  = ,! in Equation 11 in the fitting and calibration of STWR, we distinguish  = ,! from  )' here.Because  = ,! is a specific matrix of independent variables of an observed point set  )' during , while  )' is a general matrix of independent variables of points during . = ,! is only used for fitting and calibration of STWR, while  )' can also be used for prediction in STWR.In other words, we can understand  = ,! as a subclass of  )' . !)' is the  'ℎ row of the weighted matrix  )' .

Reasonable searching range and procedure of optimization
In order to obtain the optimized  and  for STWR (Equations 8 and 9), the search range should be limited.Here we use the distance from each regression point  ! (') to its  'ℎ nearest neighbor as the initial spatial bandwidth  (' at .The optimization procedure is to traverse the set  ., and for each step we further traverse the set  *' to get the optimized  and  through trials.Some trials of  may lead to no solution to Equation 11, because there might be less than ( + 1) 'ℎ neighbors within the radius of  (' −  .from the regression point.Therefore, if it occurs at time stage  − , the spatial bandwidth  (' −  .needs to be extended to the distance from its ( + 1) 'ℎ nearest neighbor to the regression point, to guarantee the matrix in Equation 11 to be nonsingular.

Steps of using STWR for prediction
In this paper, STWR is used to predicate the current values of regression points with known coordinates.The prediction formulas of STWR are more complicated than GWR because the spatial distance is calculated directly from the regression point to each observed data point, while the time distance between the regression point and the data points observed in the past cannot be calculated directly.Therefore, we specify a few steps for the prediction in STWR.First, we need to have the optimized initial spatial bandwidth  (' , the optimized  and , the optimized number of time stages model used and the fitted weight matrix.Second, all data points within the limited distance of spatial bandwidth at the latest time stage should be found for the regression point.Third, all the temporal weights of these data points need to be retrieved from the established weight matrix (Fig. 2).Fourth, we use these retrieved weights to calculate (e.g., use mean value or inverse distance weighting value) the temporal weight on the regression point.Fifth, by combining with the calculated spatial weight and the optimized  and , we can calculate the spatiotemporal weight on the regression point.Then the value of the regression point can be calculated.

Simulation design
To verify the performance of STWR and compare with the results of GWR and GTWR, several groups of simulated data were used in this study to represent different types of heterogeneity in space and time.All the data and code used in the experiments are shared on GitHub.Web links are provided at the end of this manuscript.
For GTWR, we only compared with the results generated by algorithms in Huang et al. (2010) and Wu et al. (2014), because we did not find the software package of Fotheringham et al. (2015).The data generating process (DGP) and the spatial heterogeneity are introduced here.The basic DGP is a linear model shown in Equation 1 and the study area is a regular 25×25 lattice.We defined three initial surfaces to represent the spatial heterogeneity of parameters (Fig. 3), which were generated by Equations 13, 14 and 15, respectively (Fotheringham et al., 2017).Through Equation 1 In the above two equations,  @ ' denotes the mean of an independent variable  at time stage ,  ! @ '0)' ,  ∈ {1,2} denotes the mean of  at time stage  + , and  & and  / are two parameters for adjusting the rate of change.At each time stage during the simulations, the independent variables  & and  / are generated by a normal distribution with new means of  &  @ '0)' and  /  @ '0)' , respectively.

Results with simulated data
We compared the results of OLS, GWR, GTWR, and STWR.A total of 333 random sample points for five time stages ( " ,  & ,  / ,  D ,  E from old to new) were collected from the 25×25 lattice generated in the above-mentioned DGP.To simplify the calculation process, we set  of Equation 7to zero.Due to the limitation of paper length, in the comparison below the STWR results only include those generated by the spatiotemporal kernel in Equation 8.The objective is to compare the predicted results with the true value at the latest time stage.

Case study 1
The time interval of observations in case study 1 was one unit, such as one second or one day.The value change of  & and  / were generated by  & = 0.5 and  / = 0.1, and were affected by  &  with  = 0.5 and  = 1.This means that  & and  / only changed slightly over time.Table 1 presents the results of the global OLS, GWR, GTWR and STWR at the latest time stage, i.e., stage 5.It shows that the sum of squared errors (SSE) of prediction in STWR is much lower than the other models in at least one magnitude.In addition, the AICc scores (Equation 10) also shows that STWR outperforms GTWR and GWR.As shown in Table 1, the R2 (average R-squared of all regression points ) value increases from 13.8% in OLS to 94.2% in GWR, 94.9% in GTWR, and 99.3% in STWR.The estimated standard error, Sigma, reduces to 4.292 in STWR from 23.331 in GTWR.Also, Fig. 5 shows that both the prediction surface (Y_pred) and the prediction error surface (Pred_Error) of STWR are more accurate than those in GWR.Due to the limitation of the software package in Huang et al.
(2010) and Wu et al. (2014), we did not generate images for GTWR in Fig. 5, but the result can be seen from the Sigma value in Table 1.

Case study 2
The time interval of observations in case study 2 was 10 units.The value change of  & was generated by  & = 0.5 and affected by  D  with  = 0.5 and =2. / was generated by  / =2 and affected by  /  with  = 1 and  = 1, which denotes that  & and  / changed fast over time.Table 2 shows the results of the global OLS, GWR, GTWR and STWR at the time stage 5.The SSE value in STWR is much lower than other models, and STWR has the highest R2 value 0.995.The Sigma value of STWR is 13.299, which is the lowest and less than one-fifth of the Sigma in GWR and less than one-sixth of the Sigma in GTWR.Besides, the AICc scores show that STWR significantly outperforms GTWR and GWR.
STWR utilized data from the latest three time stages to calibrate the model.The initial spatial bandwidth  (' of STWR was three nearest neighbors, which was smaller than the one in GWR with 15 nearest neighbors.The optimized  of STWR was 0.08, which shows that the effect of used observed points to their local regression points was mainly determined by their spatial distance.In this case, the GWR outperforms GTWR, which may due to the higher ratio of value change.Compared with the y_true surface, the predict surface of STWR is much better than GWR (Fig. 6).For the same reason as mentioned in case study 1, we did not generate images for GTWR in Fig. 6.

Case study 3
The time interval of observations in case study 3 was 200 units.In both case studies 1 and 2, the coefficients in Equation 1were unchanged.In contrast, in case study 3, three surfaces of coefficients changed over time, which were generated by the trends  & ,  / , and  D , respectively.The variations of coefficients were assumed to be slow.The  and  in each trend were set to be 0.2 and 1, respectively.Both  & and  / were set to be 0.5.The dynamic process of the three surfaces of coefficients and the y_true surface at each time stage are shown in Fig. 7.The process in case study 3 is more complicated than a general process, but it may be closer to reality.The small difference among these models at  " may be caused by their different searching range of spatial bandwidth.
Starting from time stage  & , STWR and GTWR can borrow points from previous observations.At time stage  & , STWR outperforms both GWR and GTWR, and the advantage of STWR becomes more obvious in the later stages.
It may seem strange that GWR can outperform GTWR (Fig. 8), but that is reasonable for the process in case study 3.
The change of this process is faster; and the time interval of observations is bigger than the previous case studies.STWR is not only able to deal with time intervals, but also to make full use of the value variation of observed points for calibration.In contrast, GTWR only uses the time interval information and all the observed points to calibrate, which may cause problems when the observed values are significantly different in spatial distribution or the time intervals are long.GTWR makes use of points from previous time stages without considering their variation, but if the actual values are quite different from previous observations at the current time stage, all the point values for the calibration of GTWR will become smooth.Thus, GWR outperforms GTWR in this situation because GWR only uses the current data points for model calibration.
STWR is better for estimation than GWR and GTWR because its Sigma value is much smaller.As shown in Fig. 8b, the Sigma of STWR was half of GWR at time stage  & , and even less than a third of GWR at time stage  E .The results show that the advantage of STWR is obvious comparing with GWR and GTWR.At  E , STWR used data from all the past time stages to calibrate the model, and its optimized (initial) spatial bandwidth  (' was derived from four nearest neighbors, which was smaller than the one in GWR with 25 nearest neighbors.The optimized  of STWR was 0, which means that STWR only borrowed points from past time stages without considering their temporal weights to each regression point at  E .The predict surfaces at time stage  E is shown in Fig. 9.The Y_pred surface of STWR is much better than GWR, especially in the middle and bottom left parts of the surface.The Pred_Error of STWR is also much lower than GWR at almost every location.In this case, the  of STWR at each time stage was 0, 0.96, 0, 0.07, and 0, respectively.These values indicate that the temporal effects are different at each stage.They also show that the value of  can be adaptive to scale the temporal and spatial effects (see Equation 3).
As Fig. 10 shows, the optimized bandwidths are quite different among these models, and the bandwidths of GWR and GTWR are larger than the initial bandwidth of STWR at each time stage.The optimized bandwidth for each time stage refers to an optimized number of the nearest neighbors (see Section 3.3).As GTWR considers all the nearest neighbors from different time stages, the optimized numbers of the nearest neighbors (bandwidth) grow fast, and exceed the GWR model at time stage  / .However, the actual distance from the observed points to the regression points is not necessarily farther.The initial optimized numbers of the nearest neighbor of STWR are smaller than those in GWR and GTWR, which means that the initial spatial bandwidth is narrower than the bandwidth of GWR and GTWR.Nevertheless, due to the strategy of borrowing points from nearby neighbors of past observations, the total points for model calibration in STWR may still be more than GWR and GTWR.Therefore, the initial optimized numbers of the nearest neighbors in STWR are kept at a lower level, which means it is more localized than GWR in this sense.

Experiments with Real-world Data
To further test the performance of STWR, we used data of precipitation δ 2 H isotopes in Northeastern United States in another case study.We chose δ 2 H data in three days from October 29 to 31, 2012, which have enough spatiotemporal data for the test.Here in the comparison the STWR results only include those generated by the spatiotemporal kernel in Equation 8. Data and code used here are shared on Zenodo (See DOI and web links in the 'Code and data availability' section at the end of the main text of this article).
In the experiments, we collected a total of 782 measurements from 116 sites located in Northeastern United States during the three-day period, and prepared the data on a daily average.The daily precipitation, mean temperature, and elevation were used as explanatory variables.The model derived from Equation 1 is represented below.
In Equation 21,  denotes the daily total precipitation (rain + melted snow),  denotes daily mean temperature, and ℎℎ is the elevation value.After data preprocessing, there were 272 points for model calibration and 73 points values on October 31, 2012.For the first day, both GTWR and STWR took no information from the past.Therefore, we only show the results of SSE, R2 and the optimized initial neighbor (bandwidth) in the model comparisons for the second and third day (D2 and D3) in Tables 3. The SSE of STWR is the lowest at both days.GWR shows a slightly higher SSE than GTWR at D2 and D3.The R2 of STWR is the highest at both days among these models.GWR has lower R2 than GTWR at D2, and almost the same R2 as GTWR at D3.
Similar to the experiments on three simulation datasets, the result here shows that STWR outperforms GTWR and GWR.In the experiment, the number of optimized initial neighbors of STWR was smaller than that of GWR and GTWR.
The optimized  of STWR was 0 at both D2 and D3.The optimized temporal bandwidths of STWR (number of time stages model used) in both D2 and D3 were 2, which means that the STWR in this case only borrowed data points from the latest 2 time stages for D2 and D3.In the result (Table 3), an interesting part to see is that the numbers of optimized initial neighbors of STWR are smaller than the spatial bandwidths of GWR for D2 and D3.The reason is that STWR borrowed points from past time stages in the calculation, which led to narrower bandwidths to some extent.We adopted Leave-one-out cross-validation (LOOCV) at D3 for the comparison between STWR and GWR.The squared errors (SE) of prediction are shown in Fig. 11.The prediction results of STWR are better than GWR for most points.
The mean SE of STWR is smaller than GWR.Moreover, the SE of STWR shows a narrower regional trend, which indicates that STWR is more robust than GWR.In addition, the total SSE of GWR and STWR are 50216.510and 39724.995,respectively.Therefore, the result further validates that the quality of predication in STWR is better than GWR.In Fig. 12, the predicted δ 2 H surface at D3 is broadly similar between the GWR and STWR calibrations.The percentages of explanation of variance in GWR and STWR are similar, which are 68.8% and 76.3%, respectively.However, like the experiment results with simulated data (Fig. 10), STWR has narrower initial bandwidth, which generates more localization in the predicted δ 2 H surface than GWR.For instance, the lower (light yellow and blue parts) or higher (orange parts) predicted values of δ 2 H are more concentrated in the δ 2 H surface of STWR than that of GWR (Fig. 12).

Discussion and Conclusions
Spatiotemporal data analysis is important in many scientific studies.Due to the complexity of spatiotemporal models, spatiotemporal effect may not be fully taken into account when the temporal and spatial information is manipulated simultaneously.In particular, the models for the effect of spatial dynamics should not be simply adapted for modeling the effect of temporal dynamics.Although the GTWR model can borrow points from the near recent, without careful consideration of temporal effect, the performance of GTWR may even be worse than GWR.Increasingly, many scientific issues are not just about spatial non-stationary but involve many spatiotemporal processes.It is necessary to review the limitation of the current spatiotemporal models and make new extensions.The aim of the STWR model developed in this study is to advance the work and discussion in that direction.
Based on a concept similar to GWR, a recently proposed model, named geographically neural network weighted regression (GNNWR) (Du et al., 2020), utilizes both OLS and neural networks to evaluate spatial non-stationarity.It is characterized by a designed spatially weighted neural network (SWNN) which can represent the spatial non-stationary weight matrix in spatial processes.Additionally, a geographically and temporally neural network weighted regression (GTNNWR) model (Wu et al., 2020), which is a temporal extension of GNNWR, was also proposed by the same group for further modelling spatiotemporal non-stationary relationships.GTNNWR can generate a space-time distance by utilizing the so-called spatiotemporal proximity neural network (STPNN), which may address complex non-linear interactions between time and space.Although both STWR and GTNNWR have the potential to handle complex spatiotemporal non-stationarity in various natural and socioeconomic processes, their principles and interpretability are different: (1) The basic formulation of GNNWR is defined as Equation 22 (Du et al., 2020), which is different from Equation 1( Fotheringham et al., 2003).The  " ( !,  ! ) and  # ( !,  ! ) denote the geographical weight of the constant coefficient  " and coefficient  # , respectively.It assumed that the multiplication of  3 ( !,  ! ) and  3 is equal to  3 ( !,  ! ) (0 ≤  ≤ ).The combined  3 ( !,  ! ) is thought as the same as the coefficients of GWR.But in STWR and GWR, the weights and the estimated coefficients are separated.The weights mainly reflect the degree of the influences from the observed points to the regression point, while the coefficient values reflect the relationships between the independent variable and dependent variable.
(2) GTNNWR and GNNWR use the proposed ANNs based method (Equation 23) ( Du et al., 2020) to calculate the weighted matrix, which is quite different from the kernel functions used in GWR and STWR models.Although GTNNWR and GNNWR use the idea of pointwise regression, they do not consider how to "borrow points" from nearby neighbors and do not have the concept of bandwidth.Without spatial bandwidth, all observation points in the study area may have impacts on the regression point, which might violate the Tobler's first law of geography (Tobler, 1970).It may be difficult to understand the relationships between the influence weight and the spatial distances, especially when the study area and the data amounts are large.STWR has spatial bandwidths and follows the Tobler's first law of geography, which can help analyze the affected range of local regression points.
(3) The data points will be divided into training set (including validation set) and test set for the GTNNWR and GNNWR, which might require more data points.Thus, it may not be appropriate for analyzing fewer amounts of data points (data acquisitions of many geoscience processes are difficult and costly).STWR and GWR do not need to divide data points into the training set (including validation set) and test set, which requires less data points than GNNWR and GTNNWR.
(4) Although GTNNWR utilizing a method named spatiotemporal proximity neural network (STPNN) (Wu et al., 2020) to calculate the spatiotemporal distance, the obtained integrated spatiotemporal distance is lack of explanation, and it is also impossible to tell apart which parts of the calculated weight is affected by time or space.Besides, there is no concept of temporal bandwidth in GTNNWR.Therefore, it fails to inform us the earliest time (stage) since which the observed points start to exert impact on the determination of the regression point.But STWR has temporal bandwidth, and it can distinguish the strength of temporal weight and spatial weight.Therefore, we can analyze the characteristics of the local interaction of time and space according to the temporal bandwidth, spatial bandwidth, and the adjustment parameter α, etc.

Deleted: Thus
The temporal kernel and the spatiotemporal kernel functions are two important contributions of STWR.The temporal kernel in STWR applies an improved sigmoid form (see Equation 4), which is different from the methods for temporal effect analysis in previous GTWR models.The temporal weight generated by the STWR temporal kernel is limited as a value between 0 and 1.The spatial weight in STWR is also limited as a value between 0 and 1.The STWR spatiotemporal kernel function has a weight adjustment parameter  to scale the temporal and spatial weights (Equation 3).In practice,  can be obtained through optimization.This form of weighted average between temporal and spatial effects in the STWR spatiotemporal kernel is a big improvement comparing with the multiplication form in previous GTWR models.The advantage of the STWR spatiotemporal kernel has been proven in four case studies with both simulated and real-world datasets.
Though the performance of STWR is outstanding, the models can still be further extended.A big topic is about the time distance.In the current STWR, the time distance represents the rate of value variation between an observed point and a regression point through a time interval.Nevertheless, we can also use time distance to represent the rate of value variation at each observed point object through time.Note that, from an object-oriented perspective, here we differentiate the point objects from locations, although the point objects have geospatial coordinates as part of their attributes.Following that new definition of time distance, the  !(') −  + ( '%6 ) in the STWR temporal kernel (Equation 4) can be replaced by Δ +('%6) (value variation of an observed point object during ).A scenario of interest is that, the observed point objects in the past time stages (such as those shown in Fig. 1) may move to new locations, have no value for a few time stages, or even disappear, so the Δ +('%6) may not exist.We can use object-based methods to address issues caused by that scenario.For example, each point object can be assigned with a unique ID, and then the observed value of the point object at each time stage can be retrieved by using its ID.With this new definition of time distance, the temporal weight on a regression point object is determined by the rate of value variation of its nearby point objects.Several different scenarios for a regression point object at current time stage  are discussed here.
(1) The location of an observed point object  is fixed through time (e.g., a fixed sensor).If the value of  is observed at both time stages  and  − , then Δ +('%6) can be calculated directly.If the value of  is observed at  but not observed at  − , we can use interpolation to generate a value for  at  − .If the value of  is not observed at , but the variation in the past is observed, we can use prediction methods to generate a value for  at .
(2) The location of  is not fixed through time (i.e.,  moves).The moving point objects can still have temporal effects to the regression point, then the Δ +('%6) can be calculated.The spatial effect, however, depends on whether  moves out of the spatial bandwidth from the regression point or not.
(3)  disappears or appears at a certain time stage.If  does not appear until the current time stage , the Δ +('%6) can be set to be 0. If  appears in a past time stage (e.g.,  − ) but it disappears before or at , we can ignore the impact of  for the regression point object.
There are other possibilities for the further improvement of STWR.The first is about the optimization of  in the spatiotemporal kernel (Equations 8 and 9).The slope  indicates that the variation of the spatial bandwidth is in a linear form, but it may not be a perfect solution.In many situations, the change of the spatial bandwidth over time may not be linear.The second is about making predications for future time stages.In this paper, we only predict values for points at the current time stage .Extensions can be made in STWR to predict values for points in future time stages beyond .The third future work is about exploring multiple spatial and temporal bandwidths of models.Different variables may have different spatial and temporal bandwidths due to their unique characteristics.Correspondingly, we may need more bandwidths to capture the different non-stationarities of those independent variables, to better represent the spatiotemporal heterogeneity.
In short, the core contribution of STWR is the clarification of the 'time distance' concept and the new temporal kernel and spatiotemporal kernel functions based on this concept.Our experiments show that STWR outperforms GWR and GTWR in analyzing and interpreting local spatiotemporal non-stationarity.We hope STWR can bring fresh ideas and new capabilities for spatiotemporal data analysis in many disciplines.

Fig. 1 .
Fig. 1.Spatiotemporal impacts of observed points with different rates of value change on a regression point at time stage T. Temporal bandwidth is the length of time from the intersection point A of the spatiotemporal bandwidth and the time line to the regression point.Spatial bandwidth and spatiotemporal bandwidth are illustrated in the figure legend.

Fig. 5 .
Fig. 5. Comparing prediction results of STWR and GWR in case study 1.Images a1, b1, and c1 are the simulation surfaces of true Y, the predicted surface of Y by STWR, and the predicted surface of Y by GWR, respectively.Images a2, b2, and c2 are the surface of simulation error, the surface of prediction error of STWR, and the surface of prediction error of GWR, respectively.

Fig. 6 .
Fig. 6.Comparing prediction results of STWR and GWR in case study 2. Images a1, b1, and c1 are the simulation surfaces of true Y, predicted surface of Y by STWR, and predicted surface of Y by GWR, respectively.Images a2, b2, and c2 are the surface of simulation error, the surface of prediction error of STWR, and the surface of prediction error of GWR, respectively.

Fig. 7 .
Fig. 7. Dynamic process of three surfaces of coefficients and the y_true surface at five different time stages.

Fig. 8 .
Fig. 8. Comparing and evaluating the performance of GWR, GTWR and STWR at five time stages.(a) Comparing the R2 value of different models; (b) Comparing the Sigma value of different models.

Fig. 9 .Fig. 10 .
Fig. 9. Comparing prediction results of STWR and GWR in case study 3. Images a1, b1, and c1 are the simulation surfaces of true Y, the predicted surface of Y by STWR, and the predicted surface of Y by GWR, respectively.Images a2, b2, c2 are

Fig. 11 .
Fig. 11.LOOCV results of STWR and GWR.(a) Squared error of prediction at each point (leave out); (b) Box plot of the LOOCV results of GWR and STWR.

Deleted:
With increasing combined applications of deep 504 learning and neural network in geospatial non-stationary 505 processes.We first discuss the main differences between 506 STWR and the recently proposed geographic neural network 507 weighted regression (GNNWR) (Du et al., 2020) and 508 geographic and temporal neural network regression 509 (GTNNWR) (Wu et al., 2020).GNNWR is a new attempt to 510 combine the OLS and GWR with Artificial neural networks Formatted: Font: Not Bold Formatted: Font: Not Bold The range of  (' is within a finite set of discrete values, because the maximum number of nearest neighbor is limited to  '%! ,  ∈ {1,2, . . ., } for the regression point  ! (') ( '%! is the total number of observed points at  − ).We denote that value set for  (' as  *' = { #0& ,  #0/ , . . . *! }, in which the element  > ,  ∈ { + 1,  + 2, . . .,  ' } denotes the distance from  ! (') to the  'ℎ nearest neighbor, and  equals to the number of independent variables.Moreover, the searching range of the temporal bandwidth  $ is also limited to a finite discrete set  .= { & ,  / , . . . .}, in which the element  . is the time interval from  to  − .

Table 1 .
Results of case study 1 at time stage  E .

Table 2 .
Results of case study 2 at time stage  E .

Table 3 .
Results of model performance with real-world data.