Articles | Volume 17, issue 15
https://doi.org/10.5194/gmd-17-6007-2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
https://doi.org/10.5194/gmd-17-6007-2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Random forests with spatial proxies for environmental modelling: opportunities and pitfalls
Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain
Universitat Pompeu Fabra (UPF), Barcelona, Spain
Marvin Ludwig
Institute of Landscape Ecology, University of Münster, Münster, Germany
Edzer Pebesma
Institute for Geoinformatics, University of Münster, Münster, Germany
Cathryn Tonne
Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain
Universitat Pompeu Fabra (UPF), Barcelona, Spain
CIBER Epidemiología y Salud Pública (CIBERESP), Madrid, Spain
Hanna Meyer
Institute of Landscape Ecology, University of Münster, Münster, Germany
Related authors
Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer
Geosci. Model Dev., 17, 5897–5912, https://doi.org/10.5194/gmd-17-5897-2024, https://doi.org/10.5194/gmd-17-5897-2024, 2024
Short summary
Short summary
Estimation of map accuracy based on cross-validation (CV) in spatial modelling is pervasive but controversial. Here, we build upon our previous work and propose a novel, prediction-oriented k-fold CV strategy for map accuracy estimation in which the distribution of geographical distances between prediction and training points is taken into account when constructing the CV folds. Our method produces more reliable estimates than other CV methods and can be used for large datasets.
Henning Teickner, Edzer Pebesma, and Klaus-Holger Knorr
Earth Syst. Dynam., 16, 891–914, https://doi.org/10.5194/esd-16-891-2025, https://doi.org/10.5194/esd-16-891-2025, 2025
Short summary
Short summary
The Holocene Peatland Model (HPM) is a widely used peatland model to understand and predict long-term peatland dynamics. Here, we test whether the HPM can predict Sphagnum litterbag decomposition rates from oxic to anoxic conditions. Our results indicate that decomposition rates change more gradually from oxic to anoxic conditions and may be underestimated under anoxic conditions, possibly because the effect of water table fluctuations on decomposition rates is not considered.
Jakub Nowosad and Hanna Meyer
AGILE GIScience Ser., 6, 40, https://doi.org/10.5194/agile-giss-6-40-2025, https://doi.org/10.5194/agile-giss-6-40-2025, 2025
Alexander Pilz, Torsten Frey, and Edzer Pebesma
AGILE GIScience Ser., 6, 43, https://doi.org/10.5194/agile-giss-6-43-2025, https://doi.org/10.5194/agile-giss-6-43-2025, 2025
Henning Teickner, Edzer Pebesma, and Klaus-Holger Knorr
Biogeosciences, 22, 417–433, https://doi.org/10.5194/bg-22-417-2025, https://doi.org/10.5194/bg-22-417-2025, 2025
Short summary
Short summary
Decomposition rates for Sphagnum mosses, the main peat-forming plants in northern peatlands, are often derived from litterbag experiments. Here, we estimate initial leaching losses from available Sphagnum litterbag experiments and analyze how decomposition rates are biased when initial leaching losses are ignored. Our analyses indicate that initial leaching losses range between 3 to 18 mass-% and that this may result in overestimated mass losses when extrapolated to several decades.
Fabian Lukas Schumacher, Christian Knoth, Marvin Ludwig, and Hanna Meyer
EGUsphere, https://doi.org/10.5194/egusphere-2024-2730, https://doi.org/10.5194/egusphere-2024-2730, 2024
Short summary
Short summary
Machine learning is increasingly used in environmental sciences for spatial predictions, but its effectiveness is challenged when models are applied beyond the areas they were trained on. We propose a Local Training Data Point Density (LPD) approach that considers how well a model's environment is represented by training data. This method provides a valuable tool for evaluating model applicability and uncertainties, crucial for broader scientific and practical applications.
Jan Linnenbrink, Carles Milà, Marvin Ludwig, and Hanna Meyer
Geosci. Model Dev., 17, 5897–5912, https://doi.org/10.5194/gmd-17-5897-2024, https://doi.org/10.5194/gmd-17-5897-2024, 2024
Short summary
Short summary
Estimation of map accuracy based on cross-validation (CV) in spatial modelling is pervasive but controversial. Here, we build upon our previous work and propose a novel, prediction-oriented k-fold CV strategy for map accuracy estimation in which the distribution of geographical distances between prediction and training points is taken into account when constructing the CV folds. Our method produces more reliable estimates than other CV methods and can be used for large datasets.
M. Ludwig, J. Bahlmann, E. Pebesma, and H. Meyer
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B3-2022, 135–141, https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-135-2022, https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-135-2022, 2022
Cited articles
Baddeley, A., Rubak, E., and Turner, R.: Spatial point patterns: methodology and applications with R, CRC Press, ISBN 9781482210200, 2015. a
Behrens, T. and Viscarra Rossel, R. A.: On the interpretability of predictors in spatial data science: The information horizon, Sci. Rep.-UK, 10, 16737, https://doi.org/10.1038/s41598-020-73773-y, 2020. a, b
Breiman, L.: Random forests, Mach. Learn., 45, 5–32, 2001. a
Breiman, L.: Manual on setting up, using, and understanding random forests v3.1, Statistics Department University of California Berkeley, CA, USA, 1, 3–42, https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf (last access: 24 April 2023), 2002. a
Cracknell, M. J. and Reading, A. M.: Geological mapping using remote sensing data: A comparison of five machine learning algorithms, their response to variations in the spatial distribution of training data and the use of explicit spatial information, Comput. Geosci., 63, 22–33, https://doi.org/10.1016/j.cageo.2013.10.008, 2014. a, b, c, d
de Bruin, S., Brus, D. J., Heuvelink, G. B., van Ebbenhorst Tengbergen, T., and Wadoux, A. M.-C.: Dealing with clustered samples for assessing map accuracy by cross-validation, Ecol. Inform., 69, 101665, https://doi.org/10.1016/j.ecoinf.2022.101665, 2022. a
de Hoogh, K., Chen, J., Gulliver, J., Hoffmann, B., Hertel, O., Ketzel, M., Bauwelinck, M., van Donkelaar, A., Hvidtfeldt, U. A., Katsouyanni, K., Klompmaker, J., Martin, R. V., Samoli, E., Schwartz, P. E., Stafoggia, M., Bellander, T., Strak, M., Wolf, K., Vienneau, D., Brunekreef, B., and Hoek, G.: Spatial PM2.5, NO2, O3 and BC models for Western Europe – Evaluation of spatiotemporal stability, Environ. Int., 120, 81–92, https://doi.org/10.1016/j.envint.2018.07.036, 2018. a, b
Dormann, C. F., McPherson, J. M., Araújo, M. B., Bivand, R., Bolliger, J., Carl, G., Davies, R. G., Hirzel, A., Jetz, W., Daniel Kissling, W., Kühn, I., Ohlemüller, R., Peres-Neto, P. R., Reineking, B., Schröder, B., Schurr, F. M., and Wilson, R.: Methods to account for spatial autocorrelation in the analysis of species distributional data: a review, Ecography, 30, 609–628, https://doi.org/10.1111/j.2007.0906-7590.05171.x, 2007. a
Fourcade, Y., Besnard, A. G., and Secondi, J.: Paintings predict the distribution of species, or the challenge of selecting environmental predictors and evaluation statistics, Global Ecol. Biogeogr., 27, 245–256, https://doi.org/10.1111/geb.12684, 2018. a
Gebbers, R. and de Bruin, S.: Application of Geostatistical Simulation in Precision Agriculture, Springer Netherlands, Dordrecht, 269–303, https://doi.org/10.1007/978-90-481-9133-8_11, 2010. a
Georganos, S., Grippa, T., Gadiaga, A. N., Linard, C., Lennert, M., Vanhuysse, S., Mboga, N., Wolff, E., and Kalogirou, S.: Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling, Geocarto Int., 36, 121–136, https://doi.org/10.1080/10106049.2019.1595177, 2021. a, b
Hajjem, A., Bellavance, F., and Larocque, D.: Mixed effects regression trees for clustered data, Stat. Probabil. Lett., 81, 451–459, https://doi.org/10.1016/j.spl.2010.12.003, 2011. a
Hajjem, A., Bellavance, F., and Larocque, D.: Mixed-effects random forest for clustered data, J. Stat. Comput. Sim., 84, 1313–1328, https://doi.org/10.1080/00949655.2012.741599, 2014. a
Hengl, T.: A practical guide to geostatistical mapping of environmental variables, Office for Official Publications of the European Communities, ISBN 978-92-79-06904-8, 2007. a
Heuvelink, G. B. and Webster, R.: Spatial statistics and soil mapping: A blossoming partnership under pressure, Spat. Stat.-Neth., 50, 100639, https://doi.org/10.1016/j.spasta.2022.100639, 2022. a
Hijmans, R. J.: terra: Spatial Data Analysis, r package version 1.6-47, https://CRAN.R-project.org/package=terra (last access: 1 February 2023), 2022. a
Kattenborn, T., Schiefer, F., Frey, J., Feilhauer, H., Mahecha, M. D., and Dormann, C. F.: Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks, ISPRS Open Journal of Photogrammetry and Remote Sensing, 5, 100018, https://doi.org/10.1016/j.ophoto.2022.100018, 2022. a
Kloog, I., Nordio, F., Lepeule, J., Padoan, A., Lee, M., Auffray, A., and Schwartz, J.: Modelling spatio-temporally resolved air temperature across the complex geo-climate area of France using satellite-derived land surface temperature data, Int. J. Climatol., 37, 296–304, https://doi.org/10.1002/joc.4705, 2017. a
Kuhn, M.: caret: Classification and Regression Training, r package version 6.0-93, https://CRAN.R-project.org/package=caret (last access: 1 February 2023), 2022. a
Kuhn, M. and Johnson, K.: Feature engineering and selection: A practical approach for predictive models, Chapman and Hall/CRC, ISBN 978-1032090856, 2019. a
Lary, D. J., Alavi, A. H., Gandomi, A. H., and Walker, A. L.: Machine learning in geosciences and remote sensing, Geosci. Front., 7, 3–10, https://doi.org/10.1016/j.gsf.2015.07.003, 2016. a
Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., and Bretagnolle, V.: Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation, Global Ecol. Biogeogr., 23, 811–820, https://doi.org/10.1111/geb.12161, 2014. a
Li, L., Girguis, M., Lurmann, F., Wu, J., Urman, R., Rappaport, E., Ritz, B., Franklin, M., Breton, C., Gilliland, F., and Habre, R.: Cluster-based bagging of constrained mixed-effects models for high spatiotemporal resolution nitrogen oxides prediction over large regions, Environ. Int., 128, 310–323, https://doi.org/10.1016/j.envint.2019.04.057, 2019. a
Longley, P.: Geographic information systems and science, John Wiley & Sons, ISBN 9781118676950, 2005. a
Ludwig, M., Moreno-Martinez, A., Hölzel, N., Pebesma, E., and Meyer, H.: Assessing and improving the transferability of current global spatial prediction models, Global Ecol. Biogeogr., 32, 356–368, https://doi.org/10.1111/geb.13635, 2023. a
Ma, H., Mo, L., Crowther, T. W., Maynard, D. S., van den Hoogen, J., Stocker, B. D., Terrer, C., and Zohner, C. M.: The global distribution and environmental drivers of aboveground versus belowground plant biomass, Nature Ecology & Evolution, 5, 1110–1122, 2021. a
Meyer, H. and Pebesma, E.: Machine learning-based global maps of ecological variables and the challenge of assessing them, Nat. Commun., 13, 2208, https://doi.org/10.1038/s41467-022-29838-9, 2022. a
Meyer, H., Milà, C., Ludwig, M., and Linnenbrink, J.: CAST: 'caret' Applications for Spatial-Temporal Models, https://github.com/HannaMeyer/CAST (last access: 8 May 2023), https://hannameyer.github.io/CAST/ (last access: 5 September 2023), 2023. a
Milà, C.: Code and data for “Random forests with spatial proxies for environmental modelling: opportunities and pitfalls”, Zenodo [code], https://doi.org/10.5281/zenodo.10495234, 2024. a
Milà, C., Mateu, J., Pebesma, E., and Meyer, H.: Nearest neighbour distance matching Leave-One-Out Cross-Validation for map validation, Methods Ecol. Evol., 13, 1304–1316, https://doi.org/10.1111/2041-210X.13851, 2022. a, b
Pebesma, E.: Simple Features for R: Standardized Support for Spatial Vector Data, R J., 10, 439–446, https://doi.org/10.32614/RJ-2018-009, 2018. a
Pebesma, E. J.: Multivariable geostatistics in S: the gstat package, Comput. Geosci., 30, 683–691, https://doi.org/10.1016/j.cageo.2004.03.012, 2004. a
Ploton, P., Mortier, F., Réjou-Méchain, M., Barbier, N., Picard, N., Rossi, V., Dormann, C., Cornu, G., Viennois, G., Bayol, N., Lyapustin, A., Gourlet-Fleury, S., and Pélissier, R.: Spatial validation reveals poor predictive performance of large-scale ecological mapping models, Nat. Commun., 11, 4540, https://doi.org/10.1038/s41467-020-18321-y, 2020. a
Poggio, L., de Sousa, L. M., Batjes, N. H., Heuvelink, G. B. M., Kempen, B., Ribeiro, E., and Rossiter, D.: SoilGrids 2.0: producing soil information for the globe with quantified spatial uncertainty, SOIL, 7, 217–240, https://doi.org/10.5194/soil-7-217-2021, 2021. a
R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/ (last access: 1 February 2023), 2022. a
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J. J., Schröder, B., Thuiller, W., Warton, D. I., Wintle, B. A., Hartig, F., and Dormann, C. F.: Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure, Ecography, 40, 913–929, 2017. a
Saha, A., Basu, S., and Datta, A.: RandomForestsGLS: Random Forests for Dependent Data, r package version 0.1.4, https://CRAN.R-project.org/package=RandomForestsGLS (last access: 8 May 2023), 2022. a
Sekulić, A., Kilibarda, M., Heuvelink, G. B., Nikolić, M., and Bajat, B.: Random Forest Spatial Interpolation, Remote Sens.-Basel, 12, 1687, https://doi.org/10.3390/rs12101687, 2020. a, b
Telford, R. and Birks, H.: Evaluation of transfer functions in spatially structured environments, Quaternary Sci. Rev., 28, 1309–1316, https://doi.org/10.1016/j.quascirev.2008.12.020, 2009. a
Tennekes, M.: tmap: Thematic Maps in R, J. Stat. Softw., 84, 1–39, https://doi.org/10.18637/jss.v084.i06, 2018. a
Valavi, R., Elith, J., Lahoz-Monfort, J. J., and Guillera-Arroita, G.: blockCV: An r package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models, Methods Ecol. Evol., 10, 225–232, https://doi.org/10.1111/2041-210X.13107, 2019. a
Wadoux, A. M. J.-C., Brus, D. J., and Heuvelink, G. B.: Sampling design optimization for soil mapping with random forest, Geoderma, 355, 113913, https://doi.org/10.1016/j.geoderma.2019.113913, 2019. a
Wadoux, A. M. J.-C., Minasny, B., and McBratney, A. B.: Machine learning for digital soil mapping: Applications, challenges and suggested solutions, Earth-Sci. Rev., 210, 103359, https://doi.org/10.1016/j.earscirev.2020.103359, 2020a. a, b, c
Wadoux, A. M. J.-C., Samuel-Rosa, A., Poggio, L., and Mulder, V. L.: A note on knowledge discovery and machine learning in digital soil mapping, Eur. J. Soil Sci., 71, 133–136, https://doi.org/10.1111/ejss.12909, 2020b. a
Wadoux, A. M. J.-C., Heuvelink, G. B., de Bruin, S., and Brus, D. J.: Spatial cross-validation is not the right way to evaluate map accuracy, Ecol. Model., 457, 109692, https://doi.org/10.1016/j.ecolmodel.2021.109692, 2021. a, b, c
Walsh, E. S., Kreakie, B. J., Cantwell, M. G., and Nacci, D.: A Random Forest approach to predict the spatial distribution of sediment pollution in an estuarine system, PLOS ONE, 12, 1–18, https://doi.org/10.1371/journal.pone.0179473, 2017. a
Wang, Y., Wu, G., Deng, L., Tang, Z., Wang, K., Sun, W., and Shangguan, Z.: Prediction of aboveground grassland biomass on the Loess Plateau, China, using a random forest algorithm, Sci. Rep.-UK, 7, 6940, https://doi.org/10.1038/s41598-017-07197-6, 2017. a
Wang, Y., Khodadadzadeh, M., and Zurita-Milla, R.: Spatial+: A new cross-validation method to evaluate geospatial machine learning models, Int. J. Appl. Earth Obs., 121, 103364, https://doi.org/10.1016/j.jag.2023.103364, 2023. a
Wenger, S. J. and Olden, J. D.: Assessing transferability of ecological models: an underappreciated aspect of statistical validation, Methods Ecol. Evol., 3, 260–267, https://doi.org/10.1111/j.2041-210X.2011.00170.x, 2012. a
Wickham, H.: ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag New York, https://ggplot2.tidyverse.org (last access: 1 February 2023), 2016. a
Wright, M. N. and Ziegler, A.: ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, J. Stat. Softw., 77, 1–17, https://doi.org/10.18637/jss.v077.i01, 2017. a
Wylie, B. K., Pastick, N. J., Picotte, J. J., and Deering, C. A.: Geospatial data mining for digital raster mapping, GISci. Remote Sens., 56, 406–429, https://doi.org/10.1080/15481603.2018.1517445, 2019. a
Zhan, Y., Luo, Y., Deng, X., Chen, H., Grieneisen, M. L., Shen, X., Zhu, L., and Zhang, M.: Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm, Atmos. Environ., 155, 129–139, https://doi.org/10.1016/j.atmosenv.2017.02.023, 2017. a
Short summary
Spatial proxies, such as coordinates and distances, are often used as predictors in random forest models for predictive mapping. In a simulation and two case studies, we investigated the conditions under which their use is appropriate. We found that spatial proxies are not always beneficial and should not be used as a default approach without careful consideration. We also provide insights into the reasons behind their suitability, how to detect them, and potential alternatives.
Spatial proxies, such as coordinates and distances, are often used as predictors in random...