the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Remote sensing-based high-resolution mapping of the forest canopy height: some models are useful, but might they be even more if combined?
Abstract. The development of high-resolution mapping models for forest attributes based on remote sensing data combined with machine or deep learning techniques, has become a prominent topic in the field of forest observation and monitoring. This has resulted in an extensive availability of multiple sources of information, which can either lead to a potential confusion, or to the possibility to learn both about models and about forest attributes through the joint interpretation of multiple models. This article seeks to endorse the latter, by relying on the Bayesian model averaging (BMA) approach, which can be used to diagnose and interpret differences among predictions of different models. The predictions in our case are the forest canopy height estimations for the metropolitan France coming from five different models (Lang et al., 2023; Liu et al., 2023; Morin et al., 2022; Potapov et al., 2021; Schwartz et al., 2024). An independent reference dataset, containing four different definitions of the forest height (dominant, mean, maximum and Lorey’s), comes from the French National Forest Inventory (NFI) providing some 5 500 plots used in the study, distributed across the entire area of interest. In this contribution, we line up the evoked models with respect to their probabilities to be the ones generating measurements/estimations at the NFI plots. Stratifying the probabilities based on French sylvo-ecological regions reveals spatial variation in the respective model probabilities across the area of interest. Furthermore, we observe significant variability in these probabilities depending on the forest height definition used. This leads us to infer that the different models inadvertently exhibit dominant predictions for different types of canopy height. We also present the respective inter-model and intra-model variance estimations, allowing us to come to understand where the employed models have comparable weights but contrasted predictions. We show that the mountainous terrain has an important impact on the models spread. Moreover, we observe that the forest stand vertical structure, the dominant tree species and the type of forest ownership systematically appear to be statistically significant factors influencing the models divergence. Finally, we demonstrate that the derived mixture models exhibit higher R2 scores and lower RMSE values compared to individual models, although they may not necessarily exhibit lower biases.
- Preprint
(5169 KB) - Metadata XML
- BibTeX
- EndNote
Status: closed
-
CC1: 'Comment on gmd-2024-95', Chong Xu, 28 Jun 2024
This manuscript is based on remote sensing data and deep learning technology to carry out high-resolution mapping of forest canopy height, which has positive value for the application of remote sensing technology in forestry. Specific suggestions are as follows: (1) The title should be modified to be more precise and concise; (2) References should not be included in the abstract, and the abstract should be revised to be more direct and clear; (3) In the introduction, a summary of the deficiencies of previous work should be added to introduce the work of this manuscript and increase logical coherence; (4) The structure of the manuscript should be modified to include five sections: Introduction, Data and Methods, Results and Analysis, Discussion, and Conclusion; (5) Although there is a section for "Results and Discussion," the discussion content is very limited. It is recommended that the discussion section be separated into an independent section and that more content be added; (6) From Figure 4, the scatter plot does not show a good pattern, which may indicate that the analysis method in this paper is not ideal. Please analyze the problems in it; (7) Please add more discussion content. It is recommended that the authors analyze and compare the advantages and limitations of this work from multiple perspectives. In addition, it is recommended to add a future outlook. In summary, major revisions are recommended.
Citation: https://doi.org/10.5194/gmd-2024-95-CC1 -
RC1: 'Comment on gmd-2024-95', Anonymous Referee #1, 28 Jun 2024
This manuscript is based on remote sensing data and deep learning technology to carry out high-resolution mapping of forest canopy height, which has positive value for the application of remote sensing technology in forestry. Specific suggestions are as follows: (1) The title should be modified to be more precise and concise; (2) References should not be included in the abstract, and the abstract should be revised to be more direct and clear; (3) In the introduction, a summary of the deficiencies of previous work should be added to introduce the work of this manuscript and increase logical coherence; (4) The structure of the manuscript should be modified to include five sections: Introduction, Data and Methods, Results and Analysis, Discussion, and Conclusion; (5) Although there is a section for "Results and Discussion," the discussion content is very limited. It is recommended that the discussion section be separated into an independent section and that more content be added; (6) From Figure 4, the scatter plot does not show a good pattern, which may indicate that the analysis method in this paper is not ideal. Please analyze the problems in it; (7) Please add more discussion content. It is recommended that the authors analyze and compare the advantages and limitations of this work from multiple perspectives. In addition, it is recommended to add a future outlook. In summary, major revisions are recommended.
Citation: https://doi.org/10.5194/gmd-2024-95-RC1 -
RC2: 'Comment on gmd-2024-95', Anonymous Referee #2, 05 Jul 2024
Overall comments
The paper “Remote sensing-based high-resolution mapping of the forest canopy height: some models are useful, but might they be even more if combined?” by Besic et al. is an interesting study of various regional or global-level canopy height models, their performance across an entire country (France) and whether a mixture of these models contains additional information that could be exploited to improve predictions. The general approach to validate canopy height, based on a fully independent, field-based data set (French NFI data) convinces me, as it ensures good spatial coverage, but also allows the authors to extend the evaluation to forest structure metrics that cannot be properly assessed from remote sensing data only, such as Lorey’s height. They thus can assess the link between what is seen from above and what is seen from below, with important implications for biomass assessments.
Overall, the study is timely and innovative. As the authors note, there is an increasing amount of (poorly validated/flawed) global canopy height models, which will likely be used for a variety of purposes, such as biomass assessments, disturbance monitoring or the validation/calibration of vegetation models. It is thus crucial to systematically assess these products, provide guidelines on their strengths and weaknesses and find ways to mitigate errors. I thus think that the study at hand would be of great interest and value to readers of Geoscientific Model Development.
I do, however, have a few major concerns that the authors need to address before I could recommend the article for publication. I will detail them in the next few paragraphs and then provide line-by-line comments below.
Model validation
The authors describe the mixture model derived via Bayesian Model Averaging (BMA) as providing a superior performance compared to each of the individual canopy height models, but the article currently does not provide enough evidence for this statement. As far as I understand, BAM is essentially fitting a higher-order mixture model to reference data, where, for every predicted site, the individual models get assigned weights based on performance at that site. The weights are model parameters, and I imagine that BAM is as prone to overfitting as any other modelling approach, particularly when done only in a “Bayesian-flavoured” way without fully propagating uncertainty. I also did not understand whether or how the site-specific weights are regularized (e.g., hierarchically nested within the overall weights?). I would therefore expect such a mixture to automatically perform better, because it allows to pick models based on local predictive accuracy, but it is not clear to what extent this approach will be fitting noise and to what extent it will pick up on systematic differences between models.
To make this analysis convincing and assess how well the model captures systematic patterns, the authors need to employ a cross-validation approach, ideally a spatial cross validation approach (as in Ploton et al. 2020, Nat. Comms.). One suggestion would be to do a spatial leave-one-out cross validation where 1,000 NFI plots are selected at random. For each NFI plot, the authors would then remove the NFI plot itself and all other NFI plots within a specific radius around the NFI plot (I would suggest 100 km to account for a decent range of spatial autocorrelation), calculate the BMA weights from the remaining training NFI plots outside the 100 km radius, and then predict with the trained BMA mixture to the validation NFI plot. This would be repeated 1,000 times. The comparison between prediction and observed value would then allow the calculation of a RMSE/MBE that should be relatively robust to overfitting and spatial autocorrelation between training and validation data. If computational costs are low, this could, of course, also be done from all 5,475 NFI plots.
If computational costs are higher, a slightly less computationally intensive alternative would be to split the data set into spatial folds such as the 91 sylvo-ecological regions (I’ll call them SERs here), and perform a leave-one-SER out validation. So all NFI plots within one SER would be predicted from the BMA mixture trained on the other 90 SERs. Ideally, the authors would also account for spatial autocorrelation in this approach by removing the SERs directly adjacent to the validation SER, so the training data set would usually be 80-90 SERs, and the validation a single SER, buffered by its neighbouring SERs.
In both cases, the authors could assess the quality of predictions both against simpler model averaging (SMA), as well as the individual models.
I generally would also suggest moving the performance evaluation (5.5) to the front of the Results section, i.e. to make it 5.1. It is super helpful for readers to understand the quality of the BAM as well as the quality of each individual model before delving into the details of the BAM weights. In addition, I would like to see a figure of paired M1-M5 canopy height values for all NFI plots, i.e. scatter plots of height predictions from M1 against M2, M1 against M3, etc., and a table with their respective correlations. It would also be great to see a supplementary figure that shows the overall height maps across France from model M1-M5 as well as the consensus product, and areas around a few sample NFI plots, with predictions from M1-M5 at small spatial scale (locations do not have to be exact, of course, just to get a visual impression).
Generalizability/Limitations of the study
The authors make quite a few sweeping statements that I believe are not warranted by the study.
The most important limitation of the study is that it assesses the canopy height models only in France. I appreciate that this, by itself, is an enormous effort, and as stated above, I think it is super important, but the authors need to recognize much better the limitations of this approach. Forests in France may be (with emphasis on “may”) representative of temperate forests in Europe, and include a few challenging areas in terms of topography (Pyrenees, Alps, Corsica), but their species composition and structural complexity is nowhere near representative of forest canopies globally. A statement such as “Thus, we pinpoint (high) mountainous regions as the primary challenge for ongoing model advancements” is not warranted. It may be the primary challenge in Europe, but even this is not 100% clear to me. I have not conducted a systematic study of canopy height products and urge the authors to correct me where I err, but when I informally compared 2-3 of the global height products to ALS-derived heights (mean and max) at tropical sites that I know well, the predictions were sometimes catastrophically bad. There was evidence of strong saturation of predicted heights around 30-40 m in some products, while other products were highly impacted by cloud cover and processing artefacts (tiling). If I had to guess the biggest challenges in developing global canopy height models, I would say they are (1) the accurate representation of tall forest canopies (lots of biomass), (2) the accurate representation of tropical forests (also lots of biomass) (3) the accurate representation of mountainous areas, with a combination of the three (tall tropical forests on slopes) likely the gold standard to evaluate what models can or cannot predict. The study at hand only really can show (3).
These caveats also extend to assessments of global canopy height models elsewhere. The authors describe the existing model performances as “impressive” (51), and they also describe the 3 m resolution of one of the canopy height models as “impressive” (110), but they do not provide evidence for either claim. The fact that models can be produced at these impressive scales does not mean that they are good models. Surely, there will be important improvements over the next few years, and I am excited about these, but in the current state, all global models come with substantial flaws. In particular, the resolutions given by models often seem to be only “nominal”. I.e. most 10 m resolution canopy height models (or lower) are so highly smoothed/averaged towards the mean that I doubt that they can resolve structures below 50-100 m resolution accurately. So 50-100 m is probably a more accurate description of most model’s actual resolution. I am sympathetic that this is part of modelling and I do not expect these global efforts to provide excellent local predictions, but apart from the sheer amount of computing and data involved, “impressive” goes too far. Just as an example, in France, the article calculates the RMSEs of models as lying between 4 and 9 m, depending on metric and model (Figure 5), which probably corresponds to relative RMSEs of ca. 30-60%. This is massive uncertainty in probably one of the better-modelled systems.
Presentation/Writing style
Finally, the presentation/writing style of the article is a bit involved and the syntax structure makes it often difficult to read. Where possible I have tried to give suggestions to simplify sentences or adjust phrases (e.g., not using “the metropolitan France”, but “metropolitan France”, or maybe just France?), but I would urge the authors to go through the article again and to try to simplify some of the more complex sentence structures. The article may also profit from being proofread by a native speaker.
Detailed comments
Title: just “forest canopy height” instead of “the forest canopy height”? And “might they be even more so if combined?” [I think you need a “more so” in English]
2-3: The sentence is a bit long and sometimes repetitive (e.g., “can” and “potential” mean the same thing). Here’s a suggestion for a slight simplification: “This has resulted in the availability of multiple, sometimes conflicting, sources of information, which can lead to confusion, but also makes it possible to learn about forest attributes through the joint interpretation of multiple models.”
9: just “forest height” instead of “the forest height”?
10-11: The “In this contribution” sentence seems a bit difficult to understand. Do you mean: “In this contribution, we evaluate models with respect to their probabilities of correctly predicting measurements at NFI plots.”?
14: What is “dominant prediction”?
15: “allowing us to come to understand” could be shortened
16: “systematically appear to be statistically significant factors” is unclear to me, systematic and statistically significant are relatively clear statements, but then “appear” qualifies this again; reformulate?
42: I would remove “while having a limited lifespan”; essentially having a short lifespan explains why GEDI has low sampling density and few recurring shots that overlap with each other
43-46: Please rephrase, it’s unclear what “the particular acquisition configurations” refers to
47: what is a “forest stand factor”?
51: “showing often impressive performances in constructing links”; as outlined above, I fundamentally disagree with this statement. What would be impressive performance? Please provide evidence for this. Personally, I have not been that much impressed by the AI-inferred canopy height models I have seen. They usually do reasonably well in open systems and they seem to also predict some shorter-statured forests in the temperate zone well, particularly in areas where they were calibrated with high-quality ALS data or where topography is not too challenging and GEDI-based inferences are more reliable. But even in areas where they get into the right ballpark of mean canopy height, they seem to homogenize forest structure a lot, overstate their resolution (a target resolution of 1 m or 10 m is very different from a 1 m or 10 m resolution in ALS assessments, for example, and usually corresponds more to 100 m resolution). Most importantly, I have seen them repeatedly fail to provide acceptable predictions in tall and heterogeneous canopies or under cloud cover, i.e. the tropics. This is an introduction, so I don’t expect a full discussion of these issues here and models will undoubtedly improve in the future, but to call the existing models “impressive” is overstating their performance.
52: “not faultless”; this goes back to my statement above; I think this is a severe understatement. The models are currently quite faulty and error prone and this is why articles such as this one are so important!
52-60: I think some of these things could be stated much more succinctly. E.g.: There is a basic question of whether remote sensing data captures all necessary information (answer is: no). Second, some remote sensing data come with huge uncertainties and biases. Third, there are more general modelling issues.
74: This needs to be discussed later, but testing the models in France gives overly optimistic assessments of them, because a) models have likely had better training data in Europe than, for example, in the tropics (fewer clouds, better ALS data, topography is more easily visible in less dense forests), b) the range of forest types is heavily restricted.
75-76: I like the idea of using field data to verify remote sensing predictions and it does allow you to calculate quantities otherwise not available (Lorey’s height). But I was wondering: why did you not also use the IGN’s lidar data as a second, independent validation data set? This is not a major issue, but it would seem to make sense to pair the NFI plots with these data, no?
100: For me, all these resolutions are always a bit tricky to interpret. Most global models look much blurrier than an ALS-derived estimate at the same resolution. E.g., I recently compared the Lang et al. 2023 model to ALS-derived canopy height estimates in Italy, and the effective resolution of the global product seemed much lower than 10 m. Similarly, Meta’s 1 m canopy height model only nominally has a resolution of 1 m, but often looks like 100 m resolution. Could you include somewhere a statement that the resolutions that are given are “nominal resolutions”?
110: again I would very much argue against “impressive”. As stated above, I have strong doubts about claimed or “nominal” resolution vs. actual or “effective” resolution.
117: “covering metropolitan France”, no “the” needed
116: What’s “Linear Forest Regression”?
130-151: As stated above, I really like this approach, and it’s great that you evaluate different height types, especially Lorey’s height.
152-221: One overall question: how well does this approach/BAM deal with highly correlated input layers? E.g., if two of the tested models provide similar predictions across France, how do weights get assigned? Will the weights get assigned more or less randomly, and both come out as similar, or could it happen (as with highly correlated predictors) that one model artificially gets a much higher weight than the other model by accident? How could you test for this?
156: yes, I like this! But you should also mention somewhere that there is also the opposite problem. If a few models are particularly bad/biased/noisy, then this might get propagated in your analysis, no?
157: You need to explain a bit more what BMA is. I have worked with Bayesian models, but never done model averaging, and I don’t understand what deterministic vs. non-deterministic BMA would be (how do you optimize model parameters of already calibrated models?) and how you would even apply that in this context. Would you have to rerun the AI models?
173: From a practical point of view, I understand the weighting by models. But I don’t understand the theoretical underpinnings. Is it correct to call Pr(Mk | H) the posterior probability of model Mk and sum to 1 across the 5 tested models? I can easily imagine a situation where all 5 models that you test are actually highly improbable models, and that there are much better models out there that we have not found yet, so the existing probabilities should not really sum to 1? Of course, the relative probabilities do not change, and that’s what’s important for your aim, so this is mostly a conceptual/terminological question, but it would be interesting to get your take on this.
175: How justified is this assumption? Do you have any evidence for this?
185-186: Yes, but are you fitting an extra model, and introducing new parameters (the weights), so creating a more complex model, no?
206: Could you explain why z_ki is treated as “missing data”?
213-214: The authors need to explain this
222: As part of the Results/Discussion, I would definitely like to see one paragraph/figure where you just compare the predictions of the different models. I think all readers would like to know how much the 5 different models actually disagree/agree with each other in predicting canopy height. This could take the form of a simple correlation plot between predicted canopy heights for each pair of models. It would also be ok to put this figure into the supplementary. In addition, I would like to see a whole map of France, with height predicted from M1-M5, and a few sample locations across France (mountain/non-mountain, etc.) in the Supplementary, where the predictions of M1-M5 are shown. This would greatly help the readers understand the relative strengths and weaknesses of the models. Ideally, as pointed out earlier, this would be juxtaposed with CHMs derived from IGN’s ALS scans, but this is more of a bonus, and I won’t insist on that.
224: Since I don’t fully grasp the weighting approach, are the overall weights somehow tied to the local weights and provide some form of regularization (e.g., as in a hierarchical Bayesian model)?
232-257: Overall, I like this paragraph, and it is an interesting comparison!
235: “relatively significantly” is odd. Please rephrase. What do you mean? Also, when looking at the graphs, the deviations do not seem so large. With the exception of M5, all model weights generally seem to fall between 0.15 and 0.25, so relatively close to 0.2. None of the models is clearly worse than the others, none of the models is clearly better.
236: Simple Model Averaging (SMA) has not been properly introduced, and needs its own short section. Did you take the median or average?
Figure 2: Could you potentially just leave the bar “WITHOUT imputations” blank, i.e., no internal black lines? I think it would be visually easier to understand the difference, at first sight I did not recognize it.
249-254: I agree with the overall statement, and think it’s important to clearly define what canopy height models are actually predicting. But my impression is that this is slightly overstating the results. Yes, there are some differences, and M5 seems to do slightly better at predicting dominant and Lorey’s height, but we are still talking about a weight of 0.25-0.27, compared to other models’ weights of 0.15-0.2. For me, this is much less clear than what you describe. I further have doubts about biomass estimates relying on several height descriptors at the same time. Evidently, these different height estimates will covary between them, and using highly covarying/correlated predictors for biomass models is probably not going to improve model predictions massively.
255-258: I don’t understand – there is an impact of imputation, but it’s unclear to me whether this is an improvement or not. It may just be that the imputation routine shares some similarities in predicting to new data with the AI models, so is this a real effect? This may become clearer under spatial cross validation though!
260: “metropolitan France”, not “the metropolitan France”.
260-262: I don’t understand what is being said here? Why is the homogeneity of the 86 regions important for the averaging?
Figure 3: I know that this figure is already quite complex, but is there a way of visualizing or describing the average NFI-based metric per local region somewhere, i.e. add a 7th column where the average dominant height, average mean height, average maximum height and average Lorey’s height is mapped? Because from a biomass perspective, we clearly care most about model performance in areas with tall/complex canopies, and I suspect that this could shift the evaluation of model weights a bit.
268-270: Yes, that’s a great point! You could also mention Corsica here, where the Popatov model also seems to do better for these variables. I don’t understand the Landsat argument. You mean that coarser resolution approaches average out local topographic errors? Also, maybe rephrase “disturbing effects”.
288-290: I don’t fully understand the meaning of “within variance”. Could you make clearer what it means maybe already in the methods section, but also here. Does it mean how much model prediction quality/weights vary across sites/regions within the same model?
291-296: This is important, because a lot of well-preserved forest area worldwide is usually located in mountainous terrain due to accessibility issues. It’s an issue if models are performing worst in these areas.
297-298: I don’t understand this point, or, if I understand it, I don’t agree. If within-model variance is larger than between-model variance, it probably just means that the models are all pretty similar in their predictions, no? That does not mean that the predictions are reliable per se. If we evaluate 5 models that are all using more or less the same input data and comparable extrapolation approaches, they could all be similarly bad at predicting something and would then have a very low between-model variance. For example, 4 out of 5 models here use GEDI shots as input, so I would already expect a lot of homogeneity in predictions just from that.
316-319: Yes and no! I fully agree with your assessment that mountainous regions are an important, and one of the primary challenges for model advancements. In Europe they may well be the “primary challenge”, but this needs to be qualified. This study evaluates models only in France and thus does not represent the variety of global forest types. I strongly suspect that the “primary challenge for ongoing model advancements” are actually tall canopies, in particular in the tropics, where cloud cover and forest structural complexity make predictions much more volatile. I would then see predictions in mountainous terrain as a strong “secondary challenge” globally. The absolute gold standard for model evaluation would be tall tropical forests with strong topographic gradients.
337-340: I don’t think it’s accurate to state that “classes” cause variations in the models. They are linked to these variations, or are predictors of them, but what causes the variations is more tricky to say. Most likely specific tree species occur in specific environments (e.g., high altitude) which are trickier to predict.
341-344: I was surprised by this. So this would indicate that low forest canopies are worse-predicted.
358: In my opinion, this paragraph should be the first of the results paragraphs, as it evaluates whether the mixture model is actually any good at modelling.
359: Broadly, the authors are calibrating a higher-order model that assigns weights to different underlying models. This approach is prone to the same modelling issues as any model, e.g., overfitting the data, and needs to be a) better described – I currently have no information as to how R2, MBE and RMSE have been calculated, and b) it needs to be done with a formal cross validation strategy. This should take the form of a spatial cross validation, as suggested in Ploton et al. 2020. One way could be that the authors select 1000 NFI plots, remove all other NFI plots within a 100 km radius, and then predict the left-out plot’s height structure from the mixture model calibrated on the remaining NFI plots. Another way could be that the authors divide France into spatial folds, e.g. the 96 sylvo-ecological regions, and predict each region from the other 95. Ideally, here I would also leave out all sylvo-ecological regions that are directly adjacent to the validation plot to account for spatial autocorrelation. The resulting MBE/RMSE could then be compared against the original model performance, as well as SMA, for example.
367: Is this actually possible? In my (very coarse) understanding, there is often a bias-variance tradeoff in modelling, so that, beyond a certain point, reducing one often comes at the expense of increasing the other.
375: Without proper spatial cross validation I would not believe this “super” model yet.
380: How accurate is it to actually call this Bayesian model averaging, if it’s only “Bayesian flavoured”?
386: cf. my objections above. Mountainous regions are a “real challenge”, and probably the most important one in France (and likely most of Europe), but that does not mean it’s true worldwide, and this needs to be acknowledged here!
391-393: “more so if combined”; epistemologically, I am not sure, whether I fully agree with this statement. Yes, combining models can make them more useful, but you could also argue that the authors are essentially fitting a much more complex model that picks the best prediction at every location, so we are gaining predictive power in return for a lot of new parameters. But to trust the results, we need proper spatial cross validation.
Figure 5: I would like to see relative RMSE here as well.
Citation: https://doi.org/10.5194/gmd-2024-95-RC2 - AC1: 'Comment on gmd-2024-95', Nikola Besic, 09 Oct 2024
Status: closed
-
CC1: 'Comment on gmd-2024-95', Chong Xu, 28 Jun 2024
This manuscript is based on remote sensing data and deep learning technology to carry out high-resolution mapping of forest canopy height, which has positive value for the application of remote sensing technology in forestry. Specific suggestions are as follows: (1) The title should be modified to be more precise and concise; (2) References should not be included in the abstract, and the abstract should be revised to be more direct and clear; (3) In the introduction, a summary of the deficiencies of previous work should be added to introduce the work of this manuscript and increase logical coherence; (4) The structure of the manuscript should be modified to include five sections: Introduction, Data and Methods, Results and Analysis, Discussion, and Conclusion; (5) Although there is a section for "Results and Discussion," the discussion content is very limited. It is recommended that the discussion section be separated into an independent section and that more content be added; (6) From Figure 4, the scatter plot does not show a good pattern, which may indicate that the analysis method in this paper is not ideal. Please analyze the problems in it; (7) Please add more discussion content. It is recommended that the authors analyze and compare the advantages and limitations of this work from multiple perspectives. In addition, it is recommended to add a future outlook. In summary, major revisions are recommended.
Citation: https://doi.org/10.5194/gmd-2024-95-CC1 -
RC1: 'Comment on gmd-2024-95', Anonymous Referee #1, 28 Jun 2024
This manuscript is based on remote sensing data and deep learning technology to carry out high-resolution mapping of forest canopy height, which has positive value for the application of remote sensing technology in forestry. Specific suggestions are as follows: (1) The title should be modified to be more precise and concise; (2) References should not be included in the abstract, and the abstract should be revised to be more direct and clear; (3) In the introduction, a summary of the deficiencies of previous work should be added to introduce the work of this manuscript and increase logical coherence; (4) The structure of the manuscript should be modified to include five sections: Introduction, Data and Methods, Results and Analysis, Discussion, and Conclusion; (5) Although there is a section for "Results and Discussion," the discussion content is very limited. It is recommended that the discussion section be separated into an independent section and that more content be added; (6) From Figure 4, the scatter plot does not show a good pattern, which may indicate that the analysis method in this paper is not ideal. Please analyze the problems in it; (7) Please add more discussion content. It is recommended that the authors analyze and compare the advantages and limitations of this work from multiple perspectives. In addition, it is recommended to add a future outlook. In summary, major revisions are recommended.
Citation: https://doi.org/10.5194/gmd-2024-95-RC1 -
RC2: 'Comment on gmd-2024-95', Anonymous Referee #2, 05 Jul 2024
Overall comments
The paper “Remote sensing-based high-resolution mapping of the forest canopy height: some models are useful, but might they be even more if combined?” by Besic et al. is an interesting study of various regional or global-level canopy height models, their performance across an entire country (France) and whether a mixture of these models contains additional information that could be exploited to improve predictions. The general approach to validate canopy height, based on a fully independent, field-based data set (French NFI data) convinces me, as it ensures good spatial coverage, but also allows the authors to extend the evaluation to forest structure metrics that cannot be properly assessed from remote sensing data only, such as Lorey’s height. They thus can assess the link between what is seen from above and what is seen from below, with important implications for biomass assessments.
Overall, the study is timely and innovative. As the authors note, there is an increasing amount of (poorly validated/flawed) global canopy height models, which will likely be used for a variety of purposes, such as biomass assessments, disturbance monitoring or the validation/calibration of vegetation models. It is thus crucial to systematically assess these products, provide guidelines on their strengths and weaknesses and find ways to mitigate errors. I thus think that the study at hand would be of great interest and value to readers of Geoscientific Model Development.
I do, however, have a few major concerns that the authors need to address before I could recommend the article for publication. I will detail them in the next few paragraphs and then provide line-by-line comments below.
Model validation
The authors describe the mixture model derived via Bayesian Model Averaging (BMA) as providing a superior performance compared to each of the individual canopy height models, but the article currently does not provide enough evidence for this statement. As far as I understand, BAM is essentially fitting a higher-order mixture model to reference data, where, for every predicted site, the individual models get assigned weights based on performance at that site. The weights are model parameters, and I imagine that BAM is as prone to overfitting as any other modelling approach, particularly when done only in a “Bayesian-flavoured” way without fully propagating uncertainty. I also did not understand whether or how the site-specific weights are regularized (e.g., hierarchically nested within the overall weights?). I would therefore expect such a mixture to automatically perform better, because it allows to pick models based on local predictive accuracy, but it is not clear to what extent this approach will be fitting noise and to what extent it will pick up on systematic differences between models.
To make this analysis convincing and assess how well the model captures systematic patterns, the authors need to employ a cross-validation approach, ideally a spatial cross validation approach (as in Ploton et al. 2020, Nat. Comms.). One suggestion would be to do a spatial leave-one-out cross validation where 1,000 NFI plots are selected at random. For each NFI plot, the authors would then remove the NFI plot itself and all other NFI plots within a specific radius around the NFI plot (I would suggest 100 km to account for a decent range of spatial autocorrelation), calculate the BMA weights from the remaining training NFI plots outside the 100 km radius, and then predict with the trained BMA mixture to the validation NFI plot. This would be repeated 1,000 times. The comparison between prediction and observed value would then allow the calculation of a RMSE/MBE that should be relatively robust to overfitting and spatial autocorrelation between training and validation data. If computational costs are low, this could, of course, also be done from all 5,475 NFI plots.
If computational costs are higher, a slightly less computationally intensive alternative would be to split the data set into spatial folds such as the 91 sylvo-ecological regions (I’ll call them SERs here), and perform a leave-one-SER out validation. So all NFI plots within one SER would be predicted from the BMA mixture trained on the other 90 SERs. Ideally, the authors would also account for spatial autocorrelation in this approach by removing the SERs directly adjacent to the validation SER, so the training data set would usually be 80-90 SERs, and the validation a single SER, buffered by its neighbouring SERs.
In both cases, the authors could assess the quality of predictions both against simpler model averaging (SMA), as well as the individual models.
I generally would also suggest moving the performance evaluation (5.5) to the front of the Results section, i.e. to make it 5.1. It is super helpful for readers to understand the quality of the BAM as well as the quality of each individual model before delving into the details of the BAM weights. In addition, I would like to see a figure of paired M1-M5 canopy height values for all NFI plots, i.e. scatter plots of height predictions from M1 against M2, M1 against M3, etc., and a table with their respective correlations. It would also be great to see a supplementary figure that shows the overall height maps across France from model M1-M5 as well as the consensus product, and areas around a few sample NFI plots, with predictions from M1-M5 at small spatial scale (locations do not have to be exact, of course, just to get a visual impression).
Generalizability/Limitations of the study
The authors make quite a few sweeping statements that I believe are not warranted by the study.
The most important limitation of the study is that it assesses the canopy height models only in France. I appreciate that this, by itself, is an enormous effort, and as stated above, I think it is super important, but the authors need to recognize much better the limitations of this approach. Forests in France may be (with emphasis on “may”) representative of temperate forests in Europe, and include a few challenging areas in terms of topography (Pyrenees, Alps, Corsica), but their species composition and structural complexity is nowhere near representative of forest canopies globally. A statement such as “Thus, we pinpoint (high) mountainous regions as the primary challenge for ongoing model advancements” is not warranted. It may be the primary challenge in Europe, but even this is not 100% clear to me. I have not conducted a systematic study of canopy height products and urge the authors to correct me where I err, but when I informally compared 2-3 of the global height products to ALS-derived heights (mean and max) at tropical sites that I know well, the predictions were sometimes catastrophically bad. There was evidence of strong saturation of predicted heights around 30-40 m in some products, while other products were highly impacted by cloud cover and processing artefacts (tiling). If I had to guess the biggest challenges in developing global canopy height models, I would say they are (1) the accurate representation of tall forest canopies (lots of biomass), (2) the accurate representation of tropical forests (also lots of biomass) (3) the accurate representation of mountainous areas, with a combination of the three (tall tropical forests on slopes) likely the gold standard to evaluate what models can or cannot predict. The study at hand only really can show (3).
These caveats also extend to assessments of global canopy height models elsewhere. The authors describe the existing model performances as “impressive” (51), and they also describe the 3 m resolution of one of the canopy height models as “impressive” (110), but they do not provide evidence for either claim. The fact that models can be produced at these impressive scales does not mean that they are good models. Surely, there will be important improvements over the next few years, and I am excited about these, but in the current state, all global models come with substantial flaws. In particular, the resolutions given by models often seem to be only “nominal”. I.e. most 10 m resolution canopy height models (or lower) are so highly smoothed/averaged towards the mean that I doubt that they can resolve structures below 50-100 m resolution accurately. So 50-100 m is probably a more accurate description of most model’s actual resolution. I am sympathetic that this is part of modelling and I do not expect these global efforts to provide excellent local predictions, but apart from the sheer amount of computing and data involved, “impressive” goes too far. Just as an example, in France, the article calculates the RMSEs of models as lying between 4 and 9 m, depending on metric and model (Figure 5), which probably corresponds to relative RMSEs of ca. 30-60%. This is massive uncertainty in probably one of the better-modelled systems.
Presentation/Writing style
Finally, the presentation/writing style of the article is a bit involved and the syntax structure makes it often difficult to read. Where possible I have tried to give suggestions to simplify sentences or adjust phrases (e.g., not using “the metropolitan France”, but “metropolitan France”, or maybe just France?), but I would urge the authors to go through the article again and to try to simplify some of the more complex sentence structures. The article may also profit from being proofread by a native speaker.
Detailed comments
Title: just “forest canopy height” instead of “the forest canopy height”? And “might they be even more so if combined?” [I think you need a “more so” in English]
2-3: The sentence is a bit long and sometimes repetitive (e.g., “can” and “potential” mean the same thing). Here’s a suggestion for a slight simplification: “This has resulted in the availability of multiple, sometimes conflicting, sources of information, which can lead to confusion, but also makes it possible to learn about forest attributes through the joint interpretation of multiple models.”
9: just “forest height” instead of “the forest height”?
10-11: The “In this contribution” sentence seems a bit difficult to understand. Do you mean: “In this contribution, we evaluate models with respect to their probabilities of correctly predicting measurements at NFI plots.”?
14: What is “dominant prediction”?
15: “allowing us to come to understand” could be shortened
16: “systematically appear to be statistically significant factors” is unclear to me, systematic and statistically significant are relatively clear statements, but then “appear” qualifies this again; reformulate?
42: I would remove “while having a limited lifespan”; essentially having a short lifespan explains why GEDI has low sampling density and few recurring shots that overlap with each other
43-46: Please rephrase, it’s unclear what “the particular acquisition configurations” refers to
47: what is a “forest stand factor”?
51: “showing often impressive performances in constructing links”; as outlined above, I fundamentally disagree with this statement. What would be impressive performance? Please provide evidence for this. Personally, I have not been that much impressed by the AI-inferred canopy height models I have seen. They usually do reasonably well in open systems and they seem to also predict some shorter-statured forests in the temperate zone well, particularly in areas where they were calibrated with high-quality ALS data or where topography is not too challenging and GEDI-based inferences are more reliable. But even in areas where they get into the right ballpark of mean canopy height, they seem to homogenize forest structure a lot, overstate their resolution (a target resolution of 1 m or 10 m is very different from a 1 m or 10 m resolution in ALS assessments, for example, and usually corresponds more to 100 m resolution). Most importantly, I have seen them repeatedly fail to provide acceptable predictions in tall and heterogeneous canopies or under cloud cover, i.e. the tropics. This is an introduction, so I don’t expect a full discussion of these issues here and models will undoubtedly improve in the future, but to call the existing models “impressive” is overstating their performance.
52: “not faultless”; this goes back to my statement above; I think this is a severe understatement. The models are currently quite faulty and error prone and this is why articles such as this one are so important!
52-60: I think some of these things could be stated much more succinctly. E.g.: There is a basic question of whether remote sensing data captures all necessary information (answer is: no). Second, some remote sensing data come with huge uncertainties and biases. Third, there are more general modelling issues.
74: This needs to be discussed later, but testing the models in France gives overly optimistic assessments of them, because a) models have likely had better training data in Europe than, for example, in the tropics (fewer clouds, better ALS data, topography is more easily visible in less dense forests), b) the range of forest types is heavily restricted.
75-76: I like the idea of using field data to verify remote sensing predictions and it does allow you to calculate quantities otherwise not available (Lorey’s height). But I was wondering: why did you not also use the IGN’s lidar data as a second, independent validation data set? This is not a major issue, but it would seem to make sense to pair the NFI plots with these data, no?
100: For me, all these resolutions are always a bit tricky to interpret. Most global models look much blurrier than an ALS-derived estimate at the same resolution. E.g., I recently compared the Lang et al. 2023 model to ALS-derived canopy height estimates in Italy, and the effective resolution of the global product seemed much lower than 10 m. Similarly, Meta’s 1 m canopy height model only nominally has a resolution of 1 m, but often looks like 100 m resolution. Could you include somewhere a statement that the resolutions that are given are “nominal resolutions”?
110: again I would very much argue against “impressive”. As stated above, I have strong doubts about claimed or “nominal” resolution vs. actual or “effective” resolution.
117: “covering metropolitan France”, no “the” needed
116: What’s “Linear Forest Regression”?
130-151: As stated above, I really like this approach, and it’s great that you evaluate different height types, especially Lorey’s height.
152-221: One overall question: how well does this approach/BAM deal with highly correlated input layers? E.g., if two of the tested models provide similar predictions across France, how do weights get assigned? Will the weights get assigned more or less randomly, and both come out as similar, or could it happen (as with highly correlated predictors) that one model artificially gets a much higher weight than the other model by accident? How could you test for this?
156: yes, I like this! But you should also mention somewhere that there is also the opposite problem. If a few models are particularly bad/biased/noisy, then this might get propagated in your analysis, no?
157: You need to explain a bit more what BMA is. I have worked with Bayesian models, but never done model averaging, and I don’t understand what deterministic vs. non-deterministic BMA would be (how do you optimize model parameters of already calibrated models?) and how you would even apply that in this context. Would you have to rerun the AI models?
173: From a practical point of view, I understand the weighting by models. But I don’t understand the theoretical underpinnings. Is it correct to call Pr(Mk | H) the posterior probability of model Mk and sum to 1 across the 5 tested models? I can easily imagine a situation where all 5 models that you test are actually highly improbable models, and that there are much better models out there that we have not found yet, so the existing probabilities should not really sum to 1? Of course, the relative probabilities do not change, and that’s what’s important for your aim, so this is mostly a conceptual/terminological question, but it would be interesting to get your take on this.
175: How justified is this assumption? Do you have any evidence for this?
185-186: Yes, but are you fitting an extra model, and introducing new parameters (the weights), so creating a more complex model, no?
206: Could you explain why z_ki is treated as “missing data”?
213-214: The authors need to explain this
222: As part of the Results/Discussion, I would definitely like to see one paragraph/figure where you just compare the predictions of the different models. I think all readers would like to know how much the 5 different models actually disagree/agree with each other in predicting canopy height. This could take the form of a simple correlation plot between predicted canopy heights for each pair of models. It would also be ok to put this figure into the supplementary. In addition, I would like to see a whole map of France, with height predicted from M1-M5, and a few sample locations across France (mountain/non-mountain, etc.) in the Supplementary, where the predictions of M1-M5 are shown. This would greatly help the readers understand the relative strengths and weaknesses of the models. Ideally, as pointed out earlier, this would be juxtaposed with CHMs derived from IGN’s ALS scans, but this is more of a bonus, and I won’t insist on that.
224: Since I don’t fully grasp the weighting approach, are the overall weights somehow tied to the local weights and provide some form of regularization (e.g., as in a hierarchical Bayesian model)?
232-257: Overall, I like this paragraph, and it is an interesting comparison!
235: “relatively significantly” is odd. Please rephrase. What do you mean? Also, when looking at the graphs, the deviations do not seem so large. With the exception of M5, all model weights generally seem to fall between 0.15 and 0.25, so relatively close to 0.2. None of the models is clearly worse than the others, none of the models is clearly better.
236: Simple Model Averaging (SMA) has not been properly introduced, and needs its own short section. Did you take the median or average?
Figure 2: Could you potentially just leave the bar “WITHOUT imputations” blank, i.e., no internal black lines? I think it would be visually easier to understand the difference, at first sight I did not recognize it.
249-254: I agree with the overall statement, and think it’s important to clearly define what canopy height models are actually predicting. But my impression is that this is slightly overstating the results. Yes, there are some differences, and M5 seems to do slightly better at predicting dominant and Lorey’s height, but we are still talking about a weight of 0.25-0.27, compared to other models’ weights of 0.15-0.2. For me, this is much less clear than what you describe. I further have doubts about biomass estimates relying on several height descriptors at the same time. Evidently, these different height estimates will covary between them, and using highly covarying/correlated predictors for biomass models is probably not going to improve model predictions massively.
255-258: I don’t understand – there is an impact of imputation, but it’s unclear to me whether this is an improvement or not. It may just be that the imputation routine shares some similarities in predicting to new data with the AI models, so is this a real effect? This may become clearer under spatial cross validation though!
260: “metropolitan France”, not “the metropolitan France”.
260-262: I don’t understand what is being said here? Why is the homogeneity of the 86 regions important for the averaging?
Figure 3: I know that this figure is already quite complex, but is there a way of visualizing or describing the average NFI-based metric per local region somewhere, i.e. add a 7th column where the average dominant height, average mean height, average maximum height and average Lorey’s height is mapped? Because from a biomass perspective, we clearly care most about model performance in areas with tall/complex canopies, and I suspect that this could shift the evaluation of model weights a bit.
268-270: Yes, that’s a great point! You could also mention Corsica here, where the Popatov model also seems to do better for these variables. I don’t understand the Landsat argument. You mean that coarser resolution approaches average out local topographic errors? Also, maybe rephrase “disturbing effects”.
288-290: I don’t fully understand the meaning of “within variance”. Could you make clearer what it means maybe already in the methods section, but also here. Does it mean how much model prediction quality/weights vary across sites/regions within the same model?
291-296: This is important, because a lot of well-preserved forest area worldwide is usually located in mountainous terrain due to accessibility issues. It’s an issue if models are performing worst in these areas.
297-298: I don’t understand this point, or, if I understand it, I don’t agree. If within-model variance is larger than between-model variance, it probably just means that the models are all pretty similar in their predictions, no? That does not mean that the predictions are reliable per se. If we evaluate 5 models that are all using more or less the same input data and comparable extrapolation approaches, they could all be similarly bad at predicting something and would then have a very low between-model variance. For example, 4 out of 5 models here use GEDI shots as input, so I would already expect a lot of homogeneity in predictions just from that.
316-319: Yes and no! I fully agree with your assessment that mountainous regions are an important, and one of the primary challenges for model advancements. In Europe they may well be the “primary challenge”, but this needs to be qualified. This study evaluates models only in France and thus does not represent the variety of global forest types. I strongly suspect that the “primary challenge for ongoing model advancements” are actually tall canopies, in particular in the tropics, where cloud cover and forest structural complexity make predictions much more volatile. I would then see predictions in mountainous terrain as a strong “secondary challenge” globally. The absolute gold standard for model evaluation would be tall tropical forests with strong topographic gradients.
337-340: I don’t think it’s accurate to state that “classes” cause variations in the models. They are linked to these variations, or are predictors of them, but what causes the variations is more tricky to say. Most likely specific tree species occur in specific environments (e.g., high altitude) which are trickier to predict.
341-344: I was surprised by this. So this would indicate that low forest canopies are worse-predicted.
358: In my opinion, this paragraph should be the first of the results paragraphs, as it evaluates whether the mixture model is actually any good at modelling.
359: Broadly, the authors are calibrating a higher-order model that assigns weights to different underlying models. This approach is prone to the same modelling issues as any model, e.g., overfitting the data, and needs to be a) better described – I currently have no information as to how R2, MBE and RMSE have been calculated, and b) it needs to be done with a formal cross validation strategy. This should take the form of a spatial cross validation, as suggested in Ploton et al. 2020. One way could be that the authors select 1000 NFI plots, remove all other NFI plots within a 100 km radius, and then predict the left-out plot’s height structure from the mixture model calibrated on the remaining NFI plots. Another way could be that the authors divide France into spatial folds, e.g. the 96 sylvo-ecological regions, and predict each region from the other 95. Ideally, here I would also leave out all sylvo-ecological regions that are directly adjacent to the validation plot to account for spatial autocorrelation. The resulting MBE/RMSE could then be compared against the original model performance, as well as SMA, for example.
367: Is this actually possible? In my (very coarse) understanding, there is often a bias-variance tradeoff in modelling, so that, beyond a certain point, reducing one often comes at the expense of increasing the other.
375: Without proper spatial cross validation I would not believe this “super” model yet.
380: How accurate is it to actually call this Bayesian model averaging, if it’s only “Bayesian flavoured”?
386: cf. my objections above. Mountainous regions are a “real challenge”, and probably the most important one in France (and likely most of Europe), but that does not mean it’s true worldwide, and this needs to be acknowledged here!
391-393: “more so if combined”; epistemologically, I am not sure, whether I fully agree with this statement. Yes, combining models can make them more useful, but you could also argue that the authors are essentially fitting a much more complex model that picks the best prediction at every location, so we are gaining predictive power in return for a lot of new parameters. But to trust the results, we need proper spatial cross validation.
Figure 5: I would like to see relative RMSE here as well.
Citation: https://doi.org/10.5194/gmd-2024-95-RC2 - AC1: 'Comment on gmd-2024-95', Nikola Besic, 09 Oct 2024
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
460 | 127 | 45 | 632 | 12 | 18 |
- HTML: 460
- PDF: 127
- XML: 45
- Total: 632
- BibTeX: 12
- EndNote: 18
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1