<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="research-article">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">GMD</journal-id><journal-title-group>
    <journal-title>Geoscientific Model Development</journal-title>
    <abbrev-journal-title abbrev-type="publisher">GMD</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Geosci. Model Dev.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">1991-9603</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/gmd-19-5765-2026</article-id><title-group><article-title>TOAR-classifier v2: a data-driven classification tool for  global air quality stations</article-title><alt-title>ML based station classification</alt-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes" rid="aff1 aff4">
          <name><surname>Mache</surname><given-names>Ramiyou Karim</given-names></name>
          <email>rkarimmache@gmail.com</email>
        <ext-link>https://orcid.org/0000-0002-0190-3311</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Schröder</surname><given-names>Sabine</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-0309-8010</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1 aff4">
          <name><surname>Langguth</surname><given-names>Michael</given-names></name>
          
        <ext-link>https://orcid.org/0000-0003-3354-5333</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2">
          <name><surname>Patnala</surname><given-names>Ankit</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff2 aff3">
          <name><surname>Schultz</surname><given-names>Martin G.</given-names></name>
          
        <ext-link>https://orcid.org/0000-0003-3455-774X</ext-link></contrib>
        <aff id="aff1"><label>1</label><institution>independent researcher</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>Department of Mathematics and Computer Science, University of Cologne, Cologne, Germany</institution>
        </aff>
        <aff id="aff4"><label>a</label><institution>formerly at: Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Ramiyou Karim Mache (rkarimmache@gmail.com)</corresp></author-notes><pub-date><day>1</day><month>July</month><year>2026</year></pub-date>
      
      <volume>19</volume>
      <issue>12</issue>
      <fpage>5765</fpage><lpage>5779</lpage>
      <history>
        <date date-type="received"><day>24</day><month>March</month><year>2025</year></date>
           <date date-type="rev-request"><day>4</day><month>April</month><year>2025</year></date>
           <date date-type="rev-recd"><day>2</day><month>October</month><year>2025</year></date>
           <date date-type="accepted"><day>17</day><month>October</month><year>2025</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2026 Ramiyou Karim Mache et al.</copyright-statement>
        <copyright-year>2026</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026.html">This article is available from https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026.html</self-uri><self-uri xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026.pdf">The full text article is available as a PDF file from https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d2e140">Accurate characterization of station locations is crucial for reliable air quality assessments such as the Tropospheric Ozone Assessment Report (TOAR). While urban and rural areas are relatively well-defined, the boundaries and identity of suburban areas remain ambiguous, overlapping with both urban and rural zones and varying due to cultural and social factors. This study investigates a machine learning approach to classify 24 348 stations in the unique global TOAR database as urban, suburban, or rural. We tested two different approaches: unsupervised <inline-formula><mml:math id="M1" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering with three clusters, and an ensemble of supervised learning classifiers including random forest, CatBoost, and LightGBM. We  integrate these classifiers into a robust voting model, leveraging their collective predictive power. To address the inherent ambiguity of suburban areas, we implement a grid-search adjusted threshold probability technique. Our models, trained on the TOAR station metadata, are evaluated on 1979 unseen data points. <inline-formula><mml:math id="M2" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering achieves 71.88 % and 87.67 % accuracy for urban and rural areas respectively, but only 15.84 % for suburban zones. The supervised classifiers surpass this performance, reaching over 84 % accuracy for urban and rural categories, and 66 %–72 % for suburban areas. The adjusted threshold technique significantly enhances overall model accuracy, particularly for suburban classification. The good separation of our model is confirmed through evaluation with NO<sub><italic>x</italic></sub> and PM<sub>2.5</sub> concentration measurements, which were not included in the training data. Furthermore, manual inspection of 30 randomly selected sites with Google maps reveals that our method provides a better label for the station type than the labels that were reported by data providers and used in the model evaluation. The objective station classification proposed in this paper therefore provides a robust foundation for type-of-area-specific air quality assessments in TOAR and elsewhere.</p>
  </abstract>
    
<funding-group>
<award-group id="gs1">
<funding-source>European Research Council</funding-source>
<award-id>787576</award-id>
</award-group>
</funding-group>
</article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d2e184">Ozone in the troposphere plays a crucial role in human and environmental health <xref ref-type="bibr" rid="bib1.bibx32 bib1.bibx15" id="paren.1"/>. As a significant atmospheric pollutant and greenhouse gas, ozone profoundly impacts air quality and contributes to the dynamics of climate change  <xref ref-type="bibr" rid="bib1.bibx24 bib1.bibx28" id="paren.2"/>. Accurate ozone monitoring data is essential for shaping public health policies and ecological regulations. Modern data infrastructures can be used to provide atmospheric scientists with the necessary metrics to quantify ozone's impact on climate, human health, and vegetation <xref ref-type="bibr" rid="bib1.bibx13 bib1.bibx11 bib1.bibx26 bib1.bibx43 bib1.bibx10 bib1.bibx27 bib1.bibx36" id="paren.3"/>. In 2021, the International Global Atmospheric Chemistry project (IGAC) launched the second phase of the Tropospheric Ozone Assessment Report (TOAR-II) to undertake a comprehensive review of the global distribution and trends of tropospheric ozone. A key accomplishment of TOAR-II is the development of a new terabyte-scale relational database of surface ozone observations and related variables. This database includes hourly measurement data and enriched metadata from 1970 to 2023, collating information from over 20 000 measurement sites worldwide through collaboration among multiple data centers and individual researchers (<uri>https://igacproject.org/activities/TOAR/TOAR-II</uri>, last access: 24 June 2026). The new TOAR-II database replaces and extends the first TOAR database that has been described in <xref ref-type="bibr" rid="bib1.bibx37" id="text.4"/>. Ozone levels exhibit significant regional variations and distinct patterns across different pollution environments. For example, urban environments with large ozone precursor emissions can exhibit “zero ozone” (i.e., ozone at sub-nmol fractions) situations and very large variability, while concentrations in rural areas tend to be smoother <xref ref-type="bibr" rid="bib1.bibx46 bib1.bibx38" id="paren.5"/>. To accurately assess the ozone situation at individual locations and interpret ozone trends across the globe, it is therefore important to characterize measurement sites in a globally consistent and objective manner. While many measurement networks provide information about the station location or “type of station” and “type of station area”, this metadata information is inconsistent between regions and error-prone as it involves some subjective judgement <xref ref-type="bibr" rid="bib1.bibx42" id="paren.6"><named-content content-type="pre">cf.,</named-content></xref>.</p>
      <p id="d2e211">In the first assessment of TOAR <xref ref-type="bibr" rid="bib1.bibx37" id="paren.7"/> pioneered a new way to classify stations in a globally uniform way based on a set of Earth Observation (EO) datasets that have been processed at the station locations. This method used manually selected threshold values of station altitude, population density, nighttime light intensity, NO<sub>2</sub> column density, and NO<sub><italic>x</italic></sub> emissions to characterize stations as urban or rural. While this approach provided a useful distinction between “clearly urban” and “clearly rural” sites, it fell short of classifying all sites as almost half of the stations remained unclassified, see Fig. <xref ref-type="fig" rid="F1"/>. Furthermore, the method was criticized for lack of an objective definition of the threshold values. Especially, the boundaries and characteristics of suburban areas remain ambiguous. Suburban zones, typically located at the periphery of cities and noted for lower density and residential land use, often overlap with both urban and rural regions. Their definition is shaped by cultural, social, and psychological factors, resulting in varied interpretations and a lack of universal consensus <xref ref-type="bibr" rid="bib1.bibx17 bib1.bibx2" id="paren.8"/>. This emphasizes the need for an objective and automated station classification method. In this study, we propose a new machine learning (ML) approach to develop a more advanced and unbiased classifier using similar objective metadata from the TOAR-II database. Our primary objective is to create a machine learning model that classifies stations in the TOAR database as urban, suburban, or rural. We implement and compare two methodologies: unsupervised learning using <inline-formula><mml:math id="M7" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering, and supervised learning classifiers such as random forest, CatBoosting, and LightGBM. In supervised learning, a subset of station characteristics is known and used to train the classifiers which are then used to predict the class of unlabeled  stations. The supervised models are evaluated individually and after applying a robust voting method. Furthermore, an adjusted threshold technique is applied to enhance the identification of suburban stations. The remainder of this paper is organized as follows: Sect. 2 describes the data and methods used, Sect. 3 presents the results and discussion, and a general conclusion wraps up the paper.</p>

      <fig id="F1" specific-use="star"><label>Figure 1</label><caption><p id="d2e250">Distribution of unlabeled, training, and test data used for the supervised ML models as described in the text.</p></caption>
        <graphic xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026-f01.png"/>

      </fig>

</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Data and methods</title>
      <p id="d2e267">This section provides an overview of our data sources, machine methods used, and evaluation metrics. We begin by introducing the TOAR-II database and the station metadata that are used as inputs of the ML models. We then detail our data preparation process, including preprocessing, feature engineering, and feature selection, all critical steps in preparing data for ML models. Finally, we present a concise summary of the ML models employed in this study and the evaluation metrics used to assess their performance.</p>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>TOAR-II database and station metadata</title>
      <p id="d2e277">Developed in the context of TOAR phase II, the TOAR-II database stands as one of the world's largest collections of near-surface ozone measurements and related information. The database can be accessed through web services which provide a comprehensive suite of ozone-related data products including standard statistics, health and vegetation impact metrics, and trend information (<uri>https://toar-data.fz-juelich.de/</uri>, last access: 24 June 2026). The TOAR-II database includes extensive information describing the locations of air quality measurement stations based on pollution-relevant properties. These properties are extracted from EO data and stored as station metadata in the database. This metadata offers contextual information about the measurement site, enabling station location characterization. Table <xref ref-type="table" rid="T1"/> below summarizes all metadata used in this study including references to the data origin.</p>

<table-wrap id="T1" specific-use="star"><label>Table 1</label><caption><p id="d2e288">Station metadata in the TOAR database used in this work.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="justify" colwidth="6.8cm"/>
     <oasis:colspec colnum="2" colname="col2" align="justify" colwidth="8.5cm"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">Variable name</oasis:entry>
         <oasis:entry colname="col2" align="left">Description</oasis:entry>
         <oasis:entry colname="col3">Type</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">lon, lat</oasis:entry>
         <oasis:entry colname="col2" align="left">longitude,  latitude represent the geographical coordinates of station.  We did not use these coordinates as predictors in  the machine learning model</oasis:entry>
         <oasis:entry colname="col3">Numeric</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">area_code</oasis:entry>
         <oasis:entry colname="col2" align="left">Unique code of the station in TOAR database</oasis:entry>
         <oasis:entry colname="col3">String</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">altitude</oasis:entry>
         <oasis:entry colname="col2" align="left">altitude of the station location in meter (m)</oasis:entry>
         <oasis:entry colname="col3">Numeric</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">mean_topography_srtm_alt_90m_year1994   mean_topography_srtm_alt_1km_year1994</oasis:entry>
         <oasis:entry colname="col2" align="left">mean value within 90 m and 1 km of relative altitude of the year 1994.   Data source: NASA Shuttle Radar Topographic Mission (SRTM) <xref ref-type="bibr" rid="bib1.bibx18" id="paren.9"/></oasis:entry>
         <oasis:entry colname="col3">Numeric</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">max_topography_srtm_relative_alt_5km_year1994   min_topography_srtm_relative_alt_5km_year1994   stddev_topography_srtm_relative_alt_5km_year1994</oasis:entry>
         <oasis:entry colname="col2" align="left">maximum, minimum, and standard deviation of the  relative altitude within a radius of 5 km around the station in 1994.   Data source: NASA Shuttle Radar Topographic Mission (SRTM) <xref ref-type="bibr" rid="bib1.bibx18" id="paren.10"/></oasis:entry>
         <oasis:entry colname="col3">Numeric</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">climatic_zone_year2016</oasis:entry>
         <oasis:entry colname="col2" align="left">climate zone of the year 2016. Provides information about climatic conditions  at a location including whether it tends to be hot or cold, humid or dry,   or exhibits a tropical climate.   Data source:  University of East Anglia Climatic Research Unit <xref ref-type="bibr" rid="bib1.bibx16" id="paren.11"/></oasis:entry>
         <oasis:entry colname="col3">Category</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">mean_stable_nightlights_1km_year2013   mean_stable_nightlights_5km_year2013   max_stable_nightlights_25km_year2013  max_stable_nightlights_25km_year1992</oasis:entry>
         <oasis:entry colname="col2" align="left">average and maximum nighttime light value of the years 1992, and 2013  in 1, 5, and 25 km around the station location. The values in this data set  represent a  brightness index ranging from 0 to 63.  Data source: NOAA National Centers for Environmental Information (NCEI) <xref ref-type="bibr" rid="bib1.bibx21" id="paren.12"/></oasis:entry>
         <oasis:entry colname="col3">Numeric</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">mean_population_density_250m_year2015       mean_population_density_5km_year2015   max_population_density_25km_year2015  mean_population_density_250m_year1990  mean_population_density_5km_year1990   max_population_density_25km_year1990</oasis:entry>
         <oasis:entry colname="col2" align="left">Average and maximum population density of the years 1990, and 2015  in 250 m, 5 km, and 25 km radius around the station location.   Data source: The European Commission, Joint Research Centre, <xref ref-type="bibr" rid="bib1.bibx12" id="paren.13"/></oasis:entry>
         <oasis:entry colname="col3">Numeric</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">mean_nox_emissions_10km_year2015   mean_nox_emissions_10km_year2000</oasis:entry>
         <oasis:entry colname="col2" align="left">Average annual NO<sub><italic>x</italic></sub> emission of the years  2000 and 2015 in a 10 km radius around  the station location.  Data source: Copernicus Atmosphere Monitoring Service <xref ref-type="bibr" rid="bib1.bibx14" id="paren.14"/></oasis:entry>
         <oasis:entry colname="col3">Numeric</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">timezone</oasis:entry>
         <oasis:entry colname="col2" align="left">Geographical area, such as Africa, America, Europe, etc.</oasis:entry>
         <oasis:entry colname="col3">Category</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left">type_of_area (target)</oasis:entry>
         <oasis:entry colname="col2" align="left">Characterization of station location (urban, suburban, rural, or unknown) reported  by the data providers of the TOAR database. This variable is not used in <inline-formula><mml:math id="M9" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering, and the known part, i.e stations labelled as urban, suburban, rural,  are employed to train supervised classifiers.</oasis:entry>
         <oasis:entry colname="col3">Category</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e517">Some metadata, such as station coordinates, are provided by many air quality agencies and scientific institutions that contribute to the TOAR database. The other metadata elements listed in Table <xref ref-type="table" rid="T1"/> stem from  EO datasets, which were downloaded from the respective provider sites. A special web service called Geospatial point extraction and aggregation service (GeoPEAS) has been developed to compute the aggregate information from the original gridded products. More information about the EO datasets used in GeoPEAS can be found on <uri>https://toar-data.fz-juelich.de/api/v2/#stationmeta</uri> (last access: 24 June 2026). After the metadata extraction with GeoPEAS, all metadata are available as lists of key-value pairs with the keys corresponding to the variable names in Table <xref ref-type="table" rid="T1"/>. For further processing, this data was collected into one table with the keys as data columns and the individual stations as rows.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Data preprocessing and feature selection</title>
      <p id="d2e535">The first step in our data preprocessing pipeline consists of cleaning the dataset. This involves removing all duplicate data points, replacing values of <inline-formula><mml:math id="M10" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">999.0</mml:mn></mml:mrow></mml:math></inline-formula> with NaN to denote missing values, and eliminating rows where all metadata information is missing. To ensure consistency, we filtered out rows with inconsistent values, such as negative population density or negative maximum stable lights. In total, this step eliminates 211 stations out of 24 348 stations. For handling missing altitude data, we fill these with mean_topography_srtm_alt_90m_year1994. Other numerical missing values are estimated using the regression iterative imputer <xref ref-type="bibr" rid="bib1.bibx35" id="paren.15"/>, and categorical missing values fill with most frequent instance. Categorical variables are encoded using OneHotEncoder from scikit-learn <xref ref-type="bibr" rid="bib1.bibx30" id="paren.16"/>.</p>
      <p id="d2e554">For <inline-formula><mml:math id="M11" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering, numerical features were scaled using standard scaling, and outliers were handled with IQR-based clipping within the range [Q1 <inline-formula><mml:math id="M12" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula> 1.5 <inline-formula><mml:math id="M13" display="inline"><mml:mo>⋅</mml:mo></mml:math></inline-formula> IQR, Q3 <inline-formula><mml:math id="M14" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> 1.5 <inline-formula><mml:math id="M15" display="inline"><mml:mo>⋅</mml:mo></mml:math></inline-formula> IQR], where IQR (Interquartile Range) is the difference between the third quartile (Q3) and the first quartile (Q1). These preprocessing steps are essential to mitigate the algorithm's sensitivity to feature scale and outliers. For supervised learning algorithms, we applied a robust scaler <xref ref-type="bibr" rid="bib1.bibx30" id="paren.17"/> to the entire dataset, which proved more effective for this task compared to alternative scaling methods such as standard or min-max scaling. This scaling approach helps mitigate the impact of outliers and ensures consistent feature ranges. In the feature selection process, we prioritized variables containing the most recent available information, ensuring our model utilizes the most up-to-date data for classification. Notably, geographical coordinates (longitude-latitude) and station codes are excluded from the machine learning models to prevent overfitting to geolocation. Following preprocessing steps,  the final dataset comprises 24 125 stations. Of these, 13 225 are labeled as urban, suburban, or rural, while the remaining 10 900 are unlabeled. The distribution of the processed data is presented in Fig. <xref ref-type="fig" rid="F1"/>.</p>
      <p id="d2e598">For the <inline-formula><mml:math id="M16" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering analysis, we allocated 22 111 samples for model training and reserved 15 % of the labeled data (1979 samples) for testing. The supervised models were trained on a smaller subset because only 13 225 stations were explicitly classified as urban, suburban, or rural by the TOAR data providers. The remaining 10 900 stations lacked this specific classification, being reported as “unknown” or without a designated category. We used 85 % of the labeled data (11 211  samples) for training while the remaining 15 % is allocated for testing. The distribution of training and testing datasets is presented in Fig. <xref ref-type="fig" rid="F1"/>. As shown, the training dataset is imbalanced, with a higher number of urban stations compared to suburban and rural ones. However, the class imbalance is not severe. We address it by applying the Synthetic Minority Oversampling Technique (SMOTE) <xref ref-type="bibr" rid="bib1.bibx8" id="paren.18"/> to the training data before fitting the supervised classifiers. It is important to note that SMOTE was applied exclusively to the training set and not to the test data. The trained classifier is then used to predict the characteristics of unlabeled stations as illustrated in Fig. <xref ref-type="fig" rid="F2"/>.</p>

      <fig id="F2" specific-use="star"><label>Figure 2</label><caption><p id="d2e618">Distribution of unlabeled, training, and test data used for the supervised ML models as described in the text.</p></caption>
          <graphic xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026-f02.png"/>

        </fig>

      <p id="d2e627">To ensure the reliability of our machine learning approach, we manually selected 35  stations with clear decision boundaries – that is, stations that are easy to classify as urban, suburban or rural, and excluded them from the training dataset. These stations were explicitly reported by data providers as urban, suburban, or rural, and we used Google Maps <xref ref-type="bibr" rid="bib1.bibx25" id="paren.19"/> to manually verify and label them. During this process, we observed discrepancies between the labels provided by the data providers and those derived from our manual Google Maps analysis. Our models will first be evaluated on these 35 stations, with accuracy calculated both against the labels reported by the data providers and against the manually verified labels from Google Maps (referred to as hand-labeled data).</p>
</sec>
<sec id="Ch1.S2.SS3">
  <label>2.3</label><title>Machine learning algorithms and evaluation methods</title>
      <p id="d2e641">This section is devoted to a concise overview of the machine learning techniques employed in this study. Additionally, we describe the evaluation metrics used to quantify the effectiveness and accuracy of our models.</p>
<sec id="Ch1.S2.SS3.SSS1">
  <label>2.3.1</label><title>Machine learning algorithms</title>
      <p id="d2e652"><list list-type="bullet">
              <list-item>

      <p id="d2e657"><italic>K-means clustering</italic> is a widely-used unsupervised machine learning technique that aims to partition data into <inline-formula><mml:math id="M17" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> distinct groups called clusters <xref ref-type="bibr" rid="bib1.bibx3 bib1.bibx40 bib1.bibx31" id="paren.20"/>. Each data point is assigned to the cluster with the nearest centroid. The algorithm seeks to minimize the within-cluster sum of squares, which measures the squared distances between data points and their respective centroids. One key requirement of <inline-formula><mml:math id="M18" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means is specifying the number of clusters beforehand. We employed the heuristic elbow method to determine the appropriate number of clusters for our task. The elbow method is a heuristic technique used to determine the optimal number of clusters in <inline-formula><mml:math id="M19" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering. It works by plotting the within-cluster sum of squares (WCSS) as a function of the number of clusters and identifying the “elbow” point on the curve, which correspond to the optimal number of clusters <xref ref-type="bibr" rid="bib1.bibx20" id="paren.21"/>. In our <inline-formula><mml:math id="M20" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering application, we determine that three clusters are optimal (see Fig. <xref ref-type="fig" rid="F3"/>a), which aligns well with our objective of categorizing the stations into three groups: urban, suburban, and rural. However, it is important to note that visual inspection of correlation plots between individual metadata values Fig. <xref ref-type="fig" rid="F2"/> reveals rather fuzzy boundaries between clusters. This observation aligns with our expectations, given the diverse nature of urban, suburban, and rural locations across different countries, which can vary significantly in terms of industrial development, population density, and degree of urbanization in different countries <xref ref-type="bibr" rid="bib1.bibx45" id="paren.22"/>.</p>
              </list-item>
              <list-item>

      <p id="d2e707"><italic>Random Forest classifier</italic>  is a widely-used machine learning algorithm for classification tasks. As an ensemble learning method, it constructs multiple decision trees through bagging during training and outputs the class that is predicted by the majority vote of the individual trees <xref ref-type="bibr" rid="bib1.bibx5" id="paren.23"/>. Known for its robustness, it naturally resists overfitting through random feature selection and typically requires minimal tuning compared to other algorithms. In our implementation, we train the Random Forest classifier with 500 estimators, employing entropy as the optimization criterion. These specific hyperparameters were determined through a grid search. We utilize the RandomForestClassifier from the scikit-learn library <xref ref-type="bibr" rid="bib1.bibx29" id="paren.24"/>.</p>
              </list-item>
              <list-item>

      <p id="d2e721"><italic>The LightGBM (LGBM) classifier</italic> is a supervised machine learning algorithm that utilizes gradient boosting techniques and tree-based learning. It employs histogram-based algorithms and leaf-wise tree growth strategies, which contribute to accelerated training speeds and reduced memory consumption. LightGBM is particularly well-suited for handling large-scale datasets. Its lightweight architecture and optimized algorithm make it a popular choice for tasks requiring both speed and accuracy in prediction <xref ref-type="bibr" rid="bib1.bibx19" id="paren.25"/>.  In our implementation, we train LightGBM with 500 estimators using Python's open-source library “lightgbm” <xref ref-type="bibr" rid="bib1.bibx44" id="paren.26"/>.</p>
              </list-item>
              <list-item>

      <p id="d2e735"><italic>The CatBoost classifier</italic>  is a machine learning algorithm that uses gradient boosting on decision trees, specifically designed to handle categorical features seamlessly <xref ref-type="bibr" rid="bib1.bibx34" id="paren.27"/>. CatBoost stands for “Categorical Boosting” and automatically handles categorical variables  without requiring manual prepossessing. It uses symmetric trees and ordered boosting to prevent overfitting, and often outperforms other methods on datasets with categorical data. This  makes it an attractive option for datasets containing both numerical and categorical variables. In our implementation, we employ the open-source CatBoost library and configure the model with 500 estimators to balance performance and computational efficiency.</p>
              </list-item>
              <list-item>

      <p id="d2e746"><italic>The Voting Classifier</italic> is an ensemble meta-estimator that combines predictions from multiple base models to enhance overall accuracy and robustness. It functions by either majority vote (“hard”) or averaging predicted probabilities (“soft”), effectively balancing the weaknesses of individual classifiers. This approach often yields superior generalization and reduced overfitting. Our implementation uses a soft voting strategy to aggregate predictions from a Random Forest, LightGBM, and CatBoost classifier via the scikit-learn library <xref ref-type="bibr" rid="bib1.bibx29" id="paren.28"/>.</p>
              </list-item>
            </list></p>
</sec>
<sec id="Ch1.S2.SS3.SSS2">
  <label>2.3.2</label><title>Leveraging Model Uncertainty to Enhance Suburban Classification Accuracy</title>
      <p id="d2e764">Considering the inherent subjectivity in defining suburban areas, we refined our prediction methodology as follows: For any given station, if the model's highest probability for either urban or rural classification falls below a threshold, and if the second-highest probability corresponds to suburban classification, we interpret this as the model's uncertainty in categorizing the area between rural and suburban, or urban and suburban. In such cases, we classify the station as suburban. We applied the grid-search strategy in the range [0.35, 0.85] minimizing macro-F1, to find optimal threshold for each supervised algorithms. We obtain the thresholds of 0.5214, 0.4357, 0.5459, 0.4846 for random forest, CatBoost, LightGBM, and voting respectively. This approach acknowledges the model's indecision and leverages it to better capture the nuanced nature of suburban environments.</p>
</sec>
<sec id="Ch1.S2.SS3.SSS3">
  <label>2.3.3</label><title>Evaluation</title>
      <p id="d2e776">To evaluate our model's accuracy, we use a separate test dataset of 1979 samples, shown as red dots in Fig. <xref ref-type="fig" rid="F2"/>. The test dataset consists of samples that were selected with stratification to ensure it reflects the distribution of real data, and intentionally excluded from both the training phase and the hyperparameter tuning process. This approach ensures that the evaluation metrics provide an unbiased assessment of the model's ability to generalize to new, unseen data, challenging the model in the real-world application scenario. We employed the following evaluation metrics to measure the performance of our machine learning model on this test dataset.</p>
      <p id="d2e781"><list list-type="bullet">
              <list-item>

      <p id="d2e786"><italic>Accuracy</italic> measures the ability of the machine learning model to accurately predict the outcome for the given input data. It is measured as the proportion of correct predictions to the total number of predictions made by the model, and given by the following formula:

                    <disp-formula id="Ch1.Ex1"><mml:math id="M21" display="block"><mml:mrow><mml:mi mathvariant="normal">Accuracy</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">Correct</mml:mi><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mi mathvariant="normal">Predictions</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">Total</mml:mi><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mi mathvariant="normal">Predictions</mml:mi></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>×</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></disp-formula>

                  Here and in the following formulas, # stands for “Number of”.</p>
              </list-item>
              <list-item>

      <p id="d2e826"><italic>Per-class Precision:</italic> For a given class <inline-formula><mml:math id="M22" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>, precision quantifies the fraction of correctly predicted instances of class <inline-formula><mml:math id="M23" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> among all instances that the model predicted as class <inline-formula><mml:math id="M24" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>:

                    <disp-formula id="Ch1.Ex2"><mml:math id="M25" display="block"><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mi mathvariant="normal">Precision</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mspace width="1em" linebreak="nobreak"/><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">True</mml:mi><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mi mathvariant="normal">Positives</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">True</mml:mi><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mi mathvariant="normal">Positives</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">False</mml:mi><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mi mathvariant="normal">Positives</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>×</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>

                  Here, True Positives (TP) are the instances that actually belong to class <inline-formula><mml:math id="M26" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> and were correctly predicted as such, while False Positives (FP) are the instances that do not belong to class <inline-formula><mml:math id="M27" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> but were incorrectly predicted as instance of class <inline-formula><mml:math id="M28" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>. Precision reflects how reliable predictions of class <inline-formula><mml:math id="M29" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> are: high precision indicates that when the model predicts <inline-formula><mml:math id="M30" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>, it is usually correct <xref ref-type="bibr" rid="bib1.bibx33 bib1.bibx6" id="paren.29"/>.</p>
              </list-item>
              <list-item>

      <p id="d2e969"><italic>Per-class Recall (Sensitivity, True Positive Rate):</italic> For a given class <inline-formula><mml:math id="M31" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>, recall is defined as the proportion of true positive predictions for class <inline-formula><mml:math id="M32" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> among all actual instances belonging to class <inline-formula><mml:math id="M33" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>

                    <disp-formula id="Ch1.Ex3"><mml:math id="M34" display="block"><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mi mathvariant="normal">Recall</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mspace linebreak="nobreak" width="1em"/><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">True</mml:mi><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mi mathvariant="normal">Positives</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">True</mml:mi><mml:mspace width="0.25em" linebreak="nobreak"/><mml:mi mathvariant="normal">Positives</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mo>+</mml:mo><mml:mspace width="0.25em" linebreak="nobreak"/><mml:mi mathvariant="italic">#</mml:mi><mml:mi mathvariant="normal">False</mml:mi><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mi mathvariant="normal">Negatives</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>×</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>

                  Here, False Negatives are the instances of class <inline-formula><mml:math id="M35" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> that the model incorrectly assigned to another class. Recall characterizes the ability of the model to retrieve all relevant instances of class <inline-formula><mml:math id="M36" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>; high recall indicates that the model rarely overlooks samples from this class <xref ref-type="bibr" rid="bib1.bibx33" id="paren.30"/>.</p>
              </list-item>
              <list-item>

      <p id="d2e1093"><italic>Per-class F1 Score:</italic> The F1 score for class <inline-formula><mml:math id="M37" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> is defined as the harmonic mean of precision and recall:

                    <disp-formula id="Ch1.Ex4"><mml:math id="M38" display="block"><mml:mrow><mml:mi mathvariant="normal">F</mml:mi><mml:mn mathvariant="normal">1</mml:mn><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mn mathvariant="normal">2</mml:mn><mml:mo>⋅</mml:mo><mml:mi mathvariant="normal">Precision</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mo>⋅</mml:mo><mml:mi mathvariant="normal">Recall</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="normal">Precision</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mo>+</mml:mo><mml:mspace width="0.25em" linebreak="nobreak"/><mml:mi mathvariant="normal">Recall</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>×</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></disp-formula>

                  This metric provides a single, balanced measure of a model's ability to achieve both high precision and high recall for class <inline-formula><mml:math id="M39" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula>, penalizing extreme values in either criterion. A high F1 score indicates that the classifier is effective both in accurately identifying instances of class <inline-formula><mml:math id="M40" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> as well as capturing the majority of actual class <inline-formula><mml:math id="M41" display="inline"><mml:mi>c</mml:mi></mml:math></inline-formula> samples <xref ref-type="bibr" rid="bib1.bibx33" id="paren.31"/>. (recall).</p>
              </list-item>
              <list-item>

      <p id="d2e1201"><italic>Macro-F1:</italic> The macro-averaged F1 score is computed as the arithmetic mean of the per-class F1 scores, assigning equal weight to each class irrespective of its prevalence in the dataset:

                    <disp-formula id="Ch1.Ex5"><mml:math id="M42" display="block"><mml:mrow><mml:mtext>Macro-F1</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mi>N</mml:mi></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mi mathvariant="normal">F</mml:mi><mml:mn mathvariant="normal">1</mml:mn><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mo>×</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></disp-formula>

                  where <inline-formula><mml:math id="M43" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula> denotes the total number of classes. This metric evaluates overall model performance by averaging across all classes and penalizes poor classification performance on minority classes, as each class contributes equally to the final score <xref ref-type="bibr" rid="bib1.bibx33" id="paren.32"/>. Macro-F1 is widely used in multi-class classification evaluation for its insensitivity to class imbalance <xref ref-type="bibr" rid="bib1.bibx6" id="paren.33"/>.</p>
              </list-item>
              <list-item>

      <p id="d2e1265"><italic>Balanced Accuracy:</italic> Balanced accuracy is defined as the average of the per-class recall values:

                    <disp-formula id="Ch1.Ex6"><mml:math id="M44" display="block"><mml:mrow><mml:mi mathvariant="normal">Balanced</mml:mi><mml:mspace width="0.25em" linebreak="nobreak"/><mml:mi mathvariant="normal">Accuracy</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mn mathvariant="normal">1</mml:mn><mml:mi>N</mml:mi></mml:mfrac></mml:mstyle><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>c</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mi mathvariant="normal">Recall</mml:mi><mml:mo>(</mml:mo><mml:mi>c</mml:mi><mml:mo>)</mml:mo><mml:mo>×</mml:mo><mml:mn mathvariant="normal">100</mml:mn></mml:mrow></mml:math></disp-formula>

                  where <inline-formula><mml:math id="M45" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula> represents the total number of classes. Unlike standard accuracy, balanced accuracy accounts for class imbalance by assigning equal weight to each class regardless of sample frequency. This metric provides a more equitable evaluation in imbalanced classification scenarios, mitigating the bias introduced when a dominant class disproportionately influences the overall accuracy score <xref ref-type="bibr" rid="bib1.bibx6 bib1.bibx39" id="paren.34"/>.</p>
              </list-item>
              <list-item>

      <p id="d2e1327"><italic>Adjusted Rand Index (ARI)</italic>  quantifies the similarity between the true cluster assignments and those predicted by the model. It operates by considering all possible pairs of samples and counting how many pairs are assigned to the same or different clusters in both the predicted and true clusters, <xref ref-type="bibr" rid="bib1.bibx7 bib1.bibx9" id="paren.35"/>. The ARI score ranges from <inline-formula><mml:math id="M46" display="inline"><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1.0</mml:mn></mml:mrow></mml:math></inline-formula> to 1.0. A score approaching 1 indicates strong concordance between the true labels and the model's predictions, indicating that many sample pairs are clustered similarly in both clusters. A score near 0 suggests the clustering is comparable to random assignment. A negative score suggests that the predicted clusters frequently disagree with the true clusters, potentially performing worse than random assignment. This implies that sample pairs are often grouped differently in the predicted clusters compared to the true clusters.</p>
              </list-item>
              <list-item>

      <p id="d2e1348"><italic>Normalized Mutual Information (NMI)</italic> measures the mutual information between the true clusters of the samples and the clusters assigned by <inline-formula><mml:math id="M47" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means, normalized by the average entropy of the two label sets. It ranges from 0 to 1, where a score close to 1 indicates strong agreement between the true clusters and the <inline-formula><mml:math id="M48" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clusters. A score of 0 indicates no mutual information between clusters <xref ref-type="bibr" rid="bib1.bibx22" id="paren.36"/>.</p>
              </list-item>
            </list>To visualize the <inline-formula><mml:math id="M49" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clusters, we employed Principal Component Analysis (PCA), a dimensionality reduction technique that projects data onto orthogonal axes of maximum variance, enabling the representation of high-dimensional data in two or three dimensions <xref ref-type="bibr" rid="bib1.bibx1" id="paren.37"/>.</p>
</sec>
</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Results and discussion</title>
      <p id="d2e1393">In this section, we present and analyze the results of the various machine learning models applied to the TOAR station classification task. The first subsection details the outcomes of the unsupervised <inline-formula><mml:math id="M50" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering. The second subsection presents and analyzes the results from the three supervised methods. Finally, the last subsection discusses the overall performance and comparative insights of the different approaches.</p>
<sec id="Ch1.S3.SS1">
  <label>3.1</label><title>Results for <inline-formula><mml:math id="M51" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering</title>
      <p id="d2e1418">Figure <xref ref-type="fig" rid="F2"/> shows the elbow plot, a heuristic technique used to determine the optimal number of clusters for <inline-formula><mml:math id="M52" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering. As the gradient of classification accuracy flattens at 3 to 4 clusters, these values for <inline-formula><mml:math id="M53" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula> represent the optimal choices. This result is very encouraging since we want to distinguish 3 different types of stations. As a first analysis of the <inline-formula><mml:math id="M54" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering, Fig. <xref ref-type="fig" rid="F3"/>b shows the <inline-formula><mml:math id="M55" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means predictions evaluated on 35 manually selected and labeled stations, with clear decision boundary from different categories for sanity check of our method. Table <xref ref-type="table" rid="T2"/> presents the accuracy of <inline-formula><mml:math id="M56" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means predictions on these manually labeled stations from the test set, comparing them with the characterization report from the TOAR database.</p>

      <fig id="F3" specific-use="star"><label>Figure 3</label><caption><p id="d2e1465"><bold>(a)</bold> Elbow method to determine the optimal number of clusters for <inline-formula><mml:math id="M57" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means. <bold>(b)</bold> Different clusters for selected hand-labeled stations, the red points represent the centroid of different clusters.</p></caption>
          <graphic xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026-f03.png"/>

        </fig>

<table-wrap id="T2" specific-use="star"><label>Table 2</label><caption><p id="d2e1490">Global evaluation of <inline-formula><mml:math id="M58" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering: Accuracy, Balanced accuracy, ARI-score, and NMI-score on unseen test set labeled by the TOAR data providers, as well as on the 35 Manually selected with hand-labeled classifications.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="5">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Dataset</oasis:entry>
         <oasis:entry colname="col2">Accuracy</oasis:entry>
         <oasis:entry colname="col3">Balanced Acc.</oasis:entry>
         <oasis:entry colname="col4">ARI-score</oasis:entry>
         <oasis:entry colname="col5">NMI-score</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">35 Manually selected with hand-labels</oasis:entry>
         <oasis:entry colname="col2">77.14 %</oasis:entry>
         <oasis:entry colname="col3">58.06 %</oasis:entry>
         <oasis:entry colname="col4">0.55</oasis:entry>
         <oasis:entry colname="col5">0.51</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">35 Manually selected with TOAR labels</oasis:entry>
         <oasis:entry colname="col2">77.14 %</oasis:entry>
         <oasis:entry colname="col3">58.82 %</oasis:entry>
         <oasis:entry colname="col4">0.55</oasis:entry>
         <oasis:entry colname="col5">0.546</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Unseen test set (1979 stations)</oasis:entry>
         <oasis:entry colname="col2">61.24 %</oasis:entry>
         <oasis:entry colname="col3">58.46 %</oasis:entry>
         <oasis:entry colname="col4">0.25</oasis:entry>
         <oasis:entry colname="col5">0.22</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <fig id="F4" specific-use="star"><label>Figure 4</label><caption><p id="d2e1596"><bold>(a)</bold> Confusion matrix for <inline-formula><mml:math id="M59" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering, evaluated on 1979  unseen labelled by the TOAR data providers. <bold>(b)</bold> Cluster visualization of the previously unseen test data.</p></caption>
          <graphic xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026-f04.png"/>

        </fig>

      <p id="d2e1617">Figure <xref ref-type="fig" rid="F4"/>a presents the confusion matrix computed from the <inline-formula><mml:math id="M60" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means prediction and the classes from the TOAR database as ground truth, based on 1979  unseen test stations (see Sect. <xref ref-type="sec" rid="Ch1.S2"/>). This confusion matrix highlights the high accuracy of <inline-formula><mml:math id="M61" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means in classifying urban and rural stations (87.67 % and 71.88 %, respectively, see Table <xref ref-type="table" rid="T3"/>, Recall's column). However, many instances exist where stations reported as urban or rural are classified as suburban, and vice versa. This misclassification is primarily due to the subjective nature of defining suburban areas, which often lie at the interface between rural and urban regions or exhibit a mix of urban-suburban or rural-suburban characteristics. Additionally, Fig. <xref ref-type="fig" rid="F4"/>b visualizes the clusters defined by <inline-formula><mml:math id="M62" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means for the 1979  unseen test stations. This visual representation clearly illustrates the fuzzy boundaries between the clusters and the noticeable spacing among the three centroids (depicted as red cross).</p>

<table-wrap id="T3"><label>Table 3</label><caption><p id="d2e1653">Per-class evaluation of <inline-formula><mml:math id="M63" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering: Precision, recall, and F1-score on 1979 unseen test set labeled by the TOAR data providers.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="5">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Class</oasis:entry>
         <oasis:entry colname="col2">Precision</oasis:entry>
         <oasis:entry colname="col3">Recall</oasis:entry>
         <oasis:entry colname="col4">F1-score</oasis:entry>
         <oasis:entry colname="col5">Support</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Urban</oasis:entry>
         <oasis:entry colname="col2">66.83 %</oasis:entry>
         <oasis:entry colname="col3">71.88 %</oasis:entry>
         <oasis:entry colname="col4">69.26 %</oasis:entry>
         <oasis:entry colname="col5">928</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Suburban</oasis:entry>
         <oasis:entry colname="col2">33.33 %</oasis:entry>
         <oasis:entry colname="col3">15.84 %</oasis:entry>
         <oasis:entry colname="col4">21.47 %</oasis:entry>
         <oasis:entry colname="col5">524</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Rural</oasis:entry>
         <oasis:entry colname="col2">63.11 %</oasis:entry>
         <oasis:entry colname="col3">87.67 %</oasis:entry>
         <oasis:entry colname="col4">73.39 %</oasis:entry>
         <oasis:entry colname="col5">527</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

</sec>
<sec id="Ch1.S3.SS2">
  <label>3.2</label><title>Results for supervised classifiers</title>
      <p id="d2e1765">Here, we evaluate the results from the three supervised machine learning classifiers, Random Forest, LightGBM, and CatBoost. Furthermore, the results from the three models were subjected to a robust voting classifier to maximize the classification accuracy. We observed that the results of all algorithms are quite similar. The models demonstrated exceptional accuracy (<inline-formula><mml:math id="M64" display="inline"><mml:mo lspace="0mm">&gt;</mml:mo></mml:math></inline-formula> 83 %), high precision (<inline-formula><mml:math id="M65" display="inline"><mml:mo lspace="0mm">&gt;</mml:mo></mml:math></inline-formula> 85 %), and high F1-score (<inline-formula><mml:math id="M66" display="inline"><mml:mo lspace="0mm">&gt;</mml:mo></mml:math></inline-formula> 81 %) in predicting urban and rural areas. However, all models struggled with the suburban class, yielding accuracies slightly above 60 % (Table <xref ref-type="table" rid="T5"/>). As discussed above, the main reason for this lower accuracy can be attributed to the inherent subjective nature of defining this category. To address this issue, we implemented a  strategy that capitalizes on model uncertainty. By adjusting the prediction probability threshold, as detailed in Sect. 2, we significantly enhanced the accuracy of suburban area classifications as shown in Table <xref ref-type="table" rid="T7"/>. Figure <xref ref-type="fig" rid="F5"/>a presents the confusion matrix for the Random Forest classifier and Fig. <xref ref-type="fig" rid="F5"/>b visualizes the feature importance. The global evaluation, i.e accuracy, balance accuracy, macro-F1 score, and weighted-F1 score of different classifiers are reported in Tables <xref ref-type="table" rid="T4"/> and <xref ref-type="table" rid="T6"/>, which show the results before and after the probability threshold adjustment, respectively. While the overall accuracy remains relatively similar before and after the adjustment, the probability threshold adjustment significantly enhances the prediction accuracy for suburban stations, increasing it from a range of 60.85 %–65.72 % to 66.22 %–71.95 %, and also slightly increase the F1-score for the suburban, these can be seen in details in Tables <xref ref-type="table" rid="T5"/> and <xref ref-type="table" rid="T7"/> presenting the per-class evaluation before and after applying the probability threshold adjustment. We also note a slight drop in accuracy when classifying urban and rural stations. However, classification performance remains high across all classifiers, with accuracy values exceeding 80 %. Additionally, we conducted tests on our machine learning models using the manually labeled stations, similar to those used for <inline-formula><mml:math id="M67" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means evaluation. In this test, we found that the classifiers predict the label report on TOAR by data provider for 35 manually selected stations with 100 % accuracy and achieve an 87.88 % accuracy for the manual classified stations.</p>

<table-wrap id="T4" specific-use="star"><label>Table 4</label><caption><p id="d2e1817">Global performance evaluation of models. Reported values are Accuracy, Balanced accuracy, Macro-F1 score, and Weighted-F1 score of random forest, LGBM, CatBoost, and voting classifiers before probability threshold adjustment, evaluated on 1979 test stations.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="5">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Model</oasis:entry>
         <oasis:entry colname="col2">Accuracy</oasis:entry>
         <oasis:entry colname="col3">Balanced Acc.</oasis:entry>
         <oasis:entry colname="col4">Macro-F1</oasis:entry>
         <oasis:entry colname="col5">Weighted-F1</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Random Forest</oasis:entry>
         <oasis:entry colname="col2">76.25 %</oasis:entry>
         <oasis:entry colname="col3">75.39 %</oasis:entry>
         <oasis:entry colname="col4">75.07 %</oasis:entry>
         <oasis:entry colname="col5">76.53 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">CatBoost</oasis:entry>
         <oasis:entry colname="col2">76.25 %</oasis:entry>
         <oasis:entry colname="col3">75.70 %</oasis:entry>
         <oasis:entry colname="col4">75.27 %</oasis:entry>
         <oasis:entry colname="col5">76.66 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">LightGBM</oasis:entry>
         <oasis:entry colname="col2">76.81 %</oasis:entry>
         <oasis:entry colname="col3">75.54 %</oasis:entry>
         <oasis:entry colname="col4">75.42 %</oasis:entry>
         <oasis:entry colname="col5">76.93 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Voting</oasis:entry>
         <oasis:entry colname="col2">76.40 %</oasis:entry>
         <oasis:entry colname="col3">75.56 %</oasis:entry>
         <oasis:entry colname="col4">75.25 %</oasis:entry>
         <oasis:entry colname="col5">76.71 %</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

<table-wrap id="T5" specific-use="star"><label>Table 5</label><caption><p id="d2e1935">Per-class performance evaluation of models. Reported values are Precision, Recall, and F1 score, of random forest, LGBM, CatBoost, and voting classifiers before probability threshold adjustment, evaluated on 1979 test stations.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="10">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right" colsep="1"/>
     <oasis:colspec colnum="6" colname="col6" align="left"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:colspec colnum="8" colname="col8" align="right"/>
     <oasis:colspec colnum="9" colname="col9" align="right"/>
     <oasis:colspec colnum="10" colname="col10" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col5" colsep="1"><bold>(a)</bold> Random Forest </oasis:entry>
         <oasis:entry namest="col6" nameend="col10"><bold>(b)</bold> CatBoost </oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Class</oasis:entry>
         <oasis:entry colname="col2">Precision</oasis:entry>
         <oasis:entry colname="col3">Recall</oasis:entry>
         <oasis:entry colname="col4">F1-score</oasis:entry>
         <oasis:entry colname="col5">Support</oasis:entry>
         <oasis:entry colname="col6">Class</oasis:entry>
         <oasis:entry colname="col7">Precision</oasis:entry>
         <oasis:entry colname="col8">Recall</oasis:entry>
         <oasis:entry colname="col9">F1-score</oasis:entry>
         <oasis:entry colname="col10">Support</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Urban</oasis:entry>
         <oasis:entry colname="col2">85.02 %</oasis:entry>
         <oasis:entry colname="col3">79.53 %</oasis:entry>
         <oasis:entry colname="col4">82.18 %</oasis:entry>
         <oasis:entry colname="col5">928</oasis:entry>
         <oasis:entry colname="col6">Urban</oasis:entry>
         <oasis:entry colname="col7">86.04 %</oasis:entry>
         <oasis:entry colname="col8">78.34 %</oasis:entry>
         <oasis:entry colname="col9">82.01 %</oasis:entry>
         <oasis:entry colname="col10">928</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Suburban</oasis:entry>
         <oasis:entry colname="col2">57.74 %</oasis:entry>
         <oasis:entry colname="col3">63.36 %</oasis:entry>
         <oasis:entry colname="col4">60.42 %</oasis:entry>
         <oasis:entry colname="col5">524</oasis:entry>
         <oasis:entry colname="col6">Suburban</oasis:entry>
         <oasis:entry colname="col7">57.00 %</oasis:entry>
         <oasis:entry colname="col8">65.27 %</oasis:entry>
         <oasis:entry colname="col9">60.85 %</oasis:entry>
         <oasis:entry colname="col10">524</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Rural</oasis:entry>
         <oasis:entry colname="col2">81.90 %</oasis:entry>
         <oasis:entry colname="col3">83.30 %</oasis:entry>
         <oasis:entry colname="col4">82.60 %</oasis:entry>
         <oasis:entry colname="col5">527</oasis:entry>
         <oasis:entry colname="col6">Rural</oasis:entry>
         <oasis:entry colname="col7">82.40 %</oasis:entry>
         <oasis:entry colname="col8">83.49 %</oasis:entry>
         <oasis:entry colname="col9">82.94 %</oasis:entry>
         <oasis:entry colname="col10">527</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col5" colsep="1"><bold>(c)</bold> LightGBM </oasis:entry>
         <oasis:entry namest="col6" nameend="col10"><bold>(d)</bold> Voting </oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Class</oasis:entry>
         <oasis:entry colname="col2">Precision</oasis:entry>
         <oasis:entry colname="col3">Recall</oasis:entry>
         <oasis:entry colname="col4">F1-score</oasis:entry>
         <oasis:entry colname="col5">Support</oasis:entry>
         <oasis:entry colname="col6">Class</oasis:entry>
         <oasis:entry colname="col7">Precision</oasis:entry>
         <oasis:entry colname="col8">Recall</oasis:entry>
         <oasis:entry colname="col9">F1-score</oasis:entry>
         <oasis:entry colname="col10">Support</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Urban</oasis:entry>
         <oasis:entry colname="col2">83.85 %</oasis:entry>
         <oasis:entry colname="col3">81.68 %</oasis:entry>
         <oasis:entry colname="col4">82.75 %</oasis:entry>
         <oasis:entry colname="col5">928</oasis:entry>
         <oasis:entry colname="col6">Urban</oasis:entry>
         <oasis:entry colname="col7">85.24 %</oasis:entry>
         <oasis:entry colname="col8">79.63 %</oasis:entry>
         <oasis:entry colname="col9">82.34 %</oasis:entry>
         <oasis:entry colname="col10">928</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Suburban</oasis:entry>
         <oasis:entry colname="col2">59.16 %</oasis:entry>
         <oasis:entry colname="col3">61.64 %</oasis:entry>
         <oasis:entry colname="col4">60.37 %</oasis:entry>
         <oasis:entry colname="col5">524</oasis:entry>
         <oasis:entry colname="col6">Suburban</oasis:entry>
         <oasis:entry colname="col7">57.59 %</oasis:entry>
         <oasis:entry colname="col8">63.74 %</oasis:entry>
         <oasis:entry colname="col9">60.51 %</oasis:entry>
         <oasis:entry colname="col10">524</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Rural</oasis:entry>
         <oasis:entry colname="col2">82.99 %</oasis:entry>
         <oasis:entry colname="col3">83.30 %</oasis:entry>
         <oasis:entry colname="col4">83.14 %</oasis:entry>
         <oasis:entry colname="col5">527</oasis:entry>
         <oasis:entry colname="col6">Rural</oasis:entry>
         <oasis:entry colname="col7">82.52 %</oasis:entry>
         <oasis:entry colname="col8">83.30 %</oasis:entry>
         <oasis:entry colname="col9">82.91 %</oasis:entry>
         <oasis:entry colname="col10">527</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

<table-wrap id="T6" specific-use="star"><label>Table 6</label><caption><p id="d2e2274">Global performance evaluation of models. Reported values are Accuracy, Balanced accuracy, Macro-F1 score, and Weighted-F1 score of random forest, LGBM, CatBoost, and voting classifiers after applying probability threshold adjustment, evaluated on 1979 test stations.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="5">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Model</oasis:entry>
         <oasis:entry colname="col2">Accuracy</oasis:entry>
         <oasis:entry colname="col3">Balanced Acc.</oasis:entry>
         <oasis:entry colname="col4">Macro-F1</oasis:entry>
         <oasis:entry colname="col5">Weighted-F1</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">Random Forest</oasis:entry>
         <oasis:entry colname="col2">75.95 %</oasis:entry>
         <oasis:entry colname="col3">75.77 %</oasis:entry>
         <oasis:entry colname="col4">75.42 %</oasis:entry>
         <oasis:entry colname="col5">76.73 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">CatBoost</oasis:entry>
         <oasis:entry colname="col2">76.30 %</oasis:entry>
         <oasis:entry colname="col3">75.85 %</oasis:entry>
         <oasis:entry colname="col4">75.40 %</oasis:entry>
         <oasis:entry colname="col5">76.74 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">LightGBM</oasis:entry>
         <oasis:entry colname="col2">76.96 %</oasis:entry>
         <oasis:entry colname="col3">76.20 %</oasis:entry>
         <oasis:entry colname="col4">76.04 %</oasis:entry>
         <oasis:entry colname="col5">77.36 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Voting</oasis:entry>
         <oasis:entry colname="col2">76.50 %</oasis:entry>
         <oasis:entry colname="col3">76.10 %</oasis:entry>
         <oasis:entry colname="col4">75.79 %</oasis:entry>
         <oasis:entry colname="col5">77.07 %</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

<table-wrap id="T7" specific-use="star"><label>Table 7</label><caption><p id="d2e2392">Per-class performance evaluation of models. Reported values are Precision, Recall, and F1 score, of random forest, LGBM, CatBoost, and voting classifiers after applying probability threshold adjustment, evaluated on 1979 test stations.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="10">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right" colsep="1"/>
     <oasis:colspec colnum="6" colname="col6" align="left"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:colspec colnum="8" colname="col8" align="right"/>
     <oasis:colspec colnum="9" colname="col9" align="right"/>
     <oasis:colspec colnum="10" colname="col10" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col5" colsep="1"><bold>(a)</bold> Random Forest </oasis:entry>
         <oasis:entry namest="col6" nameend="col10"><bold>(b)</bold> CatBoost </oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Class</oasis:entry>
         <oasis:entry colname="col2">Precision</oasis:entry>
         <oasis:entry colname="col3">Recall</oasis:entry>
         <oasis:entry colname="col4">F1-score</oasis:entry>
         <oasis:entry colname="col5">Support</oasis:entry>
         <oasis:entry colname="col6">Class</oasis:entry>
         <oasis:entry colname="col7">Precision</oasis:entry>
         <oasis:entry colname="col8">Recall</oasis:entry>
         <oasis:entry colname="col9">F1-score</oasis:entry>
         <oasis:entry colname="col10">Support</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Urban</oasis:entry>
         <oasis:entry colname="col2">87.67 %</oasis:entry>
         <oasis:entry colname="col3">76.62 %</oasis:entry>
         <oasis:entry colname="col4">81.77 %</oasis:entry>
         <oasis:entry colname="col5">928</oasis:entry>
         <oasis:entry colname="col6">Urban</oasis:entry>
         <oasis:entry colname="col7">86.29 %</oasis:entry>
         <oasis:entry colname="col8">78.02 %</oasis:entry>
         <oasis:entry colname="col9">81.95 %</oasis:entry>
         <oasis:entry colname="col10">928</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Suburban</oasis:entry>
         <oasis:entry colname="col2">55.20 %</oasis:entry>
         <oasis:entry colname="col3">71.95 %</oasis:entry>
         <oasis:entry colname="col4">62.47 %</oasis:entry>
         <oasis:entry colname="col5">524</oasis:entry>
         <oasis:entry colname="col6">Suburban</oasis:entry>
         <oasis:entry colname="col7">56.98 %</oasis:entry>
         <oasis:entry colname="col8">66.22 %</oasis:entry>
         <oasis:entry colname="col9">61.25 %</oasis:entry>
         <oasis:entry colname="col10">524</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Rural</oasis:entry>
         <oasis:entry colname="col2">85.57 %</oasis:entry>
         <oasis:entry colname="col3">78.75 %</oasis:entry>
         <oasis:entry colname="col4">82.02 %</oasis:entry>
         <oasis:entry colname="col5">527</oasis:entry>
         <oasis:entry colname="col6">Rural</oasis:entry>
         <oasis:entry colname="col7">82.67 %</oasis:entry>
         <oasis:entry colname="col8">83.30 %</oasis:entry>
         <oasis:entry colname="col9">82.99 %</oasis:entry>
         <oasis:entry colname="col10">527</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry namest="col1" nameend="col5" colsep="1"><bold>(c)</bold> LightGBM </oasis:entry>
         <oasis:entry namest="col6" nameend="col10"><bold>(d)</bold> Voting </oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Class</oasis:entry>
         <oasis:entry colname="col2">Precision</oasis:entry>
         <oasis:entry colname="col3">Recall</oasis:entry>
         <oasis:entry colname="col4">F1-score</oasis:entry>
         <oasis:entry colname="col5">Support</oasis:entry>
         <oasis:entry colname="col6">Class</oasis:entry>
         <oasis:entry colname="col7">Precision</oasis:entry>
         <oasis:entry colname="col8">Recall</oasis:entry>
         <oasis:entry colname="col9">F1-score</oasis:entry>
         <oasis:entry colname="col10">Support</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Urban</oasis:entry>
         <oasis:entry colname="col2">85.27 %</oasis:entry>
         <oasis:entry colname="col3">79.85 %</oasis:entry>
         <oasis:entry colname="col4">82.47 %</oasis:entry>
         <oasis:entry colname="col5">928</oasis:entry>
         <oasis:entry colname="col6">Urban</oasis:entry>
         <oasis:entry colname="col7">86.55 %</oasis:entry>
         <oasis:entry colname="col8">78.34 %</oasis:entry>
         <oasis:entry colname="col9">82.24 %</oasis:entry>
         <oasis:entry colname="col10">928</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Suburban</oasis:entry>
         <oasis:entry colname="col2">57.90 %</oasis:entry>
         <oasis:entry colname="col3">66.41 %</oasis:entry>
         <oasis:entry colname="col4">61.87 %</oasis:entry>
         <oasis:entry colname="col5">524</oasis:entry>
         <oasis:entry colname="col6">Suburban</oasis:entry>
         <oasis:entry colname="col7">56.80 %</oasis:entry>
         <oasis:entry colname="col8">68.51 %</oasis:entry>
         <oasis:entry colname="col9">62.11 %</oasis:entry>
         <oasis:entry colname="col10">524</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">Rural</oasis:entry>
         <oasis:entry colname="col2">85.27 %</oasis:entry>
         <oasis:entry colname="col3">82.35 %</oasis:entry>
         <oasis:entry colname="col4">83.78 %</oasis:entry>
         <oasis:entry colname="col5">527</oasis:entry>
         <oasis:entry colname="col6">Rural</oasis:entry>
         <oasis:entry colname="col7">85.01 %</oasis:entry>
         <oasis:entry colname="col8">81.78 %</oasis:entry>
         <oasis:entry colname="col9">83.37 %</oasis:entry>
         <oasis:entry colname="col10">527</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <fig id="F5" specific-use="star"><label>Figure 5</label><caption><p id="d2e2729"><bold>(a)</bold> Confusion matrix for random forest classifier, evaluated on 1979 test data points. <bold>(b)</bold> The feature importance for random, measuring the contribution of each variable in the classification process.</p></caption>
          <graphic xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026-f05.png"/>

        </fig>

      <p id="d2e2743">The implementation of the adjusted probability threshold yielded notable improvements in our classification model. While the enhancements for urban and rural station predictions were modest, the impact on suburban area classification was substantial. This is particularly significant given the inherent challenges in accurately identifying suburban zones. When compared to the unsupervised <inline-formula><mml:math id="M68" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means clustering method, our supervised approaches demonstrated superior performance across all categories. The contrast was especially pronounced in the classification of suburban areas, where <inline-formula><mml:math id="M69" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula>-means exhibited a markedly low accuracy of just 15.84 %, low precision of 33.33 % and low F1-score of 21.47 %, see Table <xref ref-type="table" rid="T3"/>. In contrast, our supervised methods achieved significantly higher accuracy rates, higher precision, and higher F1-score underscoring their effectiveness in navigating the complexities of urban-suburban-rural distinctions.</p>
</sec>
<sec id="Ch1.S3.SS3">
  <label>3.3</label><title>Discussion</title>
      <p id="d2e2770">The supervised machine learning approach demonstrates remarkable performance, achieving prediction accuracies <inline-formula><mml:math id="M70" display="inline"><mml:mo>&gt;</mml:mo></mml:math></inline-formula> 84 % for urban and rural stations when applied to previously unseen test data. This can already be used to accurately predict “urban” and “rural” labels.  While the model's performance in identifying suburban areas initially showed slightly lower accuracy, this challenge was effectively addressed through the adjusted probability threshold.</p>
      <p id="d2e2780">Despite the promising results, the classification results are far from perfect. This can be partially attributed to inherent inaccuracies within the dataset itself. To investigate this issue, we conducted a detailed review and manual inspection of the 30 randomly selected misclassifications. For these stations, which are listed in Table <xref ref-type="table" rid="T8"/>, we visually inspected the areas around the stations on Google Maps <xref ref-type="bibr" rid="bib1.bibx25" id="paren.38"/>, using a zoom level of 11 or greater. While 6 of these cases revealed wrong classifications by our best ML model, the model's classification is actually more accurate than the label that was reported by the data providers in 16  cases. In the remaining 8 cases, neither the reported nor the ML model derived label was correct. In one of these cases, both the reported and ML based label was urban, while the station site is apparently located in a rural area. In the other case, visual inspection would place the station in the suburban class, while the reported category is urban and the ML model classifies the station as rural. However, for stations misclassified by our machine learning model, we observed ambiguous features that could misguide our model. For instance, surrounding neighborhoods of areas reported as urban but classified as rural by our model exhibited rural characteristics, such as lower population density (which is one of the important features used in the training dataset, see Fig. <xref ref-type="fig" rid="F5"/>b) and more green space. This result, while initially counter-intuitive, can be explained by the strong robustness of tree-based ensemble methods to noisy datasets. Algorithms such as Random Forest, CatBoost, and LightGBM are specifically designed to mitigate the effects of noise and outliers through randomization and averaging techniques <xref ref-type="bibr" rid="bib1.bibx4" id="paren.39"/>. However, these 30 data points are insufficient to draw definitive conclusions.</p>

<table-wrap id="T8" specific-use="star"><label>Table 8</label><caption><p id="d2e2796">Closer analysis of some misclassified station locations.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="7">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:colspec colnum="5" colname="col5" align="left"/>
     <oasis:colspec colnum="6" colname="col6" align="left"/>
     <oasis:colspec colnum="7" colname="col7" align="left"/>
     <oasis:thead>
       <oasis:row>
         <oasis:entry colname="col1">latitude</oasis:entry>
         <oasis:entry colname="col2">longitude</oasis:entry>
         <oasis:entry colname="col3">Station code</oasis:entry>
         <oasis:entry colname="col4">type of area TOAR</oasis:entry>
         <oasis:entry colname="col5">type of area ML</oasis:entry>
         <oasis:entry colname="col6">True type of area</oasis:entry>
         <oasis:entry colname="col7">Winner</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2"/>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6">(From Google Maps)</oasis:entry>
         <oasis:entry colname="col7"/>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">33.859662</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M71" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>118.200707</oasis:entry>
         <oasis:entry colname="col3">06-037-4008</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">urban</oasis:entry>
         <oasis:entry colname="col6">urban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">44.470336</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M72" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>71.180077</oasis:entry>
         <oasis:entry colname="col3">33-007-0015</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">53.341875</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M73" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>6.214075</oasis:entry>
         <oasis:entry colname="col3">IE004AP</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">42.876469</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M74" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>73.071215</oasis:entry>
         <oasis:entry colname="col3">50-003-0001</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">44.307000</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M75" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>86.242649</oasis:entry>
         <oasis:entry colname="col3">26-101-0922</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">52.132417</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M76" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>0.300306</oasis:entry>
         <oasis:entry colname="col3">GB0954A</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">44.835700</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M77" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>108.386000</oasis:entry>
         <oasis:entry colname="col3">56-003-0003</oasis:entry>
         <oasis:entry colname="col4">rural</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">TOAR</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">35.273460</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M78" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>89.961217</oasis:entry>
         <oasis:entry colname="col3">47-157-0046</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">urban</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">40.734449</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M79" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>75.312389</oasis:entry>
         <oasis:entry colname="col3">42-095-1000</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">45.394410</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M80" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>93.885254</oasis:entry>
         <oasis:entry colname="col3">27-141-0012</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">36.923100</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M81" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>2.463220</oasis:entry>
         <oasis:entry colname="col3">04024001</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">44.390480</oasis:entry>
         <oasis:entry colname="col2">8.201500</oasis:entry>
         <oasis:entry colname="col3">IT1233A</oasis:entry>
         <oasis:entry colname="col4">rural</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">48.762780</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M82" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>122.440280</oasis:entry>
         <oasis:entry colname="col3">53-073-0015</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">urban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">50.575364</oasis:entry>
         <oasis:entry colname="col2">8.492018</oasis:entry>
         <oasis:entry colname="col3">DEHE095</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">urban</oasis:entry>
         <oasis:entry colname="col6">urban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">30.958515</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M83" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>88.028332</oasis:entry>
         <oasis:entry colname="col3">01-097-0028</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">31.813370</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M84" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>106.464520</oasis:entry>
         <oasis:entry colname="col3">48-141-0693</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">37.049260</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M85" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>86.214870</oasis:entry>
         <oasis:entry colname="col3">21-227-0009</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">43.629605</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M86" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>72.309499</oasis:entry>
         <oasis:entry colname="col3">33-009-0010</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">40.262540</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M87" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>89.230923</oasis:entry>
         <oasis:entry colname="col3">17-107-0001</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">33.089772</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M88" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>87.459733</oasis:entry>
         <oasis:entry colname="col3">01-125-0010</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">44.003853</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M89" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>92.414896</oasis:entry>
         <oasis:entry colname="col3">27-109-0016</oasis:entry>
         <oasis:entry colname="col4">rural</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">TOAR</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">39.652473</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M90" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>104.925926</oasis:entry>
         <oasis:entry colname="col3">08-031-0823</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">47.689728</oasis:entry>
         <oasis:entry colname="col2">22.458500</oasis:entry>
         <oasis:entry colname="col3">RO0183A</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">urban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">TOAR</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">43.144070</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M91" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>2.963370</oasis:entry>
         <oasis:entry colname="col3">01036004</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">urban</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">TOAR</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">32.891056</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M92" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>111.570503</oasis:entry>
         <oasis:entry colname="col3">04-021-3011</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">TOAR</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">40.246528</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M93" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>77.186750</oasis:entry>
         <oasis:entry colname="col3">42-041-0101</oasis:entry>
         <oasis:entry colname="col4">rural</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">TOAR</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">60.294268</oasis:entry>
         <oasis:entry colname="col2">5.324619</oasis:entry>
         <oasis:entry colname="col3">NO0121A</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">urban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">57.039915</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M94" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>135.272042</oasis:entry>
         <oasis:entry colname="col3">02-220-0007</oasis:entry>
         <oasis:entry colname="col4">suburban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">ML</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">62.486778</oasis:entry>
         <oasis:entry colname="col2">17.324437</oasis:entry>
         <oasis:entry colname="col3">SE0012A</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">suburban</oasis:entry>
         <oasis:entry colname="col6">rural</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">38.583226</oasis:entry>
         <oasis:entry colname="col2"><inline-formula><mml:math id="M95" display="inline"><mml:mi mathvariant="normal">−</mml:mi></mml:math></inline-formula>77.121900</oasis:entry>
         <oasis:entry colname="col3">11-001-0025</oasis:entry>
         <oasis:entry colname="col4">urban</oasis:entry>
         <oasis:entry colname="col5">rural</oasis:entry>
         <oasis:entry colname="col6">suburban</oasis:entry>
         <oasis:entry colname="col7">None</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <fig id="F6" specific-use="star"><label>Figure 6</label><caption><p id="d2e3777">Evaluation of the supervised classification with independent data: <bold>(a)</bold> Box and whisker plot of the 75 percentile of the NO<sub><italic>x</italic></sub>. <bold>(b)</bold> Box and whisker plot 75 percentile of the PM<sub>2.5</sub>.</p></caption>
          <graphic xlink:href="https://gmd.copernicus.org/articles/19/5765/2026/gmd-19-5765-2026-f06.png"/>

        </fig>

      <p id="d2e3810">To further lend confidence to our results, we evaluated the 75-percentile statistics of the primary air pollutant concentrations NO<sub><italic>x</italic></sub> and PM<sub>2.5</sub> from the TOAR database. We chose the 75th percentile, because urban areas typically exhibit fresh pollution with many high concentration events. This percentile  captures such characteristics while being more robust than either the maximum value or a higher percentile. While data on these species is incomplete, there are sufficient measurements from several regions to yield a meaningful statistic. Figure <xref ref-type="fig" rid="F6"/> shows box and whisker plots of the 75-percentiles of NO<sub><italic>x</italic></sub> and PM<sub>2.5</sub> concentrations aggregated for the year 2015 for the three classes. As expected, urban stations typically show substantially higher concentration levels compared to suburban sites, while rural stations show the lowest concentrations.</p>
</sec>
</sec>
<sec id="Ch1.S4" sec-type="conclusions">
  <label>4</label><title>Conclusion</title>
      <p id="d2e3860">We investigated the use of machine learning models to objectively characterize station locations for global air quality data analysis. Specifically, we wanted to improve the station classification in the TOAR-I database that was described by <xref ref-type="bibr" rid="bib1.bibx37" id="text.40"/> and base it on an objective algorithm. As a side-effect we can now explicitly label stations as suburban that were falling between the urban and rural categories in the TOAR-I classification scheme. Our proposed models demonstrate excellent prediction capabilities for urban and rural areas. With the help of an adjusted probability threshold technique, we also obtain meaningful results on the suburban category, inasmuch this category can be described objectively at all. We noticed a limitation for evaluating the accuracy of our method due to obvious misclassification of stations in official databases. As discussed in <xref ref-type="bibr" rid="bib1.bibx37" id="paren.41"/>, such errors can be introduced for various reasons. In some cases, we speculate that these misclassifications actually reflect true landcover changes (e.g., urban development), which have not been updated in the station metadata at the data providers' sites. Manual inspection of random test samples with disagreements revealed that the ML classifier was more often correct than the reported station type.</p>
      <p id="d2e3869">There is still room for improvement of the methods described here. On the one hand, a larger manual labelling effort using high-resolution EO data, could reduce the number of wrong target labels and reduce the noise in the training data. On the other hand, it may also be possible to employ modern ML methods with spatial context (e.g., <xref ref-type="bibr" rid="bib1.bibx41" id="altparen.42"/>) on such high-resolution EO data directly as a specialized land cover classification task. Nevertheless, the new TOAR station classifiers developed in this study provide a clear improvement over the previous method and can be employed in the TOAR-II ozone data analyses that will be reported in the forthcoming assessment papers.</p>
</sec>

      
      </body>
    <back><notes notes-type="codedataavailability"><title>Code and data availability</title>

      <p id="d2e3880">All code accompanying this paper is available in our GitLab repository (<uri>https://gitlab.jsc.fz-juelich.de/esde/toar-public/ml_toar_station_classification/-/tree/develop?ref_type=heads</uri>, last access: 24 June 2026; <uri>https://github.com/kmache/toar-classifier-v2</uri>, last access: 24 June 2026)  and at Zenodo at the following link: <ext-link xlink:href="https://doi.org/10.5281/zenodo.15411286" ext-link-type="DOI">10.5281/zenodo.15411286</ext-link> <xref ref-type="bibr" rid="bib1.bibx23" id="paren.43"/>. This repository also contains a copy of the data that is used in this sudy as csv files. The data can also be obtained directly from the TOAR-II database (<uri>https://toar-data.fz-juelich.de/api/v2</uri>, last access: 24 June 2026).</p>
  </notes><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d2e3901">RKM, SS, and MGS designed the study based on previous work by MGS and SS. RKM, AP, and ML developed the methodology, RKM implemented the methods and evaluated the results. RKM wrote the major part of the text with contributions from all authors. MGS conducted a final review prior to submission.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d2e3907">The contact author has declared that none of the authors has any competing interests.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d2e3913">Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. The authors bear the ultimate responsibility for providing appropriate place names. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.</p>
  </notes><ack><title>Acknowledgements</title><p id="d2e3919">The authors are grateful to the EU for funding the IntelliAQ project under grant ERC-AdG-787576. This allowed the buildup of the TOAR-II database. We also greatly appreciate the effort from hundreds of people around the world who established and operate air quality stations, process the data and make the data available to the TOAR initiative. Sebastian Hickman deserves gratitude for his initial analysis of NO<sub><italic>x</italic></sub> and PM<sub>2.5</sub> data in the TOAR-II database and helpful discussions.</p></ack><notes notes-type="financialsupport"><title>Financial support</title>

      <p id="d2e3942">This research has been supported by the European Research Council, H2020 European Research Council (grant no. 787576).The article processing charges for this open-access publication were covered by the Forschungszentrum Jülich.</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d2e3953">This paper was edited by Jason Williams and reviewed by Frank Techel and one anonymous referee.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><label>Abdi and Williams(2010)</label><mixed-citation> Abdi, H. and Williams, L. J.: Principal component analysis, Wiley Interdisciplinary Reviews: Computational Statistics, 2, 433–459, 2010.</mixed-citation></ref>
      <ref id="bib1.bibx2"><label>Airgood-Obrycki and Rieger(2019)</label><mixed-citation>Airgood-Obrycki, W. and Rieger, S.: Defining suburbs: How definitions shape the suburban landscape, Joint Center for Housing Studies of Harvard University, <uri>https://www.jchs.harvard.edu/research-areas/working-papers/defining-suburbs-how-definitions-shape-suburban-landscape</uri> (last access: 24 June 2026), 2019.</mixed-citation></ref>
      <ref id="bib1.bibx3"><label>Bahmani et al.(2012)Bahmani, Moseley, Vattani, Kumar, and Vassilvitskii</label><mixed-citation>Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S.: Scalable k-means++, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.1203.6402" ext-link-type="DOI">10.48550/arXiv.1203.6402</ext-link>, 2012.</mixed-citation></ref>
      <ref id="bib1.bibx4"><label>Biau and Scornet(2016)</label><mixed-citation> Biau, G. and Scornet, E.: A random forest guided tour, Test, 25, 197–227, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx5"><label>Breiman(2001)</label><mixed-citation> Breiman, L.: Random forests, Machine Learning, 45, 5–32, 2001.</mixed-citation></ref>
      <ref id="bib1.bibx6"><label>Brodersen et al.(2010)Brodersen, Ong, Stephan, and Buhmann</label><mixed-citation>Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M.: The balanced accuracy and its posterior distribution, in: 2010 20th international conference on pattern recognition,  3121–3124, IEEE, <ext-link xlink:href="https://doi.org/10.1109/ICPR.2010.764" ext-link-type="DOI">10.1109/ICPR.2010.764</ext-link>, 2010.</mixed-citation></ref>
      <ref id="bib1.bibx7"><label>Chacón and Rastrojo(2023)</label><mixed-citation> Chacón, J. E. and Rastrojo, A. I.: Minimum adjusted Rand index for two clusterings of a given size, Advances in Data Analysis and Classification, 17, 125–133, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx8"><label>Chawla et al.(2002)Chawla, Bowyer, Hall, and Kegelmeyer</label><mixed-citation> Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P.: SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321–357, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx9"><label>Chekir et al.(2017)Chekir, Hassas, Descoteaux, Côté, Garyfallidis, and Oulebsir-Boumghar</label><mixed-citation> Chekir, A., Hassas, S., Descoteaux, M., Côté, M., Garyfallidis, E., and Oulebsir-Boumghar, F.: 3D-SSF: A bio-inspired approach for dynamic multi-subject clustering of white matter tracts, Computers in Biology and Medicine, 83, 10–21, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx10"><label>Cooper et al.(2014)Cooper, Parrish, Ziemke, Balashov, Cupeiro, Galbally, Gilge, Horowitz, Jensen, Lamarque et al.</label><mixed-citation>Cooper, O. R., Parrish, D., Ziemke, J., Balashov, N., Cupeiro, M., Galbally, I., Gilge, S., Horowitz, L., Jensen, N., Lamarque, J.-F., Naik, V., Oltmans, S. J., Schwab, J., Shindell, D. T., Thompson, A. M., Thouret, V., Wang, Y., and Zbinden, R. M.: Global distribution and trends of tropospheric ozone: An observation-based review, Elementa: Science of the Anthropocene, 2, 000029, <ext-link xlink:href="https://doi.org/10.12952/journal.elementa.000029" ext-link-type="DOI">10.12952/journal.elementa.000029</ext-link>, 2014.</mixed-citation></ref>
      <ref id="bib1.bibx11"><label>Fleming et al.(2018)Fleming, Payne, Sweet, Craghan, Haines, Hart et al.</label><mixed-citation>Fleming, E., Payne, J., Sweet, W., Craghan, M., Haines, J., Hart, J., Stiller, H., and Sutton-Grier, A.: Coastal Effects, in:  Impacts, Risks, and Adaptation in the United States: Fourth National Climate Assessment, Volume II, edited by: Reidmiller, D. R., Avery, C. W., Easterling, D. R., Kunkel, K. E., Lewis, K. L. M., Maycock, T. K., and Stewart, B. C., 322–352. U.S. Global Change Research Program, Washington, DC, USA, <uri>https://pubs.usgs.gov/publication/70201869</uri> (last access: 29 June 2026), 2018.</mixed-citation></ref>
      <ref id="bib1.bibx12"><label>Florczyk et al.(2019)Florczyk, Corbane, Ehrlich, Freire, Kemper, Maffenini, Melchiorri, Pesaresi, Politis, Schiavina et al.</label><mixed-citation>Florczyk, A. J., Corbane, C., Ehrlich, D., Freire, S., Kemper, T., Maffenini, L., Melchiorri, M., Pesaresi, M., Politis, P., and Schiavina, M.: GHSL data package 2019, Publications Office of the European Union, Luxembourg, <ext-link xlink:href="https://doi.org/10.2760/290498" ext-link-type="DOI">10.2760/290498</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx13"><label>Gaudel et al.(2018)Gaudel, Cooper, Ancellet, Barret, Boynard, Burrows, Clerbaux, Coheur, Cuesta, Cuevas et al.</label><mixed-citation>Gaudel, A., Cooper, O. R., Ancellet, G., Barret, B., Boynard, A., Burrows, J. P., Clerbaux, C., Coheur, P.-F., Cuesta, J., Cuevas, E., Eskes, H., van Roozendael, M., Ziemke, J. R., Liu, X., Tarasick, D. W., Thouret, V., Thompson, A. M., Witte, J. C., Safieddine, S., Steinbrecht, W., Stübi, R., Trickl, T., Wang, T., Vigouroux, C., Xu, X., Wagner, A., and Yu, H.: Tropospheric Ozone Assessment Report: Present-day distribution and trends of tropospheric ozone relevant to climate and global atmospheric chemistry model evaluation, Elementa: Science of the Anthropocene, 6, 39, <ext-link xlink:href="https://doi.org/10.1525/elementa.291" ext-link-type="DOI">10.1525/elementa.291</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx14"><label>Granier et al.(2019)Granier, Darras, van Der Gon, Jana, Elguindi, Bo, Michael, Marc, Jalkanen, Kuenen et al.</label><mixed-citation>Granier, C., Darras, S., van Der Gon, H. D., Jana, D., Elguindi, N., Bo, G., Michael, G., Marc, G., Jalkanen, J.-P., Kuenen, J., Liousse, C., Quack, B., Simpson, D., and Sindelarova, K.: The Copernicus atmosphere monitoring service global and regional emissions (April 2019 version), PhD thesis, Copernicus Atmosphere Monitoring Service, <ext-link xlink:href="https://doi.org/10.24380/d0bn-kx16" ext-link-type="DOI">10.24380/d0bn-kx16</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx15"><label>Griffiths et al.(2021)Griffiths, Murray, Zeng, Shin, Abraham, Archibald, Deushi, Emmons, Galbally, Hassler et al.</label><mixed-citation>Griffiths, P. T., Murray, L. T., Zeng, G., Shin, Y. M., Abraham, N. L., Archibald, A. T., Deushi, M., Emmons, L. K., Galbally, I. E., Hassler, B., Horowitz, L. W., Keeble, J., Liu, J., Moeini, O., Naik, V., O'Connor, F. M., Oshima, N., Tarasick, D., Tilmes, S., Turnock, S. T., Wild, O., Young, P. J., and Zanis, P.: Tropospheric ozone in CMIP6 simulations, Atmos. Chem. Phys., 21, 4187–4218, <ext-link xlink:href="https://doi.org/10.5194/acp-21-4187-2021" ext-link-type="DOI">10.5194/acp-21-4187-2021</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx16"><label>Harris and Jones(2017)</label><mixed-citation>Harris, I. and Jones, P.: University of East Anglia Climatic Research Unit 2017 CRU TS4. 00: Climatic Research Unit (CRU) Time-Series (TS) version 4.00 of high-resolution gridded data of month-by-month variation in climate (Jan. 1901–Dec. 2015), Chilton, Oxfordshire, Centre for Environmental Data Analysis, <ext-link xlink:href="https://doi.org/10.5285/edf8febfdaad48abb2cbaf7d7e846a86" ext-link-type="DOI">10.5285/edf8febfdaad48abb2cbaf7d7e846a86</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx17"><label>Hesse and Siedentop(2018)</label><mixed-citation> Hesse, M. and Siedentop, S.: Suburbanisation and suburbanisms – Making sense of continental European developments, Raumforschung und Raumordnung [Spatial Research and Planning], 76, 97–108, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx18"><label>Jarvis et al.(2008)Jarvis, Reuter, Nelson, Guevara et al.</label><mixed-citation>Jarvis, A., Reuter, H. I., Nelson, A., and Guevara, E.: Hole-filled SRTM for the globe Version 4,  CGIAR-CSI SRTM 90m Database, CGIAR Consortium for Spatial Information, <uri>http://srtm.csi.cgiar.org</uri> (last access: 24 June 2026), 2008.</mixed-citation></ref>
      <ref id="bib1.bibx19"><label>Ke et al.(2017)Ke, Meng, Finley, Wang, Chen, Ma, Ye, and Liu</label><mixed-citation>Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree, Advances in Neural Information Processing Systems, 30, <uri>https://dl.acm.org/doi/10.5555/3294996.3295074</uri> (last access: 29 June 2026), 2017.</mixed-citation></ref>
      <ref id="bib1.bibx20"><label>Ketchen and Shook(1996)</label><mixed-citation> Ketchen, D. J. and Shook, C. L.: The application of cluster analysis in strategic management research: an analysis and critique, Strategic Management Journal, 17, 441–458, 1996.</mixed-citation></ref>
      <ref id="bib1.bibx21"><label>Kroehl(1982)</label><mixed-citation>Kroehl, H.: National Geophysical and Solar-Terrestrial Data Center, EDIS, NOAA, Boulder, Colorado 80303, American Geophysical Union, p. 98, <uri>https://www.ncei.noaa.gov/products/space-weather/legacy-data/publications</uri> (last access: 24 June 2026), 1982.</mixed-citation></ref>
      <ref id="bib1.bibx22"><label>Kvålseth(2017)</label><mixed-citation>Kvålseth, T. O.: On normalized mutual information: measure derivations and properties, Entropy, 19, 631, <ext-link xlink:href="https://doi.org/10.3390/e19110631" ext-link-type="DOI">10.3390/e19110631</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx23"><label>Mache et al.(2025)Mache, Schröder, Langguth, Patnala, and Schultz</label><mixed-citation>Mache, R. K., Schröder, S., Langguth, M., Patnala, A., and Schultz, M. G.: TOAR-classifier v2: A data-driven classification tool for global air quality stations, Zenodo [code, data set], <ext-link xlink:href="https://doi.org/10.5281/zenodo.15411286" ext-link-type="DOI">10.5281/zenodo.15411286</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx24"><label>Madronich et al.(2023)Madronich, Sulzberger, Longstreth, Schikowski, Andersen, Solomon, and Wilson</label><mixed-citation> Madronich, S., Sulzberger, B., Longstreth, J., Schikowski, T., Andersen, M. S., Solomon, K., and Wilson, S.: Changes in tropospheric air quality related to the protection of stratospheric ozone in a changing climate, Photochemical &amp; Photobiological Sciences, 22, 1129–1176, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx25"><label>Mehta et al.(2019)Mehta, Kanani, and Lande</label><mixed-citation> Mehta, H., Kanani, P., and Lande, P.: Google maps, International Journal of Computer Applications, 178, 41–46, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx26"><label>Mills et al.(2018)Mills, Brown, Laney, Ortega-Retuerta, Lowry, van Dijken, and Arrigo</label><mixed-citation>Mills, M. M., Brown, Z. W., Laney, S. R., Ortega-Retuerta, E., Lowry, K. E., van Dijken, G. L., and Arrigo, K. R.: Nitrogen limitation of the summer phytoplankton and heterotrophic prokaryote communities in the Chukchi Sea, Frontiers in Marine Science, 5, 362, <ext-link xlink:href="https://doi.org/10.3389/fmars.2018.00362" ext-link-type="DOI">10.3389/fmars.2018.00362</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx27"><label>Monks et al.(2015)Monks, Archibald, Colette, Cooper, Coyle, Derwent, Fowler, Granier, Law, Mills et al.</label><mixed-citation>Monks, P. S., Archibald, A. T., Colette, A., Cooper, O., Coyle, M., Derwent, R., Fowler, D., Granier, C., Law, K. S., Mills, G. E., Stevenson, D. S., Tarasova, O., Thouret, V., von Schneidemesser, E., Sommariva, R., Wild, O., and Williams, M. L.: Tropospheric ozone and its precursors from the urban to the global scale from air quality to short-lived climate forcer, Atmos. Chem. Phys., 15, 8889–8973, <ext-link xlink:href="https://doi.org/10.5194/acp-15-8889-2015" ext-link-type="DOI">10.5194/acp-15-8889-2015</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx28"><label>Orru et al.(2013)Orru, Andersson, Ebi, Langner, Åström, and Forsberg</label><mixed-citation> Orru, H., Andersson, C., Ebi, K. L., Langner, J., Åström, C., and Forsberg, B.: Impact of climate change on ozone-related mortality and morbidity in Europe, European Respiratory Journal, 41, 285–294, 2013.</mixed-citation></ref>
      <ref id="bib1.bibx29"><label>Pedregosa et al.(2011a)Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duchesnay</label><mixed-citation> Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.: Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825–2830, 2011a.</mixed-citation></ref>
      <ref id="bib1.bibx30"><label>Pedregosa et al.(2011b)Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg et al.</label><mixed-citation> Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, É.: Scikit-learn: Machine learning in Python,  Journal of Machine Learning Research, 12, 2825–2830, 2011b.</mixed-citation></ref>
      <ref id="bib1.bibx31"><label>Pelleg and Moore(2000)Pelleg, Moore et al.</label><mixed-citation> Pelleg, D. and Moore, A.: X-means: Extending K-means with Efficient Estimation of the Number of Clusters, in: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), 727–734, Morgan Kaufmann, San Francisco, CA, 2000.</mixed-citation></ref>
      <ref id="bib1.bibx32"><label>Post et al.(2012)Post, Grambsch, Weaver, Morefield, Huang, Leung, Nolte, Adams, Liang, Zhu et al.</label><mixed-citation> Post, E. S., Grambsch, A., Weaver, C., Morefield, P., Huang, J., Leung, L.-Y., Nolte, C. G., Adams, P., Liang, X.-Z., Zhu, J.-H., and Mahoney, H.: Variation in estimated ozone-related health impacts of climate change due to modeling choices and assumptions, Environmental Health Perspectives, 120, 1559–1564, 2012.</mixed-citation></ref>
      <ref id="bib1.bibx33"><label>Powers(2020)</label><mixed-citation>Powers, D. M.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2010.16061" ext-link-type="DOI">10.48550/arXiv.2010.16061</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx34"><label>Prokhorenkova et al.(2018)Prokhorenkova, Gusev, Vorobev, Dorogush, and Gulin</label><mixed-citation>Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A.: CatBoost: unbiased boosting with categorical features, arXiv, <ext-link xlink:href="https://doi.org/10.48550/arXiv.1810.11363" ext-link-type="DOI">10.48550/arXiv.1810.11363</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx35"><label>Rubinsteyn and Feldman(2016)</label><mixed-citation>Rubinsteyn, A. and Feldman, S.: Fancyimpute: An Imputation Library for Python, GitHub, <uri>https://github.com/iskandr/fancyimpute</uri> (last access: 24 June 2026), 2016.</mixed-citation></ref>
      <ref id="bib1.bibx36"><label>Schultz et al.(2015)Schultz, Akimoto, Bottenheim, Buchmann, Galbally, Gilge, Helmig, Koide, Lewis, Novelli et al.</label><mixed-citation>Schultz, M. G., Akimoto, H., Bottenheim, J., Buchmann, B., Galbally, I. E., Gilge, S., Helmig, D., Koide, H., Lewis, A. C., Novelli, P. C., Plass-Dülmer, C., Ryerson, T. B., Steinbacher, M., Steinbrecher, R., Tarasova, O., Tørseth, K., Thouret, V., and Zellweger, C.: The Global Atmosphere Watch reactive gases measurement network, Elementa: Science of the Anthropocene, 3, 000067, <ext-link xlink:href="https://doi.org/10.12952/journal.elementa.000067" ext-link-type="DOI">10.12952/journal.elementa.000067</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx37"><label>Schultz et al.(2017a)Schultz, Schröder, Lyapina, Cooper, Galbally, Petropavlovskikh, Von Schneidemesser, Tanimoto, Elshorbany, Naja et al.</label><mixed-citation> Schultz, M. G., Schröder, S., Lyapina, O., Cooper, O. R., Galbally, I., Petropavlovskikh, I., Von Schneidemesser, E., Tanimoto, H., Elshorbany, Y., Naja, M., Seguel, R. J., Dauert, U., Eckhardt, P., Feigenspan, S., Fiebig, M., Hjellbrekke, A.-G., Hong, Y.-D., Kjeld, P. C., Koide, H., Lear, G., Tarasick, D., Ueno, M., Wallasch, M., Baumgardner, D., Chuang, M.-T., Gillett, R., Lee, M., Molloy, S., Moolla, R., Wang, T., Sharps, K., Adame, J. A., Ancellet, G., Apadula, F., Artaxo, P., Barlasina, M. E., Bogucka, M., Bonasoni, P., Chang, L., Colomb, A., Cuevas-Agulló, E., Cupeiro, M., Degorska, A., Ding, A., Fröhlich, M., Frolova, M., Gadhavi, H., Gheusi, F., Gilge, S., Gonzalez, M. Y., Gros, V., Hamad, S. H., Helmig, D., Henriques, D., Hermansen, O., Holla, R., Hueber, J., Im, U., Jaffe, D. A., Komala, N., Kubistin, D., Lam, K.-S., Laurila, T., Lee, H., Levy, I., Mazzoleni, C., Mazzoleni, L. R., McClure-Begley, A., Mohamad, M., Murovec, M., Navarro-Comas, M., Nicodim, F., Parrish, D., Read, K. A., Reid, N., Ries, L., Saxena, P., Schwab, J. J., Scorgie, Y., Senik, I., Simmonds, P., Sinha, V., Skorokhod, A. I., Spain, G., Spangl, W., Spoor, R., Springston, S. R., Steer, K., Steinbacher, M., Suharguniyawan, E., Torre, P., Trickl, T., Weili, L., Weller, R., Xu, X., Xue, L., and Ma, Z.: Tropospheric Ozone Assessment Report: Database and metrics data of global surface ozone observations, Elem. Sci. Anth., 5, 58, 2017a.</mixed-citation></ref>
      <ref id="bib1.bibx38"><label>Schultz et al.(2017b)Schultz, Schröder, Lyapina, Cooper, Galbally, Petropavlovskikh et al.</label><mixed-citation>Schultz, M. G., Schröder, S., Lyapina, O., Cooper, O. R., Galbally, I., Petropavlovskikh, I., von Schneidemesser, E., Tanimoto, H., Elshorbany, Y., Naja, M., Seguel, R. J., Dauert, U., Eckhardt, P., Feigenspan, S., Fiebig, M., Hjellbrekke, A.-G., Hong, Y.-D., Kjeld, P. C., Koide, H., Lear, G., Tarasick, D., Ueno, M., Wallasch, M., Baumgardner, D., Chuang, M.-T., Gillett, R., Lee, M., Molloy, S., Moolla, R., Wang, T., Sharps, K., Adame, J. A., Ancellet, G., Apadula, F., Artaxo, P., Barlasina, M. E., Bogucka, M., Bonasoni, P., Chang, L., Colomb, A., Cuevas-Agulló, E., Cupeiro, M., Degorska, A., Ding, A., Fröhlich, M., Frolova, M., Gadhavi, H., Gheusi, F., Gilge, S., Gonzalez, M. Y., Gros, V., Hamad, S. H., Helmig, D., Henriques, D., Hermansen, O., Holla, R., Hueber, J., Im, U., Jaffe, D. A., Komala, N., Kubistin, D., Lam, K.-S., Laurila, T., Lee, H., Levy, I., Mazzoleni, C., Mazzoleni, L. R., McClure-Begley, A., Mohamad, M., Murovec, M., Navarro-Comas, M., Nicodim, F., Parrish, D., Read, K. A., Reid, N., Ries, L., Saxena, P., Schwab, J. J., Scorgie, Y., Senik, I., Simmonds, P., Sinha, V., Skorokhod, A. I., Spain, G., Spangl, W., Spoor, R., Springston, S. R., Steer, K., Steinbacher, M., Suharguniyawan, E., Torre, P., Trickl, T., Weili, L., Weller, R., Xu, X., Xue, L., and Ma, Z.: Tropospheric Ozone Assessment Report: Database and metrics data of global surface ozone observations, Elementa: Science of the Anthropocene, 5, 58, <ext-link xlink:href="https://doi.org/10.1525/elementa.244" ext-link-type="DOI">10.1525/elementa.244</ext-link>, 2017b.</mixed-citation></ref>
      <ref id="bib1.bibx39"><label>Sensoressa(2025)</label><mixed-citation>Sensoressa, N. A.: Balanced Accuracy: When Should You Use It?, Neptune.ai, <ext-link xlink:href="https://doi.org/10.1109/ICPR.2010.764" ext-link-type="DOI">10.1109/ICPR.2010.764</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx40"><label>Sinaga and Yang(2020)</label><mixed-citation> Sinaga, K. P. and Yang, M.-S.: Unsupervised K-means clustering algorithm, IEEE Access, 8, 80716–80727, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx41"><label>Szwarcman et al.(2024)Szwarcman, Roy, Fraccaro, Gíslason, Blumenstiel, Ghosal, de Oliveira, Almeida, Sedona, Kang et al.</label><mixed-citation>Szwarcman, D., Roy, S., Fraccaro, P., Gíslason, Þ. E., Blumenstiel, B., Ghosal, R., de Oliveira, P. H., Almeida, J. L. d. S., Sedona, R., Kang, Y., Chakraborty, S., Wang, S., Gomes, C., Kumar, A., Truong, M., Godwin, D., Lee, H., Hsu, C.-Y., Lal, R., Asanjan, A. A., Mujeci, B., Shidham, D., Keenan, T., Arevalo, P., Li, W., Alemohammad, H., Olofsson, P., Hain, C., Kennedy, R., Zadrozny, B., Bell, D., Cavallaro, G., Watson, C., Maskey, M., Ramachandran, R., and Moreno, J. B.: Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2412.02732" ext-link-type="DOI">10.48550/arXiv.2412.02732</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx42"><label>Tapia et al.(2016)Tapia, Escudero, Lozano, Anzano, and Mantilla</label><mixed-citation> Tapia, O., Escudero, M., Lozano, Á., Anzano, J., and Mantilla, E.: New classification scheme for ozone monitoring stations based on frequency distribution of hourly data, Science of The Total Environment, 544, 1–9, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx43"><label>Teakles et al.(2017)Teakles, So, Ainslie, Nissen, Schiller, Vingarzan, McKendry, Macdonald, Jaffe, Bertram et al.</label><mixed-citation>Teakles, A. D., So, R., Ainslie, B., Nissen, R., Schiller, C., Vingarzan, R., McKendry, I., Macdonald, A. M., Jaffe, D. A., Bertram, A. K., Strawbridge, K. B., Leaitch, W. R., Hanna, S., Toom, D., Baik, J., and Huang, L.: Impacts of the July 2012 Siberian fire plume on air quality in the Pacific Northwest, Atmos. Chem. Phys., 17, 2593–2611, <ext-link xlink:href="https://doi.org/10.5194/acp-17-2593-2017" ext-link-type="DOI">10.5194/acp-17-2593-2017</ext-link>, 2017. </mixed-citation></ref>
      <ref id="bib1.bibx44"><label>Van Rossum(2007)</label><mixed-citation>Van Rossum, G.: Python programming language, in: USENIX annual technical conference, Vol. 41, 1–36, Santa Clara, CA, <uri>https://dblp.org/db/conf/usenix/usenix2007</uri> (last access: 24 June 2026), 2007.</mixed-citation></ref>
      <ref id="bib1.bibx45"><label>Zhang et al.(2024)Zhang, Sun, Li, and Li</label><mixed-citation>Zhang, L., Sun, Y., Li, C., and Li, B.: Promoting Sustainable Development in Urban–Rural Areas: A New Approach for Evaluating the Policies of Characteristic Towns in China, Buildings, 14, 1085, <ext-link xlink:href="https://doi.org/10.3390/buildings14041085" ext-link-type="DOI">10.3390/buildings14041085</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx46"><label>Zhou et al.(2022)Zhou, Li, and Zhang</label><mixed-citation>Zhou, M., Li, Y., and Zhang, F.: Spatiotemporal variation in ground level ozone and its driving factors: a comparative study of coastal and inland cities in eastern China, International Journal of Environmental Research and Public Health, 19, 9687, <ext-link xlink:href="https://doi.org/10.3390/ijerph19159687" ext-link-type="DOI">10.3390/ijerph19159687</ext-link>, 2022.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>TOAR-classifier v2: a data-driven classification tool for  global air quality stations</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Abdi and Williams(2010)</label><mixed-citation>
      
Abdi, H. and Williams, L. J.: Principal component analysis, Wiley
Interdisciplinary Reviews: Computational Statistics, 2, 433–459, 2010.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Airgood-Obrycki and Rieger(2019)</label><mixed-citation>
      
Airgood-Obrycki, W. and Rieger, S.: Defining suburbs: How definitions shape the
suburban landscape, Joint Center for Housing Studies of Harvard University, <a href="https://www.jchs.harvard.edu/research-areas/working-papers/defining-suburbs-how-definitions-shape-suburban-landscape" target="_blank"/> (last access: 24 June 2026),
2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Bahmani et al.(2012)Bahmani, Moseley, Vattani, Kumar, and
Vassilvitskii</label><mixed-citation>
      
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S.:
Scalable k-means++, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.1203.6402" target="_blank">https://doi.org/10.48550/arXiv.1203.6402</a>, 2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Biau and Scornet(2016)</label><mixed-citation>
      
Biau, G. and Scornet, E.: A random forest guided tour, Test, 25, 197–227,
2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Breiman(2001)</label><mixed-citation>
      
Breiman, L.: Random forests, Machine Learning, 45, 5–32, 2001.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>Brodersen et al.(2010)Brodersen, Ong, Stephan, and
Buhmann</label><mixed-citation>
      
Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M.: The balanced
accuracy and its posterior distribution, in: 2010 20th international
conference on pattern recognition,  3121–3124, IEEE, <a href="https://doi.org/10.1109/ICPR.2010.764" target="_blank">https://doi.org/10.1109/ICPR.2010.764</a>, 2010.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Chacón and Rastrojo(2023)</label><mixed-citation>
      
Chacón, J. E. and Rastrojo, A. I.: Minimum adjusted Rand index for two
clusterings of a given size, Advances in Data Analysis and Classification,
17, 125–133, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Chawla et al.(2002)Chawla, Bowyer, Hall, and
Kegelmeyer</label><mixed-citation>
      
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P.: SMOTE:
synthetic minority over-sampling technique, Journal of Artificial
Intelligence Research, 16, 321–357, 2002.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Chekir et al.(2017)Chekir, Hassas, Descoteaux, Côté,
Garyfallidis, and Oulebsir-Boumghar</label><mixed-citation>
      
Chekir, A., Hassas, S., Descoteaux, M., Côté, M., Garyfallidis, E., and
Oulebsir-Boumghar, F.: 3D-SSF: A bio-inspired approach for dynamic
multi-subject clustering of white matter tracts, Computers in Biology and
Medicine, 83, 10–21, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>Cooper et al.(2014)Cooper, Parrish, Ziemke, Balashov, Cupeiro,
Galbally, Gilge, Horowitz, Jensen, Lamarque et al.</label><mixed-citation>
      
Cooper, O. R., Parrish, D., Ziemke, J., Balashov, N., Cupeiro, M., Galbally,
I., Gilge, S., Horowitz, L., Jensen, N., Lamarque, J.-F., Naik, V., Oltmans, S. J., Schwab, J.,
Shindell, D. T., Thompson, A. M., Thouret, V., Wang, Y., and Zbinden, R. M.: Global
distribution and trends of tropospheric ozone: An observation-based review,
Elementa: Science
of the Anthropocene, 2, 000029, <a href="https://doi.org/10.12952/journal.elementa.000029" target="_blank">https://doi.org/10.12952/journal.elementa.000029</a>, 2014.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Fleming et al.(2018)Fleming, Payne, Sweet, Craghan, Haines, Hart
et al.</label><mixed-citation>
      
Fleming, E., Payne, J., Sweet, W., Craghan, M., Haines, J., Hart, J., Stiller, H.,
and Sutton-Grier, A.: Coastal Effects, in:  Impacts, Risks, and
Adaptation in the United States: Fourth National Climate Assessment, Volume II, edited by: Reidmiller, D. R., Avery, C. W., Easterling, D.
R., Kunkel, K. E., Lewis, K. L. M., Maycock, T. K., and Stewart, B. C.,
322–352.
U.S. Global Change Research Program, Washington, DC, USA, <a href="https://pubs.usgs.gov/publication/70201869" target="_blank"/> (last access: 29 June 2026), 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Florczyk et al.(2019)Florczyk, Corbane, Ehrlich, Freire, Kemper,
Maffenini, Melchiorri, Pesaresi, Politis, Schiavina
et al.</label><mixed-citation>
      
Florczyk, A. J., Corbane, C., Ehrlich, D., Freire, S., Kemper, T., Maffenini,
L., Melchiorri, M., Pesaresi, M., Politis, P., and Schiavina, M.: GHSL
data package 2019, Publications
Office of the European Union, Luxembourg, <a href="https://doi.org/10.2760/290498" target="_blank">https://doi.org/10.2760/290498</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>Gaudel et al.(2018)Gaudel, Cooper, Ancellet, Barret, Boynard,
Burrows, Clerbaux, Coheur, Cuesta, Cuevas et al.</label><mixed-citation>
      
Gaudel, A., Cooper, O. R., Ancellet, G., Barret, B., Boynard, A., Burrows,
J. P., Clerbaux, C., Coheur, P.-F., Cuesta, J., Cuevas, E., Eskes, H., van Roozendael, M., Ziemke, J. R., Liu, X.,
Tarasick, D. W., Thouret, V., Thompson, A. M., Witte, J. C., Safieddine, S., Steinbrecht, W., Stübi,
R., Trickl, T., Wang, T., Vigouroux, C., Xu, X., Wagner, A., and Yu, H.:
Tropospheric Ozone Assessment Report: Present-day distribution and trends of
tropospheric ozone relevant to climate and global atmospheric chemistry model
evaluation, Elementa: Science of the
Anthropocene, 6, 39, <a href="https://doi.org/10.1525/elementa.291" target="_blank">https://doi.org/10.1525/elementa.291</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>Granier et al.(2019)Granier, Darras, van Der Gon, Jana, Elguindi, Bo,
Michael, Marc, Jalkanen, Kuenen et al.</label><mixed-citation>
      
Granier, C., Darras, S., van Der Gon, H. D., Jana, D., Elguindi, N., Bo, G.,
Michael, G., Marc, G., Jalkanen, J.-P., Kuenen, J., Liousse, C., Quack, B., Simpson, D., and
Sindelarova, K.: The Copernicus
atmosphere monitoring service global and regional emissions (April 2019
version), PhD thesis, Copernicus Atmosphere Monitoring Service, <a href="https://doi.org/10.24380/d0bn-kx16" target="_blank">https://doi.org/10.24380/d0bn-kx16</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Griffiths et al.(2021)Griffiths, Murray, Zeng, Shin, Abraham,
Archibald, Deushi, Emmons, Galbally, Hassler
et al.</label><mixed-citation>
      
Griffiths, P. T., Murray, L. T., Zeng, G., Shin, Y. M., Abraham, N. L., Archibald, A. T., Deushi, M., Emmons, L. K., Galbally, I. E., Hassler, B., Horowitz, L. W., Keeble, J., Liu, J., Moeini, O., Naik, V., O'Connor, F. M., Oshima, N., Tarasick, D., Tilmes, S., Turnock, S. T., Wild, O., Young, P. J., and Zanis, P.: Tropospheric ozone in CMIP6 simulations, Atmos. Chem. Phys., 21, 4187–4218, <a href="https://doi.org/10.5194/acp-21-4187-2021" target="_blank">https://doi.org/10.5194/acp-21-4187-2021</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Harris and Jones(2017)</label><mixed-citation>
      
Harris, I. and Jones, P.: University of East Anglia Climatic Research Unit 2017
CRU TS4. 00: Climatic Research Unit (CRU) Time-Series (TS) version 4.00 of
high-resolution gridded data of month-by-month variation in climate (Jan.
1901–Dec. 2015), Chilton, Oxfordshire, Centre for Environmental Data
Analysis, <a href="https://doi.org/10.5285/edf8febfdaad48abb2cbaf7d7e846a86" target="_blank">https://doi.org/10.5285/edf8febfdaad48abb2cbaf7d7e846a86</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>Hesse and Siedentop(2018)</label><mixed-citation>
      
Hesse, M. and Siedentop, S.: Suburbanisation and suburbanisms – Making sense of
continental European developments, Raumforschung und Raumordnung [Spatial
Research and Planning], 76, 97–108, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Jarvis et al.(2008)Jarvis, Reuter, Nelson, Guevara
et al.</label><mixed-citation>
      
Jarvis, A., Reuter, H. I., Nelson, A., and Guevara, E.: Hole-filled SRTM
for the globe Version 4,  CGIAR-CSI SRTM 90m Database, CGIAR Consortium for Spatial Information, <a href="http://srtm.csi.cgiar.org" target="_blank"/> (last access: 24 June 2026), 2008.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Ke et al.(2017)Ke, Meng, Finley, Wang, Chen, Ma, Ye, and
Liu</label><mixed-citation>
      
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu,
T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree, Advances
in Neural Information Processing Systems, 30, <a href="https://dl.acm.org/doi/10.5555/3294996.3295074" target="_blank"/> (last access: 29 June 2026), 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Ketchen and Shook(1996)</label><mixed-citation>
      
Ketchen, D. J. and Shook, C. L.: The application of cluster analysis in
strategic management research: an analysis and critique, Strategic Management
Journal, 17, 441–458, 1996.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Kroehl(1982)</label><mixed-citation>
      
Kroehl, H.: National Geophysical and Solar-Terrestrial Data Center, EDIS, NOAA,
Boulder, Colorado 80303, American Geophysical Union, p. 98, <a href="https://www.ncei.noaa.gov/products/space-weather/legacy-data/publications" target="_blank"/> (last access: 24 June 2026), 1982.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>Kvålseth(2017)</label><mixed-citation>
      
Kvålseth, T. O.: On normalized mutual information: measure derivations and
properties, Entropy, 19, 631, <a href="https://doi.org/10.3390/e19110631" target="_blank">https://doi.org/10.3390/e19110631</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Mache et al.(2025)Mache, Schröder, Langguth, Patnala, and
Schultz</label><mixed-citation>
      
Mache, R. K., Schröder, S., Langguth, M., Patnala, A., and Schultz, M. G.:
TOAR-classifier v2: A data-driven classification tool for global air quality
stations, Zenodo [code, data set], <a href="https://doi.org/10.5281/zenodo.15411286" target="_blank">https://doi.org/10.5281/zenodo.15411286</a>,
2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Madronich et al.(2023)Madronich, Sulzberger, Longstreth, Schikowski,
Andersen, Solomon, and Wilson</label><mixed-citation>
      
Madronich, S., Sulzberger, B., Longstreth, J., Schikowski, T., Andersen, M. S.,
Solomon, K., and Wilson, S.: Changes in tropospheric air quality related to
the protection of stratospheric ozone in a changing climate, Photochemical &amp;
Photobiological Sciences, 22, 1129–1176, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Mehta et al.(2019)Mehta, Kanani, and Lande</label><mixed-citation>
      
Mehta, H., Kanani, P., and Lande, P.: Google maps, International Journal of
Computer Applications, 178, 41–46, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Mills et al.(2018)Mills, Brown, Laney, Ortega-Retuerta, Lowry, van
Dijken, and Arrigo</label><mixed-citation>
      
Mills, M. M., Brown, Z. W., Laney, S. R., Ortega-Retuerta, E., Lowry, K. E.,
van Dijken, G. L., and Arrigo, K. R.: Nitrogen limitation of the summer
phytoplankton and heterotrophic prokaryote communities in the Chukchi Sea,
Frontiers in Marine Science, 5, 362, <a href="https://doi.org/10.3389/fmars.2018.00362" target="_blank">https://doi.org/10.3389/fmars.2018.00362</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Monks et al.(2015)Monks, Archibald, Colette, Cooper, Coyle, Derwent,
Fowler, Granier, Law, Mills et al.</label><mixed-citation>
      
Monks, P. S., Archibald, A. T., Colette, A., Cooper, O., Coyle, M., Derwent, R., Fowler, D., Granier, C., Law, K. S., Mills, G. E., Stevenson, D. S., Tarasova, O., Thouret, V., von Schneidemesser, E., Sommariva, R., Wild, O., and Williams, M. L.: Tropospheric ozone and its precursors from the urban to the global scale from air quality to short-lived climate forcer, Atmos. Chem. Phys., 15, 8889–8973, <a href="https://doi.org/10.5194/acp-15-8889-2015" target="_blank">https://doi.org/10.5194/acp-15-8889-2015</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Orru et al.(2013)Orru, Andersson, Ebi, Langner, Åström, and
Forsberg</label><mixed-citation>
      
Orru, H., Andersson, C., Ebi, K. L., Langner, J., Åström, C., and
Forsberg, B.: Impact of climate change on ozone-related mortality and
morbidity in Europe, European Respiratory Journal, 41, 285–294, 2013.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Pedregosa et al.(2011a)Pedregosa, Varoquaux, Gramfort,
Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas,
Passos, Cournapeau, Brucher, Perrot, and Duchesnay</label><mixed-citation>
      
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.:
Scikit-learn: Machine Learning in Python, Journal of Machine Learning
Research, 12, 2825–2830, 2011a.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Pedregosa et al.(2011b)Pedregosa, Varoquaux, Gramfort,
Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg
et al.</label><mixed-citation>
      
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M., and Duchesnay, É.:
Scikit-learn: Machine learning in Python,  Journal of Machine Learning
Research, 12, 2825–2830, 2011b.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Pelleg and Moore(2000)Pelleg, Moore et al.</label><mixed-citation>
      
Pelleg, D. and Moore, A.: X-means: Extending K-means with Efficient Estimation of
the Number of Clusters, in: Proceedings of the Seventeenth International Conference on Machine
Learning (ICML 2000), 727–734, Morgan Kaufmann, San Francisco, CA, 2000.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>Post et al.(2012)Post, Grambsch, Weaver, Morefield, Huang, Leung,
Nolte, Adams, Liang, Zhu et al.</label><mixed-citation>
      
Post, E. S., Grambsch, A., Weaver, C., Morefield, P., Huang, J., Leung, L.-Y.,
Nolte, C. G., Adams, P., Liang, X.-Z., Zhu, J.-H., and Mahoney, H.: Variation in
estimated ozone-related health impacts of climate change due to modeling
choices and assumptions, Environmental Health Perspectives, 120, 1559–1564,
2012.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>Powers(2020)</label><mixed-citation>
      
Powers, D. M.: Evaluation: from precision, recall and F-measure to ROC,
informedness, markedness and correlation, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.2010.16061" target="_blank">https://doi.org/10.48550/arXiv.2010.16061</a>,
2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Prokhorenkova et al.(2018)Prokhorenkova, Gusev, Vorobev, Dorogush,
and Gulin</label><mixed-citation>
      
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A.:
CatBoost: unbiased boosting with categorical features, arXiv, <a href="https://doi.org/10.48550/arXiv.1810.11363" target="_blank">https://doi.org/10.48550/arXiv.1810.11363</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Rubinsteyn and Feldman(2016)</label><mixed-citation>
      
Rubinsteyn, A. and Feldman, S.: Fancyimpute: An Imputation Library for Python, GitHub,
<a href="https://github.com/iskandr/fancyimpute" target="_blank"/> (last access: 24 June 2026), 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>Schultz et al.(2015)Schultz, Akimoto, Bottenheim, Buchmann, Galbally,
Gilge, Helmig, Koide, Lewis, Novelli et al.</label><mixed-citation>
      
Schultz, M. G., Akimoto, H., Bottenheim, J., Buchmann, B., Galbally, I. E.,
Gilge, S., Helmig, D., Koide, H., Lewis, A. C., Novelli, P. C., Plass-Dülmer, C., Ryerson, T. B., Steinbacher,
M., Steinbrecher, R., Tarasova, O., Tørseth, K., Thouret, V., and Zellweger, C.: The
Global Atmosphere Watch reactive gases measurement network, Elementa: Science
of the Anthropocene, 3,
000067, <a href="https://doi.org/10.12952/journal.elementa.000067" target="_blank">https://doi.org/10.12952/journal.elementa.000067</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Schultz et al.(2017a)Schultz, Schröder, Lyapina,
Cooper, Galbally, Petropavlovskikh, Von Schneidemesser, Tanimoto, Elshorbany,
Naja et al.</label><mixed-citation>
      
Schultz, M. G., Schröder, S., Lyapina, O., Cooper, O. R., Galbally, I.,
Petropavlovskikh, I., Von Schneidemesser, E., Tanimoto, H., Elshorbany, Y.,
Naja, M., Seguel, R. J., Dauert, U.,
Eckhardt, P., Feigenspan, S., Fiebig, M., Hjellbrekke, A.-G., Hong, Y.-D., Kjeld, P. C., Koide, H.,
Lear, G., Tarasick, D., Ueno, M., Wallasch, M., Baumgardner, D., Chuang, M.-T., Gillett, R., Lee,
M., Molloy, S., Moolla, R., Wang, T., Sharps, K., Adame, J. A., Ancellet, G., Apadula, F., Artaxo,
P., Barlasina, M. E., Bogucka, M., Bonasoni, P., Chang, L., Colomb, A., Cuevas-Agulló, E.,
Cupeiro, M., Degorska, A., Ding, A., Fröhlich, M., Frolova, M., Gadhavi, H., Gheusi, F., Gilge,
S., Gonzalez, M. Y., Gros, V., Hamad, S. H., Helmig, D., Henriques, D., Hermansen, O., Holla,
R., Hueber, J., Im, U., Jaffe, D. A., Komala, N., Kubistin, D., Lam, K.-S., Laurila, T., Lee, H.,
Levy, I., Mazzoleni, C., Mazzoleni, L. R., McClure-Begley, A., Mohamad, M., Murovec, M.,
Navarro-Comas, M., Nicodim, F., Parrish, D., Read, K. A., Reid, N., Ries, L., Saxena, P., Schwab,
J. J., Scorgie, Y., Senik, I., Simmonds, P., Sinha, V., Skorokhod, A. I., Spain, G., Spangl, W.,
Spoor, R., Springston, S. R., Steer, K., Steinbacher, M., Suharguniyawan, E., Torre, P., Trickl, T.,
Weili, L., Weller, R., Xu, X., Xue, L., and Ma, Z.: Tropospheric Ozone Assessment Report: Database and metrics
data of global surface ozone observations, Elem. Sci. Anth., 5, 58,
2017a.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>Schultz et al.(2017b)Schultz, Schröder, Lyapina,
Cooper, Galbally, Petropavlovskikh et al.</label><mixed-citation>
      
Schultz, M. G., Schröder, S., Lyapina, O., Cooper, O. R., Galbally, I.,
Petropavlovskikh, I., von Schneidemesser, E., Tanimoto, H., Elshorbany, Y., Naja, M., Seguel, R. J., Dauert, U.,
Eckhardt, P., Feigenspan, S., Fiebig, M., Hjellbrekke, A.-G., Hong, Y.-D., Kjeld, P. C., Koide, H.,
Lear, G., Tarasick, D., Ueno, M., Wallasch, M., Baumgardner, D., Chuang, M.-T., Gillett, R., Lee,
M., Molloy, S., Moolla, R., Wang, T., Sharps, K., Adame, J. A., Ancellet, G., Apadula, F., Artaxo,
P., Barlasina, M. E., Bogucka, M., Bonasoni, P., Chang, L., Colomb, A., Cuevas-Agulló, E.,
Cupeiro, M., Degorska, A., Ding, A., Fröhlich, M., Frolova, M., Gadhavi, H., Gheusi, F., Gilge,
S., Gonzalez, M. Y., Gros, V., Hamad, S. H., Helmig, D., Henriques, D., Hermansen, O., Holla,
R., Hueber, J., Im, U., Jaffe, D. A., Komala, N., Kubistin, D., Lam, K.-S., Laurila, T., Lee, H.,
Levy, I., Mazzoleni, C., Mazzoleni, L. R., McClure-Begley, A., Mohamad, M., Murovec, M.,
Navarro-Comas, M., Nicodim, F., Parrish, D., Read, K. A., Reid, N., Ries, L., Saxena, P., Schwab,
J. J., Scorgie, Y., Senik, I., Simmonds, P., Sinha, V., Skorokhod, A. I., Spain, G., Spangl, W.,
Spoor, R., Springston, S. R., Steer, K., Steinbacher, M., Suharguniyawan, E., Torre, P., Trickl, T.,
Weili, L., Weller, R., Xu, X., Xue, L., and Ma, Z.: Tropospheric Ozone Assessment Report: Database
and metrics data of global surface ozone observations, Elementa: Science of
the Anthropocene, 5, 58, <a href="https://doi.org/10.1525/elementa.244" target="_blank">https://doi.org/10.1525/elementa.244</a>, 2017b.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Sensoressa(2025)</label><mixed-citation>
      
Sensoressa, N. A.: Balanced Accuracy: When Should You Use It?, Neptune.ai, <a href="https://doi.org/10.1109/ICPR.2010.764" target="_blank">https://doi.org/10.1109/ICPR.2010.764</a>,
2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Sinaga and Yang(2020)</label><mixed-citation>
      
Sinaga, K. P. and Yang, M.-S.: Unsupervised K-means clustering algorithm, IEEE
Access, 8, 80716–80727, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>Szwarcman et al.(2024)Szwarcman, Roy, Fraccaro, Gíslason,
Blumenstiel, Ghosal, de Oliveira, Almeida, Sedona, Kang
et al.</label><mixed-citation>
      
Szwarcman, D., Roy, S., Fraccaro, P., Gíslason, Þ. E., Blumenstiel,
B., Ghosal, R., de Oliveira, P. H., Almeida, J. L. d. S., Sedona, R., Kang,
Y., Chakraborty, S., Wang, S., Gomes, C.,
Kumar, A., Truong, M., Godwin, D., Lee, H., Hsu, C.-Y., Lal, R., Asanjan, A. A., Mujeci, B.,
Shidham, D., Keenan, T., Arevalo, P., Li, W., Alemohammad, H., Olofsson, P., Hain, C., Kennedy,
R., Zadrozny, B., Bell, D., Cavallaro, G., Watson, C., Maskey, M., Ramachandran, R., and
Moreno, J. B.: Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for
Earth Observation Applications, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.2412.02732" target="_blank">https://doi.org/10.48550/arXiv.2412.02732</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib42"><label>Tapia et al.(2016)Tapia, Escudero, Lozano, Anzano, and
Mantilla</label><mixed-citation>
      
Tapia, O., Escudero, M., Lozano, Á., Anzano, J., and Mantilla, E.: New
classification scheme for ozone monitoring stations based on frequency
distribution of hourly data, Science of The Total Environment, 544, 1–9,
2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib43"><label>Teakles et al.(2017)Teakles, So, Ainslie, Nissen, Schiller,
Vingarzan, McKendry, Macdonald, Jaffe, Bertram et al.</label><mixed-citation>
      
Teakles, A. D., So, R., Ainslie, B., Nissen, R., Schiller, C., Vingarzan, R., McKendry, I., Macdonald, A. M., Jaffe, D. A., Bertram, A. K., Strawbridge, K. B., Leaitch, W. R., Hanna, S., Toom, D., Baik, J., and Huang, L.: Impacts of the July 2012 Siberian fire plume on air quality in the Pacific Northwest, Atmos. Chem. Phys., 17, 2593–2611, <a href="https://doi.org/10.5194/acp-17-2593-2017" target="_blank">https://doi.org/10.5194/acp-17-2593-2017</a>, 2017.


    </mixed-citation></ref-html>
<ref-html id="bib1.bib44"><label>Van Rossum(2007)</label><mixed-citation>
      
Van Rossum, G.: Python programming language, in: USENIX annual
technical conference, Vol. 41, 1–36, Santa Clara, CA, <a href="https://dblp.org/db/conf/usenix/usenix2007" target="_blank"/> (last access: 24 June 2026), 2007.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib45"><label>Zhang et al.(2024)Zhang, Sun, Li, and Li</label><mixed-citation>
      
Zhang, L., Sun, Y., Li, C., and Li, B.: Promoting Sustainable Development in
Urban–Rural Areas: A New Approach for Evaluating the Policies of
Characteristic Towns in China, Buildings, 14, 1085, <a href="https://doi.org/10.3390/buildings14041085" target="_blank">https://doi.org/10.3390/buildings14041085</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib46"><label>Zhou et al.(2022)Zhou, Li, and Zhang</label><mixed-citation>
      
Zhou, M., Li, Y., and Zhang, F.: Spatiotemporal variation in ground level ozone
and its driving factors: a comparative study of coastal and inland cities in
eastern China, International Journal of Environmental Research and Public
Health, 19, 9687, <a href="https://doi.org/10.3390/ijerph19159687" target="_blank">https://doi.org/10.3390/ijerph19159687</a>, 2022.

    </mixed-citation></ref-html>--></article>
