Using Machine Learning to Characterise Biomes

Jeremy Walton1,2, Robert Parker1,3, Ranjini Swaminathan1,4, Doug Kelley1,5, Mohamed Redha Sidoumou6 and Alisa Kim7

1UKESM core group; 2Met Office Hadley Centre for Climate Science and Services, Exeter, UK; 3National Centre for Earth Observation, University of Leicester, UK; 4National Centre for Earth Observation, Department of Meteorology, University of Reading, UK; 5UK Centre for Ecology and Hydrology, Wallingford, UK; 6Amazon Web Services, UK; 7Amazon Web Services, Germany

A biome is a community of plants with similar characteristics – for example, grassland, tropical rainforest or desert – which is formed in response to a shared climate. The distribution of biomes affects life on Earth; in turn, biomes are affected by human activity and climate change. A reliable method for the characterization of biomes and their dynamics would be helpful when looking for signs of climate change. However, current characterization methods (e.g. [1]) usually involve the empirical examination of observation data, looking for patterns and relationships within a small set of features using a limited number of snapshots. This constrains the efficacy of the method and complicates the possible extension of such methods to climate (temporal and spatial) scales.  Similarly, the identification of biomes from model results (instead of observations) would be of interest, but is difficult using current methods.

In the search for methods which are more objective, automated and data-driven, we have investigated the characterization of biomes using machine learning techniques [2]. More specifically, we start by partitioning land surface points into clusters, defined by some set of features – that is, points which have similar values for the features are clustered together. Our feature set comprises a mixture of land surface (for example, tree cover, population density) and climate (e.g. mean annual precipitation, mean annual dry days) variables, chosen to reflect the properties of large-scale climate and vegetation distributions. In our initial study [2], we used 15 variables; one of which is shown in Figure 1.

Figure 1.  One of the datasets used as input to our cluster analysis – here, number of mean annual dry days (MADD), which is a measure of rainfall seasonality.  This comes from v4.01 of the Climate Research Unit (CRU) Time Series high resolution dataset.

We preprocess the data by first regridding the variables onto an N96 (1.875° x 1.25° spatial resolution) climate model grid, and then normalizing each variable, such that the mean is zero and the variance is one. After stacking the variables at each grid point, we use a segmentation algorithm to partition the data into regions having similar properties; each segment is then characterized by the median value of each variable – that is, a vector with 15 components. Segmentation reduces the amount of noise in the data (by removing points having values which could be considered as outliers), which we found to be a useful step prior to clustering. Points corresponding to the ocean and any other locations (such as, for example, Antarctica) which have no terrestrial biomes of interest are removed at this stage as well (see Figure 2).

Figure 2.  Regions (coloured according to latitude) following the application of a segmentation algorithm, and removal of segments with no terrestrial biomes.

Having preprocessed the data, we identify clusters using the k-means algorithm. This is a distance-based method that groups points with similar feature values by assigning each point to the cluster with the nearest centroid, and then moving the centroids to minimise the distances between each point and its associated centroid. K-means requires the parameter for the number of clusters (i.e., k) to be set at the start of the clustering process. This can be set empirically but we employed three different evaluation metrics to find the best fit and settled on the optimum value of 11 [2]. Besides being consistent with the metrics, this value is also comparable with expectations from domain-specific empirical approaches to biome characterization. Figure 3 shows the results of applying k-means (with k=11) to the segmented data of Figure 2.

Figure 3.  Clusters determined using the k-means algorithm, identifying 11 biomes. Colours distinguish the clusters, but have no other significance.

The cluster results appear qualitatively similar to the biomes derived from empirical methods [1], which is encouraging. However, they are difficult to explain. The way in which points are assigned to a cluster depends on how similar its features are to those of other points in the cluster. Each cluster consists of points which are closer to its centroid than to other centroids, but the divisions between clusters are, in general, hyperplanes in feature space (which, it will be recalled, has 15 dimensions in our case) which are defined by linear combinations of feature values. Explainability would be easier if each hyperplane was parallel to one of the feature axes – that is, only dependent on the value of a single variable [3]. 

To achieve this, we have applied a supervised learning method to approximate the clusters using a decision tree [4]. The tree can be viewed as a predictive model which maps feature values to classes. Each root node (or first node at the start of the decision making process) contains a rule which assesses the value of a single feature at a point against a fixed value; the two paths from the node correspond to the outcome of the assessment being true or false. Similar remarks apply to internal nodes in the tree. Tree branches then represent conjunctions of rules for different feature values, whilst leaf nodes – that is, nodes having no downstream paths – represent classes (in this case, clusters). The tree allows points to be assigned to clusters using a series of rules, each of which depends on the value of a single feature. This aids comparison with empirical characterization methods.

Figure 4 shows a decision tree which approximates our original clusters with 95% agreement. However, it is clearly too big (too many nodes) and complicated (some leaf nodes are on multiple paths – that is, there is more than one set of rules which result in a given cluster). This represents a challenge for the usability of the classification scheme, and for any comparison with empirical methods.

Figure 4.  Decision tree obtained from clusters shown in Figure 3.  An internal node contains a rule using the value of a single feature; the two paths from that node correspond to the rule being true or false.  A leaf node corresponds to a class (in our case, a cluster).  Paths along branches of the tree represent conjunctions of rules (involving multiple features) which result in the identification of a class.

Accordingly, we modify the tree by first setting a minimum value for the number of points corresponding to each node, and then recursively trim the tree by merging neighbouring nodes that have similar properties [2]. The result is a tree with 14 leaf nodes, and which is in 85% agreement with the original clusters. It is shown in Figure 5.

Figure 5. Modified tree derived from tree shown in Figure 4 by enforcing a minimum number of points for each node and recursive trimming.

Figure 6 shows a comparison of the biomes obtained from our explainable clustering methodology with those determined by empirical methods by Olson et al. [1]. We see a reasonable number of filled cells along the diagonal, corresponding to a high level of overlap between the two sets – for example, the desert biome has 92% overlap and tropical forest has 87%. A more detailed comparison of biomes derived from the two methods (for example, the shape of their boundaries) is required, but these initial results are promising.

Figure 6.  Comparing the biomes derived from our explainable clusters to the empirical biomes of Olson et al. [1].  Each number is the % of cluster biomes that fall in each Olson biome.  In biome names, trop = tropical, med = Mediterranean, frst = forest, wood = woodland, shrub = shrubland, shrub = scrubland, grass = grassland, bl = broadleaf.  The colour of each cell denotes the type of biome (green = forest, blue = woodland, purple = grassy, orange = barren); colour intensity is proportional to the value in the cell.

In addition to the rules which govern the assignment of points to biomes, our method also admits of deeper analysis. For example, it is possible use a supervised learning method to classify points and calculate a so-called feature importance for each node in the decision tree. This gives a measure of which features have the most influence over cluster assignment. Preliminary results indicate that variables such as mean annual temperature, tree cover and mean maximum temperature of the warmest month are more important for the characterization of biomes than, for example, population density, urban cover and mean maximum windspeed.

We believe that this study demonstrates the strong potential for advancing our understanding of Earth system science by utilising machine learning methods such as explainable clustering. By expanding this work in the future and applying these methods to climate projections from models such as UKESM, we will be able to provide analyses which complement existing insights from experts about how the Earth’s biomes may alter in response to a changing climate.


  1. Olson, D. M., Dinerstein, E., Wikramanayake, E. D., Burgess, N. D., Powell, G. V. N., Underwood, E. C., D’amico, J. A., Itoua, I., Strand, H. E., Morrison, J. C., Loucks, C. J., Allnutt, T. F., Ricketts, T. H., Kura, Y., Lamoreux, J. F., Wettengel, W. W., Hedao, P., and Kassem, K. R. (2001). Terrestrial Ecoregions of the World: A New Map of Life on Earth: A new global map of terrestrial ecoregions provides an innovative tool for conserving biodiversity. BioScience, 51(11), pp933–938.
  2. Sidoumou, M.; Kim, A.; Walton, J.; Kelley, D.; Parker, R. and Swaminathan, R. (2022). Explainable Clustering Applied to the Definition of Terrestrial Biomes. In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods – ICPRAM, ISBN 978-989-758-549-4, pp 586–595. DOI: 10.5220/0010842400003122
  3. Moshkovitz, M., Dasgupta, S., Rashtchian, C. and Frost, N. (2020).  Explainable k-means and k-medians clustering. In International conference on machine learning, pp 7055–7065. PMLR.
  4. James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer, New York. DOI: 10.1007/978-1-4614-7138-7