Modeling Species Distributions with Machine Learning

Introduction

In this lab, we will learn how to model species' geographic distributions using machine learning. This is an important problem in conservation biology, as it helps us understand the distribution of different species and design effective conservation strategies. We will use a dataset of two South American mammals given past observations and 14 environmental variables. We will use the OneClassSVM algorithm from the scikit-learn library to model the geographic distribution of these two species.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.

Import Libraries

In this step, we will import the necessary libraries for our analysis. We will import the scikit-learn library for machine learning, numpy for numerical computing, and matplotlib for visualization.

from time import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn.utils import Bunch
from sklearn.datasets import fetch_species_distributions
from sklearn import svm, metrics

Load Data

In this step, we will load the data from the scikit-learn library. We will use the fetch_species_distributions function to load the data of two South American mammals given past observations and 14 environmental variables.

## Load the compressed data
data = fetch_species_distributions()

Construct Map Grid

In this step, we will construct the map grid from the data object. We will create a function called construct_grids that takes the data object as input and returns the xgrid and ygrid.

def construct_grids(batch):
    """Construct the map grid from the batch object

    Parameters
    ----------
    batch : Batch object
        The object returned by fetch_species_distributions

    Returns
    -------
    (xgrid, ygrid) : 1-D arrays
        The grid corresponding to the values in batch.coverages
    """
    ## x,y coordinates for corner cells
    xmin = batch.x_left_lower_corner + batch.grid_size
    xmax = xmin + (batch.Nx * batch.grid_size)
    ymin = batch.y_left_lower_corner + batch.grid_size
    ymax = ymin + (batch.Ny * batch.grid_size)

    ## x coordinates of the grid cells
    xgrid = np.arange(xmin, xmax, batch.grid_size)
    ## y coordinates of the grid cells
    ygrid = np.arange(ymin, ymax, batch.grid_size)

    return (xgrid, ygrid)

## Construct the map grid
xgrid, ygrid = construct_grids(data)

Create Species Bunch

In this step, we will create a bunch with information about a particular organism. We will create a function called create_species_bunch that takes the species name, train, test, coverages, xgrid, and ygrid as input and returns a bunch object.

def create_species_bunch(species_name, train, test, coverages, xgrid, ygrid):
    """Create a bunch with information about a particular organism

    This will use the test/train record arrays to extract the
    data specific to the given species name.
    """
    bunch = Bunch(name=" ".join(species_name.split("_")[:2]))
    species_name = species_name.encode("ascii")
    points = dict(test=test, train=train)

    for label, pts in points.items():
        ## choose points associated with the desired species
        pts = pts[pts["species"] == species_name]
        bunch["pts_%s" % label] = pts

        ## determine coverage values for each of the training & testing points
        ix = np.searchsorted(xgrid, pts["dd long"])
        iy = np.searchsorted(ygrid, pts["dd lat"])
        bunch["cov_%s" % label] = coverages[:, -iy, ix].T

    return bunch

## Create species bunch
BV_bunch = create_species_bunch(
    "bradypus_variegatus_0", data.train, data.test, data.coverages, xgrid, ygrid
)
MM_bunch = create_species_bunch(
    "microryzomys_minutus_0", data.train, data.test, data.coverages, xgrid, ygrid
)

Fit OneClassSVM

In this step, we will fit the OneClassSVM model to the training data. We will standardize the features and fit the OneClassSVM model to the training data.

## Standardize features
mean = BV_bunch.cov_train.mean(axis=0)
std = BV_bunch.cov_train.std(axis=0)
train_cover_std = (BV_bunch.cov_train - mean) / std

## Fit OneClassSVM
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.5)
clf.fit(train_cover_std)

Predict Species Distribution

In this step, we will predict the species distribution using the OneClassSVM model. We will predict the species distribution using the training data and plot the results.

## Predict species distribution using the training data
Z = np.ones((data.Ny, data.Nx), dtype=np.float64)

## We'll predict only for the land points.
idx = np.where(data.coverages[6] > -9999)
coverages_land = data.coverages[:, idx[0], idx[1]].T

pred = clf.decision_function((coverages_land - mean) / std)
Z *= pred.min()
Z[idx[0], idx[1]] = pred

levels = np.linspace(Z.min(), Z.max(), 25)
Z[data.coverages[6] == -9999] = -9999

## plot contours of the prediction
plt.contourf(X, Y, Z, levels=levels, cmap=plt.cm.Reds)
plt.colorbar(format="%.2f")

## scatter training/testing points
plt.scatter(
    BV_bunch.pts_train["dd long"],
    BV_bunch.pts_train["dd lat"],
    s=2**2,
    c="black",
    marker="^",
    label="train",
)
plt.scatter(
    BV_bunch.pts_test["dd long"],
    BV_bunch.pts_test["dd lat"],
    s=2**2,
    c="black",
    marker="x",
    label="test",
)
plt.legend()
plt.title(BV_bunch.name)
plt.axis("equal")

Compute AUC

In this step, we will compute the area under the ROC curve (AUC) with regards to background points. We will predict the species distribution using the test data and the background points, and compute the AUC.

## Compute AUC with regards to background points
background_points = np.c_[
    np.random.randint(low=0, high=data.Ny, size=10000),
    np.random.randint(low=0, high=data.Nx, size=10000),
].T

pred_background = Z[background_points[0], background_points[1]]
pred_test = clf.decision_function((BV_bunch.cov_test - mean) / std)
scores = np.r_[pred_test, pred_background]
y = np.r_[np.ones(pred_test.shape), np.zeros(pred_background.shape)]
fpr, tpr, thresholds = metrics.roc_curve(y, scores)
roc_auc = metrics.auc(fpr, tpr)
plt.text(-35, -70, "AUC: %.3f" % roc_auc, ha="right")
print("\n Area under the ROC curve : %f" % roc_auc)

Plot Species Distribution

In this step, we will plot the species distribution for both species using the functions and models we have created.

## Plot species distribution
plot_species_distribution()
plt.show()

Summary

In this lab, we learned how to model species' geographic distributions using machine learning. We used the OneClassSVM algorithm from the scikit-learn library to model the geographic distribution of two South American mammals given past observations and 14 environmental variables. We also learned how to plot the species distribution and compute the area under the ROC curve to evaluate the performance of our model.

Species Distribution Modeling