Motivation

In the United States since 2000, an average of 70,072 wildfires burned an average of seven million acres per year. This rate doubles the average annual acreage burned in the 1990s (3.3 million acres). In just 2021, nearly 6,000 structures burned, sixty percent of which were residential. If wildfires could be predicted in advance, those predictions could  aid in targeted mitigation efforts, reducing the likelihood and overall impact of these fires. As a company with a headquarters in California, wildfires and their impacts are of strong interest to our team.

Testing in a High Stakes Environment

Using high resolution data, we will predict the likelihood of fire in any given month and location in the western United States. As the frequency and severity of fires increases over time, the stakes for accurate prediction of wildfires are high and only getting higher. An environment with high stakes such as this mandates high quality for any ML system deployed. To do so, we will establish performance thresholds prior to model training, and then increase these thresholds iteratively as our model improves its performance.

This proactive testing and performance evaluation approach is uncommon today in machine learning. Relatively early on in its maturity curve, machine learning has yet to graduate into systematic testing like software testing did in the 1990s. Unlike software, which throws exceptions, machine learning models often fail silently. As a result, data scientists who start model development without a testing plan often run into performance problems and model development takes longer than if a testing plan is part of the process from the outset.  Rigorous model testing can debug learned issues and improve robustness when deployed into production.

Constructing the Data

To tackle this prediction task, we accessed the Global Fire Emissions Database (GFED) to extract burning, emissions, and biosphere data. This data was available at the spatial resolution of 0.25 degrees on a monthly basis from 1997 onwards.

Below is a descriptive table of the data used:

Fire Prediction Modeling Data Features

 

 

 

 

Features used for fire prediction model are extracted from burning, emissions and biosphere data.

Stored in HDF5 format, we used the h5py package to extract the data into a pandas DataFrame. The Hierarchical Data Format version 5 (HDF5), supports large, complex, and heterogeneous data using a “file directory” like structure. NSF NEON has additional information on the HDF5 file format.

Below, you can see how we accessed datasets stored in each HDF5 file using their keys. After extraction we flattened each dataset to create a Dataframe.

import h5py
def load_data_for_year(year) -> pd.DataFrame:
if year > 2016:
filename = f’GFED4.1s_{year}_beta.hdf5′
else:
filename = f’GFED4.1s_{year}.hdf5′
filepath = os.path.join(data_dir, filename)
data = h5py.File(filepath, ‘r’)
df = pd.DataFrame({
‘lat’: data[‘lat’][:].flatten(),
‘lon’: data[‘lon’][:].flatten(),
})
keys = [‘burned_area/{:02}/burned_fraction’,
’emissions/{:02}/DM’,
’emissions/{:02}/C’,
’emissions/{:02}/small_fire_fraction’,
‘biosphere/{:02}/NPP’,
‘biosphere/{:02}/Rh’,
‘biosphere/{:02}/BB’]
for month in range(1,13):
for key in keys:
key_str = key.format(month)
df[key_str] = data[key_str][:].flatten()
df = filter_to_west_us_region(df)
return df

We format the modeling data with observations on the spatial dimension. Then, the features can be the burned fraction, emissions and biosphere data for each past month. The suffix of n minus the number of months back denotes the age of each feature.

def get_feature_for_year_and_month(data_cache,year, month, suffix) -> pd.DataFrame:
if year not in data_cache:
data_cache[year] = load_data_for_year(year)
data = data_cache[year]
df = pd.DataFrame({
f’burned_fraction_{suffix}: data[‘burned_area/{:02}/burned_fraction’.format(month)],
f’emissions_DM_{suffix}: data[’emissions/{:02}/DM’.format(month)],
f’emissions_C_{suffix}: data[’emissions/{:02}/C’.format(month)],
f’emissions_small_fire_fraction_{suffix}: data[’emissions/{:02}/small_fire_fraction’.format(month)],
f’biosphere_NPP_{suffix}: data[‘biosphere/{:02}/NPP’.format(month)],
f’biosphere_Rh_{suffix}: data[‘biosphere/{:02}/Rh’.format(month)],
f’biosphere_BB_{suffix}: data[‘biosphere/{:02}/BB’.format(month)],
})
return df

Test-Guided Model Development

Before training the first model, we create performance tests with TruEra’s Test Harness. Performance thresholds can be established from domain knowledge, stakeholder requirements or an existing champion model. Establishing these thresholds prior to model training can then be used to benchmark new candidate models.

These tests will guide our iterative model development. To establish a target performance threshold, we consulted Predicting Forest Fire Using Remote Sensing Data And Machine Learning. In that paper, the authors use the FIRMS hotspot dataset with spatial granularity of 8x8km2 and a somewhat limited temporal dimension (only up to one year). The model, Agni, achieved AUC scores ranging from 0.80 to 0.85.

The baseline model will be tested against Agni’s performance. Thresholds established will warn if AUC falls below 0.85 and fail if AUC falls below 0.80. We train the model on data from 1997-2009 and test on data from 2010 onwards.

data_collections = tru.get_data_collections()
for dc in data_collections:
tru.set_data_collection(dc)
data_splits = tru.get_data_splits()
for split in data_splits:
tru.tester.add_performance_test(
data_split_name = split,
metric = ‘AUC’,
warn_if_less_than = 0.85,
fail_if_less_than = 0.80
)

Baseline Performance

For this task we start with a linear model with a 2 year input window (i.e., to produce prediction for year x, the model will take as input data from year x-1 and x-2). Examining the tests, we can see this baseline performs poorly compared to the Agni-based testing regime. The baseline linear model passes zero performance tests, warning on 5 and failing 3.

The linear model for fire prediction produces a warning for five data splits and hits the failure threshold for the remaining three.

 

 

 

 

 

 

 

The linear model for fire prediction produces a warning for five data splits and hits the failure threshold for the remaining three.

Examining the feature influence of this model, two of the top five features (biosphere_NPP_n-17, biosphere_Rh_n-19) are near the end of the input window. The features n-17 and n-19 correspond to data from 17 and 19 months from time of prediction respectively. This gives us a clue that we should extend the input window farther back into the past.

Biosphere features make up the top five features with the highest average influence, measured as the L1 norm. Since two of the top five features are near the end of the input window, we should extend the window farther back in the past.

 

 

 

 

 

 

 

Biosphere features make up the top five features with the highest average influence, measured as the L1 norm. Since two of the top five features are near the end of the input window, we should extend the window farther back in the past.

Extending the Feature History

The literature around fire cycles broadly agrees that surface fire and torching potential is most likely in what’s called the gray phase (5 to 10 years following a fire). To account for this cycle, we will train another linear mode but with its input window set to ten years. After evaluation, we can see that the new linear model now passes 4 out of 8 performance tests.

Extending the feature history back ten years drastically improves the performance of the linear model, as observed in the results from the test harness.

 

 

 

 

 

 

 

Extending the feature history back ten years drastically improves the performance of the linear model, as observed in the results from the test harness.

From Linear to Gradient Boosting Classifier

Now that baseline performance is established, we should update our testing scheme to fail relative to the linear model with an equivalent feature history window. To do so, we use the same add_performance_test() call and set fail_threshold_type to “RELATIVE” and add the reference_model_name argument. The code required to add this new performance testing scheme using the TruEra SDK is below.

data_collections = tru.get_data_collections()
for dc in data_collections:
tru.set_data_collection(dc)
data_splits = [“train”,“2010”,“2011”,“2012”,“2013”,“2014”,“2015”,“2016”]
ref_model = “linear_” + dc
for split in data_splits:
tru.tester.add_performance_test(
data_split_name = split,
metric = ‘AUC’,
warn_threshold_type = “ABSOLUTE”, #this is default, specifying for clarity
warn_if_less_than = 0.80, #treshold is reference model perf + value
fail_threshold_type = “RELATIVE”,
fail_if_less_than = 0,
reference_model_name = ref_model #set model to test against
)

From here, we will consider a more complex model: gradient boosting (GB) classifier. This GB classifier will have the n_estimators parameter (the number of boosting stages) set to 20. Using the new relative failure threshold, the GB model performs better on all data splits except for 2012. Shown by the absolute warning threshold of 0.80, the GB model meets or exceeds Agni’s minimum performance in all splits.

The GB model performs better on all data splits except for 2012, as observed using the results from the test harness.

 

 

 

 

The gradient boosting model performs better on all data splits except for 2012, as observed using the results from the test harness.

Increasing the n_estimators parameter to 200 further improves performance. Using both the warning condition against Agni and the failure condition relative to the linear model, gb_200 passes the performance tests for all data splits.

The GB model with 200 parameters now passes all eight performance tests.

 

 

 

 

The GB model with 200 parameters now passes all eight performance tests.

Selecting the Champion Model

At this stage, selecting the champion model is relatively easy. The gb_200 model passed all performance tests. For final verification, we will examine the two candidate models that meet Agni’s performance criteria in comparison to each other.  Assessing the difference in AUC score, the gb_200 model outperforms the gb_20 model on all splits but one.

Using area under the curve (AUC) as the performance metric, gb_200 outperforms gb_200 in every year except for 2015.

 

 

 

 

Using area under the curve (AUC) as the performance metric, gb_200 outperforms gb_200 in every year except for 2015.

Comparing the precision-recall curves for the best performing split on each model, the gb_200 model takes a more convex shape. Its concave shape sustains substantially higher precision as the classification threshold is lowered, thereby increasing the recall of the model (correctly predicting more actual fires).

The concave shape of the gb_200 model is favorable to that of the gb_20 model because it sustains substantively higher precision as recall increases.

 

 

 

 

The concave shape of the gb_200 model is favorable to that of the gb_20 model because it sustains substantively higher precision as recall increases.

We can also visualize the champion model’s predicted fires (blue) to actual fires (orange) to see its performance. Notably, the model successfully predicted the fires along the southern coast of California from San Diego to Santa Barbara, the pocket of fires north of Sacramento, and the swath of fire extending from northeast Nevada, up through western Idaho to the southeast corner of Washington.


Automated, Standard Testing for ML

In this blog we followed an iterative, test-driven approach to model development. We created a reasonably good model for predicting forest fires in the western region of the United States. Test-driven modeling helps to create a repeatable and standardized approach to building and assessing candidate models. Let’s hold our ML systems to a higher standard of quality, just like what’s expected for software.

This blog has been republished by AIIA. To view the original article, please click here: https://truera.com/how-ml-model-testing-accelerates-model-improvement-test-driven-modeling-for-fire-prediction/