9  Advanced Modeling Techniques

Learning Objectives

By the end of this chapter, you will be able to:

  1. Interpret machine learning models using model-agnostic methods
  2. Calculate and visualize variable importance
  3. Create partial dependence and ICE plots
  4. Explain individual predictions with SHAP values
  5. Analyze time series data with temporal autocorrelation
  6. Decompose time series into trend, seasonal, and noise components
  7. Build ARIMA models for forecasting
  8. Account for temporal autocorrelation in regression models

9.1 Introduction

This chapter explores advanced modeling techniques that extend beyond traditional regression. We’ll cover model interpretation methods that help us understand “black box” machine learning models, and time series analysis for data collected sequentially over time.

9.2 Model Interpretation and Explainability

While machine learning models like Random Forest can achieve high predictive accuracy, understanding why they make specific predictions is crucial for ecological applications. Model interpretation helps us extract ecological insights and communicate results to stakeholders.

9.2.1 Variable Importance

Code
# Load packages for model interpretation
library(DALEX)
library(DALEXtra)
library(tidymodels)
library(palmerpenguins)

# Prepare the penguins data (same as chapter 08)
penguins <- palmerpenguins::penguins %>%
  drop_na()

# Split data into training and testing sets
set.seed(123)
penguin_split <- initial_split(penguins, prop = 0.75, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

# Define a recipe for preprocessing
penguin_recipe <- recipe(body_mass_g ~ bill_length_mm + bill_depth_mm +
                         flipper_length_mm + species,
                         data = penguin_train) %>%
  step_dummy(species) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_zv(all_predictors())

# Define Random Forest model specification
rf_spec <- rand_forest(trees = 100) %>%
  set_engine("ranger") %>%
  set_mode("regression")

# Create workflow
rf_wf <- workflow() %>%
  add_recipe(penguin_recipe) %>%
  add_model(rf_spec)

# Fit the final model
final_fit <- rf_wf %>%
  fit(data = penguin_train)

# Create an explainer object
explainer_rf <- explain_tidymodels(
  final_fit,
  data = penguin_test %>% select(-body_mass_g),
  y = penguin_test$body_mass_g,
  label = "Random Forest",
  verbose = FALSE
)

# Calculate variable importance using permutation
vi_rf <- model_parts(explainer_rf, loss_function = loss_root_mean_square)

# Visualize variable importance
plot(vi_rf) +
  labs(title = "Variable Importance for Body Mass Prediction",
       subtitle = "Permutation-based importance shows contribution of each predictor") +
  theme_minimal()

# Get the importance values
vi_df <- as.data.frame(vi_rf)
vi_summary <- vi_df %>%
  filter(variable != "_baseline_", variable != "_full_model_") %>%
  group_by(variable) %>%
  summarize(
    mean_dropout_loss = mean(dropout_loss),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_dropout_loss))

print(vi_summary)
#> # A tibble: 7 × 2
#>   variable          mean_dropout_loss
#>   <chr>                         <dbl>
#> 1 species                        574.
#> 2 flipper_length_mm              536.
#> 3 bill_depth_mm                  384.
#> 4 bill_length_mm                 351.
#> 5 island                         311.
#> 6 sex                            311.
#> 7 year                           311.
Figure 9.1: Variable importance for body mass prediction
Code Explanation

This code demonstrates model-agnostic variable importance:

  1. DALEX Framework
    • Creates an “explainer” object that wraps the tidymodels workflow
    • Provides a unified interface for model interpretation
    • Works with any model type (linear, tree-based, neural networks)
  2. Permutation Importance
    • Measures how much model performance decreases when a variable is randomly shuffled
    • Higher dropout loss = more important variable
    • Unlike tree-based importance, this works for any model and accounts for correlations
  3. Interpretation
    • Variables are ranked by their contribution to predictions
    • Helps identify which morphological features are most predictive of body mass
    • Provides ecological insights into species morphology

9.2.2 Partial Dependence Plots

Partial dependence plots show how predictions change as a single variable varies, while averaging over all other variables.

Code
# Create partial dependence profiles for key variables
pdp_flipper <- model_profile(
  explainer_rf,
  variables = "flipper_length_mm",
  N = NULL  # Use all observations
)

pdp_bill <- model_profile(
  explainer_rf,
  variables = "bill_length_mm",
  N = NULL
)

# Plot partial dependence
plot(pdp_flipper) +
  labs(title = "Partial Dependence: Flipper Length",
       subtitle = "Shows average effect of flipper length on predicted body mass",
       x = "Flipper Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()

plot(pdp_bill) +
  labs(title = "Partial Dependence: Bill Length",
       subtitle = "Shows average effect of bill length on predicted body mass",
       x = "Bill Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()

# Create 2D partial dependence plot for interactions
pdp_2d <- model_profile(
  explainer_rf,
  variables = c("flipper_length_mm", "bill_length_mm"),
  N = 100
)

# Note: 2D plots require additional processing
# For simplicity, we'll show individual effects
Figure 9.2: Partial dependence: Flipper length
Figure 9.3: Partial dependence: Bill length
Results Interpretation

Partial dependence plots reveal important ecological relationships:

  1. Marginal Effects
    • Shows how each predictor affects the response on average
    • Accounts for correlations between predictors
    • Reveals non-linear relationships that linear models would miss
  2. Ecological Insights
    • Flipper length typically shows a strong positive relationship with body mass
    • The relationship may be non-linear (e.g., diminishing returns at large sizes)
    • Different species may show different patterns (captured by species variable)
  3. Model Behavior
    • Helps validate that the model learned sensible relationships
    • Can reveal unexpected patterns that warrant further investigation
    • Useful for communicating model predictions to non-technical audiences

9.2.3 Individual Conditional Expectation (ICE) Plots

While partial dependence shows average effects, ICE plots show how predictions change for individual observations.

Code
# Create ICE plots for flipper length
ice_flipper <- model_profile(
  explainer_rf,
  variables = "flipper_length_mm",
  N = 50,  # Use 50 observations for clarity
  type = "conditional"
)

# Plot ICE curves
plot(ice_flipper) +
  labs(title = "Individual Conditional Expectation: Flipper Length",
       subtitle = "Each line shows how prediction changes for one penguin",
       x = "Flipper Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()

# Compare with partial dependence (average)
plot(ice_flipper, geom = "profiles") +
  labs(title = "ICE Plots with Partial Dependence (yellow line)",
       subtitle = "Individual effects vs. average effect",
       x = "Flipper Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()
Figure 9.4: Individual conditional expectation curves
Figure 9.5: ICE plots with partial dependence overlay
PROFESSIONAL TIP: Choosing Interpretation Methods

When interpreting machine learning models for ecological applications:

  1. Variable Importance
    • Use permutation importance for model-agnostic assessment
    • Compare with domain knowledge to validate model behavior
    • Consider both statistical and ecological importance
    • Report confidence intervals when possible
  2. Partial Dependence Plots
    • Show average marginal effects of predictors
    • Useful for understanding overall patterns
    • Can mask heterogeneous effects across observations
    • Best for communicating general relationships
  3. ICE Plots
    • Reveal individual-level heterogeneity
    • Identify subgroups with different response patterns
    • More complex to interpret than PDP
    • Useful for detecting interactions
  4. Break-Down Plots (for individual predictions)
    • Explain specific predictions step-by-step
    • Useful for understanding outliers or unusual cases
    • Helps build trust in model predictions
    • Essential for high-stakes conservation decisions

9.2.4 Explaining Individual Predictions

For specific conservation decisions, we often need to understand why the model made a particular prediction.

Code
# Select an interesting observation to explain
observation_to_explain <- penguin_test[1, ]

cat("Explaining prediction for:\n")
#> Explaining prediction for:
print(observation_to_explain %>% select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))
#> # A tibble: 1 × 5
#>   species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>            <dbl>         <dbl>             <int>       <int>
#> 1 Adelie            39.1          18.7               181        3750

# Create a break-down explanation
bd <- predict_parts(
  explainer_rf,
  new_observation = observation_to_explain,
  type = "break_down"
)

# Plot the break-down
plot(bd) +
  labs(title = "Break-Down Plot for Individual Prediction",
       subtitle = "Shows contribution of each variable to the final prediction") +
  theme_minimal()

# Create SHAP values for more robust attribution
shap <- predict_parts(
  explainer_rf,
  new_observation = observation_to_explain,
  type = "shap",
  B = 25  # Number of random orderings
)

# Plot SHAP values
plot(shap) +
  labs(title = "SHAP Values for Individual Prediction",
       subtitle = "Average contribution across all possible variable orderings") +
  theme_minimal()
Figure 9.6: Break-down plot for individual prediction
Figure 9.7: SHAP values for individual prediction
Code Explanation

Individual prediction explanations help understand specific cases:

  1. Break-Down Plots
    • Show step-by-step how each variable contributes to the prediction
    • Start from the average prediction and add each variable’s effect
    • Order matters: variables are added in order of importance
  2. SHAP Values
    • Shapley Additive exPlanations from game theory
    • Average contribution across all possible orderings of variables
    • More robust than break-down plots but computationally intensive
    • Provides fair attribution of prediction to each feature
  3. Ecological Applications
    • Explain why a particular species is predicted in a certain location
    • Understand which factors drive high/low abundance predictions
    • Communicate model reasoning to stakeholders and decision-makers
    • Identify data quality issues or model failures
Communicating ML Results in Ecology

When presenting machine learning results to ecological audiences:

  1. Focus on Ecological Meaning
    • Translate statistical importance to ecological significance
    • Connect model patterns to known ecological processes
    • Discuss biological plausibility of relationships
    • Relate findings to conservation implications
  2. Visualize Effectively
    • Use partial dependence plots for main effects
    • Show uncertainty in predictions
    • Include actual data points alongside model predictions
    • Use colors and labels that resonate with ecological context
  3. Acknowledge Limitations
    • Discuss what the model cannot capture (e.g., biotic interactions)
    • Explain extrapolation risks
    • Mention data quality and sampling bias issues
    • Suggest validation approaches
  4. Provide Actionable Insights
    • Link predictions to management recommendations
    • Identify key variables that can be monitored or manipulated
    • Suggest priority areas for conservation action
    • Propose adaptive management strategies based on model uncertainty

Model interpretation transforms “black box” machine learning into a tool for ecological understanding. By explaining how models make predictions, we gain insights into ecological processes and build confidence in using ML for conservation decision-making.

9.3 Time Series Analysis for Ecological Data

Time series data—observations collected sequentially over time—are common in ecology. Population counts, climate measurements, and phenological observations all have temporal structure that standard regression methods cannot properly handle.

9.3.1 Understanding Temporal Autocorrelation

Code
# Load packages for time series analysis
library(forecast)
library(tsibble)
library(feasts)
library(ggplot2)
library(dplyr)

# Create simulated population time series
set.seed(2024)
years <- 1990:2023
n_years <- length(years)

# Simulate population with trend, seasonality, and noise
trend <- 100 + 2 * (1:n_years)  # Increasing trend
seasonal <- 10 * sin(2 * pi * (1:n_years) / 5)  # 5-year cycle
noise <- rnorm(n_years, 0, 5)
population <- trend + seasonal + noise

# Create time series data frame
pop_ts_data <- data.frame(
  year = years,
  population = round(population)
)

# Visualize the time series
ggplot(pop_ts_data, aes(x = year, y = population)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue", size = 2) +
  labs(
    title = "Simulated Population Time Series",
    subtitle = "Shows trend and cyclic pattern typical of ecological data",
    x = "Year",
    y = "Population Size"
  ) +
  theme_minimal()

# Test for temporal autocorrelation
# Create ACF and PACF plots
par(mfrow = c(2, 1))
acf(pop_ts_data$population, main = "Autocorrelation Function (ACF)")
pacf(pop_ts_data$population, main = "Partial Autocorrelation Function (PACF)")
Figure 9.8: Simulated population time series
Figure 9.9: Autocorrelation and partial autocorrelation functions
Code Explanation

This code introduces time series concepts for ecological data:

  1. Time Series Components
    • Trend: Long-term increase or decrease (e.g., climate change effects)
    • Seasonality: Regular cyclic patterns (e.g., annual breeding cycles)
    • Noise: Random variation (e.g., environmental stochasticity)
  2. Autocorrelation Functions
    • ACF: Shows correlation between observations at different time lags
    • PACF: Shows direct correlation after removing indirect effects
    • Significant spikes indicate temporal dependence
  3. Ecological Relevance
    • Population dynamics often show autocorrelation due to age structure
    • Climate variables have strong seasonal patterns
    • Ignoring autocorrelation leads to invalid statistical inferences

9.3.2 Time Series Decomposition

Code
# Convert to ts object for decomposition
pop_ts <- ts(pop_ts_data$population, start = 1990, frequency = 1)

# Decompose the time series
decomp <- stl(ts(pop_ts_data$population, frequency = 5), s.window = "periodic")

# Plot decomposition
plot(decomp, main = "Time Series Decomposition")

# Extract components
trend_component <- decomp$time.series[, "trend"]
seasonal_component <- decomp$time.series[, "seasonal"]
remainder <- decomp$time.series[, "remainder"]

# Create a data frame for ggplot
decomp_df <- data.frame(
  year = years,
  observed = pop_ts_data$population,
  trend = as.numeric(trend_component),
  seasonal = as.numeric(seasonal_component),
  remainder = as.numeric(remainder)
)

# Visualize components with ggplot
library(tidyr)
decomp_long <- decomp_df %>%
  pivot_longer(cols = c(observed, trend, seasonal, remainder),
               names_to = "component",
               values_to = "value")

ggplot(decomp_long, aes(x = year, y = value)) +
  geom_line(color = "steelblue") +
  facet_wrap(~component, scales = "free_y", ncol = 1) +
  labs(
    title = "Time Series Decomposition Components",
    x = "Year",
    y = "Value"
  ) +
  theme_minimal()
Figure 9.10: Time series decomposition
Figure 9.11: Decomposition components visualization
Results Interpretation

Time series decomposition reveals the structure of temporal data:

  1. Trend Component
    • Shows the long-term direction of the population
    • Increasing trend suggests population growth
    • Can be linear or non-linear
    • Important for long-term conservation planning
  2. Seasonal Component
    • Reveals cyclic patterns in the data
    • In this example, shows a 5-year cycle
    • Could represent environmental cycles (e.g., El Niño)
    • Helps identify optimal monitoring times
  3. Remainder (Residuals)
    • Random variation after removing trend and seasonality
    • Should be approximately white noise if decomposition is appropriate
    • Large residuals may indicate unusual events or data quality issues
  4. Ecological Applications
    • Separate long-term trends from natural cycles
    • Identify critical periods in species life cycles
    • Detect anomalies or regime shifts
    • Inform sampling design for monitoring programs

9.3.3 Forecasting with ARIMA Models

Code
# Fit an ARIMA model
# auto.arima selects the best model automatically
arima_model <- auto.arima(pop_ts)

# Display model summary
summary(arima_model)
#> Series: pop_ts 
#> ARIMA(4,1,0) with drift 
#> 
#> Coefficients:
#>           ar1      ar2      ar3      ar4   drift
#>       -0.6980  -0.6483  -0.8834  -0.4754  1.8761
#> s.e.   0.1499   0.1160   0.1091   0.1491  0.3112
#> 
#> sigma^2 = 46.02:  log likelihood = -109.02
#> AIC=230.04   AICc=233.27   BIC=239.02
#> 
#> Training set error measures:
#>                      ME     RMSE     MAE        MPE     MAPE      MASE
#> Training set -0.3425243 6.156046 4.88382 -0.5880105 3.756309 0.5052228
#>                     ACF1
#> Training set -0.02601765

# Make forecasts for the next 5 years
forecast_result <- forecast(arima_model, h = 5)

# Plot the forecast
autoplot(forecast_result) +
  labs(
    title = "Population Forecast (ARIMA Model)",
    subtitle = "5-year forecast with 80% and 95% confidence intervals",
    x = "Year",
    y = "Population Size"
  ) +
  theme_minimal()

# Extract forecast values
forecast_df <- data.frame(
  year = 2024:2028,
  point_forecast = as.numeric(forecast_result$mean),
  lower_80 = as.numeric(forecast_result$lower[, 1]),
  upper_80 = as.numeric(forecast_result$upper[, 1]),
  lower_95 = as.numeric(forecast_result$lower[, 2]),
  upper_95 = as.numeric(forecast_result$upper[, 2])
)

print(forecast_df)
#>   year point_forecast lower_80 upper_80 lower_95 upper_95
#> 1 2024       167.6379 158.9443 176.3314 154.3422 180.9335
#> 2 2025       175.1418 166.0603 184.2232 161.2529 189.0307
#> 3 2026       173.7157 164.5521 182.8793 159.7011 187.7303
#> 4 2027       171.1390 161.8463 180.4317 156.9270 185.3509
#> 5 2028       172.9301 163.3007 182.5595 158.2032 187.6570
Figure 9.12: Population forecast with ARIMA model
PROFESSIONAL TIP: Time Series Forecasting in Ecology

When forecasting ecological time series:

  1. Model Selection
    • Use auto.arima() for automatic model selection
    • Consider seasonal ARIMA for data with clear seasonality
    • Evaluate multiple models and compare AIC/BIC
    • Check residual diagnostics to validate model assumptions
  2. Forecast Interpretation
    • Point forecasts are the expected values
    • Confidence intervals quantify uncertainty
    • Uncertainty increases with forecast horizon
    • Consider biological constraints (e.g., populations can’t be negative)
  3. Ecological Considerations
    • Short-term forecasts are more reliable than long-term
    • Regime shifts or environmental changes can invalidate models
    • Incorporate external covariates when available (e.g., climate indices)
    • Use ensemble forecasts for robust predictions
  4. Communication
    • Always show uncertainty in forecasts
    • Explain assumptions and limitations
    • Relate forecasts to management thresholds
    • Update forecasts as new data become available

9.4 Chapter Summary

9.4.1 Key Concepts

  • Model Interpretation: DALEX framework provides model-agnostic interpretation methods
  • Variable Importance: Permutation importance quantifies predictor contributions
  • Partial Dependence: Shows average marginal effects of predictors
  • ICE Plots: Reveal individual-level heterogeneity in predictions
  • SHAP Values: Provide fair attribution of predictions to features
  • Temporal Autocorrelation: Nearby time points are often correlated
  • Time Series Decomposition: Separates trend, seasonal, and noise components
  • ARIMA Models: Flexible framework for time series forecasting
  • GLS Models: Account for temporal autocorrelation in regression

9.4.2 R Functions Learned

This chapter explores advanced modeling techniques that extend beyond traditional regression (covered in Chapter 8). We’ll cover model interpretation methods that help us understand “black box” machine learning models, and time series analysis for data collected sequentially over time.

9.4.3 Next Steps

Congratulations on completing Data Analysis in Natural Sciences! You now possess a comprehensive toolkit for analyzing ecological data, from data cleaning and visualization to advanced modeling and interpretation.

To continue your journey: - Apply these techniques to your own datasets - Explore the documentation for packages like tidymodels, DALEX, and fable - Join the R for Data Science community to keep learning - Contribute to open-source projects in ecology and conservation

For applied examples in conservation science, check out Chapter 10: Conservation Applications.

9.5 Exercises

  1. Model Interpretation: Use DALEX to interpret a random forest model predicting species abundance. Which environmental variables are most important?

  2. Partial Dependence: Create partial dependence plots for the top 3 most important variables. Do the relationships make ecological sense?

  3. Individual Predictions: Select an unusual prediction (outlier) and use SHAP values to explain why the model made that prediction.

  4. Time Series Decomposition: Analyze a real ecological time series (e.g., population counts, temperature data). Decompose it and interpret the components.

  5. ARIMA Forecasting: Build an ARIMA model for a population time series. How far into the future can you reliably forecast?

  6. Phenology Analysis: Analyze phenological data (e.g., bird arrival dates, flowering times) for evidence of climate change. Account for temporal autocorrelation.

  7. Model Comparison: Compare interpretations from different ML models (random forest, gradient boosting, neural network) on the same dataset. Do they agree on variable importance?

  8. Ensemble Interpretation: Create an ensemble model and interpret it. How does the interpretation differ from individual models?