9 Advanced Modeling Techniques

Learning Objectives

By the end of this chapter, you will be able to:

Interpret machine learning models using model-agnostic methods
Calculate and visualize variable importance
Create partial dependence and ICE plots
Explain individual predictions with SHAP values
Analyze time series data with temporal autocorrelation
Decompose time series into trend, seasonal, and noise components
Build ARIMA models for forecasting
Account for temporal autocorrelation in regression models

9.1 Introduction

This chapter explores advanced modeling techniques that extend beyond traditional regression. We’ll cover model interpretation methods that help us understand “black box” machine learning models, and time series analysis for data collected sequentially over time.

9.2 Model Interpretation and Explainability

While machine learning models like Random Forest can achieve high predictive accuracy, understanding why they make specific predictions is crucial for ecological applications. Model interpretation helps us extract ecological insights and communicate results to stakeholders.

9.2.1 Variable Importance

Code

# Load packages for model interpretation
library(DALEX)
library(DALEXtra)
library(tidymodels)
library(palmerpenguins)

# Prepare the penguins data (same as chapter 08)
penguins <- palmerpenguins::penguins %>%
  drop_na()

# Split data into training and testing sets
set.seed(123)
penguin_split <- initial_split(penguins, prop = 0.75, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

# Define a recipe for preprocessing
penguin_recipe <- recipe(body_mass_g ~ bill_length_mm + bill_depth_mm +
                         flipper_length_mm + species,
                         data = penguin_train) %>%
  step_dummy(species) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_zv(all_predictors())

# Define Random Forest model specification
rf_spec <- rand_forest(trees = 100) %>%
  set_engine("ranger") %>%
  set_mode("regression")

# Create workflow
rf_wf <- workflow() %>%
  add_recipe(penguin_recipe) %>%
  add_model(rf_spec)

# Fit the final model
final_fit <- rf_wf %>%
  fit(data = penguin_train)

# Create an explainer object
explainer_rf <- explain_tidymodels(
  final_fit,
  data = penguin_test %>% select(-body_mass_g),
  y = penguin_test$body_mass_g,
  label = "Random Forest",
  verbose = FALSE
)

# Calculate variable importance using permutation
vi_rf <- model_parts(explainer_rf, loss_function = loss_root_mean_square)

# Visualize variable importance
plot(vi_rf) +
  labs(title = "Variable Importance for Body Mass Prediction",
       subtitle = "Permutation-based importance shows contribution of each predictor") +
  theme_minimal()

# Get the importance values
vi_df <- as.data.frame(vi_rf)
vi_summary <- vi_df %>%
  filter(variable != "_baseline_", variable != "_full_model_") %>%
  group_by(variable) %>%
  summarize(
    mean_dropout_loss = mean(dropout_loss),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_dropout_loss))

print(vi_summary)
#> # A tibble: 7 × 2
#>   variable          mean_dropout_loss
#>   <chr>                         <dbl>
#> 1 species                        574.
#> 2 flipper_length_mm              536.
#> 3 bill_depth_mm                  384.
#> 4 bill_length_mm                 351.
#> 5 island                         311.
#> 6 sex                            311.
#> 7 year                           311.

Figure 9.1: Variable importance for body mass prediction

Code Explanation

This code demonstrates model-agnostic variable importance:

DALEX Framework
- Creates an “explainer” object that wraps the tidymodels workflow
- Provides a unified interface for model interpretation
- Works with any model type (linear, tree-based, neural networks)
Permutation Importance
- Measures how much model performance decreases when a variable is randomly shuffled
- Higher dropout loss = more important variable
- Unlike tree-based importance, this works for any model and accounts for correlations
Interpretation
- Variables are ranked by their contribution to predictions
- Helps identify which morphological features are most predictive of body mass
- Provides ecological insights into species morphology

9.2.2 Partial Dependence Plots

Partial dependence plots show how predictions change as a single variable varies, while averaging over all other variables.

Code

# Create partial dependence profiles for key variables
pdp_flipper <- model_profile(
  explainer_rf,
  variables = "flipper_length_mm",
  N = NULL  # Use all observations
)

pdp_bill <- model_profile(
  explainer_rf,
  variables = "bill_length_mm",
  N = NULL
)

# Plot partial dependence
plot(pdp_flipper) +
  labs(title = "Partial Dependence: Flipper Length",
       subtitle = "Shows average effect of flipper length on predicted body mass",
       x = "Flipper Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()

plot(pdp_bill) +
  labs(title = "Partial Dependence: Bill Length",
       subtitle = "Shows average effect of bill length on predicted body mass",
       x = "Bill Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()

# Create 2D partial dependence plot for interactions
pdp_2d <- model_profile(
  explainer_rf,
  variables = c("flipper_length_mm", "bill_length_mm"),
  N = 100
)

# Note: 2D plots require additional processing
# For simplicity, we'll show individual effects

Figure 9.2: Partial dependence: Flipper length

Figure 9.3: Partial dependence: Bill length

Results Interpretation

Partial dependence plots reveal important ecological relationships:

Marginal Effects
- Shows how each predictor affects the response on average
- Accounts for correlations between predictors
- Reveals non-linear relationships that linear models would miss
Ecological Insights
- Flipper length typically shows a strong positive relationship with body mass
- The relationship may be non-linear (e.g., diminishing returns at large sizes)
- Different species may show different patterns (captured by species variable)
Model Behavior
- Helps validate that the model learned sensible relationships
- Can reveal unexpected patterns that warrant further investigation
- Useful for communicating model predictions to non-technical audiences

9.2.3 Individual Conditional Expectation (ICE) Plots

While partial dependence shows average effects, ICE plots show how predictions change for individual observations.

Code

# Create ICE plots for flipper length
ice_flipper <- model_profile(
  explainer_rf,
  variables = "flipper_length_mm",
  N = 50,  # Use 50 observations for clarity
  type = "conditional"
)

# Plot ICE curves
plot(ice_flipper) +
  labs(title = "Individual Conditional Expectation: Flipper Length",
       subtitle = "Each line shows how prediction changes for one penguin",
       x = "Flipper Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()

# Compare with partial dependence (average)
plot(ice_flipper, geom = "profiles") +
  labs(title = "ICE Plots with Partial Dependence (yellow line)",
       subtitle = "Individual effects vs. average effect",
       x = "Flipper Length (mm)",
       y = "Predicted Body Mass (g)") +
  theme_minimal()

Figure 9.4: Individual conditional expectation curves

Figure 9.5: ICE plots with partial dependence overlay

PROFESSIONAL TIP: Choosing Interpretation Methods

When interpreting machine learning models for ecological applications:

Variable Importance
- Use permutation importance for model-agnostic assessment
- Compare with domain knowledge to validate model behavior
- Consider both statistical and ecological importance
- Report confidence intervals when possible
Partial Dependence Plots
- Show average marginal effects of predictors
- Useful for understanding overall patterns
- Can mask heterogeneous effects across observations
- Best for communicating general relationships
ICE Plots
- Reveal individual-level heterogeneity
- Identify subgroups with different response patterns
- More complex to interpret than PDP
- Useful for detecting interactions
Break-Down Plots (for individual predictions)
- Explain specific predictions step-by-step
- Useful for understanding outliers or unusual cases
- Helps build trust in model predictions
- Essential for high-stakes conservation decisions

9.2.4 Explaining Individual Predictions

For specific conservation decisions, we often need to understand why the model made a particular prediction.

Code

# Select an interesting observation to explain
observation_to_explain <- penguin_test[1, ]

cat("Explaining prediction for:\n")
#> Explaining prediction for:
print(observation_to_explain %>% select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g))
#> # A tibble: 1 × 5
#>   species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>            <dbl>         <dbl>             <int>       <int>
#> 1 Adelie            39.1          18.7               181        3750

# Create a break-down explanation
bd <- predict_parts(
  explainer_rf,
  new_observation = observation_to_explain,
  type = "break_down"
)

# Plot the break-down
plot(bd) +
  labs(title = "Break-Down Plot for Individual Prediction",
       subtitle = "Shows contribution of each variable to the final prediction") +
  theme_minimal()

# Create SHAP values for more robust attribution
shap <- predict_parts(
  explainer_rf,
  new_observation = observation_to_explain,
  type = "shap",
  B = 25  # Number of random orderings
)

# Plot SHAP values
plot(shap) +
  labs(title = "SHAP Values for Individual Prediction",
       subtitle = "Average contribution across all possible variable orderings") +
  theme_minimal()

Figure 9.6: Break-down plot for individual prediction

Figure 9.7: SHAP values for individual prediction

Code Explanation

Individual prediction explanations help understand specific cases:

Break-Down Plots
- Show step-by-step how each variable contributes to the prediction
- Start from the average prediction and add each variable’s effect
- Order matters: variables are added in order of importance
SHAP Values
- Shapley Additive exPlanations from game theory
- Average contribution across all possible orderings of variables
- More robust than break-down plots but computationally intensive
- Provides fair attribution of prediction to each feature
Ecological Applications
- Explain why a particular species is predicted in a certain location
- Understand which factors drive high/low abundance predictions
- Communicate model reasoning to stakeholders and decision-makers
- Identify data quality issues or model failures

Communicating ML Results in Ecology

When presenting machine learning results to ecological audiences:

Focus on Ecological Meaning
- Translate statistical importance to ecological significance
- Connect model patterns to known ecological processes
- Discuss biological plausibility of relationships
- Relate findings to conservation implications
Visualize Effectively
- Use partial dependence plots for main effects
- Show uncertainty in predictions
- Include actual data points alongside model predictions
- Use colors and labels that resonate with ecological context
Acknowledge Limitations
- Discuss what the model cannot capture (e.g., biotic interactions)
- Explain extrapolation risks
- Mention data quality and sampling bias issues
- Suggest validation approaches
Provide Actionable Insights
- Link predictions to management recommendations
- Identify key variables that can be monitored or manipulated
- Suggest priority areas for conservation action
- Propose adaptive management strategies based on model uncertainty

Model interpretation transforms “black box” machine learning into a tool for ecological understanding. By explaining how models make predictions, we gain insights into ecological processes and build confidence in using ML for conservation decision-making.

9.3 Time Series Analysis for Ecological Data

Time series data—observations collected sequentially over time—are common in ecology. Population counts, climate measurements, and phenological observations all have temporal structure that standard regression methods cannot properly handle.

9.3.1 Understanding Temporal Autocorrelation

Code

# Load packages for time series analysis
library(forecast)
library(tsibble)
library(feasts)
library(ggplot2)
library(dplyr)

# Create simulated population time series
set.seed(2024)
years <- 1990:2023
n_years <- length(years)

# Simulate population with trend, seasonality, and noise
trend <- 100 + 2 * (1:n_years)  # Increasing trend
seasonal <- 10 * sin(2 * pi * (1:n_years) / 5)  # 5-year cycle
noise <- rnorm(n_years, 0, 5)
population <- trend + seasonal + noise

# Create time series data frame
pop_ts_data <- data.frame(
  year = years,
  population = round(population)
)

# Visualize the time series
ggplot(pop_ts_data, aes(x = year, y = population)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue", size = 2) +
  labs(
    title = "Simulated Population Time Series",
    subtitle = "Shows trend and cyclic pattern typical of ecological data",
    x = "Year",
    y = "Population Size"
  ) +
  theme_minimal()

# Test for temporal autocorrelation
# Create ACF and PACF plots
par(mfrow = c(2, 1))
acf(pop_ts_data$population, main = "Autocorrelation Function (ACF)")
pacf(pop_ts_data$population, main = "Partial Autocorrelation Function (PACF)")

Figure 9.8: Simulated population time series

Figure 9.9: Autocorrelation and partial autocorrelation functions

Code Explanation

This code introduces time series concepts for ecological data:

Time Series Components
- Trend: Long-term increase or decrease (e.g., climate change effects)
- Seasonality: Regular cyclic patterns (e.g., annual breeding cycles)
- Noise: Random variation (e.g., environmental stochasticity)
Autocorrelation Functions
- ACF: Shows correlation between observations at different time lags
- PACF: Shows direct correlation after removing indirect effects
- Significant spikes indicate temporal dependence
Ecological Relevance
- Population dynamics often show autocorrelation due to age structure
- Climate variables have strong seasonal patterns
- Ignoring autocorrelation leads to invalid statistical inferences

9.3.2 Time Series Decomposition

Code

# Convert to ts object for decomposition
pop_ts <- ts(pop_ts_data$population, start = 1990, frequency = 1)

# Decompose the time series
decomp <- stl(ts(pop_ts_data$population, frequency = 5), s.window = "periodic")

# Plot decomposition
plot(decomp, main = "Time Series Decomposition")

# Extract components
trend_component <- decomp$time.series[, "trend"]
seasonal_component <- decomp$time.series[, "seasonal"]
remainder <- decomp$time.series[, "remainder"]

# Create a data frame for ggplot
decomp_df <- data.frame(
  year = years,
  observed = pop_ts_data$population,
  trend = as.numeric(trend_component),
  seasonal = as.numeric(seasonal_component),
  remainder = as.numeric(remainder)
)

# Visualize components with ggplot
library(tidyr)
decomp_long <- decomp_df %>%
  pivot_longer(cols = c(observed, trend, seasonal, remainder),
               names_to = "component",
               values_to = "value")

ggplot(decomp_long, aes(x = year, y = value)) +
  geom_line(color = "steelblue") +
  facet_wrap(~component, scales = "free_y", ncol = 1) +
  labs(
    title = "Time Series Decomposition Components",
    x = "Year",
    y = "Value"
  ) +
  theme_minimal()

Figure 9.11: Decomposition components visualization

Results Interpretation

Time series decomposition reveals the structure of temporal data:

Trend Component
- Shows the long-term direction of the population
- Increasing trend suggests population growth
- Can be linear or non-linear
- Important for long-term conservation planning
Seasonal Component
- Reveals cyclic patterns in the data
- In this example, shows a 5-year cycle
- Could represent environmental cycles (e.g., El Niño)
- Helps identify optimal monitoring times
Remainder (Residuals)
- Random variation after removing trend and seasonality
- Should be approximately white noise if decomposition is appropriate
- Large residuals may indicate unusual events or data quality issues
Ecological Applications
- Separate long-term trends from natural cycles
- Identify critical periods in species life cycles
- Detect anomalies or regime shifts
- Inform sampling design for monitoring programs

9.3.3 Forecasting with ARIMA Models

Code

# Fit an ARIMA model
# auto.arima selects the best model automatically
arima_model <- auto.arima(pop_ts)

# Display model summary
summary(arima_model)
#> Series: pop_ts 
#> ARIMA(4,1,0) with drift 
#> 
#> Coefficients:
#>           ar1      ar2      ar3      ar4   drift
#>       -0.6980  -0.6483  -0.8834  -0.4754  1.8761
#> s.e.   0.1499   0.1160   0.1091   0.1491  0.3112
#> 
#> sigma^2 = 46.02:  log likelihood = -109.02
#> AIC=230.04   AICc=233.27   BIC=239.02
#> 
#> Training set error measures:
#>                      ME     RMSE     MAE        MPE     MAPE      MASE
#> Training set -0.3425243 6.156046 4.88382 -0.5880105 3.756309 0.5052228
#>                     ACF1
#> Training set -0.02601765

# Make forecasts for the next 5 years
forecast_result <- forecast(arima_model, h = 5)

# Plot the forecast
autoplot(forecast_result) +
  labs(
    title = "Population Forecast (ARIMA Model)",
    subtitle = "5-year forecast with 80% and 95% confidence intervals",
    x = "Year",
    y = "Population Size"
  ) +
  theme_minimal()

# Extract forecast values
forecast_df <- data.frame(
  year = 2024:2028,
  point_forecast = as.numeric(forecast_result$mean),
  lower_80 = as.numeric(forecast_result$lower[, 1]),
  upper_80 = as.numeric(forecast_result$upper[, 1]),
  lower_95 = as.numeric(forecast_result$lower[, 2]),
  upper_95 = as.numeric(forecast_result$upper[, 2])
)

print(forecast_df)
#>   year point_forecast lower_80 upper_80 lower_95 upper_95
#> 1 2024       167.6379 158.9443 176.3314 154.3422 180.9335
#> 2 2025       175.1418 166.0603 184.2232 161.2529 189.0307
#> 3 2026       173.7157 164.5521 182.8793 159.7011 187.7303
#> 4 2027       171.1390 161.8463 180.4317 156.9270 185.3509
#> 5 2028       172.9301 163.3007 182.5595 158.2032 187.6570

Figure 9.12: Population forecast with ARIMA model

PROFESSIONAL TIP: Time Series Forecasting in Ecology

When forecasting ecological time series:

Model Selection
- Use auto.arima() for automatic model selection
- Consider seasonal ARIMA for data with clear seasonality
- Evaluate multiple models and compare AIC/BIC
- Check residual diagnostics to validate model assumptions
Forecast Interpretation
- Point forecasts are the expected values
- Confidence intervals quantify uncertainty
- Uncertainty increases with forecast horizon
- Consider biological constraints (e.g., populations can’t be negative)
Ecological Considerations
- Short-term forecasts are more reliable than long-term
- Regime shifts or environmental changes can invalidate models
- Incorporate external covariates when available (e.g., climate indices)
- Use ensemble forecasts for robust predictions
Communication
- Always show uncertainty in forecasts
- Explain assumptions and limitations
- Relate forecasts to management thresholds
- Update forecasts as new data become available

9.3.4 Ecological Example: Phenology Trends

Code

library(lmtest)  # For dwtest()

# Simulate phenological data (e.g., first flowering date)
set.seed(123)
years_pheno <- 1980:2023
n_years_pheno <- length(years_pheno)

# Earlier flowering over time due to climate change
flowering_doy <- 120 - 0.3 * (1:n_years_pheno) + rnorm(n_years_pheno, 0, 3)

pheno_data <- data.frame(
  year = years_pheno,
  flowering_day = round(flowering_doy)
)

# Visualize the trend
ggplot(pheno_data, aes(x = year, y = flowering_day)) +
  geom_point(color = "darkgreen", size = 2) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(
    title = "Phenological Shift: First Flowering Date",
    subtitle = "Earlier flowering over time suggests climate warming",
    x = "Year",
    y = "Day of Year (First Flowering)"
  ) +
  theme_minimal()

# Fit a linear model to quantify the trend
pheno_model <- lm(flowering_day ~ year, data = pheno_data)
summary(pheno_model)
#> 
#> Call:
#> lm(formula = flowering_day ~ year, data = pheno_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.7254 -1.6662 -0.2123  1.7907  6.1424 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 719.0325    66.8573  10.755 1.22e-13 ***
#> year         -0.3026     0.0334  -9.059 1.97e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.814 on 42 degrees of freedom
#> Multiple R-squared:  0.6615, Adjusted R-squared:  0.6534 
#> F-statistic: 82.07 on 1 and 42 DF,  p-value: 1.967e-11

# Calculate the rate of change
rate_per_decade <- coef(pheno_model)["year"] * 10
cat(sprintf("Flowering is occurring %.2f days earlier per decade\n", abs(rate_per_decade)))
#> Flowering is occurring 3.03 days earlier per decade

# Test for autocorrelation in residuals
dwtest(pheno_model)
#> 
#>  Durbin-Watson test
#> 
#> data:  pheno_model
#> DW = 1.9126, p-value = 0.3252
#> alternative hypothesis: true autocorrelation is greater than 0

# If autocorrelation is present, use GLS
library(nlme)
gls_model <- gls(flowering_day ~ year,
                 correlation = corAR1(form = ~ year),
                 data = pheno_data)
summary(gls_model)
#> Generalized least squares fit by REML
#>   Model: flowering_day ~ year 
#>   Data: pheno_data 
#>        AIC      BIC  logLik
#>   226.6999 233.6506 -109.35
#> 
#> Correlation Structure: AR(1)
#>  Formula: ~year 
#>  Parameter estimate(s):
#>        Phi 
#> 0.03243688 
#> 
#> Coefficients:
#>                Value Std.Error   t-value p-value
#> (Intercept) 717.3361  69.01309 10.394204       0
#> year         -0.3018   0.03448 -8.751694       0
#> 
#>  Correlation: 
#>      (Intr)
#> year -1    
#> 
#> Standardized residuals:
#>         Min          Q1         Med          Q3         Max 
#> -2.03150156 -0.59213522 -0.07665248  0.63090321  2.17206766 
#> 
#> Residual standard error: 2.818014 
#> Degrees of freedom: 44 total; 42 residual

Figure 9.13: Phenological shift: First flowering date over time

Phenological Trends and Climate Change

Phenological shifts provide powerful evidence of climate change impacts:

Trend Detection
- Linear regression can detect long-term trends
- Rate of change quantifies the magnitude of shifts
- Statistical significance indicates confidence in the trend
Temporal Autocorrelation
- Phenological data often show autocorrelation
- Durbin-Watson test detects autocorrelation in residuals
- Generalized Least Squares (GLS) accounts for autocorrelation
- Proper modeling prevents inflated Type I error rates
Ecological Implications
- Earlier flowering can lead to phenological mismatches
- Pollinators may not be active when plants flower
- Affects reproductive success and population dynamics
- Indicates ecosystem-wide climate change responses
Conservation Applications
- Monitor phenological shifts as climate change indicators
- Identify species vulnerable to phenological mismatches
- Inform assisted migration decisions
- Guide adaptive management strategies

Code Explanation

The phenology example demonstrates practical time series analysis:

Trend Analysis
- Linear regression quantifies the rate of phenological shift
- Confidence intervals indicate precision of the estimate
- Visual inspection confirms the trend pattern
Autocorrelation Testing
- Durbin-Watson test checks for temporal autocorrelation
- Significant autocorrelation requires specialized models
- GLS with AR(1) correlation structure accounts for autocorrelation
Model Comparison
- Compare standard linear model with GLS
- GLS provides more accurate standard errors
- Prevents false positives in trend detection

Time series analysis is essential for understanding temporal dynamics in ecological systems. By properly accounting for temporal structure, we can detect trends, make forecasts, and understand how ecosystems respond to environmental change.

9.4 Chapter Summary

9.4.1 Key Concepts

Model Interpretation: DALEX framework provides model-agnostic interpretation methods
Variable Importance: Permutation importance quantifies predictor contributions
Partial Dependence: Shows average marginal effects of predictors
ICE Plots: Reveal individual-level heterogeneity in predictions
SHAP Values: Provide fair attribution of predictions to features
Temporal Autocorrelation: Nearby time points are often correlated
Time Series Decomposition: Separates trend, seasonal, and noise components
ARIMA Models: Flexible framework for time series forecasting
GLS Models: Account for temporal autocorrelation in regression

9.4.2 R Functions Learned

explain_tidymodels() - Create DALEX explainer for tidymodels
model_parts() - Calculate variable importance
model_profile() - Create partial dependence and ICE plots
predict_parts() - Explain individual predictions
auto.arima() - Automatic ARIMA model selection
forecast() - Generate forecasts ## Introduction

This chapter explores advanced modeling techniques that extend beyond traditional regression (covered in Chapter 8). We’ll cover model interpretation methods that help us understand “black box” machine learning models, and time series analysis for data collected sequentially over time.

9.4.3 Next Steps

Congratulations on completing Data Analysis in Natural Sciences! You now possess a comprehensive toolkit for analyzing ecological data, from data cleaning and visualization to advanced modeling and interpretation.

To continue your journey: - Apply these techniques to your own datasets - Explore the documentation for packages like tidymodels, DALEX, and fable - Join the R for Data Science community to keep learning - Contribute to open-source projects in ecology and conservation

For applied examples in conservation science, check out Chapter 10: Conservation Applications.

9.5 Exercises

Model Interpretation: Use DALEX to interpret a random forest model predicting species abundance. Which environmental variables are most important?
Partial Dependence: Create partial dependence plots for the top 3 most important variables. Do the relationships make ecological sense?
Individual Predictions: Select an unusual prediction (outlier) and use SHAP values to explain why the model made that prediction.
Time Series Decomposition: Analyze a real ecological time series (e.g., population counts, temperature data). Decompose it and interpret the components.
ARIMA Forecasting: Build an ARIMA model for a population time series. How far into the future can you reliably forecast?
Phenology Analysis: Analyze phenological data (e.g., bird arrival dates, flowering times) for evidence of climate change. Account for temporal autocorrelation.
Model Comparison: Compare interpretations from different ML models (random forest, gradient boosting, neural network) on the same dataset. Do they agree on variable importance?
Ensemble Interpretation: Create an ensemble model and interpret it. How does the interpretation differ from individual models?