1  Introduction to Data Analysis

1.1 Overview

Learning Objectives

By the end of this chapter, you will be able to:

  1. Explain the importance of data analysis in natural sciences
  2. Install and configure R and RStudio
  3. Describe the tidyverse philosophy and its core packages
  4. Understand the basic data analysis workflow
  5. Identify different types of data used in natural sciences

Data analysis is a critical skill in modern natural sciences research. This chapter introduces the fundamental concepts, tools, and approaches that form the foundation of effective data analysis across various scientific disciplines.

1.2 Why Data Analysis Matters in Natural Sciences

Data analysis plays a pivotal role in natural sciences research for several important reasons:

1.2.1 Evidence-Based Decision Making

Data analysis transforms raw observations into actionable insights, enabling researchers and practitioners to make informed decisions about:

  • Conservation strategies: Identifying priority areas and species for protection
  • Resource management: Optimizing sustainable use of natural resources
  • Agricultural planning: Improving crop yields and farming practices
  • Environmental interventions: Designing effective pollution control measures
  • Climate adaptation: Planning for changing environmental conditions

1.2.2 Pattern Recognition

Through statistical analysis, researchers can identify patterns, trends, and relationships within natural systems that might not be apparent from casual observation alone. This applies to diverse fields including:

  • Ecology and population dynamics
  • Geology and earth processes
  • Marine biology and oceanography
  • Atmospheric science and climatology
  • Agriculture and food systems

1.2.3 Hypothesis Testing

Data analysis provides rigorous methods to test hypotheses about natural phenomena, allowing researchers to build and refine scientific theories about how natural systems function. This is fundamental across all scientific disciplines.

1.2.4 Prediction and Modeling

Advanced analytical techniques enable the development of predictive models that can forecast changes in natural systems, such as:

  • Species distribution shifts under climate change
  • Crop yield predictions based on weather patterns
  • Disease outbreak forecasting
  • Resource depletion projections
  • Ecosystem responses to disturbance
PROFESSIONAL TIP: Principles of Robust Experimental Design

Before diving into data analysis, ensure your experimental design follows these key principles:

  • Formulate clear hypotheses: Define specific, testable hypotheses before collecting data
  • Control for confounding variables: Identify and account for factors that might influence your results
  • Randomize appropriately: Randomly assign treatments to experimental units to reduce bias
  • Include adequate replication: Ensure sufficient sample sizes for statistical power
  • Consider spatial and temporal scales: Match your sampling design to the processes being studied
  • Plan for appropriate controls: Include positive, negative, and procedural controls as needed
  • Pre-register your study: Document your hypotheses and analysis plan before collecting data
  • Plan for data analysis: Select statistical methods based on your design, not just your results

1.3 Introduction to R and RStudio

1.3.1 Why R?

R is a powerful programming language and environment specifically designed for statistical computing and graphics. It has become the standard tool for data analysis in many scientific disciplines.

Key advantages of R include:

Advantage Description
Open-source and free Available to anyone without cost
Extensive package ecosystem Over 20,000 packages for specialized analyses
Reproducibility Code-based approach ensures analyses can be repeated
Flexibility Adaptable to virtually any analytical need
Active community Large user base provides support and development
Publication-quality graphics Create professional visualizations
Cross-platform Works on Windows, macOS, and Linux

1.3.2 Installing R and RStudio

To get started with R, you need to install two pieces of software:

  1. R - The programming language itself
  2. RStudio - An integrated development environment (IDE) that makes working with R easier

Installation Steps:

  1. Download and install R from CRAN
    • Choose your operating system (Windows, macOS, or Linux)
    • Follow the installation instructions
  2. Download and install RStudio from Posit
    • Choose the free Desktop version
    • Follow the installation instructions
Note: Install R First

You must install R before installing RStudio. RStudio is just an interface to R—it won’t work without R installed on your computer.

1.3.3 The RStudio Interface

When you open RStudio, you’ll see four main panels:

  1. Source Editor (top-left): Where you write and edit R scripts
  2. Console (bottom-left): Where R commands are executed
  3. Environment/History (top-right): Shows your data objects and command history
  4. Files/Plots/Packages/Help (bottom-right): File browser, plot viewer, package manager, and help documentation

1.4 The Tidyverse: A Modern Approach to Data Science

1.4.1 What is the Tidyverse?

The tidyverse is a collection of R packages designed for data science that share a common philosophy, grammar, and data structures. It represents a modern, coherent approach to data analysis that emphasizes:

  • Readability: Code that is easy to read and understand
  • Consistency: Functions that work in predictable ways
  • Composability: Tools that work well together
  • Human-centered design: Focused on the analyst’s workflow

1.4.2 Core Tidyverse Packages

Code
# Load the tidyverse (this loads multiple packages at once)
library(tidyverse)

The tidyverse includes these core packages:

Package Purpose
ggplot2 Data visualization
dplyr Data manipulation
tidyr Data tidying
readr Data import
purrr Functional programming
tibble Modern data frames
stringr String manipulation
forcats Factor handling

1.4.3 The Pipe Operator

One of the most powerful features of the tidyverse is the pipe operator %>% (or the native R pipe |>). The pipe allows you to chain operations together in a readable, left-to-right flow:

Code
# Without pipes (nested functions - hard to read)
round(mean(sqrt(c(1, 4, 9, 16, 25))), 2)

# With pipes (left-to-right flow - easy to read)
c(1, 4, 9, 16, 25) %>%
  sqrt() %>%
  mean() %>%
  round(2)
PROFESSIONAL TIP: Pipe Operator Shortcuts
  • Keyboard shortcut: Use Ctrl+Shift+M (Windows/Linux) or Cmd+Shift+M (macOS) to insert %>%
  • Native pipe: In R 4.1+, you can use the native pipe |> instead of %>%
  • Best practice: Put each function on its own line for readability

1.4.4 Tidy Data Principles

The tidyverse is built around the concept of tidy data:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Tidy data makes analysis easier because it provides a consistent structure that all tidyverse tools expect.

1.5 Installing Required Packages

For the analyses in this book, you’ll need several R packages. Install them with the following code:

Code
# Core tidyverse packages
install.packages("tidyverse")

# Statistical analysis
install.packages(c("rstatix", "car", "performance"))

# Visualization enhancements
install.packages(c("viridis", "patchwork", "scales"))

# Table formatting
install.packages(c("knitr", "kableExtra", "gt"))

# For this book's datasets
install.packages("readr")
Important: Run This Once

You only need to install packages once on your computer. After installation, you just need to load them with library() at the start of each R session.

1.6 Your First R Analysis

Let’s walk through a complete analysis using real data to see R and the tidyverse in action.

1.6.1 Loading Data

We’ll use the Palmer Penguins dataset, which contains measurements of penguins from Antarctica:

Code
# Load the tidyverse
library(tidyverse)

# Load the penguin dataset
penguins <- read_csv("../data/environmental/climate_data.csv")

# View the first few rows
head(penguins)
#> # A tibble: 6 × 8
#>   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
#> 1 Adelie  Torgersen           39.1          18.7               181        3750
#> 2 Adelie  Torgersen           39.5          17.4               186        3800
#> 3 Adelie  Torgersen           40.3          18                 195        3250
#> 4 Adelie  Torgersen           NA            NA                  NA          NA
#> 5 Adelie  Torgersen           36.7          19.3               193        3450
#> 6 Adelie  Torgersen           39.3          20.6               190        3650
#> # ℹ 2 more variables: sex <chr>, year <dbl>
Code Explanation

1.6.2 Exploring the Data Structure

Code
# Get an overview of the data structure
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
#> $ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

# Summary statistics
summary(penguins)
#>    species             island          bill_length_mm  bill_depth_mm  
#>  Length:344         Length:344         Min.   :32.10   Min.   :13.10  
#>  Class :character   Class :character   1st Qu.:39.23   1st Qu.:15.60  
#>  Mode  :character   Mode  :character   Median :44.45   Median :17.30  
#>                                        Mean   :43.92   Mean   :17.15  
#>                                        3rd Qu.:48.50   3rd Qu.:18.70  
#>                                        Max.   :59.60   Max.   :21.50  
#>                                        NA's   :2       NA's   :2      
#>  flipper_length_mm  body_mass_g       sex                 year     
#>  Min.   :172.0     Min.   :2700   Length:344         Min.   :2007  
#>  1st Qu.:190.0     1st Qu.:3550   Class :character   1st Qu.:2007  
#>  Median :197.0     Median :4050   Mode  :character   Median :2008  
#>  Mean   :200.9     Mean   :4202                      Mean   :2008  
#>  3rd Qu.:213.0     3rd Qu.:4750                      3rd Qu.:2009  
#>  Max.   :231.0     Max.   :6300                      Max.   :2009  
#>  NA's   :2         NA's   :2
Code Explanation
  • glimpse(): A tidyverse function that provides a transposed view of the data, showing each column’s name, type, and first values
  • summary(): Provides basic summary statistics for each column

1.6.3 Data Manipulation with dplyr

The dplyr package provides intuitive verbs for data manipulation:

Code
# Filter: Keep rows that match a condition
adelie_penguins <- penguins %>%
  filter(species == "Adelie")

# Select: Keep only certain columns
measurements <- penguins %>%
  select(species, bill_length_mm, bill_depth_mm, body_mass_g)

# Mutate: Create new columns
penguins_with_ratio <- penguins %>%
  mutate(bill_ratio = bill_length_mm / bill_depth_mm)

# Arrange: Sort rows
sorted_penguins <- penguins %>%
  arrange(desc(body_mass_g))

# Summarize: Calculate summary statistics
species_summary <- penguins %>%
  filter(!is.na(body_mass_g)) %>%
  group_by(species) %>%
  summarize(
    n = n(),
    mean_mass = mean(body_mass_g),
    sd_mass = sd(body_mass_g),
    min_mass = min(body_mass_g),
    max_mass = max(body_mass_g)
  )

# Display the summary
species_summary
#> # A tibble: 3 × 6
#>   species       n mean_mass sd_mass min_mass max_mass
#>   <chr>     <int>     <dbl>   <dbl>    <dbl>    <dbl>
#> 1 Adelie      151     3701.    459.     2850     4775
#> 2 Chinstrap    68     3733.    384.     2700     4800
#> 3 Gentoo      123     5076.    504.     3950     6300
Results Interpretation

The summary table shows key statistics for each penguin species:

  • n: Sample size (number of observations)
  • mean_mass: Average body mass in grams
  • sd_mass: Standard deviation, measuring variability
  • min_mass/max_mass: Range of body masses

This summary reveals that Gentoo penguins are the largest on average, while Chinstrap and Adelie penguins are more similar in size.

1.6.4 Visualization with ggplot2

The ggplot2 package creates beautiful, publication-quality graphics:

Code
# Create a boxplot of body mass by species
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
  scale_fill_viridis_d() +
  labs(
    title = "Body Mass Distribution by Penguin Species",
    subtitle = "Data from Palmer Station, Antarctica (2007-2009)",
    x = "Species",
    y = "Body Mass (g)",
    fill = "Species",
    caption = "Source: Palmer Penguins dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "none"
  )

Body mass distribution across three penguin species from Palmer Station, Antarctica.
Code Explanation

The ggplot2 syntax follows a layered grammar of graphics:

  1. ggplot(): Initialize the plot with data and aesthetic mappings
  2. aes(): Define how variables map to visual properties
  3. geom_boxplot(): Add a boxplot layer
  4. geom_jitter(): Add individual points with slight horizontal spread
  5. scale_fill_viridis_d(): Apply a colorblind-friendly color palette
  6. labs(): Add labels and titles
  7. theme_minimal(): Apply a clean, minimal theme
  8. theme(): Further customize appearance

1.6.5 A Scatter Plot with Regression

Code
# Create a scatter plot with regression lines
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
  scale_color_viridis_d() +
  labs(
    title = "Relationship Between Bill Length and Body Mass",
    subtitle = "Linear relationships shown for each species",
    x = "Bill Length (mm)",
    y = "Body Mass (g)",
    color = "Species",
    caption = "Source: Palmer Penguins dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "bottom"
  )

Relationship between bill length and body mass in three penguin species, showing positive correlations within each species.

1.7 The Data Analysis Workflow

Effective data analysis typically follows a structured workflow:

1.7.1 1. Import

Bring your data into R from files, databases, or APIs:

Code
# CSV files
data <- read_csv("path/to/file.csv")

# Excel files (requires readxl package)
data <- readxl::read_excel("path/to/file.xlsx")

# From URLs
data <- read_csv("https://example.com/data.csv")

1.7.2 2. Tidy

Restructure data into a consistent format:

  • Each variable in its own column
  • Each observation in its own row
  • Each value in its own cell

1.7.3 3. Transform

Manipulate data to create the variables you need:

  • Filter observations
  • Create new variables
  • Calculate summaries
  • Join multiple datasets

1.7.4 4. Visualize

Create graphics to understand patterns:

  • Explore distributions
  • Identify relationships
  • Detect outliers
  • Generate hypotheses

1.7.5 5. Model

Apply statistical methods to test hypotheses:

  • Fit regression models
  • Perform hypothesis tests
  • Estimate parameters
  • Make predictions

1.7.6 6. Communicate

Share your findings effectively:

  • Create reports with R Markdown or Quarto
  • Build interactive dashboards
  • Write scientific papers
  • Present to stakeholders

1.8 Types of Data in Natural Sciences

Understanding your data type is crucial for choosing appropriate analytical methods:

1.8.1 Categorical Data

Categorical data represent qualitative characteristics:

  • Nominal: Categories with no inherent order (species names, habitat types)
  • Ordinal: Categories with a meaningful order (pollution levels: low, medium, high)

1.8.2 Numerical Data

Numerical data involve measurements or counts:

  • Continuous: Can take any value within a range (temperature, pH, biomass)
  • Discrete: Can only take specific values, usually counts (number of individuals)

1.8.3 Spatial Data

Spatial data describe geographical distributions:

  • Coordinates (latitude/longitude)
  • Elevation or depth
  • Land cover maps
  • Remote sensing data

1.8.4 Temporal Data

Temporal data track changes over time:

  • Time series measurements
  • Seasonal patterns
  • Long-term monitoring data
  • Growth curves

1.9 Best Practices for Reproducible Research

PROFESSIONAL TIP: Reproducible Research Practices

Adopt these practices from the start of your research career:

  1. Use R projects: Organize your work in self-contained RStudio projects
  2. Use relative paths: Never use absolute file paths like C:/Users/Name/...
  3. Document your code: Add comments explaining why, not just what
  4. Version control: Use Git to track changes to your scripts
  5. Save your environment: Record package versions with sessionInfo()
  6. Write functions: Avoid copying and pasting code; write reusable functions
  7. Use R Markdown/Quarto: Combine code, results, and narrative in one document
  8. Set seeds for reproducibility: Use set.seed() before any random operations

1.9.1 Session Information

Always record your R environment for reproducibility:

Code
# Display session information
sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-redhat-linux-gnu
#> Running under: Fedora Linux 40 (Workstation Edition)
#> 
#> Matrix products: default
#> BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
#>  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
#>  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Australia/Brisbane
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices datasets  utils     methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
#>  [5] purrr_1.2.0     readr_2.1.6     tidyr_1.3.1     tibble_3.3.0   
#>  [9] ggplot2_4.0.1   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.2.6         generics_0.1.4     renv_1.0.10        lattice_0.22-6    
#>  [5] stringi_1.8.7      hms_1.1.4          digest_0.6.39      magrittr_2.0.4    
#>  [9] evaluate_1.0.5     grid_4.4.3         timechange_0.3.0   RColorBrewer_1.1-3
#> [13] fastmap_1.2.0      Matrix_1.7-2       jsonlite_2.0.0     mgcv_1.9-1        
#> [17] viridisLite_0.4.2  scales_1.4.0       CoprManager_0.5.7  codetools_0.2-20  
#> [21] textshaping_1.0.4  cli_3.6.5          rlang_1.1.6        crayon_1.5.3      
#> [25] splines_4.4.3      bit64_4.6.0-1      withr_3.0.2        yaml_2.3.11       
#> [29] tools_4.4.3        parallel_4.4.3     tzdb_0.5.0         vctrs_0.6.5       
#> [33] R6_2.6.1           lifecycle_1.0.4    htmlwidgets_1.6.4  bit_4.6.0         
#> [37] vroom_1.6.7        ragg_1.5.0         pkgconfig_2.0.3    pillar_1.11.1     
#> [41] gtable_0.3.6       glue_1.8.0         systemfonts_1.3.1  xfun_0.54         
#> [45] tidyselect_1.2.1   knitr_1.50         farver_2.1.2       nlme_3.1-167      
#> [49] htmltools_0.5.8.1  rmarkdown_2.30     labeling_0.4.3     compiler_4.4.3    
#> [53] S7_0.2.1

1.10 Summary

In this chapter, we introduced:

  • The importance of data analysis in natural sciences research
  • R and RStudio as powerful tools for data analysis
  • The tidyverse philosophy and its core packages
  • Basic data manipulation with dplyr
  • Data visualization with ggplot2
  • The data analysis workflow
  • Types of data in natural sciences
  • Best practices for reproducible research

In the next chapter, we’ll dive deeper into data basics, learning more about data structures, importing various file formats, and preparing data for analysis.

1.11 Exercises

  1. Install and explore: Install R and RStudio on your computer. Open RStudio and explore the interface.

  2. Load the tidyverse: Run library(tidyverse) and note which packages are loaded.

  3. Explore built-in data: Use head(), glimpse(), and summary() on R’s built-in iris dataset.

  4. Practice pipes: Rewrite this nested code using pipes:

    round(mean(sqrt(c(4, 9, 16, 25, 36))), 2)
  5. Create a summary: Using the penguins data (or iris), calculate the mean and standard deviation of a numerical variable for each group of a categorical variable.

  6. Make a plot: Create a scatter plot of two numerical variables from the iris dataset, colored by species.

  7. Research question: Think about a research question in your field. What type of data would you need? What visualizations might help you explore the data?

1.12 References