Preface

Welcome. This is Data Analysis in Natural Sciences: An R-Based Approach. It is a practical book for anyone who has to wrangle scientific data into something useful, whether you are a student, a working researcher, a field technician, or somebody who just got handed a CSV and a deadline. The examples come from forestry, agriculture, ecology, marine biology, environmental science, geology, atmospheric science and hydrology, because those are the fields I keep meeting people in, and the same techniques keep coming up.

Why this book?

There are plenty of statistics textbooks already, and plenty of R tutorials. What there is less of is something that walks you through the whole loop a working scientist actually does: load some messy data, look at it, clean it, ask a real question, pick the right test, run it, check whether the test was even appropriate, and write the result up in a way somebody else can reproduce.

So that is what this book tries to do. Each chapter sticks close to that loop. You will find:

  1. A modern R workflow built on the tidyverse and tidymodels packages, because they are now the lingua franca.
  2. Reproducible-research habits baked in from the start (Git, Quarto, renv).
  3. Real datasets from real disciplines, not just mtcars and iris.
  4. Honest treatment of statistical assumptions, including what to do when they fail.
  5. Plotting and reporting strategies that respect the reader’s time.

About the author

I am Jimmy Moses, based at the School of Forestry, Faculty of Natural Resources, Papua New Guinea University of Technology. My day job is ecological research, mostly in PNG’s forests and farms, and I wrote this book because I kept giving the same advice to students and colleagues and wanted it in one place.

Target Audience

This book is designed for:

  • Undergraduate and postgraduate students in natural science disciplines
  • Researchers seeking to enhance their data analysis capabilities
  • Technicians working in laboratories and field settings
  • Professionals in government agencies, NGOs, and private sector
  • Hobbyists with an interest in analyzing scientific data

The content is relevant to those working in:

  • Forestry and agroforestry
  • Agriculture and agronomy
  • Ecology and conservation
  • Environmental science
  • Geography and GIS/remote sensing
  • Marine biology and fisheries
  • Botany and plant sciences
  • Entomology and zoology
  • Epidemiology and veterinary sciences
  • Geology and earth sciences
  • Atmospheric and climate sciences
  • Hydrology and water resources
  • Natural resource management
  • Conservation biology

What makes this book different?

Tidyverse and tidymodels

R has two main dialects. There is base R, which is what the language ships with, and there is the tidyverse, which is a layer of packages on top that makes most data work easier to read and write. This book leans on the tidyverse side, plus tidymodels for the statistical modelling bits. If you have used dplyr, ggplot2 or tidyr before, you are already most of the way there. If you have not, that is fine, we start from scratch.

Real datasets

Every chapter uses data that came from somewhere real. You will still see iris and mtcars occasionally because they are useful for tiny examples, but most of the time you are working with actual scientific datasets bundled in the book’s data/ folder.

Reproducibility, from day one

By the time you finish, your analyses should be runnable by somebody else (or you, in two years) on a different machine without a long email thread of “what version of which package did you use again”. The tools we lean on for that:

  • Git, for tracking changes to your code.
  • Quarto, for writing analyses that mix code, prose and output in one document.
  • renv, for locking the package versions your project depends on.

What You Will Learn

This book will guide you through:

  1. Foundations of Data Analysis
    • R programming essentials
    • Data structures and types
    • Modern workflow practices
  2. Data Management
    • Importing data from various sources
    • Tidying and transforming data
    • Handling missing values
    • Data validation and quality control
  3. Exploratory Data Analysis
    • Descriptive statistics
    • Data visualization techniques
    • Pattern recognition
    • Outlier detection
  4. Statistical Analysis
    • Hypothesis testing framework
    • Common statistical tests
    • Analysis of variance (ANOVA)
    • Non-parametric methods
  5. Modeling and Prediction
    • Linear regression
    • Multiple regression
    • Logistic regression
    • Model validation and diagnostics
    • Cross-validation techniques
  6. Advanced Topics
    • Spatial analysis
    • Time series analysis
    • Mixed-effects models
    • Machine learning basics
  7. Communication
    • Professional visualization
    • Report generation
    • Scientific presentation

How to Use This Book

This book is designed to be both a learning resource and a reference guide. You can:

  • Read sequentially from start to finish to build your skills progressively
  • Focus on specific chapters as needed for particular tasks or analyses
  • Use as a reference when encountering specific analytical challenges
  • Adapt code examples to your own datasets and research questions

Code Examples

All code examples are provided in a clear, commented format. You can:

  1. Copy and run directly in R or RStudio
  2. Modify for your needs with confidence
  3. Learn by doing through practical exercises

Exercises

Each chapter ends with exercises to reinforce learning:

  • Conceptual questions that check your understanding of the underlying ideas
  • Applied problems that use the book’s bundled datasets in data/
  • Reporting prompts that ask you to communicate your results

Worked solutions are not bundled with the book — by design. Working through an exercise, inspecting your output, and iterating is the fastest way to build real fluency. If you get stuck, revisit the relevant section of the chapter or open an issue at github.com/jm0535/dains/issues to discuss your approach with other readers.

Prerequisites

To get the most out of this book, you should have:

  • Basic computer skills: File management, software installation
  • R and RStudio installed: Instructions provided in Chapter 1
  • Statistical awareness: Basic understanding helpful but not required
  • Scientific curiosity: Interest in data-driven discovery

Book Structure

The book is organized into four main parts:

Part I: Getting Started

  • Introduction to data analysis in natural sciences
  • Setting up your R environment
  • Data basics and fundamental concepts

Part II: Data Analysis Fundamentals

  • Exploratory data analysis
  • Hypothesis testing
  • Common statistical tests

Part III: Data Visualization

  • Principles of effective visualization
  • Creating publication-quality graphics
  • Advanced visualization techniques

Part IV: Advanced Topics

  • Regression analysis
  • Modeling workflows with tidymodels
  • Conservation applications
  • Special topics in natural sciences

Companion Resources

This book is accompanied by:

  • GitHub Repository: All code, data, and supplementary materials
  • Online Version: Interactive HTML version with enhanced features
  • Datasets: Carefully curated real-world data from multiple disciplines
  • Updates: Regular updates with new methods and best practices

Conventions Used in This Book

Throughout the book, you’ll encounter several types of highlighted boxes:

NoteNote Boxes

These provide additional context, technical details, or explanations of code.

ImportantImportant Boxes

These highlight critical concepts, interpretation guidelines, or common pitfalls.

TipProfessional Tips

These offer best practices, efficiency tips, and expert insights for real-world applications.

WarningWarnings

These alert you to common mistakes, limitations, or things to watch out for.

Code Formatting

Code is presented in monospaced font:

# This is an R code example
library(tidyverse)

data <- read_csv("data.csv")

Function names are shown as function_name(), and package names as packagename.

Acknowledgments

This book would not have been possible without the contributions of many individuals and the broader R community:

  • The R Core Team for developing and maintaining R
  • The tidyverse team (particularly Hadley Wickham) for revolutionizing R programming
  • The tidymodels team (especially Max Kuhn and Julia Silge) for creating a unified modeling framework
  • The RStudio team for providing excellent development tools
  • Data providers who make their datasets openly available for research and education
  • Students and colleagues who provided feedback and testing
  • The open-source community whose packages make this work possible

Software and Package Information

This book was written using:

  • R (version 4.0.0 or higher)
  • RStudio (2023.06.0 or higher)
  • Quarto (1.3.0 or higher)
  • Tidyverse packages
  • Tidymodels packages

For the most up-to-date package versions and dependencies, see the install_packages.R script included with the book materials.

Feedback and Contributions

This book is a living document that will evolve based on feedback from readers and advances in the field. If you find errors, have suggestions for improvements, or would like to contribute:

  • Report issues: Use the GitHub repository’s issue tracker
  • Suggest improvements: Submit pull requests
  • Share your applications: I’d love to hear how you’ve applied these methods in your research

License

This work is licensed under the MIT License, allowing you to freely use, modify, and share the material with appropriate attribution. See the LICENSE file in the repository for full details.

Let’s Begin!

Data analysis is both a science and an art. While the statistical methods provide the rigorous foundation, the creative application of these tools to real-world problems is where the true value emerges. This book aims to equip you with both the technical skills and the analytical mindset needed to excel in natural sciences research.

Whether you’re analyzing forest inventory data, tracking species populations, studying climate patterns, or investigating any other natural phenomenon, the skills you’ll develop here will serve as a foundation for your scientific journey.

Let’s embark on this journey into the world of data analysis for natural sciences!


Jimmy Moses School of Forestry Faculty of Natural Resources Papua New Guinea University of Technology PMB 411, Lae, Morobe Province, Papua New Guinea

First released as an open-access book: 2024 Continuously updated through 2026