--- title: "Introduction to hSDM" output: bookdown::html_document2: #base_format: rmarkdown::html_vignette #highlight: tango number_sections: true toc: true #toc_float: true fig_caption: yes link-citations: yes bibliography: bib/biblio-hSDM.bib biblio-style: bib/jae.bst csl: bib/journal-of-applied-ecology.csl pkgdown: as_is: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Introduction to hSDM} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( fig.align = "center", fig.width = 6, fig.height = 6, cache = FALSE, collapse = TRUE, comment = "#>", highlight = TRUE ) ``` Species distribution models (SDM) are useful tools to explain or predict species range and abundance from various environmental factors. SDM are thus widely used in conservation biology. When using field observations (occurence or abundance data) to fit SDMs, two major problems can arise leading to potential bias in models' results: imperfect detection [@Lahoz-Monfort2014] and spatial dependence of the observations [@Lichstein2002]. In this vignette, we illustrate the use of the `hSDM` R package wich aims at providing user-friendly statistical functions to account for both imperfect detection and spatial dependence. Package's functions are developped in a hierarchical Bayesian framework. Functions call a Metropolis-within-Gibbs algorithm coded in C to estimate model's parameters. Using compiled C code for the Gibbs sampler reduce drastically the computation time. By making these new statistical tools available to the scientific community, we hope to democratize the use of more complex, but more realistic, statistical models for increasing knowledge in ecology and conserving biodiversity. Below, we show an example of the use of `hSDM` for fitting a species distribution model to abundance data for a bird species. Model types available in `hSDM` are not limited to those described in this example. `hSDM` includes various model types (for various contexts: imperfect detection, spatial dependence, zero-inflation, land transformation) for both occurrence and abundance data: - simple Binomial and Poisson models [@Clark2007] - site-occupancy models [@MacKenzie2002] - N-mixture models [@Royle2004] - Zero-Inflated Binomial (ZIB) models [@Wilson2016] - Zero-Inflated Poisson (ZIP) models [@Flores2009] - less common mixture models to account for land degradation [@Latimer2006] All the above models can include an additional intrinsic conditional autoregressive (iCAR) process to account for the spatial dependence of the observations. It should be easy for the user to fit other model types from the example below as function arguments are rather similar from one model to the other. # Data-set The data-set from @Kery2010b includes repeated count data for the Willow tit (_Poecile montanus_, a pesserine bird, Fig. \@ref(fig:Willow-tit)) in Switzerland on the period 1999-2003. Data come from the Swiss national breeding bird survey MHB (Monitoring Haüfige Brutvögel). MHB is based on 264 1-km$^2$ sampling units (quadrats) laid out as a grid (Fig. \@ref(fig:sites)). Since 1999, every quadrat has been surveyed two to three times during most breeding seasons (15 April to 15 July). The Willow tit is a widespread but moderately rare bird species. It has a weak song and elusive behaviour and can be rather difficult to detect. (ref:cap-Willow-tit) **Willow tit (_Poecile montanus_).** ```{r Willow-tit, echo=FALSE, out.width="\\textwidth", fig.cap="(ref:cap-Willow-tit)"} knitr::include_graphics("figures/Poecile_montanus.jpg") ``` We first load some necessary packages. The `sp` and `raster` packages will be used for manipulating spatial data. ```{r libraries} # Load libraries library(sp) library(raster) library(hSDM) ``` This data-set from @Kery2010b is available in the `hSDM` R package. It can be loaded with the `data` command and formated to be used with `hSDM` functions. The data is in "wide" format: each line is a site and the repeated count data (from count1 to count3) are in columns. A site is characterized by its x-y geographical coordinates, elevation (in m) and forest cover (in %). The date of each observation (julian date) has been recorded. ```{r Kery2010-data} # Load Kéry et al. 2010 data data(data.Kery2010, package="hSDM") head(data.Kery2010) ``` We first normalize the data to facilitate MCMC convergence. We discard the forest cover in our example. ```{r normalizing} # Normalized variables elev.mean <- mean(data.Kery2010$elevation) elev.sd <- sd(data.Kery2010$elevation) juldate.mean <- mean(c(data.Kery2010$juldate1, data.Kery2010$juldate2, data.Kery2010$juldate3),na.rm=TRUE) juldate.sd <- sd(c(data.Kery2010$juldate1, data.Kery2010$juldate2, data.Kery2010$juldate3),na.rm=TRUE) data.Kery2010$elevation <- (data.Kery2010$elevation-elev.mean)/elev.sd data.Kery2010$juldate1 <- (data.Kery2010$juldate1-juldate.mean)/juldate.sd data.Kery2010$juldate2 <- (data.Kery2010$juldate2-juldate.mean)/juldate.sd data.Kery2010$juldate3 <- (data.Kery2010$juldate3-juldate.mean)/juldate.sd ``` We plot the observation sites over a grid of 10 $\times$ 10 km$^2$ cells. (ref:cap-sites) **Location of the 264 10 km$^2$ quadrats of the Swiss national breeding bird survey.** Points are located on a grid of 10 km$^2$ cells. The grid is covering the geographical extent of the observation points. ```{r sites, out.width="\\textwidth", fig.cap="(ref:cap-sites)", fig.width=9, fig.height=6} # Landscape and observation sites sites.sp <- SpatialPointsDataFrame(coords=data.Kery2010[c("coordx","coordy")], data=data.Kery2010[,-c(1,2)]) xmin <- min(data.Kery2010$coordx) xmax <- max(data.Kery2010$coordx) ymin <- min(data.Kery2010$coordy) ymax <- max(data.Kery2010$coordy) res_spatial_cell <- 10 ncol <- ceiling((xmax-xmin)/res_spatial_cell) nrow <- ceiling((ymax-ymin)/res_spatial_cell) Xmax <- xmin+ncol*res_spatial_cell Ymax <- ymin+nrow*res_spatial_cell ext <- extent(c(xmin, Xmax, ymin, Ymax)) landscape <- raster(ncols=ncol,nrows=nrow,ext) values(landscape) <- runif(ncell(landscape),0,1) landscape.po <- rasterToPolygons(landscape) par(mar=c(0.1,0.1,0.1,0.1)) plot(landscape.po) points(sites.sp, pch=16, cex=1, col="black") ``` One important step if we want to account for spatial dependence is to compute the neighborhood for each cell. In our case, the neighborhood is define by the "king move" in chess, in the height directions around the target cell. To compute the neighbordhood for each cell, we need the number of neighbors (which can vary) and the adjacent cell identifiers. We can use the function `adjacent` from the `raster` package to do this. ```{r Neighborhood} # Neighborhood # Rasters must be projected to correctly compute the neighborhood crs(landscape) <- '+proj=utm +zone=1' # Cell for each site cells <- extract(landscape,sites.sp,cell=TRUE)[,1] # Neighborhood matrix ncells <- ncell(landscape) neighbors.mat <- adjacent(landscape, cells=c(1:ncells), directions=8, pairs=TRUE, sorted=TRUE) # Number of neighbors by cell n.neighbors <- as.data.frame(table(as.factor(neighbors.mat[,1])))[,2] # Adjacent cells adj <- neighbors.mat[,2] ``` We rearrange the data in two data-sets, one for the observation (or detection) process and one for the suitability (or ecological) process. The data for the observation process is in "long" format. Each row is an observation which is characterized by the site number and the julian date of the observation. The data for the suitability process (at the site level) includes information on sites (coordinates and elevation) by row. ```{r Arranging-data} # Arranging data # data.obs nsite <- length(data.Kery2010$coordx) count <- c(data.Kery2010$count1,data.Kery2010$count2,data.Kery2010$count3) juldate <- c(data.Kery2010$juldate1,data.Kery2010$juldate2, data.Kery2010$juldate3) site <- rep(1:nsite,3) data.obs <- data.frame(count,juldate,site) data.obs <- data.obs[!is.na(data.obs$juldate),] # data.suit data.suit <- data.Kery2010[c("coordx","coordy","elevation")] data.suit$cells <- cells data.suit <- data.suit[-139,] # Removing site 139 with no juldate ``` We will compare three SDMs with increasing complexity: a first simple Poisson model, a second N-mixture model with imperfect detection, and a third N-mixture iCAR model with imperfect detection and spatial dependence. # Simple Poisson model To fit a simple Poisson model, we use the function `hSDM.poisson`. For this model, we slightly rearrange the observation data-set to associate the site altitude to each observation in a "long" format. The statistical model can be written as follow: \begin{equation} y_i \sim \mathcal{P}oisson(\lambda_i) \\ \log(\lambda_i) = X_i \beta \\ \end{equation} ```{r Kery2010-pois-mod, cache=TRUE} # hSDM.poisson data.pois <- data.obs data.pois$elevation <- data.suit$elevation[as.numeric(as.factor(data.obs$site))] mod.Kery2010.pois <- hSDM.poisson(counts=data.pois$count, suitability=~elevation+I(elevation^2), data=data.pois, beta.start=0) ``` ```{r Kery2010-pois-mcmc, results="markup"} # Outputs summary(mod.Kery2010.pois$mcmc) ``` ```{r Kery2010-pois-pred} # Predictions npred <- 100 nsamp <- dim(mod.Kery2010.pois$mcmc)[1] # Abundance-elevation elev.seq <- seq(500,3000,length.out=npred) elev.seq.n <- (elev.seq-elev.mean)/elev.sd beta <- as.matrix(mod.Kery2010.pois$mcmc[,1:3]) tbeta <- t(beta) X <- matrix(c(rep(1,npred),elev.seq.n,elev.seq.n^2),ncol=3) N <- matrix(NA,nrow=nsamp,ncol=npred) for (i in 1:npred) { N[,i] <- exp(X[i,] %*% tbeta) } N.est.pois <- apply(N,2,mean) N.q1.pois <- apply(N,2,quantile,0.025) N.q2.pois <- apply(N,2,quantile,0.975) ``` # N-mixture model for imperfect detection The model integrates two processes, an _ecological_ process associated to the abundance of the species due to habitat suitability and an _observation_ process that takes into account the fact that the probability of detection of the species is inferior to one. \begin{equation} \textbf{Ecological process:} \\ N_i \sim \mathcal{P}oisson(\lambda_i) \\ \log(\lambda_i) = X_i \beta \\ ~ \\ \textbf{Observation process:} \\ y_{it} \sim \mathcal{B}inomial(N_i, \delta_{it}) \\ \text{logit}(\delta_{it}) = W_{it} \gamma \\ \end{equation} ```{r Kery2010-Nmix-mod, cache=TRUE} # hSDM.Nmixture mod.Kery2010.Nmix <- hSDM.Nmixture(# Observations counts=data.obs$count, observability=~juldate+I(juldate^2), site=data.obs$site, data.observability=data.obs, # Habitat suitability=~elevation+I(elevation^2), data.suitability=data.suit, # Predictions suitability.pred=NULL, # Chains burnin=10000, mcmc=5000, thin=5, # Starting values beta.start=0, gamma.start=0, # Priors mubeta=0, Vbeta=1.0E6, mugamma=0, Vgamma=1.0E6, # Various seed=1234, verbose=1, save.p=0, save.N=0) ``` ```{r Kery2010-Nmix-mcmc, results="markup"} # Outputs summary(mod.Kery2010.Nmix$mcmc) ``` ```{r Kery2010-Nmix-pred} # Predictions nsamp <- dim(mod.Kery2010.Nmix$mcmc)[1] # Abundance-elevation beta <- as.matrix(mod.Kery2010.Nmix$mcmc[,1:3]) tbeta <- t(beta) N <- matrix(NA,nrow=nsamp,ncol=npred) for (i in 1:npred) { N[,i] <- exp(X[i,] %*% tbeta) } N.est.Nmix <- apply(N,2,mean) N.q1.Nmix <- apply(N,2,quantile,0.025) N.q2.Nmix <- apply(N,2,quantile,0.975) # Detection-Julian date juldate.seq <- seq(100,200,length.out=npred) juldate.seq.n <- (juldate.seq-juldate.mean)/juldate.sd gamma <- as.matrix(mod.Kery2010.Nmix$mcmc[,4:6]) tgamma <- t(gamma) W <- matrix(c(rep(1,npred),juldate.seq.n,juldate.seq.n^2),ncol=3) delta <- matrix(NA,nrow=nsamp,ncol=npred) for (i in 1:npred) { delta[,i] <- inv.logit(X[i,] %*% tgamma) } delta.est.Nmix <- apply(delta,2,mean) delta.q1.Nmix <- apply(delta,2,quantile,0.025) delta.q2.Nmix <- apply(delta,2,quantile,0.975) ``` # N-mixture model with iCAR process For this model, we assume that the abundance of the species at one site depends on the abundance of the species on neighboring sites. In this model, the _ecological_ process includes an intrinsic conditional autoregressive (iCAR) subprocess to model the spatial dependence between observations. \begin{equation} \textbf{Ecological process:} \\ N_i \sim \mathcal{P}oisson(\lambda_i) \\ \log(\lambda_i) = X_i \beta + \rho_{j(i)} \\ \text{$\rho_{j(i)}$: spatial random effect for spatial entity $j$ including site $i$} \\ ~ \\ \textbf{Spatial dependence:} \\ \text{An intrinsic conditional autoregressive model (iCAR) is assumed:} \\ \rho_j \sim \mathcal{N}ormal(\mu_j, V_{\rho}/n_j) \\ \text{$\mu_j$: mean of $\rho_{j^{\prime}}$ in the neighborhood of $j$.} \\ \text{$V_{\rho}$: variance of the spatial random effects.} \\ \text{$n_j$: number of neighbors for spatial entity $j$.} \\ ~ \\ \textbf{Observation process:} \\ y_{it} \sim \mathcal{B}inomial(N_i, \delta_{it}) \\ \text{logit}(\delta_{it}) = W_{it} \gamma \\ \end{equation} ```{r Kery2010-Nmix-iCAR-mod, cache=TRUE} # hSDM.Nmixture.iCAR mod.Kery2010.Nmix.iCAR <- hSDM.Nmixture.iCAR(# Observations counts=data.obs$count, observability=~juldate+I(juldate^2), site=data.obs$site, data.observability=data.obs, # Habitat suitability=~elevation+I(elevation^2), data.suitability=data.suit, # Spatial structure spatial.entity=data.suit$cells, n.neighbors=n.neighbors, neighbors=adj, # Chains burnin=20000, mcmc=10000, thin=10, # Starting values beta.start=0, gamma.start=0, Vrho.start=1, # Priors mubeta=0, Vbeta=1.0E6, mugamma=0, Vgamma=1.0E6, priorVrho="1/Gamma", shape=1, rate=1, # Various seed=1234, verbose=1, save.rho=0, save.p=0, save.N=0) ``` ```{r Kery2010-Nmix-iCAR-mcmc,results="markup"} summary(mod.Kery2010.Nmix.iCAR$mcmc) ``` (ref:cap-spatial) **Estimated spatial random effects.** Locations of observation quadrats are represented by dots. The mean abundance on each quadrat is represented by a circle of size proportional to abundance. ```{r Kery2010-Nmix-iCAR-spatial-effects, out.width="\\textwidth", fig.width=9, fig.height=6, fig.cap="(ref:cap-spatial)"} # Spatial random effects rho.pred <- mod.Kery2010.Nmix.iCAR$rho.pred r.rho.pred <- rasterFromXYZ(cbind(coordinates(landscape),rho.pred)) plot(r.rho.pred) # Mean abundance by site ma <- apply(sites.sp@data[,3:5],1,mean,na.rm=TRUE) points(sites.sp,pch=16,cex=0.5) points(sites.sp,pch=1,cex=ma/2) ``` ```{r Kery2010-Nmix-iCAR-pred} # Predictions nsamp <- dim(mod.Kery2010.Nmix.iCAR$mcmc)[1] # Abundance-elevation beta <- as.matrix(mod.Kery2010.Nmix.iCAR$mcmc[,1:3]) tbeta <- t(beta) N <- matrix(NA,nrow=nsamp,ncol=npred) # Simplified way of obtaining samples for rho rho.samp <- sample(rho.pred,nsamp,replace=TRUE) for (i in 1:npred) { N[,i] <- exp(X[i,] %*% tbeta + rho.samp) } N.est.Nmix.iCAR <- apply(N,2,mean) N.q1.Nmix.iCAR <- apply(N,2,quantile,0.025) N.q2.Nmix.iCAR <- apply(N,2,quantile,0.975) # Detection-Julian date gamma <- as.matrix(mod.Kery2010.Nmix.iCAR$mcmc[,4:6]) tgamma <- t(gamma) delta <- matrix(NA,nrow=nsamp,ncol=npred) for (i in 1:npred) { delta[,i] <- inv.logit(X[i,] %*% tgamma) } delta.est.Nmix.iCAR <- apply(delta,2,mean) delta.q1.Nmix.iCAR <- apply(delta,2,quantile,0.025) delta.q2.Nmix.iCAR <- apply(delta,2,quantile,0.975) ``` # Comparing predictions from the three different models (ref:cap-abundance) **Relationship between abundance and altitude.** Black: Poisson model, Red: N-mixture model, Green: N-mixture model with iCAR. ```{r Kery2010-comp-abundance, fig.cap="(ref:cap-abundance)"} # Expected abundance - Elevation par(mar=c(4,4,1,1),cex=1.4,tcl=+0.5) plot(elev.seq,N.est.pois,type="l", xlim=c(500,3000), ylim=c(0,10), lwd=2, xlab="Elevation (m a.s.l.)", ylab="Expected abundance", axes=FALSE) #lines(elev.seq,N.q1.pois,lty=3,lwd=1) #lines(elev.seq,N.q2.pois,lty=3,lwd=1) axis(1,at=seq(500,3000,by=500),labels=seq(500,3000,by=500)) axis(2,at=seq(0,10,by=2),labels=seq(0,10,by=2)) # Nmix lines(elev.seq,N.est.Nmix,lwd=2,col="red") #lines(elev.seq,N.q1.Nmix,lty=3,lwd=1,col="red") #lines(elev.seq,N.q2.Nmix,lty=3,lwd=1,col="red") # Nmix.iCAR lines(elev.seq,N.est.Nmix.iCAR,lwd=2,col="dark green") #lines(elev.seq,N.q1.Nmix.iCAR,lty=3,lwd=1,col="dark green") #lines(elev.seq,N.q2.Nmix.iCAR,lty=3,lwd=1,col="dark green") ``` (ref:cap-detection) **Relationship between detection probability and observation date.** Red: N-mixture model, Green: N-mixture model with iCAR. ```{r Kery2010-comp-detection, fig.cap="(ref:cap-detection)"} # Detection probability - Julian date par(mar=c(4,4,1,1),cex=1.4,tcl=+0.5) plot(juldate.seq,delta.est.Nmix,type="l", xlim=c(100,200), ylim=c(0,1), lwd=2, col="red", xlab="Julian date", ylab="Detection probability", axes=FALSE) lines(juldate.seq,delta.q1.Nmix,lty=3,lwd=1,col="red") lines(juldate.seq,delta.q2.Nmix,lty=3,lwd=1,col="red") axis(1,at=seq(100,200,by=20),labels=seq(100,200,by=20)) axis(2,at=seq(0,1,by=0.2),labels=seq(0,1,by=0.2)) # Nmix.iCAR lines(juldate.seq,delta.est.Nmix.iCAR,lwd=2,col="dark green") lines(juldate.seq,delta.q1.Nmix.iCAR,lty=3,lwd=1,col="dark green") lines(juldate.seq,delta.q2.Nmix.iCAR,lty=3,lwd=1,col="dark green") ``` # References