Comparing model types - synthetic data
1 Purpose
This page documents the production of numerous synthetic data sets that are to be used to compare the performances of various models that are candidates for GCRMN. Each synthetic dataset is intended to represent a unique modelling challenge. The following table summarises the statistical challenge and the path of the synthetic dataset designed to represent this issue.
Statistical challenge | File path |
---|---|
Replacing poor sites with good sites | synthetic/benthos_fixed_locs_obs_1.rds |
Replacing poor sites with good sites (small data) | synthetic/benthos_fixed_locs_obs_2.rds |
Gaps in a temporal series of some sites | synthetic/benthos_fixed_locs_obs_2.rds |
Gaps in a temporal series of all sites | synthetic/benthos_fixed_locs_obs_2.rds |
Use of covariates | synthetic/benthos_fixed_locs_obs_2.rds |
2 Preparations
The bulk of the heavy lifting is performed by the synthos
package. This package can be installed by issuing the following:
Load the necessary R libraries
Establish the global environment and paths
3 Generate synthetic landscape
- create a list of configuration parameters
- seed: a random seed to use for the stoichastic processes
- crs: coordinate reference system to apply to the spatial domain
- model: the covariate model to use. The following models are supported: “Sph”, “Exp”, “Gau”, “Lin”, “Mat”, “Ste”, “Pen”, “Hug”, “Hol”, “Cor”, “Sphlin”, “Sphexp”, “Sphgaus”, “Sphmat”, “Sphste”, “Sphpen”, “Sphhug”, “Sphhol”, “Sphcor”
- psill: partial sill represents the variance explained by the spatial structure. It is the difference between the total sill (the asymptotic value of the variogram) and the nugget.In a variogram, the psill is the plateau value that the variogram reaches as the distance increases.
- range: the distance at which spatial correlation becomes negligible. Beyond this distance, the variogram reaches the sill, and points are considered spatially uncorrelated. It defines the spatial extent of the influence of one location on another.
- nugget: represents the variance at zero distance (the y-intercept of the variogram). It accounts for measurement error or spatial variability at scales smaller than the sampling resolution. A nonzero nugget indicates that there is variability that cannot be explained by the spatial model.
- alpha: controls the smoothness of the spatial field. Larger values of alpha result in smoother spatial fields, while smaller values allow for rougher fields. In the context of SPDEs, alpha is related to the order of the differential operator used to define the spatial process.
- kappa: controls the spatial scale or range of the spatial process. It determines how quickly spatial correlation decays with distance. Smaller values of kappa correspond to larger spatial ranges (more extended correlation), while larger values correspond to shorter spatial ranges (localized correlation). In SPDE-based models, kappa is used to parameterize the precision matrix of the spatial field.
- variance: is used to calculate the precision of the Matérn
- patch_threshold: patch_threshold: a numeric representing the threshold below which the field is masked away to leave only the patches
- reef_width: representing half the width of a ribbon representing the reef that is centered on the outline of the patch
- years: a sequence of years
- dhw_weight: relative importance of degree heating weeks (how much to weight the effects of dhw when calculating cover)
- cyc_weight: relative importance of cyclones (how much to weight the effects of cyclones when calculating cover)
- other_weight: relative importance of other disturbances (how much to weight the effects of other disturbances when calculating cover)
- hcc_cover_range: the range of Hard Coral Cover to be observed over space and time
- hcc_growth: the annual rate of growth of hard coral
- sc_cover_range: the range of Soft Coral Cover to be observed over space and time
- sc_growth: the annual rate of growth of soft coral
config <- list(
seed = 1,
crs = 4326,
model = "Exp",
psill = 1,
range = 15,
nugget = 0,
alpha = 2,
kappa = 1,
variance = 1,
patch_threshold = 1.75,
reef_width = 0.01,
years = 1:12,
## years = 2000:2011,
dhw_weight = 0.5,
cyc_weight = 0.4,
other_weight = 0.1,
hcc_cover_range = c(0.1, 0.4),
hcc_growth = 0.3,
sc_cover_range = c(0.01, 0.1),
sc_growth = 0.3
)
- define the spatial domain
spatial_domain <- st_geometry(
st_multipoint(
x = rbind(
c(0, -10),
c(3, -10),
c(10, -20),
c(1, -21),
c(2, -16),
c(0, -10)
)
)
) |>
st_set_crs(config$crs) |>
st_cast("POLYGON")
set.seed(config$seed)
spatial_grid <- spatial_domain |>
st_set_crs(NA) |>
st_sample(size = 10000, type = "regular") |>
st_set_crs(config$crs)
- create the synthetic landscape
- inspect the synthetic data
benthos_reefs_pts <- readRDS(file = paste0(
data_path,
"synthetic/benthos_reefs_pts.rds"
))
benthos_reefs_pts |> head()
Simple feature collection with 6 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 9.398625 ymin: -19.95124 xmax: 9.398625 ymax: -19.95124
Geodetic CRS: WGS 84
# A tibble: 6 × 9
Year HCC Reef geometry SC MA CYC DHW OTHER
<int> <dbl> <chr> <POINT [°]> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.801 Reef1 (9.398625 -19.95124) -2.51 0.262 0 0.399 0.512
2 2 -0.993 Reef1 (9.398625 -19.95124) -2.70 0.374 0.570 0.379 0.745
3 3 -0.951 Reef1 (9.398625 -19.95124) -2.66 0.349 0 0.417 0.486
4 4 -0.889 Reef1 (9.398625 -19.95124) -2.60 0.314 0.0180 0.340 0.615
5 5 -0.842 Reef1 (9.398625 -19.95124) -2.55 0.286 0 0.419 0.429
6 6 -0.773 Reef1 (9.398625 -19.95124) -2.49 0.246 0 0.318 0.728
- convert the synthetic data to an sf object and convert responses to a natural scale
benthos_reefs_sf <- benthos_reefs_pts |>
mutate(across(c(HCC, SC, MA), ~ plogis(.))) |>
dplyr::select(Year, Reef, geometry, everything())
saveRDS(benthos_reefs_sf, file = paste0(data_path, "synthetic/benthos_reefs_sf.rds"))
write_csv(benthos_reefs_sf, paste0(data_path, "synthetic/benthos_reefs_sf.csv"))
- visualise the spatio-temporal pattern of each response and covariate
- visualise the temporal trend in each of the responses and covariates marginalised over space
4 Simulate a fixed site sampling design
- select a set of sites to repeatedly sample
- n_locs: number of locations (e.g. reefs)
- n_sites: number of sites within each location
- seed: random seed used in the random selection of locations and sites
benthos_fixed_locs_sf <- readRDS(file = paste0(
data_path,
"synthetic/benthos_fixed_locs_sf.rds"
))
head(benthos_fixed_locs_sf)
Simple feature collection with 6 features and 9 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3.658625 ymin: -19.53124 xmax: 3.658625 ymax: -19.53124
Geodetic CRS: WGS 84
# A tibble: 6 × 10
Reef geometry Site Year HCC SC MA CYC DHW
<chr> <POINT [°]> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Reef103 (3.658625 -19.53124) S1 1 -1.48 -3.22 0.630 0 0.422
2 Reef103 (3.658625 -19.53124) S1 2 -1.42 -3.16 0.601 0 0.386
3 Reef103 (3.658625 -19.53124) S1 3 -1.45 -3.19 0.614 0 0.535
4 Reef103 (3.658625 -19.53124) S1 4 -1.28 -3.03 0.529 0 0.186
5 Reef103 (3.658625 -19.53124) S1 5 -1.21 -2.95 0.487 0 0.331
6 Reef103 (3.658625 -19.53124) S1 6 -1.17 -2.91 0.464 0.0270 0.368
# ℹ 1 more variable: OTHER <dbl>
benthos_fixed_locs_sf |>
ggplot() +
geom_sf(aes(fill = plogis(HCC), color = plogis(HCC))) +
facet_wrap(~Year) +
scale_fill_viridis_c() +
scale_color_viridis_c() +
theme_bw()
- generate the within site structural hierarchy (transects/photos and points)
- Number_of_transects_per_site: number of transects per site
- Depths: depths to sample at
- Number_of_frames_per_transect: number of photo frames per transect
- Points_per_frame: number of points on each photo
- hcc_site_sigma: HCC variability between sites within locations (logit scale)
- hcc_transect_sigma: HCC variability between transects within sites (logit scale)
- hcc_sigma: HCC stoichastic noise not attributed to anything else (logit scale)
- sc_site_sigma: SC variability between sites within locations (logit scale)
- sc_transect_sigma: SC variability between transects within sites (logit scale)
- sc_sigma: SC stoichastic noise not attributed to anything else (logit scale)
- ma_site_sigma: MA variability between sites within locations (logit scale)
- ma_transect_sigma: MA variability between transects within sites (logit scale)
- ma_sigma: MA stoichastic noise not attributed to anything else (logit scale)
config <- list(
Number_of_transects_per_site = 5,
Depths = 2,
Number_of_frames_per_transect = 100,
Points_per_frame = 5,
## Note, the following are on the link scale
hcc_site_sigma = 0.5, # variability in Sites within Locations
hcc_transect_sigma = 0.2, # variability in Transects within Sites
hcc_sigma = 0.1, # random noise
sc_site_sigma = 0.05, # variability in Sites within Locations
sc_transect_sigma = 0.02, # variability in Transects within Sites
sc_sigma = 0.01, # random noise
ma_site_sigma = 0.5, # variability in Sites within Locations
ma_transect_sigma = 0.2, # variability in Transects within Sites
ma_sigma = 0.1 # random noise
)
benthos_fixed_locs_obs <- sampling_design_fine_scale_fixed(
benthos_fixed_locs_sf,
config
)
saveRDS(benthos_fixed_locs_obs, file = paste0(
data_path,
"synthetic/benthos_fixed_locs_obs.rds"
))
## save as csv
write_csv(
benthos_fixed_locs_obs,
paste0(data_path, "synthetic/benthos_fixed_locs_obs.csv")
)
- inspect the resulting dataset
benthos_fixed_locs_obs <- readRDS(file = paste0(
data_path,
"synthetic/benthos_fixed_locs_obs.rds"
))
benthos_fixed_locs_obs
# A tibble: 3,000 × 10
Reef Longitude Latitude Site Transect Year HCC SC MA
<chr> <dbl> <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 Reef103 3.66 -19.5 S1 T1 1 18.6 3.99 54.9
2 Reef103 3.66 -19.5 S1 T1 2 19.0 4.32 50.5
3 Reef103 3.66 -19.5 S1 T1 3 19.0 4.18 48.7
4 Reef103 3.66 -19.5 S1 T1 4 21.6 4.81 50.2
5 Reef103 3.66 -19.5 S1 T1 5 25.4 5.20 49.0
6 Reef103 3.66 -19.5 S1 T1 6 21.2 5.47 46.3
7 Reef103 3.66 -19.5 S1 T1 7 22.2 5.26 45.6
8 Reef103 3.66 -19.5 S1 T1 8 23.2 5.18 46.1
9 Reef103 3.66 -19.5 S1 T1 9 19.7 4.43 47.2
10 Reef103 3.66 -19.5 S1 T1 10 17.4 4.09 54.4
# ℹ 2,990 more rows
# ℹ 1 more variable: Date <dttm>
- join in the disturbances
benthos_fixed_locs_obs_disturb <- benthos_fixed_locs_obs |>
left_join(
benthos_fixed_locs_sf |>
dplyr::select(Reef, Year, CYC, DHW, OTHER) |>
distinct(),
by = c("Reef", "Year"),
relationship = "many-to-many"
)
saveRDS(benthos_fixed_locs_obs_disturb, file = paste0(
data_path,
"synthetic/benthos_fixed_locs_obs_disturb.rds"
))
write_csv(
benthos_fixed_locs_obs_disturb,
paste0(data_path, "synthetic/benthos_fixed_locs_obs_disturb.csv")
)
benthos_fixed_locs_obs_disturb <- readRDS(file = paste0(
data_path,
"synthetic/benthos_fixed_locs_obs_disturb.rds"
))
benthos_fixed_locs_obs_disturb |>
as.data.frame() |>
head()
Reef Longitude Latitude Site Transect Year HCC SC MA
1 Reef103 3.658625 -19.53124 S1 T1 1 18.62544 3.985485 54.91391
2 Reef103 3.658625 -19.53124 S1 T1 1 18.62544 3.985485 54.91391
3 Reef103 3.658625 -19.53124 S1 T1 2 18.97497 4.319049 50.48907
4 Reef103 3.658625 -19.53124 S1 T1 2 18.97497 4.319049 50.48907
5 Reef103 3.658625 -19.53124 S1 T1 3 19.04566 4.176435 48.69475
6 Reef103 3.658625 -19.53124 S1 T1 3 19.04566 4.176435 48.69475
Date CYC DHW OTHER geometry
1 1-01-01 14:00:00 0 0.4218127 0.3645608 POINT (3.658625 -19.53124)
2 1-01-01 14:00:00 0 0.3975010 0.3349294 POINT (3.598625 -19.43124)
3 2-01-01 14:00:00 0 0.3864855 0.4973167 POINT (3.658625 -19.53124)
4 2-01-01 14:00:00 0 0.3810641 0.4968247 POINT (3.598625 -19.43124)
5 3-01-01 14:00:00 0 0.5350182 0.5864228 POINT (3.658625 -19.53124)
6 3-01-01 14:00:00 0 0.5235389 0.5418778 POINT (3.598625 -19.43124)
5 Specific data issue scenarios
The goal of the model comparisons exercise is to be able to evaluate how well different models can accommodate different issues. These issues include:
- situations in which certain monitored reefs are discontinued and replaced by alternative reefs for the rest of a time series. The most extreme form of this would be if a poor reef (a reef with very low cover) was replaced by a good reef (higher cover).
- it might be expected that the severity of replacing reefs would intensify with fewer reefs. To explore this, we will create a version of the reef replacement when the pool of reefs is smaller
- a common scenario is when there are gaps in time series. We will create a versions of the data where there are times when no samples are collected from selected reefs
- a more extreme version might be when there is a temporal gap in all sampled reefs
- using covariates (disturbances) to refine predictions in areas where there are no samples
reef_names <- benthos_fixed_locs_obs |>
group_by(Reef) |>
summarise(HCC = mean(HCC)) |>
filter(HCC == max(HCC) | HCC == min(HCC)) |>
arrange(HCC) |>
pull(Reef)
yrs <- benthos_fixed_locs_obs |> pull(Year) |> unique()
benthos_fixed_locs_obs_1 <-
benthos_fixed_locs_obs |>
filter(
!(Reef == reef_names[1] & Year > yrs[floor(length(yrs)*3/4)] |
Reef == reef_names[2] & Year < yrs[floor(length(yrs)*3/4) + 1])
)
saveRDS(benthos_fixed_locs_obs_1,
file = paste0(data_path, "synthetic/benthos_fixed_locs_obs_1.rds")
)
write_csv(
benthos_fixed_locs_obs_1,
paste0(data_path, "synthetic/benthos_fixed_locs_obs_1.csv")
)
set.seed(123)
reef_names <- benthos_fixed_locs_obs |>
group_by(Reef) |>
summarise(HCC = mean(HCC)) |>
filter(HCC == max(HCC) | HCC == min(HCC)) |>
arrange(HCC) |>
pull(Reef)
reef_names_rnd <- benthos_fixed_locs_obs |>
group_by(Reef) |>
summarise(HCC = mean(HCC)) |>
arrange(HCC) |>
filter(!Reef %in% reef_names) |>
pull(Reef) |>
sample(3) |>
c(reef_names)
benthos_fixed_locs_obs_2 <-
benthos_fixed_locs_obs_1 |>
filter(Reef %in% reef_names_rnd)
saveRDS(benthos_fixed_locs_obs_2,
file = paste0(data_path, "synthetic/benthos_fixed_locs_obs_2.rds")
)
write_csv(
benthos_fixed_locs_obs_2,
paste0(data_path, "synthetic/benthos_fixed_locs_obs_2.csv")
)
set.seed(123)
reef_names <- benthos_fixed_locs_obs |>
pull(Reef) |>
unique() |>
sample(5)
benthos_fixed_locs_obs_3 <-
benthos_fixed_locs_obs |>
filter(
!(Reef %in% reef_names & Year %in% 3:5)
)
saveRDS(benthos_fixed_locs_obs_3,
file = paste0(data_path, "synthetic/benthos_fixed_locs_obs_3.rds")
)
write_csv(
benthos_fixed_locs_obs_3,
paste0(data_path, "synthetic/benthos_fixed_locs_obs_3.csv")
)
benthos_fixed_locs_obs_3 <- readRDS(
file = paste0(data_path, "synthetic/benthos_fixed_locs_obs_3.rds")
)
benthos_fixed_locs_obs_3 |>
dplyr::select(-MA, -SC) |>
pivot_wider(
id_cols = c(Reef, Longitude, Latitude, Site, Transect),
names_from = Year,
values_from = HCC
) |>
pivot_longer(
cols = -c(Reef, Longitude, Latitude, Site, Transect),
names_to = "Year",
values_to = "HCC"
) |>
ggplot() +
geom_line(aes(
y = HCC, x = Year, colour = Site,
group = interaction(Site, Transect)
)) +
facet_wrap(~Reef) +
theme_bw()
benthos_fixed_locs_obs_4 <- readRDS(
file = paste0(data_path, "synthetic/benthos_fixed_locs_obs_4.rds")
)
yrs <- 1:12
benthos_fixed_locs_obs_4 |>
complete(Year = yrs, nesting(Reef, Longitude, Latitude, Site, Transect)) |>
ggplot() +
geom_line(aes(
y = HCC, x = Year, colour = Site,
group = interaction(Site, Transect)
)) +
facet_wrap(~Reef) +
theme_bw()