\(\color{orange}{\text{N.B. This tutorial is a live document that will continue to be improved and updated by the authors. Enjoy!}}\)
In this tutorial we will use the AusTraits database (Falster et al. 2021) to explore the benefits of multi-response phylogenetic mixed models (MR-PMM). AusTraits is a relational database containing records of ~500 plant traits across ~30,000 Australian plant taxa (https://austraits.org/). The first step is to install and load any required packages, then load in the AusTraits data. For reproducibility, all analyses will use version 6.0.0 of AusTraits (https://zenodo.org/records/11188867). This tutorial is currently set up for users to download the relevant files directly from GitHub and inspect outputs of each code chunk within Rstudio on their local machine.
A crucial step in any phylogenetic comparative analysis is collating, cleaning and manipulating data records. As we are dealing with a phylogeny, this includes matching the nomenclature (species naming conventions) between our trait data and the tip labels of the phylogenetic tree. The following section covers the decisions made and operations applied to arrive at the final data set for statistical analyses. As we are working from AusTraits, the first step is to pull down data on the organisms we are interested in, which for today is the genus Eucalyptus.
# extract all austraits records for genus Eucalyptus
eucalyptus <- extract_taxa(austraits, genus = "Eucalyptus")
# join all information on these records from austraits database
euc_data <- (eucalyptus %>% austraits::join_all())$traits
# inspect the data table
euc_data %>% head() # first few rows
euc_data %>% names() # column names
euc_data %>% select(trait_name) %>% unique() # traits
euc_data %>% select(basis_of_record) %>% unique() # types of records
We will focus on three continuous traits relevant to leaf economics
and water use efficiency: leaf_mass_per_area
(LMA),
leaf_N_per_dry_mass
(N_mass) and leaf_delta13C
(d13C). Some important decisions to be made are when to exclude data,
and how to treat records that are not reported in a format conducive to
our analyses. Here we choose to exclude data from experimental studies,
limiting ourselves to observations on field and captive-cultivated
(i.e., botanical gardens and arboreta) specimens, as well as
measurements on preserved specimens and expert reports from the
literature (i.e., Flora field guides). We also limit ourselves to
observations on reproductively mature individuals.
Another consideration for these data will be properly integrating
different types of observations. For example, some records in AusTraits
represent raw trait values observed on a single individual (replicates =
1), where-as others represent mean trait values calculated from
observations on multiple individuals (replicates > 1). From a
modelling perspective, we would like observations based on more
replicates to have a greater influence on the estimation of model
parameters. Hence, observations should be weighted proportionate to
their level of replication during model fit. The variable
replicates
reports an integer value for some records, but
is also sometimes reported as a range, e.g., “3-5”. To be conservative,
we can define the lower bound of this range as the level of replication
for that observation. This represents one of many arbitrary decisions
that must be made when attempting to collate data from diverse sources.
The important thing is to keep a good record of all such decisions.
First, so their influence on results may be tested (e.g., with
sensitivity analyses), and second so they can be adequately described
and justified for publication.
# choose traits to extract data on
traits <- c("leaf_mass_per_area", "leaf_N_per_dry_mass", "leaf_delta13C")
num_traits <- traits # we will focus on the numerical leaf traits initially
## clean data
# - remove records from experimental studies and juvenile specimens
# - resolve replicates reported as a character string to an integer value by taking minima of reported range
euc_data <- euc_data %>%
filter(trait_name %in% traits,
basis_of_record %in% c("field", "captive_cultivated", "literature", "preserved_specimen"),
# !location_name %in% c("Currency Creek Arboretum",
# "Currency Creek Arboretum, South Australia"), # exclude records from CCA?
life_stage == "adult",) %>%
mutate(replicates = ifelse(replicates == "unknown" | replicates == "1-10", 1,
ifelse(replicates == "2-17", 2,
ifelse(replicates == "3-5", 3,
ifelse(replicates == "4-5" | replicates == "4-6", 4,
ifelse(replicates == "5-6", 5,
ifelse(replicates == "10-100" | replicates == "10-40" | replicates == "10-20", 10, replicates)))))),
replicates = as.character(replicates)) %>%
replace_na(list(replicates = "1"))
# we have a mix of value types across a diverse set of records
euc_data %>% select(value_type) %>% table()
euc_data %>% select(basis_of_record) %>% table()
# select only the columns we need and create a unique id for each record
euc_data <- euc_data %>%
select(dataset_id, location_name, observation_id, taxon_name, trait_name, value_type, value, replicates) %>%
distinct() %>%
mutate(n = 1:n(), .by = c(dataset_id, location_name, observation_id, taxon_name, trait_name),
id = paste0(dataset_id,"_",observation_id,"_",n), .before = dataset_id) %>% # create unique observation id
select(-n)
After our initial refinement of records to include, the next step is to match the nomenclature (species naming conventions) between our trait data and the tip labels of our phylogenetic tree. We will use the maximum likelihood Eucalypt phylogeny presented by (Thornhill et al. 2019) for all our analyses (N.B. models are easily extended to incorporate phylogenetic uncertainty by fitting over a sample of candidate trees and combining posterior estimates prior to inference). To proceed with analyses, our tree must be binary (fully bifurcating with no polytomies) and ultrametric (clock-like with all tips right adjusted). So the first step is to perform these checks and amend the tree where necessary.
# read in Eucalypt phylogeny
phy <- read.tree(file="tree.ML1.tre")
plot(phy, edge.width = 1, show.tip.label = F) # looks ultrametric
# perform checks
is.binary(phy);is.ultrametric(phy) # but fails both tests
# resolve polytomies and make ultrametric (clock-like)
phy <- multi2di(phy, random = FALSE) # splits polytomies into bifurcations but introduces zero length branches
phy$edge.length[phy$edge.length == 0] <- 1e-5 # replace with small non-zero length
phy <- force.ultrametric(phy, method = "extend") # make ultrametric by extending external branches where necessary
phy$node.label <- (length(phy$tip.label)+1):(phy$Nnode+length(phy$tip.label)) # add node labels
is.binary(phy);is.ultrametric(phy) # now passes checks
Now the tree is in the right shape we can go about cleaning and
matching the nomenclature. A quick preview of taxon names reveals a few
differences between our data and the tip labels of our tree. We can
usually resolve most differences simply by amending text strings using
grep
functions. We will create two versions of the taxon
name; one which retains subspecies identity and one which does not. In
the event that some subspecies do not appear in the tree, using the
latter nomenclature may allow us to include more records by treating
trait data from all subspecies as observations at the species level
(e.g., records for subspecies Eucalyptus_examplus_primus and
Eucalyptus_examplus_secundis are treated as records for species
Eucalyptus_examplus). As taxonomy is in a constant state of revision, it
is also a good idea to check whether any species have undergone recent
name changes which may create conflicts when attempting to match trait
data records to taxon names in the phylogeny, though we will skip this
step for today’s exercise.
# phylogeny contains binomial labels for species and trinomial labels for subspecies, all separated by "_" (genus_species_subspecies)
tail(phy$tip.label)
# austraits also contains records at the species and subspecies level, but with a different naming convention and names separated by " "
# Additional information (e.g., "var." = variety) and unidentified species (e.g., "Eucalyptus sp.") will also have to be removed
euc_data %>% select(taxon_name) %>% unique() %>% tail()
# these differences in nomenclature cause a complete mismatch in taxon names between data and tree
euc_data %>% pull(taxon_name) %>% unique() %>% intersect(.,phy$tip.label) %>% length()
# filter out unwanted records and use grep functions to resolve differences in nomenclature
euc_data <- euc_data %>%
filter(!taxon_name == "Eucalyptus", # remove records of unidentified species
!taxon_name %like% "% sp. %",
!taxon_name %like% "% x %" # remove records from hybrid species
) %>%
mutate(taxon = gsub(" ", "_",taxon_name), # clean taxon names by replacing "subsp." and "var." with "_"
taxon = gsub("_subsp._", "_",taxon),
taxon = gsub("_var._", "_",taxon),
taxon = gsub('^([^_]+_[^_]+_[^_]+).*', "\\1",taxon), # remove everything after the third "_" (retains subsp epithet)
taxon_sp = gsub('^([^_]+_[^_]+).*', "\\1",taxon)) # remove everything after the second "_", (removes subsp epithet)
# quick check that the re-naming worked as intended
euc_data %>% select(taxon, taxon_sp) %>% unique() %>% tail()
Now we have resolved the nomenclature, we can check which and how many taxa in the phylogeny have AusTraits records for our traits of interest.
# check how many species with trait data are also in the phylogeny
match <- euc_data %>%
filter(trait_name %in% traits) %>%
pull(taxon) %>% intersect(phy$tip.label);length(match) # we have data on at least one leaf trait for 421 taxa
# check how many if we drop the subsp epithet off both the tip.labels of
# the phylogeny and the taxon names in the data
match_sp <- euc_data %>%
filter(trait_name %in% traits) %>%
pull(taxon_sp) %>% # subsp epithet dropped
intersect(gsub('^([^_]+_[^_]+).*', "\\1", phy$tip.label)); length(match_sp) # increases to 457 taxa by ignoring subspecies nomenclature
# and adds almost 2000 records across the three traits (~25% increase in sample size)
euc_data %>% filter(taxon %in% match, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()
euc_data %>% filter(taxon_sp %in% match_sp, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()
# One thing to check is that we will not create duplicate tip labels in the phylogeny by dropping subsp epithets.
# This would happen if the phylogeny featured multiple subsp for any given species. To confirm this, we can check
# that the number of unique tip labels doesn't change when we drop the subsp epithet.
length(unique(gsub('^([^_]+_[^_]+).*', "\\1", phy$tip.label))) == length(unique(phy$tip.label)) # all good
# Ok, let's drop subsp epithets off the tip labels of the phylogeny
phy$tip.label <- gsub('^([^_]+_[^_]+).*', "\\1", phy$tip.label)
# match in taxonomic information from Nicolle 2022
euc_data$subgenus <- taxonomy[match(euc_data$taxon_sp, taxonomy$taxon_sp),]$subgenus # match each taxa to subgenus
euc_data$series <- taxonomy[match(euc_data$taxon_sp, taxonomy$taxon_sp),]$series # match each taxa to subgenus
Phylogenetic signal in species traits is to some extant a function of phylogenetic scale (ref Revell XXXX). By including more diverse species in our phylogeny, we capture deeper splits that represent more meaningful divergences in the genotype and phenotype of extant lineages, precisely the effects we intend to model when analysing inter-species data. One consequence of this, is that shallow topology (near the tips) is less informative than deep topology when attempting to infer patterns of phylogenetic niche conservatism, because differences between genera are usually more significant than differences between species within genera. Furthermore, shallow topology can be problematic for clades in which closely related species hybridise, as by definition a phylogeny (a type of acyclic graph) assumes a bifurcating evolutionary process. In such cases, phylogenies defined for a taxonomic rank above the species level (e.g., genus), may be a preferable choice. If taxonomy for the groups in question is well resolved, it may be possible to define a phylogeny with rank just above the species level (e.g., series or section). This may be sufficient to mitigate errors due to hybridization, as inter-fertility often declines rapidly with phylogenetic distance (euc ref). Defining our phylogeny at a higher taxonomic rank also has the surprisingly powerful benefit of allowing us to include data on species that were not featured in the species-level phylogeny; species are now simply treated as replicates of the appropriate taxonomic group within the higher rank tree.
For higher taxonomic ranks (e.g. genus), it will usually be possible to derive a unique consensus tree by sampling a single species from each genus and simply pruning the tree to those tips. This approach may be problematic for lower taxonomic ranks however, because more closely related species are less likely to be monophyletic with respect to the taxonomic rank in question (). Even in such cases, we can easily account for this phylogenetic uncertainty by randomly sampling topologies at the specified rank and fitting our models over this sample of trees.
For the current example, we will sample trees at the level of taxonomic series (which in Eucalyptus contain between 1-XX species), reducing the dimension of our tree from 656 tips in the species-level tree to 87 tips in the series-level tree.
# subset data and tree to species with series level taxonomy
phy_2 <- keep.tip(phy, intersect(euc_data %>% filter(!is.na(series)) %>% pull(taxon),phy$tip.label))
dat_2 <- euc_data %>%
filter(!is.na(series),
taxon %in% phy_2$tip.label) %>%
arrange(factor(taxon, levels = phy_2$tip.label))
# sample series level topologies
n_samp = 10
tree_samp <- list()
C_samp <- list()
# run loop
for (i in 1:n_samp) {
samp <- dat_2 %>% group_by(series) %>% slice_sample(n=1) # sample one species from each series and keep only those tips to create series level tree
phy.ser <- tree.ultra(phy, samp) # make ultrametric tree
samp <- samp %>% arrange(factor(taxon, levels = phy.ser$tip.label))
phy.ser$tip.label <- samp$series
tree_samp[[i]] <- phy.ser
C_samp[[i]] <- vcv.phylo(phy.ser, cor = T)
}
# plot sampled topology
plot(tree_samp[[1]], edge.width = 1, show.tip.label = F)
# plot species-level tree again to compare
plot(phy, edge.width = 1, show.tip.label = F)
# create list of C matrices and dfs for use with brm_multiple()
C_list <- list();for (i in 1:length(C_samp)){C_list[[i]] <- list(C = C_samp[[i]])}
dat_list <- lapply(1:length(C_samp), function(x) dat_2)
We can see that our sampled series-level trees capture the major structural form of the species-level tree, but show some variation in lower order topology and branch lengths. By fitting our model over this sample of trees and combining chains for inference, we are able to integrate over this phylogenetic uncertainty.
We can also check if we are able to include more data by assigning species missing from the original phylogeny to a tip (taxonomic series) in the new tree.
# we can include data on another 47 taxa (~10% increase) by using the series level tree
match_series <- euc_data %>%
filter(trait_name %in% traits,
series %in% tree_samp[[1]]$tip.label) %>%
pull(taxon_sp) %>% unique()
length(match_sp) # 457
length(match_series) # 504
# adding another 700 observations (a further 8% increase in sample size)
euc_data %>% filter(taxon_sp %in% match_sp, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()
euc_data %>% filter(taxon_sp %in% match_series, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()
Now we have cleaned the data and matched the nomenclature it’s time
to get our data into a more useful format. Specifically, we need to
re-frame our data from long to wide format to facilitate plotting and
model fitting. We can use the pivot_wider()
function from
tidyr
to do this.
# Currently, each row represents a single observation (observation_id) on a single trait (trait_name)
# for a single taxon (taxon_name) from a single study (dataset_id). This is called long format.
euc_data %>% head()
# To analyse these data, we need a separate column for each trait, with each row representing a single
# observation (or sampling event) within a study, i.e., wide format.
euc_wide <- euc_data %>%
pivot_wider(names_from = trait_name, values_from = value) %>%
mutate(across(all_of(traits), as.numeric))
# then subset the data to those species in the phylogeny, setting taxon = taxon_sp (subsp epithet dropped)
euc_wide <- euc_wide %>% mutate(taxon = taxon_sp) %>% select(-c(taxon_sp, taxon_name)) #%>% filter(series %in% match_series)
We will also prune the species level-tree to contain only those species in the data (n = 457)
# prune species level-tree to data
# phy <- keep.tip(phy, euc_wide %>% filter(taxon %in% phy$tip.label) %>% pull(taxon) %>% unique)
We can now examine and summarise our new data table.
# we have a good number of observations for LMA across taxa, less so for N_mass and d13C
euc_wide %>% select(all_of(traits)) %>% summarise_all(~ sum((!is.na(.))))
# we can also calculate the number of observations for each trait for each taxon
species_coverage <- euc_wide %>% filter(taxon %in% phy$tip.label) %>% group_by(taxon) %>% summarise(across(all_of(traits),~ sum(!is.na(.))))
series_coverage <- euc_wide %>% filter(series %in% tree_samp[[1]]$tip.label) %>% group_by(series) %>% summarise(across(all_of(traits),~ sum(!is.na(.))))
# print
species_coverage
series_coverage
# We have many missing values (e.g., ~79% of observations do not report values for d13C)
euc_wide %>% select(all_of(traits)) %>% summarise_all(~ sum((!is.na(.)))) %>% `/`(nrow(euc_wide)) %>% round(3)
# However, the proportion of taxa in the species-level tree reporting at least one observation for each trait is actually much higher for Nmass and d13C
species_coverage %>% mutate(across(all_of(traits), ~ ifelse(.>0,1,0))) %>% select(-1) %>% colSums() %>% `/`(Ntip(phy)) %>% round(2)
# and, of course, even higher for the series level tree
series_coverage %>% mutate(across(all_of(traits), ~ ifelse(.>0,1,0))) %>% select(-1) %>% colSums() %>% `/`(Ntip(tree_samp[[1]])) %>% round(2)
As a final data preparation step, we will transform our leaf traits and rename our data table and variables for brevity.
## final clean
# - rename data object for brevity
# - log then z transform leaf traits and abbreviate names
dat <- euc_wide %>%
mutate(N_mass = scale(log(leaf_N_per_dry_mass))[,1],
d13C = scale(-1*log(abs(leaf_delta13C)))[,1], # to aide interpretation, restore negative values with -1*
LMA = scale(log(leaf_mass_per_area))[,1],
replicates = as.numeric(replicates))
To facilitate plots of species mean trait values and demonstrate a meta-analytic implementation of MR-PMM, we will create a summary data table by pooling all observations for each species across studies into a single mean value and associated standard error (each species represented by a single row). This procedure simplifies the structure of our data set considerably, but throws away a lot of information about different potential sources of variation (i.e., within-species, between-study). We will first fit the simplified meta-analytic version for the sake of exposition, then demonstrate a preferable approach of using additional random effects in a multilevel model to capture the cross-classified sampling design of observations.
## calculate species mean trait values
# as each observation reports the number of replicates, we will use a weighted mean and se to
# properly account for variation in the uncertainty of trait values
dat_mean <- dat %>%
group_by(taxon, series, subgenus) %>%
summarise(across(all_of(c("N_mass","d13C","LMA")), ~ .x %>% weighted.mean(w = replicates, na.rm = TRUE))) # weighted mean
# calculate weighted se (ideally achieved using map() within summarise above)
dat_se <- dat %>%
group_by(taxon) %>%
select(taxon, N_mass,d13C,LMA,replicates) %>%
mutate(replicates = as.numeric(replicates))
taxa <- dat_se %>% pull(taxon) %>% unique()
result <- tibble(taxon=taxa, N_mass.se=NA, d13C.se=NA,LMA.se=NA)
for (i in 1:length(taxa)){
result[i,2] <- dat_se %>%
select(taxon,N_mass,replicates) %>%
na.omit() %>%
filter(any(n()>1)) %>%
filter(taxon == taxa[i]) %>%
list(N_mass = .$N_mass, replicates = .$replicates) %>%
{ wtd.stderror(.$N_mass, w = .$replicates) }
result[i,3] <- dat_se %>%
select(taxon,d13C,replicates) %>%
na.omit() %>%
filter(any(n()>1)) %>%
filter(taxon == taxa[i]) %>%
list(d13C = .$d13C, replicates = .$replicates) %>%
{ wtd.stderror(.$d13C, w = .$replicates) }
result[i,4] <- dat_se %>%
select(taxon,LMA,replicates) %>%
na.omit() %>%
filter(any(n()>1)) %>%
filter(taxon == taxa[i]) %>%
list(LMA = .$LMA, replicates = .$replicates) %>%
{ wtd.stderror(.$LMA, w = .$replicates) }
}
result %>% head() # check result
dat_mean <- dat_mean %>% left_join(.,result) %>% ungroup # merge weighted se into data frame
# for observations with NA for se (i.e., species with a single obs for a given trait), assign 90th pctl. se for that trait
dat_mean[is.na(dat_mean$N_mass.se),]$N_mass.se <- quantile(dat_mean[!is.na(dat_mean$N_mass.se),]$N_mass.se, 0.9)
dat_mean[is.na(dat_mean$d13C.se),]$d13C.se <- quantile(dat_mean[!is.na(dat_mean$d13C.se),]$d13C.se, 0.9)
dat_mean[is.na(dat_mean$LMA.se),]$LMA.se <- quantile(dat_mean[!is.na(dat_mean$LMA.se),]$LMA.se, 0.9)
# compare the two data sets
dat %>% select(taxon, series, subgenus, dataset_id, LMA, N_mass, d13C) %>% head() # rows contain individual obs, multiple rows per species and per study.
dat_mean %>% head() # single row containing species mean trait values
Now we have summarised our data to species level, let’s make a few plots to visualize the phylogenetic distribution of trait values.
# filter to complete cases for plotting
dat_cc <- dat_mean %>% select(taxon, series, subgenus, N_mass, d13C, LMA) %>%
filter(taxon %in% phy$tip.label) %>%
drop_na() %>%
filter(subgenus %in% c("Eucalyptus", "Eudesmia", "Symphyomyrtus"),
!(!is.na(LMA) & LMA > 3) & !(!is.na(N_mass) & N_mass < -4)) # remove outliers for plotting
sg_cols <- c("#E69F00","#009E73", "#56B4E9")
X <- dat_cc %>% ungroup() %>% select(N_mass, d13C, LMA) %>% as.matrix()
rownames(X) <- dat_cc %>% pull(taxon)
phy_cc <- keep.tip(phy, dat_cc %>% pull(taxon))
First some bivariate scatter plots to visualise pairwise covariance at the level of species phenotypes.
# scatter-plots
p1 <- dat_cc %>% ggplot(aes(N_mass, LMA, color = subgenus)) + geom_point(size=3) +
theme_classic() + theme(axis.text = element_text(size=15)) + guides(color="none") + scale_color_manual(values=sg_cols)
p2 <- dat_cc %>% ggplot(aes(d13C, LMA, color = subgenus)) + geom_point(size=3) +
theme_classic() + theme(axis.text = element_text(size=15)) + guides(color="none") + scale_color_manual(values=sg_cols)
p3 <- dat_cc %>% ggplot(aes(N_mass, d13C, color = subgenus)) + geom_point(size=3) +
theme_classic() + theme(axis.text = element_text(size=15)) + guides(color="none") + scale_color_manual(values=sg_cols)
ggarrange(p1,p2,p3,ncol = 1)
To get a feel for the phylogenetic distribution of trait values, it is useful to plot our trait data as a heatmap matrix against the tree. All three traits are strongly structured with respect to phylogeny (similar values among closely related species). Furthermore, these traits appear to
## plot data against tree
# group by operational taxonomic unit (OTU)
grp <- list(Eucalyptus = dat_cc[which(dat_cc$subgenus == 'Eucalyptus'), ]$taxon,
Eudesmia = dat_cc[which(dat_cc$subgenus == 'Eudesmia'), ]$taxon,
Symphyomyrtus = dat_cc[which(dat_cc$subgenus == 'Symphyomyrtus'), ]$taxon)
p <- ggtree(phy_cc, layout = 'rectangular') +
scale_color_manual(values=c("#56B4E9","orange", "#009E73","#56B4E9"))
p <- groupOTU(p, grp, 'subgenus') + aes(color=subgenus)
# heatmap
gheatmap(p, X, offset=-1, width=0.4, color=NA,
colnames=T, colnames_position = "top",
colnames_offset_x = 0, colnames_offset_y = 5, colnames_angle = 0,
font.size = 2.5) +
scale_fill_viridis_c(option="D", name="", na.value="white") +
# guides(color="none") +
ylim(-1, 370)
The same pattern is evident at the series level.
## Repeat for series level tree
# summarise by series
series_mean <- dat %>%
filter(!is.na(series)) %>%
group_by(series, subgenus) %>%
summarise(across(all_of(c("N_mass","d13C","LMA")), ~ .x %>%
weighted.mean(w = replicates, na.rm = TRUE)))
# prep trait data frame
dat_cc_ser <- series_mean %>%
drop_na() %>%
filter(subgenus %in% c("Eucalyptus", "Eudesmia", "Symphyomyrtus"))
X <- dat_cc_ser %>% ungroup() %>% select(N_mass, d13C, LMA) %>% as.matrix()
rownames(X) <- dat_cc_ser %>% pull(series)
phy_cc <- keep.tip(tree_samp[[1]], dat_cc_ser %>% filter(series %in% tree_samp[[1]]$tip.label) %>% pull(series))
# group by operational taxonomic unit (OTU)
grp <- list(Eucalyptus = dat_cc_ser[which(dat_cc_ser$subgenus == 'Eucalyptus'), ]$series,
Eudesmia = dat_cc_ser[which(dat_cc_ser$subgenus == 'Eudesmia'), ]$series,
Symphyomyrtus = dat_cc_ser[which(dat_cc_ser$subgenus == 'Symphyomyrtus'), ]$series)
p <- ggtree(phy_cc, layout = 'rectangular') +
scale_color_manual(values=c("#56B4E9","orange", "#009E73","#56B4E9"))
p <- groupOTU(p, grp, 'subgenus') + aes(color=subgenus)
# heatmap
gheatmap(p, X, offset=-1, width=0.4, color=NA,
colnames=T, colnames_position = "top",
colnames_offset_x = 0, colnames_offset_y = 1, colnames_angle = 0,
font.size = 2.5) +
scale_fill_viridis_c(option="D", name="", na.value="white") +
# guides(color="none") +
ylim(-1, 81)
We will now fit a series of models to these data to explore different approaches to phylogenetic comparative analyses and highlight some strengths of multi-response phylogenetic mixed models (MR-PMM).
First, let’s see what conclusions we would come to by using traditional univariate approaches to examine the relationship between two of our chosen traits (N_mass and d13C): ordinary least squares regression (OLS), phylogenetic generalised least squares (PGLS), and finally PGLS with lambda (phylogenetic signal) estimated from the data (LAMBDA).
## OLS
fit_OLS <- dat_mean %>% select(N_mass, d13C) %>% na.omit %>% lm(d13C ~ N_mass, .)
fit_OLS %>% summary() # significant relationship but negligible R^2 (~0.01)
dat_mean %>% select(N_mass, d13C) %>% plot()
fit_OLS %>% abline(col="blue")
## PGLS
dat_PGLS <- dat_mean %>% select(taxon, N_mass, d13C) %>% na.omit() %>% as.data.frame();rownames(dat_PGLS) <- dat_PGLS$taxon
fit_PGLS <- dat_PGLS %>% phylolm(d13C ~ N_mass, ., phy, model = "BM"); fit_PGLS %>% summary() # lambda fixed to 1 = Brownian Motion
fit_LAMBDA <- dat_PGLS %>% phylolm(d13C ~ N_mass, ., phy, model = "lambda"); fit_LAMBDA %>% summary() # lambda estimated
# plot estimated slopes from all univariate models
# once we account for phylogeny the relationship disappears
dat_mean %>% select(N_mass, d13C) %>% plot()
fit_OLS %>% abline(col="blue");fit_PGLS %>% abline(col="red");fit_LAMBDA %>% abline(col="purple")
legend(-4.25, 2.5, legend=c("OLS", expression("PGLS"[lambda]), "PGLS"), fill = c("blue","purple","red"))
OLS reports a significant negative relationship between
d13C
and N_mass
. However, this relationship is
reversed (becomes positive) after accounting for phylogenetic signal in
the response variable d13C
, regardless of whether we use
PGLS or LAMBDA. The explanation? By partitioning out the variance in
d13C
that is associated with phylogeny, we also partition
out any covariance between d13C
and N_mass
that is associated with phylogeny. It has long been argued that
accounting for phylogeny is necessary when evaluating inter-specific
trait relationships. However, this approach risks overlooking functional
relationships between traits that are phylogenetically structured due
to, for example, phylogenetic niche conservatism. A preferable approach
is to explicitly estimate both phylogenetic and non-phylogenetic
correlations between traits using MR-PMM.
Let’s try fitting a MR-PMM in which all three traits are specified as
response variables and see whether a negative correlation between
d13C
and N_mass
is indeed manifesting on the
phylogenetic level (as suggested by PGLS above).
First we will fit a meta-analytic model to the species mean trait
data, passing in the sampling variances for each observation via the the
mev
and resp_se
arguments in
MCMCglmm
and brms
respectively. Using
MCMCglmm
,
## MR-PMM
# number of response variables in model
n_resp = 3
# phylogenetic correlation matrix (here actually the inverse of the phylo covariance matrix)
C <- inverseA(phy %>%
keep.tip(dat_mean %>% filter(taxon %in% phy$tip.label) %>% pull(taxon)))$Ainv
# uninformative prior
p_mean <- list(R = list(V=diag(n_resp), nu=n_resp),
G = list(G1=list(V=diag(n_resp), nu=n_resp)))
# subset data to tree
dat_mean_fit <- dat_mean %>% filter(taxon %in% phy$tip.label)
# fit model
fit_mean_mcmcglmm <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ trait-1, # cbind() specifies multivariate response vector
mev = c(dat_mean_fit$N_mass.se^2, dat_mean_fit$d13C.se^2, dat_mean_fit$LMA.se^2),
random = ~us(trait):taxon, # ~us() specifies unstructured covariance matrix
rcov = ~us(trait):units,
ginv = list(taxon = C), # pass inv phylo cov mat
family = c("gaussian","gaussian","gaussian"), # all Gaussian traits
nitt = 55000,
burnin = 5000,
thin = 50,
data = as.data.frame(dat_mean_fit),
prior = p_mean,
verbose = FALSE)
summary(fit_mean_mcmcglmm)
We now fit the same model using brms
. Note that unlike
MCMCglmm
, brms
does not impute missing values
by default, we must instruct it to do so by including mi()
in the model formula.
# BRMS
C <- vcv.phylo(phy, corr = T)
# model formulae
bf_y1 <- brmsformula(N_mass | mi() + resp_se(N_mass.se, sigma = TRUE) ~ 1 + (1|b|gr(taxon, cov = C)))
bf_y2 <- brmsformula(d13C | mi() + resp_se(d13C.se, sigma = TRUE) ~ 1 + (1|b|gr(taxon, cov = C)))
bf_y3 <- brmsformula(LMA | mi() + resp_se(LMA.se, sigma = TRUE) ~ 1 + (1|b|gr(taxon, cov = C)))
fit_mean_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(TRUE),
data = dat_mean %>% filter(taxon %in% phy$tip.label), # filter to species featured in the tree
family = gaussian(),
data2 = list(C = C),
control = list(adapt_delta = 0.95, max_treedepth = 12),
cores = 4,
chains = 4)
saveRDS(fit_mean_brms, "fit_mean_brms.rds")
Importantly, because the Hamiltonian sampler is unable to take
advantage of numerical solutions for the multivariate Gaussian case,
fitting this model in brms
is considerably slower than
MCMCglmm
.
# show runtime of each model
fit_mean_mcmcglmm_sys_time
rstan::get_elapsed_time(fit_mean_brms$fit) %>% apply(1,sum) %>% max
Modelling phylogenetic effects at the series level dramatically reduces fitting time. Furthermore, inferences are qualitatively unchanged in this case (likely to be data and phylogeny dependent).
# phylo vcv
C <- vcv.phylo(tree_samp[[1]], corr = T)
# formulae
bf_y1 <- brmsformula(N_mass | mi() + resp_se(N_mass.se, sigma = TRUE) ~ 1 + (1|b|gr(series, cov = C)))
bf_y2 <- brmsformula(d13C | mi() + resp_se(d13C.se, sigma = TRUE) ~ 1 + (1|b|gr(series, cov = C)))
bf_y3 <- brmsformula(LMA | mi() + resp_se(LMA.se, sigma = TRUE) ~ 1 + (1|b|gr(series, cov = C)))
# fit
fit_mean_series_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(TRUE),
data = dat_mean %>% filter(series %in% tree_samp[[1]]$tip.label), # filter to series featured in the tree
family = gaussian(),
data2 = list(C = C),
control = list(adapt_delta = 0.95, max_treedepth = 12),
cores = 4,
chains = 4)
# saveRDS(fit_mean_series_brms, "fit_mean_series_brms.rds")
rstan::get_elapsed_time(fit_mean_series_brms$fit) %>% apply(1,sum) %>% max
So far we have fit models to a simplified data set of species mean
trait values (one row per species). Pooling data in this way discards
information; it is always preferable to perform analyses on the lowest
possible level of observed data. To handle the complex cross-classified
structure of records in our data (multiple observations on multiple taxa
from multiple studies), we need only specify 2 additional random effects
in our MR-PMM. The first is a non-phylogenetic between-species effect
taxon
, which captures the component of between-species
(co)variance that is independent of phylogeny. The second is a random
study effect dataset_id
, which captures between-study
variance. The residual for this model therefore absorbs within-species
variances and measurement error.
# only keep rows with data for at least one of the leaf traits (filter out rows with data for none)
dat_ml <- dat %>%
filter(!(is.na(dat$d13C) & is.na(dat$N_mass) & is.na(dat$LMA))) %>%
select(id,dataset_id,taxon,series,subgenus,d13C,N_mass,LMA,replicates) %>%
mutate(phylo = taxon,
subgenus = as.factor(ifelse(subgenus != "Symphyomyrtus", "Other", subgenus))) # choose which level to model phylo effects
Now we can fit our multilevel model.
# specify prior
p_ml <- list(R = list(V=diag(n_resp),nu=n_resp),
G = list(G1=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
G2=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
G3=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp))))
#-----------------------------------------------------------------------------------------#
## SPECIES LEVEL ANALYSIS
# inverse phylo covariance matrix
C <- inverseA(phy)$Ainv
# fit
fit_ml_mcmcglmm <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ trait-1,
random = ~us(trait):phylo + # between-species (phylo)
us(trait):taxon + # between-species (non-phylo)
idh(trait):dataset_id, # between-study
rcov = ~idh(trait):units, # residual error
ginv = list(phylo = C),
family = c("gaussian","gaussian","gaussian"),
nitt = 55000,
burnin = 5000,
thin = 50,
data = as.data.frame(dat_ml %>%
filter(phylo %in% dimnames(C)[[1]])),
prior = p_ml,
verbose = FALSE)
summary(fit_ml_mcmcglmm)
# saveRDS(fit_ml_mcmcglmm, "fit_ml_mcmcglmm.rds")
#-----------------------------------------------------------------------------------------#
## SERIES LEVEL ANALYSIS
# inverse phylogenetic covariance matrix
C <- inverseA(tree_samp[[1]])$Ainv
# fit
fit_ml_series_mcmcglmm <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ trait-1,
random = ~us(trait):phylo + # between-series (phylo)
us(trait):taxon + # between-species (non-phylo)
idh(trait):dataset_id, # between-study
rcov = ~idh(trait):units, # residual error
ginv = list(phylo = C),
family = c("gaussian","gaussian","gaussian"),
nitt = 55000,
burnin = 5000,
thin = 50,
data = as.data.frame(dat_ml %>%
mutate(phylo = series) %>%
filter(phylo %in% tree_samp[[1]]$tip.label)),
prior = p_ml,
verbose = FALSE)
summary(fit_ml_series_mcmcglmm)
# saveRDS(fit_ml_series_mcmcglmm, "fit_ml_series_mcmcglmm.rds")
Using brms
, we can directly weight observations
according to the number of replicates via the weights()
argument in the model formulae.
## Species level
# phylo vcv
C <- vcv.phylo(phy, corr = T)
# formulae
bf_y1 <- brmsformula(N_mass | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y2 <- brmsformula(d13C | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y3 <- brmsformula(LMA | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
# fit
fit_ml_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
data = dat_ml %>% mutate(phylo = taxon) %>% filter(taxon %in% phy$tip.label),
family = gaussian(),
data2 = list(C = C),
control = list(adapt_delta = 0.95, max_treedepth = 12),
cores = 4,
chains = 4)
# saveRDS(fit_ml_brms, "fit_ml_brms.rds")
#----------------------------------------------------------------------------------------------#
## Series level
# phylo vcv
C <- vcv.phylo(tree_samp[[1]], corr = T)
# fit
fit_ml_series_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
data = dat_ml %>% mutate(phylo = series) %>% filter(series %in% tree_samp[[1]]$tip.label),
family = gaussian(),
data2 = list(C = C),
control = list(adapt_delta = 0.95, max_treedepth = 12),
cores = 4,
chains = 4)
saveRDS(fit_ml_series_brms, "fit_ml_series_brms.rds")
While complex models of evolution are currently not implemented in
MCMCglmm
or brms
, it is possible to fit
separate trait covariance matrices for different levels of a grouping
factor (e.g., clade). This may be desirable when there is a strong a
priori hypothesis that trait correlations should differ between major
phylogenetic clades, e.g., subgenus Symphyomyrtus vs all other
Eucalyptus subgenera.
p_ml_at.level <- list(R = list(V=diag(n_resp),nu=n_resp),
G = list(G1=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
G2=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
G3=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
G4=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
G5=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp))))
fit_ml_mcmcglmm_at.level <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ subgenus:trait-1,
random = ~us(at.level(subgenus,'Symphyomyrtus'):trait):phylo +
us(at.level(subgenus,'Other'):trait):phylo +
us(at.level(subgenus,'Symphyomyrtus'):trait):taxon +
us(at.level(subgenus,'Other'):trait):taxon +
idh(trait):dataset_id, # between-study
rcov = ~idh(trait):units, # residual error
ginv = list(phylo = C),
family = c("gaussian","gaussian","gaussian"),
nitt = 55000,
burnin = 5000,
thin = 50,
data = as.data.frame(dat_ml %>%
filter(phylo %in% dimnames(C)[[1]],
!is.na(subgenus))),
prior = p_ml_at.level,
verbose = FALSE)
summary(fit_ml_mcmcglmm_at.level)
So, what does our multilevel MR-PMM tell us about the relationships
between these leaf traits in Eucalyptus? In the model summary for
MCMCglmm
, the intercepts for each trait are reported as
location effects, the phylogenetic (co)variances under
phylo
and the residual (co)variances under
units
. We can also see that some parameters, particularly
for d13C
, have low ESS indicating some convergence
issues.
fit_ml_mcmcglmm %>% summary
In order to make meaningful comparisons, we must calculate
correlations (scaled covariance) between traits. We will do this using
the custom function mrpmm_cor()
(which takes a
MCMCglmm
model fit and calculates correlations between the
response variables at the levels specified).
# select model
fit_ml <- fit_ml_mcmcglmm
phylo <- "phylo"
# extract samples from multilevel model fit
mod_res <- cbind(cbind(mrpmm_cor(fit_ml, "N_mass", "d13C", phylo, "taxon")[,3:4],
mrpmm_cor(fit_ml, "N_mass", "LMA", phylo, "taxon")[,3:4],
mrpmm_cor(fit_ml, "d13C", "LMA", phylo, "taxon")[,3:4]))
names(mod_res) <- c("N_mass:d13C_phy","N_mass:d13C_ind",
"N_mass:LMA_phy","N_mass:LMA_ind",
"d13C:LMA_phy","d13C:LMA_ind")
#-----------------------------------------------------------------------------------------#
We will then use functions from the tidybayes
package to
plot the posteriors of our correlation estimates.
## CORRELATION PLOT
# set theme
theme_set(theme_classic())
# build plot
p1 <- mod_res %>%
as_tibble %>%
mutate(.chain = 1, .iteration = 1:nrow(fit_ml$Sol), .draw = 1:nrow(fit_ml$Sol)) %>%
pivot_longer(1:6, names_to = "parameter", values_to = "estimate") %>%
mutate(level = ifelse(str_detect(parameter, "_phy"), "phylogenetic","non-phylogenetic"),
parameter = str_replace(parameter, "_phy", ""),
parameter = str_replace(parameter, "_ind", "")) %>%
ggplot(aes(x = estimate,
y = factor(parameter, levels=c("N_mass:LMA","d13C:LMA","N_mass:d13C")),
col = factor(level, levels = c("phylogenetic", "non-phylogenetic")),
fill = factor(level, levels = c("phylogenetic", "non-phylogenetic")))) +
stat_halfeye(position = position_dodge(width = 0.4),
slab_alpha = 0.5,
fatten_point = 3) +
geom_vline(xintercept=0, linetype="dashed") +
theme(axis.text = element_text(size=18),
axis.title.x = element_text(size=20),
legend.text = element_text(size=18),
legend.title = element_text(size=20),
plot.margin = margin(0.2,0.2,0.2,0.2, "cm")) +
xlim(-1,1) +
ylab("") + xlab("") + guides(color="none",fill=guide_legend(title="level")) +
scale_color_manual(values=c("#00447c", "#c1549c")) +
scale_fill_manual(values=c( "#00447c", "#c1549c")) +
scale_y_discrete(labels = c("N_mass:d13C" = bquote(cor(N[mass]*","~delta^13*C)),
"N_mass:LMA" = bquote(cor("LMA"*","~N[mass])),
"d13C:LMA" = bquote(cor("LMA"*","~delta^13*C))))
p1
# ggsave("Figure_4B.tiff", dpi = 900, width = 30, height = 18, units = c("cm"))
The model reports a positive phylogenetic correlation between LMA and
\(\delta\prescript{13}{}{\mathbf{C}}\),
and negative phylogenetic correlations between these traits and \(\mathbf{N}_{\mathrm{mass}}\), indicating
conserved co-evolutionary relationships. Between-species correlations
with LMA show a consistent direction across both phylogenetic and
non-phylogenetic levels. In contrast, the between-species correlation
for d13C
and N_mass
is negative on the
photogenic level and positive on the taxon (non-phylogenetic) level.
This situation highlights an important strength of MR-PMM; the capacity
to disentangle trait relationships operating at different levels in the
model hierarchy.
We can also examine the variance decomposition of each trait to better understand how different sources of variation contribute to overall trait variances. First we calculate the total posterior variance for each trait by adding variance components, then decompose this variance into the proportion of variance explained by each component.
# calculate total variance for each trait
LMA_var <- fit_ml$VCV[,"traitLMA:traitLMA.phylo"] +
fit_ml$VCV[,"traitLMA:traitLMA.taxon"] +
fit_ml$VCV[,"traitLMA.dataset_id"] +
fit_ml$VCV[,"traitLMA.units"]
N_mass_var <- fit_ml$VCV[,"traitN_mass:traitN_mass.phylo"] +
fit_ml$VCV[,"traitN_mass:traitN_mass.taxon"] +
fit_ml$VCV[,"traitN_mass.dataset_id"] +
fit_ml$VCV[,"traitN_mass.units"]
d13C_var <- fit_ml$VCV[,"traitd13C:traitd13C.phylo"] +
fit_ml$VCV[,"traitd13C:traitd13C.taxon"] +
fit_ml$VCV[,"traitd13C.dataset_id"] +
fit_ml$VCV[,"traitd13C.units"]
# calculate the proportion of variance explained by each level across traits
data_stack <- data.frame(
trait = rep(c("LMA","N_mass","d13C"), each=4),
level = rep(c("phylogenetic","non-phylogenetic","study","residual"),3),
post_mode = c(mean(fit_ml$VCV[,"traitLMA:traitLMA.phylo"]/LMA_var),
mean(fit_ml$VCV[,"traitLMA:traitLMA.taxon"]/LMA_var),
mean(fit_ml$VCV[,"traitLMA.dataset_id"]/LMA_var),
mean(fit_ml$VCV[,"traitLMA.units"]/LMA_var),
mean(fit_ml$VCV[,"traitN_mass:traitN_mass.phylo"]/N_mass_var),
mean(fit_ml$VCV[,"traitN_mass:traitN_mass.taxon"]/N_mass_var),
mean(fit_ml$VCV[,"traitN_mass.dataset_id"]/N_mass_var),
mean(fit_ml$VCV[,"traitN_mass.units"]/N_mass_var),
mean(fit_ml$VCV[,"traitd13C:traitd13C.phylo"]/d13C_var),
mean(fit_ml$VCV[,"traitd13C:traitd13C.taxon"]/d13C_var),
mean(fit_ml$VCV[,"traitd13C.dataset_id"]/d13C_var),
mean(fit_ml$VCV[,"traitd13C.units"]/d13C_var))
)
We can then build these estimates into a stacked bargraph,
## VARIANCE DECOMPOSITION PLOT
# stacked bar graph
p2 <- ggplot(data_stack %>%
mutate(post_mode = round(post_mode,2)),
aes(x = factor(trait, levels=c("LMA","N_mass","d13C")),
y = post_mode,
fill = factor(level,
levels=rev(c("phylogenetic","non-phylogenetic","study","residual"))),
label = post_mode)) +
geom_bar(position="fill", stat="identity") +
geom_text(col="white", size = 5, position = position_stack(vjust = 0.5)) +
scale_fill_manual(values=rev(c("#00447c","#c1549c","#fb836f","#FFA600"))) +
guides(fill=guide_legend("")) +
theme(axis.text = element_text(size=18),
axis.title.y = element_text(size=20),
legend.text= element_text(size=18),
plot.margin = margin(0.2,1,0.2,0.2, "cm")) +
ylab("proportion of variance") + xlab("") +
scale_x_discrete(labels = c("N_mass" = bquote(N[mass]),
"d13C" = bquote(delta^13*C)))
p2
# ggsave("Figure_4A.tiff", dpi = 900, width = 24, height = 18, units = c("cm"))
and plot together as a panel.
# plot together in panel
ggarrange(p2, p1, widths = c(0.4,0.6), common.legend = T)
# ggsave("Figure_4AB.tiff", dpi = 900, width = 30, height = 15, units = c("cm"))
Finally, we can use the fitted covariance matrices to derive partial correlations.
mod <- fit_ml_mcmcglmm
phylo <- "phylo"
result_list <- cbind(mod$Sol, mod$VCV) %>% as_tibble %>% mutate(tree = 1) %>% relocate(tree)
# combine
res <- result_list %>% bind_rows() %>% as.matrix() %>% as_tibble()
# extract VCV components
vcv <- res %>%
select(contains(":"))# select VCV components
# # r hat diagnostic (rhat on combined posterior not great, but for any given tree rhat is good)
chains = vcv %>% as.data.frame()
chains %>% map_dbl(rstan::Rhat) %>% {.[.>1.05]}
# calculate correlations
n_resp = 3
cor_list <- level_cor_vcv(vcv, n_resp, part.1 = phylo, part.2 = "taxon", lambda = NULL)
level_cor <- gsub("_cor", "", names(cor_list$posteriors)) # different levels correlations are estimated at
We can use the igraph
package to plot pairwise
correlations as a network.
## NETWORK GRAPH
# choose level to calculate result
level_cor
level <- "par_phy"
# calculate
cor <- paste0(level,"_cor")
mat <- paste0(level,"_mat")
sig.95 <- cor_list$posteriors[[cor]] %>% summarise_all(~ quantile(., prob=c(0.025,0.975)) %>% {. > 0} %>% table %>% length == 1) %>% t() %>% as_tibble %>% mutate(sig.95 = V1, edge = 1:((n_resp^2-n_resp)/2)) %>% select(-V1)
sig.80 <- cor_list$posteriors[[cor]] %>% summarise_all(~ quantile(., prob=c(0.1,0.90)) %>% {. > 0} %>% table %>% length == 1) %>% t() %>% as_tibble %>% mutate(sig.80 = V1, edge = 1:((n_resp^2-n_resp)/2)) %>% select(-V1)
sig <- left_join(sig.95, sig.80, by = "edge") %>% mutate(mean = cor_list$posteriors[[cor]] %>% summarise_all(~ mean(.)) %>% t %>% as.vector)
sig <- sig %>% mutate(col = ifelse(sig.80 == T & sig.95 != T & mean > 0, "lightblue",
ifelse(sig.80 == T & sig.95 == T & mean > 0, "blue",
ifelse(sig.80 == T & sig.95 != T & mean < 0, "pink", "red"))),
lty = ifelse(sig.80 == T & sig.95 != T & mean > 0, "dotted",
ifelse(sig.80 == T & sig.95 == T & mean > 0, "solid",
ifelse(sig.80 == T & sig.95 != T & mean < 0, "dotted", "solid"))))
cor_mat <- cor_list$matrices[[mat]]
dimnames(cor_mat)[[1]] <- c("N mass","leaf\nd13C","LMA")
dimnames(cor_mat)[[2]] <- c("N mass","leaf\nd13C","LMA")
diag(cor_mat) <- 0
g <- igraph::graph_from_adjacency_matrix (cor_mat, weighted=TRUE, mode="upper")
igraph::E(g)$weight <- abs(t(cor_mat[lower.tri(t(cor_mat))]))
# g_sig <- delete.edges(g, igraph::E(g)[sig %>% filter(sig.80 == F) %>% pull(edge)]) # delete no-sig edges
# plot(g_sig,
# layout = layout.circle,
# vertex.size=32,
# edge.width=(abs(igraph::E(g)$weight)^1.5)*30,
# edge.color=ifelse((cor_mat > 0)[lower.tri(cor_mat)][sig %>% filter(sig.80 == T) %>% pull(edge)], "blue","red"),
# edge.lty=sig %>% filter(sig.80 == T) %>% pull(lty))
plot(g,
layout = layout.circle,
vertex.size=32,
edge.width=(abs(igraph::E(g)$weight)^2)*20,
edge.color=ifelse((cor_mat > 0)[lower.tri(cor_mat)][sig %>% filter(sig.80 == T) %>% pull(edge)], "blue","red"))
cor_list$posteriors[["par_phy_cor"]] %>% map(mean)
cor_list$posteriors[["par_ind_cor"]] %>% map(quantile)
cor_list$posteriors[["par_ind_cor"]] %>% mcmc_areas()
cor_list$posteriors[["par_ind_cor"]] %>% mcmc_intervals(prob_outer = 0.95)
#-----------------------------------------------------------------------#
Because Brownian Motion is a noisy process, estimates of phylogenetic correlations are generally associated with high uncertainty. One way to improve precision is to impose a regularizing prior which shrinks estimates toward zero. When applied to correlation coefficients, this has the effect of focusing our inference on the strongest trait relationships in our data.
## Series level
# phylo vcv
C <- vcv.phylo(phy, corr = T)
# get prior for model
get_prior(formula = bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
data = dat_ml,
data2 = list(C = C))
# set higher eta value for LKJ prior on phylo correlations
p_lkj <- prior(lkj(3), class = cor, group = phylo)
# fit
fit_ml_lkj_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
data = dat_ml %>% mutate(phylo = taxon) %>% filter(taxon %in% phy$tip.label),
family = gaussian(),
data2 = list(C = C),
prior = p_lkj,
control = list(adapt_delta = 0.95, max_treedepth = 12),
cores = 4,
chains = 4)
# saveRDS(fit_ml_lkj_brms, "fit_ml_lkj_brms.rds")
#-------------------------------------------------------------------------------------------#
## Series level
# phylo vcv
C <- vcv.phylo(tree_samp[[1]], corr = T)
# model formulae
bf_y1 <- brmsformula(N_mass | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y2 <- brmsformula(d13C | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y3 <- brmsformula(LMA | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
# fit
fit_ml_series_lkj_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
data = dat_ml %>% mutate(phylo = series) %>% filter(series %in% tree_samp[[1]]$tip.label),
family = gaussian(),
data2 = list(C = C),
prior = p_lkj,
control = list(adapt_delta = 0.95, max_treedepth = 12),
cores = 4,
chains = 4)
# saveRDS(fit_ml_series_lkj_brms, "fit_ml_series_lkj_brms.rds")
We explore two approaches to model validation: 1) we assessed within-sample predictive capacity using posterior predictive checks; and 2) we assessed out-of-sample predictive capacity using LOO-CV.
First, we choose a brms
model fit and define the
parameter estimates we wish to validate.
# choose model
fit <- fit_ml_brms
# relevant model parameters (exclude random effects and log-probability)
pars <- fit %>% as_tibble %>%
select(-starts_with("r_"),-starts_with("lp"),-starts_with("Intercept")) %>% names;pars
r_pars <- fit %>% as_tibble %>% select(starts_with("r_")) %>% names; r_pars
Diagnostics for models fit in brms
are easily obtained
via dependent functions. In this case, rhat values indicate convergence
issues for some study effects (especially for d13C), and a bulk ESS plot
shows that some parameters have low ESS (though inspection of the ess
tibble shows these are mostly for data).
# rhat diagnostics
rhat <- rhat(fit, pars = r_pars)
rhat[rhat > 1.1] # convergence issues for study effects (especially for d13C)
# HMC diagnostics
rstan::check_hmc_diagnostics(fit$fit)
# Effective Sample Size (ESS)
ess <-
fit %>%
as_tibble %>%
select(all_of(pars)) %>%
map_dbl(ess_bulk) # map ESS for each parameter
tibble(ess = ess, par = names(ess)) %>%
ggplot(aes(y = par, x = ess))+
geom_linerange(aes(xmin = 0, xmax = ess)) +
geom_point() +
geom_vline(xintercept = 400, col = "grey30", linetype = "dashed") +
theme(axis.text.y = element_blank()) +
labs(y = NULL, title = "Bulk effective sample size (ess)")
ess[ess < 100]
For the first approach to model validation, posterior predictive checks verify the capacity of the model to generate plausible data; the proportion of data included in the nominal predictive intervals is generally close to expectation for each method and response trait.
## posterior predictive checks
# prediction conditional on species-level random-intercept estimates
ppA1 <- pp_check(fit, resp= "LMA", type = "intervals") + labs(subtitle = "Response: LMA")
ppA2 <- pp_check(fit, resp= "Nmass", type = "intervals") + labs(subtitle = "Response: Nmass")
ppA3 <- pp_check(fit, resp= "d13C", type = "intervals") + labs(subtitle = "Response: d13C")
# arrange by predictive mean of the conditional model
ppA1$data <- ppA1$data %>% arrange(match(y_id, ppA1$data %>% arrange(m) %>% pull(y_id)))
ppA2$data <- ppA2$data %>% arrange(match(y_id, ppA2$data %>% arrange(m) %>% pull(y_id)))
ppA3$data <- ppA3$data %>% arrange(match(y_id, ppA3$data %>% arrange(m) %>% pull(y_id)))
# calculate coverage
covA1 <- round(get_coverage(ppA1$data),1)
covA2 <- round(get_coverage(ppA2$data),1)
covA3 <- round(get_coverage(ppA3$data),1)
# plot
ggarrange(plot_pred(ppA1$data %>% select(6:10), ppA1$data$y_obs, cov = covA1, "LMA", subtitle = F) ,
plot_pred(ppA2$data %>% select(6:10), ppA2$data$y_obs, cov = covA2, "Nmass", subtitle = F),
plot_pred(ppA3$data %>% select(6:10), ppA3$data$y_obs, cov = covA3, "d13C", subtitle = F),
ncol = 1)
We can perform residual analysis by building a qq plot
## Residual analysis
pe1 <- predictive_error(fit, resp = "LMA") %>% as.data.frame()
pe2 <- predictive_error(fit, resp = "Nmass") %>% as.data.frame()
pe3 <- predictive_error(fit, resp = "d13C") %>% as.data.frame()
lbs <- labs(x = "Theoretical Quantiles",
y = "Sample Quantiles")
# qq pnorm plots
ggarrange(
tibble(y = pe1 %>% map_dbl(mean)) %>%
ggplot(aes(sample = y)) +
stat_qq(alpha = 0.5) + stat_qq_line() +
lbs + labs(title= "LMA"),
tibble(y = pe2 %>% map_dbl(mean)) %>%
ggplot(aes(sample = y)) +
stat_qq(alpha = 0.5) + stat_qq_line() +
lbs + labs(title= "Nmass"),
tibble(y = pe3 %>% map_dbl(mean)) %>%
ggplot(aes(sample = y)) +
stat_qq(alpha = 0.5) + stat_qq_line() +
lbs + labs(title= "d13C"),
nrow = 1)
For LOO-CV, we find the approximate method (Bürkner, Gabry, and Vehtari 2021) fails for a small proportion of the data (especially for LMA), necessitating manual re-fits of the model. The computational demand of re-fitting such a large model for more than a few troublesome test data points is probably not feasible for the average desktop workstation. However, as HPC becomes more accessible to the average researcher, manual re-fitting procedures may represent a viable option to LOO-CV of MR-PMM.
## LOO-CV predictions
# choose model
fit <- fit_mean_series_brms
fit_loo <- loo(fit, cores = 8)
# loo predict
loo_pred.LMA <- loo_predict(fit, type = "quantile",
probs = c(0.025,0.25,0.5,0.75,0.975), resp = "LMA") %>%
t %>% `colnames<-`(c("ll","l","m","h","hh"))
loo_pred.N_mass <- loo_predict(fit, type = "quantile",
probs = c(0.025,0.25,0.5,0.75,0.975), resp = "Nmass") %>%
t %>% `colnames<-`(c("ll","l","m","h","hh"))
loo_pred.d13C <- loo_predict(fit, type = "quantile",
probs = c(0.025,0.25,0.5,0.75,0.975), resp = "d13C") %>%
t %>% `colnames<-`(c("ll","l","m","h","hh"))
# plot
loo.plot.LMA <-
loo_pred.LMA %>%
as_tibble %>%
mutate(y_obs = fit$data$LMA) %>%
arrange(m) %>%
mutate(n = 1:n()) %>%
ggplot(aes(n)) +
geom_linerange(aes(ymin = ll, ymax = hh), alpha = 0.4) +
geom_linerange(aes(ymin = l, ymax = h), size = 1, alpha = 0.4)+
geom_line(aes(y = m), size = 0.5) +
geom_point(aes(y = y_obs), col = blues9[9], size =1.5) +
labs(y = NULL, x = NULL,
title = "LOO-predictive intervals: LMA")
loo.plot.Nmass <-
loo_pred.Nmass %>%
as_tibble %>%
mutate(y_obs = fit$data$N_mass) %>%
arrange(m) %>%
mutate(n = 1:n()) %>%
ggplot(aes(n)) +
geom_linerange(aes(ymin = ll, ymax = hh), alpha = 0.4) +
geom_linerange(aes(ymin = l, ymax = h), size = 1, alpha = 0.4)+
geom_line(aes(y = m), size = 0.5) +
geom_point(aes(y = y_obs), col = blues9[9], size =1.5) +
labs(y = NULL, x = NULL, title = "LOO-predictive intervals: Nmass")
loo.plot.d13C <-
loo_pred.d13C %>%
as_tibble %>%
mutate(y_obs = fit$data$d13C) %>%
arrange(m) %>%
mutate(n = 1:n()) %>%
ggplot(aes(n)) +
geom_linerange(aes(ymin = ll, ymax = hh), alpha = 0.4) +
geom_linerange(aes(ymin = l, ymax = h), size = 1, alpha = 0.4)+
geom_line(aes(y = m), size = 0.5) +
geom_point(aes(y = y_obs), col = blues9[9], size =1.5) +
labs(y = NULL, x = NULL, title = "LOO-predictive intervals: d13C")
ggarrange(loo.plot.y1,loo.plot.y2, ncol =1)