\(\color{orange}{\text{N.B. This tutorial is a live document that will continue to be improved and updated by the authors. Enjoy!}}\)

Introduction

In this tutorial we will use the AusTraits database (Falster et al. 2021) to explore the benefits of multi-response phylogenetic mixed models (MR-PMM). AusTraits is a relational database containing records of ~500 plant traits across ~30,000 Australian plant taxa (https://austraits.org/). The first step is to install and load any required packages, then load in the AusTraits data. For reproducibility, all analyses will use version 6.0.0 of AusTraits (https://zenodo.org/records/11188867). This tutorial is currently set up for users to download the relevant files directly from GitHub and inspect outputs of each code chunk within Rstudio on their local machine.

Data cleaning and manipulation

A crucial step in any phylogenetic comparative analysis is collating, cleaning and manipulating data records. As we are dealing with a phylogeny, this includes matching the nomenclature (species naming conventions) between our trait data and the tip labels of the phylogenetic tree. The following section covers the decisions made and operations applied to arrive at the final data set for statistical analyses. As we are working from AusTraits, the first step is to pull down data on the organisms we are interested in, which for today is the genus Eucalyptus.

# extract all austraits records for genus Eucalyptus
eucalyptus <- extract_taxa(austraits, genus = "Eucalyptus")

# join all information on these records from austraits database
euc_data <- (eucalyptus %>% austraits::join_all())$traits

# inspect the data table
euc_data %>% head() # first few rows
euc_data %>% names() # column names
euc_data %>% select(trait_name) %>% unique() # traits
euc_data %>% select(basis_of_record) %>% unique() # types of records

We will focus on three continuous traits relevant to leaf economics and water use efficiency: leaf_mass_per_area (LMA), leaf_N_per_dry_mass (N_mass) and leaf_delta13C (d13C). Some important decisions to be made are when to exclude data, and how to treat records that are not reported in a format conducive to our analyses. Here we choose to exclude data from experimental studies, limiting ourselves to observations on field and captive-cultivated (i.e., botanical gardens and arboreta) specimens, as well as measurements on preserved specimens and expert reports from the literature (i.e., Flora field guides). We also limit ourselves to observations on reproductively mature individuals.

Another consideration for these data will be properly integrating different types of observations. For example, some records in AusTraits represent raw trait values observed on a single individual (replicates = 1), where-as others represent mean trait values calculated from observations on multiple individuals (replicates > 1). From a modelling perspective, we would like observations based on more replicates to have a greater influence on the estimation of model parameters. Hence, observations should be weighted proportionate to their level of replication during model fit. The variable replicates reports an integer value for some records, but is also sometimes reported as a range, e.g., “3-5”. To be conservative, we can define the lower bound of this range as the level of replication for that observation. This represents one of many arbitrary decisions that must be made when attempting to collate data from diverse sources. The important thing is to keep a good record of all such decisions. First, so their influence on results may be tested (e.g., with sensitivity analyses), and second so they can be adequately described and justified for publication.

# choose traits to extract data on
traits <- c("leaf_mass_per_area", "leaf_N_per_dry_mass", "leaf_delta13C")
num_traits <- traits # we will focus on the numerical leaf traits initially

## clean data
# - remove records from experimental studies and juvenile specimens 
# - resolve replicates reported as a character string to an integer value by taking minima of reported range
euc_data <- euc_data %>% 
  filter(trait_name %in% traits,
         basis_of_record %in% c("field", "captive_cultivated", "literature", "preserved_specimen"),
         # !location_name %in% c("Currency Creek Arboretum",
         #                       "Currency Creek Arboretum, South Australia"), # exclude records from CCA?
         life_stage == "adult",) %>% 
  mutate(replicates = ifelse(replicates == "unknown" | replicates == "1-10", 1,
                      ifelse(replicates == "2-17", 2,
                      ifelse(replicates == "3-5", 3,
                      ifelse(replicates == "4-5" | replicates == "4-6", 4,
                      ifelse(replicates == "5-6", 5,
                      ifelse(replicates == "10-100" | replicates == "10-40" | replicates == "10-20", 10, replicates)))))),
         replicates = as.character(replicates)) %>% 
  replace_na(list(replicates = "1"))

# we have a mix of value types across a diverse set of records
euc_data %>% select(value_type) %>% table()
euc_data %>% select(basis_of_record) %>% table()

# select only the columns we need and create a unique id for each record
euc_data <- euc_data %>% 
  select(dataset_id, location_name, observation_id, taxon_name, trait_name, value_type, value, replicates) %>% 
  distinct() %>%
  mutate(n = 1:n(), .by = c(dataset_id, location_name, observation_id, taxon_name, trait_name),
         id = paste0(dataset_id,"_",observation_id,"_",n), .before = dataset_id) %>% # create unique observation id
  select(-n)

Phylogeny

After our initial refinement of records to include, the next step is to match the nomenclature (species naming conventions) between our trait data and the tip labels of our phylogenetic tree. We will use the maximum likelihood Eucalypt phylogeny presented by (Thornhill et al. 2019) for all our analyses (N.B. models are easily extended to incorporate phylogenetic uncertainty by fitting over a sample of candidate trees and combining posterior estimates prior to inference). To proceed with analyses, our tree must be binary (fully bifurcating with no polytomies) and ultrametric (clock-like with all tips right adjusted). So the first step is to perform these checks and amend the tree where necessary.

# read in Eucalypt phylogeny
phy <- read.tree(file="tree.ML1.tre")
plot(phy, edge.width = 1, show.tip.label = F) # looks ultrametric

# perform checks
is.binary(phy);is.ultrametric(phy) # but fails both tests

# resolve polytomies and make ultrametric (clock-like)
phy <- multi2di(phy, random = FALSE) # splits polytomies into bifurcations but introduces zero length branches
phy$edge.length[phy$edge.length == 0] <- 1e-5 # replace with small non-zero length
phy <- force.ultrametric(phy, method = "extend") # make ultrametric by extending external branches where necessary
phy$node.label <- (length(phy$tip.label)+1):(phy$Nnode+length(phy$tip.label)) # add node labels
is.binary(phy);is.ultrametric(phy) # now passes checks

Nomenclature

Now the tree is in the right shape we can go about cleaning and matching the nomenclature. A quick preview of taxon names reveals a few differences between our data and the tip labels of our tree. We can usually resolve most differences simply by amending text strings using grep functions. We will create two versions of the taxon name; one which retains subspecies identity and one which does not. In the event that some subspecies do not appear in the tree, using the latter nomenclature may allow us to include more records by treating trait data from all subspecies as observations at the species level (e.g., records for subspecies Eucalyptus_examplus_primus and Eucalyptus_examplus_secundis are treated as records for species Eucalyptus_examplus). As taxonomy is in a constant state of revision, it is also a good idea to check whether any species have undergone recent name changes which may create conflicts when attempting to match trait data records to taxon names in the phylogeny, though we will skip this step for today’s exercise.

# phylogeny contains binomial labels for species and trinomial labels for subspecies, all separated by "_" (genus_species_subspecies)
tail(phy$tip.label)

# austraits also contains records at the species and subspecies level, but with a different naming convention and names separated by " "
# Additional information (e.g., "var." = variety) and unidentified species (e.g., "Eucalyptus sp.") will also have to be removed  
euc_data %>% select(taxon_name) %>% unique() %>% tail()

# these differences in nomenclature cause a complete mismatch in taxon names between data and tree
euc_data %>% pull(taxon_name) %>% unique() %>% intersect(.,phy$tip.label) %>% length()

# filter out unwanted records and use grep functions to resolve differences in nomenclature
euc_data <- euc_data %>% 
  filter(!taxon_name == "Eucalyptus", # remove records of unidentified species
         !taxon_name %like% "% sp. %",
         !taxon_name %like% "% x %" # remove records from hybrid species
         ) %>% 
  mutate(taxon = gsub(" ", "_",taxon_name), # clean taxon names by replacing "subsp." and "var." with "_"
         taxon = gsub("_subsp._", "_",taxon),
         taxon = gsub("_var._", "_",taxon),
         taxon = gsub('^([^_]+_[^_]+_[^_]+).*', "\\1",taxon), # remove everything after the third "_" (retains subsp epithet)
         taxon_sp = gsub('^([^_]+_[^_]+).*', "\\1",taxon))  # remove everything after the second "_", (removes subsp epithet)

# quick check that the re-naming worked as intended
euc_data %>% select(taxon, taxon_sp) %>% unique() %>% tail()

Now we have resolved the nomenclature, we can check which and how many taxa in the phylogeny have AusTraits records for our traits of interest.

# check how many species with trait data are also in the phylogeny
match <- euc_data %>% 
  filter(trait_name %in% traits) %>% 
  pull(taxon) %>% intersect(phy$tip.label);length(match) # we have data on at least one leaf trait for 421 taxa

# check how many if we drop the subsp epithet off both the tip.labels of 
# the phylogeny and the taxon names in the data
match_sp <- euc_data %>% 
  filter(trait_name %in% traits) %>% 
  pull(taxon_sp) %>% # subsp epithet dropped
  intersect(gsub('^([^_]+_[^_]+).*', "\\1", phy$tip.label)); length(match_sp) # increases to 457 taxa by ignoring subspecies nomenclature

# and adds almost 2000 records across the three traits (~25% increase in sample size)
euc_data %>% filter(taxon %in% match, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()
euc_data %>% filter(taxon_sp %in% match_sp, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()

# One thing to check is that we will not create duplicate tip labels in the phylogeny by dropping subsp epithets.
# This would happen if the phylogeny featured multiple subsp for any given species. To confirm this, we can check 
# that the number of unique tip labels doesn't change when we drop the subsp epithet.
length(unique(gsub('^([^_]+_[^_]+).*', "\\1", phy$tip.label))) == length(unique(phy$tip.label)) # all good

# Ok, let's drop subsp epithets off the tip labels of the phylogeny
phy$tip.label <- gsub('^([^_]+_[^_]+).*', "\\1", phy$tip.label)

# match in taxonomic information from Nicolle 2022
euc_data$subgenus <- taxonomy[match(euc_data$taxon_sp, taxonomy$taxon_sp),]$subgenus # match each taxa to subgenus
euc_data$series <- taxonomy[match(euc_data$taxon_sp, taxonomy$taxon_sp),]$series # match each taxa to subgenus

Modelling phylogeny at a higher taxonomic rank

Phylogenetic signal in species traits is to some extant a function of phylogenetic scale (ref Revell XXXX). By including more diverse species in our phylogeny, we capture deeper splits that represent more meaningful divergences in the genotype and phenotype of extant lineages, precisely the effects we intend to model when analysing inter-species data. One consequence of this, is that shallow topology (near the tips) is less informative than deep topology when attempting to infer patterns of phylogenetic niche conservatism, because differences between genera are usually more significant than differences between species within genera. Furthermore, shallow topology can be problematic for clades in which closely related species hybridise, as by definition a phylogeny (a type of acyclic graph) assumes a bifurcating evolutionary process. In such cases, phylogenies defined for a taxonomic rank above the species level (e.g., genus), may be a preferable choice. If taxonomy for the groups in question is well resolved, it may be possible to define a phylogeny with rank just above the species level (e.g., series or section). This may be sufficient to mitigate errors due to hybridization, as inter-fertility often declines rapidly with phylogenetic distance (euc ref). Defining our phylogeny at a higher taxonomic rank also has the surprisingly powerful benefit of allowing us to include data on species that were not featured in the species-level phylogeny; species are now simply treated as replicates of the appropriate taxonomic group within the higher rank tree.

For higher taxonomic ranks (e.g. genus), it will usually be possible to derive a unique consensus tree by sampling a single species from each genus and simply pruning the tree to those tips. This approach may be problematic for lower taxonomic ranks however, because more closely related species are less likely to be monophyletic with respect to the taxonomic rank in question (). Even in such cases, we can easily account for this phylogenetic uncertainty by randomly sampling topologies at the specified rank and fitting our models over this sample of trees.

For the current example, we will sample trees at the level of taxonomic series (which in Eucalyptus contain between 1-XX species), reducing the dimension of our tree from 656 tips in the species-level tree to 87 tips in the series-level tree.

# subset data and tree to species with series level taxonomy
phy_2 <- keep.tip(phy, intersect(euc_data %>% filter(!is.na(series)) %>% pull(taxon),phy$tip.label))
dat_2 <- euc_data %>% 
  filter(!is.na(series),
         taxon %in% phy_2$tip.label) %>% 
  arrange(factor(taxon, levels = phy_2$tip.label))

# sample series level topologies
n_samp = 10
tree_samp <- list()
C_samp <- list()
# run loop
for (i in 1:n_samp) {
  samp <- dat_2 %>% group_by(series) %>% slice_sample(n=1) # sample one species from each series and keep only those tips to create series level tree
  phy.ser <- tree.ultra(phy, samp) # make ultrametric tree
  samp <- samp %>% arrange(factor(taxon, levels = phy.ser$tip.label))
  phy.ser$tip.label <- samp$series
  tree_samp[[i]] <- phy.ser
  C_samp[[i]] <- vcv.phylo(phy.ser, cor = T)
}

# plot sampled topology
plot(tree_samp[[1]], edge.width = 1, show.tip.label = F)

# plot species-level tree again to compare
plot(phy, edge.width = 1, show.tip.label = F)

# create list of C matrices and dfs for use with brm_multiple()
C_list <- list();for (i in 1:length(C_samp)){C_list[[i]] <- list(C = C_samp[[i]])}
dat_list <- lapply(1:length(C_samp), function(x) dat_2)

We can see that our sampled series-level trees capture the major structural form of the species-level tree, but show some variation in lower order topology and branch lengths. By fitting our model over this sample of trees and combining chains for inference, we are able to integrate over this phylogenetic uncertainty.

We can also check if we are able to include more data by assigning species missing from the original phylogeny to a tip (taxonomic series) in the new tree.

# we can include data on another 47 taxa (~10% increase) by using the series level tree
match_series <- euc_data %>% 
  filter(trait_name %in% traits,
         series %in% tree_samp[[1]]$tip.label) %>% 
  pull(taxon_sp) %>% unique()
length(match_sp) # 457
length(match_series) # 504

# adding another 700 observations (a further 8% increase in sample size)
euc_data %>% filter(taxon_sp %in% match_sp, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()
euc_data %>% filter(taxon_sp %in% match_series, trait_name %in% traits) %>% select(value) %>% na.omit() %>% nrow()

Reframe data

Now we have cleaned the data and matched the nomenclature it’s time to get our data into a more useful format. Specifically, we need to re-frame our data from long to wide format to facilitate plotting and model fitting. We can use the pivot_wider() function from tidyr to do this.

# Currently, each row represents a single observation (observation_id) on a single trait (trait_name) 
# for a single taxon (taxon_name) from a single study (dataset_id). This is called long format.
euc_data %>% head()

# To analyse these data, we need a separate column for each trait, with each row representing a single
# observation (or sampling event) within a study, i.e., wide format.
euc_wide <- euc_data %>% 
  pivot_wider(names_from = trait_name, values_from = value) %>% 
  mutate(across(all_of(traits), as.numeric))

# then subset the data to those species in the phylogeny, setting taxon = taxon_sp (subsp epithet dropped)
euc_wide <- euc_wide %>% mutate(taxon = taxon_sp) %>% select(-c(taxon_sp, taxon_name)) #%>% filter(series %in% match_series)

We will also prune the species level-tree to contain only those species in the data (n = 457)

# prune species level-tree to data
# phy <- keep.tip(phy, euc_wide %>% filter(taxon %in% phy$tip.label) %>% pull(taxon) %>% unique)

We can now examine and summarise our new data table.

# we have a good number of observations for LMA across taxa, less so for N_mass and d13C
euc_wide %>% select(all_of(traits)) %>% summarise_all(~ sum((!is.na(.))))

# we can also calculate the number of observations for each trait for each taxon
species_coverage <- euc_wide %>% filter(taxon %in% phy$tip.label) %>% group_by(taxon) %>% summarise(across(all_of(traits),~ sum(!is.na(.))))
series_coverage <- euc_wide %>% filter(series %in% tree_samp[[1]]$tip.label) %>% group_by(series) %>% summarise(across(all_of(traits),~ sum(!is.na(.))))

# print
species_coverage
series_coverage

# We have many missing values (e.g., ~79% of observations do not report values for d13C)
euc_wide %>% select(all_of(traits)) %>% summarise_all(~ sum((!is.na(.)))) %>% `/`(nrow(euc_wide)) %>% round(3)

# However, the proportion of taxa in the species-level tree reporting at least one observation for each trait is actually much higher for Nmass and d13C
species_coverage %>% mutate(across(all_of(traits), ~ ifelse(.>0,1,0))) %>% select(-1) %>% colSums() %>% `/`(Ntip(phy)) %>% round(2)

# and, of course, even higher for the series level tree
series_coverage %>% mutate(across(all_of(traits), ~ ifelse(.>0,1,0))) %>% select(-1) %>% colSums() %>% `/`(Ntip(tree_samp[[1]])) %>% round(2)

As a final data preparation step, we will transform our leaf traits and rename our data table and variables for brevity.

## final clean
# - rename data object for brevity
# - log then z transform leaf traits and abbreviate names
dat <- euc_wide %>% 
  mutate(N_mass = scale(log(leaf_N_per_dry_mass))[,1],
         d13C = scale(-1*log(abs(leaf_delta13C)))[,1], # to aide interpretation, restore negative values with -1*
         LMA = scale(log(leaf_mass_per_area))[,1],
         replicates = as.numeric(replicates))

Data Summaries

To facilitate plots of species mean trait values and demonstrate a meta-analytic implementation of MR-PMM, we will create a summary data table by pooling all observations for each species across studies into a single mean value and associated standard error (each species represented by a single row). This procedure simplifies the structure of our data set considerably, but throws away a lot of information about different potential sources of variation (i.e., within-species, between-study). We will first fit the simplified meta-analytic version for the sake of exposition, then demonstrate a preferable approach of using additional random effects in a multilevel model to capture the cross-classified sampling design of observations.

## calculate species mean trait values
# as each observation reports the number of replicates, we will use a weighted mean and se to 
# properly account for variation in the uncertainty of trait values
dat_mean <- dat %>%
  group_by(taxon, series, subgenus) %>% 
  summarise(across(all_of(c("N_mass","d13C","LMA")), ~ .x %>% weighted.mean(w = replicates, na.rm = TRUE))) # weighted mean

# calculate weighted se (ideally achieved using map() within summarise above)
dat_se <- dat %>%
            group_by(taxon) %>% 
            select(taxon, N_mass,d13C,LMA,replicates) %>%
            mutate(replicates = as.numeric(replicates))
taxa <- dat_se %>% pull(taxon) %>% unique()
result <- tibble(taxon=taxa, N_mass.se=NA, d13C.se=NA,LMA.se=NA)
for (i in 1:length(taxa)){
  result[i,2] <- dat_se %>%
                    select(taxon,N_mass,replicates) %>%
                    na.omit() %>% 
                    filter(any(n()>1)) %>% 
                    filter(taxon == taxa[i]) %>% 
                    list(N_mass = .$N_mass, replicates = .$replicates) %>% 
                    { wtd.stderror(.$N_mass, w = .$replicates) }
  result[i,3] <- dat_se %>%
                    select(taxon,d13C,replicates) %>%
                    na.omit() %>%
                    filter(any(n()>1)) %>%
                    filter(taxon == taxa[i]) %>%
                    list(d13C = .$d13C, replicates = .$replicates) %>%
                      { wtd.stderror(.$d13C, w = .$replicates) }
  result[i,4] <- dat_se %>%
                    select(taxon,LMA,replicates) %>%
                    na.omit() %>%
                    filter(any(n()>1)) %>%
                    filter(taxon == taxa[i]) %>%
                    list(LMA = .$LMA, replicates = .$replicates) %>%
                    { wtd.stderror(.$LMA, w = .$replicates) }
}
result %>% head() # check result
dat_mean <- dat_mean %>% left_join(.,result) %>% ungroup # merge weighted se into data frame

# for observations with NA for se (i.e., species with a single obs for a given trait), assign 90th pctl. se for that trait
dat_mean[is.na(dat_mean$N_mass.se),]$N_mass.se <- quantile(dat_mean[!is.na(dat_mean$N_mass.se),]$N_mass.se, 0.9)
dat_mean[is.na(dat_mean$d13C.se),]$d13C.se <- quantile(dat_mean[!is.na(dat_mean$d13C.se),]$d13C.se, 0.9)
dat_mean[is.na(dat_mean$LMA.se),]$LMA.se <- quantile(dat_mean[!is.na(dat_mean$LMA.se),]$LMA.se, 0.9)

# compare the two data sets
dat %>% select(taxon, series, subgenus, dataset_id,  LMA, N_mass, d13C) %>% head() # rows contain individual obs, multiple rows per species and per study.
dat_mean %>% head() # single row containing species mean trait values

Plotting

Now we have summarised our data to species level, let’s make a few plots to visualize the phylogenetic distribution of trait values.

# filter to complete cases for plotting
dat_cc <- dat_mean %>% select(taxon, series, subgenus, N_mass, d13C, LMA) %>% 
  filter(taxon %in% phy$tip.label) %>% 
  drop_na() %>% 
  filter(subgenus %in% c("Eucalyptus", "Eudesmia", "Symphyomyrtus"),
         !(!is.na(LMA) & LMA > 3) & !(!is.na(N_mass) & N_mass < -4)) # remove outliers for plotting
sg_cols <- c("#E69F00","#009E73", "#56B4E9")
X <- dat_cc %>% ungroup() %>% select(N_mass, d13C, LMA) %>% as.matrix()
rownames(X) <- dat_cc %>% pull(taxon)
phy_cc <- keep.tip(phy, dat_cc %>% pull(taxon))

First some bivariate scatter plots to visualise pairwise covariance at the level of species phenotypes.

# scatter-plots
p1 <- dat_cc %>% ggplot(aes(N_mass, LMA, color = subgenus)) + geom_point(size=3) + 
  theme_classic() + theme(axis.text = element_text(size=15)) + guides(color="none") + scale_color_manual(values=sg_cols)
p2 <- dat_cc %>% ggplot(aes(d13C, LMA, color = subgenus)) + geom_point(size=3) +
  theme_classic() + theme(axis.text = element_text(size=15)) + guides(color="none") + scale_color_manual(values=sg_cols)
p3 <- dat_cc %>% ggplot(aes(N_mass, d13C, color = subgenus)) + geom_point(size=3) +
  theme_classic() + theme(axis.text = element_text(size=15)) + guides(color="none") + scale_color_manual(values=sg_cols)

ggarrange(p1,p2,p3,ncol = 1)

To get a feel for the phylogenetic distribution of trait values, it is useful to plot our trait data as a heatmap matrix against the tree. All three traits are strongly structured with respect to phylogeny (similar values among closely related species). Furthermore, these traits appear to

## plot data against tree
# group by operational taxonomic unit (OTU)
grp <- list(Eucalyptus = dat_cc[which(dat_cc$subgenus == 'Eucalyptus'), ]$taxon,
            Eudesmia = dat_cc[which(dat_cc$subgenus == 'Eudesmia'), ]$taxon,
            Symphyomyrtus = dat_cc[which(dat_cc$subgenus == 'Symphyomyrtus'), ]$taxon)
p <- ggtree(phy_cc, layout = 'rectangular') + 
  scale_color_manual(values=c("#56B4E9","orange", "#009E73","#56B4E9"))
p <- groupOTU(p, grp, 'subgenus') + aes(color=subgenus)

# heatmap
gheatmap(p, X, offset=-1, width=0.4, color=NA,
         colnames=T, colnames_position = "top", 
         colnames_offset_x = 0, colnames_offset_y = 5, colnames_angle = 0, 
         font.size = 2.5) +
  scale_fill_viridis_c(option="D", name="", na.value="white") +
  # guides(color="none") +
  ylim(-1, 370)

The same pattern is evident at the series level.

## Repeat for series level tree
# summarise by series
series_mean <- dat %>%
  filter(!is.na(series)) %>% 
  group_by(series, subgenus) %>% 
  summarise(across(all_of(c("N_mass","d13C","LMA")), ~ .x %>% 
                   weighted.mean(w = replicates, na.rm = TRUE)))
# prep trait data frame
dat_cc_ser <- series_mean %>%
  drop_na() %>% 
  filter(subgenus %in% c("Eucalyptus", "Eudesmia", "Symphyomyrtus"))
X <- dat_cc_ser %>% ungroup() %>% select(N_mass, d13C, LMA) %>% as.matrix()
rownames(X) <- dat_cc_ser %>% pull(series)
phy_cc <- keep.tip(tree_samp[[1]], dat_cc_ser %>% filter(series %in% tree_samp[[1]]$tip.label) %>% pull(series))

# group by operational taxonomic unit (OTU)
grp <- list(Eucalyptus = dat_cc_ser[which(dat_cc_ser$subgenus == 'Eucalyptus'), ]$series,
            Eudesmia = dat_cc_ser[which(dat_cc_ser$subgenus == 'Eudesmia'), ]$series,
            Symphyomyrtus = dat_cc_ser[which(dat_cc_ser$subgenus == 'Symphyomyrtus'), ]$series)
p <- ggtree(phy_cc, layout = 'rectangular') + 
  scale_color_manual(values=c("#56B4E9","orange", "#009E73","#56B4E9"))
p <- groupOTU(p, grp, 'subgenus') + aes(color=subgenus)

# heatmap
gheatmap(p, X, offset=-1, width=0.4, color=NA,
         colnames=T, colnames_position = "top", 
         colnames_offset_x = 0, colnames_offset_y = 1, colnames_angle = 0, 
         font.size = 2.5) +
  scale_fill_viridis_c(option="D", name="", na.value="white") +
  # guides(color="none") +
  ylim(-1, 81)

Models

We will now fit a series of models to these data to explore different approaches to phylogenetic comparative analyses and highlight some strengths of multi-response phylogenetic mixed models (MR-PMM).

Common univariate approaches (OLS, PGLS)

First, let’s see what conclusions we would come to by using traditional univariate approaches to examine the relationship between two of our chosen traits (N_mass and d13C): ordinary least squares regression (OLS), phylogenetic generalised least squares (PGLS), and finally PGLS with lambda (phylogenetic signal) estimated from the data (LAMBDA).

## OLS
fit_OLS <- dat_mean %>% select(N_mass, d13C) %>% na.omit %>% lm(d13C ~ N_mass, .)
fit_OLS %>% summary() # significant relationship but negligible R^2 (~0.01)
dat_mean %>% select(N_mass, d13C) %>% plot()
fit_OLS %>% abline(col="blue")

## PGLS
dat_PGLS <- dat_mean %>% select(taxon, N_mass, d13C) %>% na.omit() %>% as.data.frame();rownames(dat_PGLS) <- dat_PGLS$taxon
fit_PGLS <- dat_PGLS %>% phylolm(d13C ~ N_mass, ., phy, model = "BM"); fit_PGLS %>% summary() # lambda fixed to 1 = Brownian Motion
fit_LAMBDA <- dat_PGLS %>% phylolm(d13C ~ N_mass, ., phy, model = "lambda"); fit_LAMBDA %>% summary() # lambda estimated

# plot estimated slopes from all univariate models
# once we account for phylogeny the relationship disappears
dat_mean %>% select(N_mass, d13C) %>% plot()
fit_OLS %>% abline(col="blue");fit_PGLS %>% abline(col="red");fit_LAMBDA %>% abline(col="purple")
legend(-4.25, 2.5, legend=c("OLS", expression("PGLS"[lambda]), "PGLS"), fill = c("blue","purple","red"))

OLS reports a significant negative relationship between d13C and N_mass. However, this relationship is reversed (becomes positive) after accounting for phylogenetic signal in the response variable d13C, regardless of whether we use PGLS or LAMBDA. The explanation? By partitioning out the variance in d13C that is associated with phylogeny, we also partition out any covariance between d13C and N_mass that is associated with phylogeny. It has long been argued that accounting for phylogeny is necessary when evaluating inter-specific trait relationships. However, this approach risks overlooking functional relationships between traits that are phylogenetically structured due to, for example, phylogenetic niche conservatism. A preferable approach is to explicitly estimate both phylogenetic and non-phylogenetic correlations between traits using MR-PMM.

MR-PMM

Let’s try fitting a MR-PMM in which all three traits are specified as response variables and see whether a negative correlation between d13C and N_mass is indeed manifesting on the phylogenetic level (as suggested by PGLS above).

First we will fit a meta-analytic model to the species mean trait data, passing in the sampling variances for each observation via the the mev and resp_se arguments in MCMCglmmand brms respectively. Using MCMCglmm,

## MR-PMM

# number of response variables in model
n_resp = 3

# phylogenetic correlation matrix (here actually the inverse of the phylo covariance matrix)
C <- inverseA(phy %>% 
              keep.tip(dat_mean %>% filter(taxon %in% phy$tip.label) %>% pull(taxon)))$Ainv

# uninformative prior
p_mean <- list(R = list(V=diag(n_resp), nu=n_resp),
               G = list(G1=list(V=diag(n_resp), nu=n_resp)))

# subset data to tree
dat_mean_fit <- dat_mean %>% filter(taxon %in% phy$tip.label)

# fit model
fit_mean_mcmcglmm <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ trait-1, # cbind() specifies multivariate response vector
                      mev      = c(dat_mean_fit$N_mass.se^2, dat_mean_fit$d13C.se^2, dat_mean_fit$LMA.se^2),
                      random   = ~us(trait):taxon, # ~us() specifies unstructured covariance matrix
                      rcov     = ~us(trait):units,
                      ginv     = list(taxon = C), # pass inv phylo cov mat
                      family   = c("gaussian","gaussian","gaussian"), # all Gaussian traits
                      nitt     = 55000,
                      burnin   = 5000,
                      thin     = 50,
                      data     = as.data.frame(dat_mean_fit),
                      prior    = p_mean,
                      verbose  = FALSE)
summary(fit_mean_mcmcglmm)

We now fit the same model using brms. Note that unlike MCMCglmm, brms does not impute missing values by default, we must instruct it to do so by including mi() in the model formula.

# BRMS
C <- vcv.phylo(phy, corr = T)

# model formulae
bf_y1 <- brmsformula(N_mass | mi() +  resp_se(N_mass.se, sigma = TRUE) ~ 1 + (1|b|gr(taxon, cov = C)))
bf_y2 <- brmsformula(d13C | mi() + resp_se(d13C.se, sigma = TRUE) ~ 1 + (1|b|gr(taxon, cov = C)))
bf_y3 <- brmsformula(LMA | mi() + resp_se(LMA.se, sigma = TRUE) ~ 1 + (1|b|gr(taxon, cov = C)))

fit_mean_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(TRUE),
              data    = dat_mean %>% filter(taxon %in% phy$tip.label), # filter to species featured in the tree
              family  = gaussian(), 
              data2   = list(C = C),
              control = list(adapt_delta = 0.95, max_treedepth = 12),
              cores   = 4, 
              chains  = 4)
saveRDS(fit_mean_brms, "fit_mean_brms.rds")

Importantly, because the Hamiltonian sampler is unable to take advantage of numerical solutions for the multivariate Gaussian case, fitting this model in brms is considerably slower than MCMCglmm.

# show runtime of each model

fit_mean_mcmcglmm_sys_time
rstan::get_elapsed_time(fit_mean_brms$fit) %>% apply(1,sum) %>% max

Modelling phylogenetic effects at the series level dramatically reduces fitting time. Furthermore, inferences are qualitatively unchanged in this case (likely to be data and phylogeny dependent).

# phylo vcv
C <- vcv.phylo(tree_samp[[1]], corr = T)

# formulae
bf_y1 <- brmsformula(N_mass | mi() + resp_se(N_mass.se, sigma = TRUE) ~ 1 + (1|b|gr(series, cov = C)))
bf_y2 <- brmsformula(d13C | mi() + resp_se(d13C.se, sigma = TRUE) ~ 1 + (1|b|gr(series, cov = C)))
bf_y3 <- brmsformula(LMA | mi() + resp_se(LMA.se, sigma = TRUE) ~ 1 + (1|b|gr(series, cov = C)))

# fit
fit_mean_series_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(TRUE),
              data    = dat_mean %>% filter(series %in% tree_samp[[1]]$tip.label), # filter to series featured in the tree
              family  = gaussian(), 
              data2   = list(C = C),
              control = list(adapt_delta = 0.95, max_treedepth = 12),
              cores   = 4, 
              chains  = 4)
# saveRDS(fit_mean_series_brms, "fit_mean_series_brms.rds")

rstan::get_elapsed_time(fit_mean_series_brms$fit) %>% apply(1,sum) %>% max

Multilevel model

So far we have fit models to a simplified data set of species mean trait values (one row per species). Pooling data in this way discards information; it is always preferable to perform analyses on the lowest possible level of observed data. To handle the complex cross-classified structure of records in our data (multiple observations on multiple taxa from multiple studies), we need only specify 2 additional random effects in our MR-PMM. The first is a non-phylogenetic between-species effect taxon, which captures the component of between-species (co)variance that is independent of phylogeny. The second is a random study effect dataset_id, which captures between-study variance. The residual for this model therefore absorbs within-species variances and measurement error.

# only keep rows with data for at least one of the leaf traits (filter out rows with data for none)
dat_ml <- dat %>%
  filter(!(is.na(dat$d13C) & is.na(dat$N_mass) & is.na(dat$LMA))) %>%
  select(id,dataset_id,taxon,series,subgenus,d13C,N_mass,LMA,replicates) %>% 
  mutate(phylo = taxon,
         subgenus = as.factor(ifelse(subgenus != "Symphyomyrtus", "Other", subgenus))) # choose which level to model phylo effects

Now we can fit our multilevel model.

# specify prior
p_ml <- list(R = list(V=diag(n_resp),nu=n_resp),
             G = list(G1=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
                      G2=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
                      G3=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp))))

#-----------------------------------------------------------------------------------------#


## SPECIES LEVEL ANALYSIS

# inverse phylo covariance matrix
C <- inverseA(phy)$Ainv

# fit
fit_ml_mcmcglmm <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ trait-1,
                     random   = ~us(trait):phylo + # between-species (phylo) 
                                 us(trait):taxon + # between-species (non-phylo)
                                 idh(trait):dataset_id, # between-study
                     rcov     = ~idh(trait):units, # residual error
                     ginv     = list(phylo = C),
                     family   = c("gaussian","gaussian","gaussian"),
                     nitt     = 55000,
                     burnin   = 5000,
                     thin     = 50,
                     data     = as.data.frame(dat_ml %>% 
                                              filter(phylo %in% dimnames(C)[[1]])),
                     prior    = p_ml,
                     verbose  = FALSE)
summary(fit_ml_mcmcglmm)
# saveRDS(fit_ml_mcmcglmm, "fit_ml_mcmcglmm.rds")


#-----------------------------------------------------------------------------------------#

## SERIES LEVEL ANALYSIS

# inverse phylogenetic covariance matrix
C <- inverseA(tree_samp[[1]])$Ainv

# fit
fit_ml_series_mcmcglmm <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ trait-1,
                     random   = ~us(trait):phylo + # between-series (phylo) 
                                 us(trait):taxon + # between-species (non-phylo)
                                 idh(trait):dataset_id, # between-study
                     rcov     = ~idh(trait):units, # residual error
                     ginv     = list(phylo = C),
                     family   = c("gaussian","gaussian","gaussian"),
                     nitt     = 55000,
                     burnin   = 5000,
                     thin     = 50,
                     data     = as.data.frame(dat_ml %>% 
                                              mutate(phylo = series) %>% 
                                              filter(phylo %in% tree_samp[[1]]$tip.label)),
                     prior    = p_ml,
                     verbose  = FALSE)
summary(fit_ml_series_mcmcglmm)
# saveRDS(fit_ml_series_mcmcglmm, "fit_ml_series_mcmcglmm.rds")

Using brms, we can directly weight observations according to the number of replicates via the weights() argument in the model formulae.

## Species level

# phylo vcv
C <- vcv.phylo(phy, corr = T)

# formulae
bf_y1 <- brmsformula(N_mass | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y2 <- brmsformula(d13C   | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y3 <- brmsformula(LMA    | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))

# fit
fit_ml_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
              data    = dat_ml %>% mutate(phylo = taxon) %>% filter(taxon %in% phy$tip.label), 
              family  = gaussian(), 
              data2   = list(C = C),
              control = list(adapt_delta = 0.95, max_treedepth = 12),
              cores   = 4, 
              chains  = 4)
# saveRDS(fit_ml_brms, "fit_ml_brms.rds")


#----------------------------------------------------------------------------------------------#

## Series level

# phylo vcv
C <- vcv.phylo(tree_samp[[1]], corr = T)

# fit
fit_ml_series_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
              data    = dat_ml %>% mutate(phylo = series) %>% filter(series %in% tree_samp[[1]]$tip.label), 
              family  = gaussian(), 
              data2   = list(C = C),
              control = list(adapt_delta = 0.95, max_treedepth = 12),
              cores   = 4, 
              chains  = 4)
saveRDS(fit_ml_series_brms, "fit_ml_series_brms.rds")

Covariances by a grouping factor

While complex models of evolution are currently not implemented in MCMCglmm or brms, it is possible to fit separate trait covariance matrices for different levels of a grouping factor (e.g., clade). This may be desirable when there is a strong a priori hypothesis that trait correlations should differ between major phylogenetic clades, e.g., subgenus Symphyomyrtus vs all other Eucalyptus subgenera.

p_ml_at.level <- list(R = list(V=diag(n_resp),nu=n_resp),
                      G = list(G1=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
                               G2=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
                               G3=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
                               G4=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp)),
                               G5=list(V=diag(n_resp), nu=n_resp, alpha.mu = rep(0,n_resp), alpha.V = diag(n_resp))))


fit_ml_mcmcglmm_at.level <- MCMCglmm(cbind(N_mass, d13C, LMA) ~ subgenus:trait-1,
                     random   = ~us(at.level(subgenus,'Symphyomyrtus'):trait):phylo +
                                 us(at.level(subgenus,'Other'):trait):phylo + 
                                 us(at.level(subgenus,'Symphyomyrtus'):trait):taxon +
                                 us(at.level(subgenus,'Other'):trait):taxon + 
                                 idh(trait):dataset_id, # between-study
                     rcov     = ~idh(trait):units, # residual error
                     ginv     = list(phylo = C),
                     family   = c("gaussian","gaussian","gaussian"),
                     nitt     = 55000,
                     burnin   = 5000,
                     thin     = 50,
                     data     = as.data.frame(dat_ml %>% 
                                              filter(phylo %in% dimnames(C)[[1]],
                                                     !is.na(subgenus))),
                     prior    = p_ml_at.level,
                     verbose  = FALSE)

summary(fit_ml_mcmcglmm_at.level)

Inference

So, what does our multilevel MR-PMM tell us about the relationships between these leaf traits in Eucalyptus? In the model summary for MCMCglmm, the intercepts for each trait are reported as location effects, the phylogenetic (co)variances under phylo and the residual (co)variances under units. We can also see that some parameters, particularly for d13C, have low ESS indicating some convergence issues.

fit_ml_mcmcglmm %>% summary

Multilevel Correlations

In order to make meaningful comparisons, we must calculate correlations (scaled covariance) between traits. We will do this using the custom function mrpmm_cor()(which takes a MCMCglmm model fit and calculates correlations between the response variables at the levels specified).

# select model
fit_ml <- fit_ml_mcmcglmm
phylo <- "phylo"


# extract samples from multilevel model fit
mod_res <- cbind(cbind(mrpmm_cor(fit_ml, "N_mass", "d13C", phylo, "taxon")[,3:4],
                       mrpmm_cor(fit_ml, "N_mass", "LMA", phylo, "taxon")[,3:4],
                       mrpmm_cor(fit_ml, "d13C",   "LMA", phylo, "taxon")[,3:4]))
names(mod_res) <- c("N_mass:d13C_phy","N_mass:d13C_ind",
                    "N_mass:LMA_phy","N_mass:LMA_ind",
                    "d13C:LMA_phy","d13C:LMA_ind")

#-----------------------------------------------------------------------------------------#

We will then use functions from the tidybayes package to plot the posteriors of our correlation estimates.

## CORRELATION PLOT

# set theme
theme_set(theme_classic())

# build plot
p1 <- mod_res %>%
  as_tibble %>% 
  mutate(.chain = 1, .iteration = 1:nrow(fit_ml$Sol), .draw = 1:nrow(fit_ml$Sol)) %>% 
  pivot_longer(1:6, names_to = "parameter", values_to = "estimate") %>%
  mutate(level = ifelse(str_detect(parameter, "_phy"), "phylogenetic","non-phylogenetic"),
         parameter = str_replace(parameter, "_phy", ""),
         parameter = str_replace(parameter, "_ind", "")) %>% 
  ggplot(aes(x = estimate, 
             y = factor(parameter, levels=c("N_mass:LMA","d13C:LMA","N_mass:d13C")),
             col = factor(level, levels = c("phylogenetic", "non-phylogenetic")), 
             fill = factor(level, levels = c("phylogenetic", "non-phylogenetic")))) +
  stat_halfeye(position = position_dodge(width = 0.4), 
               slab_alpha = 0.5, 
               fatten_point = 3) +
  geom_vline(xintercept=0, linetype="dashed") +
    theme(axis.text = element_text(size=18),
        axis.title.x = element_text(size=20),
        legend.text = element_text(size=18),
        legend.title = element_text(size=20),
        plot.margin = margin(0.2,0.2,0.2,0.2, "cm")) +
  xlim(-1,1) +
  ylab("") + xlab("") + guides(color="none",fill=guide_legend(title="level")) +
  scale_color_manual(values=c("#00447c", "#c1549c")) +
  scale_fill_manual(values=c( "#00447c", "#c1549c")) +
  scale_y_discrete(labels = c("N_mass:d13C" = bquote(cor(N[mass]*","~delta^13*C)),
                              "N_mass:LMA" = bquote(cor("LMA"*","~N[mass])),
                              "d13C:LMA" = bquote(cor("LMA"*","~delta^13*C))))
p1  

# ggsave("Figure_4B.tiff", dpi = 900, width = 30, height = 18, units = c("cm"))

The model reports a positive phylogenetic correlation between LMA and \(\delta\prescript{13}{}{\mathbf{C}}\), and negative phylogenetic correlations between these traits and \(\mathbf{N}_{\mathrm{mass}}\), indicating conserved co-evolutionary relationships. Between-species correlations with LMA show a consistent direction across both phylogenetic and non-phylogenetic levels. In contrast, the between-species correlation for d13C and N_mass is negative on the photogenic level and positive on the taxon (non-phylogenetic) level. This situation highlights an important strength of MR-PMM; the capacity to disentangle trait relationships operating at different levels in the model hierarchy.

Variance Decomposition

We can also examine the variance decomposition of each trait to better understand how different sources of variation contribute to overall trait variances. First we calculate the total posterior variance for each trait by adding variance components, then decompose this variance into the proportion of variance explained by each component.

# calculate total variance for each trait
LMA_var <- fit_ml$VCV[,"traitLMA:traitLMA.phylo"] + 
           fit_ml$VCV[,"traitLMA:traitLMA.taxon"] +
           fit_ml$VCV[,"traitLMA.dataset_id"] +
           fit_ml$VCV[,"traitLMA.units"]
N_mass_var <- fit_ml$VCV[,"traitN_mass:traitN_mass.phylo"] + 
           fit_ml$VCV[,"traitN_mass:traitN_mass.taxon"] +
           fit_ml$VCV[,"traitN_mass.dataset_id"] +
           fit_ml$VCV[,"traitN_mass.units"]
d13C_var <- fit_ml$VCV[,"traitd13C:traitd13C.phylo"] + 
           fit_ml$VCV[,"traitd13C:traitd13C.taxon"] +
           fit_ml$VCV[,"traitd13C.dataset_id"] +
           fit_ml$VCV[,"traitd13C.units"]

# calculate the proportion of variance explained by each level across traits 
data_stack <- data.frame(
    trait = rep(c("LMA","N_mass","d13C"), each=4),
    level = rep(c("phylogenetic","non-phylogenetic","study","residual"),3),
    post_mode = c(mean(fit_ml$VCV[,"traitLMA:traitLMA.phylo"]/LMA_var),
                  mean(fit_ml$VCV[,"traitLMA:traitLMA.taxon"]/LMA_var),
                  mean(fit_ml$VCV[,"traitLMA.dataset_id"]/LMA_var),
                  mean(fit_ml$VCV[,"traitLMA.units"]/LMA_var),
                  mean(fit_ml$VCV[,"traitN_mass:traitN_mass.phylo"]/N_mass_var),
                  mean(fit_ml$VCV[,"traitN_mass:traitN_mass.taxon"]/N_mass_var),
                  mean(fit_ml$VCV[,"traitN_mass.dataset_id"]/N_mass_var),
                  mean(fit_ml$VCV[,"traitN_mass.units"]/N_mass_var),
                  mean(fit_ml$VCV[,"traitd13C:traitd13C.phylo"]/d13C_var),
                  mean(fit_ml$VCV[,"traitd13C:traitd13C.taxon"]/d13C_var),
                  mean(fit_ml$VCV[,"traitd13C.dataset_id"]/d13C_var),
                  mean(fit_ml$VCV[,"traitd13C.units"]/d13C_var))
)

We can then build these estimates into a stacked bargraph,

## VARIANCE DECOMPOSITION PLOT

# stacked bar graph
p2 <- ggplot(data_stack %>% 
               mutate(post_mode = round(post_mode,2)),
             aes(x = factor(trait, levels=c("LMA","N_mass","d13C")),
                 y = post_mode, 
                 fill = factor(level, 
                               levels=rev(c("phylogenetic","non-phylogenetic","study","residual"))),
                 label = post_mode)) + 
    geom_bar(position="fill", stat="identity") +
    geom_text(col="white", size = 5, position = position_stack(vjust = 0.5)) +
    scale_fill_manual(values=rev(c("#00447c","#c1549c","#fb836f","#FFA600"))) +
    guides(fill=guide_legend("")) +
    theme(axis.text = element_text(size=18),
          axis.title.y = element_text(size=20),
          legend.text= element_text(size=18),
          plot.margin = margin(0.2,1,0.2,0.2, "cm")) +
    ylab("proportion of variance") + xlab("") + 
    scale_x_discrete(labels = c("N_mass" = bquote(N[mass]),
                              "d13C" = bquote(delta^13*C)))
p2
# ggsave("Figure_4A.tiff", dpi = 900, width = 24, height = 18, units = c("cm"))

and plot together as a panel.

# plot together in panel
ggarrange(p2, p1, widths = c(0.4,0.6), common.legend = T)
# ggsave("Figure_4AB.tiff", dpi = 900, width = 30, height = 15, units = c("cm"))

Partial Correlations

Finally, we can use the fitted covariance matrices to derive partial correlations.

mod <- fit_ml_mcmcglmm
phylo <- "phylo"

result_list <- cbind(mod$Sol, mod$VCV) %>% as_tibble %>% mutate(tree = 1) %>% relocate(tree)

# combine
res <- result_list %>% bind_rows() %>% as.matrix() %>% as_tibble()

# extract VCV components
vcv <- res %>% 
  select(contains(":"))# select VCV components

# # r hat diagnostic (rhat on combined posterior not great, but for any given tree rhat is good)
chains = vcv %>% as.data.frame()
chains %>% map_dbl(rstan::Rhat) %>% {.[.>1.05]}

# calculate correlations  
n_resp = 3
cor_list <- level_cor_vcv(vcv, n_resp, part.1 = phylo, part.2 = "taxon", lambda = NULL)
level_cor <- gsub("_cor", "", names(cor_list$posteriors)) # different levels correlations are estimated at

We can use the igraph package to plot pairwise correlations as a network.

## NETWORK GRAPH

# choose level to calculate result
level_cor
level <- "par_phy"

# calculate
cor <- paste0(level,"_cor")
mat <- paste0(level,"_mat")
sig.95 <- cor_list$posteriors[[cor]] %>% summarise_all(~ quantile(., prob=c(0.025,0.975)) %>% {. > 0} %>% table %>% length == 1) %>% t() %>% as_tibble %>% mutate(sig.95 = V1, edge = 1:((n_resp^2-n_resp)/2)) %>% select(-V1)
sig.80 <- cor_list$posteriors[[cor]] %>% summarise_all(~ quantile(., prob=c(0.1,0.90)) %>% {. > 0} %>% table %>% length == 1) %>% t() %>% as_tibble %>% mutate(sig.80 = V1, edge = 1:((n_resp^2-n_resp)/2)) %>% select(-V1)
sig <- left_join(sig.95, sig.80, by = "edge") %>% mutate(mean = cor_list$posteriors[[cor]] %>% summarise_all(~ mean(.)) %>% t %>% as.vector)
sig <- sig %>% mutate(col = ifelse(sig.80 == T & sig.95 != T & mean > 0, "lightblue",
                            ifelse(sig.80 == T & sig.95 == T & mean > 0, "blue",
                            ifelse(sig.80 == T & sig.95 != T & mean < 0, "pink", "red"))),
                     lty = ifelse(sig.80 == T & sig.95 != T & mean > 0, "dotted",
                            ifelse(sig.80 == T & sig.95 == T & mean > 0, "solid",
                            ifelse(sig.80 == T & sig.95 != T & mean < 0, "dotted", "solid"))))
cor_mat <- cor_list$matrices[[mat]]
dimnames(cor_mat)[[1]] <- c("N mass","leaf\nd13C","LMA")
dimnames(cor_mat)[[2]] <- c("N mass","leaf\nd13C","LMA")
diag(cor_mat) <- 0
g <- igraph::graph_from_adjacency_matrix (cor_mat, weighted=TRUE, mode="upper")
igraph::E(g)$weight <- abs(t(cor_mat[lower.tri(t(cor_mat))]))
# g_sig <- delete.edges(g, igraph::E(g)[sig %>% filter(sig.80 == F) %>% pull(edge)]) # delete no-sig edges
# plot(g_sig,
#      layout = layout.circle,
#      vertex.size=32,
#      edge.width=(abs(igraph::E(g)$weight)^1.5)*30,
#      edge.color=ifelse((cor_mat > 0)[lower.tri(cor_mat)][sig %>% filter(sig.80 == T) %>% pull(edge)], "blue","red"),
#      edge.lty=sig %>% filter(sig.80 == T) %>% pull(lty))


plot(g,
     layout = layout.circle,
     vertex.size=32,
     edge.width=(abs(igraph::E(g)$weight)^2)*20,
     edge.color=ifelse((cor_mat > 0)[lower.tri(cor_mat)][sig %>% filter(sig.80 == T) %>% pull(edge)], "blue","red"))


cor_list$posteriors[["par_phy_cor"]] %>% map(mean)
cor_list$posteriors[["par_ind_cor"]] %>% map(quantile)
cor_list$posteriors[["par_ind_cor"]] %>% mcmc_areas()
cor_list$posteriors[["par_ind_cor"]] %>% mcmc_intervals(prob_outer = 0.95)

#-----------------------------------------------------------------------#

Priors

Regularising correlation estimates

Because Brownian Motion is a noisy process, estimates of phylogenetic correlations are generally associated with high uncertainty. One way to improve precision is to impose a regularizing prior which shrinks estimates toward zero. When applied to correlation coefficients, this has the effect of focusing our inference on the strongest trait relationships in our data.

## Series level

# phylo vcv
C <- vcv.phylo(phy, corr = T)

# get prior for model
get_prior(formula =  bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE), 
          data = dat_ml,
          data2 = list(C = C))

# set higher eta value for LKJ prior on phylo correlations
p_lkj <- prior(lkj(3), class = cor, group = phylo)

# fit
fit_ml_lkj_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
              data    = dat_ml %>% mutate(phylo = taxon) %>% filter(taxon %in% phy$tip.label), 
              family  = gaussian(), 
              data2   = list(C = C),
              prior   = p_lkj,
              control = list(adapt_delta = 0.95, max_treedepth = 12),
              cores   = 4, 
              chains  = 4)
# saveRDS(fit_ml_lkj_brms, "fit_ml_lkj_brms.rds")

#-------------------------------------------------------------------------------------------#


## Series level

# phylo vcv
C <- vcv.phylo(tree_samp[[1]], corr = T)

# model formulae
bf_y1 <- brmsformula(N_mass | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y2 <- brmsformula(d13C   | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))
bf_y3 <- brmsformula(LMA    | weights(replicates) + mi() ~ 1 + (1|a|gr(phylo, cov = C)) + (1|b|taxon) + (1|dataset_id))

# fit
fit_ml_series_lkj_brms <- brm(bf_y1 + bf_y2 + bf_y3 + set_rescor(FALSE),
              data    = dat_ml %>% mutate(phylo = series) %>% filter(series %in% tree_samp[[1]]$tip.label), 
              family  = gaussian(), 
              data2   = list(C = C),
              prior   = p_lkj,
              control = list(adapt_delta = 0.95, max_treedepth = 12),
              cores   = 4, 
              chains  = 4)
# saveRDS(fit_ml_series_lkj_brms, "fit_ml_series_lkj_brms.rds")

Model Diagnostics and Validation

We explore two approaches to model validation: 1) we assessed within-sample predictive capacity using posterior predictive checks; and 2) we assessed out-of-sample predictive capacity using LOO-CV.

First, we choose a brms model fit and define the parameter estimates we wish to validate.

# choose model
fit <- fit_ml_brms

# relevant model parameters (exclude random effects and log-probability)
pars <- fit %>% as_tibble %>%
  select(-starts_with("r_"),-starts_with("lp"),-starts_with("Intercept")) %>% names;pars
r_pars <- fit %>% as_tibble %>% select(starts_with("r_")) %>% names; r_pars

Convergence Diagnostics

Diagnostics for models fit in brms are easily obtained via dependent functions. In this case, rhat values indicate convergence issues for some study effects (especially for d13C), and a bulk ESS plot shows that some parameters have low ESS (though inspection of the ess tibble shows these are mostly for data).

# rhat diagnostics
rhat <- rhat(fit, pars = r_pars)
rhat[rhat > 1.1] # convergence issues for study effects (especially for d13C)

# HMC diagnostics
rstan::check_hmc_diagnostics(fit$fit)

# Effective Sample Size (ESS)
ess <-
  fit %>%
  as_tibble %>%
  select(all_of(pars)) %>%
  map_dbl(ess_bulk) # map ESS for each parameter

tibble(ess = ess, par = names(ess)) %>%
  ggplot(aes(y = par, x = ess))+
  geom_linerange(aes(xmin = 0, xmax = ess)) +
  geom_point() +
  geom_vline(xintercept = 400, col = "grey30", linetype = "dashed") +
  theme(axis.text.y = element_blank()) +
  labs(y = NULL, title = "Bulk effective sample size (ess)")

ess[ess < 100]

Validation

For the first approach to model validation, posterior predictive checks verify the capacity of the model to generate plausible data; the proportion of data included in the nominal predictive intervals is generally close to expectation for each method and response trait.

## posterior predictive checks

# prediction conditional on species-level random-intercept estimates
ppA1 <- pp_check(fit, resp= "LMA", type = "intervals") + labs(subtitle = "Response: LMA")
ppA2 <- pp_check(fit, resp= "Nmass", type = "intervals") + labs(subtitle = "Response: Nmass")
ppA3 <- pp_check(fit, resp= "d13C", type = "intervals") + labs(subtitle = "Response: d13C")

# arrange by predictive mean of the conditional model
ppA1$data <- ppA1$data %>% arrange(match(y_id, ppA1$data %>% arrange(m) %>% pull(y_id)))
ppA2$data <- ppA2$data %>% arrange(match(y_id, ppA2$data %>% arrange(m) %>% pull(y_id)))
ppA3$data <- ppA3$data %>% arrange(match(y_id, ppA3$data %>% arrange(m) %>% pull(y_id)))

# calculate coverage
covA1 <- round(get_coverage(ppA1$data),1)
covA2 <- round(get_coverage(ppA2$data),1)
covA3 <- round(get_coverage(ppA3$data),1)

# plot
ggarrange(plot_pred(ppA1$data %>% select(6:10), ppA1$data$y_obs, cov = covA1, "LMA", subtitle = F) ,
          plot_pred(ppA2$data %>% select(6:10), ppA2$data$y_obs, cov = covA2, "Nmass", subtitle = F),
          plot_pred(ppA3$data %>% select(6:10), ppA3$data$y_obs, cov = covA3, "d13C", subtitle = F),
          ncol = 1)

We can perform residual analysis by building a qq plot

## Residual analysis

pe1 <- predictive_error(fit, resp = "LMA") %>% as.data.frame()
pe2 <- predictive_error(fit, resp = "Nmass") %>% as.data.frame()
pe3 <- predictive_error(fit, resp = "d13C") %>% as.data.frame()

lbs <- labs(x = "Theoretical Quantiles",
            y = "Sample Quantiles")

# qq pnorm plots
ggarrange(
  tibble(y = pe1 %>% map_dbl(mean)) %>%
  ggplot(aes(sample = y)) +
  stat_qq(alpha = 0.5) + stat_qq_line() +
    lbs + labs(title= "LMA"),
  tibble(y = pe2 %>% map_dbl(mean)) %>%
  ggplot(aes(sample = y)) +
  stat_qq(alpha = 0.5) + stat_qq_line() +
    lbs + labs(title= "Nmass"),
tibble(y = pe3 %>% map_dbl(mean)) %>%
  ggplot(aes(sample = y)) +
  stat_qq(alpha = 0.5) + stat_qq_line() +
  lbs + labs(title= "d13C"),
nrow = 1)

For LOO-CV, we find the approximate method (Bürkner, Gabry, and Vehtari 2021) fails for a small proportion of the data (especially for LMA), necessitating manual re-fits of the model. The computational demand of re-fitting such a large model for more than a few troublesome test data points is probably not feasible for the average desktop workstation. However, as HPC becomes more accessible to the average researcher, manual re-fitting procedures may represent a viable option to LOO-CV of MR-PMM.

## LOO-CV predictions

# choose model
fit <- fit_mean_series_brms

fit_loo <- loo(fit, cores = 8)

# loo predict
loo_pred.LMA <- loo_predict(fit, type = "quantile",
                             probs = c(0.025,0.25,0.5,0.75,0.975), resp = "LMA") %>%
                             t %>% `colnames<-`(c("ll","l","m","h","hh"))
loo_pred.N_mass <- loo_predict(fit, type = "quantile",
                             probs = c(0.025,0.25,0.5,0.75,0.975), resp = "Nmass") %>%
                             t %>% `colnames<-`(c("ll","l","m","h","hh"))
loo_pred.d13C <- loo_predict(fit, type = "quantile",
                             probs = c(0.025,0.25,0.5,0.75,0.975), resp = "d13C") %>%
                             t %>% `colnames<-`(c("ll","l","m","h","hh"))

# plot
loo.plot.LMA <-
  loo_pred.LMA %>%
  as_tibble %>%
  mutate(y_obs = fit$data$LMA) %>%
  arrange(m) %>%
  mutate(n = 1:n()) %>%
  ggplot(aes(n)) +
  geom_linerange(aes(ymin = ll, ymax = hh), alpha = 0.4) +
  geom_linerange(aes(ymin = l, ymax = h), size = 1, alpha = 0.4)+
  geom_line(aes(y = m), size = 0.5) +
  geom_point(aes(y = y_obs), col = blues9[9], size =1.5) +
  labs(y = NULL, x = NULL,
       title = "LOO-predictive intervals: LMA")

loo.plot.Nmass <-
  loo_pred.Nmass %>%
  as_tibble %>%
  mutate(y_obs = fit$data$N_mass) %>%
  arrange(m) %>%
  mutate(n = 1:n()) %>%
  ggplot(aes(n)) +
  geom_linerange(aes(ymin = ll, ymax = hh), alpha = 0.4) +
  geom_linerange(aes(ymin = l, ymax = h), size = 1, alpha = 0.4)+
  geom_line(aes(y = m), size = 0.5) +
  geom_point(aes(y = y_obs), col = blues9[9], size =1.5) +
  labs(y = NULL, x = NULL, title = "LOO-predictive intervals: Nmass")

loo.plot.d13C <-
  loo_pred.d13C %>%
  as_tibble %>%
  mutate(y_obs = fit$data$d13C) %>%
  arrange(m) %>%
  mutate(n = 1:n()) %>%
  ggplot(aes(n)) +
  geom_linerange(aes(ymin = ll, ymax = hh), alpha = 0.4) +
  geom_linerange(aes(ymin = l, ymax = h), size = 1, alpha = 0.4)+
  geom_line(aes(y = m), size = 0.5) +
  geom_point(aes(y = y_obs), col = blues9[9], size =1.5) +
  labs(y = NULL, x = NULL, title = "LOO-predictive intervals: d13C")

ggarrange(loo.plot.y1,loo.plot.y2, ncol =1)

References

Bürkner, Paul-Christian, Jonah Gabry, and Aki Vehtari. 2021. “Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models.” Computational Statistics 36 (2): 1243–61. https://doi.org/10.1007/s00180-020-01045-4.

Falster, Daniel, Rachael Gallagher, Elizabeth H Wenk, Ian J Wright, Dony Indiarto, Samuel C Andrew, Caitlan Baxter, et al. 2021. “AusTraits, a Curated Plant Trait Database for the Australian Flora.” Scientific Data 8 (1): 1–20.

Thornhill, Andrew H, Michael D Crisp, Carsten Külheim, Kristy E Lam, Leigh A Nelson, David K Yeates, and Joseph T Miller. 2019. “A Dated Molecular Perspective of Eucalypt Taxonomy, Evolution and Diversification.” Australian Systematic Botany 32 (1): 29–48.

Example MR-PMM - Leaf Traits in Eucalyptus

Ben Halliwell

2024