BiotaPhy Metacommunity Phylogenetic Analysis (MCPA)

What is the analysis?

The metacommunity phylogenetic analysis is a “method that aims to evaluate the interaction between phylogenetic structure, historical biogeographic events and environmental filtering in driving species distributions in a large-scale metacommunity”.

What are the inputs

The MCPA process requires an incidence matrix, a phylogenetic tree, environment data and biogeographic hypotheses. All of the inputs should be encoded as matrices.

Each row in the incidence matrix represents a site and each column represents a species. The value of each cell should
be either zero if the species is not found at that site or one if it is.

The phylogenetic tree must be binary and ultrametric and then is encoded with the rows representing the tips of the tree and the columns representing the internal nodes. For each node / column, each tip that occurs in that clade should be encoded with a value relative to its branch length. One of the system clades should be encoded with positive values and one should be encoded with negative values (which is which is arbitrarily chosen) the sum of all of the values in each column should be zero.

The environmental data matrix should be encoded as a matrix of site rows and environmental variable columns. The sites in this matrix should match the sites in the incidence matrix. The values of each cell should be the value of the variable at that particular site.

The biogeographic hypotheses should be encoded as a matrix with rows representing sites and columns representing hypotheses. Each hypothesis should be encoded as a Helmert contrast and should be tertiary. For example, there should be two sides of the hypothesis (ex. one island or the other). Arbitrarily select one side to be encoded with positive values and one with negative values. Set the value of each cell to be the corresponding side the site falls within. If the site does not fall in either side, the value should be set to zero.

How does BiotaPhy create these inputs

BiotaPhy uses presence-absence matrices for the incidence matrix in the MCPA computation. These PAMs are created by generating a species distribution model for each of the taxa that are to be included. These models are limited by masking the potential distribution region by the convex hull around the observed occurrence data intersected with the ecoregions that those points fall within. Without this step, the generated PAM would only be correlated with the environment data since it was used to generate the models.

Phylogenetic trees are uploaded (or retrieved from a service) as a NEXUS file.
We ensure that the tree is binary by resolving any polytomies that may exist within it. The tree must be ultrametric if it has branch lengths. If it does not contain branch lengths, the structure itself of the tree can be used to encode the tree.

The environmental data matrix is generated by intersecting the environmental data with the shapegrid used to generate the PAM. The value for each of these cells is determined by creating a weighted average of all of the raster cells that fall within each site.

The biogeographic hypotheses matrix is generated from a set of hypothesis shapefiles. For each shapefile, an event field can be specified indicating what features should be considered to be part of the same hypothesis. If there is no event field specified, each feature is treated as its own hypothesis. If there are only two, then they are taken as opposing sides of a hypothesis. If there are more than two, they are split into separate hypotheses.


Performing an MCPA computation can take a lot of computing horsepower, especially for larger matrices. The computation generates several (10,000) semi-partial correlation matrices from permutations of the incidence data to assess the impact of each environmental variable and biogeographic hypothesis and determine what values are signficant. To improve performance, these operations are split into chunks of N runs (default 100) that are then spread across the processing nodes and compute cores of a cluster. The results of these computations are then aggregated and summarized before being returned to the user.

What are the outputs

The outputs of the MCPA process are in the form of a stack of node by predictor matrices, where “node” is an internal node of the phylogenetic tree and “predictor” is either an environmental variable or a biogeographic hypothesis. The first layer of this stack is the observed semi-partial correlations between each node and each predictor. The cells of the second layer are frequency values indicating how often the randomized incidence matricies produced larger semi-partial correlation values for each node - predictor combination. Finally, the third layer indicates if the semi-partial correlation value of each cell in the observed matrix should be considered significant or not.

How are these outputs interpreted?

The results of this analysis are interpreted by first looking at the the significance matrix. For every cell that is “true” the semi-partial correlation between that internal tree node and that predictor is significant. After determining which values are significant, look at the corresponding semi-partial correlation values at those combinations. The closer that the absolute value of the cell is to one, the greater the impact of that predictor on the node, meaning one sister clade of that node reacts more positively to that predictor than the other showing the effect of that evolutionary process. These values can show how clades evolved to match climate conditions as well as show which organisms are more vagile to disperse to overcome the legacy of previous biogeographic events.


Paper provided in the USB drive and linked below:

Leibold, M. A., Economo, E. P., & Peres‐Neto, P. (2010). Metacommunity phylogenetics: separating the roles of environmental filters and historical biogeography. Ecology letters, 13(10), 1290-1299. doi:10.1111/j.1461-0248.2010.01523.x