This document aims to give a flavour of the justification for using parsimony in general, and profile parsimony in particular, to reconstruct evolutionary history; and to summarize the mathematical/information theoretic underpinning of the profile parsimony approach.

I’ll open by acknowledging that a philosophical approach is not everyone’s cup of tea – it’s certainly not my home turf. I tend to prefer models that can be shown to work well, without worrying too much about their abstract philosophy – though admittedly, identifying a ‘best’ method is not always straightforward (e.g. Smith, 2019).

I should also acknowledge that practical considerations can influence the choice of reconstruction method: a tip-dated phylogeny cannot be accomplished in a parsimony framework, for example; and where a principled model of evolutionary change is available, as with certain subsets of molecular data, some see a compelling case for using such models to infer phylogeny.

My goal here is not to argue that any method is superior, but to develop a case that profile parsimony rests on a principled underpinning – and, if it does prove to outperform other methods in certain circumstances, to give a sense of why this might be.

There are a number of complementary perspectives on the philosophical justification for a parsimony approach (e.g. Farris, 1983), which this brief overview will not do justice to; but hopefully my unqualified and largely unreferenced perspective captures some of the nature of the principal arguments. Better-versed readers are invited to suggest improvements or modifications by e-mail or by opening a GitHub issue.

Parsimony has been defended by reference to Occam’s Razor: the principle that scientists should prefer the simplest explanation that can provide an adequate account for observed data. In the context of morphological phylogenetics, “observed data” are codified as observed character states scored within a matrix. Ideally, a phylogenetic tree would explain the distribution of character states between taxa by reconstructing each character state as representing a homologous feature, with a single evolutionary origin on the tree. An explanation in which a certain trait evolved once is simpler than one in which that trait evolved twice; it is less faithful to the information inherent in the character coding, and attributes less of this information to common ancestry. Each additional step on a tree can be viewed as an additional “assumption,” and on a simple or “pure” view, the tree that makes the fewest assumptions should be preferred.

However, this perspective – which implicitly underpins the practice of *equal-weights* parsimony – treats all assumptions as equivalent; the simplest hypothesis is the one that makes the fewest assumptions (here, assumptions of homoplasy).

A more nuanced interpretation of Occam’s Razor suggests that the simplest hypothesis is the one that is least surprising. A hypothesis predicated on the existence of a flying spaghetti monster may require only a single (barmy) assumption, but we might nevertheless tend to prefer a hypothesis that requires a greater number of assumptions that are better aligned with previous experience.

This interpretation has two applications for phylogenetics. The first is that we may wish to prefer trees (which are depictions of phylogenetic hypotheses) that concentrate homoplasy in characters that we believe to be prone to convergent evolution. This view calls for *character weighting*, that is, assigning less weight to changes in characters that are believed to be less phylogenetically reliable. Such characters may be identified by successive approximations (Farris, 1969), by comparing the pattern of their tokens with that of other characters, by expert judgement, or by other *a priori* means.

The second application argues that each additional case of homoplasy beyond the first is successively less surprising: the first observation of homoplasy (rather than some prior intuition) taught us that the character was not entirely reliable, making a second homoplasy less unexpected; the second observed homoplasy, in turn, makes us less surprised by the third. This approach calls for a *step weighting* approach, in which each additional step in a given character receives less penalty than the last. (Mathematically, this can be expressed in as if it were a character weighting strategy; but I feel that its *a posteriori* nature conveys a subtly different motivation and justifies a separate treatment.)

This raises the question of how each step beyond the first ought to be penalized. Ultimately, any concave function (in which each step is penalized by a positive amount that is smaller than the penalty applied to the previous step) is consistent with this philosophy (Arias & Miranda-Esquivel, 2004). The most widely used step weighting approach is Goloboff’s (1993) implied weighting, where the total cost associated with a character is expressed as *e* / (*e* + *k*), where *e* is the number of homoplasies within a character, and *k* is an arbitrary constant. As *e* tends to infinity, this approach tends to equal weights; as *k* tends to zero, it tends to clique analysis (in which a character is either homologous or ignored) (Farris, 1983). The most appropriate value for *k* may depend on the number of taxa, the number and distribution of observed states, and other factors (Goloboff, Carpenter, Arias, & Esquivel, 2008; Goloboff, Torres, & Arias, 2018) (a more detailed treatment will be provided in a revision of this document). Moreover, some adjustment must be made for ‘missing’ data, i.e. ambiguous tokens, which reduce the opportunity to observe homoplasy (Goloboff, 2014). Implied weighting is described as an approximation (Goloboff, 1993), and I am not aware of a straightforward interpretation of the ‘fit’ score, or a principled definition of the nature of the quantity that is being approximated.

I argue that the quantity that we should seek to minimise is the “surprise” of a tree. Information theory is the science of quantifying unexpectedness. Information is usually measured in *bits*. One bit is the amount of information generated by tossing a fair coin: to record the outcome of a coin toss, I must record either a `H`

or a `T`

, and with each of the two symbols equally likely, there is no way to compress the results of multiple tosses.

The Shannon (1948) information content of an outcome \(x\) is defined to be \(h(x) = -\log_2{P(x)}\), which simplifies to \(\log_2{n}\) when all \(n\) outcomes are equally likely. Thus, the outcome of a fair coin toss delivers \(\log_2{2} = 1\textrm{ bit}\) of information; the outcome of rolling a fair six-sided die contains \(\log_2{6} \approx 2.58\textrm{ bits}\) of information; and the outcome of selecting at random one of the 105 unrooted binary six-leaf trees is \(\log_2{105} \approx 6.71\textrm{ bits}\).

Unlikely outcomes are more surprising, and thus contain more information than likely outcomes. The information content of rolling a twelve on two fair six-sided dice is \(-\log_2{\frac{1}{36}} \approx 5.16\textrm{ bits}\), whereas a seven, which could be produced by six of the 36 possible rolls (`1 & 6`

, `2 & 5`

, …), is less surprising, and thus contains less information: \(-\log_2{\frac{6}{36}} \approx 2.58\textrm{ bits}\). An additional 2.58 bits of information would be required to establish whether which of the rolls `1 & 6`

, `2 & 5`

, … occurred.

Now consider two competing explanations for an event: (i), three consecutive rolls of two dice each produced a seven; (ii), two consecutive rolls of two dice each produced a twelve. The former event corresponds to \(3 \times 2.58 = 7.75\textrm{ bits}\) of information, so is less surprising than the latter, which represents \(2 \times 5.16 = 10.34\textrm{ bits}\), despite involving an additional roll of the dice.

How do we measure the “surprise” associated with additional steps in a character on a phylogenetic tree? Consider a character with the states `0 0 0 1 1 1`

. In the most parsimonious situation in which the character contains a single step on a tree, it is compatible with nine of the 105 labelled six-leaf trees, and thus represents \(-log_2\frac{9}{105} = 3.54\textrm{ bits}\) of phylogenetic information. If we are told that the character contains exactly two steps, it can occur on 63 trees, so yields \(-log_2\frac{63}{105} = 0.74\textrm{ bits}\). (The number of trees with *m* extra steps can be calculated using theorem 1 of Carter, Hendy, Penny, Székely, & Wormald (1990), implemented in the function `Carter1()`

.) Learning that a second step occurred meant that 3.54 − 0.74 = 2.81 bits of information we had previously attributed to common ancestry instead represent a signature of homoplasy; this quantity measures our degree of ‘surprise.’

If we subsequently learn that the character contains three steps, then it can occur on any six-leaf tree, and contains no phylogenetic information. We are less surprised by this third step, which attributes the remaining 0.74 bits of information to factors other than common ancestry, because the second step had already reduced the amount of phylogenetic information we expected the character to hold.

Although this is not how the approach was initially justified (Faith & Trueman, 2001), profile parsimony aims to ascribe as much of the information in each character as possible to common ancestry; in the example above, it assigns the first extra step a penalty of 2.81, and the second extra step a penalty of 0.74, corresponding to the amount of information that could no longer be assigned to common ancestry after learning of the existence of the \(n\)th step.

Whereas the value of a parsimony score obtained under implied weights does not have any inherent meaning, the profile parsimony score of a given tree represents the amount of information within a character matrix that can be attributed to common ancestry under that tree, which, if expressed as a proportion of the total cladistic information content of the characters in the matrix (Cotton & Wilkinson, 2008), gives an indication of the degree of homoplasy in the underlying character matrix.

Profile parsimony produces a concavity profile that reflects the opportunity to observe additional steps in each character, unlike implied weighting, where a single concavity value is applied to all characters, regardless of the opportunity for additional steps to be observed.

The graph shows the profiles under implied weighting (with different concavity constants) and profile parsimony (under different distributions of the tokens `0`

and `1`

in a character coded for forty leaves). One prominent difference between the character of concavity profiles is that implied weighting continues to assign relatively large penalties to additional steps even when the distribution of a character is almost random on a tree. Another is that profile parsimony treats a second (and each subsequent) step as less surprising if there are fewer opportunities to observe a second step by chance, on account of there being a smaller number of tokens with the rarer step; implied weighting is blind to the distribution of tokens within a character.

The present implementation of profile parsimony in “TreeSearch” is restricted: inapplicable tokens are treated as ambiguous; partial ambiguity (e.g. `{02}`

) is treated as complete (`?`

), and informative states (i.e. states present in more than one taxon) beyond the first two are ignored (treated as ambiguous). This reflects the complicated mathematics of calculating the number of trees with a given number of steps.

Tree length can be calculated with `TreeLength(concavity = 'profile')`

, and tree search performed with `MaximizeParsimony(concavity = 'profile')`

. Data can be prepared for profile parsimony using `PrepareDataProfile()`

, and the profile of a character calculated using `StepInformation()`

.

- Conduct tree search using profile parsimony

Arias, J. S., & Miranda-Esquivel, D. R. (2004). Profile parsimony (PP): An analysis under implied weights (IW). *Cladistics*, *20*(1), 56–63. doi:10.1111/j.1096-0031.2003.00001.x

Carter, M., Hendy, M., Penny, D., Székely, L. A., & Wormald, N. C. (1990). On the distribution of lengths of evolutionary trees. *SIAM Journal on Discrete Mathematics*, *3*(1), 38–47. doi:10.1137/0403005

Cotton, J., & Wilkinson, M. (2008). Quantifying the potential utility of phylogenetic characters. *Taxon*, *57*(1), 131–136.

Faith, D. P., & Trueman, J. W. H. (2001). Towards an inclusive philosophy for phylogenetic inference. *Systematic Biology*, *50*(3), 331–350. doi:10.1080/10635150118627

Farris, J. S. (1969). A successive approximations approach to character weighting. *Systematic Biology*, *18*(4), 374–385. doi:10.2307/2412182

Farris, J. S. (1983). The logical basis of phylogenetic analysis. In N. Platnick & V. A. Funk (Eds.), *Advances in Cladistics, Vol. 2* (pp. 7–36). New York: Columbia University Press.

Goloboff, P. A. (1993). Estimating character weights during tree search. *Cladistics*, *9*(1), 83–91. doi:10.1111/j.1096-0031.1993.tb00209.x

Goloboff, P. A. (2014). Extended implied weighting. *Cladistics*, *30*(3), 260–272. doi:10.1111/cla.12047

Goloboff, P. A., Carpenter, J. M., Arias, J. S., & Esquivel, D. R. M. (2008). Weighting against homoplasy improves phylogenetic analysis of morphological data sets. *Cladistics*, *24*(5), 758–773. doi:10.1111/j.1096-0031.2008.00209.x

Goloboff, P. A., Torres, A., & Arias, J. S. (2018). Weighted parsimony outperforms other methods of phylogenetic inference under models appropriate for morphology. *Cladistics*, *34*(4), 407–437. doi:10.1111/cla.12205

Shannon, C. E. (1948). A mathematical theory of communication. *Bell System Technical Journal*, *27*, 379–423, 623–656.

Smith, M. R. (2019). Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets. *Biology Letters*, *15*(2), 20180632. doi:10.1098/rsbl.2018.0632