2026-03-20
•Tobia Ochsner
•6 min read
This post describes my proposed improvements over Draft 12.2025 of the language specifications and core component library. The changes take into account feedback from several people, as well as my learnings from trying to describe more complex models in PhyloSpec. Lastly, I tried to improve consistency and adherence to our core principles.
Feel free to provide feedback in the GitHub discussion of this post!
All proposed changes can be found in the corresponding pull request.
tl;dr: skip to the example model to see these changes in action
With the first batch of changes, I tried to make PhyloSpec feel less like a programming language and more like a natural description of the model.
Previously, we used decorators for clamping (@observedAs(10)). However, the dual use of decorators for clamping and engine-specific hints might be confusing. I thus propose a different syntax for clamping:
Alignment alignment ~ PhyloCTMC(
tree, qMatrix, siteRates
) observed as data Decorators for engine-specific hints make a lot of sense, because you give hints to (“at”) a specific engine:
@beast3(someFlag=True) // we tell beast3 what to do here
Tree tree ~ Yule(
...
) Instead of list comprehension, I propose to use the following index notation:
Vector<Real> values = [1, 2, 3]
Real squared[i] = values[i] * values[i] for i in 1:num(values)
Real rate[i] ~ LogNormal(
logMean=values[i], logSd=0.5
) for i in 1:num(values) This very closely mirrors mathematical syntax, while being slightly more concise than list comprehension.
However, this syntax is less flexible than list comprehension. Notably, it can’t be used for inline vector creation:
Alignment alignment = PhyloCTMC(
tree,
qMatrix,
branchRates=[i for i in 1:numBranches(tree)]
) must be written as
Rate branchRates[i] = i for i in 1:numBranches(tree)
Alignment alignment = PhyloCTMC(
tree,
qMatrix,
branchRates
) The name of a function parameter can be dropped for the first argument or if the variable is called like the parameter:
// we drop the file= for the first parameter
Alignment data = fromNexus("file.nex", ageParser=parse(...))
Tree tree ~ ...
QMatrix qMatrix = ...
Alignment alignment ~ PhyloCTMC(
tree, qMatrix // instead of tree=tree, qMatrix=qMatrix
) This greatly reduces verbosity without affecting readability.
Every rate and age in a model is implicitly tied to a time scale. This change makes this more apparent.
Adding explicit units to every variable can get complicated pretty quickly (what is the unit of the log mean?). However, we can indicate the global time scale by adding units to Age-typed literals:
Tree tree ~ Yule(birthRate=1.0, rootAge=20 Ma)
Age cladeAge = age(["humans", "chimpanzees"], tree) observed as 7 Ma This also implicitly sets the time scale for rates and for parsed tip dates. For simplicity, we only allow a single type of time unit in a model.
use keyword to make them feel less like a programming language: use bdmmprime.BirthDeathMigration.Age was added to highlight that we measure time backwards from the present.readXYZ have been renamed to fromXYZ.From our core principles: (…) the language should help researchers understand the Bayesian phylogenetic framework by making core concepts explicit (…).
The following changes address the abstraction level of the building blocks used to describe models. My goal is to use building blocks that correspond to concepts which might show up in a Bayesian Phylogenetics class or in the methodology section of a paper applying Bayesian phylogenetics.
Vector<Rate> branchRates1 ~ StrictClock(2.0, tree)
Vector<Rate> branchRates2 ~ RelaxedClock(
Exponential(rate=2.0), tree
) Vector<Rate> siteRates ~ DiscreteGammaInv(
shape=2.0, numCategories=4, invariantProportion=0.1, numSites=numSites(alignment)
) Another way to look at these changes is that we attach a label to generators, while remaining explicit about what they generate (a StrictClock explicitly generates a vector of branch rates). This is consistent with how we treat substitution models:
QMatrix qMatrix = jc69() The following changes resulted after I attempted to translate existing BEAST 2 XMLs into PhyloSpec. These are required to model realistic analyses in practice.
We can truncate and offset existing scalar distributions:
PositiveReal x ~ Truncated(
Normal(mean=0, sd=1), lower=2.0
)
Real y ~ Truncated(
Normal(mean=0, sd=1), upper=2.0
)
NonNegativeReal z ~ Truncated(
Normal(mean=0, sd=1), lower=0.0, upper=10.0
)
Real w ~ Offset(
Normal(mean=0, sd=1), offset=2.0
)
PositiveReal k ~ Offset(
Exponential(rate=1), offset=2.0
)
Real m ~ Offset(
Exponential(rate=1), offset=-2.0
) Type parameters on the lower and upper arguments provide type safety up to the discrete set of scalar types. Offset has to be hard-coded into the type checker to provide fine-grained type safety.
Ideally, researchers would want to put their PhyloSpec model into a figure of their paper, because it is the most concise way to summarize their analysis.
However, in practice, some level of data wrangling or implementation details (like the chain length or versions) are unavoidable. Thus, we follow the path of LPhy, Stan and BUGS and allow different blocks:
data {
// some data processing
}
model {
// the model description
}
mcmc {
// some general technicalities like chain length and output files
}
revbayes {
// some engine specific block
}
All blocks are optional. Statements in the data and model can also be put outside of any block.
The data block cannot contain any random variables drawn from distributions. No statement in the data can reference a variable defined in another block. Statements in the model block can reference variables defined in the data block.
The statements in all blocks are parsed by the PhyloSpec parser and must adhere to the grammar. However, statements in engine-specific blocks are neither resolved nor type-checked and are directly passed to the engine.
An engine should choose reasonable defaults if no mcmc or engine-specific block is given. There will be a concrete list of allowed variables in the mcmc block (tbd).
We can inject variables into string literals using string interpolation:
String seed = env(seed)
String fileName = "analysis_${seed}.nex" Only variable names can be used within the curly brackets. If a string literal should actually contain ${, we can escape it using \:
String fileName = "analysis_\${seed}.nex" // "analysis_\${seed}.nex" We introduce the parse function which describes a way to extract information out of a string. This can be used to extract information out of taxa names:
Alignment alignment1 = fromNexus(
"file.nex",
speciesName=parse(delimiter="_", part=1),
age=parse(delimiter="_", part=3)
)
Alignment alignment2 = fromNexus(
"datedFile.nex",
date=parse(delimiter="_", part=3)
) If tip dates are parsed and no explicit time scale is set, an implicit time scale of years is used when converting to tip ages.
We can also parse traits:
Alignment traits = traitsFromTaxa(
taxa=taxa(alignment),
trait=parse(regex="^[^_]+_[^_]+_([^_]+)")
) Conceptually, the traits alignment inherits the taxon ages and species names from the alignment alignment. This works because they are tied to the taxa itself:
Age age = age(taxa(alignment)[1])
String speciesName = species(taxa(alignment)[1]) The following syntax allows to clamp a scalar variable into an interval:
Age mrca = mrca(
["humans", "gorillas"], tree
) observed between [7, 8] This syntax is short-hand for the following:
Age mrca = mrca(["humans", "gorillas"], tree)
Real uniformOverInterval ~ Uniform(lower=7, upper=8)
Real diff = (uniformOverInterval - mrca) observed as 0.0 This corresponds to a non-zero likelihood whenever the MRCA is in the given interval and is similar to how RevBayes handles calibration (see e.g. here in section Sampling Fossil Occurrence Times).
env function allows access to environment variables.fromCSV function and the Map<K, V> type have been added.Binomial and Cauchy distributions have been added.mrca and age have been added to retrieve clade and taxon ages.mk substitution model now has an optional parameter for the expected rate.The following model leverages some of the proposed changes:
data {
String seed = env("SEED")
Alignment molecularData = fromNexus(
"alignment_${seed}.nex", age=parse(delimiter="_", part=1)
)
Alignment traitData = traitsFromTaxa(
taxa(molecularData), trait=parse(delimiter="_", part=2)
)
}
model {
Tree tree ~ Yule(
birthRate~Exponential(rate=1),
taxa=taxa(molecularData)
)
QMatrix molecularQ = hky(
kappa~LogNormal(logMean=0.1, logSd=2),
baseFrequencies~Dirichlet([1, 1, 1, 1])
)
QMatrix traitQ = mk(
rate~LogNormal(logMean=0.1, logSd=2)
)
Vector<Rate> siteRates ~ DiscreteGammaInv(
shape=1,
numCategories=4,
invariantProportion=0.1,
numSites=numSites(molecularData)
)
Vector<Rate> branchRates ~ RelaxedClock(
LogNormal(logMean=1, logSd=2),
numBranches=numBranches(tree)
)
Alignment molecular ~ PhyloCTMC(
tree, qMatrix=molecularQ, siteRates, branchRates
) observed as molecularData
Alignment traits ~ PhyloCTMC(
tree, qMatrix=traitQ, branchRates
) observed as traitData
Age rootAge = rootAge(tree) observed between [1.7 yr, 4 yr]
}
mcmc {
Integer chainLength = 1e8
String logFile = "output.log"
String treesFile = "output.trees"
} Feel free to provide feedback in the GitHub discussion of this post! Thank you!