This page describes the philosophy and features of the modeling language.
This example shows some features of the language:
Alignment observedSequences = fromNexus("alignment.nex")
QMatrix qMatrix = hky(
kappa~LogNormal(logMean=1.0, logSd=0.5),
baseFrequencies~Dirichlet(concentration=[1.0, 1.0, 1.0, 1.0])
)
Tree tree ~ Yule(
birthRate~Exponential(10.0),
taxa=taxa(observedSequences)
)
Alignment sequences ~ PhyloCTMC(
tree,
qMatrix,
numSequences=numTaxa(observedSequences)
) observed as observedSequences PhyloSpec models describe a graphical model. Every statement corresponds to one variable in the graph.
The simplest statement is a constant assignment:
Rate birthRate = 2.5 This defines a new variable called birthRate of type Rate with a constant value of 2.5. Every variable has a type. Whenever possible, PhyloSpec uses types that tell you what a variable actually represents. This is why birthRate has type Rate and not just PositiveReal.
There are other basic types:
String taxonName = "Chimpanzee"
Integer number = 100
Vector<String> taxonNames = ["Chimpanzee", "Human"] We can apply functions and the usual numerical expressions:
Rate deathRate = 0.5 * birthRate
Rate diversificationRate = birthRate - deathRate
Real logDiversificationRate = log(diversificationRate) Functions can create more sophisticated types:
Alignment alignment = nexus(file="sequences.nex")
Integer numTaxa = numTaxa(alignment)
Taxa taxa = taxa(alignment) So far, we only have deterministic variables. Let’s change that!
Distribution<Real> normalDistribution = Normal(mean=0.0, sd=1.0)
Real drawnValue ~ normalDistribution Here, Normal is a function which returns a Distribution<Real> object (a distribution on real numbers). This distribution is then assigned to the random variable drawnValue using the ~ operator. This is where the randomness in our model comes from!
~ assigns a distribution to a random variable, hence it always has to be preceded by a Distribution object. Use the = operator for assignments of constant values or for deterministic transformations of random variables.
Some examples of valid and invalid statements:
// ✅ Exponential is a function which returns a Distribution<PositiveReal>
PositiveReal a ~ Exponential(rate=1.0)
// ✅
Real b = log(a)
// 🚫 b is not a distribution
Real c ~ b
// 🚫 can't add a number to a distribution object
Real d ~ Normal(mean=0.0, sd=1.0) + 10
// 🚫 can't apply log to a distribution object
Real e ~ log(Normal(mean=0.0, sd=1.0))
// ✅ the function IID takes a distribution object and creates a new distribution object
Vector<Real> f ~ IID(
base=Normal(mean=0.0, sd=1.0),
n=5
) By convention, functions returning a distribution always start with an uppercase letter, whereas all others start with a lowercase letter.
We now look at more details, but we’ve already covered the main things to know.
Check out the Core Component Library for a current list of all types, distributions, and functions.
Every object has one of many types.
A "string" is always of type String, true and false are always of type Boolean.
What about numbers? 10 could refer to a PositiveInteger, a NonNegativeInteger, a Integer, a PositiveReal, a NonNegativeReal, a Real, a Rate. 0.5 could also refer to a Probability among others. In these cases, the exact type is determined by its usage:
Real a = 10 // here, 10 is a Real
Real b = log(5) // log takes a PositiveReal, so 5 is a PositiveReal One reason why PhyloSpec uses types is to make scripts more readable. One part of this is the use of aliases:
Rate birthRate ~ LogNormal(logMean=1, logSd=2) // Rate is an alias for PositiveReal If type A is an alias of type B, the two of them can be used interchangeably.
Open question: How far do we go with aliases? Do we use things like Count (for NonNegativeInteger), Frequencies (for Simplex), or Path and TaxonName (for String?).
Every type can have one or more type parameters. Examples of parameterized types are Vector<T>, Map<K,V>, and Sequence<T>.
From the perspective of an object, its type parameter is fixed upon generation. The type of an object attribute can dependent on the type parameters:
Vector<Real> numbers = [0.5, 0.1]
Real last = numbers.first // for Vector<T>, .first has type T Some languages allow to set bounds to a type parameter. However, we only ever interact with objects through generators. Hence, it is sufficient (and more flexible) to specify type parameter bounds there.
We have the luxury that we can define our type hierarchy from a purely conceptual perspective—we don’t have to care (too much) about implementation details. A type A extends from a type B if an object of type B is also an object of type A. A PositiveReal is also a Real, a TimeTree is also a Tree.
An object of a subtype can always be used in places where a supertype is required:
PositiveReal a = 10
Real b = a // this still works, as PositiveReal extends Real Open questions: Subtyping combined with generics raises the question of covariance: if A extends B, does T<A> extend T<B>? I think yes, but a more careful argument will follow.
PositiveReal mean ~ Exponential(1.0)
Real y ~ Normal(mean=mean, sd=2.0)
Real y ~ Normal(mean=mean, sd=2.0, offset=1.0) If there is only one argument, it can be passed directly (e.g., Exponential(1.0)). If there are multiple arguments, all but the first one must be named explicitly (e.g., Normal(mean=1.0, sd=2.0) or Normal(1.0, sd=2.0)).
There might be optional arguments. One can have multiple functions with the same name but different argument types. Functions with the same name cannot only differ in their return type.
If a variable has the same name as a function argument, it can be passed directly:
PositiveReal mean ~ Exponential(1.0)
PositiveReal sd ~ Exponential(1.0)
Real y ~ Normal(mean, sd) Open question: Do we allow default values?
Nested expressions are allowed:
Real y ~ Normal(mean=log(100), sd=2.0) is equivalent to:
Real mean = log(100)
Real y = Normal(mean=mean, sd=2.0) Whereas
Real y ~ Normal(mean~Exponential(1.0), sd=2.0) is equivalent to:
Real mean ~ Exponential(1.0)
Real y = Normal(mean=mean, sd=2.0) Instead of overly flexible loops, PhyloSpec supports indexed assignments:
Vector<Real> x = [1.0, 2.0, 3.0]
Real logX[i] = log(x[i]) for i in 1:num(x) // logX[i] is a Real, logX a Vector<Real>
Real xTimesLogX[i] = x[i] * logX[i] for i in 1:num(x) Justification for indexed assignments: Indexed assignments provide a declarative, mathematically-inspired syntax that naturally extends beyond simple function application. They excel at:
x[i]^2 + y[i] for i in 1:num(x)f(x[i], y[j]) for i in 1:num(x) for j in num(j)i * x[i] for i in 1:num(x)This syntax aligns well with mathematical set-builder notation and provides more flexibility than map-style vectorization while remaining declarative and side-effect free - ideal for model specification languages where clarity and mathematical expressiveness are paramount.
Elements of vectors, matrices and arrays can be accessed using special index accessors:
Vector<Real> x = [1.0, 2.0, 3.0]
Real x1 = x[1] // 1.0
Matrix<Real> y = [
[1.0, 2.0],
[2.0, 1.0],
]
Real y12 = y[1, 2] // 2.0
Vector<Map<String, String>> data = fromCSV("file.csv")
String entry = data[1]["header"] We use one-based indexing.
We can inject variables into string literals using string interpolation:
String seed = env("seed") // reads in an environment variable
String fileName = "analysis_${seed}.nex" Only variable names can be used within the curly brackets. If a string literal should actually contain ${, we can escape it using \:
String fileName = "analysis_\${seed}.nex" Distribution are normal objects produced by normal functions and can be assigned to variables and passed as arguments (distributions as first-class citizens):
// Create a vector of distribution objects
Vector<Distribution<Real>> components = [
Normal(mean=0.0, sd=1.0), // Normal is simply a function returning Distribution<Real>
Normal(mean=5.0, sd=2.0),
Normal(mean=10.0, sd=1.5)
]
Vector<Real> weights = [0.3, 0.5, 0.2]
Mixture mixture = Mixture(components=components, weights=weights)
Distribution<Real> components[i] = Normal(mean=i*2.0, sd=1.0) for i in 1:10
Vector<Real> weights = repeat(x=0.1, rep=10)
Mixture mixture = Mixture(components=components, weights=weights)
// Sample from the mixture
Real x ~ mixture We use the syntax observed as to assign an observation to a random variable:
Real x ~ Normal(mean=0.0, sd=1.0) observed as 50 The following syntax allows to clamp a scalar variable into an interval:
Age mrca = mrca(
["humans", "gorillas"], tree
) observed between [7, 8] This syntax is short-hand for the following:
Age mrca = mrca(
["humans", "gorillas"], tree
)
Real uniformOverInterval ~ Uniform(lower=7, upper=8)
Real diff = (uniformOverInterval - mrca) observed as 0.0 This corresponds to a non-zero likelihood whenever the MRCA is in the given interval and mirrors to how RevBayes handles calibration (see e.g. here in section Sampling Fossil Occurrence Times).
Every rate and age in a model is implicitly tied to a time scale. This change makes this more apparent.
Adding explicit units to every variable can get complicated pretty quickly (what is the unit of the log mean?). However, we allow to indicate the global time scale by adding units to Age-typed literals:
Tree tree ~ Yule(birthRate=1.0, rootAge=20 Ma)
Age cladeAge = age(["humans", "chimpanzees"], tree) observed as 7 Ma This also implicitly sets the time scale for rates and for parsed tip dates. For simplicity, we only allow a single type of time unit in a model.
Valid units are d, yr, kyr, Myr, ka, and Ma.
Optionally, blocks can be used to aid readability and group the statements:
data {
Alignment alignment = fromNexus("file.nex")
}
model {
PositiveInteger numAffectedSites ~ Binomial(
p=0.1, n=numSites(alignment)
)
}
mcmc {
// some general technicalities like chain length and output files
} All blocks are optional. Statements in the data and model can also be put outside of any block.
The data block cannot contain any random variables drawn from distributions. No statement in the data can reference a variable defined in another block. Statements in the model block can reference variables defined in the data block.
An engine should choose reasonable defaults if no mcmc or engine-specific block is given. There will be a concrete list of allowed variables in the mcmc block (tbd).
PhyloSpec supports standard numerical operations to enable mathematical transformations of variables. These operations are particularly useful when canonical distributions provide convenient priors, but transformed values are needed for downstream model components.
= assignments)~) on the same line^ > *,/ > +,-PositiveReal). Only alphanumeric characters are allowed, they must start with a letter.rate) and can contain Unicode characters. They must start with a letter (a greek letter like α also works).log). Only alphanumeric characters are allowed, they must start with a letter.log). Only alphanumeric characters are allowed, they must start with a letter.log). Only alphanumeric characters are allowed, they must start with a letter.Engines can add custom types, distributions, and functions to the language. These extensions are namespaced and can be imported:
use revbayes.readAtlas
use revbayes.RlAtlas
RlAtlas atlas = readAtlas("atlas.csv") Extensions are defined in component libraries. A component library is a JSON file describing every additional types and functions.
There can be engine-specific blocks that are not part of the language and can be used to configure the engine. These blocks must not change the model specified and will be ignored by any other engine.
revbayes {
// Configuration for RevBayes
// these could be moves
} Statements in engine-specific should adhere to the same syntax as any other PhyloSpec statements. However, statements in engine specific-blocks are neither resolved nor type-checked and are directly passed to the engine.
@revbayes(someInternalArgument=true)
Real x ~ Normal(mean=0.0, sd=1.0)