Unit 4: Learning Dynamics with Neural Differential Equations

Published

26/06/2026

Unit 3 §3.1 introduced ODEs as \dot{\mathbf{x}} = f(\mathbf{x}, t) and let OrdinaryDiffEq.jl integrate them. This unit takes the next step: make f itself learnable. Two routes — replace f entirely with a neural network (Neural ODEs), or splice a small network into known physics (UDEs).

We treat the solver as a black box throughout — adaptive step-size control, stiff vs non-stiff, implicit methods are deliberately out of scope. Pick the right solver, trust it, and focus on the dynamics.

4.1 ODEs as models of dynamics

Before we let neural networks loose on a vector field we need one more layer of ODE intuition: the qualitative behaviours that distinguish interesting dynamics from boring ones. The rest of this section lays this out (equilibria, limit cycles, chaos) and then puts one example — Lotka–Volterra — under the microscope in Julia and Python so the cross-ecosystem reflex from Unit 3 §3.1 carries over.

The IVP setup from Unit 3 §3.1 carries over verbatim:

\dot{\mathbf{x}}(t) = f(\mathbf{x}(t), t),\qquad \mathbf{x}(0) = \mathbf{x}_0.

What’s new in this unit is the phenomenology — nonlinear ODEs admit qualitative behaviours linear systems can’t:

Equilibria — fixed points \mathbf{x}^* where f(\mathbf{x}^*) = 0. Stability is read off from the eigenvalues of the Jacobian J = \partial f / \partial \mathbf{x} at \mathbf{x}^*.
Limit cycles — isolated closed orbits that attract nearby trajectories. Linear systems can’t have these; the classic example is the Van der Pol oscillator. (Lotka–Volterra below also has closed orbits, but they are neutral centres — a continuous family, none attracting — not limit cycles.)
Chaos — bounded, aperiodic, sensitive dependence on initial conditions. The Lorenz system we revisit in Unit 3 §3.4 (SINDy) lives here.

Worked example: Lotka–Volterra

We return to the predator–prey system from Unit 3 §3.1 — there it introduced what nonlinearity looks like (out-of-phase population cycles); here we use it to expose a hidden conserved quantity that motivates the structure-preserving methods below. The model again:

\dot{x} = \alpha x - \beta x y, \qquad \dot{y} = \delta x y - \gamma y.

For positive (\alpha, \beta, \gamma, \delta) the system has a non-trivial fixed point (\gamma/\delta,\, \alpha/\beta) surrounded by a family of closed orbits — neutrally stable centres, not an attracting limit cycle — and it’s a favourite testbed for SciML methods (we’ll meet it again in §4.3).

using OrdinaryDiffEq, Plots

function lv!(du, u, p, t)
    α, β, γ, δ = p
    du[1] = α*u[1] - β*u[1]*u[2]
    du[2] = δ*u[1]*u[2] - γ*u[2]
end
p     = (1.5, 1.0, 3.0, 1.0)
tspan = (0.0, 12.0)

plt = plot(xlabel="prey x", ylabel="predator y",
           title="Lotka–Volterra phase portrait",
           aspect_ratio = :equal, legend = :outerright)
for u0 in [[1.0, 1.0], [1.5, 1.0], [2.0, 1.0]]
    sol = solve(ODEProblem(lv!, copy(u0), tspan, p), Tsit5();
                saveat = 0.01)
    plot!(plt, sol[1, :], sol[2, :], label = "x(0)=$(u0[1])")
end
plt

Each initial condition lands on a different closed orbit, traced counter-clockwise (prey rise first, predators chase) — a first integral (a conserved quantity along trajectories; V(x, y) = \delta x - \gamma \log x + \beta y - \alpha \log y for this LV system) exists, but it is invisible to a method that only sees data. Recovering invariants like this from observations alone is exactly what Hamiltonian / Lagrangian networks (Unit 3 §3.3) and the SINDy machinery (Unit 3 §3.4) try to do.

The same model in Python with scipy.integrate.solve_ivp — same shape, same plot, different ecosystem:

units/unit_04/scripts/lotka_volterra_python.py

import numpy as np
from scipy.integrate import solve_ivp

alpha, beta, gamma, delta = 1.5, 1.0, 3.0, 1.0
def lv(t, u):
    x, y = u
    return [alpha * x - beta * x * y,
            delta * x * y - gamma * y]

t_eval = np.linspace(0, 12, 1201)
trajs = []
for u0 in ([1.0, 1.0], [1.5, 1.0], [2.0, 1.0]):
    sol = solve_ivp(lv, (0, 12), u0, t_eval=t_eval, rtol=1e-8)
    trajs.append(sol.y)
    print(f"x(0)={u0[0]:.1f}  x(12)={sol.y[0, -1]:.3f}  "
          f"y(12)={sol.y[1, -1]:.3f}")

Available at scripts/lotka_volterra_python.py.

✏️ Section exercise — test the invariant

The text claims V(x, y) = \delta x - \gamma \log x + \beta y - \alpha \log y is conserved along Lotka–Volterra trajectories. Verify it numerically: solve from (x_0, y_0) = (1, 1) with the parameters above and plot V(x(t), y(t)) - V(x_0, y_0) over t \in [0, 50] at solver tolerances reltol = 1e-3 and reltol = 1e-10. Then answer two questions: (a) is the drift you see physics or numerics, and how do the two tolerance runs prove it? (b) If you fit a neural network to this trajectory and integrated it for 500 time units, what would you expect V to do, and why is that the motivation for §3.3’s Hamiltonian architectures?

💡 Hint

Define V(x, y) = δ*x - γ*log(x) + β*y - α*log(y), set V0 = V(u0...), and solve the same ODEProblem over (0.0, 50.0) twice, changing only reltol/abstol (try 1e-3 then 1e-10); plot [V(u...) - V0 for u in sol.u] for both. The logic: if the drift were physics it wouldn’t shrink when you tighten the solver — the two-tolerance overlay is the whole proof that it’s numerics. For part (b), a NN-fitted field has no reason to keep \nabla V\cdot f_\theta = 0, so like Solution 3.3’s perturbed field it drifts V secularly over 500 t.u. and the orbit spirals — the motivation for §3.3’s structure-preserving architectures.

Go to solution →

Logistic growth

A population rarely grows without limit. The logistic model caps it with a carrying capacity K,

\dot N = r\,N\Bigl(1 - \frac{N}{K}\Bigr),

near-exponential when rare, levelling off at the stable equilibrium N = K. It is exactly the coral-growth term in the §4.3 COTS model. This is one of the rare nonlinear ODEs with a clean closed form — separating variables and integrating gives the logistic sigmoid

N(t) = \frac{K}{1 + \Bigl(\dfrac{K - N_0}{N_0}\Bigr)\,e^{-rt}},

an S-curve that rises from N_0 and saturates at K (and indeed reduces to the familiar 1/(1+e^{-rt}) sigmoid when K=1, N_0=\tfrac12). Solve it numerically from a few starting sizes and that same S-curve appears:

using OrdinaryDiffEq, Plots
logistic(N, p, t) = p.r * N * (1 - N / p.K)
prob = ODEProblem(logistic, 0.0, (0.0, 20.0), (r = 0.4, K = 100.0))

plt = plot(xlabel = "time t", ylabel = "population N", framestyle = :box,
           title = "Logistic growth   Ṅ = rN(1 − N/K)", legend = :bottomright, gridalpha = 0.25)
for N0 in (2.0, 20.0, 60.0)
    sol = solve(remake(prob; u0 = N0), Tsit5(); saveat = 0.1)
    plot!(plt, sol.t, sol.u, lw = 2.6, label = "N(0) = $N0")
end
hline!(plt, [100.0]; ls = :dash, c = :gray, label = "carrying capacity K")
plt

A compartmental model: SIR

Epidemics are dynamics too. The SIR model partitions a population into Susceptible, Infected and Recovered and moves mass between them:

\dot S = -\beta\, S I, \qquad \dot I = \beta\, S I - \gamma I, \qquad \dot R = \gamma I,

with S, I, R as population fractions. New infections need an S–I contact (\beta S I — itself a mass-action functional response); recovery drains I at rate \gamma. The outcome hinges on the basic reproduction number R_0 = \beta/\gamma: an outbreak grows only if R_0 > 1.

using OrdinaryDiffEq, Plots
function sir!(du, u, p, t)
    β, γ = p; S, I, R = u
    du[1] = -β*S*I
    du[2] =  β*S*I - γ*I
    du[3] =  γ*I
end
sol = solve(ODEProblem(sir!, [0.99, 0.01, 0.0], (0.0, 120.0), (0.4, 0.1)),
            Tsit5(); saveat = 0.5)
plot(sol; lw = 2.4, label = ["S" "I" "R"], xlabel = "time (days)",
     ylabel = "population fraction", title = "SIR epidemic  (R₀ = β/γ = 4)",
     framestyle = :box, legend = :right)

Variants just redraw the compartment graph: SIS (no lasting immunity — recovereds flow back to S, so the disease can stay endemic) and SEIR (an Exposed/latent stage before infectiousness). Fitting such models to real spread data is its own craft — the SafeBlues project released a “safe virtual virus” across university campuses to study exactly this kind of dynamic epidemic-model fitting on genuine contact data.

4.2 From ResNets to Neural ODEs

This section walks the bridge between the discrete-layer MLPs of Unit 2 and the continuous-time vector fields we’ll use in the rest of the workshop. The subsections cover (a) the layer-as-Euler-step intuition, (b) what a Neural ODE looks like in Lux.jl, (c) the adjoint trick that makes training through an ODE solver feasible, and (d) an illustrative training skeleton. All four together form one of the two patterns (Unit 9) the capstone inverse problem uses.

Residual networks: the ML backstory

Before 2015, making a network deeper eventually made it worse — and not from overfitting: the training error itself went up once you stacked enough layers. Gradients backpropagated through many layers shrank toward zero (or blew up), so the early layers barely learned. This was the degradation problem — depth was supposed to help, and past a point it hurt.

The fix, from He et al. 2015 (“Deep Residual Learning for Image Recognition”), was a one-line change to how a block of layers is wired. Instead of asking the block to output the full transformed signal H(h), ask it to output only the residual f_\theta(h) — the change to make — and add the input back through a skip connection:

h_{\ell+1} = h_\ell + f_\theta(h_\ell).

Now each block only has to learn a small correction to its input. If the best thing a layer can do is nothing, it just learns f_\theta \approx 0 — trivial — and the skip connection carries the signal (and the gradient) straight through. That single idea made networks of 50, 150, even 1000+ layers trainable, won ImageNet 2015, and is now everywhere (transformers are stacks of residual blocks). With that in hand, the bridge to continuous time is almost a relabelling.

A residual block. The input h_\ell skips around a small learned branch f_\theta (two weight layers and an activation) and is **added back**, so the block only has to learn a *correction* to its input: h_{\ell+1} = h_\ell + f_\theta(h_\ell). The skip connection is what lets gradients — and the signal — pass straight through, making very deep stacks trainable.

The Euler-step analogy

A residual network layer — the skip-connection block we just met, now with a step size \Delta t written in front of f_\theta — is

h_{\ell+1} = h_\ell + \Delta t\, f_\theta(h_\ell, \ell).

Compare to one explicit Euler step on the ODE \dot{h} = f_\theta(h, t):

h(t + \Delta t) \approx h(t) + \Delta t\, f_\theta(h(t), t).

Identical. Stacking many such steps and taking \Delta t \to 0 as the depth L \to \infty gives a Neural ODE (Chen et al., 2018): the network is a vector field f_\theta. The headline change is what happens to the forward pass. In an ordinary network it is a fixed stack of layer evaluations; in a Neural ODE that stack is gone — the forward pass is now literally “solve an ODE”: hand \dot h = f_\theta(h, t) to an ODE solver and integrate from the input h(0) = \mathbf{x} to the output h(T) = \text{prediction}. The depth — how many “layers” you take — is no longer fixed in the architecture; it is however many steps the solver decides to take.

Tapping out intermediate layers. The loss need not live only at the output. In ordinary deep nets you can tap out hidden layers and attach a loss to them — deep supervision / auxiliary heads (the side-classifiers in GoogLeNet/Inception, or deeply-supervised nets), which inject gradient partway up the stack to steady very deep training. The continuous-depth version is the same idea: instead of scoring only the endpoint h(T), compare the trajectory h(t) to data at intermediate times — a running loss along the path, \int_0^T \ell\!\bigl(h(t), t\bigr)\,dt or a sum over observation times. That is exactly the trajectory loss used in §4.3 (data at many t, not just the end), and the basis of the multiple-shooting tricks for fitting a Neural ODE to a time series.

Continuous-depth models in `Lux.jl` style

The vector field is just an MLP. Wrap it in an ODEProblem and hand it to the solver:

using Lux, OrdinaryDiffEq, Random

rng    = Random.MersenneTwister(42)
net    = Lux.Chain(Lux.Dense(2 => 16, tanh), Lux.Dense(16 => 2))
ps, st = Lux.setup(rng, net)

# Lift the network into a vector field f_θ(u, p, t)
function nn_rhs!(du, u, p, t)
    y, _ = net(u, p, st)
    du .= y
end

prob = ODEProblem(nn_rhs!, [1.0f0, 0.0f0], (0.0f0, 1.0f0), ps)
sol  = solve(prob, Tsit5(); saveat = 0.05)
size(sol)   # (state, time) — the integrated trajectory

(2, 21)

That’s a forward pass of an untrained Neural ODE. Training fits \theta so the integrated trajectory matches data — and that training is what the workshop’s later units use.

What are Neural ODEs actually good for?

A fair question: if a Neural ODE is “just” a very deep ResNet, why bother solving an ODE? The honest answer is that for plain image classification they don’t beat a discrete ResNet — they’re slower and no more accurate. Their real wins are the places where continuous time is genuinely part of the problem:

Irregularly-sampled time series. Because the model is defined for every t, you can evaluate it at whatever ragged timestamps your data actually has — no resampling onto a fixed grid. This is the big practical use: Latent ODEs / ODE-RNNs for medical records, climate sensors, and finance, where readings arrive at uneven times (Rubanova et al. 2019).
Continuous normalizing flows. Use the ODE to transport a simple probability density into a complicated one, giving an exact, invertible generative model (FFJORD); the same continuous-transport idea underlies today’s flow-matching and diffusion-style generators.
Memory-efficient very deep models. The adjoint (next subsection) trains with memory that doesn’t grow with depth — the original selling point.
Learning physical dynamics from data. Fit f_\theta to observed trajectories of a real system — the direct bridge to the universal differential equations of §4.3 and the scientific applications this workshop is about.

State of the art. The active frontier is mostly about making the above practical: faster and more accurate adjoints, solvers that cope with stiff learned dynamics, and hybrids that splice Neural-ODE blocks into transformers and diffusion models. The reference implementations live in Julia’s SciML stack — SciMLSensitivity.jl and DiffEqFlux.jl, which pioneered differentiable solving at scale; on the Python side the modern counterpart is the JAX library diffrax. For this course the takeaway is narrower: a Neural ODE is the right tool when the data is a trajectory and the dynamics are unknown — exactly the capstone setting in Unit 9.

Adjoint sensitivities (sketch)

Naively Zygote.gradient through an ODE solver stores every intermediate state — what’s called discretise-then-differentiate or “direct backprop through the solver”. Memory cost scales with the number of solver steps, which is fine for short trajectories and prohibitive for long ones.

The adjoint method (originally due to Pontryagin in optimal control; rediscovered for Neural ODEs by Chen et al. 2018) replaces this by differentiate-then-discretise: derive an auxiliary ODE for the gradient — the adjoint \boldsymbol{\lambda}(t) = \partial \mathcal{L} / \partial \mathbf{x}(t), the sensitivity of the loss to the state at time t — and solve it backward in time alongside the original equation. Memory cost becomes constant in trajectory length; the price is some numerical-accuracy bookkeeping (the backward solve has to be at least as accurate as the forward one).

A third middle-ground option, checkpointing, stores only a sparse subset of states forward, then re-integrates between them on the backward pass — logarithmic-in-time memory at the cost of logarithmic-in-time extra compute.

In Julia that machinery lives in SciMLSensitivity.jl (the sensealg = InterpolatingAdjoint() you’ll pass to solve); the Python equivalents are torchdiffeq (Chen’s reference implementation) and the JAX library diffrax. One practical caveat for this workshop: SciMLSensitivity.jl is not in the default Project.toml — it is heavy to precompile, so we keep it out and the rendered site never executes the adjoint-based code below. It is a one-line Pkg.add("SciMLSensitivity") when you want to run the training script yourself. (The §4.3 UDE worked example deliberately avoids it — it backprops through a hand-written RK4 step with plain Zygote — so that one needs nothing beyond the course environment.)

Training a Neural ODE — a complete run

Not a sketch — the whole thing end to end: make data from a known system, let a small MLP be the vector field, train through the ODE solver with adjoint sensitivities, and check the fit. This is the full scripts/neural_ode_train.jl. It is shown here but not run by the site — it is the one example that needs SciMLSensitivity (for InterpolatingAdjoint), which (as just noted) is not in the default Project.toml. To run it yourself, add that one package; everything else it uses — ComponentArrays, Lux, OrdinaryDiffEq, Zygote — is already in the course environment:

units/unit_04/scripts/neural_ode_train.jl

using Lux, OrdinaryDiffEq, SciMLSensitivity, Zygote, Optimisers,
      ComponentArrays, Random

rng = Random.MersenneTwister(0)

# ground-truth system: a damped spiral, sampled at 40 times
A_true = Float32[-0.1 2.0; -2.0 -0.1]
u0, tspan = Float32[2.0, 0.0], (0.0f0, 6.0f0)
ts   = range(tspan[1], tspan[2]; length = 40)
data = Array(solve(ODEProblem((u,p,t) -> A_true*u, u0, tspan), Tsit5(); saveat = ts))
data .+= 0.08f0 .* randn(rng, size(data))   # a bit of observation noise

# the Neural ODE: a small MLP IS the vector field f_θ(u)
fθ      = Lux.Chain(Lux.Dense(2 => 16, tanh), Lux.Dense(16 => 2))
ps0, st = Lux.setup(rng, fθ)
ps      = ComponentArray(ps0)               # flat params the solver can carry

rhs(u, p, t) = first(fθ(u, p, st))
predict(p) = Array(solve(ODEProblem(rhs, u0, tspan, p), Tsit5();
                   saveat = ts, sensealg = InterpolatingAdjoint(autojacvec = ZygoteVJP())))
loss(p) = sum(abs2, predict(p) .- data)

# train: Adam on θ; the gradient flows through the solver via the adjoint
opt_state = Optimisers.setup(Optimisers.Adam(0.05f0), ps)
for epoch in 1:300
    l, back = Zygote.pullback(loss, ps)
    opt_state, ps = Optimisers.update(opt_state, ps, back(one(l))[1])
end
# `predict(ps)` now tracks the true damped spiral; the script saves the fit + loss curve.

Trained on the noisy samples, the learned vector field’s trajectory threads the data and recovers the underlying spiral:

Figure 1: A Neural ODE trained on 40 noisy samples of a damped spiral. The black dots are the noisy observations, the crimson curve is the trajectory of the *learned* field f_\theta integrated from \mathbf{x}_0 = (2, 0), and the dotted grey curve is the true spiral. The network never sees the equation — only the dots — yet its field reproduces the inward spiral.

Three structural differences from the standard supervised-learning loop in Unit 2 §2.8:

The forward pass is an ODE solve, not a feed-forward evaluation. Cost is dominated by the integrator.
The loss is a trajectory misfit — a sum over time, not a single per-sample term.
The gradient flows through the solver via sensealg. Pick the wrong one and training time blows up.

✏️ Section exercise — train a tiny Neural ODE on a spiral

Make the §4.2 skeleton run end-to-end on the classic toy: data from the linear spiral \dot{\mathbf{u}} = A\mathbf{u}, A = \begin{pmatrix} -0.1 & 2 \\ -2 & -0.1\end{pmatrix}, sampled at 30 points over t \in [0, 6] from \mathbf{u}_0 = (2, 0). Fit the 2 → 16 → 2 tanh network with the adjoint-backed training loop (SciMLSensitivity + Adam, ~1 000 iterations), then extrapolate the trained model to t = 12 and plot it against the true spiral. Where does the learned dynamics stay faithful, and where does it drift? Bonus: repeat with only the first half of the spiral as training data and watch the extrapolation get worse — the data-efficiency argument for UDEs in one picture.

💡 Hint

You’ll need two packages not in the workshop Project.toml: SciMLSensitivity (adjoints) and ComponentArrays (flat parameter vector Zygote can differentiate) — wrap the Lux params with ps = ComponentArray(ps). Write a predict(p, tspan, saveat) that solves the Neural ODE with sensealg = InterpolatingAdjoint(), so you can reuse it at two horizons; train on loss(p) = sum(abs2, predict(p, (0f0,6f0), ts) .- data) with Adam(1f-2) for ~1000 iters. To extrapolate, just call predict(ps, (0f0,12f0), t_ext) with the trained ps — don’t retrain — and plot against the true spiral re-solved over the same span. It stays faithful while the orbit revisits states it saw in training, then drifts where it didn’t.

Go to solution →

4.3 Universal Differential Equations

A Neural ODE replaces all of f with a network. A Universal Differential Equation (UDE) only replaces the bit you don’t know, keeping the trusted physics intact. That’s almost always the right move in scientific applications — most domains have some equation that’s known to be correct (mass conservation, a governing PDE in part of the domain, a well-established empirical law) and some term that’s empirical or unknown. UDEs let you write the known bits with a pencil and learn the rest with a small network — the position taken throughout Units 8–10 of the capstone.

Known physics + learned closure

A Universal Differential Equation (UDE) (Rackauckas et al. 2020) is an ODE whose right-hand side splits into a known term and a learned term:

\dot{\mathbf{x}} = \underbrace{f_{\text{phys}}(\mathbf{x}, t)}_{\text{known}} \;+\; \underbrace{N_\theta(\mathbf{x}, t)}_{\text{learned}}.

You write down whatever physics you trust (f_{\text{phys}}) and let a small neural network absorb the rest. Train on observed \mathbf{x}(t). The result is a model you can simulate, analyse, and interpret — the NN is small and you can probe it.

Worked example: Crown-of-Thorns starfish on the Great Barrier Reef

The Great Barrier Reef has lost roughly half its hard-coral cover since 1985, and the peer-reviewed AIMS attribution (De’ath et al. 2012, PNAS) pins ~42% of that loss on Crown-of-Thorns starfish (COTS) predation — second only to cyclones, and one of the drivers reef managers attempt to address directly. Reefs sit at 1–10 adult COTS per hectare until an outbreak (>30/ha) strips the coral; densities are monitored and targeted culling is one management response. For a modeller, the grazing interaction is relatively well-understood textbook predator–prey; more uncertain is why COTS populations crash — their mortality, which is where we will place the learned term.

The modelling question we’ll borrow: given monitored time-series of coral cover C(t) and COTS density S(t) on a reef, can a UDE separate the (mostly) known population dynamics from the genuinely uncertain density-dependent COTS mortality?

Known physics: logistic coral + Holling-II grazing

A standard predator–prey model for COTS / coral (Morello et al. 2014, Marine Ecology Progress Series; historically Antonelli et al. 1990, Ecological Modelling) takes the form

\begin{aligned} \dot C \;&=\; r_C\, C\,\Bigl(1 - \frac{C}{K}\Bigr) \;-\; \frac{a\, S\, C}{1 + a\,h\,C} \\ \dot S \;&=\; e\,\frac{a\, S\, C}{1 + a\,h\,C} \;-\; m(S, \ldots)\, S \end{aligned}

with parameters (the values used in the simulation below):

r_C = 0.30\;\text{yr}^{-1} — coral intrinsic growth rate,
K = 80\% cover — carrying capacity,
a = 0.005 — COTS attack rate,
h = 0.10 — handling time,
e = 0.40 — conversion efficiency,
m(\cdot) = 0.06 — COTS mortality rate, held at this placeholder constant for now (it is the term the UDE will learn).

The two structural terms — logistic coral growth and Holling type-II grazing — are well established. The mortality m(S, \ldots) is exactly where the modelling literature is honest about not knowing the right functional form (Babcock et al. 2016, PLOS ONE flags it as the dominant source of structural uncertainty), and it’s the natural place to put a learnable closure.

Pin that mortality at a constant value for a moment and the known skeleton already produces the signature COTS boom–bust — run it forward:

using OrdinaryDiffEq, Plots

r_C, K = 0.30, 80.0      # coral growth (1/yr), carrying capacity (% cover)
a, h   = 0.005, 0.10     # COTS attack rate, handling time
e      = 0.40            # conversion efficiency
m      = 0.06            # placeholder mortality — the term the UDE will learn

function cots_coral!(du, u, p, t)
    C, S = u                                  # coral cover (%), COTS density (per ha)
    graze = a * S * C / (1 + a * h * C)
    du[1] = r_C * C * (1 - C / K) - graze      # coral:  logistic growth − grazing
    du[2] = e * graze - m * S                  # COTS:   grazing gain − mortality
end

sol = solve(ODEProblem(cots_coral!, [60.0, 2.0], (0.0, 200.0)), Tsit5(); saveat = 0.5)
t = sol.t;  C = [u[1] for u in sol.u];  S = [u[2] for u in sol.u]

p1 = plot(t, C; lw = 2.4, c = :seagreen, label = "coral cover C (%)",
          xlabel = "time (years)", ylabel = "% cover  /  COTS per ha",
          title = "COTS–coral skeleton (constant mortality)", legend = :topright,
          framestyle = :box)
plot!(p1, t, S; lw = 2.4, c = :firebrick, label = "COTS density S (per ha)")
hline!(p1, [30]; ls = :dash, c = :gray, label = "outbreak threshold")
p2 = plot(C, S; lw = 2, c = :purple, xlabel = "coral C (%)", ylabel = "COTS S (per ha)",
          title = "phase portrait", legend = false, framestyle = :box)
plot(p1, p2; layout = (1, 2), size = (900, 350),
     bottom_margin = 5Plots.mm, left_margin = 5Plots.mm)

Even this crude constant-mortality skeleton reproduces the outbreak-and-crash: COTS irrupt past the 30/ha threshold, coral collapses from ~75% to ~30%, then the system spirals to a low-coral coexistence. What it can’t capture is why real outbreaks recur and vary in size — that is set by the density- and temperature-dependent mortality, the one term the UDE replaces with N_\theta.

Learned closure: COTS mortality as a small MLP

Replace the unknown mortality with a small network of the state and one or two environmental drivers — a sea-surface-temperature anomaly T_a(t) from satellite data is a reasonable starting choice (warmer water suppresses larval metamorphic success; Lang et al. 2022):

\dot S \;=\; e\,\frac{a\, S\, C}{1 + a\,h\,C} \;-\; \underbrace{N_\theta\!\bigl(S, C, T_a(t)\bigr)}_{\text{learned mortality}} \cdot S.

Training is exactly the §4.2 Neural-ODE loop — integrate the model forward, compare to the observed C(t), S(t) trajectories, backprop through the solver, step Adam — except N_\theta is now the only learned piece, with everything else held at its known value.

Let’s make that concrete and run it end to end on the cleanest version of the problem: the mortality depends on density alone, m(S), and is the only unknown. We invent a hidden truth m_\star(S) = 0.035 + 0.045\,(S/35)^2 — a baseline plus a density-dependent crowding/disease term — generate noisy COTS / coral trajectories from it, then ask whether a small network, dropped into the mortality slot with all other physics held fixed, can recover that curve from the trajectories alone. (Adding C and a temperature anomaly T_a(t) as extra inputs is the same machinery with a wider first layer.) Here is the whole thing — data generation, the hybrid field, the differentiable solver, and a two-stage Adam → L-BFGS fit (only the ~20 lines of Plots code that draw the figure below are left in the file scripts/cots_ude_train.jl):

units/unit_04/scripts/cots_ude_train.jl

using Lux, Zygote, Optimisers, Optimization, OptimizationOptimJL,
      ComponentArrays, Random, Statistics

rng = Random.MersenneTwister(1)

# known physics — held FIXED in both the ground truth and the UDE
const r_C, K = 0.30, 80.0      # coral growth rate (1/yr), carrying capacity (% cover)
const a, h   = 0.005, 0.10     # COTS attack rate, handling time
const e      = 0.40            # conversion efficiency
graze(C, S) = a*S*C / (1 + a*h*C)

# the HIDDEN mortality law the UDE must recover from data alone
m_true(S) = 0.035 + 0.045*(S/35)^2

# explicit RK4 rollout of a 2-state field; Zygote.Buffer makes it differentiable
function rollout(field, u0, dt, n)
    buf = Zygote.Buffer(zeros(eltype(u0), length(u0), n + 1))
    buf[:, 1] = u0; u = u0
    for k in 1:n
        k1 = field(u);                k2 = field(u .+ 0.5dt .* k1)
        k3 = field(u .+ 0.5dt .* k2); k4 = field(u .+ dt .* k3)
        u  = u .+ (dt/6) .* (k1 .+ 2 .* k2 .+ 2 .* k3 .+ k4); buf[:, k+1] = u
    end
    copy(buf)
end

# ground-truth data: integrate the true model, sample yearly, add 3% noise
true_field(u) = (C = u[1]; S = u[2]; g = graze(C, S);
                 [r_C*C*(1 - C/K) - g,  e*g - m_true(S)*S])
u0    = [60.0, 2.0]
T, dt = 60.0, 0.1
n     = Int(T / dt)
idx   = 1:10:(n + 1)                       # observe once a year
clean = rollout(true_field, u0, dt, n)[:, idx]
Cv, Sv = clean[1, :], clean[2, :]
data   = clean .* (1 .+ 0.03 .* randn(rng, size(clean)))

# the UDE: mortality is a small NN of S; every other term stays known
net = Lux.Chain(Lux.Dense(1 => 16, tanh), Lux.Dense(16 => 16, tanh), Lux.Dense(16 => 1))
p0, stt = Lux.setup(rng, net)
ps = ComponentArray{Float64}(p0)
ps.layer_3.bias .= -2.95                   # softplus(-2.95)≈0.05 → sane initial mortality
mθ(S, p) = log1p(exp(first(net([S], p, stt))[1]))   # softplus keeps mortality ≥ 0

ude_field(u, p) = (C = u[1]; S = u[2]; g = graze(C, S);
                   [r_C*C*(1 - C/K) - g,  e*g - mθ(S, p)*S])
predict(p) = rollout(u -> ude_field(u, p), u0, dt, n)

wC, wS = 1/std(Cv), 1/std(Sv)              # weight coral and COTS misfits equally
function loss(p)
    sol = predict(p)[:, idx]
    mean(abs2, (sol[1, :] .- data[1, :]) .* wC) + mean(abs2, (sol[2, :] .- data[2, :]) .* wS)
end

# train in two stages — the standard SciML recipe:
#   1) Adam: robust first-order steps that get us into the right basin;
#   2) L-BFGS: a quasi-Newton polish that converges sharply once we are close.
function adam!(ps, lr, iters)
    os = Optimisers.setup(Optimisers.Adam(lr), ps)
    for _ in 1:iters
        os, ps = Optimisers.update(os, ps, Zygote.gradient(loss, ps)[1])
    end
    ps
end
ps = adam!(ps, 0.01, 500)                                          # stage 1: Adam
optf = OptimizationFunction((p, _) -> loss(p), Optimization.AutoZygote())
ps   = solve(OptimizationProblem(optf, ps),                        # stage 2: L-BFGS
             OptimizationOptimJL.LBFGS(); maxiters = 300).u
# mθ(·, ps) is now the recovered mortality law — plotted against m_true above.

Why an explicit RK4 here, not the §4.2 adjoint?

§4.2 trained its Neural ODE by backpropagating through OrdinaryDiffEq’s solver with SciMLSensitivity’s adjoint (sensealg = InterpolatingAdjoint(...)) — the right tool for stiff or high-dimensional systems, where a hand-written solver would be slow or unstable. Here the system is tiny and non-stiff (two states, smooth), so we instead unroll a fixed-step RK4 loop and let plain Zygote differentiate straight through it (Zygote.Buffer makes the in-place writes differentiable). Two reasons: it keeps the example self-contained — nothing beyond what @pinn already ships, no SciMLSensitivity — and it makes the “backprop through the solver” idea literal: the integrator you differentiate is right there in rollout, not hidden behind a sensealg. For a production-scale UDE you would switch back to the adjoint, exactly as in §4.2.

Train it — 500 Adam steps to drop into the right basin, then an L-BFGS polish — and the network recovers the hidden mortality law to under 10% mean error over the density range the data actually visits:

Figure 2: Left: the trained UDE (lines) tracks the noisy COTS / coral observations (dots). Right: the learned closure (blue dashed) recovers the hidden mortality law m_\star(S) (black) closely *inside* the density range the trajectories explore (shaded), and is left unconstrained *outside* it — the honest signature of every UDE result.

That right panel is the whole point of the method. The UDE separates what we trust (the logistic-grazing skeleton) from what we don’t (why COTS adults die), and hands the ecologist a plottable, biologically-interpretable mortality curve — recovered from trajectories alone, agreeing with truth where the data constrains it and openly extrapolating where it doesn’t. That last contrast — faithful interpolation, unconstrained extrapolation — is exactly what makes UDEs honest, and it is the same caution you’ll apply to the learned closures in the capstone.

Fitting the physical constants too

Nothing forces the known scalars (r_C, K, a, h, e) to stay fixed. In practice you rarely know them exactly — and the same trajectory loss can estimate them jointly with the closure: just put them in the parameter vector. With ComponentArrays the packing is one line — a phys block alongside the network weights — and the rest is a handful of mechanical edits to the code above: the constants stop being globals and get read from p.phys, while the closure reads its weights from p.net.

# (1) pack physics + weights together — replaces `ps = ComponentArray{Float64}(p0)`
ps = ComponentArray(phys = (r_C = 0.30, K = 80.0, a = 0.005, h = 0.10, e = 0.40),
                    net  = ComponentArray{Float64}(p0))
ps.net.layer_3.bias .= -2.95                 # NN init now lives under .net

# (2) graze and the closure take their constants/weights from the parameter vector
graze(C, S, a, h) = a*S*C / (1 + a*h*C)
mθ(S, p)          = log1p(exp(first(net([S], p.net, stt))[1]))   # weights ← p.net

# (3) ude_field reads every known constant from p.phys instead of the globals
function ude_field(u, p)
    C, S = u;  ph = p.phys
    g = graze(C, S, ph.a, ph.h)
    [ph.r_C * C * (1 - C/ph.K) - g,  ph.e * g - mθ(S, p) * S]
end

Everything downstream — predict, loss, and the Adam → L-BFGS loop — is unchanged: it only ever calls ude_field(u, p). The loss gradient now flows to both the handful of mechanistic constants and the neural closure, and the two optimisers fit them together. Two cautions: the scalars and the network weights live on very different scales, so standardise them or give the physical block its own learning rate; and watch identifiability — fit too many free constants against too little data and the parameters and the closure trade off against one another. This grey-box estimation — a few interpretable numbers and an unknown function from a single trajectory loss — is exactly the capstone setting in Units 8–10.

✏️ Section exercise — the functional inverse, in miniature

This is the functional inverse of a UDE, and the perfect contrast to a parameter inverse. A parameter inverse assumes you know the model’s form and recovers a few scalars; a functional inverse drops a whole term and learns it as a network — recovering a function, not a number. It is the §4.3 COTS example in miniature: there the unknown was the mortality m(S); here it will be the predator–prey interaction.

Build ude_lv.jl yourself, in three steps:

Data. Solve Lotka–Volterra (\alpha, \beta, \gamma, \delta as in §4.1, t \in [0, 8], 60 samples) and add 2% multiplicative noise.
Model. Keep the terms you trust and replace the predator-growth term +\delta x y with a 2 \to 8 \to 1 network: \dot x = \alpha x - \beta x y, \qquad \dot y = -\gamma y + N_\theta(x, y). Train on the trajectory loss (Adam for a few hundred steps; an L-BFGS polish like §4.3’s is an optional extra).
Read it off. Plot the learned N_\theta(x, y) against the true surface \delta x y over the region the orbit actually visits.

The one question that matters: does N_\theta match \delta x y inside the training orbit’s range, and what does it do outside it? That contrast — faithful interpolation, unconstrained extrapolation — is the honest summary of every UDE result you’ll ever publish.

Grey-box stretch. Following the Fitting the physical constants too block above, also pack one interpretable scalar into the parameter vector and recover it jointly with the closure. Free \beta (the prey-loss rate in the trusted equation \dot x = \alpha x - \beta x y), not \gamma — the worked solution explains why the latter is non-identifiable here.

💡 Hint

Generate data with truth = solve(...) then data = truth .* (1 .+ 0.02f0 .* randn(...)). The UDE rhs keeps the known terms inline and swaps the predator-growth term for the net: du[1] = α*u[1] - β*u[1]*u[2]; du[2] = first(nn(u, p, st))[1] - γ*u[2] — i.e. +\delta xy becomes N_\theta(x,y). For training, either reuse ex-4-2’s adjoint loop (InterpolatingAdjoint) or §4.3’s Zygote.Buffer RK4 rollout, which needs no extra package. For the comparison, evaluate N_θ and δ·x·y on a grid with two comprehensions and heatmap both — then overlay the training orbit with plot! to see exactly where the agreement ends.

Go to solution →

4.4 What comes next

This unit established the trajectory-loss family of physics-aware training: integrate the ODE forward, compare to data, backprop through the solver via the adjoint. The remaining units add the rest of the toolkit:

Unit 5 introduces the other family: residual-loss training. Instead of integrating, you evaluate the equation residual at random collocation points and train the network to satisfy it. No solver, no adjoint — just autodiff.
Unit 6 provides the PDE background the residual-loss approach assumes (function spaces, weak solutions, well-posedness).
Unit 7 covers what goes wrong in practice with the residual approach (the so-called “PINN failure modes”) and the modern fixes.
Units 8–10 combine both approaches on the capstone column problem: a UDE forward model, then a hybrid residual + trajectory inverse for the unknown driver.

The trajectory-loss skeleton in §4.2 and the residual-loss skeleton in Unit 5 are the two recipes every Sci-ML / PINN paper builds on; understanding when to reach for each is the practical takeaway of these four units.

4.1 ODEs as models of dynamics

Worked example: Lotka–Volterra

Logistic growth

A compartmental model: SIR

4.2 From ResNets to Neural ODEs

Residual networks: the ML backstory

The Euler-step analogy

Continuous-depth models in Lux.jl style

What are Neural ODEs actually good for?

Adjoint sensitivities (sketch)

Training a Neural ODE — a complete run

4.3 Universal Differential Equations

Known physics + learned closure

Worked example: Crown-of-Thorns starfish on the Great Barrier Reef

Known physics: logistic coral + Holling-II grazing

Learned closure: COTS mortality as a small MLP

Fitting the physical constants too

4.4 What comes next

Continuous-depth models in `Lux.jl` style