Unit 3: Scientific Machine Learning and Physics-Informed Machine Learning — an overview

Published

26/06/2026

Unit 2 gave us a function-approximation view of ML — pick a function class, minimise a loss, generalise. This unit takes that picture and adds physics to it. It is a deliberately broad overview — a guided tour that samples the main physics-aware ML ideas (hybrid models, conservation-aware architectures, equation discovery, reduced-order modelling) at survey depth, to map the landscape rather than drill into any one corner. The very next unit, Unit 4, then goes deep on one of those corners — Neural ODEs and universal differential equations. This unit closes with a conceptual preview of physics-informed machine learning (PIML) — the framing that Units 5–9 implement against the AIMS capstone.

Before any of that, an ODE refresher — the dynamics language we need to start doing meaningful Sci-ML.

3.1 An ODE refresher

An ordinary differential equation (ODE) is a relation between a function and its derivatives in one independent variable (typically time t):

\dot{\mathbf{x}}(t) \;=\; f(\mathbf{x}(t), t),\qquad \mathbf{x}(0) \;=\; \mathbf{x}_0.

We’ll meet partial differential equations (PDEs) — derivatives in more than one variable — in Unit 6. For now everything is an ODE, and everything is an initial-value problem (IVP): given \mathbf{x}_0 at t = 0, integrate forward.

Vocabulary worth keeping straight:

\dot{\mathbf{x}} is the time derivative d\mathbf{x}/dt.
Order — highest derivative appearing. A second-order scalar ODE \ddot{x} = g(x, \dot{x}) rewrites as a first-order system in (x, \dot{x}). Throughout the unit we assume that’s been done.
Autonomous if f has no explicit t dependence (\dot{\mathbf{x}} = f(\mathbf{x})); non-autonomous otherwise.
Linear if f(\mathbf{x}, t) = A(t)\,\mathbf{x} + b(t); nonlinear otherwise. Linear systems with constant A have explicit solutions via matrix exponentials; nonlinear systems generally do not.

The next subsections walk a sequence of ODEs in increasing complexity — exponential decay, the undamped harmonic oscillator, a damped oscillator, and then two nonlinear systems: predator–prey (closed cycles) and the chaotic Lorenz attractor. The linear cases come with Julia and Python code so the ecosystem fluency built in Unit 2 carries over.

The simplest ODE: exponential decay

A scalar, first-order, linear, autonomous ODE — the canonical warm-up:

\dot x \;=\; -\lambda\, x, \qquad x(0) = x_0, \qquad \lambda > 0,

with exact solution x(t) = x_0\, e^{-\lambda t}. It models radioactive decay, RC-circuit discharge, drug clearance from a bloodstream — anything where the rate of change is proportional to the current amount. The point of starting here is that we can compare numerical integration to a known closed-form answer, which makes solver choices concrete.

In Julia, using the OrdinaryDiffEq ecosystem that the rest of the workshop standardises on:

using OrdinaryDiffEq, Plots

λ, x0 = 0.5, 1.0
decay(x, p, t) = -λ * x
prob = ODEProblem(decay, x0, (0.0, 6.0))
sol  = solve(prob, Tsit5(); saveat = 0.05)

plot(sol.t, sol.u; label = "Tsit5 (numerical)", lw = 2,
     xlabel = "t", ylabel = "x(t)",
     title = "Exponential decay: ẋ = -λx,  λ = $λ")
plot!(sol.t, x0 .* exp.(-λ .* sol.t);
      label = "exact: x₀·e^(-λt)", ls = :dash, c = :black)

In Python with scipy.integrate.solve_ivp (the SciPy equivalent of OrdinaryDiffEq):

units/unit_03/scripts/decay_python.py

import numpy as np
from scipy.integrate import solve_ivp

lam, x0 = 0.5, 1.0
sol = solve_ivp(lambda t, x: -lam * x,
                t_span=(0.0, 6.0), y0=[x0],
                method="RK45", dense_output=True)

t = np.linspace(0, 6, 121)
print("max |numerical - exact| =",
      float(np.max(np.abs(sol.sol(t)[0] - x0 * np.exp(-lam * t)))))

The max error is \sim 10^{-6} for both solvers at default tolerances — a useful baseline before we ask anything harder.

A second-order system: the harmonic oscillator

The simplest non-trivial second-order ODE — a mass on a spring,

\ddot{x} + \omega^2 x = 0 \quad\Longleftrightarrow\quad \begin{pmatrix} \dot{x} \\ \dot{v} \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ -\omega^2 & 0 \end{pmatrix} \begin{pmatrix} x \\ v \end{pmatrix},

with v = \dot{x}. Energy E = \tfrac{1}{2}(v^2 + \omega^2 x^2) is conserved exactly; the phase portrait — trajectories in the (x, v) plane — is a family of concentric ellipses. This is the example the Hamiltonian / Lagrangian neural networks of §3.3 are designed to handle without drifting.

An ODE is a vector field. At every point (x, v) of the phase plane the right-hand side f(x, v) = (v, -\omega^2 x) pins down a little velocity arrow — “if you are here, move that way” — and a solution is simply the curve that follows the arrows from its starting point. Solving the ODE is threading those arrows; the grey field below is f itself, and the coloured orbits thread through it. (This is the same “the network is a vector field” picture that Unit 4 §4.2 builds Neural ODEs on.)

using OrdinaryDiffEq, Plots

ω = 2π                                # one cycle per second
oscillator(u, p, t) = [u[2], -ω^2 * u[1]]

u0s   = [0.5, 1.0, 1.5, 2.0]          # starting positions (all released from rest)
tspan = (0.0, 1.0)                    # exactly one period
pal   = cgrad(:viridis, length(u0s), categorical = true)

plt = plot(xlabel = "position  x", ylabel = "velocity  v",
           title = "Harmonic oscillator — phase portrait",
           legend = :outerright, size = (720, 460),
           framestyle = :box, gridalpha = 0.25)

# the ODE drawn as a vector field: an arrow f(x,v) = (v, -ω²x) at each grid point.
# directions are normalised in screen units (x and v have very different scales)
# so the arrows show *flow direction*, not raw magnitude; orbits are tangent to them.
bx = Float64[]; bv = Float64[]; qx = Float64[]; qv = Float64[]
Rx, Rv, len = 4.4, 27.0, 0.05          # axis spans + screen-normalised arrow length
for x in range(-1.9, 1.9; length = 9), v in range(-11.5, 11.5; length = 7)
    sx, sv = v / Rx, (-ω^2 * x) / Rv
    n = hypot(sx, sv); n == 0 && continue
    push!(bx, x); push!(bv, v)
    push!(qx, sx/n * Rx * len); push!(qv, sv/n * Rv * len)
end
quiver!(plt, bx, bv, quiver = (qx, qv), c = :gray75, lw = 0.8, label = "")

for (k, a) in enumerate(u0s)
    sol = solve(ODEProblem(oscillator, [a, 0.0], tspan), Tsit5(); saveat = 0.005)
    plot!(plt, sol[1, :], sol[2, :], lw = 2.6, c = pal[k], label = "x(0) = $a")
    scatter!(plt, [a], [0.0], c = pal[k], ms = 5, msw = 0, label = "")   # release point
end
annotate!(plt, 0.0, 11.0, text("↻  clockwise flow", 9, :gray))
plt

Each closed loop is a constant-energy orbit, everywhere tangent to the grey vector-field arrows; the dots mark where the mass is released from rest. The flow runs clockwise (the grey arrow): released at positive x with zero velocity, the restoring force drives the velocity negative first, so the orbit sweeps downward. The loops never cross or decay — energy is conserved exactly, which is precisely the structure the Hamiltonian networks of §3.3 are built to respect.

The second-order equation, two equivalent codings. The phase portrait above integrated the oscillator as a first-order system in the two coordinates (x, v) — the right-hand column of the \Longleftrightarrow at the top of this section. That rewrite is the standard move, but it is worth seeing both forms in code side by side, because OrdinaryDiffEq can also take the second-order equation directly via SecondOrderODEProblem, where you write the acceleration \ddot{x} as a function of \dot{x} and x. We show the general damped equation \ddot{x} + 2\zeta\omega\,\dot{x} + \omega^2 x = 0 so \zeta = 0 is exactly the undamped oscillator above and \zeta > 0 is the next subsection; both codings integrate to the same trajectory:

using OrdinaryDiffEq

ω, ζ          = 2π, 0.1                 # ζ = 0 recovers ẍ + ω²x = 0 above
x0, v0, tspan = 1.0, 0.0, (0.0, 3.0)

# (1) first-order system in two coordinates u = (x, v), with v = ẋ
firstorder(u, p, t) = [u[2], -ω^2 * u[1] - 2ζ * ω * u[2]]
sol1 = solve(ODEProblem(firstorder, [x0, v0], tspan), Tsit5(); saveat = 0.01)

# (2) the literal second-order ODE  ẍ = -ω²x - 2ζω ẋ, written as the acceleration
accel(dx, x, p, t) = -ω^2 * x - 2ζ * ω * dx
sol2 = solve(SecondOrderODEProblem(accel, v0, x0, tspan), Tsit5(); saveat = 0.01)
# SecondOrderODEProblem stores the state as (ẋ, x): velocity first, position second

x1 = [u[1] for u in sol1.u]            # position from the first-order system
x2 = [u[2] for u in sol2.u]            # position from the second-order problem
maximum(abs, x1 .- x2)                 # ≈ 0 — the two codings agree

0.0

OrdinaryDiffEq.Tsit5() is a 5th-order explicit Runge–Kutta solver — a sensible default for non-stiff problems. For stiff systems (e.g. fast diffusion coupled to slow advection) one picks an implicit solver like QNDF() or Rodas5P() instead. Choosing the solver is most of classical numerics; in Unit 4 we’ll let the network learn the dynamics directly and side-step the choice.

Adding damping: the three regimes

Real oscillators bleed energy. Add a velocity-proportional drag 2\zeta\omega\dot x and the equation becomes

\ddot{x} + 2\zeta\omega\, \dot{x} + \omega^2 x = 0,

with damping ratio \zeta \geq 0 — the dimensionless number that controls how the oscillator decays:

\zeta < 1 — underdamped. Oscillation with an exponential envelope e^{-\zeta\omega t}. The default for, e.g., a car suspension or a ringing bell.
\zeta = 1 — critically damped. Fastest return to zero with no overshoot. The design target for, e.g., a closing door or a feedback controller.
\zeta > 1 — overdamped. Slow exponential return; no oscillation. Heavy honey on a spring.

A single Julia script overlays the three regimes:

using OrdinaryDiffEq, Plots

ω = 2π
function damped(u, p, t)
    ζ = p[1]
    [u[2], -ω^2 * u[1] - 2ζ * ω * u[2]]
end

u0  = [1.0, 0.0]
ts  = (0.0, 3.0)
plt = plot(xlabel = "t", ylabel = "x(t)",
           title = "Damped oscillator: ẍ + 2ζω ẋ + ω²x = 0",
           legend = :topright)
for (ζ, label) in [(0.1, "ζ=0.1 (under)"),
                   (1.0, "ζ=1.0 (critical)"),
                   (2.0, "ζ=2.0 (over)")]
    sol = solve(ODEProblem(damped, u0, ts, [ζ]), Tsit5(); saveat = 0.01)
    plot!(plt, sol.t, [u[1] for u in sol.u], label = label, lw = 2)
end
plt

Same model in Python — same shape:

units/unit_03/scripts/damped_python.py

import numpy as np
from scipy.integrate import solve_ivp

omega = 2 * np.pi
def damped(t, u, zeta):
    return [u[1], -omega**2 * u[0] - 2 * zeta * omega * u[1]]

for zeta in (0.1, 1.0, 2.0):
    sol = solve_ivp(damped, (0.0, 3.0), [1.0, 0.0],
                    args=(zeta,), t_eval=np.linspace(0, 3, 301))
    print(f"zeta={zeta}: x(3.0) = {sol.y[0, -1]:+.4f}")

Notice \zeta = 1 minimises the time to “land” near zero — a fact control engineers exploit. We’ll come back to damped oscillators when we discuss the inverse problem in Unit 4: recovering \zeta from a noisy trajectory is the simplest possible “physics parameter from data” problem and a useful sanity check before attempting the AIMS capstone in Units 8–10.

Peek inside the solver: the Euler scheme

Every example so far handed the dynamics to solve and trusted the black box. The damped oscillator is a good place to open it up with the simplest integrator, forward Euler — partly because the very same one-line update is what links Neural ODEs to ResNets in Unit 4 §4.2.

First, write the second-order equation as a first-order system in the two coordinates \mathbf{u} = (x, v) with v = \dot x — exactly the trick the harmonic oscillator used above, now with the damping term carried into the matrix:

\ddot x + 2\zeta\omega\,\dot x + \omega^2 x = 0 \quad\Longleftrightarrow\quad \frac{d}{dt}\begin{pmatrix} x \\ v \end{pmatrix} = \begin{pmatrix} v \\ -\omega^2 x - 2\zeta\omega\, v \end{pmatrix} = \underbrace{\begin{pmatrix} 0 & 1 \\ -\omega^2 & -2\zeta\omega \end{pmatrix}}_{A} \begin{pmatrix} x \\ v \end{pmatrix}.

With the system in the form \dot{\mathbf{u}} = f(\mathbf{u}), forward Euler just replaces the derivative by its forward finite difference (§2.7) and rearranges — march from \mathbf{u}_n \approx \mathbf{u}(t_n) to the next sample by a single nudge along the vector field:

\mathbf{u}_{n+1} \;=\; \mathbf{u}_n \;+\; \Delta t\, f(\mathbf{u}_n), \qquad t_n = n\,\Delta t.

That is the whole method: take the current state, add a small step in the direction the field points. (Hold onto this line — a ResNet layer is exactly one Euler step of a learned field f_\theta, which is the bridge to Neural ODEs in Unit 4 §4.2.) Coded by hand and raced against Tsit5:

using OrdinaryDiffEq, Plots

ω, ζ = 2π, 0.3
f(u) = [u[2], -ω^2 * u[1] - 2ζ * ω * u[2]]        # the first-order system, u = (x, v)

# forward Euler, by hand:  uₙ₊₁ = uₙ + Δt · f(uₙ)
function euler(f, u0, dt, T)
    us = [u0]
    for _ in 1:round(Int, T / dt)
        push!(us, us[end] .+ dt .* f(us[end]))
    end
    reduce(hcat, us)
end

T, u0 = 3.0, [1.0, 0.0]
ref = solve(ODEProblem((u, p, t) -> f(u), u0, (0.0, T)), Tsit5(); saveat = 0.01)

plt = plot(ref.t, [u[1] for u in ref.u]; label = "Tsit5 (reference)", lw = 3, c = :black,
           xlabel = "t", ylabel = "x(t)", legend = :topright,
           title = "Forward Euler vs a real solver (ζ = 0.3)")
for dt in (0.1, 0.04, 0.01)
    U = euler(f, u0, dt, T)
    plot!(plt, range(0, T; length = size(U, 2)), U[1, :]; lw = 1.8, label = "Euler Δt = $dt")
end
plt

At \Delta t = 0.1 the Euler curve already lags and overshoots; halving the step twice brings it onto the reference. A real solver like Tsit5 buys that accuracy with a far better per-step rule and automatic step-size control, so you rarely set \Delta t by hand.

Try it. Rerun the loop at \Delta t = 0.2 and watch the amplitude grow instead of decay: forward Euler injects a little spurious energy each step, so on an oscillator a too-coarse step is not merely inaccurate but unstable — the trajectory blows up. Then shrink \Delta t until the Euler curve sits on the Tsit5 reference, and compare the step counts: that is the price of the simplest possible integrator.

Nonlinearity: predator and prey

Everything so far has been linear — f(\mathbf{x}) = A\mathbf{x}, solvable in closed form with a matrix exponential. Real systems are usually nonlinear, and nonlinearity changes the qualitative behaviour, not just the numbers. The classic example is Lotka–Volterra, a two-species predator–prey model:

\dot{R} = \alpha R - \beta R F, \qquad \dot{F} = \delta R F - \gamma F,

with prey R (rabbits) and predators F (foxes). The RF terms — prey eaten in proportion to both populations — are what make it nonlinear: you cannot write the right-hand side as a matrix times \mathbf{x}, and solutions no longer superpose.

using OrdinaryDiffEq, Plots

function lotka(u, p, t)
    α, β, δ, γ = p
    R, F = u
    [α*R - β*R*F, δ*R*F - γ*F]
end

p  = (1.5, 1.0, 1.0, 3.0)            # prey growth, predation, predator gain, predator death
u0 = [1.0, 1.0]
sol = solve(ODEProblem(lotka, u0, (0.0, 12.0), p), Tsit5(); saveat = 0.02)

p1 = plot(sol.t, [u[1] for u in sol.u], lw = 2.6, c = :seagreen, label = "prey (rabbits)",
          xlabel = "time  t", ylabel = "population", legend = :topright,
          title = "Predator–prey (Lotka–Volterra)", framestyle = :box, gridalpha = 0.25)
plot!(p1, sol.t, [u[2] for u in sol.u], lw = 2.6, c = :firebrick, label = "predator (foxes)")

p2 = plot([u[1] for u in sol.u], [u[2] for u in sol.u], lw = 2.4, c = :purple,
          xlabel = "prey", ylabel = "predator", title = "phase portrait (closed orbit)",
          legend = false, framestyle = :box, gridalpha = 0.25)
scatter!(p2, [u0[1]], [u0[2]], c = :black, ms = 5, msw = 0)
plot(p1, p2, layout = (1, 2), size = (960, 400), bottom_margin = 4Plots.mm, left_margin = 4Plots.mm)

The populations cycle forever, out of phase: prey boom, predators follow, prey crash, predators starve, prey recover. In the phase plane that is a closed orbit, traced counter-clockwise (prey lead, predators lag a quarter-cycle behind) — like the oscillator’s ellipses, but bent by the nonlinearity. Systems like this generally have no closed-form solution, which is exactly why we reach for numerical solvers (and, later, networks that learn the dynamics directly). We meet this same predator–prey system again in Unit 4 §4.1, where its closed orbits turn out to hide a conserved quantity — making it a standard testbed for the structure-preserving methods of this unit.

Chaos: the Lorenz system

Turn the nonlinearity up and you can get chaos — bounded, deterministic motion that never repeats and depends sensitively on the starting point. Edward Lorenz’s 1963 three-variable model of atmospheric convection is the iconic example:

\dot{x} = \sigma(y - x), \qquad \dot{y} = x(\rho - z) - y, \qquad \dot{z} = x y - \beta z.

using OrdinaryDiffEq, Plots

function lorenz(u, p, t)
    σ, ρ, β = p
    x, y, z = u
    [σ*(y - x), x*(ρ - z) - y, x*y - β*z]
end

sol = solve(ODEProblem(lorenz, [1.0, 1.0, 1.0], (0.0, 60.0), (10.0, 28.0, 8/3)),
            Tsit5(); saveat = 0.005)
xs = [u[1] for u in sol.u]; ys = [u[2] for u in sol.u]; zs = [u[3] for u in sol.u]
plot(xs, ys, zs; lw = 0.7, line_z = sol.t, c = cgrad(:viridis), colorbar = false,
     legend = false, xlabel = "x", ylabel = "y", zlabel = "z",
     title = "Lorenz attractor (σ=10, ρ=28, β=8/3)", size = (720, 560), camera = (30, 25))

The trajectory winds forever around two lobes — the Lorenz attractor, the original “butterfly”. Two trajectories starting a millimetre apart diverge exponentially, which is why weather is unpredictable beyond a week or so. We meet this exact system again in §3.4, where SINDy recovers these three equations from data alone.

Why ODEs matter for the rest of the unit

Every Sci-ML idea in §§3.2–3.5 starts from an ODE picture:

Hybrid models replace part of f with a neural network.
Hamiltonian / Lagrangian NNs parametrise an energy function and derive f from it.
SINDy assumes f is a sparse combination of library terms and recovers it from data.
PINNs turn the IVP itself into a loss — they approximate the solution map t \mapsto \mathbf{x}(t) by a neural network.

Keep the picture \dot{\mathbf{x}} = f(\mathbf{x}, t) in mind as we go through each.

✏️ Section exercise — find the fastest landing

The text claims \zeta = 1 minimises the time to “land” near zero. Test it. Define the settling time as the first time after which |x(t)| stays below 0.02 forever (numerically: below 0.02 for the rest of a 10-second solve). Sweep \zeta \in \{0.1, 0.2, \ldots, 2.0\} with the damped-oscillator code above, compute the settling time for each, and plot it against \zeta. Where is the minimum — exactly at \zeta = 1, or slightly below? (Control engineers will tell you the answer is ~0.7 for this tolerance; find out whether they’re right.)

💡 Hint

Reuse the §3.1 damped function unchanged. For one ζ: solve on (0.0, 10.0) with a fine saveat (0.001 is plenty), pull x = [u[1] for u in sol.u], then with tol = 0.02 the settling time is the time of the last excursion outside the band: sol.t[findall(abs.(x) .> tol)[end]]. Wrap that in a settling_time(ζ) function, broadcast over ζs = 0.1:0.05:2.0, and read the minimum with ζs[argmin(ts)] before plotting. Expect it near ζ ≈ 0.7–0.8, not 1.0 — then shrink tol and watch the minimum migrate toward 1.

Go to solution →

3.2 The SciML landscape

Hybrid physics + ML

A hybrid (or universal) model uses a known physical equation for the parts of the dynamics we understand and a neural network for the parts we don’t. Concretely, swap f in \dot{\mathbf{x}} = f(\mathbf{x}, t) for a sum of known and learned pieces:

\dot{\mathbf{x}} = \underbrace{f_{\text{phys}}(\mathbf{x}, t)}_{\text{known}} \;+\; \underbrace{N_\theta(\mathbf{x}, t)}_{\text{learned}}.

Example: a population model with known logistic growth and an unknown immigration term modelled by a small NN; train \theta against observed time series. The benefits over pure ML — data efficiency and out-of-distribution extrapolation. The benefit over pure physics — flexibility where the physics is genuinely unknown. Universal differential equations (UDEs) in Unit 4 are exactly this construction.

In code the shape is short — a known physics term plus a small network spliced into the same right-hand side, handed to an ordinary ODE solver:

# Illustrative skeleton — the full *trainable* version is Unit 4 §4.3.
using Lux, OrdinaryDiffEq, Random

r, K = 1.2, 10.0                                  # known logistic growth
immigration = Lux.Chain(Lux.Dense(1 => 8, tanh), Lux.Dense(8 => 1))   # learned term Nθ
ps, st = Lux.setup(Random.default_rng(), immigration)

function hybrid!(du, u, p, t)
    known   = r * u[1] * (1 - u[1] / K)           # the physics we trust
    learned = first(immigration([u[1]], p, st))[1]  # the bit we don't
    du[1] = known + learned
end

prob = ODEProblem(hybrid!, [1.0], (0.0, 10.0), ps)
sol  = solve(prob, Tsit5())     # a forward pass; training then fits θ to observed data

Training fits only \theta (the network) while the logistic term stays fixed — so the model can bend to data without discarding the physics you already know. Unit 4 §4.3 builds this out into a worked domain example: a Crown-of-Thorns starfish / coral-reef model on the Great Barrier Reef.

Surrogate models for expensive simulations

A surrogate replaces an expensive PDE / ODE solver with a fast learned approximation. Train on a dataset of (input, solver output) pairs; deploy in design optimisation, uncertainty quantification, or real-time inference. The catch: surrogates are only trustworthy inside the distribution they were trained on. Out-of-distribution inputs can produce confident-but-wrong outputs.

Conservation, symmetries, and inductive biases

A neural network that cannot violate conservation of mass, energy, or momentum will generalise better than one that has to learn the constraint from data. Inductive biases that enforce conservation:

Hard architectural constraints — output a flux, then take its divergence; output an energy, then take its gradient (HNNs in §3.3 work this way).
Soft loss penalties — add a violation term to the loss.
Equivariant architectures — networks whose outputs transform correctly under symmetries (translation, rotation, gauge).

The pattern recurs throughout the unit: the more physics we bake in, the less data we need.

✏️ Section exercise — pick the right tool

For each scenario below, decide which SciML pattern fits best — hybrid model, surrogate, or conservation-aware architecture — and say in one sentence what the “physics part” and the “ML part” would be:

A wind-farm operator needs aerodynamic loads for 10 000 candidate turbine layouts; each CFD evaluation takes 8 hours.
A pharmacokinetic model has well-understood drug clearance but an unknown absorption mechanism that varies between patients.
A satellite-dynamics propagator slowly drifts in total energy over multi-year simulations, corrupting conjunction forecasts.
A coastal model needs the current field between expensive simulation runs, but its predictions are only ever queried inside the simulated parameter range.

💡 Hint

Classify by bottleneck: is the problem cost per query (→ surrogate), an unknown term inside known dynamics (→ hybrid), or a violated invariant (→ conservation-aware architecture)? Scenario 4’s stated guarantee about where the model is queried is itself a clue.

Go to solution →

3.3 Hamiltonian and Lagrangian neural networks

The construction

A Hamiltonian Neural Network (HNN) (Greydanus et al., 2019) parametrises the Hamiltonian H_\theta(q, p) of a mechanical system with a neural network, then derives the dynamics via Hamilton’s equations with autodiff. Here q is the generalised position (an angle, a displacement) and p is the generalised (conjugate) momentum; together they are the system’s canonical coordinates, and the scalar H(q, p) is its total energy. For the pendulum below, q is the angle from vertical and p is the angular momentum. The dynamics follow from

\dot{q} \;=\; \frac{\partial H_\theta}{\partial p}, \qquad \dot{p} \;=\; -\,\frac{\partial H_\theta}{\partial q}.

Energy is exactly conserved by construction — no soft penalty needed — because \dot H = \nabla H \cdot \dot{\mathbf{x}} = 0 for the symplectic flow. Train on observed trajectories by minimising the residual

\mathcal{L}(\theta) \;=\; \sum_i \left\| \begin{pmatrix} \dot q_i \\[2pt] \dot p_i \end{pmatrix} - \begin{pmatrix} \partial_p H_\theta(q_i, p_i) \\[2pt] -\partial_q H_\theta(q_i, p_i) \end{pmatrix} \right\|^2.

Contrast: what a plain MLP would do. Before going further, picture the obvious alternative — the baseline we race the HNN against below. A vanilla MLP skips the energy entirely: it takes the state (q, p) straight in and is trained to predict the field (\dot q, \dot p) straight out, as ordinary input→output regression on the same noisy samples. So the network is the vector field, and what it learns is just “at this (q,p), move this way.” Two inputs, two outputs, nothing else imposed. Crucially, nothing in that setup forces the learned field to be the gradient of any energy — there is no H_\theta behind it — so the recovered dynamics carry no conservation law and, on a long rollout, quietly drift. The HNN sees the exact same data; the only difference is what the network represents — one scalar energy H_\theta, differentiated into the field — and that single structural choice is what keeps energy from leaking.

Lagrangian Neural Networks (LNNs) (Cranmer et al., 2020) do the analogous trick with the Lagrangian L_\theta(q, \dot q) and the Euler–Lagrange equations.

Autodiff recall. This is the autodiff machinery from §2.7 in a new costume — worth pausing on, because it is exactly the pattern PINNs use. Reading the field off H_\theta needs its gradient with respect to the inputs (q, p) — only two of them — which is the cheap direction for forward mode (ForwardDiff), the same reason a PINN takes its spatial derivatives in forward mode. Training then needs the gradient of the loss with respect to the parameters — many of them, one scalar loss — the cheap direction for reverse mode. And because a derivative of the network sits inside the loss, the two differentiations nest: a forward-mode input-gradient wrapped in a reverse-mode parameter-gradient, the forward-over-reverse composition of §2.7. (The script below shows both sides: the vanilla baseline trains its parameters with reverse-mode Zygote, but the HNN’s parameter gradient has to fall back to ForwardDiff — reverse mode silently returns nothing when asked to differentiate through the inner forward-mode input-gradient, the exact nesting caveat flagged in §5.3 and Solution 2.7, and a big part of why real PINNs lean on NeuralPDE.jl.) Same engine as a PINN; here it conserves energy instead of enforcing a PDE residual.

A pendulum sanity check

The simple pendulum has the closed-form Hamiltonian H(q, p) = \tfrac{1}{2} p^2 + (1 - \cos q). The script below trains an HNN — a small MLP for H_\theta(q, p) whose field is read off by ForwardDiff (\dot q = \partial_p H_\theta, \dot p = -\partial_q H_\theta) — and, for contrast, a vanilla MLP that predicts the field (\dot q, \dot p) directly from the same 120 noisy points. This is the nested-autodiff pattern of §2.7 put to work: a derivative of the network inside the training loss.

units/unit_03/scripts/hnn_pendulum.jl

#!/usr/bin/env julia
# Hamiltonian Neural Network (HNN) on the simple pendulum — Greydanus et al. (2019),
# the worked example for unit_03.qmd §3.3.
#
# We parametrise the Hamiltonian H_θ(q,p) with a small Lux MLP and DERIVE the
# dynamics by autodiff — q̇ = ∂H/∂p, ṗ = -∂H/∂q — so the flow is symplectic and
# energy is conserved by construction. For contrast we train a vanilla Lux MLP
# that predicts the field (q̇, ṗ) directly from the SAME noisy data. On a long
# rollout the HNN's energy barely drifts while the vanilla MLP's wanders off.
#
# Autodiff split — the §2.7 forward-over-reverse pattern AND its one caveat:
#   • the HNN FIELD is the input-gradient of H_θ → ForwardDiff (cheap for 2 inputs);
#   • the VANILLA model trains with Zygote (reverse mode over the parameters);
#   • the HNN PARAMETER gradient, however, must ALSO use ForwardDiff: Zygote
#     silently returns `nothing` when asked to differentiate THROUGH the inner
#     ForwardDiff input-gradient (the exact nesting caveat called out in §5.3 /
#     Solution 2.7 — the central reason real PINNs lean on NeuralPDE.jl). The net
#     is tiny, so the forward-over-forward cost is negligible.
#
# Run via ./build.sh execute 3 (writes output to ../output/hnn_pendulum.md).

using Lux, Random, Zygote, ForwardDiff, ComponentArrays, Optimisers,
      Statistics, OrdinaryDiffEq, Printf

# ── true pendulum ───────────────────────────────────────────────────────
Htrue(q, p)     = 0.5 * p^2 + (1 - cos(q))
truefield(q, p) = (p, -sin(q))                  # (q̇, ṗ) = (∂_p H, -∂_q H)

# ── data: noisy field samples over the region the test orbit visits ─────────
rng = Random.MersenneTwister(0)
N   = 120
U   = vcat((5.0 .* rand(rng, N) .- 2.5)',        # q ∈ [-2.5, 2.5]
           (4.0 .* rand(rng, N) .- 2.0)')        # p ∈ [-2.0, 2.0]
F   = reduce(hcat, [collect(truefield(U[1, i], U[2, i])) for i in 1:N])
F .+= 0.05 .* randn(rng, size(F))                # 5% observation noise

# ── two small Lux MLPs, with Float64 parameter vectors ──────────────────────
hnet = Lux.Chain(Lux.Dense(2 => 16, tanh), Lux.Dense(16 => 16, tanh), Lux.Dense(16 => 1))
vnet = Lux.Chain(Lux.Dense(2 => 16, tanh), Lux.Dense(16 => 16, tanh), Lux.Dense(16 => 2))
psh, sth = Lux.setup(rng, hnet); psh = ComponentArray{Float64}(psh)
psv, stv = Lux.setup(rng, vnet); psv = ComponentArray{Float64}(psv)

# HNN: H_θ(q,p) is a scalar; the field is its INPUT-gradient (ForwardDiff).
Hθ(u, ps)        = first(hnet(u, ps, sth))[1]
hnn_field(u, ps) = (g = ForwardDiff.gradient(z -> Hθ(z, ps), u); (g[2], -g[1]))
# Vanilla baseline: (q,p) ↦ (q̇, ṗ) directly — no structure imposed.
van_field(u, ps) = vnet(u, ps, stv)[1]

# ── losses (mean squared field error) ───────────────────────────────────
function loss_hnn(ps)
    s = zero(eltype(ps))
    for i in 1:N
        q̇, ṗ = hnn_field(U[:, i], ps)
        s += (q̇ - F[1, i])^2 + (ṗ - F[2, i])^2
    end
    s / N
end
loss_van(ps) = mean(sum(abs2, van_field(U[:, i], ps) .- F[:, i]) for i in 1:N)

# ── train: Optimisers.Adam; the HNN gradient via ForwardDiff (forward-over-
#    forward), the vanilla gradient via Zygote (plain reverse mode) ──────────
function train(loss, ps, grad; iters = 1500, lr = 5e-3)
    opt = Optimisers.setup(Optimisers.Adam(lr), ps)
    for _ in 1:iters
        opt, ps = Optimisers.update(opt, ps, grad(loss, ps))
    end
    ps
end
psh = train(loss_hnn, psh, (l, p) -> ForwardDiff.gradient(l, p))
psv = train(loss_van, psv, (l, p) -> Zygote.gradient(l, p)[1])

# ── long rollout from a held-out initial condition; measure energy drift ────
u0, tspan = [2.0, 0.0], (0.0, 50.0)
H0 = Htrue(u0...)
rollout(rhs) = solve(ODEProblem((u, _, _) -> rhs(u), u0, tspan), Tsit5();
                     reltol = 1e-8, abstol = 1e-8, saveat = 0.5)
solh = rollout(u -> collect(hnn_field(u, psh)))
solv = rollout(u -> van_field(u, psv))
drifth = [Htrue(u...) - H0 for u in solh.u]
driftv = [Htrue(u...) - H0 for u in solv.u]

@printf("HNN pendulum demo\n")
@printf("training points     : %d  (5%% field noise)\n", N)
@printf("rollout horizon     : t ∈ [0, %d]\n", Int(tspan[2]))
@printf("HNN     max |ΔH|    : %.4f\n", maximum(abs, drifth))
@printf("vanilla max |ΔH|    : %.4f\n", maximum(abs, driftv))
@printf("vanilla drifts %.0f× more than the energy-conserving HNN\n",
        maximum(abs, driftv) / maximum(abs, drifth))

# ── figure: energy drift vs time ────────────────────────────────────────
using Plots
plot(solh.t, drifth; lw = 2, label = "HNN (∂H/∂p, -∂H/∂q)",
     xlabel = "time", ylabel = "H(t) - H(0)", title = "Pendulum energy drift",
     legend = :topleft)
plot!(solv.t, driftv; lw = 2, ls = :dash, label = "vanilla MLP field")
savefig(joinpath(@__DIR__, "..", "figures", "hnn_energy_drift.png"))

# ── figure: WHAT THE HNN LEARNS — the scalar H_θ(q,p) it fits, and the orbit
#    its autodiff-derived field traces (vs the true pendulum and the vanilla MLP).
#    The HNN's only output is the surface H_θ; the field and the orbit are
#    *derived* from it, never fit directly. ──────────────────────────────────
fine(rhs) = solve(ODEProblem((u, _, _) -> rhs(u), u0, (0.0, 15.0)), Tsit5();
                  reltol = 1e-8, abstol = 1e-8, saveat = 0.02)
orbit(sol) = ([u[1] for u in sol.u], [u[2] for u in sol.u])
ot = orbit(fine(u -> collect(truefield(u...))))     # true pendulum
oh = orbit(fine(u -> collect(hnn_field(u, psh))))   # HNN's learned field
ov = orbit(fine(u -> van_field(u, psv)))            # vanilla's learned field

pq = plot(ot...; lw = 3, c = :black, label = "true pendulum",
          xlabel = "q (angle)", ylabel = "p (momentum)", framestyle = :box,
          title = "orbit traced by the learned field", legend = :topright)
plot!(pq, oh...; lw = 2, c = :seagreen,  label = "HNN")
plot!(pq, ov...; lw = 2, c = :firebrick, ls = :dash, label = "vanilla MLP")

qs = range(-2.6, 2.6; length = 100)
pg = range(-2.0, 2.0; length = 100)
Htrue_grid  = [Htrue(q, p)    for p in pg, q in qs]
Hlearn_grid = [Hθ([q, p], psh) for p in pg, q in qs]
Hlearn_grid .-= Hθ([0.0, 0.0], psh)                 # H_θ is fixed only up to a constant; pin H(0,0)=0
ph = contourf(qs, pg, Hlearn_grid; c = :viridis, levels = 12, framestyle = :box,
              xlabel = "q (angle)", ylabel = "p (momentum)",
              title = "learned Hθ(q,p)  (white dashed = true H)")
contour!(ph, qs, pg, Htrue_grid; levels = 8, c = :white, ls = :dash, lw = 1, colorbar = false)

plot(pq, ph; layout = (1, 2), size = (1040, 430),
     bottom_margin = 5Plots.mm, left_margin = 5Plots.mm)
savefig(joinpath(@__DIR__, "..", "figures", "hnn_pendulum_learned.png"))

Running it produces:

HNN pendulum demo
training points     : 120  (5% field noise)
rollout horizon     : t ∈ [0, 50]
HNN     max |ΔH|    : 0.0313
vanilla max |ΔH|    : 0.9299
vanilla drifts 30× more than the energy-conserving HNN

What is actually being learned? The HNN’s network has a single scalar output — the Hamiltonian H_\theta(q, p), the energy surface on the right below. It never sees the field (\dot q, \dot p) as a target; the field is the gradient of H_\theta, produced on demand by ForwardDiff, and the trajectory is that field integrated. So learning a good H_\theta is the whole game — and the left panel confirms it worked: integrate the learned field from a held-out start and the HNN’s orbit sits right on the true pendulum orbit, while the vanilla MLP — which fits the field directly, with no energy behind it — already wanders off its loop.

**What the HNN learns.** *Right:* the network’s only output, the scalar Hamiltonian H_\theta(q, p) (filled contours), against the true energy contours (white dashed) — recovered up to an additive constant, which the dynamics ignore because only \nabla H_\theta enters. *Left:* the phase-space orbit obtained by integrating each model’s learned field from a held-out start; the HNN (green) tracks the true closed orbit (black), the vanilla MLP (red dashed) does not. The HNN fits an *energy*, and its motion is derived from it.

Both models fit the data, but only the HNN’s field is the gradient of an energy, so its flow is symplectic. On a long rollout that shows up directly as energy drift: the HNN holds energy almost flat while the vanilla MLP’s leaks away — here well over an order of magnitude more — and the gap widens the longer you integrate.

Energy drift H(t) - H(0) on a 50-time-unit pendulum rollout from (q_0, p_0) = (2, 0). The HNN derives its field from \partial H_\theta, so energy barely moves; the vanilla MLP fits the same data but its field is *not* a gradient, so energy steadily leaks away.

✏️ Section exercise — energy conservation by construction

Verify the HNN’s core claim without training anything. Take the pendulum Hamiltonian H(q, p) = \tfrac{1}{2}p^2 + (1 - \cos q) and build the vector field (\dot q, \dot p) = (\partial_p H, -\partial_q H) via ForwardDiff (don’t differentiate by hand — the point is that this is exactly what an HNN does with H_\theta). Integrate from (q_0, p_0) = (2.0, 0.0) for 100 time units and plot the energy drift H(t) - H(0). Then do the same for a “slightly wrong” field — multiply the \dot p component by 1.02, mimicking a vanilla MLP’s 2% error — and compare the drifts. What property of the exact Hamiltonian field does the perturbed field lose?

💡 Hint

Write H(u) = 0.5*u[2]^2 + (1 - cos(u[1])). ForwardDiff.gradient(H, u) returns [∂H/∂q, ∂H/∂p] = (g[1], g[2]) in one call — never differentiate by hand. Hamilton’s equations map directly: in ham!(du,u,p,t) set du[1] = g[2] and du[2] = -g[1]. The 2%-wrong field is the same with one extra factor: du[2] = -1.02*g[1]. Integrate both from u0 = [2.0, 0.0] over (0.0, 100.0) at reltol = abstol = 1e-9 so the drift is the field’s property, not the integrator’s, then plot [H(u) - H(u0) for u in sol.u]. The property the perturbed field loses is orthogonality to ∇H (i.e. Ḣ ≡ 0).

Go to solution →

3.4 Equation discovery: SINDy

The next three subsections cover (a) the sparse-regression idea, (b) how to choose the candidate-term library, and (c) a worked Lorenz example in Julia and Python. We then close with when SINDy beats — and loses to — black-box function approximators.

The sparse regression idea

SINDy (Sparse Identification of Nonlinear Dynamics) (Brunton et al., 2016) asks: instead of fitting a black-box neural network to data, can we directly recover the governing equation by assuming it’s a short symbolic expression? The construction:

Observe time-series data X = [\mathbf{x}(t_1), \mathbf{x}(t_2), \ldots].
Numerically differentiate to estimate \dot X (the hard part — noise destroys naive finite differences; total-variation regularised differentiation or smoothing splines help).
Build a library matrix of candidate right-hand-side terms evaluated at each sample, \Theta(\mathbf{x}) \;=\; [1,\ x_1,\ x_2,\ \ldots,\ x_1^2,\ x_1 x_2,\ \ldots,\ \sin x_1,\ \cos x_1,\ \ldots].
Solve the sparse regression \dot X = \Theta(X)\,\Xi for a sparse coefficient matrix \Xi. The nonzero entries are the equation.

Written out, that linear system is what the sparse regression actually solves — stack the m time samples as rows, with n states (here n = 3: x, y, z) and p library terms:

\underbrace{\begin{bmatrix} \dot x_1 & \dot y_1 & \dot z_1 \\ \vdots & \vdots & \vdots \\ \dot x_m & \dot y_m & \dot z_m \end{bmatrix}}_{\dot X\ \,(m \times n)} \;=\; \underbrace{\begin{bmatrix} 1 & x_1 & y_1 & z_1 & x_1^2 & x_1 y_1 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \\ 1 & x_m & y_m & z_m & x_m^2 & x_m y_m & \cdots \end{bmatrix}}_{\Theta(X)\ \,(m \times p)} \; \underbrace{\begin{bmatrix} \xi^x_1 & \xi^y_1 & \xi^z_1 \\ \vdots & \vdots & \vdots \\ \xi^x_p & \xi^y_p & \xi^z_p \end{bmatrix}}_{\Xi\ \,(p \times n)}

Each column of \Xi is the coefficient vector for one state’s equation — \dot x = \Theta(X)\,\xi^x, and likewise for y and z — so the columns decouple and can be solved one at a time. The library is deliberately wide (many candidate terms, large p) but the truth is sparse: for the Lorenz \dot x = \sigma(y - x), the column \xi^x is zero everywhere except the two rows for x and y. Sparse regression is exactly the search for that — the fewest library columns whose combination reproduces each column of \dot X — and the surviving nonzero pattern of \Xi is the recovered system.

A note on the symbol. \Xi is a capital Greek Xi (lower-case \xi, pronounced “ksi”). Brunton et al. wrote the coefficient matrix this way in the original SINDy paper and the convention stuck, so we keep it — apologies if it is an unfamiliar glyph; blame the sparse-regression literature, not the maths. If you are wondering how to type it: in Julia (the REPL, VS Code, a Quarto cell) you write \Xi and press Tab, and it turns into \Xi. The same backslash-name-then-Tab trick produces every Greek letter and most symbols in this course — \xi⇥ \to \xi, \Theta⇥ \to \Theta, \alpha⇥ \to \alpha, even \dot-accents and subscripts.

The sparsity is enforced either by \ell_1 regularisation (LASSO) or by STLSQ — Sequentially Thresholded Least Squares: ordinary least squares → set small coefficients to zero → refit on the survivors → repeat until stable. STLSQ is what the original Brunton paper proposed and what the workshop script uses.

Why SINDy is popular. Compared to fitting a neural network to the same dynamics, SINDy offers:

Interpretability. The output is a human-readable equation, not a 10-layer black box. Domain experts can sanity-check it.
Extrapolation. A 3-term equation is far more likely to be correct outside its training regime than a network that memorised the training trajectory.
Data efficiency. Hundreds of samples are usually enough, vs. tens of thousands for a deep model on the same task.
Compatibility with downstream physics. Once you have an equation, you can hand it to a stiff ODE solver, do stability analysis, derive control laws — any classical tool works.

Compared to physics-informed neural networks (next units), SINDy is the right move when you don’t know the equation but suspect it’s sparse. PINNs assume the equation is given and solve it; SINDy discovers the equation in the first place. They’re complementary, not competing.

Building a library

A standard polynomial library up to degree 3 plus a few trigonometric terms is the default. Domain knowledge guides which terms to include; including too many makes the regression unstable (many candidate terms ⇒ overfitting), too few makes recovery impossible (the true terms aren’t in the basis). Numerical differentiation of \dot X from time-series X is a separate art (total-variation regularised differentiation, smoothed splines).

Worked example: recovering Lorenz

Given trajectories of the Lorenz system

\dot x = \sigma(y - x),\quad \dot y = x(\rho - z) - y,\quad \dot z = xy - \beta z,

with moderate observation noise, SINDy with a degree-2 polynomial library typically recovers the exact equations and the parameters (\sigma, \rho, \beta) \approx (10, 28, 8/3) — the canonical demonstration that a sparse prior plus enough data is sometimes all you need.

Julia does have a full SINDy stack — SciML’s DataDrivenDiffEq.jl with DataDrivenSparse.jl for the sparse solvers (STLSQ, ADMM, …), the Julia counterpart to Python’s pysindy. Here, though, we roll our own bare-bones STLSQ (no DataDrivenDiffEq dependency) to make the recipe explicit: build the library, threshold the OLS coefficients, re-fit on the surviving terms, repeat.

# SINDy on the Lorenz system.
#
# Simulate a Lorenz trajectory, add observation noise, and recover the
# symbolic equations by sparse regression against a degree-2 polynomial
# library — the worked example in unit_03.qmd §3.4.
#
# Algorithm: sequentially thresholded least squares (STLSQ), the
# canonical SINDy update from Brunton, Proctor & Kutz (2016).
# Derivatives are obtained by second-order central differences; the
# qmd notes that derivative estimation is a separate art and we keep
# this script simple by relying on a dense save interval.
#
# Run via ./build.sh execute 3 (writes output to ../output/sindy_lorenz.md).

using OrdinaryDiffEq, Random, LinearAlgebra, Printf, Statistics

# ── 1. simulate Lorenz ─────────────────────────────────────────────────
σ_true, ρ_true, β_true = 10.0, 28.0, 8/3
function lorenz!(du, u, p, t)
    x, y, z = u
    du[1] = σ_true * (y - x)
    du[2] = x * (ρ_true - z) - y
    du[3] = x * y - β_true * z
end

u0    = [-8.0, 7.0, 27.0]
tspan = (0.0, 20.0)
dt    = 0.002
sol   = solve(ODEProblem(lorenz!, u0, tspan), Tsit5();
              saveat = dt, abstol = 1e-10, reltol = 1e-10)

X_clean = permutedims(reduce(hcat, sol.u))            # N × 3

# ── 2. add observation noise ───────────────────────────────────────────
rng = Random.MersenneTwister(0)
noise_level = 0.01                                    # 1 % of per-axis std
σ_obs = noise_level .* vec(std(X_clean; dims = 1))'
X = X_clean .+ σ_obs .* randn(rng, size(X_clean))

# ── 3. central-difference derivatives, trim end points ─────────────────
Ẋ = (X[3:end, :] .- X[1:end-2, :]) ./ (2 * dt)
Xc = X[2:end-1, :]

# ── 4. degree-2 polynomial feature library ─────────────────────────────
# columns: 1, x, y, z, x², xy, xz, y², yz, z²
function poly_library(X)
    x, y, z = X[:, 1], X[:, 2], X[:, 3]
    hcat(ones(length(x)), x, y, z,
         x.^2, x.*y, x.*z, y.^2, y.*z, z.^2)
end
labels = ["1", "x", "y", "z", "x^2", "x*y", "x*z", "y^2", "y*z", "z^2"]
Θ = poly_library(Xc)

# ── 5. STLSQ — sequentially thresholded least squares ──────────────────
function stlsq(Θ, Ẋ, λ; n_iters = 15)
    Ξ = Θ \ Ẋ
    for _ in 1:n_iters
        small = abs.(Ξ) .< λ
        Ξ[small] .= 0
        for j in axes(Ξ, 2)
            keep = .!small[:, j]
            sum(keep) == 0 && continue
            Ξ[keep, j] = Θ[:, keep] \ Ẋ[:, j]
        end
    end
    Ξ
end

λ = 0.1
Ξ = stlsq(Θ, Ẋ, λ)

# ── 6. report ──────────────────────────────────────────────────────────
state_names = ["dx/dt", "dy/dt", "dz/dt"]

println("Lorenz SINDy demo")
println("samples used      : ", size(Xc, 1))
println("observation noise : ", noise_level * 100, " % of per-axis std")
println("STLSQ threshold λ : ", λ)
println()
println("True system:")
println("  dx/dt = -10·x + 10·y")
println("  dy/dt =  28·x -  y - x·z")
println("  dz/dt =  x·y - (8/3)·z")
println()
println("Recovered equations:")
for j in 1:3
    terms = String[]
    for i in eachindex(labels)
        c = Ξ[i, j]
        c == 0 && continue
        push!(terms, @sprintf("%+0.3f·%s", c, labels[i]))
    end
    println("  $(state_names[j]) = ", isempty(terms) ? "0" : join(terms, "  "))
end

println()
@printf("σ̂ ≈ %0.3f   (true %0.3f)\n", -Ξ[2, 1], σ_true)
@printf("ρ̂ ≈ %0.3f   (true %0.3f)\n",  Ξ[2, 2], ρ_true)
@printf("β̂ ≈ %0.3f   (true %0.3f)\n", -Ξ[4, 3], β_true)

Running it produces:

Lorenz SINDy demo
samples used      : 9999
observation noise : 1.0 % of per-axis std
STLSQ threshold λ : 0.1

True system:
  dx/dt = -10·x + 10·y
  dy/dt =  28·x -  y - x·z
  dz/dt =  x·y - (8/3)·z

Recovered equations:
  dx/dt = -9.995·x  +9.995·y
  dy/dt = +27.971·x  -0.990·y  -0.999·x*z
  dz/dt = -2.667·z  +1.000·x*y

σ̂ ≈ 9.995   (true 10.000)
ρ̂ ≈ 27.971   (true 28.000)
β̂ ≈ 2.667   (true 2.667)

The same workflow in Python uses the pysindy package — the canonical SINDy implementation, maintained by the Brunton group:

units/unit_03/scripts/sindy_lorenz_python.py

# pip install pysindy
import numpy as np
from scipy.integrate import solve_ivp
import pysindy as ps

sigma, rho, beta = 10.0, 28.0, 8.0 / 3.0
def lorenz(t, u): return [sigma * (u[1] - u[0]),
                          u[0] * (rho - u[2]) - u[1],
                          u[0] * u[1] - beta * u[2]]

t = np.linspace(0, 10, 5001)
sol = solve_ivp(lorenz, (t[0], t[-1]), [-8.0, 7.0, 27.0],
                t_eval=t, rtol=1e-9, atol=1e-9)
X = sol.y.T

model = ps.SINDy(
    feature_library=ps.PolynomialLibrary(degree=2),
    optimizer=ps.STLSQ(threshold=0.1),
)
model.fit(X, t=t)
model.print()

PySINDy bundles all the choices we made explicit in the Julia script (library, optimiser, differentiation) into a scikit-learn style estimator; the trade-off is less visibility into the inner loop. Both produce the same Lorenz equation, recovered to ~3 significant figures.

When discovery beats approximation

SINDy wins when

the underlying equation is genuinely sparse in your library,
you care about interpretability — a 3-term equation is more useful than a 10-layer network, and
you have enough clean trajectory data to estimate derivatives.

It loses when the dynamics are intrinsically high-dimensional, the right basis isn’t in your library, or the noise drowns the derivatives.

✏️ Section exercise — SINDy on the damped oscillator

Run the full SINDy recipe yourself on a system where you know the answer: the damped oscillator of §3.1, \dot x = v, \dot v = -\omega^2 x - 2\zeta\omega v with \omega = 2\pi, \zeta = 0.1. Simulate a trajectory, estimate \dot X by central differences, build a polynomial library up to degree 2 in (x, v), and run 5 iterations of STLSQ (threshold 0.5). You should recover the two equations with coefficients (-\omega^2, -2\zeta\omega) \approx (-39.5, -1.26). Then add 1% Gaussian noise to the trajectory before differentiating and watch what happens to the recovered coefficients. Which step of the pipeline failed?

💡 Hint

Estimate Ẋ by central differences (X[3:end,:] .- X[1:end-2,:]) ./ (t[3:end] .- t[1:end-2]), dropping the endpoints, and pair it with the interior states X[2:end-1,:]. The degree-2 library is one hcat: Θ = hcat(ones(n), x, v, x.^2, x.*v, v.^2). STLSQ on a target column dx: start ξ = Θ \ dx; then five times over, mask small = abs.(ξ) .< 0.5, set ξ[small] .= 0, and refit the survivors ξ[.!small] = Θ[:, .!small] \ dx. You should recover -ω² ≈ -39.5 and -2ζω ≈ -1.26. For the noise experiment, perturb X before differencing — that ordering is the whole point, and it’s the differentiation step that breaks.

Go to solution →

3.5 Proper Orthogonal Decomposition (POD)

Dynamical systems often look high-dimensional but live on a low-dimensional manifold — a turbulent flow field has millions of grid cells but the coherent motion (vortices, waves) is described by a handful of modes. Proper Orthogonal Decomposition (POD) — also called Principal Component Analysis when applied to snapshot matrices — extracts those modes systematically. It’s the classical standard for reduced-order modelling (ROM) and the first thing to reach for when you suspect your system has hidden low-dimensional structure.

The snapshot SVD

Here is the picture before the algebra. Suppose you record a system as it evolves — a temperature profile along a rod, a velocity field over a grid: at each instant you measure M numbers (one per grid point), and you take N such snapshots in time. Each snapshot is a single point in \mathbb{R}^M, so the whole recording is a cloud of N points in that M-dimensional space. The question POD asks is: does that cloud really fill all M dimensions, or does it sit close to a low-dimensional plane? If just a few spatial patterns, faded in and out over time, explain almost every snapshot, then we can keep only those patterns plus their time-coefficients and reconstruct \mathbf{x}(t) from a handful of numbers instead of M. The singular value decomposition is the tool that finds those patterns — and tells you how many you need.

Collect N snapshots of the state \mathbf{x}_k \in \mathbb{R}^M (e.g. M = number of grid cells, N = number of time samples) into a snapshot matrix:

X \;=\; \bigl[\mathbf{x}_1 \;|\; \mathbf{x}_2 \;|\; \ldots \;|\; \mathbf{x}_N\bigr] \;\in\; \mathbb{R}^{M \times N}.

Compute its singular value decomposition,

X \;=\; U \Sigma V^\top, \qquad U \in \mathbb{R}^{M \times r},\ \Sigma \in \mathbb{R}^{r \times r},\ V \in \mathbb{R}^{N \times r},

with \Sigma = \mathrm{diag}(\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0). The columns of U are the POD modes — an orthonormal basis for the snapshots, ranked by how much variance they explain. Keep the first k \ll r modes and you have a k-dimensional approximation:

\mathbf{x}(t) \;\approx\; U_k \, \mathbf{a}(t), \qquad \mathbf{a}(t) = U_k^\top \mathbf{x}(t) \;\in\; \mathbb{R}^k.

Need the SVD itself refreshed first? Expand the box.

An SVD refresher (on a small 3×5 matrix)

Every real matrix A factorises as A = U\Sigma V^\top with U and V orthogonal (orthonormal columns) and the singular values \sigma_1 \geq \sigma_2 \geq \ldots \geq 0 on the diagonal of \Sigma:

using LinearAlgebra

A = [ 1.0  2.0  0.0  1.0  3.0
      0.0  1.0  1.0  2.0  1.0
      2.0  0.0  1.0  1.0  0.0 ]      # a 3×5 matrix

U, σ, V = svd(A)                      # the singular value decomposition: A = U Σ Vᵀ
@show size(U) size(σ) size(V)         # (3,3) (3,) (5,3) — economy SVD, r = min(3,5) = 3
@show round.(σ; digits = 3)           # singular values, largest first
@show A ≈ U * Diagonal(σ) * V'        # the three factors rebuild A exactly

size(U) = (3, 3)
size(σ) = (3,)
size(V) = (5, 3)
round.(σ; digits = 3) = [4.5, 2.299, 1.571]
A ≈ U * Diagonal(σ) * V' = true

true

svd hands back U (3{\times}3), the vector of singular values \sigma, and V (5{\times}3) — the economy form, keeping the r = \min(3,5) = 3 nonzero directions — and U\,\mathrm{diag}(\sigma)\,V^\top multiplies straight back to A. The singular vectors are orthonormal and the \sigma_i come out sorted, which is exactly why keeping the first k columns above is the best rank-k approximation of the matrix.

The decay of \sigma_i tells you whether the reduction will work. Fast decay → a few modes capture almost everything → ROM is viable. Slow decay → genuinely high-dimensional → POD will lose detail.

Worked example: damped oscillator chain

A chain of M = 50 masses coupled by springs, given a smooth initial bump and lightly damped, is a useful test bench. The state-space dimension is 100 (positions + velocities), yet the motion is a superposition of just a few normal modes — standing-wave shapes, each ringing at its own frequency. A smooth initial push excites mainly the low-frequency modes, so the snapshot SVD should compress the field into a handful of POD modes. We check three things in one figure: how fast the singular values fall, how the reconstruction error drops as we keep more modes, and what the leading mode shapes actually look like.

using OrdinaryDiffEq, LinearAlgebra, Plots

# 50-mass chain: x¨ᵢ = ω²(xᵢ₋₁ - 2xᵢ + xᵢ₊₁) - 2ζω x˙ᵢ  (fixed at endpoints)
M, ω, ζ = 50, 2π, 0.02
function chain!(du, u, p, t)
    x, v = @views u[1:M], u[M+1:end]
    du[1:M] .= v
    @inbounds for i in 1:M
        l = i == 1 ? 0.0 : x[i-1]
        r = i == M ? 0.0 : x[i+1]
        du[M+i] = ω^2 * (l - 2x[i] + r) - 2ζ * ω * v[i]
    end
end

u0 = zeros(2M)
for i in 1:M; u0[i] = exp(-((i - M/2) / 5.0)^2); end   # smooth Gaussian bump
sol = solve(ODEProblem(chain!, u0, (0.0, 8.0)), Tsit5(); saveat = 0.02)

# Snapshot matrix (rows = positions, columns = time samples), then SVD
X = reduce(hcat, [u[1:M] for u in sol.u])
U, Σ, V = svd(X)

p1 = plot(1:15, Σ[1:15] ./ Σ[1]; lw = 2.4, marker = :circle, ms = 4, c = :steelblue,
          yaxis = :log, xlabel = "mode i", ylabel = "σᵢ / σ₁", legend = false,
          title = "singular-value spectrum", framestyle = :box, gridalpha = 0.25)

errs = [norm(X - U[:, 1:k] * Diagonal(Σ[1:k]) * V[:, 1:k]') / norm(X) for k in 1:15]
p2 = plot(1:15, errs; lw = 2.4, marker = :circle, ms = 4, c = :firebrick,
          yaxis = :log, xlabel = "modes kept k", ylabel = "rel. reconstruction error",
          legend = false, title = "reconstruction error", framestyle = :box, gridalpha = 0.25)

p3 = plot(xlabel = "mass index", ylabel = "mode amplitude", title = "first 3 POD modes",
          framestyle = :box, gridalpha = 0.25, legend = :topright)
for k in 1:3
    plot!(p3, 1:M, U[:, k]; lw = 2.2, label = "mode $k")
end

plot(p1, p2, p3; layout = @layout([a b; c _]), size = (900, 660),
     bottom_margin = 5Plots.mm, left_margin = 5Plots.mm)

The singular values fall off fast and the reconstruction error drops below 1% by about a dozen modes — so the effective dimensionality is far below the 100 state variables. The leading mode shapes are exactly the chain’s smooth normal modes: mode 1 is the half-sine “breathing” shape, mode 2 the full sine, mode 3 the next overtone — modes ordered by spatial frequency, just as a Fourier basis would predict for this linear system. (A localised single-cell push, by contrast, excites all frequencies and compresses far less — POD rewards smooth, low-rank structure.)

Same workflow in Python — scipy.linalg.svd is the SVD; the rest is numpy:

units/unit_03/scripts/pod_chain_python.py

import numpy as np
from scipy.integrate import solve_ivp
from scipy.linalg import svd

M, omega, zeta = 50, 2 * np.pi, 0.02
def chain(t, u):
    x, v = u[:M], u[M:]
    du = np.zeros_like(u)
    du[:M] = v
    for i in range(M):
        l = 0.0 if i == 0     else x[i - 1]
        r = 0.0 if i == M - 1 else x[i + 1]
        du[M + i] = omega**2 * (l - 2 * x[i] + r) - 2 * zeta * omega * v[i]
    return du

u0 = np.zeros(2 * M); u0[M // 2] = 1.0
sol = solve_ivp(chain, (0.0, 8.0), u0, t_eval=np.linspace(0, 8, 401))
X = sol.y[:M, :]                                       # (M, N) positions

U, Sigma, Vt = svd(X, full_matrices=False)
print("first 6 singular values (relative):",
      np.round(Sigma[:6] / Sigma[0], 4))
print(f"variance captured by k=6 modes: "
      f"{(Sigma[:6]**2).sum() / (Sigma**2).sum():.4f}")

POD’s role in modern Sci-ML

POD by itself is linear and deterministic — it pre-dates deep learning by half a century. But it shows up everywhere in contemporary work:

As a preprocessor. Reduce 1M-dimensional snapshots to 100 POD coefficients, then learn the evolution of those coefficients with a small neural ODE or transformer. Modern weather emulators (Pangu-Weather, GraphCast) lean on this idea.
As a sanity check. If POD already captures 99% of variance in 5 modes, you don’t need deep learning — a linear ROM is enough.
As a comparison baseline. Any nonlinear surrogate should beat the POD-truncation baseline; if it doesn’t, the nonlinearity isn’t pulling its weight.

POD’s nonlinear cousins — autoencoders, dynamic mode decomposition (DMD), Koopman operators — generalise the idea: project the dynamics onto a learned (possibly nonlinear) low-dimensional manifold, then evolve there. An autoencoder is the most direct generalisation: a neural network trained to reconstruct its input through a narrow bottleneck layer, so the bottleneck learns a compressed code. POD is exactly the linear special case — a linear encoder/decoder with a squared-error loss recovers the same subspace the leading POD modes span — and an autoencoder simply lets the encoder and decoder be nonlinear, which can capture curved low-dimensional structure that a linear basis cannot. See Mathematical Engineering of Deep Learning (Liquet, Moka & Nazarathy, 2024) for the autoencoder construction. We won’t go deep into these in the workshop, but the POD picture is the right mental starting point.

✏️ Section exercise — POD on a diffusing field

The chain example showed slow singular-value decay for an oscillatory system. Now try a diffusive one, where POD shines. Solve the 1-D heat equation u_t = \alpha u_{xx} (\alpha = 0.01, x \in [0, 1], homogeneous Dirichlet BCs, initial condition u_0(x) = \exp(-200(x - 0.3)^2)) with simple finite differences, collect ~200 snapshots, and compute the snapshot SVD. How many modes capture 99% of the variance, and how does that compare to the oscillator chain? Bonus: plot the first three POD modes — what classical basis do they resemble, and why was that predictable from §6.2’s separation of variables (peek ahead if curious)?

💡 Hint

The FTCS update is one line (the §6.4 simulator): on x = range(0,1; length=101) with dt = 0.4*dx^2/α, step the interior u_new[i] = u[i] + α*dt/dx^2*(u[i+1] - 2u[i] + u[i-1]) and force u[1] = u[end] = 0 each step. Start from u = exp.(-200 .* (x .- 0.3).^2) (then zero the ends), and push!(snapshots, copy(u)) every ~40 steps for ~200 columns. Then U, Σ, V = svd(reduce(hcat, snapshots)) from LinearAlgebra; cumulative variance is cumsum(Σ.^2) ./ sum(Σ.^2), so the 99% mode count is findfirst(energy .≥ 0.99) — expect 2–4, far fewer than the chain. Plot U[:, 1:3] vs x: they look like sin(nπx), the heat equation’s separation-of-variables eigenbasis (Unit 6).

Go to solution →

3.6 Physics-informed machine learning: a conceptual preview

The core idea

A neural network u_\theta(\mathbf{x}, t) approximates the solution of a PDE (or ODE). The loss has a physics term — the equation residual evaluated by autodiff at scattered collocation points — plus boundary / initial conditions and (optionally) a data term:

\mathcal{L}(\theta) \;=\; \lambda_r\,\mathcal{L}_{\text{PDE}} + \lambda_b\,\mathcal{L}_{\text{BC}} + \lambda_i\,\mathcal{L}_{\text{IC}} + \lambda_d\,\mathcal{L}_{\text{data}}.

Minimising it simultaneously fits the data and respects the equation. No grid, no time-stepping, no mesh.

Mechanically the residual is the part that’s new: for the oscillator at the top of this unit, \mathcal{L}_{\text{PDE}}(\theta) = \frac{1}{N_r}\sum_i \bigl| \ddot u_\theta(t_i) + \omega^2 u_\theta(t_i) \bigr|^2, with the second derivative computed by autodiff straight through u_\theta.

What makes PIML different from other Sci-ML methods

vs surrogates: PIML doesn’t need a precomputed dataset — the equation itself supervises the network.
vs equation discovery: PIML assumes the equation is known and asks for its solution (or its parameters via the inverse problem).
vs hybrid models: PIML is a special case where the entire solution field is the “ML part” and the equation residual is the “physics part”.

Where this lands in the workshop

Unit 4 takes the ODE picture from §3.1 and adds learnable terms — the Neural-ODE / UDE picture from §3.2. Unit 5 implements vanilla PINNs on the oscillator above and a small collection of ODE / PDE toys. Unit 6 provides the PDE foundations. Unit 7 addresses where naïve PINNs fail and how the modern fixes close the gap. Units 8–10 assemble the AIMS capstone — forward (Part I) and inverse (Part II).

✏️ Section exercise — evaluate a residual by hand

No training yet — just the residual. For the oscillator \ddot u + \omega^2 u = 0 with \omega = 2\pi, take three candidate solutions: (a) the exact u(t) = \cos(\omega t), (b) the wrong-frequency u(t) = \cos(0.9\,\omega t), and (c) the damped impostor u(t) = e^{-t/2}\cos(\omega t). Compute the PINN residual loss \mathcal{L}_{\text{PDE}} = \frac{1}{N}\sum_i |\ddot u(t_i) + \omega^2 u(t_i)|^2 for each over N = 100 uniform points on [0, 2], computing \ddot u with nested ForwardDiff.derivative rather than by hand. Rank the three losses. Candidate (c) fits the IC perfectly and looks plausible — what does its residual reveal, and at which t is the residual largest?

💡 Hint

Each candidate is a plain Julia function of t — no network anywhere, e.g. a = t -> cos(ω*t). Nest ForwardDiff.derivative twice for ü: d2(f, t) = ForwardDiff.derivative(τ -> ForwardDiff.derivative(f, τ), t). The per-point residual is d2(f,t) + ω^2*f(t); the loss is its mean of squares over ts = range(0, 2; length = 100). Rank the three losses by printing each. For “where is (c) largest”, don’t reason it out — just plot candidate (c)’s residual against ts and read off the peak (it sits early, near t ≈ 0.12).

Go to solution →