Unit 3: Scientific Machine Learning and Physics-Informed Machine Learning

Published

12/06/2026

Unit 2 gave us a function-approximation view of ML — pick a function class, minimise a loss, generalise. This unit takes that picture and adds physics to it, in three flavours: hybrid models, conservation-aware architectures, and equation discovery. The unit closes with a conceptual preview of physics-informed machine learning (PIML) — the framing that Units 5–9 implement against the AIMS capstone.

Before any of that, an ODE refresher — the dynamics language we need to start meaningful Sci-ML.

3.1 An ODE refresher

An ordinary differential equation (ODE) is a relation between a function and its derivatives in one independent variable (typically time t):

\dot{\mathbf{x}}(t) \;=\; f(\mathbf{x}(t), t),\qquad \mathbf{x}(0) \;=\; \mathbf{x}_0.

We’ll meet partial differential equations (PDEs) — derivatives in more than one variable — in Unit 6. For now everything is an ODE, and everything is an initial-value problem (IVP): given \mathbf{x}_0 at t = 0, integrate forward.

Vocabulary worth keeping straight:

  • \dot{\mathbf{x}} is the time derivative d\mathbf{x}/dt.
  • Order — highest derivative appearing. A second-order scalar ODE \ddot{x} = g(x, \dot{x}) rewrites as a first-order system in (x, \dot{x}). Throughout the unit we assume that’s been done.
  • Autonomous if f has no explicit t dependence (\dot{\mathbf{x}} = f(\mathbf{x})); non-autonomous otherwise.
  • Linear if f(\mathbf{x}, t) = A(t)\,\mathbf{x} + b(t); nonlinear otherwise. Linear systems have explicit solutions via matrix exponentials; nonlinear systems generally do not.

The next three subsubsections walk three ODEs in increasing complexity — exponential decay, the undamped harmonic oscillator, and a second-order system with damping. Each comes with Julia and Python code so the ecosystem fluency built in Unit 2 carries over.

The simplest ODE: exponential decay

A scalar, first-order, linear, autonomous ODE — the canonical warm-up:

\dot x \;=\; -\lambda\, x, \qquad x(0) = x_0, \qquad \lambda > 0,

with exact solution x(t) = x_0\, e^{-\lambda t}. It models radioactive decay, RC-circuit discharge, drug clearance from a bloodstream — anything where the rate of change is proportional to the current amount. The point of starting here is that we can compare numerical integration to a known closed-form answer, which makes solver choices concrete.

In Julia, using the OrdinaryDiffEq ecosystem that the rest of the workshop standardises on:

using OrdinaryDiffEq, Plots

λ, x0 = 0.5, 1.0
decay(x, p, t) = -λ * x
prob = ODEProblem(decay, x0, (0.0, 6.0))
sol  = solve(prob, Tsit5(); saveat = 0.05)

plot(sol.t, sol.u; label = "Tsit5 (numerical)", lw = 2,
     xlabel = "t", ylabel = "x(t)",
     title = "Exponential decay: ẋ = -λx,  λ = $λ")
plot!(sol.t, x0 .* exp.(-λ .* sol.t);
      label = "exact: x₀·e^(-λt)", ls = :dash, c = :black)

In Python with scipy.integrate.solve_ivp (the SciPy equivalent of OrdinaryDiffEq):

units/unit_03/scripts/decay_python.py
import numpy as np
from scipy.integrate import solve_ivp

lam, x0 = 0.5, 1.0
sol = solve_ivp(lambda t, x: -lam * x,
                t_span=(0.0, 6.0), y0=[x0],
                method="RK45", dense_output=True)

t = np.linspace(0, 6, 121)
print("max |numerical - exact| =",
      float(np.max(np.abs(sol.sol(t)[0] - x0 * np.exp(-lam * t)))))

The max error is \sim 10^{-6} for both solvers at default tolerances — a useful baseline before we ask anything harder.

A second-order system: the harmonic oscillator

The simplest non-trivial second-order ODE — a mass on a spring,

\ddot{x} + \omega^2 x = 0 \quad\Longleftrightarrow\quad \begin{pmatrix} \dot{x} \\ \dot{v} \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ -\omega^2 & 0 \end{pmatrix} \begin{pmatrix} x \\ v \end{pmatrix},

with v = \dot{x}. Energy E = \tfrac{1}{2}(v^2 + \omega^2 x^2) is conserved exactly; the phase portrait — trajectories in the (x, v) plane — is a family of concentric ellipses. This is the example the Hamiltonian / Lagrangian neural networks of §3.3 are designed to handle without drifting.

using OrdinaryDiffEq, Plots

ω = 2π                                # one cycle per second
oscillator(u, p, t) = [u[2], -ω^2 * u[1]]

u0s  = [[1.0, 0.0], [1.5, 0.0], [2.0, 0.0]]
tspan = (0.0, 2.0)

plt = plot(xlabel = "x", ylabel = "v",
           title = "Harmonic oscillator phase portrait",
           aspect_ratio = :equal, legend = :outerright)
for u0 in u0s
    sol = solve(ODEProblem(oscillator, u0, tspan), Tsit5();
                saveat = 0.01)
    plot!(plt, sol[1, :], sol[2, :], label = "x(0) = $(u0[1])")
end
plt

OrdinaryDiffEq.Tsit5() is a 5th-order explicit Runge–Kutta solver — a sensible default for non-stiff problems. For stiff systems (e.g. fast diffusion coupled to slow advection) one picks an implicit solver like QNDF() or Rodas5P() instead. Choosing the solver is most of classical numerics; in Unit 4 we’ll let the network learn the dynamics directly and side-step the choice.

Adding damping: the three regimes

Real oscillators bleed energy. Add a velocity-proportional drag 2\zeta\omega\dot x and the equation becomes

\ddot{x} + 2\zeta\omega\, \dot{x} + \omega^2 x = 0,

with damping ratio \zeta \geq 0 — the dimensionless number that controls how the oscillator decays:

  • \zeta < 1underdamped. Oscillation with an exponential envelope e^{-\zeta\omega t}. The default for, e.g., a car suspension or a ringing bell.
  • \zeta = 1critically damped. Fastest return to zero with no overshoot. The design target for, e.g., a closing door or a feedback controller.
  • \zeta > 1overdamped. Slow exponential return; no oscillation. Heavy honey on a spring.

A single Julia script overlays the three regimes:

using OrdinaryDiffEq, Plots

ω = 2π
function damped(u, p, t)
    ζ = p[1]
    [u[2], -ω^2 * u[1] - 2ζ * ω * u[2]]
end

u0  = [1.0, 0.0]
ts  = (0.0, 3.0)
plt = plot(xlabel = "t", ylabel = "x(t)",
           title = "Damped oscillator: ẍ + 2ζω ẋ + ω²x = 0",
           legend = :topright)
for (ζ, label) in [(0.1, "ζ=0.1 (under)"),
                   (1.0, "ζ=1.0 (critical)"),
                   (2.0, "ζ=2.0 (over)")]
    sol = solve(ODEProblem(damped, u0, ts, [ζ]), Tsit5(); saveat = 0.01)
    plot!(plt, sol.t, [u[1] for u in sol.u], label = label, lw = 2)
end
plt

Same model in Python — same shape:

units/unit_03/scripts/damped_python.py
import numpy as np
from scipy.integrate import solve_ivp

omega = 2 * np.pi
def damped(t, u, zeta):
    return [u[1], -omega**2 * u[0] - 2 * zeta * omega * u[1]]

for zeta in (0.1, 1.0, 2.0):
    sol = solve_ivp(damped, (0.0, 3.0), [1.0, 0.0],
                    args=(zeta,), t_eval=np.linspace(0, 3, 301))
    print(f"ζ={zeta}: x(3.0) = {sol.y[0, -1]:+.4f}")

Notice \zeta = 1 minimises the time to “land” near zero — a fact control engineers exploit. We’ll come back to damped oscillators when we discuss the inverse problem in Unit 4: recovering \zeta from a noisy trajectory is the simplest possible “physics parameter from data” problem and a useful sanity check before attempting the AIMS capstone in Units 8–10.

Why ODEs matter for the rest of the unit

Every Sci-ML idea in §§3.2–3.5 starts from an ODE picture:

  • Hybrid models replace part of f with a neural network.
  • Hamiltonian / Lagrangian NNs parametrise an energy function and derive f from it.
  • SINDy assumes f is a sparse combination of library terms and recovers it from data.
  • PINNs turn the IVP itself into a loss — they approximate the solution map t \mapsto \mathbf{x}(t) by a neural network.

Keep the picture \dot{\mathbf{x}} = f(\mathbf{x}, t) in mind as we go through each.

Note✏️ Section exercise — find the fastest landing

The text claims \zeta = 1 minimises the time to “land” near zero. Test it. Define the settling time as the first time after which |x(t)| stays below 0.02 forever (numerically: below 0.02 for the rest of a 10-second solve). Sweep \zeta \in \{0.1, 0.2, \ldots, 2.0\} with the damped-oscillator code above, compute the settling time for each, and plot it against \zeta. Where is the minimum — exactly at \zeta = 1, or slightly below? (Control engineers will tell you the answer is ~0.7 for this tolerance; find out whether they’re right.)

💡 Hint

Solve once per ζ with a fine saveat (0.001 is plenty), then the settling time is sol.t[findall(abs.(x) .> tol)[end]] — the time of the last excursion outside the band. A comprehension over ζs = 0.1:0.05:2.0 gives the whole curve in seconds. Watch the minimum move when you change tol.

3.2 The SciML landscape

Hybrid physics + ML

A hybrid (or universal) model uses a known physical equation for the parts of the dynamics we understand and a neural network for the parts we don’t. Concretely, swap f in \dot{\mathbf{x}} = f(\mathbf{x}, t) for a sum of known and learned pieces:

\dot{\mathbf{x}} = \underbrace{f_{\text{phys}}(\mathbf{x}, t)}_{\text{known}} \;+\; \underbrace{N_\theta(\mathbf{x}, t)}_{\text{learned}}.

Example: a population model with known logistic growth and an unknown immigration term modelled by a small NN; train \theta against observed time series. The benefits over pure ML — data efficiency and out-of-distribution extrapolation. The benefit over pure physics — flexibility where the physics is genuinely unknown. Universal differential equations (UDEs) in Unit 4 are exactly this construction.

Surrogate models for expensive simulations

A surrogate replaces an expensive PDE / ODE solver with a fast learned approximation. Train on a dataset of (input, solver output) pairs; deploy in design optimisation, uncertainty quantification, or real-time inference. The catch: surrogates are only trustworthy inside the distribution they were trained on. Out-of-distribution inputs can produce confident-but-wrong outputs.

Conservation, symmetries, and inductive biases

A neural network that cannot violate conservation of mass, energy, or momentum will generalise better than one that has to learn the constraint from data. Inductive biases that enforce conservation:

  • Hard architectural constraints — output a flux, then take its divergence; output an energy, then take its gradient (HNNs in §3.3 work this way).
  • Soft loss penalties — add a violation term to the loss.
  • Equivariant architectures — networks whose outputs transform correctly under symmetries (translation, rotation, gauge).

The pattern recurs throughout the unit: the more physics we bake in, the less data we need.

Note✏️ Section exercise — pick the right tool

For each scenario below, decide which SciML pattern fits best — hybrid model, surrogate, or conservation-aware architecture — and say in one sentence what the “physics part” and the “ML part” would be:

  1. A wind-farm operator needs aerodynamic loads for 10 000 candidate turbine layouts; each CFD evaluation takes 8 hours.
  2. A pharmacokinetic model has well-understood drug clearance but an unknown absorption mechanism that varies between patients.
  3. A satellite-dynamics propagator slowly drifts in total energy over multi-year simulations, corrupting conjunction forecasts.
  4. A coastal model needs the current field between expensive simulation runs, but its predictions are only ever queried inside the simulated parameter range.

💡 Hint

Classify by bottleneck: is the problem cost per query (→ surrogate), an unknown term inside known dynamics (→ hybrid), or a violated invariant (→ conservation-aware architecture)? Scenario 4’s stated guarantee about where the model is queried is itself a clue.

3.3 Hamiltonian and Lagrangian neural networks

The construction

A Hamiltonian Neural Network (HNN) (Greydanus et al., 2019) parametrises the Hamiltonian H_\theta(q, p) of a mechanical system with a neural network, then derives the dynamics via Hamilton’s equations with autodiff:

\dot{q} \;=\; \frac{\partial H_\theta}{\partial p}, \qquad \dot{p} \;=\; -\,\frac{\partial H_\theta}{\partial q}.

Energy is exactly conserved by construction — no soft penalty needed — because \dot H = \nabla H \cdot \dot{\mathbf{x}} = 0 for the symplectic flow. Train on observed trajectories by minimising the residual

\mathcal{L}(\theta) \;=\; \sum_i \bigl\| (\dot q_i, \dot p_i) - (\partial_p H_\theta, -\partial_q H_\theta)(q_i, p_i) \bigr\|^2.

Lagrangian Neural Networks (LNNs) (Cranmer et al., 2020) do the analogous trick with the Lagrangian L_\theta(q, \dot q) and the Euler–Lagrange equations.

A pendulum sanity check

The simple pendulum has the closed-form Hamiltonian H(q, p) = \tfrac{1}{2} p^2 + (1 - \cos q). An HNN trained on a hundred noisy trajectory points typically recovers an energy-conserving model that drifts an order of magnitude less than a vanilla MLP fit to the same data. The win is sharpest on long extrapolation horizons.

Code placeholder — units/unit_03/scripts/hnn_pendulum.jl (TODO).

Note✏️ Section exercise — energy conservation by construction

Verify the HNN’s core claim without training anything. Take the pendulum Hamiltonian H(q, p) = \tfrac{1}{2}p^2 + (1 - \cos q) and build the vector field (\dot q, \dot p) = (\partial_p H, -\partial_q H) via ForwardDiff (don’t differentiate by hand — the point is that this is exactly what an HNN does with H_\theta). Integrate from (q_0, p_0) = (2.0, 0.0) for 100 time units and plot the energy drift H(t) - H(0). Then do the same for a “slightly wrong” field — multiply the \dot p component by 1.02, mimicking a vanilla MLP’s 2% error — and compare the drifts. What property of the exact Hamiltonian field does the perturbed field lose?

💡 Hint

ForwardDiff.gradient(H, u) returns (\partial_q H, \partial_p H) in one call — build the field by swapping and negating, never differentiating by hand. Run the solver at reltol = 1e-9 so the drift you plot is the field’s property, not the integrator’s. The 2%-wrong field needs only one extra factor in du[2].

3.4 Equation discovery: SINDy

The next three subsubsections cover (a) the sparse-regression idea, (b) how to choose the candidate-term library, and (c) a worked Lorenz example in Julia and Python. We then close with when SINDy beats — and loses to — black-box function approximators.

The sparse regression idea

SINDy (Sparse Identification of Nonlinear Dynamics) (Brunton et al., 2016) asks: instead of fitting a black-box neural network to data, can we directly recover the governing equation by assuming it’s a short symbolic expression? The construction:

  1. Observe time-series data X = [\mathbf{x}(t_1), \mathbf{x}(t_2), \ldots].
  2. Numerically differentiate to estimate \dot X (the hard part — noise destroys naive finite differences; total-variation regularised differentiation or smoothing splines help).
  3. Build a library matrix of candidate right-hand-side terms evaluated at each sample, \Theta(\mathbf{x}) \;=\; [1,\ x_1,\ x_2,\ \ldots,\ x_1^2,\ x_1 x_2,\ \ldots,\ \sin x_1,\ \cos x_1,\ \ldots].
  4. Solve the sparse regression \dot X = \Theta(X)\,\Xi for a sparse coefficient matrix \Xi. The nonzero entries are the equation.

The sparsity is enforced either by \ell_1 regularisation (LASSO) or by STLSQ — Sequentially Thresholded Least Squares: ordinary least squares → set small coefficients to zero → refit on the survivors → repeat until stable. STLSQ is what the original Brunton paper proposed and what the workshop script uses.

Why SINDy is popular. Compared to fitting a neural network to the same dynamics, SINDy offers:

  • Interpretability. The output is a human-readable equation, not a 10-layer black box. Domain experts can sanity-check it.
  • Extrapolation. A 3-term equation is far more likely to be correct outside its training regime than a network that memorised the training trajectory.
  • Data efficiency. Hundreds of samples are usually enough, vs. tens of thousands for a deep model on the same task.
  • Compatibility with downstream physics. Once you have an equation, you can hand it to a stiff ODE solver, do stability analysis, derive control laws — any classical tool works.

Compared to physics-informed neural networks (next units), SINDy is the right move when you don’t know the equation but suspect it’s sparse. PINNs assume the equation is given and solve it; SINDy discovers the equation in the first place. They’re complementary, not competing.

Building a library

A standard polynomial library up to degree 3 plus a few trigonometric terms is the default. Domain knowledge guides which terms to include; including too many makes the regression unstable (many candidate terms ⇒ overfitting), too few makes recovery impossible (the true terms aren’t in the basis). Numerical differentiation of \dot X from time-series X is a separate art (total-variation regularised differentiation, smoothed splines).

Worked example: recovering Lorenz

Given trajectories of the Lorenz system

\dot x = \sigma(y - x),\quad \dot y = x(\rho - z) - y,\quad \dot z = xy - \beta z,

with moderate observation noise, SINDy with a degree-2 polynomial library typically recovers the exact equations and the parameters (\sigma, \rho, \beta) \approx (10, 28, 8/3) — the canonical demonstration that a sparse prior plus enough data is sometimes all you need.

A bare-bones STLSQ implementation (no DataDrivenDiffEq dependency) makes the recipe explicit: build the library, threshold the OLS coefficients, re-fit on the surviving terms, repeat.

# SINDy on the Lorenz system.
#
# Simulate a Lorenz trajectory, add observation noise, and recover the
# symbolic equations by sparse regression against a degree-2 polynomial
# library — the worked example in unit_03.qmd §3.4.
#
# Algorithm: sequentially thresholded least squares (STLSQ), the
# canonical SINDy update from Brunton, Proctor & Kutz (2016).
# Derivatives are obtained by second-order central differences; the
# qmd notes that derivative estimation is a separate art and we keep
# this script simple by relying on a dense save interval.
#
# Run via ./build.sh execute 3 (writes output to ../output/sindy_lorenz.md).

using OrdinaryDiffEq, Random, LinearAlgebra, Printf, Statistics

# ── 1. simulate Lorenz ─────────────────────────────────────────────────
σ_true, ρ_true, β_true = 10.0, 28.0, 8/3
function lorenz!(du, u, p, t)
    x, y, z = u
    du[1] = σ_true * (y - x)
    du[2] = x * (ρ_true - z) - y
    du[3] = x * y - β_true * z
end

u0    = [-8.0, 7.0, 27.0]
tspan = (0.0, 20.0)
dt    = 0.002
sol   = solve(ODEProblem(lorenz!, u0, tspan), Tsit5();
              saveat = dt, abstol = 1e-10, reltol = 1e-10)

X_clean = permutedims(reduce(hcat, sol.u))            # N × 3

# ── 2. add observation noise ───────────────────────────────────────────
rng = Random.MersenneTwister(0)
noise_level = 0.01                                    # 1 % of per-axis std
σ_obs = noise_level .* vec(std(X_clean; dims = 1))'
X = X_clean .+ σ_obs .* randn(rng, size(X_clean))

# ── 3. central-difference derivatives, trim end points ─────────────────
= (X[3:end, :] .- X[1:end-2, :]) ./ (2 * dt)
Xc = X[2:end-1, :]

# ── 4. degree-2 polynomial feature library ─────────────────────────────
# columns: 1, x, y, z, x², xy, xz, y², yz, z²
function poly_library(X)
    x, y, z = X[:, 1], X[:, 2], X[:, 3]
    hcat(ones(length(x)), x, y, z,
         x.^2, x.*y, x.*z, y.^2, y.*z, z.^2)
end
labels = ["1", "x", "y", "z", "x^2", "x*y", "x*z", "y^2", "y*z", "z^2"]
Θ = poly_library(Xc)

# ── 5. STLSQ — sequentially thresholded least squares ──────────────────
function stlsq(Θ, Ẋ, λ; n_iters = 15)
    Ξ = Θ \
    for _ in 1:n_iters
        small = abs.(Ξ) .< λ
        Ξ[small] .= 0
        for j in axes(Ξ, 2)
            keep = .!small[:, j]
            sum(keep) == 0 && continue
            Ξ[keep, j] = Θ[:, keep] \ Ẋ[:, j]
        end
    end
    Ξ
end

λ = 0.1
Ξ = stlsq(Θ, Ẋ, λ)

# ── 6. report ──────────────────────────────────────────────────────────
state_names = ["dx/dt", "dy/dt", "dz/dt"]

println("Lorenz SINDy demo")
println("samples used      : ", size(Xc, 1))
println("observation noise : ", noise_level * 100, " % of per-axis std")
println("STLSQ threshold λ : ", λ)
println()
println("True system:")
println("  dx/dt = -10·x + 10·y")
println("  dy/dt =  28·x -  y - x·z")
println("  dz/dt =  x·y - (8/3)·z")
println()
println("Recovered equations:")
for j in 1:3
    terms = String[]
    for i in eachindex(labels)
        c = Ξ[i, j]
        c == 0 && continue
        push!(terms, @sprintf("%+0.3f·%s", c, labels[i]))
    end
    println("  $(state_names[j]) = ", isempty(terms) ? "0" : join(terms, "  "))
end

println()
@printf("σ̂ ≈ %0.3f   (true %0.3f)\n", -Ξ[2, 1], σ_true)
@printf("ρ̂ ≈ %0.3f   (true %0.3f)\n",  Ξ[2, 2], ρ_true)
@printf("β̂ ≈ %0.3f   (true %0.3f)\n", -Ξ[4, 3], β_true)

Captured output from ./build.sh execute 3:

Lorenz SINDy demo
samples used      : 9999
observation noise : 1.0 % of per-axis std
STLSQ threshold λ : 0.1

True system:
  dx/dt = -10·x + 10·y
  dy/dt =  28·x -  y - x·z
  dz/dt =  x·y - (8/3)·z

Recovered equations:
  dx/dt = -9.995·x  +9.995·y
  dy/dt = +27.971·x  -0.990·y  -0.999·x*z
  dz/dt = -2.667·z  +1.000·x*y

σ̂ ≈ 9.995   (true 10.000)
ρ̂ ≈ 27.971   (true 28.000)
β̂ ≈ 2.667   (true 2.667)

The same workflow in Python uses the pysindy package — the canonical SINDy implementation, maintained by the Brunton group:

units/unit_03/scripts/sindy_lorenz_python.py
# pip install pysindy
import numpy as np
from scipy.integrate import solve_ivp
import pysindy as ps

sigma, rho, beta = 10.0, 28.0, 8.0 / 3.0
def lorenz(t, u): return [sigma * (u[1] - u[0]),
                          u[0] * (rho - u[2]) - u[1],
                          u[0] * u[1] - beta * u[2]]

t = np.linspace(0, 10, 5001)
sol = solve_ivp(lorenz, (t[0], t[-1]), [-8.0, 7.0, 27.0],
                t_eval=t, rtol=1e-9, atol=1e-9)
X = sol.y.T

model = ps.SINDy(
    feature_library=ps.PolynomialLibrary(degree=2),
    optimizer=ps.STLSQ(threshold=0.1),
)
model.fit(X, t=t)
model.print()

PySINDy bundles all the choices we made explicit in the Julia script (library, optimiser, differentiation) into a scikit-learn style estimator; the trade-off is less visibility into the inner loop. Both produce the same Lorenz equation, recovered to ~3 significant figures.

When discovery beats approximation

SINDy wins when

  1. the underlying equation is genuinely sparse in your library,
  2. you care about interpretability — a 3-term equation is more useful than a 10-layer network, and
  3. you have enough clean trajectory data to estimate derivatives.

It loses when the dynamics are intrinsically high-dimensional, the right basis isn’t in your library, or the noise drowns the derivatives.

Note✏️ Section exercise — SINDy on the damped oscillator

Run the full SINDy recipe yourself on a system where you know the answer: the damped oscillator of §3.1, \dot x = v, \dot v = -\omega^2 x - 2\zeta\omega v with \omega = 2\pi, \zeta = 0.1. Simulate a trajectory, estimate \dot X by central differences, build a polynomial library up to degree 2 in (x, v), and run 5 iterations of STLSQ (threshold 0.5). You should recover the two equations with coefficients (-\omega^2, -2\zeta\omega) \approx (-39.5, -1.26). Then add 1% Gaussian noise to the trajectory before differentiating and watch what happens to the recovered coefficients. Which step of the pipeline failed?

💡 Hint

Differentiate with central differences (X[3:end,:] .- X[1:end-2,:]) ./ (t[3:end] .- t[1:end-2]) and drop the endpoints. The library is one hcat; STLSQ is literally: fit with \, zero the small coefficients, refit the survivors, repeat five times. For the noise experiment, perturb X before differencing — that ordering is the whole point.

3.5 Proper Orthogonal Decomposition (POD)

Dynamical systems often look high-dimensional but live on a low-dimensional manifold — a turbulent flow field has millions of grid cells but the coherent motion (vortices, waves) is described by a handful of modes. Proper Orthogonal Decomposition (POD) — also called Principal Component Analysis when applied to snapshot matrices — extracts those modes systematically. It’s the classical workhorse for reduced-order modelling (ROM) and the first thing to reach for when you suspect your system has hidden low-dimensional structure.

The snapshot SVD

Collect N snapshots of the state \mathbf{x}_k \in \mathbb{R}^M (e.g. M = number of grid cells, N = number of time samples) into a snapshot matrix:

X \;=\; \bigl[\mathbf{x}_1 \;|\; \mathbf{x}_2 \;|\; \ldots \;|\; \mathbf{x}_N\bigr] \;\in\; \mathbb{R}^{M \times N}.

Compute its singular value decomposition,

X \;=\; U \Sigma V^\top, \qquad U \in \mathbb{R}^{M \times r},\ \Sigma \in \mathbb{R}^{r \times r},\ V \in \mathbb{R}^{N \times r},

with \Sigma = \mathrm{diag}(\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0). The columns of U are the POD modes — an orthonormal basis for the snapshots, ranked by how much variance they explain. Keep the first k \ll r modes and you have a k-dimensional approximation:

\mathbf{x}(t) \;\approx\; U_k \, \mathbf{a}(t), \qquad \mathbf{a}(t) = U_k^\top \mathbf{x}(t) \;\in\; \mathbb{R}^k.

The decay of \sigma_i tells you whether the reduction will work. Fast decay → a few modes capture almost everything → ROM is viable. Slow decay → genuinely high-dimensional → POD will lose detail.

Worked example: damped oscillator chain

A chain of M = 50 masses coupled by springs, given a localised initial push and lightly damped, is a useful test bench. The state-space dimension is 100 (positions + velocities) but a handful of normal modes carry almost all the energy:

units/unit_03/scripts/pod_chain.jl
using OrdinaryDiffEq, LinearAlgebra, Plots

# 50-mass chain: x¨ᵢ = ω²(xᵢ₋₁ - 2xᵢ + xᵢ₊₁) - 2ζω x˙ᵢ  (free at endpoints)
M, ω, ζ = 50, 2π, 0.02
function chain!(du, u, p, t)
    x, v = @views u[1:M], u[M+1:end]
    du[1:M] .= v
    @inbounds for i in 1:M
        l = i == 1     ? 0.0 : x[i - 1]
        r = i == M     ? 0.0 : x[i + 1]
        du[M + i] = ω^2 * (l - 2x[i] + r) - 2ζ * ω * v[i]
    end
end

u0 = zeros(2M); u0[M ÷ 2] = 1.0          # localised initial displacement
sol = solve(ODEProblem(chain!, u0, (0.0, 8.0)), Tsit5(); saveat = 0.02)

# Snapshot matrix: rows = positions, columns = time samples
X = reduce(hcat, [u[1:M] for u in sol.u])
U, Σ, V = svd(X)

println("first 10 singular values (relative):")
foreach(i -> println("  σ_$i / σ_1 = ", round(Σ[i] / Σ[1]; digits = 4)),
        1:10)

# Reconstruction error vs number of modes kept
k_range = 1:15
errs = [norm(X - U[:, 1:k] * Diagonal(Σ[1:k]) * V[:, 1:k]') / norm(X)
        for k in k_range]
plot(k_range, errs; lw = 2, marker = :circle,
     xlabel = "modes kept k", ylabel = "relative reconstruction error",
     title = "POD spectrum: 50-mass damped chain", yaxis = :log)

Typical behaviour: ~6 modes recover >99\% of the variance even though the system has 100 state-space dimensions. The first mode is the symmetric breathing of the chain, the second is the first-overtone shape, and so on — modes hierarchically encoded by spatial frequency, exactly as a Fourier basis would predict for this linear system.

Same workflow in Python — scipy.linalg.svd is the SVD; the rest is numpy:

units/unit_03/scripts/pod_chain_python.py
import numpy as np
from scipy.integrate import solve_ivp
from scipy.linalg import svd

M, omega, zeta = 50, 2 * np.pi, 0.02
def chain(t, u):
    x, v = u[:M], u[M:]
    du = np.zeros_like(u)
    du[:M] = v
    for i in range(M):
        l = 0.0 if i == 0     else x[i - 1]
        r = 0.0 if i == M - 1 else x[i + 1]
        du[M + i] = omega**2 * (l - 2 * x[i] + r) - 2 * zeta * omega * v[i]
    return du

u0 = np.zeros(2 * M); u0[M // 2] = 1.0
sol = solve_ivp(chain, (0.0, 8.0), u0, t_eval=np.linspace(0, 8, 401))
X = sol.y[:M, :]                                       # (M, N) positions

U, Sigma, Vt = svd(X, full_matrices=False)
print("first 6 singular values (relative):",
      np.round(Sigma[:6] / Sigma[0], 4))
print(f"variance captured by k=6 modes: "
      f"{(Sigma[:6]**2).sum() / (Sigma**2).sum():.4f}")

POD’s role in modern Sci-ML

POD by itself is linear and deterministic — it pre-dates deep learning by half a century. But it shows up everywhere in contemporary work:

  • As a preprocessor. Reduce 1M-dimensional snapshots to 100 POD coefficients, then learn the evolution of those coefficients with a small neural ODE or transformer. Modern weather emulators (Pangu-Weather, GraphCast) lean on this idea.
  • As a sanity check. If POD already captures 99% of variance in 5 modes, you don’t need deep learning — a linear ROM is enough.
  • As a comparison baseline. Any nonlinear surrogate should beat the POD-truncation baseline; if it doesn’t, the nonlinearity isn’t pulling its weight.

POD’s nonlinear cousins — autoencoders, dynamic mode decomposition (DMD), Koopman operators — generalise the idea: project the dynamics onto a learned (possibly nonlinear) low-dimensional manifold, then evolve there. We won’t go deep into them in the workshop, but the POD picture is the right mental starting point.

Note✏️ Section exercise — POD on a diffusing field

The chain example showed slow singular-value decay for an oscillatory system. Now try a diffusive one, where POD shines. Solve the 1-D heat equation u_t = \alpha u_{xx} (\alpha = 0.01, x \in [0, 1], homogeneous Dirichlet BCs, initial condition u_0(x) = \exp(-200(x - 0.3)^2)) with simple finite differences, collect ~200 snapshots, and compute the snapshot SVD. How many modes capture 99% of the variance, and how does that compare to the oscillator chain? Bonus: plot the first three POD modes — what classical basis do they resemble, and why was that predictable from §6.2’s separation of variables (peek ahead if curious)?

💡 Hint

The §6.4 FTCS code is your simulator — just change the IC and push a copy(u) into a snapshot vector every ~40 steps. Then svd from LinearAlgebra on reduce(hcat, snapshots), and the variance captured is cumsum(Σ.^2) ./ sum(Σ.^2). Plot U[:, 1:3] against x for the modes.

3.6 Physics-informed machine learning: a conceptual preview

The core idea

A neural network u_\theta(\mathbf{x}, t) approximates the solution of a PDE (or ODE). The loss has a physics term — the equation residual evaluated by autodiff at scattered collocation points — plus boundary / initial conditions and (optionally) a data term:

\mathcal{L}(\theta) \;=\; \lambda_r\,\mathcal{L}_{\text{PDE}} + \lambda_b\,\mathcal{L}_{\text{BC}} + \lambda_i\,\mathcal{L}_{\text{IC}} + \lambda_d\,\mathcal{L}_{\text{data}}.

Minimising it simultaneously fits the data and respects the equation. No grid, no time-stepping, no mesh.

Mechanically the residual is the part that’s new: for the oscillator at the top of this unit, \mathcal{L}_{\text{PDE}}(\theta) = \frac{1}{N_r}\sum_i \bigl| \ddot u_\theta(t_i) + \omega^2 u_\theta(t_i) \bigr|^2, with the second derivative computed by autodiff straight through u_\theta.

What makes PIML different from other Sci-ML methods

  • vs surrogates: PIML doesn’t need a precomputed dataset — the equation itself supervises the network.
  • vs equation discovery: PIML assumes the equation is known and asks for its solution (or its parameters via the inverse problem).
  • vs hybrid models: PIML is a special case where the entire solution field is the “ML part” and the equation residual is the “physics part”.

Where this lands in the workshop

Unit 4 takes the ODE picture from §3.1 and adds learnable terms — the Neural-ODE / UDE picture from §3.2. Unit 5 implements vanilla PINNs on the oscillator above and a small collection of ODE / PDE toys. Unit 6 provides the PDE foundations. Unit 7 addresses where naïve PINNs fail and how the modern fixes close the gap. Units 8–10 assemble the AIMS capstone — forward (Part I) and inverse (Part II).

Note✏️ Section exercise — evaluate a residual by hand

No training yet — just the residual. For the oscillator \ddot u + \omega^2 u = 0 with \omega = 2\pi, take three candidate solutions: (a) the exact u(t) = \cos(\omega t), (b) the wrong-frequency u(t) = \cos(0.9\,\omega t), and (c) the damped impostor u(t) = e^{-t/2}\cos(\omega t). Compute the PINN residual loss \mathcal{L}_{\text{PDE}} = \frac{1}{N}\sum_i |\ddot u(t_i) + \omega^2 u(t_i)|^2 for each over N = 100 uniform points on [0, 2], computing \ddot u with nested ForwardDiff.derivative rather than by hand. Rank the three losses. Candidate (c) fits the IC perfectly and looks plausible — what does its residual reveal, and at which t is the residual largest?

💡 Hint

Each candidate is a plain Julia function of t — no network anywhere. Get \ddot u by nesting ForwardDiff.derivative twice on the function, then mean the squared residual over the grid. For the ‘where is it largest’ question, plot the residual of candidate (c) against t rather than reasoning it out.