Unit 5: PINNs on Basic Models

Published

26/06/2026

Our first hands-on PINN, in the simplest possible settings. We do it twice:

From first principles — a one-page implementation in Lux + Zygote that you can read end-to-end. No PINN library, no symbolic machinery; just a network, a residual, and an optimiser. This is the inline code in §5.3.
With NeuralPDE.jl — the same problem written declaratively against a ModelingToolkit PDESystem and discretised in a single line. Concise, scalable, the right tool for the AIMS capstone — but harder to debug if you don’t know the first-principles version first. This is the standalone script in §5.4.

Before the implementations, §5.2 is a focused autodiff deep-dive — the forward/reverse/forward-over-forward composition that PINNs depend on, in both Julia and JAX. We then push NeuralPDE.jl from the ODE up to the simplest PDE (1D diffusion, §5.5) and call out the failure modes that Unit 7 finally fixes.

5.1 The PINN idea

Residual loss from a differential equation

For an ODE \dot{x} = f(x, t), define a network x_\theta(t) — input time, output state — and the residual

r_\theta(t) \;=\; \dot{x}_\theta(t) - f(x_\theta(t), t),

with \dot{x}_\theta computed by autodiff through the network. The residual loss is

\mathcal{L}_{\text{ODE}}(\theta) \;=\; \frac{1}{N_r}\sum_{i=1}^{N_r} \bigl|r_\theta(t_i)\bigr|^2

evaluated at scattered collocation points \{t_i\}. Minimising it pushes x_\theta toward a solution of the ODE.

Initial conditions are what pin it down

A residual alone has infinitely many solutions — the ODE is a one- parameter family indexed by x_0. The IC term

\mathcal{L}_{\text{IC}}(\theta) \;=\; \bigl|x_\theta(0) - x_0\bigr|^2

(with weight \lambda_{\text{IC}}) selects the trajectory we want. The same idea applies to boundary conditions on PDEs. The weights \lambda_{\text{IC}}, \lambda_{\text{BC}}, \lambda_{\text{PDE}} are the main hyperparameters — tuning them well is a major theme of Unit 7.

A second example: the heat equation

The same recipe carries over to PDEs — now with a genuine boundary term. Take the 1-D heat equation on x \in [0, 1], t \in [0, T],

\partial_t u \;=\; \alpha\, \partial_{xx} u, \tag{1}

approximated by a network u_\theta(x, t). Three losses combine. The PDE residual at interior collocation points \{(x_i, t_i)\},

\mathcal{L}_{\text{PDE}}(\theta) \;=\; \frac{1}{N_r}\sum_{i=1}^{N_r} \bigl|\,\partial_t u_\theta - \alpha\, \partial_{xx} u_\theta\,\bigr|^2_{(x_i, t_i)},

the initial condition u(x, 0) = u_0(x),

\mathcal{L}_{\text{IC}}(\theta) \;=\; \frac{1}{N_0}\sum_{j=1}^{N_0} \bigl|\,u_\theta(x_j, 0) - u_0(x_j)\,\bigr|^2,

and the boundary condition — say the homogeneous Dirichlet ends u(0, t) = u(1, t) = 0,

\mathcal{L}_{\text{BC}}(\theta) \;=\; \frac{1}{N_b}\sum_{k=1}^{N_b} \bigl|\,u_\theta(0, t_k)\,\bigr|^2 + \bigl|\,u_\theta(1, t_k)\,\bigr|^2.

The PINN minimises their weighted sum \lambda_{\text{PDE}}\mathcal{L}_{\text{PDE}} + \lambda_{\text{IC}}\mathcal{L}_{\text{IC}} + \lambda_{\text{BC}}\mathcal{L}_{\text{BC}}. The only genuinely new ingredient versus the ODE is \partial_{xx} u_\theta — a second derivative of the network with respect to its input — which is exactly what §5.2 below shows how to compute.

Collocation points

Where do you put the \{t_i\}? Options:

Uniform grid — simple, biased toward regular regions.
Random uniform — eliminates grid bias.
Latin hypercube — better space coverage in higher dimensions.
Adaptive — sample more densely where the residual is high.

For a first pass on a smooth ODE, a uniform grid of a few hundred points is plenty.

✏️ Section exercise — count the solutions

Two short pencil-and-paper checks of the §5.1 picture:

For \dot x = -x without any IC term in the loss, write down three different functions that drive \mathcal{L}_{\text{ODE}} to exactly zero. What is the full family of zero-residual solutions, and which single member does adding \mathcal{L}_{\text{IC}} = |x_\theta(0) - 1|^2 select?
Suppose you set \lambda_{\text{IC}} = 10^{-6} and train to a total loss of 10^{-8}. Roughly how far can x_\theta(0) sit from 1 while still “succeeding”? What does this tell you about reading a small total loss as evidence of a correct solution?

💡 Hint

Part 1 — the zero-residual set is exactly the ODE’s solution family. Solve \dot x = -x by hand: the general solution is x(t) = Ce^{-t}, so any C (try 0, 1, 5) gives zero residual; note C=0 (the flat x\equiv0) is a valid zero-residual function and a favourite of lazy optimisers. Adding the IC term forces C=1. Part 2 — only the IC term can be nonzero, so \lambda_{IC}\,|x_\theta(0)-1|^2 \le \mathcal{L}_{total}; plug in 10^{-6}, 10^{-8} and solve |x_\theta(0)-1| \le \sqrt{10^{-8}/10^{-6}}. The number is the moral: a tiny total loss can hide a wrong solution. No code for either part.

Go to solution →

5.2 Autodiff for PINNs: forward, reverse, and second derivatives

Unit 2 §2.7 introduced forward vs reverse mode at the level of “scalar loss gradient over parameters”. PINNs need more than that. The residual r_\theta(x, t) = \partial_t u_\theta - \alpha\,\partial_{xx} u_\theta contains derivatives of the network output with respect to the inputs (x, t), and the training gradient needs the parameter-gradient of that residual loss — a derivative of a derivative. Getting the AD composition right is what makes a PINN train in seconds rather than minutes (or not at all). This section walks the four AD operations PINNs actually need, side-by-side in Julia (ForwardDiff / Zygote / Enzyme) and JAX (jax.grad / jax.jacfwd / jax.hessian).

Three derivatives, three modes

The running example is the heat equation of Equation 1, \partial_t u = \alpha\, \partial_{xx} u, whose residual

r_\theta(x, t) \;=\; \partial_t u_\theta(x, t) \;-\; \alpha\, \partial_{xx} u_\theta(x, t)

needs exactly the derivatives in the table below. For a network u_\theta(\mathbf{x}, t): \mathbb{R}^{d+1} \to \mathbb{R} with P parameters, three derivative computations recur:

What	Shape	Right mode
\partial u_\theta / \partial x_i at one (\mathbf{x}, t)	\mathbb{R}^{d+1} \to \mathbb{R}	forward (cheap inputs)
\partial^2 u_\theta / \partial x_i^2 at one (\mathbf{x}, t)	scalar	forward-over-forward
\partial \mathcal{L} / \partial \theta over all \theta	\mathbb{R}^P \to \mathbb{R}	reverse (cheap output)

The composition that trains a PINN is therefore reverse-over-(forward-over-forward): forward for the \partial_x, forward again for the \partial_{xx}, reverse on the outside for \partial_\theta \mathcal{L}. Every PINN framework — Julia’s Lux + ForwardDiff + Zygote, JAX’s flax + jax.grad, PyTorch’s torch.autograd.grad — implements exactly this stack.

Forward mode: spatial derivatives of the network

In Julia, ForwardDiff.derivative is the standard tool for scalar inputs:

using Lux, Random, ForwardDiff

rng = Random.MersenneTwister(0)
u = Lux.Chain(Lux.Dense(2 => 16, tanh), Lux.Dense(16 => 1))
ps, st = Lux.setup(rng, u)

u_θ(x, t) = first(u([x, t], ps, st)[1])

# ∂u/∂x via forward mode
∂x(x, t) = ForwardDiff.derivative(ξ -> u_θ(ξ, t), x)
@info "∂u/∂x at (0.3, 0.5) = $(∂x(0.3, 0.5))"

The JAX equivalent — jax.jacfwd is the multi-input forward-mode Jacobian, but for a scalar input the single-derivative jax.grad composed in forward-mode style is equivalent:

import jax, jax.numpy as jnp
from flax import linen as nn

class MLP(nn.Module):
    @nn.compact
    def __call__(self, xt):
        h = nn.tanh(nn.Dense(16)(xt))
        return nn.Dense(1)(h).squeeze()

mlp = MLP()
key = jax.random.PRNGKey(0)
params = mlp.init(key, jnp.zeros(2))

u_theta = lambda x, t: mlp.apply(params, jnp.array([x, t]))
du_dx   = jax.jacfwd(u_theta, argnums=0)        # ∂u/∂x, forward mode
print("∂u/∂x at (0.3, 0.5) =", du_dx(0.3, 0.5))

Second derivatives: the PDE residual

For the heat equation we need \partial^2_x u_\theta. The cleanest composition is forward-over-forward: differentiate once with ForwardDiff, then differentiate the result again with ForwardDiff. The dual-number machinery handles the nesting automatically:

# ∂²u/∂x² via forward-over-forward
∂xx(x, t) = ForwardDiff.derivative(
                ξ -> ForwardDiff.derivative(η -> u_θ(η, t), ξ),
                x,
            )
@info "∂²u/∂x² at (0.3, 0.5) = $(∂xx(0.3, 0.5))"

# Residual of the heat equation ∂_t u = α ∂_xx u
α = 0.1
∂t(x, t) = ForwardDiff.derivative(τ -> u_θ(x, τ), t)
r(x, t)  = ∂t(x, t) - α * ∂xx(x, t)
@info "heat-eq residual at (0.3, 0.5) = $(r(0.3, 0.5))"

In JAX the analogous pattern — jax.hessian is jacfwd ∘ jacrev internally, exactly the right composition:

import jax

# ∂²u/∂x² at fixed t
d2u_dx2 = jax.jacfwd(jax.jacfwd(u_theta, argnums=0), argnums=0)
du_dt   = jax.jacfwd(u_theta, argnums=1)

alpha = 0.1
def residual(x, t):
    return du_dt(x, t) - alpha * d2u_dx2(x, t)

print("heat-eq residual at (0.3, 0.5) =", residual(0.3, 0.5))

For higher-dimensional inputs (say a 3-D wave equation needing \partial^2_x + \partial^2_y + \partial^2_z), jax.hessian(u_theta) returns the full 3 \times 3 Hessian at a point — take its trace to get the Laplacian.

The outer gradient: reverse mode over parameters

The residual at a collocation point is a scalar; the loss sums many of them, then we want \partial_\theta \mathcal{L} — a single output, P inputs. Classic reverse-mode territory: Zygote.gradient in Julia, jax.grad in JAX. The two AD modes happily compose.

function pde_loss(ps, st, xs, ts)
    ŝ = 0.0
    for (x, t) in zip(xs, ts)
        u_θ(x, t) = first(u([x, t], ps, st)[1])
        ∂t = ForwardDiff.derivative(τ -> u_θ(x, τ), t)
        ∂xx = ForwardDiff.derivative(
                  ξ -> ForwardDiff.derivative(η -> u_θ(η, t), ξ), x)
        ŝ += (∂t - α * ∂xx)^2
    end
    ŝ / length(xs)
end

# Reverse-mode autodiff over θ, with forward-over-forward inside
grad = first(Zygote.gradient(p -> pde_loss(p, st, xs, ts), ps))

The same shape in JAX is breathtakingly compact:

def loss(params, xs, ts):
    def r_single(x, t):
        u   = lambda x, t: mlp.apply(params, jnp.array([x, t]))
        d2  = jax.jacfwd(jax.jacfwd(u, argnums=0), argnums=0)
        dt  = jax.jacfwd(u, argnums=1)
        return (dt(x, t) - alpha * d2(x, t)) ** 2
    return jnp.mean(jax.vmap(r_single)(xs, ts))

grad_fn = jax.grad(loss)             # reverse-mode over params

Why nesting order matters

A practical pitfall: don’t use reverse mode for the inner derivatives. Reverse-mode AD records a tape proportional to the size of the network’s intermediate activations; nesting two reverse passes through the same network multiplies the tape size. On a modest 5-layer MLP this can turn a 100 ms step into a 20-second step. Forward mode’s cost scales with the input dimension, so for PDEs in 1D / 2D / 3D — exactly the regime PINNs operate in — forward-over-forward is the right choice for the spatial pieces.

The general rule (true in both Julia and JAX):

Inner derivatives over the few-dimensional input → forward mode. Outer derivative over the many-dimensional parameter vector → reverse mode.

That single rule of thumb is what makes the Lux + ForwardDiff + Zygote stack work, and it’s also what NeuralPDE.jl and JAX’s Lineax / optax glue together for you. Get the composition right and PINN training scales; get it wrong and it doesn’t.

A subtler pitfall — silently wrong gradients

The rule above is about speed. There is a second trap that is about correctness, and it is specific to Julia’s split-package AD — worth knowing before you hand-roll anything. The clean ideal — ForwardDiff for the inner input-derivative wrapped in Zygote for the outer parameter-gradient, i.e. forward-over-reverse — silently misfires when you wire it by hand. Ask Zygote.gradient to differentiate a loss whose residual calls ForwardDiff.gradient(u_θ, x), and Zygote returns nothing for the parameters: it can’t route cotangents back through the inner ForwardDiff call, so it treats that whole term as a constant — at most a buried “cannot track gradients” warning, never a hard error, just a PINN that trains to the wrong thing. (The mirror image, Zygote nested in Zygote, breaks differently — it hits an array-mutation limitation in Lux’s pullbacks.) This is the reason hand-rolled PINNs are fiddly in Julia today, and JAX is the clean contrast (the jax.jacfwd snippets earlier in this section, and §2.7): there grad/jacfwd/jacrev are transforms of one traced program and compose by design.

Three robust ways out, in increasing order of “do this for real work”:

Central differences for the inner derivative. Replace ForwardDiff.gradient(u_θ, x) with a finite difference of plain network evaluations, so Zygote only ever differentiates ordinary forward passes. The O(h^2) bias sits far below the training error. This is exactly what the hand-rolled PINN in §5.3 does — and why.
Let NeuralPDE.jl assemble the residual. It wires ForwardDiff, Zygote, and ChainRules together so the nesting is correct and fast; you never touch the composition yourself. This is the route §5.4 onward — and the capstone — takes.
Enzyme.jl. It differentiates compiled code (not a recorded tape) and composes nested modes natively, closing the gap at the source; SciML is moving toward it as the default backend.

Connecting back to Unit 1

The forward solver for the shallow-water equations in Unit 1 computed \partial^2 \eta / \partial t^2 by finite differences, on a 100×190 grid, in \sim 12\,\text{s}. The PINN above computes the same kind of second derivative — but at arbitrary (x, y, t) points, with exact (rounding error only) accuracy, on demand from the network. That is the headline trade-off PINNs offer: you give up the structured-grid efficiency of FD/FV/FEM and you get a meshless, autodiff-exact, easily-differentiable surrogate in return. The rest of Units 5–7 is mostly about how to realise that trade in practice.

✏️ Section exercise — trust, then verify, the autodiff stack

Take a function whose derivatives you know exactly: u(x, t) = \sin(3x)\,e^{-2t}, so \partial_x u = 3\cos(3x)e^{-2t}, \partial_{xx} u = -9\sin(3x)e^{-2t}, and \partial_t u = -2\sin(3x)e^{-2t}. Implement u as plain Julia code and compute all three derivatives at (x, t) = (0.7, 0.4) with nested ForwardDiff.derivative, exactly as the section does for the network. Verify each against the closed form to machine precision. Then confirm the residual of the heat equation \partial_t u - \alpha\,\partial_{xx} u vanishes when \alpha = 2/9 — i.e. your u is a heat-equation solution for that diffusivity. If you have a Python environment handy, repeat the whole check in JAX with jax.jacfwd and confirm the same numbers come out.

💡 Hint

Validate the derivative pipeline before any network is involved — u(x, t) = sin(3x) * exp(-2t) is just an ordinary function. Use the §5.2 nesting: ∂x = ForwardDiff.derivative(ξ -> u(ξ, 0.4), 0.7), and wrap that in a second ForwardDiff.derivative for \partial_{xx}. Compare each against the closed forms 3\cos(3x)e^{-2t}, -9\sin(3x)e^{-2t}, -2\sin(3x)e^{-2t}. For α: this is one separated heat mode with decay rate \alpha k^2, k=3, and the mode decays as e^{-2t}, so 9\alpha = 2 \Rightarrow \alpha = 2/9. In JAX, jax.jacfwd nests identically.

Go to solution →

5.3 A PINN from first principles

We solve \dot{x}(t) = -x(t) with x(0) = 1 — the simplest non-trivial ODE — whose exact solution is x(t) = e^{-t}. Our PINN will be a 1 → 16 → 1 MLP with \tanh activations. The three pieces of machinery are exactly what §5.2 motivated:

Lux for the model and its parameter container;
a symmetric central difference to compute \dot{x}_\theta(t) at collocation points — this sidesteps the silent-gradient pitfall of §5.2 (see the comment in the listing for the specifics);
Zygote.gradient to differentiate the loss with respect to \theta (many-input scalar-output → reverse mode).

using Lux, Random, Zygote, Optimisers, Plots

rng = Random.MersenneTwister(0)
model = Lux.Chain(
    Lux.Dense(1 => 16, tanh),
    Lux.Dense(16 => 1),
)
ps, st = Lux.setup(rng, model)

# Float32 throughout — Lux initialises Float32 weights; Float64 input
# triggers a "mixed-precision matmul fallback" warning from LuxLib and
# kicks Octavian.jl out of the matmul path. Keep ps, st, t all Float32.
const F = Float32

# scalar t → scalar x_θ(t)
function x_θ(t::F, ps, st)
    y, _ = model([t], ps, st)
    y[1]
end

# Time derivative of the network at a collocation point, by a symmetric
# central difference. Why not AD-inside-AD? Both pure compositions are
# version-sensitive in the current Lux/Zygote/ForwardDiff stack:
# nesting Zygote inside Zygote (reverse-over-reverse) hits Zygote's
# array-mutation limitation in Lux's pullbacks, and ForwardDiff inside
# Zygote silently DROPS the parameter gradient of the derivative term
# (Zygote warns it "cannot track gradients with respect to f" for a
# parameter-closure). The central difference sidesteps both: Zygote
# differentiates straight through two plain network evaluations, so the
# training gradient is exact for the discretised residual — at the cost
# of an O(h²) bias in ẋ_θ that is far below the training error here.
# §5.2's clean AD compositions remain the right mental model; this is
# the robust portable implementation as of mid-2026.
const h = F(1e-2)
dxdt(t::F, ps, st) = (x_θ(t + h, ps, st) - x_θ(t - h, ps, st)) / (2h)

ts = collect(range(F(0), F(4); length = 64))   # collocation points

function loss(ps)
    L_res = sum(t -> (dxdt(t, ps, st) + x_θ(t, ps, st))^2, ts) / length(ts)
    L_ic  = (x_θ(F(0), ps, st) - one(F))^2
    L_res + F(100) * L_ic
end

opt_state = Optimisers.setup(Optimisers.Adam(F(2e-2)), ps)
for epoch in 1:1500
    g = first(Zygote.gradient(loss, ps))
    opt_state, ps = Optimisers.update(opt_state, ps, g)
end

# compare to the exact solution
tgrid  = range(F(0), F(4); length = 200)
exact  = exp.(-collect(tgrid))
learnt = [x_θ(t, ps, st) for t in tgrid]

plt = plot(tgrid, exact;  label = "exact e^(-t)",
           xlabel = "t", ylabel = "x", lw = 2)
plot!(plt, tgrid, learnt; label = "PINN", lw = 2, ls = :dash)

Notice what this is, and what it isn’t:

There is no time-stepping. The PINN doesn’t march forward in time; it fits a function on the whole interval at once.
There is no training data — only the equation and the IC. The collocation points are unlabelled.
The loss is differentiable end-to-end: \dot{x}_\theta here is a central difference of two network evaluations, so Zygote.gradient flows through the whole residual exactly. If you can write the residual, you can train.

That last point is the whole game. Whatever differential equation you can encode as a residual r(t, ps), the same loop trains a PINN for it. The difficulty in §§5.3-5.4 is scaling the bookkeeping: multiple BCs, multiple spatial dimensions, multiple loss terms. That’s what NeuralPDE.jl automates.

✏️ Section exercise — your second PINN: the oscillator

Upgrade the §5.3 script from first-order to second-order physics. Solve \ddot x + \omega^2 x = 0 with \omega = 2, x(0) = 1, \dot x(0) = 0 on t \in [0, 4] (exact solution \cos(2t) — about 1¼ periods). You’ll need three changes: a second time derivative of the network in the residual, a second IC term for \dot x_\theta(0), and probably a wider network (1 → 32 → 32 → 1) with more Adam iterations. Plot the PINN against \cos(2t). Then stretch the domain to t \in [0, 12], retrain, and describe what happens to the late-time fit — your first encounter with the failure mode that §5.6 names.

💡 Hint

Graft three local changes onto the §5.3 script. (1) A second derivative by the central second difference — same trick as §5.3’s dxdt, one order up: d2(t) = (x_θ(t+h) - 2x_θ(t) + x_θ(t-h)) / h^2. Keep using central differences, not AD-inside-AD (see the §5.3 listing comment for why); Zygote still differentiates the loss exactly through these. (2) The residual is (d2(t) + ω^2*x_θ(t))^2, plus a second IC penalty (dxdt(0))^2 for \dot x(0)=0 alongside (x_θ(0) - 1)^2. (3) Widen to 1 → 32 → 32 → 1 and run ~4 000 Adam iterations. Why two ICs? The general solution A\cos\omega t + B\sin\omega t has two free constants (same counting as Solution 5.1). On [0,12] watch the late-time amplitude decay — that’s the §5.6 failure mode.

Go to solution →

5.4 The same ODE with `NeuralPDE.jl`

NeuralPDE.jl turns a ModelingToolkit.PDESystem (equation + domains + BCs) and a PhysicsInformedNN(chain, training_strategy) discretisation into an OptimizationProblem that any Optimization.jl optimiser can solve. The same \dot{x} = -x problem becomes:

# A PINN for ẋ = -x, x(0) = 1 using NeuralPDE.jl.
#
# Same problem as the hand-rolled version in unit_05.qmd §5.3, but
# written declaratively against a ModelingToolkit PDESystem.
#
# Run via ./build.sh execute 5 (writes output to ../output/pinn_neuralpde_ode.md).

using NeuralPDE, Lux, ModelingToolkit
using Optimization, OptimizationOptimJL
using DomainSets: ClosedInterval
using Random

@parameters t
@variables x(..)
Dt = Differential(t)

eq      = Dt(x(t)) ~ -x(t)
bcs     = [x(0.0) ~ 1.0]
domains = [t ∈ ClosedInterval(0.0, 4.0)]

@named pde_system = PDESystem(eq, bcs, domains, [t], [x(t)])

chain = Lux.Chain(Lux.Dense(1 => 16, tanh), Lux.Dense(16 => 1))
disc  = PhysicsInformedNN(chain, GridTraining(0.05))

prob = discretize(pde_system, disc)
@info "solving with LBFGS..."
sol  = solve(prob, LBFGS(); maxiters = 200)

# Pull the trained function out of the solution and score it against
# the exact e^{-t} at a held-out grid.
phi    = disc.phi
params = sol.u
tgrid  = collect(range(0.0, 4.0; length = 200))
learnt = [phi([t], params)[1] for t in tgrid]
exact  = exp.(-tgrid)
err    = maximum(abs, learnt .- exact)

println("retcode         : $(sol.retcode)")
println("final loss      : $(round(sol.objective; sigdigits = 4))")
println("max |PINN-exact|: $(round(err; sigdigits = 3))")

Running it produces:

[ Info: solving with LBFGS...
retcode         : MaxIters
final loss      : 1.391e-7
max |PINN-exact|: 0.000114

What the package buys you: a one-line residual derivation (no hand-rolled dxdt), built-in collocation strategies, and a clean path from Optimization.solve to Adam, LBFGS, or any other adapter in the SciML stack. What you pay: heavier precompile and a lot of symbolic machinery between your problem and the gradient that ends up training the network.

✏️ Section exercise — change the equation, not the code

The declarative payoff of NeuralPDE.jl is that a different ODE is a one-line edit. Swap the equation in the standalone script for the logistic equation \dot x = x(1 - x) with x(0) = 0.1 on t \in [0, 8] (exact solution x(t) = 1 / (1 + 9e^{-t})), re-discretise, and re-solve. Compare the PINN to the closed form. Then look at what you did not have to touch — no hand-rolled dxdt, no loss assembly — and note the one thing you did have to reconsider (hint: the solution now saturates near 1; does the default network/strategy still fit it well at the plateau?).

💡 Hint

Only three lines of the standalone script change — the symbolic layer regenerates the residual, loss, and gradient downstream: eq = Dₜ(x(t)) ~ x(t)*(1 - x(t)), bcs = [x(0.0) ~ 0.1], and domains = [t ∈ ClosedInterval(0.0, 8.0)]. Grade against the logistic curve 1/(1+9e^{-t}). For the ‘reconsider’ question: the sigmoid does all its work near the transition t=\ln 9 \approx 2.2 and is glued to the plateau for t \gtrsim 5, so a uniform GridTraining spends half its points where nothing happens — the principle §5.1’s adaptive sampling addresses.

Go to solution →

5.5 The simplest PDE: 1D diffusion

The 1D heat equation \partial_t u = \alpha\,\partial_x^2 u on (x, t) \in [0, 1]\times[0, 1] with Dirichlet boundaries u(0, t) = u(1, t) = 0 and a Gaussian initial bump u(x, 0) = \exp\!\bigl(-200(x - 0.5)^2\bigr). Parabolic, smoothing, well-posed — and the first proper PDE PINN benchmark.

The implementation is structurally identical to §5.4, with one extra spatial coordinate and one extra BC pair:

# 1D heat equation PINN with NeuralPDE.jl.
#
#   ∂t u = α ∂xx u       on (x,t) ∈ [0,1] × [0,1]
#   u(x, 0) = exp(-200 (x - 0.5)^2)
#   u(0, t) = u(1, t) = 0
#
# Compares the trained network to a quick second-order finite-
# difference reference solve.
#
# Run via ./build.sh execute 5; writes output/pinn_neuralpde_heat.md
# plus output/pinn_neuralpde_heat.png.

using NeuralPDE, Lux, ModelingToolkit
using Optimization, OptimizationOptimJL
using OrdinaryDiffEq, OrdinaryDiffEqBDF
using DomainSets: ClosedInterval
using LinearAlgebra, Plots

@parameters x t
@variables u(..)
Dt  = Differential(t)
Dxx = Differential(x)^2

α   = 0.01
eq  = Dt(u(x, t)) ~ α * Dxx(u(x, t))

ic_bump(x_) = exp(-200 * (x_ - 0.5)^2)
bcs = [
    u(x, 0.0) ~ ic_bump(x),
    u(0.0, t) ~ 0.0,
    u(1.0, t) ~ 0.0,
]
domains = [x ∈ ClosedInterval(0.0, 1.0), t ∈ ClosedInterval(0.0, 1.0)]

@named pde_system = PDESystem(eq, bcs, domains, [x, t], [u(x, t)])

chain = Lux.Chain(
    Lux.Dense(2 => 32, tanh),
    Lux.Dense(32 => 32, tanh),
    Lux.Dense(32 => 1),
)
disc = PhysicsInformedNN(chain, GridTraining([0.02, 0.02]))
prob = discretize(pde_system, disc)

@info "solving heat equation PINN..."
sol = solve(prob, LBFGS(); maxiters = 1500)
println("retcode    : $(sol.retcode)")
println("final loss : $(round(sol.objective; sigdigits = 4))")

# ── finite-difference reference (centred diffs, BDF in time) ───────────
Nx = 101
xg = range(0.0, 1.0; length = Nx)
Δx = step(xg)
u0 = ic_bump.(xg)
u0[1] = 0.0; u0[end] = 0.0          # enforce BCs at t=0

function rhs!(du, u, p, t)
    du[1] = 0.0
    du[end] = 0.0
    @inbounds for i in 2:length(u)-1
        du[i] = α * (u[i+1] - 2u[i] + u[i-1]) / Δx^2
    end
end
ref_prob = ODEProblem(rhs!, u0, (0.0, 1.0))
ref_sol  = solve(ref_prob, FBDF(); saveat = [0.0, 0.3, 0.7, 1.0])

# ── evaluate PINN at the same grid+times ───────────────────────────────
phi    = disc.phi
params = sol.u
pinn_at(t) = [phi([xi, t], params)[1] for xi in xg]

max_err = let m = 0.0
    for (i, t) in enumerate(ref_sol.t)
        ref = ref_sol.u[i]
        pin = pinn_at(t)
        m = max(m, maximum(abs, pin .- ref))
    end
    m
end
println("max |PINN-FD|: $(round(max_err; sigdigits = 3))")

# ── plot ───────────────────────────────────────────────────────────────
plt = plot(xlabel = "x", ylabel = "u(x, t)",
           title  = "1D diffusion: PINN vs. FD reference", legend = :topright)
colors = [:black, :red, :blue, :green]
for (i, t) in enumerate(ref_sol.t)
    plot!(plt, xg, ref_sol.u[i]; lw = 2,  ls = :solid, color = colors[i],
          label = "FD t=$(round(t; digits=2))")
    plot!(plt, xg, pinn_at(t);    lw = 2,  ls = :dash,  color = colors[i],
          label = "PINN t=$(round(t; digits=2))")
end
outpath = joinpath(@__DIR__, "..", "output", "pinn_neuralpde_heat.png")
savefig(plt, outpath)
println("saved $outpath")

Running it produces:

[ Info: solving heat equation PINN...
retcode    : MaxIters
final loss : 0.0003209
max |PINN-FD|: 0.0276
saved /Users/uqjnazar/git/PIML/Julia_PINN_training_2026/units/unit_05/scripts/../output/pinn_neuralpde_heat.png

PINN solution at three times vs. a finite-difference reference. The PINN matches the smoothing trend; the residual at the Gaussian peak is the visible discrepancy.

✏️ Section exercise — sharpen the bump until it breaks

The heat-equation PINN above fits a Gaussian initial bump of width parameter 200. Make the problem harder one notch at a time: rerun the script with the IC sharpness at 200, 800, 3200 (the bump’s width halves each step), keeping everything else fixed. For each run record the final loss and plot the PINN against the FD reference at t = 0 — the IC is where the damage shows. At what sharpness does the PINN visibly fail to represent the initial condition, and does throwing 4× more collocation points at it fix the problem? Keep your three loss numbers; §5.6 explains the trend and Unit 7 sells the cure.

💡 Hint

The sharpness is the 200 inside the IC line of the bcs vector: u(x, 0) ~ exp(-S*(x - 0.5)^2). Sweep S ∈ (200, 800, 3200) (bump width roughly halves each step), record res.objective for each, and plot the t=0 slice against exp.(-S .* (x .- 0.5).^2). Expect the peak to start clipping around S=800 and smear at S=3200, with the loss climbing ~10× per step. For the collocation test, halve the GridTraining spacings in each dimension (4× the points) at S=3200 — and notice it barely helps. That “more points didn’t fix it” is the whole point; §5.6 names why.

Go to solution →

The 2-D version, on a GPU

§5.3 built a PINN from first principles; §5.5 solved the 1-D heat equation. Add one space dimension — \partial_t u = \alpha\,(\partial_{xx} u + \partial_{yy} u), \qquad (x, y) \in [0, 1]^2,\ t \in [0, 0.2], with u(x, y, 0) = \sin\pi x\,\sin\pi y and u = 0 on the spatial boundary — and the collocation grid multiplies. Every collocation point is an independent network evaluation, so a large batch is exactly the embarrassingly-parallel workload a GPU was built for (the same lever as Unit 6 and Unit 7).

We keep the §5.3 recipe — a hand-built residual, derivatives by the finite-difference-in-input stencil — and bake the initial condition and all four boundaries in exactly with the ansatz u_\theta(x, y, t) = \sin\pi x\,\sin\pi y\,\bigl(1 + t\,N_\theta(x, y, t)\bigr), which vanishes on the boundary and equals \sin\pi x\,\sin\pi y at t = 0. That initial condition has the closed form u^\star = e^{-2\alpha\pi^2 t}\sin\pi x\,\sin\pi y, so we can report a true L2 error. The same model trains on the CPU (Array) and the GPU (CuArray); only the device of the parameters and collocation arrays changes.

units/unit_05/scripts/pinn_2d_diffusion_gpu.jl

# ===========================================================================
# Unit 5 — a 2-D diffusion PINN, trained on the GPU.
#
# Section 5.3 builds a PINN "from first principles" — a Lux network, a residual
# loss assembled by hand, gradient descent — and §5.5 solves the 1-D heat
# equation. This extends that hand-built recipe to TWO space dimensions,
#   ∂t u = α (∂xx u + ∂yy u)   on (x,y) ∈ [0,1]², t ∈ [0, 0.2],
#   u(x,y,0) = sin(πx) sin(πy),   u = 0 on the spatial boundary,
# and trains it on the GPU. Going from u(x,t) to u(x,y,t) multiplies the
# collocation grid, which is exactly when a PINN starts to want a GPU: every
# collocation point is an independent network evaluation, so a large batch fills
# the card.
#
# We keep the §5.3 spirit — no NeuralPDE, a residual built by hand — and use the
# finite-difference-in-input trick from §5.3 for the derivatives, so the
# parameter gradient is a single reverse-mode pass that runs cleanly on the GPU.
# The IC and all four boundaries are enforced *exactly* by construction:
#   u_θ(x,y,t) = sin(πx) sin(πy) · (1 + t · N_θ(x,y,t)),
# which is sin·sin (so u=0 on the boundary) and equals sin·sin at t=0 (the IC).
# This IC has the exact analytic solution
#   u*(x,y,t) = exp(-2 α π² t) · sin(πx) sin(πy),
# so we can report the PINN's true L2 error, and we train the SAME model on the
# CPU and the GPU to show the device move and the speed-up.
#
# Run on the GPU hub (the @pinn env has Lux + LuxCUDA + CUDA):
#   julia --project=@pinn units/unit_05/scripts/pinn_2d_diffusion_gpu.jl
# Nothing here runs during `quarto render` — the .qmd embeds it as a listing only.
# ===========================================================================

using Lux, LuxCUDA, CUDA, Optimisers, Zygote, Random, Printf, Statistics

const α = 0.1f0
const Tmax = 0.2f0
const HX = 2f-3; const HT = 2f-3

make_model() = Chain(Dense(3 => 48, tanh), Dense(48 => 48, tanh),
                     Dense(48 => 48, tanh), Dense(48 => 1))

u_exact(x, y, t) = exp(-2 * α * π^2 * t) * sinpi(x) * sinpi(y)

function solve_2d(Ncol, dev; iters = 3000, seed = 1)
    model = make_model()
    ps, st = Lux.setup(Xoshiro(seed), model)
    ps = ps |> dev; st = st |> dev
    opt = Optimisers.setup(Adam(2f-3), ps)

    rng = Xoshiro(seed + 7)
    X = rand(rng, Float32, 1, Ncol); Y = rand(rng, Float32, 1, Ncol); T = Tmax .* rand(rng, Float32, 1, Ncol)
    X, Y, T = dev(X), dev(Y), dev(T)

    N(p, x, y, t) = first(model(vcat(x, y, t), p, st))
    # hard IC (t=0 ⇒ sin·sin) and hard Dirichlet boundary (sin factors vanish):
    uθ(p, x, y, t) = sinpi.(x) .* sinpi.(y) .* (1f0 .+ t .* N(p, x, y, t))

    function loss(p)
        ut  = (uθ(p, X, Y, T .+ HT) .- uθ(p, X, Y, T .- HT)) ./ (2HT)
        uxx = (uθ(p, X .+ HX, Y, T) .- 2f0 .* uθ(p, X, Y, T) .+ uθ(p, X .- HX, Y, T)) ./ HX^2
        uyy = (uθ(p, X, Y .+ HX, T) .- 2f0 .* uθ(p, X, Y, T) .+ uθ(p, X, Y .- HX, T)) ./ HX^2
        return mean(abs2, ut .- α .* (uxx .+ uyy))
    end

    Zygote.gradient(loss, ps)                            # warm up / compile
    dev === identity || CUDA.synchronize()
    t0 = time()
    for _ in 1:iters
        g = Zygote.gradient(loss, ps)[1]
        opt, ps = Optimisers.update(opt, ps, g)
    end
    dev === identity || CUDA.synchronize()
    elapsed = time() - t0

    # L2 error vs analytic at t = Tmax on a 41×41 grid
    gr = range(0f0, 1f0; length = 41)
    xs = reshape(repeat(gr, inner = 41), 1, :); ys = reshape(repeat(gr, outer = 41), 1, :)
    ts = fill(Tmax, 1, length(xs))
    up = Array(uθ(ps, dev(xs), dev(ys), dev(ts)))
    ex = [u_exact(x, y, Tmax) for (x, y) in zip(vec(Array(xs)), vec(Array(ys)))]
    err = sqrt(mean((vec(up) .- ex).^2))
    return (; ps, model, st, uθ, elapsed, err, finalloss = loss(ps))
end

println("="^64)
println("Unit 5 — 2-D diffusion PINN (hand-built), CPU vs GPU")
println("="^64)
have_gpu = CUDA.functional()
@printf("GPU available: %s%s\n", have_gpu, have_gpu ? "  ($(CUDA.name(CUDA.device())))" : "")
println("u_θ = sin(πx)sin(πy)(1 + t·N);  residual ∂tu − α(∂xxu+∂yyu) by input stencil\n")

print("CPU  (N=4000)   … "); flush(stdout)
cpu = solve_2d(4_000, identity; iters = 2000)
@printf("%.1fs | loss %.2e | L2 err vs exact %.2e\n", cpu.elapsed, cpu.finalloss, cpu.err)

if have_gpu
    print("GPU  (N=120000) … "); flush(stdout)
    gpu = solve_2d(120_000, gpu_device(); iters = 3000)
    @printf("%.1fs | loss %.2e | L2 err vs exact %.2e\n", gpu.elapsed, gpu.finalloss, gpu.err)
    @printf("\nGPU trains 30× the collocation points (and 1.5× the iterations) in %.1f× the\n", gpu.elapsed/cpu.elapsed)
    @printf("CPU wall-clock, and reaches the analytic solution to L2 = %.1e.\n", gpu.err)

    try
        using CairoMakie
        ng = 81
        gr = range(0f0,1f0,length=ng)
        xs = reshape(repeat(gr, inner=ng),1,:); ys = reshape(repeat(gr, outer=ng),1,:); ts = fill(Tmax,1,ng*ng)
        up = reshape(Array(gpu.uθ(gpu.ps, CuArray(xs), CuArray(ys), CuArray(ts))), ng, ng)
        ex = [u_exact(x,y,Tmax) for x in gr, y in gr]
        f = Figure(size=(820,330))
        a1=Axis(f[1,1],title="PINN u(x,y,0.2) (GPU)",xlabel="x",ylabel="y",aspect=1)
        h1=heatmap!(a1,gr,gr,up;colormap=:viridis); Colorbar(f[1,2],h1)
        a2=Axis(f[1,3],title="exact u(x,y,0.2)",xlabel="x",ylabel="y",aspect=1)
        h2=heatmap!(a2,gr,gr,ex;colormap=:viridis); Colorbar(f[1,4],h2)
        figdir = get(ENV,"GPU_FIG_DIR",joinpath(@__DIR__,"..","figures")); isdir(figdir)||mkpath(figdir)
        save(joinpath(figdir,"pinn_2d_diffusion_gpu.png"), f)
        println("wrote figures/pinn_2d_diffusion_gpu.png")
    catch e
        println("(figure skipped: ", e, ")")
    end
end

Captured on the workshop GPU hub (NVIDIA L4):

================================================================
Unit 5 — 2-D diffusion PINN (hand-built), CPU vs GPU
================================================================
GPU available: true  (NVIDIA L4)
u_θ = sin(πx)sin(πy)(1 + t·N);  residual ∂tu − α(∂xxu+∂yyu) by input stencil

CPU  (N=4000)   … 93.4s | loss 9.98e-06 | L2 err vs exact 6.61e-05
GPU  (N=120000) … 105.9s | loss 7.11e-06 | L2 err vs exact 3.20e-05

GPU trains 30× the collocation points (and 1.5× the iterations) in 1.1× the
CPU wall-clock, and reaches the analytic solution to L2 = 3.2e-05.

The win here isn’t the stopwatch — it’s resolution. In about the same wall-clock (1.1×) the GPU trains on 30× the collocation points (and 1.5× the iterations), and that denser sampling roughly halves the error against the analytic solution (L2 6.6\times10^{-5} \to 3.2\times10^{-5}). On the CPU, 30× the points would cost roughly 30× the time; on the GPU it is nearly free until the card fills. That is the same lever Unit 6’s finite-difference solver pulls on a grid and Unit 7 measures as raw training-step throughput — and it is what makes the Unit 9 / Unit 10 capstone columns tractable.

The trained PINN field u(x, y, 0.2) (left, on the GPU) beside the analytic solution e^{-2\alpha\pi^2 t}\sin\pi x\,\sin\pi y (right) — a single \sin\cdot\sin mode decaying under 2-D diffusion. The two are visually indistinguishable (L2 3.2\times10^{-5}).

5.6 Failure modes already visible

Even on benign problems three pathologies show up that motivate everything in Unit 7:

Spectral bias. MLPs with smooth activations preferentially fit low frequencies first (Rahaman et al., 2019). The Gaussian initial bump in §5.5 — locally high curvature — is the slowest-converging feature of the loss.
Loss imbalance. With residual, IC, and BC contributing terms of very different magnitude, the optimiser drives the easiest one to zero first. The PINN can land on a solution that satisfies the equation inside the domain but ignores the boundary, or fits the IC and the PDE while drifting at the BC.
Causal violation. A time-dependent PINN can fit the residual at every t simultaneously, including futures whose dependence on the IC hasn’t propagated forward. The result is locally consistent at each (x, t) point but globally inconsistent as a trajectory.

These three are the meat of Unit 7: adaptive loss weighting, Fourier feature embeddings, hard BC enforcement, and causal training. With those tools we can train PINNs on the capstone column in Unit 9 / Unit 10; without them we can’t.

✏️ Section exercise — diagnose before you medicate

Match symptom to disease. For each observation below — all from real PINN training logs — name which of the three §5.6 pathologies (spectral bias, loss imbalance, causal violation) is the prime suspect, and which Unit 7 fix you’d reach for first:

The total loss is 10^{-6} but the solution at the domain boundary is visibly wrong; the BC term contributes 10^{-9} of the total with \lambda_b = 1.
A wave-equation PINN reproduces the first quarter-period beautifully, then the predicted field “freezes” — late times look like a smeared copy of early times.
A PINN fits \sin(2\pi x) initial data in 500 iterations but needs 50 000 for \sin(16\pi x), with the same network.
Doubling \lambda_{\text{IC}} fixes the initial condition but the interior residual gets 100× worse.

💡 Hint

Three diagnostics separate the three diseases: per-term loss magnitudes (imbalance), residual-binned-by-t (causality), and whether difficulty scales with the target’s frequency (spectral bias). Each numbered symptom answers exactly one of those probes — start by deciding which probe the symptom is reporting.

Go to solution →

5.1 The PINN idea

Residual loss from a differential equation

Initial conditions are what pin it down

A second example: the heat equation

Collocation points

5.2 Autodiff for PINNs: forward, reverse, and second derivatives

Three derivatives, three modes

Forward mode: spatial derivatives of the network

Second derivatives: the PDE residual

The outer gradient: reverse mode over parameters

Why nesting order matters

A subtler pitfall — silently wrong gradients

Connecting back to Unit 1

5.3 A PINN from first principles

5.4 The same ODE with NeuralPDE.jl

5.5 The simplest PDE: 1D diffusion

The 2-D version, on a GPU

5.6 Failure modes already visible

5.4 The same ODE with `NeuralPDE.jl`