Unit 7: When PINNs Meet PDEs

Published

26/06/2026

Vanilla PINNs work on some PDEs and fail surprisingly often on others. We diagnose the failure modes (loss imbalance, spectral bias, causal violation) on a column heat equation and on shallow water, then walk through the modern fixes that mostly close the gap: causal training, Fourier feature embeddings, hard boundary-condition enforcement, adaptive loss weighting. By the end of the unit the toolkit needed for the capstone is in place.

7.1 Vanilla PINNs on PDEs

The PINN Loss

A PINN approximates the solution u(\mathbf{x}, t) with a neural network u_\theta(\mathbf{x}, t). The total loss combines a residual term, boundary terms, and (for time-dependent problems) an initial condition term:

\mathcal{L} = \lambda_r\,\mathcal{L}_{\text{PDE}} + \lambda_b\,\mathcal{L}_{\text{BC}} + \lambda_i\,\mathcal{L}_{\text{IC}},

with \mathcal{L}_{\text{PDE}} = \frac{1}{N_r}\sum_{i=1}^{N_r}|r(\mathbf{x}_i, t_i)|^2 evaluated by autodiff at scattered collocation points — no grid, no time-stepping.

Heat Equation: A Sanity Check

The 1D diffusion equation \partial_t u = \alpha\,\partial_x^2 u on [0, 1]\times[0, 2] with a Gaussian initial bump and zero boundaries. Vanilla PINNs handle this well, making it a useful first concrete implementation.

using NeuralPDE, Lux, ModelingToolkit, Optimization, OptimizationOptimJL
using DomainSets: ClosedInterval
using Plots

@parameters x t
@variables u(..)

Dₜ  = Differential(t)
Dₓₓ = Differential(x) ∘ Differential(x)

α = 0.01

eq = Dₜ(u(x, t)) ~ α * Dₓₓ(u(x, t))

domains = [
    x ∈ ClosedInterval(0.0, 1.0),
    t ∈ ClosedInterval(0.0, 2.0),
]

bcs = [
    u(x, 0)    ~ exp(-200 * (x - 0.5)^2),
    u(0.0, t)  ~ 0.0,
    u(1.0, t)  ~ 0.0,
]

@named pde_system = PDESystem(eq, bcs, domains, [x, t], [u(x, t)])

chain = Lux.Chain(
    Lux.Dense(2, 20, Lux.tanh),
    Lux.Dense(20, 20, Lux.tanh),
    Lux.Dense(20, 1),
)

discretization = PhysicsInformedNN(chain, GridTraining([0.05, 0.05]))
prob = discretize(pde_system, discretization)
res = Optimization.solve(prob, LBFGS(); maxiters=1500)

If the @parameters / @variables / Differential / ~ / PDESystem / PhysicsInformedNN / discretize machinery above looks unfamiliar, it is dissected line-by-line in §7.6 — anatomy of a NeuralPDE program; the examples here use it without ceremony so you see the shape of a PINN program before the parts list.

Laplace on a Disk: Where Mesh-Free Helps

A clean elliptic example. The PINN handles the circular geometry mesh-free — a classical advantage over finite differences. Closed-form solution: u(r, \theta) = r^3 \sin(3\theta).

Figure 1: Left: the closed-form three-lobed harmonic u(r, \theta) = r^3 \sin(3\theta) on the unit disk, with Dirichlet boundary u(R, \theta) = \sin(3\theta) and a small inner cutoff r_{\min}. Right: a schematic PINN collocation cloud — the solve evaluates the PDE residual at interior points (green) and the Dirichlet condition at a sparser ring of boundary points (orange); the diagram draws fewer for clarity. The mesh-free advantage: the collocation points sit directly inside the disk — no curvilinear mesh or quadrature scheme needed.

using NeuralPDE, Lux, ModelingToolkit, Optimization, OptimizationOptimJL
using DomainSets, Plots

@parameters r θ
@variables u(..)

Dᵣ  = Differential(r);  Dᵣᵣ = Differential(r) ∘ Differential(r)
Dθθ = Differential(θ) ∘ Differential(θ)

eq = r^2 * Dᵣᵣ(u(r, θ)) + r * Dᵣ(u(r, θ)) + Dθθ(u(r, θ)) ~ 0

R = 1.0;  r_min = 0.05
domains = [r ∈ ClosedInterval(r_min, R), θ ∈ ClosedInterval(0.0, 2π)]
bcs = [
    u(R, θ)     ~ sin(3θ),
    u(r_min, θ) ~ r_min^3 * sin(3θ),
    u(r, 0.0)   ~ 0.0,
    u(r, 2π)    ~ 0.0,
]
@named pde_system = PDESystem(eq, bcs, domains, [r, θ], [u(r, θ)])

chain = Lux.Chain(
    Lux.Dense(2, 32, Lux.tanh), Lux.Dense(32, 32, Lux.tanh),
    Lux.Dense(32, 32, Lux.tanh), Lux.Dense(32, 1),
)

discretization = PhysicsInformedNN(chain, GridTraining([0.05, 0.1]))
prob = discretize(pde_system, discretization)
res = Optimization.solve(prob, LBFGS(); maxiters=3000)

Run it and the trained network reproduces the three-lobed harmonic to a relative L^2 error of about 2% after 3000 LBFGS iterations on a single CPU — the runnable version, which adds the grading and plotting calls, is units/unit_07/scripts/laplace_disk_solve.jl. Where that error concentrates, and what the collocation cloud has to do with it, is the section exercise.

✏️ Section exercise — grade the disk PINN against the exact harmonic

The Laplace-on-a-disk example has a closed-form answer, u(r, \theta) = r^3 \sin(3\theta) — so use it. Run the disk script above, then evaluate the trained network on a 50 \times 100 polar grid and compute the relative L^2 error and the worst-case pointwise error. Two follow-ups: (a) where on the disk does the worst error sit — interior, outer boundary, or near the r_{\min} cutoff — and what about the collocation cloud explains it? (b) Re-run with the boundary condition changed to \sin(6\theta) (exact solution r^6\sin(6\theta)) and report what happens to the error — a first taste of the frequency story §7.2 tells.

💡 Hint

phi = discretization.phi returns a 1-element vector, so wrap each eval in first: u_pinn = [first(phi([r,θ], res.u)) for r in rs, θ in θs] over rs = range(r_min, R; length=50), θs = range(0, 2π; length=100). The exact field is [r^3*sin(3θ) for r in rs, θ in θs]. Relative L^2 is sqrt(sum(abs2, u_pinn .- u_exact)/sum(abs2, u_exact)); findmax(abs.(u_pinn .- u_exact)) returns (value, index) — map it back through rs/θs to see it land near the rim (recall the §7.1 figure: many interior residual points but only a thin ring of soft boundary points, and the field’s amplitude all lives near r=R). For (b), change only the two bcs lines to sin(6θ) and r_min^6*sin(6θ).

Go to solution →

Warm-starting a neighbour: the disk PINN as a pre-trained init

A solved PINN is a reusable asset. Because the trial network is just N(x, y), a problem on a nearby geometry can start from the trained weights instead of random noise — transfer learning, exactly as in vision. Take the disk solution above and reuse it for a very mild ellipse (semi-axes 1.05 \times 0.95): same Laplace equation, same \sin 3\theta boundary data, same network — only the collocation and boundary points move from the circle onto the ellipse.

using Lux, Random, Zygote, Optimisers, Statistics

net = Lux.Chain(Lux.Dense(2 => 32, tanh), Lux.Dense(32 => 32, tanh),
                Lux.Dense(32 => 32, tanh), Lux.Dense(32 => 1))
N(ps, X) = first(Lux.apply(net, X, ps, st))      # u(x, y); st carries no trainable state

# Laplacian uₓₓ + u_yy by the finite-difference-in-input stencil (Zygote-safe, §5.3):
h = 1f-2;  ex = Float32[h, 0];  ey = Float32[0, h]
lap(ps, X) = (N(ps, X .+ ex) .+ N(ps, X .- ex) .+ N(ps, X .+ ey) .+ N(ps, X .- ey)
              .- 4f0 .* N(ps, X)) ./ h^2
loss(ps, XY, XYb, gb) = mean(abs2, lap(ps, XY)) + 50f0 * mean(abs2, N(ps, XYb) .- gb)

# 1. disk — interior + boundary points on the unit circle, BC u = sin(3θ):
ps_disk = train(random_init(), disk_points()...)            # from scratch

# 2. ellipse (a, b) = (1.05, 0.95) — SAME net, points stretched onto the ellipse.
#    The geometry enters ONLY through where the point clouds sit:
ps_ellipse = train(deepcopy(ps_disk), ellipse_points(1.05f0, 0.95f0)...)   # ← warm start

The disk network drops in unchanged as the initial guess. On a hand-rolled Lux + Zygote PINN (scripts/ellipse_warmstart_solve.jl, the §5.3 finite-difference-in-input pattern, runnable end to end), the payoff is large:

ellipse run	initial weights	loss after 200 steps	Adam steps to loss < 0.002
warm start	trained disk PINN	2.4e-03	353
from scratch	random init	4.6e+00	3524

The warm start begins already inside the basin — its loss after 200 Adam steps is below the tolerance the cold start needs ~3500 steps to reach — so it converges about 10× sooner. The lesson generalises: when you must solve a family of related PDEs — a parameter sweep, a shape-optimisation loop, a slowly moving domain — solve one and warm-start its neighbours. The caveat is the §7.2 flip side: transfer only helps when the neighbour is genuinely close. Stretch the ellipse hard, or change the boundary data, and the disk init is no better than random — the optimiser still has to cross the whole landscape.

7.2 Why Naïve PINNs Fail

“Vanilla” PINN training — one global loss weight, smooth-MLP ansatz, soft BC enforcement, simultaneous residual evaluation across the whole space-time domain — works on the textbook examples of §7.1 and breaks on most realistic problems. The characterisation of why is by now well understood; the dominant references are Krishnapriyan et al. 2021 (“Characterizing possible failure modes in PINNs”), Wang et al. 2022 on causality, and the NTK-based diagnostics of Wang et al. 2022. They converge on three intertwined failure modes, covered below in order of how often they hit you in practice: loss imbalance, spectral bias, causal violation. The benchmark survey of de Wolff et al. 2021 adds a useful reproducibility study — same failures, on shallow-water and advection–diffusion — but it isn’t the foundational reference; the failure modes themselves were identified earlier and more cleanly elsewhere.

Loss imbalance between residual and boundary terms

The total loss

\mathcal{L} \;=\; \lambda_r\,\mathcal{L}_{\text{PDE}} + \lambda_b\,\mathcal{L}_{\text{BC}} + \lambda_i\,\mathcal{L}_{\text{IC}} + \lambda_d\,\mathcal{L}_{\text{data}}

is a sum of terms whose magnitudes can differ by orders of magnitude — easily 10^3 in problems with sharp ICs or stiff BCs. The optimiser drives the largest term to zero first; if that’s a term you don’t care about, the others stagnate. The classic failure: \mathcal{L}_{\text{PDE}} hits machine precision but the BC is still off by 10^{-1}, producing a function that satisfies the equation in the interior but ignores the boundary. Krishnapriyan et al. show this is the generic outcome for PINNs on convection-dominated and high-frequency problems.

Spectral bias

Smooth-activation MLPs are biased toward learning low-frequency components first (Rahaman et al. 2019; Tancik et al. 2020). High-frequency features — sharp ICs, oscillatory solutions, fine spatial structure — take many more iterations or never converge. The §6.2 Fourier-mode analysis says exactly why: each Fourier mode has its own loss-landscape curvature; the high-k modes are poorly conditioned and Adam barely moves their parameters.

This is the same effect that makes diffusion solvers smooth fast data (every Fourier component decays at rate \propto k^2) — the PINN inherits the bias from its parameterisation, not from the physics. The fix (Fourier feature embeddings, §7.3) is also Fourier-shaped.

Causal violation in time-dependent problems

The PDE residual is evaluated at all (\mathbf{x}_i, t_i) collocation points simultaneously. Nothing in the loss requires that the solution at t_2 be a consequence of the solution at t_1 < t_2. The optimiser can find a globally inconsistent solution whose residual is small everywhere but whose time-direction information flow is wrong — invisible to the loss but obvious in the predicted trajectory.

Wang et al. 2022 gave the clean diagnostic — tracking how the residual at each time t falls during training, which for a causally-failing PINN happens out of order (late t before early t) — and the matching fix (causal training, §7.3). It’s the failure mode that most often bites real-world inverse problems, including the Unit 1 shallow-water source-recovery example.

A benchmark catalogue worth knowing about

For a head-to-head comparison of vanilla PINNs against three non-trivial PDEs — a variable-depth wave equation, the 2-D linearised shallow-water system with Coriolis and viscosity, and SWE-driven advection–diffusion of a thermal anomaly — de Wolff et al. 2021 is a useful reference. Their headline finding (vanilla PINN underperforms a pure-data MLP on all three) is consistent with the failure modes above; it’s a negative reproducibility result, not a new diagnosis. We reproduce the cleanest case — the 1-D reduction of their linearised SWE — below to make the failure concretely visible before we discuss fixes.

Worked example: naïve linearised shallow water

A 1-D reduction of the linearised SWE (\eta_t + H u_x = 0,\;u_t + g\eta_x = 0 on [0, 2]\times[0, 0.05]) trains in seconds and runs straight into the failure modes above. The same example in 2-D has the same form, just slower:

Figure 2: Naïve PINN training on the 1-D linearised SWE, from an actual run (`scripts/naive_swe_failure_solve.jl`). **Left:** at the half-time snapshot the exact d’Alembert solution has split the initial pulse into two counter-propagating bumps (height ½ at x \approx 0.75, 1.25); the trained PINN over-smooths them into a single low lump — a relative L^2 error around 75%. **Right:** the binned PDE residual is largest at the sharp t = 0 pulse and small afterwards. Note the trap: the residual is small almost everywhere, yet the solution is badly wrong — driving the residual down is not the same as solving the problem. This run cleanly shows **loss imbalance** and **spectral bias**; **causal violation** is a training-dynamics effect taken up in §7.3.

using NeuralPDE, Lux, ModelingToolkit, Optimization, OptimizationOptimJL
using DomainSets, Plots

@parameters x t
@variables η(..) u_vel(..)

Dₜ = Differential(t);  Dₓ = Differential(x)
g = 9.81;  H = 10.0   # wave speed c ≈ 9.9 m/s

eqs = [
    Dₜ(η(x, t)) + H * Dₓ(u_vel(x, t)) ~ 0,
    Dₜ(u_vel(x, t)) + g * Dₓ(η(x, t)) ~ 0,
]

domains = [x ∈ ClosedInterval(0.0, 2.0), t ∈ ClosedInterval(0.0, 0.05)]

bcs = [
    η(x, 0)        ~ exp(-100 * (x - 1.0)^2),
    u_vel(x, 0)    ~ 0.0,
    u_vel(0.0, t)  ~ 0.0,
    u_vel(2.0, t)  ~ 0.0,
]

@named pde_system = PDESystem(eqs, bcs, domains, [x, t], [η(x, t), u_vel(x, t)])

chain_η = Lux.Chain(Lux.Dense(2, 32, tanh), Lux.Dense(32, 32, tanh),
                    Lux.Dense(32, 32, tanh), Lux.Dense(32, 1))
chain_u = Lux.Chain(Lux.Dense(2, 32, tanh), Lux.Dense(32, 32, tanh),
                    Lux.Dense(32, 32, tanh), Lux.Dense(32, 1))

discretization = PhysicsInformedNN([chain_η, chain_u], GridTraining([0.02, 0.001]))
prob = discretize(pde_system, discretization)
res = Optimization.solve(prob, LBFGS(); maxiters=3000)

Train this and the run in Figure 2 shows two of the failure modes plainly: the sharp initial bump oversmooths into a single low lump (spectral bias), and because \mathcal{L}_{\text{PDE}} dominates the loss budget the boundary and initial terms never tighten (loss imbalance) — the result is about 75% wrong. The PDE residual, meanwhile, is small everywhere except at the sharp initial pulse: a reminder that driving the residual down is not the same as solving the problem. The third failure mode, causal violation, is a property of the training dynamics — the optimiser reducing late-time residual before the early-time structure is resolved — and is best seen and cured through causal training (§7.3) rather than read off a static snapshot. Together these are the target failure that §7.3 dismantles.

✏️ Section exercise — reproduce the failure, then prove it

Run the 1-D linearised-SWE script above (it trains in seconds) and produce the two diagnostics from Figure 2 yourself:

The half-time snapshot of \eta(x, t) against the exact reference — d’Alembert gives two half-height bumps moving at c = \sqrt{gH} \approx 9.9 m/s (a 20-line leapfrog solver of the same system works too). Measure the relative L^2 error.
The residual-vs-time histogram: evaluate the PDE residual on a dense grid, bin by t into 10 bins, and plot the mean per bin.

What each one tells you: the snapshot exposes spectral bias (one lump, not two) and loss imbalance (the boundary and initial terms left loose). The residual histogram makes a subtler point — it is small almost everywhere even though the snapshot is badly wrong, so a low residual does not certify a correct solution. Then the cheap experiment: multiply the BC loss terms by 100 (crude manual loss-balancing) and report which symptom improves and which doesn’t — evidence that loss imbalance and spectral bias are distinct diseases, each needing its own §7.3 fix.

💡 Hint

The cheapest reference isn’t FD at all — d’Alembert gives \eta(x,t) = \tfrac12\eta_0(x-ct) + \tfrac12\eta_0(x+ct) with c=\sqrt{gH} \approx 9.9 m/s, so at the half-time snapshot the initial bump has split into two half-height copies. This is a two-network system, so phi = discretization.phi is a vector, and each network takes its OWN parameter slice — use ηf(x,t) = first(phi[1]([x,t], res.u.depvar.η)) and uf(x,t) = first(phi[2]([x,t], res.u.depvar.u_vel)), not the full res.u. The residual is first-order, so each term is a single ForwardDiff.derivative (not nested): ForwardDiff.derivative(τ->ηf(x,τ), t) + 10*ForwardDiff.derivative(ξ->uf(ξ,t), x). Bin t into ~10 ranges and plot the per-bin mean. The snapshot exposes spectral bias (one lump not two) + loss imbalance; the residual histogram is small except at the sharp t=0 pulse — its lesson is that a small residual does not mean a correct solution.

Go to solution →

7.3 Modern Fixes

Causal Training

Wang et al. (2022) weight residual losses at later times by how well earlier times have already converged. Concretely,

\lambda(t_i) = \exp\!\Bigl(-\epsilon \sum_{t_j < t_i} \mathcal{L}_{\text{PDE}}(t_j)\Bigr),

so a time slice contributes only after earlier slices have been resolved: the running sum stays large — keeping the weight near zero — until the earlier-time loss is small, and \epsilon sets how strictly this ordering is enforced. This trains the network in the right temporal causal order without explicit time-stepping.

Fourier Feature Embeddings

Wrap each input in a Fourier feature transform

\gamma(\mathbf{x}) = [\sin(B\mathbf{x}),\, \cos(B\mathbf{x})]

with frequencies B drawn from a distribution that includes the high frequencies you need (Tancik et al., 2020). The network sees high-frequency content in its input directly, breaking spectral bias.

How it is done. Choose a number of frequencies m and sample a matrix B \in \mathbb{R}^{m \times d} once, where d is the input dimension (d = 2 for (x, t)); each row is one frequency vector. The transform turns the d-dimensional input into a 2m-dimensional vector \gamma(\mathbf{x}), which is fed to the first dense layer — so the model is u_\theta(\mathbf{x}) = N_\theta(\gamma(\mathbf{x})) and the first layer has shape 2m \to \text{width}. B is drawn once and held fixed; it is not a trained parameter (a learnable B is a known variant, but the fixed version is the standard recipe). A common choice is B \sim \mathcal{N}(0, \sigma^2 I), and the scale \sigma is the one knob that matters: too small and the embedding is still smooth, so spectral bias survives; too large and the network starts fitting high-frequency noise and the PDE residual becomes hard to drive down. Tune it by scanning a few values of \sigma and watching the residual, starting near the highest frequency the true solution actually contains. When space and time have different scales, use a separate \sigma for the spatial and temporal rows of B. The residual machinery is unchanged: autodiff differentiates straight through the \sin / \cos, so the rest of the training loop stays the same. The §7.3 exercise builds the whole thing in about twenty lines.

Hard Boundary Condition Enforcement

Instead of penalising BC violations softly, construct the network output to satisfy them by construction. For Dirichlet BCs on [0, 1] with u(0, t) = u(1, t) = 0, write

u_\theta(x, t) = x(1 - x)\,N_\theta(x, t).

Now u_\theta = 0 on the boundary regardless of N_\theta. The loss has one fewer term to balance, and the optimiser has fewer ways to fail. The trick generalises to other BC types via clever ansätze (Sukumar & Srivastava, 2022).

Adaptive Loss Weighting

Tune \lambda_r, \lambda_b, \lambda_i during training so all terms stay roughly balanced. Three families of approaches:

Gradient-balancing (McClenny & Braga-Neto, 2022) — match the gradient norms of each term.
NTK-based (Wang et al., 2022) — use the neural tangent kernel to compute target weights.
Annealing — start with one term dominant, gradually shift weight to others.

All beat hand-tuning. With these, the modern PINN is ready for the problems §7.4 catalogues.

✏️ Section exercise — Fourier features in twenty lines

Isolate the spectral-bias cure from all PINN machinery. Fit f(x) = \sin(25x) on [0, 1] from 200 samples with two Lux networks: (a) a plain 1 → 64 → 64 → 1 tanh MLP, and (b) the same MLP fed the Fourier-feature embedding \gamma(x) = [\sin(Bx), \cos(Bx)] with 16 frequencies B \sim \mathcal{N}(0, 10^2) (so input width 32). Train both with identical Adam budgets (2 000 iterations) and plot the two loss curves and the two fits. Then break it: re-draw B with \sigma = 1 and with \sigma = 100, and describe both failure directions. The takeaway to write in one sentence: what does \sigma have to match for the embedding to work? Separately, verify the hard-BC trick: show that u_\theta(x, t) = x(1 - x)N_\theta(x, t) satisfies u(0, t) = u(1, t) = 0 for any network, and find an analogous ansatz that enforces u(x, 0) = g(x).

💡 Hint

The embedding is a fixed (untrained) layer: Lux.WrappedFunction(x -> vcat(sin.(B*x), cos.(B*x))) with B = σ .* randn(rng, Float32, 16, 1) drawn ONCE — the next Dense then has input width 2\times16 = 32. Keep the Adam budget identical across all runs or the comparison proves nothing. The σ story: σ=1 keeps all frequencies far below 25 (bias unbroken, looks like the plain MLP); σ=100 scatters them far above (over-oscillates between samples) — so σ must bracket the target’s own frequency (~25). Hard-IC ansatz: u_θ(x,t) = g(x) + t·N_θ(x,t) gives u(x,0)=g(x) for any network, because the t factor vanishes at t=0 (the same trick Unit 10 §10.3 uses).

Go to solution →

7.4 A taxonomy of PINN workflows

“A PINN” isn’t a single recipe — it’s a family of training configurations differentiated by which loss terms are present, what the network represents, and what’s known vs unknown going in. The four workflows below cover the great majority of PINN applications, as far as we know. Pick the one whose “Unknown” column matches what you don’t know.

Quick-reference: pick a workflow

	W1 · Forward solve	W2 · Parameter ID	W3 · Source / driver recovery	W4 · Hybrid simulator
Unknown	the field u(\mathbf{x}, t) itself	scalar \boldsymbol{\xi} in the PDE	a function f(t) or g(\mathbf{x})	a closure / sub-grid term
Loss terms	residual + IC + BC	+ data	+ data	+ data
Network represents	u_\theta	u_\theta + scalar \xi	u_\theta + f_\phi	u_\theta + closure N_\phi
Canonical example	Burgers’ eq forward solve	recover diffusivity from sparse u	capstone: storm \tau(t) from moorings	sub-grid turbulence closure
In the workshop	Unit 5 §5.4	(course tangent)	Unit 9–10 (PINN); Unit 1 §1.2 (naïve PINN fails → adjoint wins)	Unit 4 §4.3 (UDEs, ODE form)

Each workflow is laid out below as a self-contained card.

W1 · Forward solve — replace the numerical solver

Setup. Given a PDE + IC + BC, train u_\theta(\mathbf{x}, t) to satisfy them. Loss = \mathcal{L}_{\text{PDE}} + \mathcal{L}_{\text{IC}} + \mathcal{L}_{\text{BC}}. No data term. The trained network is the solution.

Canonical example. Solve u_t = \alpha u_{xx} with given initial and boundary conditions — the heat-equation sanity check in §7.1 above is exactly this shape.

Reach for it when… Mesh-free needed (irregular geometry, no quadrature scheme); continuous-resolution surrogate wanted (query u_\theta(\mathbf{x}, t) at any coordinate, no re-meshing); downstream optimisation needs \partial u / \partial \mathbf{x} cheaply via autodiff.

Don’t reach for it when… A tuned FD / FV / FEM solver on its own mesh already exists. PINN forward solves are rarely faster or more accurate; the benefit is the meshless + differentiable properties, not raw speed.

W2 · Parameter identification — recover scalar unknowns

Setup. As W1 plus a data loss \mathcal{L}_{\text{data}} = \frac{1}{N}\sum_i (u_\theta(\mathbf{x}_i, t_i) - u_{\text{obs}, i})^2, and the PDE contains unknown scalar parameters \boldsymbol{\xi} = (\xi_1, \xi_2, \ldots) that are optimised jointly with the network weights \theta.

Canonical example. Recover a diffusion coefficient \alpha in u_t = \alpha u_{xx} from sparse, noisy measurements of u. Or: recover (\sigma, \rho, \beta) in Lorenz from a trajectory.

Reach for it when… The parametric form is known and you have at least order-of-magnitude informative data. PINNs beat classical inverse methods when the forward map is hard to make adjoint-differentiable (e.g., the simulator is a black box, or the adjoint code doesn’t exist).

Don’t reach for it when… The parameter set is high-dimensional (\boldsymbol{\xi} \in \mathbb{R}^{100+}); classical Bayesian inference with MCMC is the right tool, not gradient descent on a PINN loss.

W3 · Source / driver recovery — recover an unknown function (functional inverse)

Setup. As W2 but the unknown is a function, not a scalar: a boundary-condition timeseries \psi(t), a spatially varying source f(\mathbf{x}), a closure law g(\mathbf{u}). Parameterise the unknown with a small auxiliary network \psi_\phi trained jointly with u_\theta, or directly read it off u_\theta at a designated coordinate.

Canonical example. The Unit 9 / Unit 10 capstone — recover the storm wind-stress envelope \tau(t) from mooring data with a PINN. Unit 1 §1.2 poses the same problem shape (recover the river-mouth ψ(t) from four bay tide-gauge timeseries), but is where the naïve PINN is shown falling short — there the non-PINN adjoint inverse is what recovers ψ(t) (see Don’t reach for it when… below).

Reach for it when… The missing piece isn’t a number but a shape, and you suspect it’s reasonably smooth (so a small auxiliary network + smoothness prior is well-posed). Most field-science source-recovery problems live here.

Don’t reach for it when… A well-tested adjoint inversion exists for your forward map — the Unit 1 worked example shows the adjoint inverse often beats the naïve PINN by a wide margin on linear forward maps. PINNs come into their own when the forward map is nonlinear and the adjoint is hard to write.

W4 · Hybrid simulator (data assimilation + UDE) — learn a closure

Setup. Combine a known PDE residual, sparse observations, and a learned closure / forcing term in one loss. This is the Universal Differential Equation picture of Unit 4 §4.3 lifted to PDEs. The network N_\phi represents the unknown closure (e.g. \dot{\mathbf{x}} = f_{\text{known}} + N_\phi), not the full solution.

Canonical example. Sub-grid turbulence closures in climate models; reaction-rate corrections in chemistry; bottom-friction laws in shallow-water models calibrated against gauge data. The Crown-of-Thorns starfish UDE in Unit 4 §4.3 is a one-dimensional analogue: known logistic growth + Holling-II grazing + learned mortality closure.

Reach for it when… You have trustworthy physics, observations, and a known gap between them. The biggest current growth area in scientific ML and the most defensible choice for production work — the learned piece is small, local, and individually interpretable.

Don’t reach for it when… You don’t actually have a known physics skeleton — at that point you’re doing pure data-driven ML, not physics-informed ML, and you should reach for a different toolset.

How to pick: a decision tree

Walk the questions top-to-bottom and stop at the first “yes”:

Do you have field data? — No → W1 (forward solve).
Is the missing piece a small number of scalars? — Yes → W2 (parameter ID).
Is the missing piece a function (timeseries, spatial field, closure law)?
- …and you trust the whole governing equation as written → W3 (source recovery).
- …and there’s a known gap between the governing equation and reality → W4 (hybrid simulator).

Almost every PINN paper in the 2019–2026 corpus lands in one of these four. The Unit 1 worked example (where the naïve PINN falls short), the capstone, and the modern fixes from §7.3 are all aimed at making W3 practical on real problems; that’s the through-line of the workshop.

✏️ Section exercise — file these five papers

Classify each published setup into W1–W4 using the quick-reference table, and name the giveaway:

Hidden Fluid Mechanics (§7.7): given dye-concentration movies, reconstruct the unmeasured velocity and pressure fields under a Navier–Stokes residual.
A PINN solves Burgers’ equation from IC + BCs alone, benchmarked against a spectral solver.
From sparse pressure-head wells, recover the scalar parameters of a van-Genuchten soil-retention curve and the head field.
A climate model keeps its dynamical core but learns a sub-grid cloud-microphysics correction from high-resolution simulation output.
The Unit 1 problem: four tide gauges, known SWE physics, unknown river-mouth timeseries ψ(t).

One of the five is genuinely arguable between two workflows — say which, and defend both readings.

💡 Hint

Run the §7.4 decision tree on each: is the unknown the field itself, a few scalars, a function, or a correction to trusted physics? For the arguable one, ask whether the governing equations contain any unknown at all — and then ask what’s substituting for the missing BCs/ICs.

Go to solution →

7.5 The inverse problem in practice

Workflows 2 and 3 above are the operational reason to use PINNs in field science. They also fail in characteristic ways worth flagging.

When it works

Reasonably dense observations. The Unit 1 example uses 4 gauges over 6 hours, ~120 samples each — enough to identify a smooth ψ(t).
Well-posed forward physics. If the forward map is invertible in principle (i.e., the data determines the unknown up to a small null space), regularisation gets you the rest of the way.
Smooth unknowns. Smooth ψ(t) / smooth source field; smooth-MLP ansatz is well-matched.
Modern fixes applied. Causal training, Fourier features, hard BC, adaptive weights — pre-§7.3 PINNs lose to classical adjoint methods on most realistic inverse problems.

When it doesn’t

Severe ill-posedness. Many possible unknowns produce near-identical observations; no amount of data fixes this. Tikhonov / Bayesian regularisation is required. The Unit 1 comparison (“naïve PINN fits the gauges but misses the source amplitude”) is exactly this story.
Discontinuous unknowns (step changes, shocks). Smooth-MLP PINNs round them off. KANs (Unit 2 §2.9) or spectral basis ansätze are alternatives worth trying.
Very noisy data. Without a noise model in the loss, PINNs overfit observation noise. Adding a Gaussian-likelihood \mathcal{L}_{\text{data}} helps but isn’t free.

A practical recipe

For inverse problems of Workflow 2 or 3 type:

Start with the classical solution if you have one. Adjoint inversion or Tikhonov-regularised LSQ — these are battle-tested on linear forward maps. Use them as the reference.
Set up the PINN forward problem first (Workflow 1) and verify it converges on a representative parameter set.
Add the data term gradually — start with the data weight at 10^{-2}\times the residual weight, then ramp it up. (Adaptive weighting from §7.3 makes this automatic.)
Watch the residual in time, not just in total. Track how the residual at each time t falls during training; if late t is driven down before early t is resolved, the fit is causality-violating (§7.3). And remember the converse trap from §7.2: a small total residual on its own does not prove the recovered source is right.
Sanity-check on a known truth. Generate synthetic data from a known ψ; recover it; ensure the recovered shape and amplitude are within tolerance. Only then turn it loose on real data.

The two-network source-recovery paradigm

Workflow 3 has a recurring shape that does not depend on the physics. Every source-recovery problem is this same shape in different clothes — the Unit 1 river-mouth \psi(t) (the motivating example, where the naïve PINN falls short and the non-PINN adjoint inverse is what recovers it), the capstone storm envelope \tau(t) (where a PINN does recover the driver), a spatial source g(\mathbf{x}), a closure law. Write the forward problem so the unknown enters linearly, multiplying a known shape:

\partial_t u = \mathcal{A}\,u + s(\mathbf{x})\,f(t),

where \mathcal{A} is the known spatial operator (for the capstone column, advection plus diffusion), s(\mathbf{x}) is the known source shape, and f(t) is the unknown driver we want back. Three structural choices define the paradigm:

Two networks, trained jointly. A field network u_\theta(\mathbf{x}, t) for the solution and a small second network f_\phi(t) for the unknown driver, optimised together in one loss — not in two separate stages.
Hard-wire the IC and BCs into the field ansatz (§7.3). With the initial and boundary conditions built in for free, the only soft loss terms left are the PDE residual and the data misfit (plus a flux BC if there is one).
The driver is recovered indirectly. f_\phi appears in only one loss term — the residual — and rearranging it gives f_\phi \approx (\partial_t u_\theta - \mathcal{A}\,u_\theta)/s. So f is never fitted to data directly: it is set by the derivatives of the field network.

That last point is the whole difficulty. The sensor data constrains the values of u_\theta at a handful of points, but f is fixed by its derivatives. A field can sit right on the sensor values while its derivatives are too small — and then f_\phi comes out too low, especially at the peak, even though the data loss looks fine. Under-recovery, not blow-up, is the failure mode.

The fix is the loss weights. Being precise about what to run:

Set the data weight well above the residual weight, \lambda_d \gg \lambda_r (ramp \lambda_d up from the gentle start of step 3 above). Forcing u_\theta to match the sensor values tightly is what forces its derivatives — and hence the recovered f — to the right size.
Add a small smoothness penalty on the driver, \lambda_\text{reg}\int|f_\phi'|^2: large enough to remove point-to-point jitter in f_\phi, small enough not to flatten the real peak.

Too small a data weight is the single most common way a W3 PINN silently under-recovers. The loss below is exactly this recipe.

# Schematic W3 source recovery for a generic ∂t u = κ ∂xx u + s(x) f(t):
# known κ and source shape s(x); unknown driver f(t). The concrete
# capstone instance (real column operator, validated weights) is
# Unit 10 §10.2 — the paradigm, not the physics, is the point here.
u_net = Chain(Dense(2=>32, tanh), Dense(32=>32, tanh), Dense(32=>1))  # field u_θ(x,t)
f_net = Chain(Dense(1=>16, tanh), Dense(16=>1))                       # driver f_φ(t)

# hard IC + Dirichlet BC baked into the ansatz (§7.3) — no IC/BC loss terms:
u(p, x, t) = (x .* (1 .- x)) .* t .* first(u_net(vcat(x, t), p, stu))  # u(0,·)=u(1,·)=u(·,0)=0
f(q, t)    = first(f_net(t, q, stf))

# input derivatives by central differences, so Zygote trains through them (§5.3):
h = 1f-3
∂t(p,x,t)  = (u(p,x,t.+h) .- u(p,x,t.-h)) ./ (2h)
∂xx(p,x,t) = (u(p,x.+h,t) .- 2 .* u(p,x,t) .+ u(p,x.-h,t)) ./ h^2

λ_d, λ_reg = 1f3, 1f-5            # ← the lesson lives here: data dominates, smoothing is a whisper
function loss(p, q)
    L_pde  = mean(abs2, ∂t(p,xc,tc) .- κ .* ∂xx(p,xc,tc) .- s.(xc) .* f(q,tc))
    L_data = mean(abs2, u(p, x_obs, t_obs) .- u_obs)            # noisy sensor traces
    L_reg  = mean(abs2, (f(q,tg.+h) .- f(q,tg.-h)) ./ (2h))     # H¹ prior on the driver
    L_pde + λ_d*L_data + λ_reg*L_reg
end
# train p (field) and q (driver) JOINTLY — one Adam state each, both updated every step.

Two guard-rails keep this honest. Run it as a twin experiment first (step 5): plant a known f^\star, forward-solve, sample with noise, and recover — only a known truth lets you quote a recovery error. And expect a deconvolution floor: at finite sensor noise the peak lands a few percent short no matter how you tune, because deconvolution amplifies noise — that residual gap is the inverse problem’s physics, not a bug to optimise away. The capstone walks this paradigm end-to-end on a real mooring geometry, weight-sweep table and all (Unit 9 §9.9 → Unit 10 §10.2).

The next three sections (§§7.6–7.8) zoom out to the practical ecosystem around all of this: the software you’d build a PINN with (§7.6), where PINNs have actually delivered in the published literature (§7.7), and the commercial / industrial deployments worth knowing about (§7.8).

✏️ Section exercise — recover a diffusivity (Workflow 2, end-to-end)

The smallest honest inverse problem. Generate synthetic data from the heat equation with known \alpha^\star = 0.07: solve u_t = \alpha^\star u_{xx} on [0,1]\times[0,1] (FD, Dirichlet BCs, u_0 = \sin(\pi x) — or just use the exact solution e^{-\alpha^\star \pi^2 t}\sin(\pi x)), sample 50 random interior points, add 1% noise. Now forget \alpha^\star and recover it: train a small PINN u_\theta with a trainable scalar \alpha in the residual, jointly over (\theta, \alpha), following the §7.5 recipe (forward sanity-check first, then add the data term). Report the recovered \alpha, then re-run with 5% noise and with 10 data points and tabulate the degradation. Which item of the “when it doesn’t work” list do you hit first?

💡 Hint

Make α trainable by log-parameterising it (logα, init deliberately wrong, e.g. log(0.02)) and packing it into a ComponentArray alongside the net params — Zygote differentiates through both. Kill BC+IC with the hard ansatz u_θ(x,t) = sin(πx)·(1 + t·N_θ(x,t)), so the loss is just residual + data (weight data ~100×). Important: take the residual’s input derivatives by CENTRAL DIFFERENCES, not ForwardDiff — Zygote can’t track a parameter gradient through ForwardDiff.derivative of a parameter-closure and silently drops it (the §5.3 trap), so uxx = (u_θ(x+h,t) - 2u_θ(x,t) + u_θ(x-h,t))/h^2. Follow the §7.5 recipe: forward sanity-check before adding the data term. The truth u(x,t) = exp(-α*π²t)sin(πx) gives you data directly.

Go to solution →

7.6 The PINN software ecosystem

A practitioner reading the 2026 PINN literature will hit at least three software stacks. The subsections below lay out the Julia, Python, and commercial offerings so the reader can place any PINN paper or product on the map — and, because the workshop is Julia-first, a dedicated walkthrough dissects the actual NeuralPDE / ModelingToolkit building blocks (@parameters, @variables, Differential, PDESystem, …) that the §7.1–§7.2 code used without ceremony. All claims are sourced from canonical project pages or release notes as of mid-2026.

Julia: the SciML stack

The workshop standardises on the SciML stack — the same one Zubov et al. 2021 describe in the NeuralPDE.jl companion paper. The pieces every PINN workflow touches:

Package	Role
`NeuralPDE.jl`	Symbolic PINN solver — takes a `PDESystem` and builds the physics-informed loss automatically.
`Lux.jl`	Explicit-parameter neural-network framework; the recommended NN backend inside `NeuralPDE`.
`ModelingToolkit.jl`	Symbolic CAS — defines `PDESystem`, the canonical PDE input.
`MethodOfLines.jl`	Finite-difference discretiser of `PDESystem`s — useful for FD reference solutions to benchmark a PINN against.
`OrdinaryDiffEq.jl`	Tsit5, Rodas5P, QNDF, … — used inside Neural-ODE / UDE training and to generate ground-truth trajectories.
`Optimization.jl`	Unified wrapper over Optim, Optimisers, NLopt, … — what `NeuralPDE` drives for Adam / L-BFGS training.
`SciMLSensitivity.jl`	Forward / adjoint backends (`InterpolatingAdjoint`, `BacksolveAdjoint`, `EnzymeVJP`) for gradients through ODE/PDE solves.
`Zygote.jl` / `ForwardDiff.jl` / `Enzyme.jl`	Reverse-mode, dual-number forward, and LLVM-level AD. Compose as Unit 5 §5.2 describes; Enzyme is increasingly the default in 2025–26.
`NeuralOperators.jl`	DeepONet, FNO, Markov NO, NOMAD — pairs with `NeuralPDE` for PINO-style losses.
`KolmogorovArnold.jl`	Lux-compatible KAN layers (see Unit 2 §2.9).

A consolidation note worth knowing: the older Flux-based stack (DiffEqFlux, FluxNeuralOperators) is being superseded by the Lux + Reactant + Enzyme stack across SciML; new projects should start with the right column.

Anatomy of a `NeuralPDE` program

The package table says which libraries you load. This subsection says how they fit together — the symbolic pipeline that turns a PDE written on paper into a trained network. Every code block in §7.1–§7.2 is an instance of the same six-stage skeleton; learn it once and the rest of the unit reads as variations.

The headline idea — and what distinguishes the Julia stack from a hand-rolled PINN like the one Unit 5 §5.3 builds — is that you never write the loss function. You declare the PDE symbolically with ModelingToolkit, and NeuralPDE differentiates that symbolic expression to assemble the residual, IC, and BC loss terms automatically. The vocabulary you need is small:

Building block	Package	What it does
`@parameters x t`	ModelingToolkit	Declares the independent variables — the PDE’s coordinates (space, time). These become the inputs of the network.
`@variables u(..)`	ModelingToolkit	Declares the dependent variable(s) — the unknown field(s) being solved for. The `(..)` means “a function whose arguments I’ll supply later” (e.g. `u(x, t)`); these become the outputs.
`Differential(t)`	ModelingToolkit	A symbolic derivative operator. `Dₜ = Differential(t)` gives `Dₜ(u(x,t))` =\partial_t u. Compose with `∘` for higher order: `Differential(x) ∘ Differential(x)` is \partial_{xx}.
`~`	Symbolics	Symbolic equality — builds an equation, not an assignment. `Dₜ(u) ~ α*Dₓₓ(u)` is the heat equation as a data structure the solver can read.
`x ∈ ClosedInterval(0, 1)`	DomainSets	The geometric domain each independent variable ranges over. One entry per coordinate.
`PDESystem(eqs, bcs, domains, ivs, dvs)`	ModelingToolkit	Bundles the whole symbolic problem — equation(s), boundary/initial conditions (also written with `~`), domains, and the lists of independent (`ivs`) and dependent (`dvs`) variables — into one object. `@named` tags it for readable error messages.
`Lux.Chain(Dense(…), …)`	Lux	The trial function u_\theta. Input width = number of `@parameters`; output width = number of `@variables`. For a system, pass a vector of chains, one per dependent variable.
`GridTraining(dx)`	NeuralPDE	A training strategy — where the residual is sampled. `GridTraining` lays a fixed lattice with spacing `dx`; the alternatives (below) resample instead.
`PhysicsInformedNN(chain, strategy)`	NeuralPDE	Bundles network + sampling strategy and is the object that builds the physics-informed loss by autodiff when discretized.
`discretize(pdesys, disc)`	NeuralPDE	Compiles the symbolic `PDESystem` + the `PhysicsInformedNN` into a concrete `OptimizationProblem` — a plain loss-minimisation problem with no PDE symbols left in it.
`Optimization.solve(prob, opt; maxiters)`	Optimization	Minimises that loss. The usual idiom is a few hundred steps of `Adam()` to get into the basin, then `LBFGS()` to polish. (The short §7.1/§7.2 demos call `LBFGS()` directly, for brevity.)
`disc.phi`, `res.u`	NeuralPDE	After solving, `res.u` is the optimal parameter vector and `disc.phi([x,t], res.u)` evaluates the trained solution at any coordinate. For a system, `phi` is a vector — `phi[1]`, `phi[2]`, ….

Read against the heat-equation block of §7.1, the six stages are:

Declare the symbols — @parameters x t, @variables u(..), the Differentials.
Write the physics — the equation and the domains/bcs, all in ~ form.
Assemble — @named pde_system = PDESystem(…).
Choose the ansatz — a Lux.Chain of the right input/output width.
Discretize — PhysicsInformedNN(chain, strategy) then discretize(…) → an OptimizationProblem.
Train — Optimization.solve(…), then query disc.phi.

Which training strategy? This is the one knob with real performance consequences, and it connects straight to the GPU story in §7.6 ▸ Why a GPU changes which PINNs you can afford:

GridTraining(dx) — a fixed lattice. Cheap, deterministic, fine for the low-dimensional textbook problems of §7.1; scales badly because the point count is \mathcal{O}(\text{res}^{d}).
StochasticTraining(n) / QuasiRandomTraining(n) — draw n fresh (pseudo- or low-discrepancy) collocation points each iteration. This is the resampling that the §7.3 fixes (adaptive sampling, curriculum schedules) build on, and the workload the GPU benchmark feeds.
QuadratureTraining() — integrate the residual with an adaptive quadrature rule rather than a Monte-Carlo average; most accurate per point, best in low dimension.

Extending the skeleton to inverse problems. Workflows W2/W3 (§7.4) reuse all six stages and bolt on two hooks:

For an unknown scalar (W2 — e.g. recover the diffusivity \alpha in ex-7-5): add it to the system’s parameter list with a starting guess and pass param_estim = true to PhysicsInformedNN, so the optimiser tunes \alpha jointly with \theta.
For the data-misfit term (W2 and W3): pass an additional_loss callback to PhysicsInformedNN that evaluates phi at the sensor coordinates and returns the misfit scalar — this is the \mathcal{L}_{\text{data}} of §7.2 added on top of the auto-built residual loss.
For an unknown function (W3 — e.g. the capstone’s \tau(t)): give it its own @variables and a second Lux.Chain, trained jointly.

And adaptive_loss = GradientScaleAdaptiveLoss(…) (or MiniMaxAdaptiveLoss) is the built-in switch for the §7.3 adaptive loss-weighting — so the whole §7.3 toolkit is reachable through keyword arguments to the same PhysicsInformedNN constructor.

Python: DeepXDE, PhysicsNeMo, JAX, PyTorch

The Python side is more fragmented because there’s no single maintainer. The pieces:

Package	Role
`DeepXDE`	Reference Python PINN library (Lu Lu, originally Karniadakis group). Multi-backend (TF, PyTorch, JAX, PaddlePaddle); forward + inverse ODE/PDE/IDE with CSG geometries.
`NVIDIA PhysicsNeMo`	PyTorch-based physics-ML framework (formerly Modulus). Covers PINNs, FNOs, MeshGraphNets, diffusion surrogates.
`PhysicsNeMo-Sym`	Symbolic PDE/BC layer (formerly Modulus-Sym) — closest Python analogue to `NeuralPDE.jl`.
`PINA`	PyTorch-Lightning-based PINN + neural operator + PINO library (mathLab @ SISSA, Rozza group).
`Diffrax`	JAX ODE/SDE/CDE solvers (Patrick Kidger) — the JAX counterpart to `OrdinaryDiffEq.jl`.
`Equinox`	Callable-PyTree NN layers in JAX; the foundation of `Diffrax` and most JAX PINN code.
`Flax (NNX)`	Google’s JAX NN library; NNX (2024) is the current object-oriented API.
`jinns`	JAX-native PINN library (PINNs, SPINNs, HyperPINNs, adaptive weights) on Equinox + Optax.
`torchdiffeq` / `torchdyn`	Differentiable ODE solvers in PyTorch; standard for neural-ODE and Burgers-class PINN demos.
`PySINDy`	Sparse identification of nonlinear dynamics; the discovery-side companion to PINNs (Unit 3 §3.4).
`pykan`	Reference KAN implementation accompanying Liu et al. 2024.

Commercial and enterprise offerings

The line between “PINN” and “AI surrogate trained on simulation data” is blurry in vendor marketing. Be precise: only the items in the first column below train against a PDE residual; the second column are physics-aware data surrogates (still useful, still physics-informed in a loose sense, but not PINNs).

Residual-loss PINN-aware	AI surrogate on simulation data
NVIDIA PhysicsNeMo (commercial framing of the open-source core)	Ansys SimAI (physics-agnostic surrogate, FEM-trained)
Altair PhysicsAI (geometric DL + surrogate; explicit-dynamics speedups)	COMSOL 6.2+ surrogate nodes (DNN, GP, polynomial-chaos)
SimScale Physics AI (built on PhysicsNeMo; centrifugal-pump foundation model GA March 2025)	Siemens Simcenter AI assistants (post-processing + generative design)
Pasteur Labs SI Platform (pre-GA late 2025; differentiable simulators + AI; defence/energy)	Wolfram (PINN as community notebooks, not branded product)

The Ansys × NVIDIA SeaScape integration (November 2024) is the hybrid case — Ansys’ semiconductor signoff toolchain (RedHawk-SC, Totem-SC, PathFinder-SC) now embeds PhysicsNeMo for power-integrity surrogates.

✏️ Section exercise — pick the package

For each job below, name the one package from the §7.6 tables you’d reach for first (Julia or Python — your call), and the runner-up from the other ecosystem:

Generate a trusted FD reference solution for the capstone column PDE, starting from a symbolic PDESystem.
Train a PINN for a PDE specified symbolically, in Julia.
Backprop through an ODE solve in a JAX codebase.
Recover a sparse symbolic equation from trajectory data.
A PyTorch team wants industrial-grade PINN + neural-operator tooling with vendor support and GPU recipes.
Swap a KAN into an existing Lux-based PINN.

💡 Hint

Rows 1–2 reward the symbolic Julia stack (ModelingToolkit-based); row 3 is the Kidger JAX ecosystem; row 4 is the Brunton group’s; row 5 is NVIDIA’s; row 6 needs the one Lux-compatible KAN package in the §7.6 table. The interesting part is which rows have no good answer in the other ecosystem.

Go to solution →

Why a GPU changes which PINNs you can afford

Everything in §7.2–§7.3 — causal training, adaptive resampling, curriculum schedules, high-frequency problems — ultimately spends the same currency: collocation points. Each residual point is one forward pass through the network (plus a few more for the derivative stencil) and one reverse-mode pass for the parameter gradient. So the practical ceiling on a PINN is blunt: how many collocation points can you push through a training step per second? That is a batched-matmul throughput question — exactly the workload a GPU rewrites.

The script below measures one full training step of a Poisson PINN (\Delta u = f, with the Laplacian taken by the finite-difference-in-input stencil from §5.3 and the parameter gradient by Zygote — the same Lux path used throughout the course), sweeping the collocation batch from 10^3 to 5\times10^5 on the CPU and the GPU:

units/unit_07/scripts/pinn_throughput_gpu.jl

# ===========================================================================
# Unit 7 — why a GPU changes what PINNs you can afford: collocation throughput.
#
# Section 7.2/7.3 make the case that serious PINNs need *many* collocation points
# — causal training, adaptive resampling, and high-frequency problems all pile on
# residual points, and each point is a forward+backward pass through the network.
# The practical question is then: how many collocation points can you evaluate per
# second? That is precisely a batched-matmul throughput question, and it is where
# the GPU rewrites the budget.
#
# This measures the cost of ONE training step of a Poisson PINN
#   residual:  Δu_θ(x,y) − f(x,y),    f chosen so u* = sin(πx)sin(πy)
# where the Laplacian is taken by a cheap finite-difference stencil in the input
# (five batched forward passes), and the parameter gradient by reverse-mode AD —
# the same Lux + Zygote path used throughout the course. We sweep the collocation
# batch size N and report points/second on CPU vs GPU.
#
# Run on the GPU hub (the @pinn env has Lux + LuxCUDA + CUDA):
#   julia --project=@pinn units/unit_07/scripts/pinn_throughput_gpu.jl
# Nothing here runs during `quarto render` — the .qmd shows it `eval: false`.
# ===========================================================================

using Lux, LuxCUDA, CUDA, Optimisers, Zygote, Random, Printf, Statistics

make_model() = Chain(Dense(2 => 128, tanh), Dense(128 => 128, tanh),
                     Dense(128 => 128, tanh), Dense(128 => 1))

const H = 1f-3                          # FD step for the input-space Laplacian
fsource(x, y) = -2f0 * (π^2) .* sinpi.(x) .* sinpi.(y)   # so u* = sin(πx)sin(πy)

# One full training step over a batch of N collocation points; returns seconds.
function step_time(N, dev; nrep = 3)
    model = make_model()
    ps, st = Lux.setup(Xoshiro(0), model)
    ps = ps |> dev; st = st |> dev
    opt = Optimisers.setup(Adam(1f-3), ps)
    P = (rand(Float32, 2, N)) |> dev          # collocation points in (0,1)²
    ex = reshape(P[1, :], 1, N); ey = reshape(P[2, :], 1, N)
    f  = fsource(ex, ey)

    u(p, X) = first(model(X, p, st))
    function lap_residual(p)
        c = u(p, P)
        xp = u(p, vcat(ex .+ H, ey)); xm = u(p, vcat(ex .- H, ey))
        yp = u(p, vcat(ex, ey .+ H)); ym = u(p, vcat(ex, ey .- H))
        Δu = (xp .+ xm .+ yp .+ ym .- 4f0 .* c) ./ H^2
        return mean(abs2, Δu .- f)
    end

    g = Zygote.gradient(lap_residual, ps)[1]; opt, ps = Optimisers.update(opt, ps, g)  # warmup
    dev === identity || CUDA.synchronize()
    t0 = time()
    for _ in 1:nrep
        g = Zygote.gradient(lap_residual, ps)[1]
        opt, ps = Optimisers.update(opt, ps, g)
    end
    dev === identity || CUDA.synchronize()
    return (time() - t0) / nrep
end

println("="^66)
println("Unit 7 — Poisson-PINN training-step throughput: CPU vs GPU")
println("="^66)
have_gpu = CUDA.functional()
@printf("GPU available: %s%s\n", have_gpu, have_gpu ? "  ($(CUDA.name(CUDA.device())))" : "")
println("net 2-128-128-128-1; one step = 5 batched forward passes + reverse-mode AD\n")

@printf("%-12s %12s %12s %10s %14s %14s\n",
        "N (colloc)", "CPU (ms)", "GPU (ms)", "speedup", "CPU pts/s", "GPU pts/s")
println("-"^78)
Ns = [1_000, 10_000, 100_000, 500_000]
cpu_tp = Float64[]; gpu_tp = Float64[]
for N in Ns
    tc = step_time(N, identity)
    tg = have_gpu ? step_time(N, gpu_device()) : NaN
    spd = have_gpu ? tc / tg : NaN
    push!(cpu_tp, N / tc); have_gpu && push!(gpu_tp, N / tg)
    @printf("%-12d %12.1f %12.1f %9.1fx %14.2e %14.2e\n",
            N, 1e3*tc, 1e3*tg, spd, N/tc, have_gpu ? N/tg : NaN)
end

if have_gpu
    @printf("\nAt N=%d the GPU evaluates %.1fx more collocation points per second;\n", Ns[end], gpu_tp[end] / cpu_tp[end])
    @printf("the GPU throughput peaks at N=%d (%.1fx the CPU there).\n", Ns[argmax(gpu_tp)], maximum(gpu_tp) / cpu_tp[argmax(gpu_tp)])
    try
        using CairoMakie
        f = Figure(size = (560, 400))
        ax = Axis(f[1,1], title = "PINN collocation throughput (higher is better)",
                  xlabel = "collocation points N", ylabel = "points / second",
                  xscale = log10, yscale = log10)
        lines!(ax, Ns, cpu_tp, label = "CPU", linewidth = 3)
        scatter!(ax, Ns, cpu_tp, markersize = 9)
        lines!(ax, Ns, gpu_tp, label = "GPU", linewidth = 3)
        scatter!(ax, Ns, gpu_tp, markersize = 9)
        axislegend(ax, position = :rb)
        figdir = get(ENV, "GPU_FIG_DIR", joinpath(@__DIR__, "..", "figures"))
        isdir(figdir) || mkpath(figdir)
        save(joinpath(figdir, "pinn_throughput_gpu.png"), f)
        println("wrote figures/pinn_throughput_gpu.png")
    catch e
        println("(figure skipped: ", e, ")")
    end
end

Captured on the workshop GPU hub (NVIDIA A10G):

==================================================================
Unit 7 — Poisson-PINN training-step throughput: CPU vs GPU
==================================================================
GPU available: true  (NVIDIA A10G)
net 2-128-128-128-1; one step = 5 batched forward passes + reverse-mode AD

N (colloc)       CPU (ms)     GPU (ms)    speedup      CPU pts/s      GPU pts/s
------------------------------------------------------------------------------
1000                 31.2          6.3       5.0x       3.20e+04       1.59e+05
10000               412.1          6.9      59.3x       2.43e+04       1.44e+06
100000             2563.1         29.7      86.2x       3.90e+04       3.36e+06
500000             9826.9        141.6      69.4x       5.09e+04       3.53e+06

At N=500000 the GPU evaluates 69.4x more collocation points per second;
the GPU throughput peaks at N=500000 (69.4x the CPU there).

At a thousand points the GPU is only ~5× ahead — the batch is too small to hide kernel-launch latency. But the CPU’s throughput is roughly flat (it is already saturated at N=1000), while the GPU keeps filling: by N=10^4 it is ~60× faster per step, and across the useful range it sustains 1.4–3.5 million collocation points per second against the CPU’s ~25–50 000. That is the difference between a PINN you resample aggressively every few hundred iterations and one you cannot afford to. It is the same lever Unit 6’s heat solver pulled (§6.4) — a perfectly parallel, matmul-bound workload — applied to the residual loss instead of a stencil step.

PINN collocation throughput, CPU vs GPU (log–log). The CPU plateaus near 5\times10^4 points/s; the GPU climbs to ~3.5\times10^6 — roughly a step that costs **2.6 s on the CPU runs in 30 ms on the GPU** (~86×) at N=10^5.

7.7 Where PINNs have shown success — a survey of the literature

The published-results landscape is uneven: some domains have flagship PINN successes, others have almost no PINN penetration despite headline ML activity. This section names the keystone papers in seven application areas, with the limitations the original authors acknowledged. The three surveys worth keeping next to a screen (Karniadakis et al. 2021, Cuomo et al. 2022, Toscano et al. 2025) cover the same territory in much more depth.

Fluid mechanics and hidden physics

The flagship PINN-for-fluids paper is Raissi, Yazdani & Karniadakis (Science 2020) — Hidden Fluid Mechanics. Given only passive-scalar concentration fields (smoke, dye, contrast-agent imaging), the PINN reconstructs the full velocity and pressure fields by enforcing the incompressible Navier–Stokes residual. Demoed on 2-D cylinder wake, 3-D intracranial aneurysm, and synthetic blood-flow visualisations, recovering pressure (never directly measured) to a few-percent error. Limit acknowledged by the authors: long training times, struggles at high Reynolds number, fine-scale turbulence not demonstrated.

Cardiovascular and biomedical

The keystone clinical-data PINN is Kissas et al. 2020 — 4D-flow MRI of a thoracic aorta in, arterial blood-pressure waveforms out, with reduced 1-D blood-flow equations as the physics constraint. Sahli Costabal et al. 2020 solves the Eikonal equation on patient atrial geometries to reconstruct cardiac activation maps for atrial-fibrillation workups, with uncertainty quantified via randomised priors. The EP-PINNs follow-up (Herrero Martin et al., Front. Cardiovasc. Med. 2021) handles full Aliev–Panfilov monodomain dynamics. Limit flagged by all three: small geometries, anisotropy and fibre orientation degrade accuracy.

Subsurface and geophysics

The canonical groundwater reference is Tartakovsky et al. 2020 — Richards-equation flow with both the hydraulic-conductivity field and the unsaturated K(ψ) constitutive relation recovered from sparse pressure-head observations; beats Gaussian-process regression in the data-sparse regime. For seismic full-waveform inversion, Song, Alkhalifah & Waheed 2021 parameterises the scattered Helmholtz wavefield in anisotropic VTI media; their PINNup follow-up (arXiv:2109.14536) uses frequency upscaling + neuron splitting to climb the frequency ladder. Authors note vanilla PINNs fail in highly heterogeneous porous media — a mixed pressure-velocity formulation is required.

Materials and solid mechanics

The keystone is Haghighat et al. 2021 — momentum balance + constitutive relations enforced, Lamé parameters identified in heterogeneous linear elasticity, then extended to von Mises elastoplasticity using a multi-network architecture, recovering parameters to ~1–2% from synthetic displacement fields. For metal additive manufacturing the most commonly cited published example is Liao et al. 2024 (arXiv:2401.02403) — real-time 2D temperature-field prediction with <3% error on thin walls. Authors flag the need for multi-network designs at stress concentrations.

Weather and climate — the operator-vs-PINN distinction

This is the area where students need the sharpest warning. GraphCast (Lam et al., Science 2023), FourCastNet (Pathak et al. 2022, NVIDIA), ClimaX (Nguyen et al., ICML 2023) and the operational NVIDIA Earth-2 systems are operator-learning models — GNNs, transformers, spherical neural operators — trained on ERA5 reanalysis without an explicit PDE-residual loss. They are not PINNs. Genuine PINN-for-atmosphere work exists but is narrower: e.g. atmospheric radiative transfer (Zhao et al. 2025, JQSRT) and sparse-station weather reconstruction with Navier–Stokes regularisation (Vadyala/Betancourt et al. 2024, Open Res. Europe). For workshop-honest framing: cite GraphCast/FourCastNet as the operator-learning success story, not the PINN one. Their PINN niche is data-sparse reconstruction at regional scale.

Power systems and engineering

The line was opened by Misyris, Venzke & Chatzivasileiadis 2020 — swing-equation PINN that learns rotor-angle / frequency dynamics from far fewer trajectories than a pure data-driven RNN, with the same network used for inverse identification of damping and inertia. Stiasny et al. 2021 (arXiv:2004.04026) and the transient-stability follow-up (arXiv:2106.13638) accelerate IEEE-benchmark simulations by orders of magnitude. The most recent, Nellikkath/Stiasny et al. 2024 (arXiv:2404.13325), integrates trained PINN components plug-and-play into a conventional time-domain solver. Limit: scaling to large networks is still open; current demos are a handful of buses.

Quantum and molecular

The canonical demonstration is the nonlinear Schrödinger equation worked example in Raissi, Perdikaris & Karniadakis 2019. Follow-ups cover the time-dependent linear Schrödinger equation (Shah et al., arXiv:2210.12522) and quantum-spectrum eigenvalue problems (Brevi et al. 2024 tutorial, arXiv:2407.20669). This is largely a method-development literature — the chemistry/condensed-matter community has adopted neural wavefunctions (FermiNet, PauliNet) rather than residual-loss PINNs. Authors note PINNs struggle with high-dimensional many-body problems and lose orthogonality of eigenstates.

What’s not in this list

A useful honesty check: areas where PINNs are conspicuously absent from the success stories include large-scale turbulent CFD, multiphase flow, plasma physics for fusion, and high-Reynolds aerodynamics. Operator-learning approaches now dominate those areas; whether PINNs return to them depends on whether the §7.3 fixes scale further than they currently do.

✏️ Section exercise — read one keystone paper properly

Pick one of the seven application areas above, open its keystone paper (all are linked), and fill in a five-row claims table: (1) the PDE(s) enforced in the residual; (2) what was measured vs what was reconstructed; (3) which of the W1–W4 workflows it is; (4) the limitation the authors themselves acknowledge; (5) one modern fix from §7.3 that the paper pre-dates or omits, and whether it would plausibly help. Finally, the calibration question: does the paper’s headline claim survive its own limitations section?

💡 Hint

Read in this order: abstract (claim), figures (what’s measured vs reconstructed), then the limitations/discussion section before the methods — it calibrates everything else. For row 5, compare the paper’s publication year against the §7.3 fix papers (2020–2022): anything it predates, it couldn’t have used.

Go to solution →

7.8 Commercial and industrial deployments

Where the residual-loss PINN idea has actually shipped. This is a narrower list than §7.7 — most production “physics-AI” is operator-learning or surrogate modelling, not strict PINN. The catalogue below is sectioned by maturity: GA (generally available product), field-trial (validated prototypes with named customers), and R&D (credible commercial sponsorship, no deployed product yet).

NVIDIA PhysicsNeMo — the most documented production stack

PhysicsNeMo (formerly Modulus / SimNet) is the only ecosystem with a meaningful corpus of named, multi-year industrial deployments. The case studies on the NVIDIA developer page that specifically use a PINN variant rather than a generic neural operator:

Siemens Energy — field-trial. PINN for static heat conduction in transformer bushings (<4% error, sub-second inference); GNN surrogate for gas-insulated-switchgear thermals (~10,000× over transient CFD). Bushing models in customer field-evaluation as of 2025. (NVIDIA case study)
Siemens Energy HRSG digital twin — deployed prototype on AWS / A100, predicting corrosion in heat-recovery steam generators across a 600-unit fleet (weeks → hours). (brief)
Siemens Gamesa — active partnership, wind-farm wake modelling, ~4,000× over CFD (weeks → minutes). (NVIDIA blog)
SimScale centrifugal-pump foundation model — GA March 2025 to 600k+ users, ~2,700× over CFD design-point analysis. (SimScale press)
Shell — R&D, paper-backed. Nested Fourier Neural Operator for CO₂ plume migration in CCS site screening, \sim 10^5 × faster than reservoir simulators. (NVIDIA spotlight / paper)
Ansys × NVIDIA SeaScape — integration announced Nov 2024, demoed GTC 2025. PhysicsNeMo embedded into RedHawk-SC / Totem-SC / PathFinder-SC for semiconductor power-integrity signoff. (Ansys press)

Other commercial offerings

Altair PhysicsAI — GA. The most explicitly “PINN-aware” CAE product from a major vendor (geometric DL + surrogate), ~1000× claims for explicit dynamics; bumper-impact case study with Cyient. (Altair)
Pasteur Labs SI Platform — pre-GA late 2025. Brooklyn-based public-benefit corp commercialising “Simulation Intelligence” (differentiable simulators + AI). Acquired the Cornell-spinout FOSAI in August 2025, bringing US Space Force / DARPA / commercial aerospace customers with it. (Pasteur Labs)
SandboxAQ ($5.75 B valuation, $95 M raise July 2025) — markets Large Quantitative Models — physics/chemistry-grounded models, not residual-loss PINNs strictly. Real customers: Vodafone, SoftBank, Mount Sinai, US government for cryptography and drug discovery. (coverage)

Honest disambiguation

Three classes of marketing that look like PINN deployments but aren’t:

Ansys SimAI, COMSOL surrogate nodes, Siemens Simcenter AI — physics-agnostic surrogates trained on FEM/CFD output. Useful, but not PINNs in the residual-loss sense.
NVIDIA Earth-2 / FourCastNet 3, GraphCast — neural operator models. Most production “physics ML” at scale is in this bucket.
“TSMC PINN production deployment”, “Aramco PINN reservoir modelling” — the named industrial partnerships involve PINN research at KAUST / NVIDIA but the production tooling is unverified or covered by NDA. Treat such claims as R&D-with-industrial-sponsor, not deployed product.

Bottom line for the workshop

The single most defensible message: most “physics-AI” in production today is FEM-trained data surrogates or neural operators, not classic residual-loss PINNs. The clearest residual-loss-PINN industrial story is Siemens Energy transformer-bushing thermals on NVIDIA PhysicsNeMo, currently in customer field trials. The §7.3 modern-fix toolkit is what makes that case work; the rest of the field is catching up.

With the toolkit, the literature, and the commercial landscape in hand, the modern PINN is ready for the capstone — Unit 10 walks Workflow 3 end-to-end on the AIMS thermistor column.

✏️ Section exercise — the marketing audit

You receive five vendor claims. Sort each into the section’s three buckets — residual-loss PINN, neural operator, or FEM/CFD-trained surrogate — and state the one question you’d ask the vendor to confirm your classification:

“Our model was trained on 40 000 Fluent simulations and predicts drag for new geometries in 50 ms.”
“The network minimises the Navier–Stokes residual at 2M collocation points alongside the sensor data.”
“A graph neural network trained on 40 years of ERA5 reanalysis produces a 10-day global forecast in under a minute.”
“Physics-informed AI: our surrogate respects conservation of energy because the training data came from an energy-conserving solver.”
“We embed the governing equations as soft constraints during training, so the model needs 100× less simulation data.”

Claim 4 is the trap — explain precisely why “trained on physics-respecting data” is not the same as “physics-informed training”, and what can go wrong out-of-distribution.

💡 Hint

One question sorts everything: what exactly is in the training loss? Data misfit only → surrogate; PDE residual at collocation points → PINN; reanalysis/simulation targets with a learned operator → neural operator. For claim 4, ask what enforces the conservation law on an input unlike the training set.

Go to solution →

7.1 Vanilla PINNs on PDEs

The PINN Loss

Heat Equation: A Sanity Check

Laplace on a Disk: Where Mesh-Free Helps

Warm-starting a neighbour: the disk PINN as a pre-trained init

7.2 Why Naïve PINNs Fail

Loss imbalance between residual and boundary terms

Spectral bias

Causal violation in time-dependent problems

A benchmark catalogue worth knowing about

Worked example: naïve linearised shallow water

7.3 Modern Fixes

Causal Training

Fourier Feature Embeddings

Hard Boundary Condition Enforcement

Adaptive Loss Weighting

7.4 A taxonomy of PINN workflows

Quick-reference: pick a workflow

How to pick: a decision tree

7.5 The inverse problem in practice

When it works

When it doesn’t

A practical recipe

The two-network source-recovery paradigm

7.6 The PINN software ecosystem

Julia: the SciML stack

Anatomy of a NeuralPDE program

Python: DeepXDE, PhysicsNeMo, JAX, PyTorch

Commercial and enterprise offerings

Why a GPU changes which PINNs you can afford

7.7 Where PINNs have shown success — a survey of the literature

Fluid mechanics and hidden physics

Cardiovascular and biomedical

Subsurface and geophysics

Materials and solid mechanics

Weather and climate — the operator-vs-PINN distinction

Power systems and engineering

Quantum and molecular

What’s not in this list

7.8 Commercial and industrial deployments

NVIDIA PhysicsNeMo — the most documented production stack

Other commercial offerings

Honest disambiguation

Bottom line for the workshop

Anatomy of a `NeuralPDE` program