Unit 2: ML Foundations: From Classical to Deep

Published

26/06/2026

A walk from supervised learning and random forests through to neural networks, optimisation, and the autodiff stack that PINNs sit on. This unit follows the framing of Mathematical Engineering of Deep Learning (Liquet, Moka & Nazarathy), chapters 2–6, with code in Julia (DecisionTree.jl for classical ML, Lux.jl for deep networks) and a scikit-learn / PyTorch comparison so the cross-ecosystem story stays in view.

A single unit can’t cover everything — the goal here is shared vocabulary and a feel for which classical ideas survive the journey to PINNs.

2.1 Supervised learning fundamentals

The setup

Given labelled data \{(x_i, y_i)\}_{i=1}^N, learn a function f: \mathbb{R}^n \to \mathcal{Y} such that f(x) \approx y on unseen inputs. The output space \mathcal{Y} is continuous (regression) or a finite set (classification). The art of supervised learning is choosing a function class — linear, tree-based, neural — and a fitting procedure that generalises beyond the training set.

Loss, true risk, empirical risk

A loss \ell(\hat y, y) measures how bad a single prediction is. The true risk is the population mean

R(f) = \mathbb{E}_{(x, y)}[\ell(f(x), y)].

We can’t compute it, so we minimise the empirical risk

\hat R(f) = \frac{1}{N}\sum_{i=1}^N \ell(f(x_i), y_i).

All of supervised learning is a story about how, when, and whether \hat R approximates R. PINNs are no exception — their “data” includes scattered collocation points where the loss is a PDE residual rather than a label, but the ERM picture survives intact (Unit 5).

Train, validate, test — and k-fold cross-validation

The standard split: training to fit, validation to pick hyperparameters, test to estimate true risk. Never let the test set influence model choice.

When data is scarce, hold out one validation set at a time and rotate. That’s k-fold cross-validation:

k-fold cross-validation. Each row trains on the unshaded blocks and validates on the shaded one; cross-validated risk averages the k held-out scores. (Figure from *Mathematical Engineering of Deep Learning*, §2.5.)

Bias, variance, and model complexity

Holding N fixed and sweeping the function class from simple to complex, training error decreases monotonically while test error is U-shaped: too simple under-fits (high bias), too complex over-fits (high variance). The minimum of the test-error curve marks the “right” complexity for that dataset size.

Test (red) and training (green) error vs. model complexity. The classical bias–variance trade-off: complexity beyond the right-hand side of the dotted line buys lower training error at the cost of higher test error. (Figure from *MEDL*, §2.7.)

For PINNs the same picture applies: the function class is set by the network architecture, “complexity” is roughly the parameter count, and the test set is replaced by held-out collocation points in the PDE domain.

Ordinary least squares: the `\` operator

Before adding any penalty, recall the plainest fit of all. Julia’s left-division \ solves linear systems: for a square invertible A, A \ b returns the x with Ax = b. For a tall matrix X — more data rows than parameters — there is no exact solution, so X \ y instead returns the ordinary least-squares (OLS) estimate: the \beta minimising \lVert y - X\beta\rVert_2^2, computed stably by QR (no explicit normal equations). Stacking a column of ones for the intercept turns line-fitting into a single call:

units/unit_02/scripts/ols_regression.jl

# 5 noisy points, roughly on the line  y = 2 + 3x
x = [0.0, 1.0, 2.0, 3.0, 4.0]
y = [2.1, 5.2, 7.8, 11.1, 13.9]

# design matrix: column of 1s (intercept) next to x (slope)
X = [ones(length(x)) x]

# least-squares fit — `\` minimizes ‖y - Xβ‖² for this tall X
β = X \ y

println("intercept = ", round(β[1], digits = 3))   # ≈ 2.12
println("slope     = ", round(β[2], digits = 3))   # ≈ 2.95

The recovered intercept \approx 2.12 and slope \approx 2.95 are close to the y = 2 + 3x the data came from. The textbook closed form is \hat\beta = (X^\top X)^{-1} X^\top y, but \ is what you actually call — and it’s the building block the next section regularises.

Regularisation: penalising complexity

Cross-validation measures over-fitting; regularisation actively fights it. Rather than trimming the function class by hand, keep it flexible but add a penalty \Omega(\theta) on the parameters to the empirical risk, pulling the fit toward simpler solutions:

\min_{\theta}\; \hat R(f_\theta) \;+\; \lambda\,\Omega(\theta), \qquad \lambda \ge 0 .

The regularisation strength \lambda trades data-fit against simplicity: \lambda = 0 is the unpenalised fit, large \lambda forces small parameters. For a linear model f(x) = \beta^\top x under squared loss this is simply least squares plus a penalty. Taking the squared L_2 norm \Omega(\beta) = \lVert\beta\rVert_2^2 gives ridge regression,

\hat\beta_\lambda = \arg\min_{\beta}\; \underbrace{\lVert y - X\beta\rVert_2^2}_{\text{least squares}} + \lambda\,\lVert\beta\rVert_2^2 \;=\; (X^\top X + \lambda I)^{-1} X^\top y ,

which is exactly the ordinary-least-squares normal equations \hat\beta = (X^\top X)^{-1} X^\top y with \lambda I added to the diagonal. Two consequences: \lambda = 0 recovers OLS, and any \lambda > 0 both shrinks the coefficients toward zero and makes X^\top X + \lambda I invertible even when X^\top X is singular or ill-conditioned (MEDL §2.5).

A 12-feature problem where only 3 features matter, fit from just 15 points, makes the effect concrete:

units/unit_02/scripts/ridge_regression.jl

using LinearAlgebra, Random, Statistics
Random.seed!(1)

# synthetic linear data: y = Xβ + noise, only 3 of 12 features matter
p      = 12
β_true = vcat([3.0, -2.0, 1.5], zeros(p - 3))
gen(m) = (X = randn(m, p); (X, X * β_true .+ 0.5 .* randn(m)))
Xtr, ytr = gen(15)     # small training set → OLS overfits
Xte, yte = gen(2000)   # large test set → clean risk estimate

rmse(β, X, y) = sqrt(mean(abs2, y .- X * β))

# ordinary least squares:  minimize ‖y − Xβ‖²            (β̂ = X \ y)
β_ols = Xtr \ ytr

# ridge (L2):  minimize ‖y − Xβ‖² + λ‖β‖²
#   closed form  β̂(λ) = (XᵀX + λI)⁻¹ Xᵀy  — OLS normal equations + λI
ridge(X, y, λ) = (X'X + λ * I) \ (X'y)

println("λ        ‖β‖      test RMSE")
for λ in (0.0, 1.0, 10.0, 100.0)
    β = ridge(Xtr, ytr, λ)
    println(rpad(λ, 8), rpad(round(norm(β), digits = 2), 9),
            round(rmse(β, Xte, yte), digits = 2))
end

Sweeping \lambda traces the same U-shape as the bias–variance curve above — now along one continuous knob instead of by swapping models:

\lambda	\lVert\hat\beta\rVert	test RMSE
0 (OLS)	3.89	1.54
1	3.57	\mathbf{1.31}
10	2.51	1.91
100	0.73	3.33

OLS over-fits the 15 noisy points; a little regularisation (\lambda = 1) gives the lowest test error; too much (\lambda = 100) over-shrinks and under-fits. Swapping the L_2 penalty for the L_1 norm \Omega(\beta) = \lVert\beta\rVert_1 gives the LASSO, which drives many coefficients exactly to zero (automatic feature selection) but has no closed form. The same L_2 idea returns later as weight decay in deep networks (§2.8), and in PINNs, where the extra PDE-residual loss terms act as physics-informed regularisers (Unit 5).

✏️ Section exercise — watch the U-shape appear

Make the bias–variance trade-off concrete. Generate 60 noisy samples of y = \sin(2\pi x) + \varepsilon, \varepsilon \sim \mathcal{N}(0, 0.2^2), on x \in [0, 1]. Fit polynomials of degree 1, 2, \ldots, 15 by least squares, and for each degree compute (a) the training RMSE and (b) the 5-fold cross-validated RMSE — implementing the fold rotation yourself, no library. Plot both curves against degree. Where is the minimum of the CV curve, and what happens to the two curves past it?

💡 Hint

Design matrix: vander(x, d) = hcat((x .^ p for p in 0:d)...); fit each degree with c = vander(x, d) \ y and score with rmse(a, b) = sqrt(mean(abs2, a .- b)). For the 5 interleaved folds use folds = [i:5:N for i in 1:5]; per fold build a boolean test, refit on vander(x[.!test], d) \ y[.!test], and score on x[test]. Loop for d in 1:15, plot both RMSE curves (try yscale = :log10), and expect numerical warnings past degree ~12 — that ill-conditioning is part of the story.

Go to solution →

2.2 Random forests on MNIST: a cross-language comparison

Before deep learning takes over, one quick stop on the non-deep-learning side of the world — and a first look at MNIST, the dataset we’ll keep using as a running example. The next few subsections cover what a random forest is, a first look at the digits, and the same MNIST problem in Julia and in Python, so the cross-ecosystem template is established up front.

Decision trees, ensembled

A decision tree partitions the input space by axis-aligned splits, with a constant prediction at each leaf. Greedy fitting picks the split that reduces a loss criterion most at each node — variance for regression, Gini or entropy for classification. Trees are interpretable but high-variance: small data perturbations produce wildly different trees.

A random forest mitigates the variance by averaging many trees trained on bootstrap resamples of the data, with each split considering only a random subset of features. The variance falls, generalisation goes up, interpretability becomes harder to maintain. Random forests are still a serious baseline on tabular or flat image problems — often the right tool when you have hundreds-of-thousands of rows of modestly sized feature vectors, hand-designed or otherwise.

The MNIST benchmark

MNIST is 70 000 grayscale images of handwritten digits 0–9, each 28 \times 28 pixels (so 784 features when flattened to a vector). 60 000 are reserved for training, 10 000 for test. It is the canonical “easy classification benchmark” in deep learning — small enough to fit in memory, hard enough that a naive baseline isn’t perfect, and used in Mathematical Engineering of Deep Learning (Chapters 5–6) for exactly the kind of side-by-side comparison we’re about to do.

A first look at the digits

Before training anything, it’s worth looking at the data. Each MNIST sample is a 28 \times 28 grayscale image; flattened to a length-784 vector of pixel intensities, that raw vector is all every model in this unit — forest, softmax, MLP — ever sees (only the convolutional network in §2.8 keeps the 2-D grid). Load the training split and show the first 40 digits with their labels:

using MLDatasets, Plots

# train_x is 28×28×60000 with pixel values already in [0,1]; train_y are labels 0–9
train_x, train_y = MLDatasets.MNIST(split = :train)[:]

panels = map(1:40) do i
    heatmap(transpose(train_x[:, :, i]);          # transpose + yflip ⇒ upright digit
            yflip = true, color = :grays, aspect_ratio = 1,
            axis = false, ticks = false, colorbar = false, legend = false,
            title = string(train_y[i]), titlefontsize = 7)
end
plot(panels...; layout = (5, 8), size = (760, 500),
     plot_title = "MNIST — first 40 training digits")   # in a notebook this renders inline

The first 40 MNIST training digits, with their labels. Each is a 28\times28 grayscale image — 784 raw pixel values, which is all a *flat* classifier ever sees.

Setup note. MLDatasets.jl — the dataset loader used here and by the forest below — isn’t in the workshop Project.toml by default (we keep the PINN deps minimal). Add it once with Pkg.add("MLDatasets") from the repo root. This re-resolves your personal environment and precompiles MLDatasets’ own dependency closure — a few minutes, one-off, cached afterwards — so do it at a break, not mid-exercise. It installs into your home environment and doesn’t disturb the shared PINN stack, so using NeuralPDE stays fast. Why this is a separate step: The @pinn Julia environment.

MNIST random forest in Julia (`DecisionTree.jl`)

A 100-tree forest on flattened MNIST takes a couple of minutes to fit and lands around 97% test accuracy — a strong baseline that took deep learning a decade to convincingly beat. We call DecisionTree.jl (already in the workshop Project.toml) directly, with MLDatasets.jl for the MNIST loader. For broader classical-ML work, Julia’s umbrella framework is MLJ.jl — a unified, scikit-learn-style interface over dozens of models — but we use DecisionTree.jl on its own here to keep the dependencies light:

units/unit_02/scripts/mnist_rf_julia.jl

using MLDatasets, DecisionTree, Statistics

train_x, train_y = MLDatasets.MNIST(split = :train)[:]
test_x,  test_y  = MLDatasets.MNIST(split = :test)[:]

# (28, 28, N) → (N, 784); MLDatasets already returns pixels in [0, 1]
flatten(x) = reshape(Float32.(x), 28 * 28, size(x, 3))'
X_train, X_test = flatten(train_x), flatten(test_x)
y_train, y_test = Int.(train_y), Int.(test_y)

# Native DecisionTree.jl API
forest = build_forest(
    y_train, X_train,
    28,           # n_subfeatures (≈ √784)
    100,          # n_trees
    0.7,          # partial_sampling
    -1,           # max_depth
    1, 2, 0.0;    # min_samples_leaf, min_samples_split, min_purity_increase
    rng = 0,
)
ŷ = apply_forest(forest, X_test)
@info "MNIST test accuracy (DecisionTree.jl RF, 100 trees) = $(round(mean(ŷ .== y_test); digits = 4))"

(This uses the same MLDatasets loader as the digit display above — see that setup note if you haven’t added it yet.)

MNIST random forest in Python (`scikit-learn`)

Same problem, same numbers, in the Python ecosystem:

units/unit_02/scripts/mnist_rf_sklearn.py

from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# fetch_openml caches to ~/scikit_learn_data/ after the first download
X, y = fetch_openml("mnist_784", version=1, as_frame=False, return_X_y=True)
X = X.astype("float32") / 255.0
y = y.astype("int64")

X_train, X_test = X[:60_000], X[60_000:]
y_train, y_test = y[:60_000], y[60_000:]

clf = RandomForestClassifier(
    n_estimators=100, n_jobs=-1, random_state=0,
)
clf.fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
print(f"MNIST test accuracy (sklearn RF, 100 trees): {acc:.4f}")

Run it yourself with scripts/mnist_rf_sklearn.py (or ./build.sh execute 2 once you’ve installed scikit-learn). Typical first-run cost: ~30 s download (cached after) + ~2 min fit on a modern laptop CPU.

The point isn’t the accuracy number — it’s that the same problem in two ecosystems has the same shape: load data, choose a model, fit, score. This template recurs throughout the workshop.

When classical ML wins

Random forests (and gradient boosting, and regularised linear models) still beat neural networks when:

the dataset is small (hundreds to thousands of rows),
the features are tabular and well-engineered, or
interpretability matters and you can pay the per-tree inspection cost.

Deep learning’s edge sharpens with high-dimensional unstructured inputs (images, text, audio), large datasets, and — as we’ll see in Unit 5 onwards — PDE residuals.

✏️ Section exercise — how many trees is enough?

Take the DecisionTree.jl MNIST script above and sweep the forest size: n_{\text{trees}} \in \{1, 5, 20, 50, 100, 200\} (use a 10 000-image training subset to keep each fit under ~20 s). Plot test accuracy against tree count. Then hold 100 trees fixed and sweep n_subfeatures over \{5, 28, 100, 784\}. Two questions: where does the accuracy curve flatten, and why does n_subfeatures = 784 (every split sees every feature) hurt the ensemble even though each individual tree gets stronger?

💡 Hint

Reuse the section’s build_forest call — argument 3 is n_subfeatures, argument 4 is n_trees; vary just those in a loop. MLDatasets already returns pixels in [0,1], so you can drop the /255 and flatten with reshape(Float32.(x), 784, :)'. Subset X[1:10_000, :] (and matching labels) to keep each fit fast; the shape of the curves is robust to the subset. Score with mean(apply_forest(forest, Xt) .== yt). For the 784 case: when every split sees every feature, all trees pick the same dominant splits and their errors stop being decorrelated — and averaging correlated errors removes less noise.

Go to solution →

2.3 From logistic regression to a neural network

The simplest possible “neural network” is a single linear layer followed by a smooth squashing function. It is also the first genuinely useful classifier — the softmax regression baseline that Mathematical Engineering of Deep Learning derives in detail in Chapter 3 (Simple Neural Networks). The next four subsections walk through the theory (Bernoulli → sigmoid → softmax) and then fit it on MNIST in both Julia and Python, so the jump to deeper networks in §2.4 is a single conceptual step rather than a leap.

Bernoulli likelihood and the sigmoid

For binary classification, model P(y = 1 \mid x) = \sigma(w^\top x + b) with the sigmoid \sigma(z) = 1/(1 + e^{-z}). Maximising the Bernoulli likelihood over training data is equivalent to minimising the binary cross-entropy

\mathcal{L}(w, b) = -\tfrac{1}{N}\sum_i \bigl[y_i \log \hat p_i + (1-y_i) \log (1-\hat p_i)\bigr],

where \hat p_i = \sigma(w^\top x_i + b). That single-layer x \mapsto \sigma(w^\top x + b) is the simplest neural network — a linear map followed by a smooth nonlinearity.

Softmax for multiclass

Stack the binary case K times and renormalise with the softmax

\mathrm{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}.

The model becomes P(y = k \mid x) = \mathrm{softmax}(Wx + b)_k, fit by categorical cross-entropy.

A logistic / softmax classifier as a one-layer neural network. The affine map Wx + b feeds a softmax that outputs class probabilities. (Figure from *MEDL*, §3.3.)

MNIST softmax regression in `Lux.jl`

The “deep learning baseline” — a single linear layer + softmax, trained by maximum likelihood — on the same MNIST data the forest just saw. With no nonlinearity and no hidden units, this is literally the simple network MEDL introduces in Chapter 3 before motivating the deeper feedforward networks of Chapter 5. Expect ~92% test accuracy — a striking gap below the random forest’s ~97%, which is the point.

units/unit_02/scripts/mnist_linear_lux.jl

using MLDatasets, Lux, Random, Zygote, Optimisers, Statistics

# data ─────────────────────────────────────────────────────────────────
train_x, train_y = MLDatasets.MNIST(split = :train)[:]
test_x,  test_y  = MLDatasets.MNIST(split = :test)[:]
flatten(x) = reshape(Float32.(x), 28 * 28, size(x, 3))
X_train, X_test = flatten(train_x), flatten(test_x)
y_train, y_test = Int.(train_y), Int.(test_y)

onehot(k, K = 10) = (e = zeros(Float32, K); e[k + 1] = 1f0; e)
Y_train = reduce(hcat, onehot.(y_train))

# model: 784 → 10 (single linear layer + softmax) ─────────────────────
rng = Random.MersenneTwister(0)
model = Lux.Chain(Lux.Dense(784 => 10), Lux.softmax)
ps, st = Lux.setup(rng, model)

# cross-entropy loss ───────────────────────────────────────────────────
function loss_fn(ps, st, X, Y)
    ŷ, st = model(X, ps, st)
    -mean(sum(Y .* log.(ŷ .+ 1f-9); dims = 1)), st
end

# mini-batch SGD (Adam) ────────────────────────────────────────────────
opt_state = Optimisers.setup(Optimisers.Adam(1f-2), ps)
batch     = 128
nb        = size(X_train, 2) ÷ batch
for epoch in 1:10
    perm = randperm(rng, size(X_train, 2))
    for b in 1:nb
        idx = perm[(b - 1) * batch + 1 : b * batch]
        gs  = first(Zygote.gradient(
                p -> first(loss_fn(p, st, X_train[:, idx], Y_train[:, idx])),
                ps,
              ))
        opt_state, ps = Optimisers.update(opt_state, ps, gs)
    end
end

ŷ_test, _ = model(X_test, ps, st)
preds = vec(map(argmax, eachcol(ŷ_test))) .- 1
@info "MNIST test accuracy (Lux softmax) = $(round(mean(preds .== y_test); digits = 4))"

This is the same skeleton you’ll see again in every PINN: define the model as a pure function, separate parameters ps from inference state st, build a loss, take gradients with Zygote.gradient, step with Optimisers.update. The only thing PINNs add is more loss terms.

The Lux style — parameters as plain arrays, not hidden state — matters because NeuralPDE.jl (Unit 5) needs to compose layers with PDE-residual losses that contain derivatives of the network output. Frameworks like scikit-learn that hide their parameters can’t do that composition.

MNIST softmax regression in Python (`scikit-learn`)

Same model, fewer lines, very similar numbers:

units/unit_02/scripts/mnist_linear_sklearn.py

from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = fetch_openml("mnist_784", version=1, as_frame=False, return_X_y=True)
X = X.astype("float32") / 255.0
y = y.astype("int64")

X_train, X_test = X[:60_000], X[60_000:]
y_train, y_test = y[:60_000], y[60_000:]

# Multinomial logistic = softmax regression
clf = LogisticRegression(
    solver="lbfgs", max_iter=200, n_jobs=-1, random_state=0,
)
clf.fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
print(f"MNIST test accuracy (softmax regression): {acc:.4f}")

Available as scripts/mnist_linear_sklearn.py. Runtime: ~30 s on a laptop CPU. Notice that scikit-learn itself uses L-BFGS internally for this convex problem — the same optimiser we’ll use for the fine-tuning stage of every PINN in Units 5–7. Convex losses are why L-BFGS converges so cleanly here; in deeper networks the loss surface isn’t convex and we’ll need Adam.

✏️ Section exercise — logistic regression from scratch

Strip the framework away entirely. Using only plain Julia arrays (no Lux, no Zygote), implement binary logistic regression for the digits 0 vs 1 of MNIST: write the sigmoid, the binary cross-entropy, derive the gradient \nabla_w \mathcal{L} = \frac{1}{N} X (\hat p - y) by hand, and run 500 steps of plain gradient descent. You should clear 99% test accuracy on this easy pair. Then swap in the digits 4 vs 9 and watch the same model struggle — why is that pair so much harder for a linear decision boundary?

💡 Hint

No autodiff needed: for sigmoid + binary cross-entropy the per-example gradient collapses to p̂ .- y (the \sigma' cancels against the log), so with X kept as 784×N (columns are images) the update is w -= η .* (X * δ) ./ N and b -= η .* mean(δ), where δ = σ(X'w .+ b) .- y. Select the pair with a mask (y .== 0) .| (y .== 1) and set the label Float32.(y .== 1). Use η = 0.5, 500 steps — about twenty lines. Then swap in 4-vs-9: the discriminator (does the upper loop close?) is a conjunction of pixels no single linear threshold can express.

Go to solution →

2.4 Feedforward networks

MLPs: depth \times width

A multilayer perceptron (MLP) stacks affine layers with elementwise nonlinearities \sigma in between:

f_\theta(x) = W_L\,\sigma\!\bigl(W_{L-1}\,\sigma(\cdots W_1 x + b_1 \cdots) + b_{L-1}\bigr) + b_L.

A one-hidden-layer network is enough to demonstrate the architecture shape:

One hidden layer. The input vector flows through an affine map + nonlinearity to a hidden representation, then a final linear projection. (Figure from *MEDL*, §5.1.)

Depth lets the network compose features; width gives it room to represent each:

A deeper feedforward network — same picture, repeated. (Figure from *MEDL*, §5.1.)

Universal approximation

A two-layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy, given enough hidden units (Hornik, 1991; Cybenko, 1989). The theorem is qualitative: it doesn’t tell you how many units, how much data, or how to train. Depth, architectural priors, and good optimisation make the practical difference.

Activation choices (with a PINN aside)

Common nonlinearities — the two smooth ones first, since they matter most for PINNs:

Sigmoid \sigma(z) = \dfrac{1}{1 + e^{-z}} — squashes the whole real line into (0, 1); the classic “soft switch.”
tanh \tanh(z) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}} = 2\,\sigma(2z) - 1 — the same S-shape, rescaled to (-1, 1) and centred at zero. Sigmoid and tanh are smooth and bounded, so they saturate: deep networks built from them tend to gradient-vanish.
ReLU \max(0, z) — cheap, non-saturating, the modern default for vision and language tasks.
Swish z\,\sigma(\beta z) — a smoothed ReLU; a good default for PINNs.

The four activations on [-5, 5]: sigmoid and tanh are the smooth, bounded S-curves; ReLU is the piecewise-linear default; Swish is a smoothed ReLU. (Generated by `scripts/activations.jl`.)

Why the smooth ones are central for PINNs. A PINN’s loss contains derivatives of the network’s output — often the second derivative, which shows up in the PDE residual (the autodiff machinery later in this unit, leaned on hard from Unit 5 on). ReLU’s second derivative is zero almost everywhere, so a ReLU network cannot even form a diffusion-type residual like u_{xx}. Smooth activations keep every derivative alive — which is why PINNs use them: the workshop defaults to Swish from Unit 5 on, with tanh the other standard choice. Sigmoid and tanh are the two to keep at your fingertips — every smooth activation is a variation on that same S-curve.

✏️ Section exercise — universal approximation, and the ReLU trap

Fit a one-hidden-layer Lux MLP to f(x) = \sin(4\pi x) \cdot e^{-x} on [0, 1] at widths 5, 20, 100, and plot the three fits over the target. Then — the PINN-relevant part — take the width-20 network in two versions, tanh and relu, and use nested ForwardDiff.derivative to plot the second derivative f_\theta''(x) of each. The tanh network gives a smooth curve; what does the ReLU network give, and why does that disqualify it from any diffusion-type PDE residual?

💡 Hint

Build each net as Lux.Chain(Lux.Dense(1 => w, act), Lux.Dense(w => 1)) with act = tanh or relu, and train it with the same Zygote.gradient + Optimisers.Adam(1f-2) loop used in §2.3. For the second derivative, reuse the nested-ForwardDiff.derivative pattern from §2.7: wrap the trained net in a scalar closure u(x) = first(model([x], ps, st))[1], then f2(x) = ForwardDiff.derivative(ξ -> ForwardDiff.derivative(u, ξ), x). The tanh f2 is smooth; the ReLU one is identically zero (a piecewise-linear net), so a u_xx residual built on ReLU silently solves the wrong equation.

Go to solution →

2.5 On using Lux.jl

Every neural network in this workshop — and every PINN from Unit 5 on — is built with Lux.jl. The previous sections already used it in passing; this section is a short, standalone tour of the handful of functions you actually need, with the Lux documentation worth keeping open alongside. The snippets build on one another — pasted in order into a Julia session (the @pinn environment), they run as one self-contained example on a small synthetic batch.

The core idea: parameters live outside the model

A Lux model is an immutable description of an architecture — it holds no weights. The parameters (ps) and any auxiliary state (st, e.g. a BatchNorm running average) live in separate objects you create once and then thread through every call by hand:

using Lux, Random

model = Lux.Chain(
    Lux.Dense(784 => 128, tanh),   # affine 784→128, then tanh
    Lux.Dense(128 => 10),          # affine 128→10 (logits)
)

rng    = Random.MersenneTwister(0)
ps, st = Lux.setup(rng, model)     # ps = parameters, st = state

Lux.setup initialises the parameters (Glorot by default) and returns the pair. From here on model, ps, and st are three separate things you carry together. This is the single most important Lux convention, and the reason the SciML / PINN stack is built on it: a loss can be differentiated cleanly with respect to ps because ps is just a NamedTuple of arrays, not hidden inside the model.

Specifying a network

The two main building blocks:

Lux.Dense(in => out, activation) — a fully-connected layer x \mapsto \sigma(Wx + b). Omit the activation for a plain affine map.
Lux.Chain(layer1, layer2, …) — compose layers front to back.

That covers every MLP in this unit. Other layers slot in the same way — Lux.Conv, Lux.MaxPool (the LeNet CNN of §2.8), Lux.BatchNorm, Lux.Dropout, or Lux.WrappedFunction(f) to drop an arbitrary function into a chain.

Inference: calling the model

A Lux model is called as a pure function model(x, ps, st) and returns a tuple (output, new_state):

x = randn(rng, Float32, 784, 32)    # a batch of 32 (synthetic) inputs, columns
ŷ, st = model(x, ps, st)            # ŷ is 10×32; st unchanged here (Dense is stateless)

Inputs are column-major: features down the rows, batch across the columns, so a 784 => 128 layer expects a 784 × batch matrix. For layers whose behaviour differs between training and inference (Dropout, BatchNorm) switch to evaluation behaviour with Lux.testmode(st) before predicting.

Activations and one-hot encoding

Activations are ordinary functions — pass them by name to a layer. Lux re-exports the common ones from NNlib.jl: relu, tanh, sigmoid, swish, gelu, softplus, … plus softmax for the output. This is exactly the §2.4 menu; for PINNs the default is tanh or swish.

Lux.Dense(128 => 10, relu)          # ReLU layer
Lux.Dense(128 => 10, swish)         # Swish — the PINN default

For classification you want integer labels as one-hot columns. Hand-rolling it is a one-liner (as in §2.3), but the ecosystem helper is onehotbatch from OneHotArrays.jl:

using OneHotArrays                  # not in @pinn by default — see the setup note below
y_train = rand(rng, 0:9, 32)        # 32 synthetic integer labels (one per input column)
Y       = onehotbatch(y_train, 0:9) # 10 × 32 one-hot matrix
preds   = onecold(ŷ, 0:9)           # argmax of ŷ back to labels 0–9

Setup note. OneHotArrays.jl is not in the @pinn environment — add it yourself once with Pkg.add("OneHotArrays") (it installs in a few seconds, pulls in NNlib which is already present, and doesn’t disturb the shared PINN stack).

Training: the standard loop

Training is the §2.3 skeleton — and it is the same skeleton every PINN uses. Take gradients of a scalar loss with respect to ps (reverse-mode AD, §2.7), then step an optimiser:

using Zygote, Optimisers, Statistics

function loss_fn(ps)
    ŷ, _ = model(x, ps, st)
    mean(abs2, ŷ .- Y)              # any scalar loss (here MSE on the one-hot targets)
end

opt_state = Optimisers.setup(Optimisers.Adam(1f-2), ps)   # Adam: see §2.6
for step in 1:1000
    g = first(Zygote.gradient(loss_fn, ps))               # ∂loss/∂ps, a NamedTuple
    opt_state, ps = Optimisers.update(opt_state, ps, g)   # functional update
end

Two things to notice. Optimisers.update returns a new (opt_state, ps) rather than mutating in place — you rebind them each step. And the optimiser itself (Adam, Descent, and the L-BFGS that finishes every PINN) is the subject of §2.6; here just treat Adam(1f-2) as a sensible default.

Inspecting a trained model

ps is a plain NamedTuple, so the trained weights are right there to read:

Lux.parameterlength(model)    # total trainable parameter count
Lux.statelength(model)        # size of the state st
ps.layer_1.weight             # the 128×784 weight matrix of the first Dense
ps.layer_1.bias               # its bias vector

Lux.parameterlength is what §2.8’s exercise uses to compare model sizes; and because ps is just nested arrays you can save it, move it to the GPU (ps |> gpu_device(), §2.8), or hand it to NeuralPDE.jl unchanged.

Why this design — Lux vs Flux, and the other frameworks

Julia’s older deep-learning library, Flux.jl, stores parameters inside the layer objects (a Dense carries its own W and b), so a model is a mutable, stateful thing. Lux.jl deliberately splits them apart: the model is an immutable, explicit-parameter function f(x, ps, st). That functional style is what makes Lux the right substrate for SciML and PINNs — when the loss contains derivatives of the network itself (§2.7), the autodiff and the PDE machinery need the parameters as an explicit, immutable argument they can differentiate and reshape, not as hidden state. It also keeps GPU movement and reproducibility (rng in, ps out) clean.

For orientation across ecosystems:

PyTorch — like Flux: parameters live on the nn.Module, you call model(x), and the weights are implicit. Define-by-run, imperative.
TensorFlow / Keras — Keras is the high-level layer API; classic TensorFlow built a static computation graph first and ran it afterwards. Mostly implicit-parameter, like PyTorch.
JAX / Flax — functional and explicit-parameter, the same philosophy as Lux: pure functions with parameters passed in. If you know Flax, Lux will feel familiar.

The payoff of the Lux convention is uniformity: the model(x, ps, st) / Zygote.gradient / Optimisers.update loop above is byte-for-byte the shape of every training run in this course, plain classifier or PINN alike.

Saving and loading a trained model

Because the parameters live outside the model, saving a Lux model means saving ps (and st) — plain NamedTuples of arrays — while the architecture stays in your code. You reload by rebuilding the same Chain and reading the arrays back in. A few options, simplest first:

Serialization (Julia standard library, nothing to install) — one call each way. Convenient, but the format is tied to your exact Julia/package versions: fine for a scratch checkpoint, not for long-term archival or sharing.
JLD2.jl — an HDF5-style container, robust across versions; the option the Lux docs recommend, and what the example below uses.
BSON.jl — binary JSON, the long-standing choice in the Flux world.
Cross-language / archival — export the raw weight arrays with HDF5.jl or NPZ.jl (e.g. to read them back from Python/NumPy).

A complete round-trip on a baby model — build it, pretend to train, save, and load. (JLD2.jl and BSON.jl are not in @pinn by default; Pkg.add them once. Serialization is always available.)

using Lux, Random, JLD2

# build a baby MLP (1 → 8 → 1) and initialise its parameters
model  = Lux.Chain(Lux.Dense(1 => 8, tanh), Lux.Dense(8 => 1))
rng    = Random.MersenneTwister(0)
ps, st = Lux.setup(rng, model)
# ... you would train here (the loop above), updating ps; we skip it ...

# SAVE — store the parameters and state, not the model object
jldsave("baby_model.jld2"; ps, st)

# LOAD — rebuild the IDENTICAL architecture in code, then read the params back
model_loaded = Lux.Chain(Lux.Dense(1 => 8, tanh), Lux.Dense(8 => 1))
saved        = JLD2.load("baby_model.jld2")
ps2, st2     = saved["ps"], saved["st"]

# confirm it round-trips: same input → same output
x = randn(rng, Float32, 1, 4)
@assert model(x, ps, st)[1] ≈ model_loaded(x, ps2, st2)[1]

The same shape with the two alternatives:

# standard library — zero dependencies
using Serialization
serialize("baby_model.jls", (ps, st))
ps3, st3 = deserialize("baby_model.jls")

# BSON.jl — the Flux-ecosystem choice
using BSON
BSON.@save "baby_model.bson" ps st
BSON.@load "baby_model.bson" ps st

Rule of thumb: save ps/st (the arrays), keep the model as code. Don’t rely on serialising the model object across Julia or package versions — rebuilding the Chain and loading the parameters into it is what stays portable.

2.6 Optimisation

Training a neural network — or fitting a PINN — comes down to minimising a scalar loss over a large parameter vector \theta \in \mathbb{R}^P, given its gradient g = \nabla_\theta \mathcal{L}. This section is about the optimisers that use that gradient: plain gradient descent and its geometry, the stochastic / mini-batch variants, momentum and Adam, and the quasi-Newton method (L-BFGS) that often finishes the job. How the gradient itself is computed — automatic differentiation — is the next section (§2.7). Most of this material is treated at greater depth in Mathematical Engineering of Deep Learning, Chapter 4 (Optimization Algorithms), our primary reference throughout this unit — with the classic, thorough treatment in Goodfellow, Bengio & Courville (2016) Chapter 8 as the standard alternative. The goal here is a working understanding sufficient to read PINN training code and know which knob to turn when it doesn’t converge.

How hard this minimisation is depends entirely on the shape of the loss surface, and that shape is something we partly choose. The two panels below are the same logistic-regression model on the same data, differing only in the loss function: squared error gives a non-convex landscape riddled with flat regions and a saddle, while binary cross-entropy gives a smooth convex bowl that gradient descent walks straight down. PINN losses live at the hard end of this spectrum — a lesson worth keeping in mind for the rest of the unit.

The loss landscape of logistic regression on a synthetic dataset, as a function of the two parameters (w, b). **(a)** Squared loss C_i(\theta) = (y^{(i)} - \hat y^{(i)})^2 — non-convex, with flat plateaus and a saddle. **(b)** Binary cross-entropy C_i(\theta) = \mathrm{CE}(y^{(i)}, \hat y^{(i)}) — a single convex basin. Same model, same data; only the loss differs. (Figure 3.4 from *Mathematical Engineering of Deep Learning*.)

Gradient descent: the geometric core

Given a smooth loss \mathcal{L}: \mathbb{R}^P \to \mathbb{R} and a parameter vector \theta_t, the gradient g_t = \nabla_\theta \mathcal{L}(\theta_t) \in \mathbb{R}^P is the direction in which \mathcal{L} increases fastest. Gradient descent takes a small step in the opposite direction:

\theta_{t+1} \;=\; \theta_t \;-\; \eta\, g_t, \tag{1}

with learning rate \eta > 0. To first order in \eta,

\mathcal{L}(\theta_{t+1}) \;\approx\; \mathcal{L}(\theta_t) - \eta\, \|g_t\|^2 + O(\eta^2),

so any positive \eta decreases the loss locally; too large and the O(\eta^2) Hessian term wins and the iterate diverges. The practical learning-rate range for deep networks (10^{-4} to 10^{-2}) is set by this balance — see MEDL Chapter 4 for the convergence analysis.

Gradient descent on the loss contours of a simple linear-regression problem, started from the origin (black dot) and converging to the optimum (green dot). The learning rate \eta (“\alpha” in the legend) decides everything: \eta = 0.01 and \eta = 0.0235 both converge — the larger rate just gets there in fewer steps — but a slightly larger \eta = 0.024 overshoots and *diverges* (the iterate runs off through the grey band). The window between “fast” and “broken” can be very narrow. (Figure 2.7 from *Mathematical Engineering of Deep Learning*.)

Why isn’t plain gradient descent enough? Three failure modes that recur throughout deep learning and PINN training:

Pathological curvature. Loss surfaces often have long, thin valleys (the ravine problem). g_t points across the valley, not along it. Plain GD zig-zags.
Bad scaling between parameters. Different layers have wildly different gradient magnitudes. A single global \eta can’t be right for all of them at once.
Stochastic noise when g_t is estimated from a mini-batch rather than the full dataset (covered below).

Each of momentum, Adam, and L-BFGS is a different fix.

Stochastic and mini-batch gradient descent

For empirical risk \mathcal{L}(\theta) = \frac{1}{N}\sum_i \ell_i(\theta), the full-batch gradient sums over all N training examples — exact, but expensive when N \in [10^4, 10^9]. Stochastic gradient descent (SGD) replaces the sum with a single random example i_t:

\theta_{t+1} \;=\; \theta_t - \eta \, \nabla_\theta \ell_{i_t}(\theta_t),

a noisy but unbiased estimator of the true gradient. The middle ground is mini-batch SGD — sum over a random subset B \subset \{1, \ldots, N\} of size |B| (typically 32 – 1024):

\theta_{t+1} \;=\; \theta_t - \frac{\eta}{|B|}\sum_{i \in B} \nabla_\theta \ell_i(\theta_t).

The gradient variance falls like 1/|B|. Three reasons mini-batches dominate practice:

Computational cost. A modern GPU/TPU is massively parallel; cost per gradient is roughly constant up to |B| \sim hundreds. So a single mini-batch update is nearly as cheap as a single SGD update but with far lower variance.
Gradient noise as a regulariser. Noisy gradients prevent the iterate from settling into sharp minima; flat minima generalise better empirically. See MEDL Chapter 4.
Memory. Full-batch gradients on a million-example dataset don’t fit on a GPU.

Gradient descent (red) vs. stochastic gradient descent (blue) on a cost C(\theta) = \frac1n\sum_i C_i(\theta) built from individual terms, both started from \theta^{(0)}. **(a)** When the per-term minimisers u_1, \ldots, u_n differ (green dots), SGD follows a noisy path that rattles around inside their convex hull — each step chases one C_i, not the average. **(b)** When all terms share the same minimiser, the stochastic gradient is the full gradient (up to scale), so SGD tracks GD almost exactly. The noise comes from disagreement between the C_i, not from randomness per se. (Figure 4.3 from *Mathematical Engineering of Deep Learning*.)

PINNs invert this picture: their “dataset” is a few thousand collocation points, so full-batch methods are feasible — Adam typically runs over all points, and L-BFGS often finishes the job. Mini-batching reappears only when the collocation set is large (e.g. high-dimensional domains or data-assimilation terms), so the vocabulary is still worth having.

Momentum and Adam

SGD + momentum (Polyak 1964; Sutskever et al. 2013) adds inertia:

\begin{aligned} v_{t+1} &= \mu\, v_t + g_t, \\ \theta_{t+1} &= \theta_t - \eta\, v_{t+1}, \end{aligned}

with momentum coefficient \mu \in [0, 1) (typically 0.9). The update v_t is an exponentially-weighted moving average of past gradients — it cancels in directions where successive g_t oscillate (the ravine), and accumulates in directions where they agree (the valley floor). For a quadratic loss with condition number \kappa, momentum reduces the iterations to converge from O(\kappa) to O(\sqrt{\kappa}) — a square-root speedup.

Adam (Kingma & Ba 2015) is best understood as the fusion of two independently-invented ideas, each useful on its own:

Momentum (above) — an exponentially-weighted average of the gradient itself, smoothing the direction of travel.
RMSprop (Hinton, unpublished 2012) — an exponentially-weighted average of the gradient’s squared magnitude, used to give each parameter its own per-parameter step size: directions with large recent gradients are damped, small ones amplified.

Adam runs both averages at once and divides one by the other:

\begin{aligned} m_t &= \beta_1\, m_{t-1} + (1 - \beta_1)\, g_t, &\hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \\ v_t &= \beta_2\, v_{t-1} + (1 - \beta_2)\, g_t^2, &\hat{v}_t &= \frac{v_t}{1 - \beta_2^t}, \\ \theta_{t+1} &= \theta_t - \eta \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}. \end{aligned}

Read as: m_t is the momentum part (running average of the gradient), v_t is the RMSprop part (running average of the squared gradient, component-wise), and the update divides one by the square root of the other so that each parameter gets a step size inversely proportional to how loud its historical gradient has been.

The hats \hat{m}_t, \hat{v}_t are bias correction, and they are a general fact about exponential smoothing, not an Adam-specific trick. Initialising m_0 = v_0 = 0 biases an exponentially-weighted average toward zero for the first several steps: after t steps the weights (1-\beta)\sum_{k} \beta^{k} only sum to 1 - \beta^t, not 1. Dividing by 1 - \beta_1^t and 1 - \beta_2^t exactly rescales the two averages back to unbiased estimates, and the correction fades away (\beta^t \to 0) once the running averages have warmed up.

Hyperparameter defaults — what MEDL Chapter 4 recommends and the workshop scripts use: \eta \in [10^{-4}, 10^{-2}], \beta_1 = 0.9, \beta_2 = 0.999, \varepsilon = 10^{-8}. Adam almost always works; it is the right first move for any neural-network training, including the initial phase of every PINN in Units 5–7.

L-BFGS: quasi-Newton for the small-batch regime

Newton’s method takes step \theta_{t+1} = \theta_t - H_t^{-1} g_t, using the Hessian H_t = \nabla_\theta^2 \mathcal{L}. In one step on a quadratic loss it lands at the minimum. The catch: H_t is P \times P and inverting it costs O(P^3) — infeasible for P \gtrsim 10^4 parameters.

BFGS builds a positive-definite approximation B_t \approx H_t incrementally from successive gradient differences using a rank-2 update. L-BFGS (Liu & Nocedal 1989) is the limited-memory variant: store only the last m pairs of (s_k, y_k) = (\theta_{k+1} - \theta_k, \, g_{k+1} - g_k) and reconstruct B_t^{-1} g_t implicitly in O(mP) time. Default m = 10 — so L-BFGS uses ~10 P memory and ~10 P work per step. Crucially:

No learning rate to tune. L-BFGS does an internal Wolfe line search.
Quadratic convergence near the minimum. Once close, it finishes in a handful of steps.
Needs an accurate gradient. L-BFGS assumes the gradient is deterministic — feed it a noisy mini-batch gradient and the Wolfe conditions break.

This is exactly the regime PINNs land in: a few thousand collocation points (small enough to take the full-batch gradient on every step), a non-convex but locally well-conditioned loss after Adam has done the rough work. The recipe across Units 5–7 is therefore:

Adam for a few thousand iterations (escape bad initialisations under mini-batch noise) → L-BFGS for a few hundred iterations (drive the residual to convergence with exact gradients).

✏️ Section exercise — race the optimisers down a ravine

Implement gradient descent, momentum (\mu = 0.9), and Adam by hand — each is under ten lines — and race them on the Rosenbrock function f(x, y) = (1 - x)^2 + 100\,(y - x^2)^2 from (-1.5, 2). Plot the three trajectories over the loss contours and the loss-vs-iteration curves (log scale, 5 000 iterations, tune each learning rate to the largest stable value). Which method zig-zags across the valley, which one tracks the valley floor, and where does Adam’s per-parameter scaling visibly help?

💡 Hint

Rosenbrock gradient (the part that’s easy to get wrong): ∇f(x, y) = [-2(1 - x) - 400x*(y - x^2), 200*(y - x^2)]. Keep each optimiser under ten lines by mutating the iterate in place with state in a NamedTuple — GD is x .-= η .* g; momentum is v .= μ .* v .+ g; x .-= η .* v; Adam is the bias-corrected m/v pair from §2.6 as x .-= η .* m̂ ./ (sqrt.(v̂) .+ 1e-8). The real work is fair tuning: push each η up until it just diverges, then back off (GD lands near 1e-3, Adam near 2e-2, μ = 0.9). Plot log10 ∘ f for the contours or the colour scale saturates, then map what you see onto the three §2.6 failure modes.

Go to solution →

2.7 Automatic differentiation

Every optimiser in §2.6 needs the gradient \nabla_\theta \mathcal{L}, and so far we have taken it “for free.” This section is how you get it — and it matters more for PINNs than for ordinary deep learning, because a PINN differentiates its own network inside the loss.

Automatic differentiation is neither of the two things people first guess. It is not symbolic differentiation (manipulating formulas, which blows up in size), and it is not finite differences \bigl(f(x+h)-f(x)\bigr)/h (which forces a bad trade between truncation error at large h and round-off at small h). Instead it computes the exact derivative — to machine precision — by applying the chain rule to the elementary operations your code already runs.

Numerical differentiation, briefly

The intro flagged finite differences as the thing AD is not — but they are worth keeping in your pocket, because they stay the practical fallback when AD won’t compose. Sampling f at points a step h apart gives the three formulas you actually use:

f'(x) \approx \frac{f(x+h) - f(x)}{h}\ \ (\text{forward},\ O(h)), \qquad f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}\ \ (\text{central},\ O(h^2)),

f''(x) \approx \frac{f(x+h) - 2f(x) + f(x-h)}{h^2}\ \ (\text{central second derivative},\ O(h^2)).

The central difference is the one to reach for: the same two extra evaluations as the forward formula, but its leading error cancels to O(h^2). Everything then hinges on the step h — too large and the truncation error (O(h) or O(h^2)) dominates; too small and round-off does, because you are subtracting two nearly-equal floats and dividing by a tiny number. The error is minimised at an awkward middle value (h \sim \sqrt{\varepsilon_{\text{mach}}} \approx 10^{-8} for forward, \varepsilon_{\text{mach}}^{1/3} \approx 10^{-5} for central in Float64), and you never reach full machine precision.

Automatic differentiation exists to delete that whole trade-off — exact derivatives, no h to tune — and it comes in two flavours, forward and reverse, which the rest of this section builds up. Finite differences still earn their keep as the robust fallback when the AD backends won’t nest cleanly — exactly the role the central difference plays in the hand-rolled PINN of Unit 5 §5.2.

Computational graphs and backprop

Every neural-network loss is the output of a finite computation graph — a DAG whose nodes are tensors and whose edges are elementary operations.

The computational graph for C(\theta) = \sin(\theta_1 \theta_2) + \exp(\theta_2 + \theta_3) - 2\theta_3: the inputs \theta_1, \theta_2, \theta_3 enter on the left, each node applies one elementary operation, and the output C(\theta) comes out on the right. Forward evaluation flows left-to-right; reverse-mode AD (backpropagation) propagates gradients right-to-left along the same edges. (Figure 4.7 from *Mathematical Engineering of Deep Learning*.)

Backpropagation is exactly reverse-mode automatic differentiation applied to that graph: one forward pass evaluates intermediate quantities, one backward pass propagates gradients using the chain rule. Cost is O(\text{nodes}) per gradient — i.e., similar to one forward pass.

Dual numbers and the tape

Forward mode = dual numbers. Carry each value together with its derivative as a pair (x, \dot x). Formally, adjoin a symbol \varepsilon with \varepsilon^2 = 0 and evaluate f at a + \varepsilon: every elementary rule gives f(a + \varepsilon) = f(a) + f'(a)\,\varepsilon, so the coefficient of \varepsilon that drops out the far end is the derivative. ForwardDiff.jl is exactly this — no formulas, no step size:

using ForwardDiff

f(x) = sin(x^2)
ForwardDiff.derivative(f, 1.0)            # 1.08060…  (= 2x·cos(x²) at x=1, exact)

g(v) = sum(v .^ 2)
ForwardDiff.gradient(g, [1.0, 2.0, 3.0])  # [2.0, 4.0, 6.0]

One forward pass carries one derivative direction, so a full gradient costs about n passes for n inputs — cheap when n is small.

Reverse mode = the tape (backpropagation). Record every elementary operation on the forward pass — that is the “tape”, i.e. the computational graph above — then walk it backwards once, accumulating the adjoint \partial\mathcal{L}/\partial(\text{each node}) by the chain rule. A single backward pass produces the gradient with respect to all inputs at once, at a cost of roughly one forward evaluation — independent of how many parameters there are. Zygote.jl does this:

using Zygote

g(v) = sum(v .^ 2)
Zygote.gradient(g, [1.0, 2.0, 3.0])[1]    # [2.0, 4.0, 6.0] — same answer, one backward pass

So which mode? For f:\mathbb{R}^n \to \mathbb{R}^m, forward mode costs \propto n and reverse mode \propto m. Training a network is “millions of parameters in, one scalar loss out” (n huge, m = 1), so reverse mode wins — which is why backprop, and every deep-learning framework, is reverse-mode AD.

Hands-on: time the two modes yourself

Don’t take the “\propto n vs. \propto m” claim on faith — measure it. Take the simplest possible scalar-valued function of an n-vector and watch the two modes scale as n grows. You only need @elapsed (built into Julia); the first call to each is slow because it compiles, so we time a second batch of calls.

using ForwardDiff, Zygote

f(v) = sum(sin.(v) .* v.^2)        # R^n → R, one scalar out

for n in (3, 30, 300, 3000)
    v = collect(range(0.1, 1.0; length = n))
    ForwardDiff.gradient(f, v); Zygote.gradient(f, v)         # warm up the compiler
    tf = @elapsed for _ in 1:100; ForwardDiff.gradient(f, v); end
    tr = @elapsed for _ in 1:100; Zygote.gradient(f, v);     end
    println("n=$n  forward=$(round(tf/100*1e6; digits=1))µs  reverse=$(round(tr/100*1e6; digits=1))µs")
end

On a laptop this prints roughly (your absolute numbers will differ; the trend is the point):

n=3     forward=0.4µs    reverse=0.7µs      # tiny n: forward wins (reverse has setup overhead)
n=30    forward=3.0µs    reverse=1.0µs      # crossover already passed
n=300   forward=721µs    reverse=6.3µs      # reverse ~115× faster
n=3000  forward=21000µs  reverse=98µs       # reverse ~215× faster

Read the two columns: forward-mode time grows roughly linearly in n (each input needs its own pass — one extra “\dot x” carried through the computation), while reverse-mode time barely moves (one backward pass returns all n partials at once). For two or three inputs the linear cost is nothing and forward mode is actually faster, because reverse mode pays a fixed bookkeeping overhead for its tape. Somewhere around n \approx 10–30 the lines cross, and past that reverse mode runs away with it. That single plot is the whole reason training uses reverse mode and the inner spatial derivative of a PINN uses forward mode — keep reading.

One honest caveat — loops. If you rewrite f with an explicit scalar loop,

function g(v)
    s = zero(eltype(v))
    for x in v
        s += sin(x) * x^2          # same maths, written element-by-element
    end
    return s
end

forward mode is unchanged, but Zygote (a tape-recording reverse engine) is now much slower — it records every scalar operation in the loop, so its overhead grows with the loop length and it can lose even at n = 3000. The lesson is practical, not theoretical: tape-based reverse AD loves vectorised array code and struggles with tight scalar loops. (Newer engines such as Enzyme.jl, which differentiate compiled code rather than a recorded tape, close much of this gap — which is exactly why the SciML stack is moving toward Enzyme.) Try both f and g and compare.

Differentiating the network itself: forward-over-reverse (the PINN twist)

PINNs add something ordinary deep learning never does: the loss contains derivatives of the network’s output with respect to its inputs. For a network u(x;\theta) standing in for a PDE solution, the residual needs spatial derivatives like u_x, u_{xx} at each collocation point — and we still need the gradient of that loss with respect to the parameters \theta. Two differentiations, nested.

The inner derivative — network output w.r.t. its input. Few inputs (x, maybe t), often to second order or higher: exactly where forward mode shines, and it nests — just differentiate the derivative:

using ForwardDiff

u(x)  = sin(2x) * exp(-x)               # pretend this is the network output u(x)
u1(x) = ForwardDiff.derivative(u, x)
u2(x) = ForwardDiff.derivative(u1, x)  # nest forward mode → the 2nd derivative

u2(0.3)    # -3.7006…  (matches the hand derivative e^{-x}(-3 sin 2x - 4 cos 2x))

That u2 is precisely the operator a diffusion or Poisson PINN puts in its residual (Unit 5). It is also why §2.4’s ReLU warning bites: a ReLU network’s second derivative is zero almost everywhere, so the residual can’t even be formed.

The outer gradient — loss w.r.t. parameters. Sum the residual over collocation points into a scalar loss and differentiate w.r.t. the parameters. Here is the whole shape on a tiny three-parameter stand-in “network” u(x;p) = p_1 \sin(p_2 x) + p_3, asked to satisfy u'' = -\pi^2 \sin(\pi x):

using ForwardDiff

unet(x, p) = p[1] * sin(p[2] * x) + p[3]
uxx(x, p)  = ForwardDiff.derivative(z -> ForwardDiff.derivative(y -> unet(y, p), z), x)

xs   = range(0, 1; length = 20)
loss(p) = sum((uxx(x, p) + pi^2 * sin(pi * x))^2 for x in xs)   # PDE residual, summed

ForwardDiff.gradient(loss, [1.0, 3.0, 0.0])   # ∇ₚ loss = [-108.8, -143.7, 0.0]

That gradient drives the optimiser (Adam, then L-BFGS) just like every other fit in this unit — the only new ingredient is that loss had a derivative of the network inside it.

Forward-over-reverse, and what SciML actually uses. Both differentiations above use forward mode because the toy has just three parameters. A real PINN has thousands, so the outer parameter-gradient switches to reverse mode (cheap when m=1 over many parameters) while the inner input-derivative stays forward mode (cheap for one or two inputs). That composition — a reverse-mode gradient of a loss that contains forward-mode input-derivatives — is forward-over-reverse, and getting the nesting right is the difference between a PINN that trains in minutes and one that crawls.

You won’t assemble this by hand. NeuralPDE.jl builds the residual and its derivatives for you, layered on Lux.jl (the network), ForwardDiff.jl (the input derivatives), and reverse-mode AD (Zygote.jl, increasingly Enzyme.jl) for the parameter gradient — with ChainRules.jl supplying the rules that let the modes compose. Picking and nesting these backends correctly is a large part of what the SciML stack does so that the residual you write in maths comes out differentiated correctly and fast.

✏️ Section exercise — differentiate a network’s output, then its loss

No training — just the autodiff. Everything here uses only ForwardDiff, runs in under a second, and needs no data. Work through the three parts in order.

(a) Nest forward mode to get a second derivative. Take u(x) = e^{\sin x}.

Define u(x) = exp(sin(x)).
Get the first derivative as a function with u1(x) = ForwardDiff.derivative(u, x), then nest once more to get the second: u2(x) = ForwardDiff.derivative(u1, x). (You can also write the nesting in one line — see the hint.)
Evaluate u2(0.7) and compare it to the closed form u'' = (\cos^2 x - \sin x)\,e^{\sin x} at x = 0.7. They should match to ~15 digits (≈ -0.11281) — this is exact AD, not a finite-difference approximation, so there is no step size to tune.

(b) The forward-over-reverse shape on a toy “network”. Now mimic the structure of a PINN loss, where a derivative of the network sits inside the loss. Use the two-parameter stand-in u(x; p) = p_1 \tanh(p_2 x).

Define unet(x, p) = p[1] * tanh(p[2] * x).
Build the second spatial derivative u_{xx} by nesting ForwardDiff.derivative over the spatial argument x while carrying the parameters p: uxx(x, p) = ForwardDiff.derivative(z -> ForwardDiff.derivative(y -> unet(y, p), z), x).
Over ten points x_i \in [0, 1] (xs = range(0, 1; length = 10)), form the residual loss \mathcal{L}(p) = \sum_i \bigl(u''(x_i; p) + u(x_i; p)\bigr)^2 — this asks the network to satisfy u'' = -u, the simple-harmonic oscillator.
Compute the parameter gradient \nabla_p \mathcal{L} two ways at, say, p = [1.0, 2.0]: once with ForwardDiff.gradient(loss, p), and once with a finite-difference check on each component, \bigl(\mathcal{L}(p + h\,e_i) - \mathcal{L}(p)\bigr)/h with h = 1e-6. The two should agree to ~5 digits (≈ [42.6, 45.0]).

(c) Which mode at scale? In part (b) both differentiations used forward mode because p has only two entries. Suppose instead p had thousands of entries (a real network). Which mode would you switch the outer parameter gradient to, and why? Answer in one sentence using the input-vs-output cost rule (§2.7 timing) — and note that the inner spatial derivative would stay forward mode. That split is exactly the forward-over-reverse pattern PINNs rely on.

💡 Hint

The second derivative is ForwardDiff.derivative(t -> ForwardDiff.derivative(u, t), x). For (b), uxx(x, p) nests the same way over the spatial argument while carrying p; then ForwardDiff.gradient(loss, p) is the whole parameter gradient at once. The finite-difference check is (loss(p .+ h .* eᵢ) - loss(p)) / h with h = 1e-6. For the last part, count inputs vs outputs: reverse mode is the cheap direction when there are many parameters and one scalar loss.

Go to solution →

2.8 Training a model in practice

The previous sections set up everything we need — a function class (MLP), an autodiff engine, an optimiser, and a dataset (MNIST) with two known baselines (random forest ~97%, softmax regression ~92%). Now we train real networks on MNIST and close the gap. We lead with the strongest, most genuinely-Julia piece — a convolutional network in Lux.jl, trained on CPU and GPU — then step back to a plain MLP to see how the GPU payoff scales, and finally show the same MLP in the Python ecosystem (scikit-learn, PyTorch) for the cross-language comparison. The Julia runs are the ones we execute (they fetch MNIST directly, so they need no extra dataset package); the Python listings are static.

A convolutional network on MNIST (Lux.jl)

The softmax classifier (§2.3) and the MLP both flatten each 28 \times 28 image into a length-784 vector — they throw the 2-D structure away before they ever see it. A convolutional network keeps it: each conv layer slides small learnable filters across the image, sharing one set of weights over every location.

That weight sharing is a textbook inductive bias — an assumption about the problem, baked into the architecture, that biases the model toward the right kind of solution before it sees any data. A convolution assumes two things about images: locality (nearby pixels are what matter) and translation invariance (an edge or stroke means the same thing wherever it sits, so one filter should detect it everywhere). Encoding those assumptions in the architecture means the network doesn’t have to learn them — so it needs far fewer parameters and generalises better. Hold onto this idea, because it is exactly how PINNs work too: there the inductive bias is the PDE itself, wired into the loss, telling the network what physics the solution must satisfy (§2.10, Unit 5). A CNN biases the model toward image structure; a PINN biases it toward physics — same principle, with the prior coming from a law of nature instead of an architecture. Convolutional networks are the subject of Mathematical Engineering of Deep Learning, Chapter 6 (Convolutional Neural Networks).

The script below is the classic LeNet-5 (LeCun et al., 1998) in Lux.jl — two Conv + MaxPool blocks that learn 2-D features, then three Dense layers that classify them — trained on the same MNIST data, on the CPU and the GPU:

units/unit_02/scripts/conv_mnist_lux.jl

#!/usr/bin/env julia
# ===========================================================================
# Unit 2.7 — a convolutional network in Lux.jl (LeNet-5 on MNIST).
#
# A real, multi-layer CONVOLUTIONAL network — the architecture family of
# *Mathematical Engineering of Deep Learning* Chapter 6 (Convolutional Neural
# Networks), and the headline model of §2.7. It is the classic LeNet-5
# (LeCun et al., 1998): two conv+pool blocks that learn local 2-D features,
# then three dense layers that classify them.
#
# Why convolutions? The softmax (§2.3) and MLP (§2.7) flatten each 28×28 image
# to a length-784 vector, throwing away the 2-D spatial structure. A conv layer
# instead slides small learnable filters over the image, sharing weights across
# locations — far fewer parameters, and translation-aware. On the same MNIST
# data that gave us random forest ~97%, softmax ~92% and the MLP ~98%, the CNN
# reaches ~98.8%, the best in the unit.
#
# Convolutions are also where the GPU really earns its keep: each layer is many
# small matmuls, so the per-epoch speed-up is much larger than the MLP's ~3×.
#
# Run on the GPU hub (the @pinn env has Lux + LuxCUDA + cuDNN + CUDA):
#   julia --project=@pinn units/unit_02/scripts/conv_mnist_lux.jl
# Nothing here runs during `quarto render` — the .qmd shows it `eval: false`.
# ===========================================================================

using Lux, LuxCUDA, CUDA, Optimisers, Zygote, Random, Printf, Statistics, Downloads

# --- MNIST loader (direct download + pure-Julia IDX parse) → image tensors ---
const MIRRORS = ["https://storage.googleapis.com/cvdf-datasets/mnist/",
                 "https://ossci-datasets.s3.amazonaws.com/mnist/"]

function fetch_idx(name)
    cache = joinpath(get(ENV, "MNIST_DIR", "/tmp/mnist"), name)
    raw   = replace(cache, ".gz" => "")
    isdir(dirname(cache)) || mkpath(dirname(cache))
    if !isfile(raw)
        ok = false
        for m in MIRRORS
            try
                Downloads.download(m * name, cache); ok = true; break
            catch; end
        end
        ok || error("could not download $name from any mirror")
        run(`gunzip -f $cache`)
    end
    return read(raw)
end

function load_mnist()
    function images(name)
        b  = fetch_idx(name)
        n  = Int(b[5])<<24 | Int(b[6])<<16 | Int(b[7])<<8 | Int(b[8])
        nr = Int(b[9])<<24 | Int(b[10])<<16 | Int(b[11])<<8 | Int(b[12])
        nc = Int(b[13])<<24 | Int(b[14])<<16 | Int(b[15])<<8 | Int(b[16])
        # Lux/cuDNN want WHCN: (width, height, channels, batch)
        return reshape(Float32.(b[17:end]) ./ 255f0, nr, nc, 1, n)
    end
    function labels(name)
        b = fetch_idx(name)
        n = Int(b[5])<<24 | Int(b[6])<<16 | Int(b[7])<<8 | Int(b[8])
        return Int.(b[9:8+n])                  # 0..9
    end
    Xtr = images("train-images-idx3-ubyte.gz"); ytr = labels("train-labels-idx1-ubyte.gz")
    Xte = images("t10k-images-idx3-ubyte.gz");  yte = labels("t10k-labels-idx1-ubyte.gz")
    return Xtr, ytr, Xte, yte
end

onehot(y) = (Y = zeros(Float32, 10, length(y)); for (j, c) in enumerate(y); Y[c+1, j] = 1f0; end; Y)

# numerically-stable softmax cross-entropy (works on Array or CuArray)
function logitce(logits, Y)
    m  = maximum(logits; dims = 1)
    ls = logits .- m .- log.(sum(exp.(logits .- m); dims = 1))
    return -sum(Y .* ls) / size(Y, 2)
end

# LeNet-5: conv→pool→conv→pool→dense×3. ~62k parameters.
make_cnn() = Chain(
    Conv((5, 5), 1 => 6, relu, pad = 2),   # 28×28×6
    MaxPool((2, 2)),                        # 14×14×6
    Conv((5, 5), 6 => 16, relu),            # 10×10×16
    MaxPool((2, 2)),                        #  5×5×16
    FlattenLayer(),                         # 400
    Dense(400 => 120, relu),
    Dense(120 => 84, relu),
    Dense(84 => 10),
)

function accuracy(model, ps, st, X, y, dev; bs = 1000)
    correct = 0
    for s in 1:bs:size(X, 4)
        idx = s:min(s + bs - 1, size(X, 4))
        ŷ, _ = model(dev(X[:, :, :, idx]), ps, st)
        pred = vec(map(i -> i[1] - 1, argmax(Array(ŷ); dims = 1)))
        correct += sum(pred .== y[idx])
    end
    return correct / length(y)
end

# one mini-batch SGD/Adam epoch; returns updated (ps, opt)
function train_epoch!(model, ps, st, opt, Xd, Yd, n, bs)
    order = randperm(n)
    for s in 1:bs:n
        idx = order[s:min(s + bs - 1, n)]
        xb = Xd[:, :, :, idx]; yb = Yd[:, idx]
        gs = Zygote.gradient(p -> logitce(first(model(xb, p, st)), yb), ps)[1]
        opt, ps = Optimisers.update(opt, ps, gs)
    end
    return ps, opt
end

# ---------------------------------------------------------------------------
println("="^64); println("Unit 2.7 — MNIST convolutional network (LeNet-5, Lux.jl)"); println("="^64)
have_gpu = CUDA.functional()
@printf("GPU available: %s%s\n", have_gpu, have_gpu ? "  ($(CUDA.name(CUDA.device())))" : "")

print("loading MNIST … "); Xtr, ytr, Xte, yte = load_mnist()
Ytr = onehot(ytr)
@printf("train=%d  test=%d  (28×28×1 image tensors)\n", size(Xtr, 4), size(Xte, 4))
@printf("LeNet-5: 2 conv blocks + 3 dense layers, %d parameters\n\n", Lux.parameterlength(make_cnn()))
const BS = 128

# --- per-epoch wall-clock: CPU vs GPU (convolutions favour the GPU heavily) --
println("Per-epoch wall-clock (batch=$BS, full 60k train set):")
@printf("%-8s %12s\n", "device", "sec / epoch"); println("-"^24)
results = Dict{String,Float64}()
for (tag, dev) in (have_gpu ? (("CPU", identity), ("GPU", gpu_device())) : (("CPU", identity),))
    rng = Xoshiro(0)
    model = make_cnn(); ps, st = Lux.setup(rng, model)
    ps = ps |> dev; st = st |> dev
    Xd = Xtr |> dev; Yd = Ytr |> dev
    ps, _ = train_epoch!(model, ps, st, Optimisers.setup(Adam(1f-3), ps), Xd, Yd, size(Xtr,4), BS) # warmup/compile
    dev === identity || CUDA.synchronize()
    t0 = time()
    ps, _ = train_epoch!(model, ps, st, Optimisers.setup(Adam(1f-3), ps), Xd, Yd, size(Xtr,4), BS)
    dev === identity || CUDA.synchronize()
    results[tag] = time() - t0
    @printf("%-8s %12.2f\n", tag, results[tag])
end
if haskey(results, "GPU")
    @printf("\nGPU speed-up: %.1fx per epoch (vs the MLP's ~3x — convolutions are far more compute-dense)\n",
            results["CPU"] / results["GPU"])
end

# --- a full training run on the GPU (or CPU fallback) to real accuracy -------
dev = have_gpu ? gpu_device() : identity
println("\nFull training run on $(have_gpu ? "GPU" : "CPU") — 10 epochs:")
rng = Xoshiro(1); model = make_cnn(); ps, st = Lux.setup(rng, model)
ps = ps |> dev; st = st |> dev
Xd = Xtr |> dev; Yd = Ytr |> dev
opt = Optimisers.setup(Adam(1f-3), ps)
hist = Float64[]
for epoch in 1:10
    global ps, opt = train_epoch!(model, ps, st, opt, Xd, Yd, size(Xtr,4), BS)
    acc = accuracy(model, ps, st, Xte, yte, dev)
    push!(hist, acc)
    @printf("  epoch %2d   test accuracy = %.4f\n", epoch, acc)
end
@printf("\nFinal MNIST test accuracy: %.4f  (forest ~0.97, softmax ~0.92, MLP ~0.98)\n", hist[end])

Captured on the workshop GPU hub (NVIDIA L4):

================================================================
Unit 2.7 — MNIST convolutional network (LeNet-5, Lux.jl)
================================================================
GPU available: true  (NVIDIA L4)
loading MNIST … train=60000  test=10000  (28×28×1 image tensors)
LeNet-5: 2 conv blocks + 3 dense layers, 61706 parameters

Per-epoch wall-clock (batch=128, full 60k train set):
device    sec / epoch
------------------------
CPU             21.18
GPU              1.33

GPU speed-up: 16.0x per epoch (vs the MLP's ~3x — convolutions are far more compute-dense)

Full training run on GPU — 10 epochs:
  epoch  1   test accuracy = 0.9721
  epoch  2   test accuracy = 0.9789
  epoch  3   test accuracy = 0.9843
  epoch  4   test accuracy = 0.9864
  epoch  5   test accuracy = 0.9870
  epoch  6   test accuracy = 0.9848
  epoch  7   test accuracy = 0.9895
  epoch  8   test accuracy = 0.9861
  epoch  9   test accuracy = 0.9886
  epoch 10   test accuracy = 0.9875

Final MNIST test accuracy: 0.9875  (forest ~0.97, softmax ~0.92, MLP ~0.98)

Two payoffs. First, accuracy: the CNN reaches ~98.8% — the best in the unit, topping the random forest (~97%) and the softmax baseline (~92%), and edging out the plain MLP (~98%) we train next. And it does so with 61,706 parameters — fewer than a quarter of that MLP’s 235,146: the convolution’s inductive bias buys more accuracy with fewer weights — the opposite of the “depth costs parameters” trade a plain MLP makes. Second, the GPU: because a CNN is many small matmuls per layer it is far more compute-dense than a flat MLP, so the per-epoch speed-up here is ~16× — far above what the same card gives the small dense MLP (~3×, which we see next). Convolutions are where the GPU starts to matter, and the collocation counts of PINNs (from Unit 5) push that lever harder still.

Training on the GPU — the same one-liner on a plain MLP (Julia, Lux + CUDA)

Now step down to a simpler model — a plain MLP (the §2.3 softmax grown two hidden layers, 784 → 256 → 128 → 10) — which makes it easy to see how the GPU’s advantage scales with the model. The same MLP is trained on the CPU and on the GPU, where the only change is pushing the parameters and the data to the device with |> gpu_device(). Everything else — the model, the loss, the Adam update — is byte-for-byte identical.

units/unit_02/scripts/mnist_gpu_lux.jl

#!/usr/bin/env julia
# ===========================================================================
# Unit 2.7 — a real deep-learning training run on the GPU (MNIST, Lux.jl).
#
# Section 2.7 trains an MLP on MNIST in scikit-learn and PyTorch. The PyTorch
# version already moves tensors to `cuda`; this
# is the *Julia* side of that story: the identical Lux model trained on the CPU
# and on the GPU, so you can see (a) that the move is a one-liner — push the
# params and the data to the device — and (b) the speed-up on an image-scale
# dataset where the GPU's batched matmuls actually pay off.
#
# MNIST is fetched directly from a public mirror and parsed in a few lines of
# pure Julia (no MLDatasets / PyCall dependency). Model: 784-256-128-10 MLP.
#
# Run on the GPU hub (the @pinn env has Lux + LuxCUDA + cuDNN + CUDA):
#   julia --project=@pinn units/unit_02/scripts/mnist_gpu_lux.jl
# Nothing here runs during `quarto render` — the .qmd shows it `eval: false`.
# ===========================================================================

using Lux, LuxCUDA, CUDA, Optimisers, Zygote, Random, Printf, Statistics, Downloads

# --- MNIST loader (direct download + pure-Julia IDX parse) -----------------
const MIRRORS = ["https://storage.googleapis.com/cvdf-datasets/mnist/",
                 "https://ossci-datasets.s3.amazonaws.com/mnist/"]

function fetch_idx(name)
    cache = joinpath(get(ENV, "MNIST_DIR", "/tmp/mnist"), name)
    raw   = replace(cache, ".gz" => "")
    isdir(dirname(cache)) || mkpath(dirname(cache))
    if !isfile(raw)
        ok = false
        for m in MIRRORS
            try
                Downloads.download(m * name, cache); ok = true; break
            catch; end
        end
        ok || error("could not download $name from any mirror")
        run(`gunzip -f $cache`)
    end
    return read(raw)
end

function load_mnist()
    function images(name)
        b = fetch_idx(name)
        n  = Int(b[5])<<24 | Int(b[6])<<16 | Int(b[7])<<8 | Int(b[8])
        nr = Int(b[9])<<24 | Int(b[10])<<16 | Int(b[11])<<8 | Int(b[12])
        nc = Int(b[13])<<24 | Int(b[14])<<16 | Int(b[15])<<8 | Int(b[16])
        px = reshape(b[17:end], nr*nc, n)
        return Float32.(px) ./ 255f0          # (784, n), normalised
    end
    function labels(name)
        b = fetch_idx(name)
        n = Int(b[5])<<24 | Int(b[6])<<16 | Int(b[7])<<8 | Int(b[8])
        return Int.(b[9:8+n])                  # 0..9
    end
    Xtr = images("train-images-idx3-ubyte.gz"); ytr = labels("train-labels-idx1-ubyte.gz")
    Xte = images("t10k-images-idx3-ubyte.gz");  yte = labels("t10k-labels-idx1-ubyte.gz")
    return Xtr, ytr, Xte, yte
end

onehot(y) = (Y = zeros(Float32, 10, length(y)); for (j, c) in enumerate(y); Y[c+1, j] = 1f0; end; Y)

# numerically-stable softmax cross-entropy (works on Array or CuArray)
function logitce(logits, Y)
    m  = maximum(logits; dims = 1)
    ls = logits .- m .- log.(sum(exp.(logits .- m); dims = 1))
    return -sum(Y .* ls) / size(Y, 2)
end

function accuracy(model, ps, st, X, y, dev)
    ŷ, _ = model(dev(X), ps, st)
    pred = vec(map(i -> i[1] - 1, argmax(Array(ŷ); dims = 1)))
    return mean(pred .== y)
end

# --- one training epoch (mini-batch SGD/Adam); returns seconds --------------
function train_epoch!(model, ps, st, opt, Xd, Yd, n, bs)
    order = randperm(n)
    for s in 1:bs:n
        idx = order[s:min(s+bs-1, n)]
        xb = Xd[:, idx]; yb = Yd[:, idx]
        gs = Zygote.gradient(p -> logitce(first(model(xb, p, st)), yb), ps)[1]
        opt, ps = Optimisers.update(opt, ps, gs)
    end
    return ps, opt
end

# ---------------------------------------------------------------------------
println("="^64); println("Unit 2.7 — MNIST MLP training on CPU vs GPU (Lux.jl)"); println("="^64)
have_gpu = CUDA.functional()
@printf("GPU available: %s%s\n", have_gpu, have_gpu ? "  ($(CUDA.name(CUDA.device())))" : "")

print("loading MNIST … "); Xtr, ytr, Xte, yte = load_mnist()
Ytr = onehot(ytr)
@printf("train=%d  test=%d  (28x28 → 784)\n\n", size(Xtr, 2), size(Xte, 2))

make_model() = Chain(Dense(784 => 256, relu), Dense(256 => 128, relu), Dense(128 => 10))
const BS = 128

# --- per-epoch wall-clock: CPU vs GPU (same model, same data) --------------
println("Per-epoch wall-clock (batch=$BS, full 60k train set):")
@printf("%-8s %12s\n", "device", "sec / epoch"); println("-"^24)
results = Dict{String,Float64}()
for (tag, dev) in (have_gpu ? (("CPU", identity), ("GPU", gpu_device())) : (("CPU", identity),))
    rng = Xoshiro(0)
    model = make_model(); ps, st = Lux.setup(rng, model)
    ps = ps |> dev; st = st |> dev
    Xd = Xtr |> dev; Yd = Ytr |> dev
    ps, _ = train_epoch!(model, ps, st, Optimisers.setup(Adam(1f-3), ps), Xd, Yd, size(Xtr,2), BS) # warmup/compile
    dev === identity || CUDA.synchronize()
    t0 = time()
    ps, _ = train_epoch!(model, ps, st, Optimisers.setup(Adam(1f-3), ps), Xd, Yd, size(Xtr,2), BS)
    dev === identity || CUDA.synchronize()
    results[tag] = time() - t0
    @printf("%-8s %12.2f\n", tag, results[tag])
end
if haskey(results, "GPU")
    @printf("\nGPU speed-up: %.1fx per epoch\n", results["CPU"] / results["GPU"])
end

# --- a full training run on the GPU (or CPU fallback) to real accuracy ------
dev = have_gpu ? gpu_device() : identity
println("\nFull training run on $(have_gpu ? "GPU" : "CPU") — 10 epochs:")
rng = Xoshiro(1); model = make_model(); ps, st = Lux.setup(rng, model)
ps = ps |> dev; st = st |> dev
Xd = Xtr |> dev; Yd = Ytr |> dev
opt = Optimisers.setup(Adam(1f-3), ps)
hist = Float64[]
for epoch in 1:10
    global ps, opt = train_epoch!(model, ps, st, opt, Xd, Yd, size(Xtr,2), BS)
    acc = accuracy(model, ps, st, Xte, yte, dev)
    push!(hist, acc)
    @printf("  epoch %2d   test accuracy = %.4f\n", epoch, acc)
end
@printf("\nFinal MNIST test accuracy: %.4f\n", hist[end])

try
    using CairoMakie
    f = Figure(size = (560, 380))
    ax = Axis(f[1, 1], title = "MNIST MLP on the GPU — test accuracy",
              xlabel = "epoch", ylabel = "test accuracy")
    lines!(ax, 1:length(hist), hist, linewidth = 3)
    scatter!(ax, 1:length(hist), hist, markersize = 9)
    figdir = get(ENV, "GPU_FIG_DIR", joinpath(@__DIR__, "..", "figures"))
    isdir(figdir) || mkpath(figdir)
    save(joinpath(figdir, "mnist_gpu_accuracy.png"), f)
    println("wrote figures/mnist_gpu_accuracy.png")
catch e
    println("(figure skipped: ", e, ")")
end

Captured on the workshop GPU hub (NVIDIA L4):

================================================================
Unit 2.7 — MNIST MLP training on CPU vs GPU (Lux.jl)
================================================================
GPU available: true  (NVIDIA L4)
loading MNIST … train=60000  test=10000  (28x28 → 784)

Per-epoch wall-clock (batch=128, full 60k train set):
device    sec / epoch
------------------------
CPU              2.01
GPU              0.67

GPU speed-up: 3.0x per epoch

Full training run on GPU — 10 epochs:
  epoch  1   test accuracy = 0.9616
  epoch  2   test accuracy = 0.9680
  epoch  3   test accuracy = 0.9768
  epoch  4   test accuracy = 0.9780
  epoch  5   test accuracy = 0.9766
  epoch  6   test accuracy = 0.9784
  epoch  7   test accuracy = 0.9791
  epoch  8   test accuracy = 0.9826
  epoch  9   test accuracy = 0.9801
  epoch 10   test accuracy = 0.9796

Final MNIST test accuracy: 0.9796

Two things to take away. First, the GPU is about 3× faster per epoch here — real, but not dramatic, because a 784-wide MLP on 28×28 images is small: the batched matmuls don’t fully occupy the card and the CPU is no slouch. That’s the flip side of the CNN’s ~16× above — the more compute-dense the model, the more the GPU pulls ahead, and the collocation-point counts that PINNs demand (Unit 5) push it further still. Second, the move itself is trivial: one gpu_device() and the same code trains on either processor. The full 10-epoch run reaches 97.96% test accuracy:

MNIST test accuracy over 10 training epochs of the Lux MLP, trained on the GPU.

Regularisation, batch norm, dropout

Three standard knobs to keep training honest:

Weight regularisation — add \lambda \|\theta\|^2 to the loss (L_2, “weight decay”) to keep the parameters small.
Batch normalisation — normalise activations within each mini-batch; lets you use higher learning rates and reduces sensitivity to initialisation.
Dropout — randomly zero a fraction of activations during training; equivalent to averaging over an ensemble of thinned networks.

Dropout: at each training step, a random subset of hidden units is masked out; at test time all units are used with weights rescaled. (Figure from *MEDL*, §5.7.)

For PINNs the regularisation knobs you care about are physics terms in the loss rather than weight penalties or dropout — the PDE residual itself is a powerful regulariser, as we’ll see from Unit 5 onward.

The same MLP in Python — `scikit-learn`

For the cross-language picture, here is that same MLP — the 784 → 256 → 128 → 10 network we just trained on the GPU above — in the Python ecosystem. These listings are static — we don’t execute them in the build; they’re the Python mirror of the Julia workflow above. First the few-line scikit-learn version; expect ~98% test accuracy — comfortably beating the softmax baseline, edging out the random forest:

units/unit_02/scripts/mnist_mlp_sklearn.py

from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

X, y = fetch_openml("mnist_784", version=1, as_frame=False, return_X_y=True)
X = X.astype("float32") / 255.0
y = y.astype("int64")
X_train, X_test = X[:60_000], X[60_000:]
y_train, y_test = y[:60_000], y[60_000:]

clf = MLPClassifier(
    hidden_layer_sizes=(256, 128), activation="relu",
    solver="adam", batch_size=128, max_iter=20, random_state=0,
)
clf.fit(X_train, y_train)
print(f"MNIST test accuracy (sklearn MLP): "
      f"{accuracy_score(y_test, clf.predict(X_test)):.4f}")

The same MLP in Python — `PyTorch`

The same model with explicit autodiff and an explicit training loop. This is the PyTorch idiom you’ll see everywhere in the deep learning book:

units/unit_02/scripts/mnist_mlp_pytorch.py

# pip install torch torchvision
import torch, torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

device = "cuda" if torch.cuda.is_available() else "cpu"
tf  = transforms.Compose([transforms.ToTensor(), transforms.Lambda(torch.flatten)])
tr  = datasets.MNIST("~/torch-data", train=True,  download=True, transform=tf)
te  = datasets.MNIST("~/torch-data", train=False, download=True, transform=tf)
tl  = DataLoader(tr, batch_size=128, shuffle=True)
vl  = DataLoader(te, batch_size=512)

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10),
).to(device)
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)
loss  = nn.CrossEntropyLoss()

for epoch in range(10):                 # mini-batch SGD with Adam
    model.train()
    for X, y in tl:
        X, y = X.to(device), y.to(device)
        opt.zero_grad()
        loss(model(X), y).backward()    # backprop
        opt.step()

    # eval
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for X, y in vl:
            X, y = X.to(device), y.to(device)
            pred = model(X).argmax(dim=1)
            correct += (pred == y).sum().item()
            total   += y.size(0)
    print(f"epoch {epoch:2d}  test acc = {correct / total:.4f}")

The structural beats — model.train() / model.eval() mode, opt.zero_grad() to clear accumulated gradients, the inner loss.backward(); opt.step() two-liner — are PyTorch idiom you’ll see in every textbook and tutorial. Functionally it’s the same training loop as mnist_linear_lux.jl (§2.3). The only ecosystem difference is whether parameters live on the model object (PyTorch) or in a separate ps tuple (Lux).

Run it via scripts/mnist_mlp_pytorch.py after installing torch + torchvision; expect ~30 s/epoch on a modern laptop CPU, much less on GPU.

All three implementations follow the same skeleton — load data, define model, set optimiser, loop minibatches, compute loss, step. The Julia version makes the parameter state explicit; PyTorch and scikit-learn hide it. The only meaningful differences are surface-level. Pick by ecosystem.

✏️ Section exercise — parameters, accuracy, and a two-phase finetune

The unit has now climbed the whole MNIST ladder: random forest (~97%), softmax regression (~92%), an MLP in Python and in Julia (~98%), and the LeNet CNN above (~98.8%). Two questions about that ladder:

Parameter efficiency. Use Lux.parameterlength to count the parameters of three Julia models: the §2.3 softmax classifier (784 → 10), the GPU-section MLP (784 → 256 → 128 → 10), and the LeNet CNN. Rank them by accuracy and by parameter count — the CNN should come out most accurate while not being the largest. Explain in one sentence why a convolution needs so few weights.
A miniature Adam → L-BFGS schedule. Re-run conv_mnist_lux.jl but, after the 10 Adam(1f-3) epochs, do 5 more epochs at a 10× smaller learning rate (Adam(1f-4)). Note the effect on the final test accuracy. That fast-then-careful pattern is a miniature of the Adam → L-BFGS schedule every PINN in Units 5–10 uses.

💡 Hint

Lux.parameterlength(model) counts any model with no hand-arithmetic — build all three the same way: softmax Lux.Chain(Lux.Dense(784 => 10)), MLP Lux.Chain(Lux.Dense(784=>256,relu), Lux.Dense(256=>128,relu), Lux.Dense(128=>10)), and the CNN via the script’s make_cnn(). You should get 7,850 / 235,146 / 61,706 — the CNN is most accurate yet under a third of the MLP’s weights, because a convolution reuses one small filter at every image position. For the finetune, keep the trained ps, rebuild only the optimiser state opt = Optimisers.setup(Adam(1f-4), ps), then run 5 more epochs with the same per-epoch training call the script already defines.

Go to solution →

2.9 Kolmogorov-Arnold networks (KANs)

A recent alternative to the MLP that’s been picking up traction in the PINN community: Kolmogorov-Arnold networks (Liu et al. 2024). Where an MLP puts a fixed activation \sigma at every node and learns the weights on the edges, a KAN puts a learnable univariate function (a B-spline) on every edge and just sums them at the nodes:

\underbrace{f_{\text{MLP}}(x) = W_2\, \sigma(W_1 x + b_1) + b_2}_{\text{fixed }\sigma\text{, learnable }W} \qquad\text{vs.}\qquad \underbrace{f_{\text{KAN}}(x)_q = \sum_p \phi_{q, p}^{(2)}\!\Bigl(\sum_r \phi_{p, r}^{(1)}(x_r)\Bigr)}_{\text{learnable univariate }\phi\text{ on every edge}}.

The motivation is the Kolmogorov-Arnold superposition theorem (1957): any continuous multivariate function on a compact domain can be written as a finite sum of single-variable functions of single-variable functions — the same universal-approximation spirit as MLPs but with a structurally different parameterisation. In practice KANs trade fewer parameters for more interpretability (you can plot every edge’s learned \phi) at the cost of a more expensive forward pass.

For PINNs the appeal is that learnable univariate splines often need many fewer collocation points to capture sharp features in the PDE solution. Some preliminary results (KAN-PINN) suggest KANs can outperform MLPs on stiff or oscillatory PDEs — this is an active research front.

KAN in Julia (`KolmogorovArnold.jl`)

A 2-layer KAN approximating a spatiotemporal field — the decaying heat-equation eigenmode u(x, t) = e^{-t}\sin(\pi x) on (x, t) \in [0, 1]^2 — in the Julia ecosystem using KolmogorovArnold.jl. Two inputs (x, t), one scalar output: the same shape a PINN’s solution network has.

units/unit_02/scripts/kan_julia.jl

#!/usr/bin/env julia
# A two-input KAN. Approximate the spatiotemporal field
#
#   u(x, t) = e^{-t} sin(pi x)
#
# — a single decaying eigenmode of the heat equation on (x, t) in [0, 1]^2 —
# with a small Kolmogorov-Arnold network, then plot the learned field next to
# the exact one. Two inputs (x, t) in, one scalar out: the same shape a PINN's
# solution network has.

using KolmogorovArnold, Lux, Random, Zygote, Optimisers, Statistics, Plots

rng = Random.MersenneTwister(0)
u(x, t) = exp(-t) * sinpi(x)                  # the field we want the KAN to learn

# training data: random (x, t) samples in the unit square
n = 2000
X = rand(rng, Float32, 2, n)                  # row 1 = x, row 2 = t
y = reshape(u.(X[1, :], X[2, :]), 1, n)

# KAN 2 → 5 → 1. KDense(in, out, grid_len): grid_len is the number of basis
# centres on each edge — the per-edge spline/RBF resolution. More centres = a
# more flexible learnable φ, and more parameters.
grid_len = 6
model = Lux.Chain(
    KDense(2, 5, grid_len; basis_func = rbf, normalizer = softsign),
    KDense(5, 1, grid_len; basis_func = rbf, normalizer = softsign),
)
ps, st = Lux.setup(rng, model)

loss(ps) = mean(abs2, first(model(X, ps, st)) .- y)
opt_state = Optimisers.setup(Optimisers.Adam(1f-2), ps)
for k in 1:3000
    global ps, opt_state            # reassign the script-level vars (soft scope)
    gs = first(Zygote.gradient(loss, ps))
    opt_state, ps = Optimisers.update(opt_state, ps, gs)
end
@info "final training MSE = $(loss(ps))"

# evaluate the trained KAN on a regular grid and compare to the exact field
Ng = 100
xs = range(0f0, 1f0; length = Ng)
ts = range(0f0, 1f0; length = Ng)
grid  = reduce(hcat, [[x, t] for t in ts for x in xs])
û     = reshape(first(model(grid, ps, st)), Ng, Ng)   # KAN prediction, indexed [x, t]
exact = [u(x, t) for x in xs, t in ts]

base = (xlabel = "x", ylabel = "t", aspect_ratio = :equal,
        xlims = (0, 1), ylims = (0, 1), framestyle = :box)
p1 = heatmap(xs, ts, exact'; title = "exact  u(x,t) = e⁻ᵗ sin(πx)", c = :viridis, base...)
p2 = heatmap(xs, ts, û';     title = "KAN approximation",           c = :viridis, base...)
p3 = heatmap(xs, ts, abs.(û .- exact)'; title = "|KAN − exact|",    c = :magma,   base...)
plt = plot(p1, p2, p3; layout = (1, 3), size = (1150, 360),
           bottom_margin = 5Plots.mm, left_margin = 5Plots.mm)
savefig(plt, joinpath(@__DIR__, "..", "figures", "kan_xt.png"))

A `2 → 5 → 1` KAN (105 parameters) trained on 2000 random (x, t) samples to approximate the heat-equation eigenmode u(x, t) = e^{-t}\sin(\pi x). **Left:** the exact field; **middle:** the KAN’s prediction; **right:** the pointwise error — under 1.5\times10^{-2} everywhere after 3000 Adam steps (training MSE \approx 4\times10^{-6}). Two inputs in, one scalar out — the shape of a PINN solution network.

What’s actually on each edge — the splines and the parameters. Every edge’s learnable function \phi is a weighted sum of grid_len fixed basis bumps — B-splines in the original Liu et al. paper, radial basis functions (smooth Gaussian-like bumps) in KolmogorovArnold.jl. So the three KDense arguments are (in_dims, out_dims, grid_len): an in → out layer has in × out edges, and each edge carries grid_len trainable coefficients — the heights of its bumps — so the layer holds roughly in × out × grid_len parameters. The normalizer (here softsign) squashes each input into the fixed range the bumps live on, so one grid covers any input. Turning grid_len up makes every \phi more flexible — it captures sharper features, but adds parameters and slows the forward pass; it is the KAN’s main capacity knob, the way width and depth are an MLP’s. The 2 → 5 → 1 network above has 2{\times}5 + 5{\times}1 = 15 edges, and with grid_len = 6 spline heights plus one base weight on each edge (7 apiece) that comes to 105 trainable parameters — and it fits the field above to a training MSE of \approx 4\times10^{-6}. (Contrast the MLP, where each edge holds a single weight; a KAN moves the parameters onto the edges as little functions.)

On one KAN edge the basis bumps are **fixed** (left): six smooth Gaussian-like humps spaced across the input range. The edge’s learnable function \phi (right, thick black) is just their **weighted sum** — training only slides the six heights w_k up and down (the faint dashed curves are the scaled bumps that add up to it). Turning `grid_len` up adds more bumps, so \phi can bend more sharply.

A spline, in plain terms, is a curve built from simple local pieces instead of one global formula. Each bump only matters near its own centre, so changing one height w_k reshapes \phi near that bump and leaves the rest of the curve alone. That locality is the whole point: a spline can capture a sharp feature — a kink, a steep front — by adjusting a couple of nearby bumps, without disturbing the function everywhere else. It is exactly the property KANs lean on for PINNs, where a few well-placed bumps can resolve a steep PDE solution that a single smooth global activation would smear out. (The original Liu et al. paper uses B-splines — piecewise-polynomial bumps tied to a grid of knots; KolmogorovArnold.jl uses radial basis functions, which behave the same way for our purposes.)

KAN in Python (`pykan`)

The reference Liu et al. implementation, in Python — on their original two-variable benchmark f(x_1, x_2) = \exp(\sin(\pi x_1) + x_2^2):

units/unit_02/scripts/kan_pykan.py

# pip install pykan
import torch
from kan import KAN, create_dataset

# build dataset for f(x1, x2) = exp(sin(pi x1) + x2^2)
f = lambda x: torch.exp(torch.sin(torch.pi * x[:, [0]]) + x[:, [1]]**2)
dataset = create_dataset(f, n_var=2, train_num=1000, test_num=1000)

# KAN with grid=5 B-spline knots, k=3 cubic splines, shape 2→5→1
model = KAN(width=[2, 5, 1], grid=5, k=3, seed=0)
model.fit(dataset, opt="LBFGS", steps=50, lamb=0.0)

print(f"test loss = {torch.mean((model(dataset['test_input']) - dataset['test_label'])**2).item():.3e}")

Available as scripts/kan_pykan.py. Requires pip install pykan (which pulls in torch). Notice the default training optimiser is L-BFGS — exactly the small-batch regime we just discussed.

When to reach for a KAN

The honest answer in mid-2026: rarely yet, but worth knowing.

Where they shine. Low-dimensional problems where interpretability matters; PDE solutions with sharp features that an MLP would need many neurons to capture; symbolic-regression applications.
Where they don’t. High-dimensional image-style inputs (CNNs / transformers still dominate); production deep learning (the MLP-Adam-cross-entropy stack is still vastly more debugged).
For PINNs specifically. A reasonable experimental swap-in for the inner MLP of a PINN — NeuralPDE.jl accepts any Lux chain, so the substitution is a one-liner. Whether KANs meaningfully improve PINN convergence is still being worked out.

We won’t return to KANs in the worked examples — Units 5–9 stick with MLP-PINNs — but they belong in the working vocabulary of any practitioner reading the 2024–26 PINN literature.

✏️ Section exercise — KAN vs MLP on a sharp feature

Sharp features are where KANs claim an edge. Fit both a 2-layer KAN (KolmogorovArnold.jl, as in the script above) and a parameter-matched tanh MLP to the 1-D function f(x) = |x| + 0.2\sin(5x) on [-1, 1] — smooth everywhere except the kink at zero. Train both with Adam for the same number of iterations, compare final MSE, and plot both fits near x = 0. Which model captures the kink better at equal parameter count, and what happens to the MLP if you give it 10× the training iterations?

💡 Hint

KolmogorovArnold.jl is already in @pinn, so using KolmogorovArnold just works. Target: y = abs.(x) .+ 0.2f0 .* sin.(5f0 .* x). KAN: stack two KDense(in, out, 6; basis_func = rbf, normalizer = softsign) layers (e.g. 1=>6 then 6=>1). Parameter-match the MLP with a quick Lux.parameterlength check — Lux.Chain(Lux.Dense(1=>14,tanh), Lux.Dense(14=>14,tanh), Lux.Dense(14=>1)) is close — an unmatched comparison is meaningless. Train both with the same Adam loop and iteration count, then zoom the plot into x ∈ [-0.1, 0.1]; the global MSE hides where the kink lives. Then give the MLP 10× the iterations and watch it slowly close the gap (an efficiency edge, not an expressiveness one).

Go to solution →

2.10 What carries over to PINNs

Three ideas from this unit survive the journey to physics-informed networks:

Loss minimisation as the unifying training principle. PINNs add residual terms to the loss but keep the empirical-risk skeleton.
Generalisation as the goal. A PINN that fits its collocation points but violates physics elsewhere is overfitting; held-out collocation samples play the role of the test set.
Regularisation by physics. The PDE residual is a domain prior — a regulariser sharper than any L_2 penalty, because it actually encodes what the function must satisfy.

With those in hand, Unit 3 lifts the function-approximation idea into a SciML landscape, and Unit 4 introduces the ODE side that PINNs will meet head-on in Unit 5.

✏️ Section exercise — translate the vocabulary

A quick dictionary drill before moving on. For each supervised-learning concept on the left, write down its PINN counterpart and one sentence on what changes: (a) a labelled training example; (b) the test set; (c) overfitting; (d) the L_2 weight penalty; (e) the mini-batch. Then answer: in what sense does a PINN have “infinite training data”, and what is the catch?

💡 Hint

For each concept, ask: what supplies the supervision signal in a PINN (the equation), and what plays the role of unseen data (points you didn’t sample)? The §2.1 ERM picture maps over almost mechanically — the one genuinely new idea is that the regulariser is the physics.

Go to solution →

2.1 Supervised learning fundamentals

The setup

Loss, true risk, empirical risk

Train, validate, test — and k-fold cross-validation

Bias, variance, and model complexity

Ordinary least squares: the \ operator

Regularisation: penalising complexity

2.2 Random forests on MNIST: a cross-language comparison

Decision trees, ensembled

The MNIST benchmark

A first look at the digits

MNIST random forest in Julia (DecisionTree.jl)

MNIST random forest in Python (scikit-learn)

When classical ML wins

2.3 From logistic regression to a neural network

Bernoulli likelihood and the sigmoid

Softmax for multiclass

MNIST softmax regression in Lux.jl

MNIST softmax regression in Python (scikit-learn)

2.4 Feedforward networks

MLPs: depth \times width

Universal approximation

Activation choices (with a PINN aside)

2.5 On using Lux.jl

The core idea: parameters live outside the model

Specifying a network

Inference: calling the model

Activations and one-hot encoding

Training: the standard loop

Inspecting a trained model

Why this design — Lux vs Flux, and the other frameworks

Saving and loading a trained model

2.6 Optimisation

Gradient descent: the geometric core

Stochastic and mini-batch gradient descent

Momentum and Adam

L-BFGS: quasi-Newton for the small-batch regime

2.7 Automatic differentiation

Numerical differentiation, briefly

Computational graphs and backprop

Dual numbers and the tape

Hands-on: time the two modes yourself

Differentiating the network itself: forward-over-reverse (the PINN twist)

2.8 Training a model in practice

A convolutional network on MNIST (Lux.jl)

Training on the GPU — the same one-liner on a plain MLP (Julia, Lux + CUDA)

Regularisation, batch norm, dropout

The same MLP in Python — scikit-learn

The same MLP in Python — PyTorch

2.9 Kolmogorov-Arnold networks (KANs)

KAN in Julia (KolmogorovArnold.jl)

KAN in Python (pykan)

When to reach for a KAN

2.10 What carries over to PINNs

Ordinary least squares: the `\` operator

MNIST random forest in Julia (`DecisionTree.jl`)

MNIST random forest in Python (`scikit-learn`)

MNIST softmax regression in `Lux.jl`

MNIST softmax regression in Python (`scikit-learn`)

The same MLP in Python — `scikit-learn`

The same MLP in Python — `PyTorch`

KAN in Julia (`KolmogorovArnold.jl`)

KAN in Python (`pykan`)