Unit 2: ML Foundations: From Classical to Deep

Published

12/06/2026

A walk from supervised learning and random forests through to neural networks, optimisation, and the autodiff stack that PINNs sit on. This unit follows the framing of Mathematical Engineering of Deep Learning (Nazarathy & Klok), chapters 2–5, with code in Julia (MLJ.jl for classical ML, Lux.jl for deep networks) and a scikit-learn / PyTorch cameo so the cross-ecosystem story stays alive.

A single unit can’t cover everything — the goal here is shared vocabulary and a feel for which classical ideas survive the journey to PINNs. Deep-network theory and the modern PINN toolkit get their own treatment in Unit 7.

2.1 Supervised learning fundamentals

The setup

Given labelled data \{(x_i, y_i)\}_{i=1}^N, learn a function f: \mathbb{R}^n \to \mathcal{Y} such that f(x) \approx y on unseen inputs. The output space \mathcal{Y} is continuous (regression) or a finite set (classification). The art of supervised learning is choosing a function class — linear, tree-based, neural — and a fitting procedure that generalises beyond the training set.

Loss, true risk, empirical risk

A loss \ell(\hat y, y) measures how bad a single prediction is. The true risk is the population mean

R(f) = \mathbb{E}_{(x, y)\sim p}[\ell(f(x), y)].

We can’t compute it, so we minimise the empirical risk

\hat R(f) = \frac{1}{N}\sum_{i=1}^N \ell(f(x_i), y_i).

All of supervised learning is a story about how, when, and whether \hat R approximates R. PINNs are no exception — their “data” includes scattered collocation points where the loss is a PDE residual rather than a label, but the ERM picture survives intact (Unit 5).

Train, validate, test — and k-fold cross-validation

The standard split: training to fit, validation to pick hyperparameters, test to estimate true risk. Never let the test set influence model choice.

When data is scarce, hold out one validation set at a time and rotate. That’s k-fold cross-validation:

k-fold cross-validation. Each row trains on the unshaded blocks and validates on the shaded one; cross-validated risk averages the k held-out scores. (Figure from Mathematical Engineering of Deep Learning, §2.5.)

Bias, variance, and model complexity

Holding N fixed and sweeping the function class from simple to complex, training error decreases monotonically while test error is U-shaped: too simple under-fits (high bias), too complex over-fits (high variance). The minimum of the test-error curve marks the “right” complexity for that dataset size.

Test (orange) and training (blue) error vs. model complexity. The classical bias–variance trade-off: complexity beyond the right-hand side of the dotted line buys lower training error at the cost of higher test error. (Figure from MEDL, §2.7.)

For PINNs the same picture applies: the function class is set by the network architecture, “complexity” is roughly the parameter count, and the test set is replaced by held-out collocation points in the PDE domain.

Note✏️ Section exercise — watch the U-shape appear

Make the bias–variance trade-off concrete. Generate 60 noisy samples of y = \sin(2\pi x) + \varepsilon, \varepsilon \sim \mathcal{N}(0, 0.2^2), on x \in [0, 1]. Fit polynomials of degree 1, 2, \ldots, 15 by least squares, and for each degree compute (a) the training RMSE and (b) the 5-fold cross-validated RMSE — implementing the fold rotation yourself, no library. Plot both curves against degree. Where is the minimum of the CV curve, and what happens to the two curves past it?

💡 Hint

Build the polynomial design matrix with hcat((x .^ p for p in 0:d)...) and fit with \. For the folds, interleaved index sets [i:5:N for i in 1:5] are the least code; train on .!mask, score on mask. Expect numerical warnings at high degree — that’s part of the story.

2.2 Random forests on MNIST: a cross-language cameo

Before deep learning takes over, one quick stop on the non-deep-learning side of the world — and a first look at MNIST, the dataset we’ll keep using as a running example. The next two subsubsections cover (a) what a random forest is and (b) the same MNIST problem in Julia and in Python, so the cross-ecosystem template is established up front.

Decision trees, ensembled

A decision tree partitions the input space by axis-aligned splits, with a constant prediction at each leaf. Greedy fitting picks the split that reduces a loss criterion most at each node — variance for regression, Gini or entropy for classification. Trees are interpretable but high-variance: small data perturbations produce wildly different trees.

A random forest mitigates the variance by averaging many trees trained on bootstrap resamples of the data, with each split considering only a random subset of features. The variance falls, generalisation goes up, interpretability becomes harder to maintain. Random forests are still a serious baseline on tabular or flat image problems — often the right tool when you have hundreds-of-thousands of rows of modestly sized feature vectors, hand-designed or otherwise.

The MNIST benchmark

MNIST is 70 000 grayscale images of handwritten digits 0–9, each 28 \times 28 pixels (so 784 features when flattened to a vector). 60 000 are reserved for training, 10 000 for test. It is the canonical “easy classification benchmark” in deep learning — small enough to fit in memory, hard enough that a naive baseline isn’t perfect, and used in Goodfellow, Bengio & Courville (Chapters 1, 5, 6) and Mathematical Engineering of Deep Learning (Chapters 5–6) for exactly the kind of side-by-side comparison we’re about to do.

MNIST random forest in Julia (DecisionTree.jl)

A 100-tree forest on flattened MNIST takes a couple of minutes to fit and lands around 97% test accuracy — a strong baseline that took deep learning a decade to convincingly beat. The workshop’s DecisionTree.jl (already in the workshop Project.toml) plus MLDatasets.jl for the MNIST loader:

units/unit_02/scripts/mnist_rf_julia.jl
using MLDatasets, DecisionTree, Statistics

train_x, train_y = MLDatasets.MNIST(split = :train)[:]
test_x,  test_y  = MLDatasets.MNIST(split = :test)[:]

# (28, 28, N) → (N, 784), normalise to [0, 1]
flatten(x) = reshape(Float32.(x), 28 * 28, size(x, 3))' ./ 255f0
X_train, X_test = flatten(train_x), flatten(test_x)
y_train, y_test = Int.(train_y), Int.(test_y)

# Native DecisionTree.jl API
forest = build_forest(
    y_train, X_train,
    28,           # n_subfeatures (≈ √784)
    100,          # n_trees
    0.7,          # partial_sampling
    -1,           # max_depth
    1, 0.0, 0;
    rng = 0,
)
ŷ = apply_forest(forest, X_test)
@info "MNIST test accuracy (DecisionTree.jl RF, 100 trees) = $(round(mean.== y_test); digits = 4))"

Setup note. MLDatasets.jl isn’t in the workshop Project.toml by default (we keep PINN-focused deps minimal). To run this script, add it once with Pkg.add("MLDatasets") from the repo root.

MNIST random forest in Python (scikit-learn)

Same problem, same numbers, in the Python ecosystem:

units/unit_02/scripts/mnist_rf_sklearn.py
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# fetch_openml caches to ~/scikit_learn_data/ after the first download
X, y = fetch_openml("mnist_784", version=1, as_frame=False, return_X_y=True)
X = X.astype("float32") / 255.0
y = y.astype("int64")

X_train, X_test = X[:60_000], X[60_000:]
y_train, y_test = y[:60_000], y[60_000:]

clf = RandomForestClassifier(
    n_estimators=100, n_jobs=-1, random_state=0,
)
clf.fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
print(f"MNIST test accuracy (sklearn RF, 100 trees): {acc:.4f}")

Run it yourself with scripts/mnist_rf_sklearn.py (or ./build.sh execute 2 once you’ve installed scikit-learn). Typical first-run cost: ~30 s download (cached after) + ~2 min fit on a modern laptop CPU.

The point isn’t the accuracy number — it’s that the same problem in two ecosystems has the same shape: load data, choose a model, fit, score. This template recurs throughout the workshop.

When classical ML wins

Random forests (and gradient boosting, and regularised linear models) still beat neural networks when:

  • the dataset is small (hundreds to thousands of rows),
  • the features are tabular and well-engineered, or
  • interpretability matters and you can pay the per-tree inspection cost.

Deep learning’s edge sharpens with high-dimensional unstructured inputs (images, text, audio), large datasets, and — as we’ll see in Unit 5 onwards — PDE residuals.

Note✏️ Section exercise — how many trees is enough?

Take the DecisionTree.jl MNIST script above and sweep the forest size: n_{\text{trees}} \in \{1, 5, 20, 50, 100, 200\} (use a 10 000-image training subset to keep each fit under ~20 s). Plot test accuracy against tree count. Then hold 100 trees fixed and sweep n_subfeatures over \{5, 28, 100, 784\}. Two questions: where does the accuracy curve flatten, and why does n_subfeatures = 784 (every split sees every feature) hurt the ensemble even though each individual tree gets stronger?

💡 Hint

Reuse the build_forest call from the section verbatim — the third positional argument is n_subfeatures and the fourth is n_trees. Subset with X[1:10_000, :] to keep each fit fast; the shape of the curves is robust to the subset. For the conceptual part, think about what averaging does to correlated vs uncorrelated errors.

2.3 From logistic regression to a neural network

The simplest possible “neural network” is a single linear layer followed by a smooth squashing function. It is also the first genuinely useful classifier — the softmax regression baseline that Goodfellow, Bengio & Courville open Chapter 5 with, and that MEDL derives in detail in Chapter 3. The next four subsubsections walk through the theory (Bernoulli → sigmoid → softmax) and then fit it on MNIST in both Julia and Python, so the jump to deeper networks in §2.4 is a single conceptual step rather than a leap.

Bernoulli likelihood and the sigmoid

For binary classification, model P(y = 1 \mid x) = \sigma(w^\top x + b) with the sigmoid \sigma(z) = 1/(1 + e^{-z}). Maximising the Bernoulli likelihood over training data is equivalent to minimising the binary cross-entropy

\mathcal{L}(w, b) = -\tfrac{1}{N}\sum_i \bigl[y_i \log \hat p_i + (1-y_i) \log (1-\hat p_i)\bigr],

where \hat p_i = \sigma(w^\top x_i + b). That single-layer x \mapsto \sigma(w^\top x + b) is the simplest neural network — a linear map followed by a smooth nonlinearity.

Softmax for multiclass

Stack the binary case K times and renormalise with the softmax

\mathrm{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}.

The model becomes P(y = k \mid x) = \mathrm{softmax}(Wx + b)_k, fit by categorical cross-entropy.

A logistic / softmax classifier as a one-layer neural network. The affine map Wx + b feeds a softmax that outputs class probabilities. (Figure from MEDL, §3.3.)

MNIST softmax regression in Lux.jl

The “deep learning baseline” — a single linear layer + softmax, trained by maximum likelihood — on the same MNIST data the forest just saw. With no nonlinearity and no hidden units, this is literally the model in Goodfellow, Bengio & Courville 2016 Chapter 5 used to motivate why we need deeper networks. Expect ~92% test accuracy — a striking gap below the random forest’s ~97%, which is the point.

units/unit_02/scripts/mnist_linear_lux.jl
using MLDatasets, Lux, Random, Zygote, Optimisers, Statistics

# data ─────────────────────────────────────────────────────────────────
train_x, train_y = MLDatasets.MNIST(split = :train)[:]
test_x,  test_y  = MLDatasets.MNIST(split = :test)[:]
flatten(x) = reshape(Float32.(x), 28 * 28, size(x, 3)) ./ 255f0
X_train, X_test = flatten(train_x), flatten(test_x)
y_train, y_test = Int.(train_y), Int.(test_y)

onehot(k, K = 10) = (e = zeros(Float32, K); e[k + 1] = 1f0; e)
Y_train = reduce(hcat, onehot.(y_train))

# model: 784 → 10 (single linear layer + softmax) ─────────────────────
rng = Random.MersenneTwister(0)
model = Lux.Chain(Lux.Dense(784 => 10), Lux.softmax)
ps, st = Lux.setup(rng, model)

# cross-entropy loss ───────────────────────────────────────────────────
function loss_fn(ps, st, X, Y)
    ŷ, st = model(X, ps, st)
    -mean(sum(Y .* log.(ŷ .+ 1f-9); dims = 1)), st
end

# mini-batch SGD (Adam) ────────────────────────────────────────────────
opt_state = Optimisers.setup(Optimisers.Adam(1f-2), ps)
batch     = 128
nb        = size(X_train, 2) ÷ batch
for epoch in 1:10
    perm = randperm(rng, size(X_train, 2))
    for b in 1:nb
        idx = perm[(b - 1) * batch + 1 : b * batch]
        gs  = first(Zygote.gradient(
                p -> first(loss_fn(p, st, X_train[:, idx], Y_train[:, idx])),
                ps,
              ))
        opt_state, ps = Optimisers.update(opt_state, ps, gs)
    end
end

ŷ_test, _ = model(X_test, ps, st)
preds = vec(map(argmax, eachcol(ŷ_test))) .- 1
@info "MNIST test accuracy (Lux softmax) = $(round(mean(preds .== y_test); digits = 4))"

This is the same skeleton you’ll see again in every PINN: define the model as a pure function, separate parameters ps from inference state st, build a loss, take gradients with Zygote.gradient, step with Optimisers.update. The only thing PINNs add is more loss terms.

The Lux style — parameters as plain arrays, not hidden state — matters because NeuralPDE.jl (Unit 5) needs to compose layers with PDE-residual losses that contain derivatives of the network output. Frameworks like scikit-learn that hide their parameters can’t do that composition.

MNIST softmax regression in Python (scikit-learn)

Same model, fewer lines, very similar numbers:

units/unit_02/scripts/mnist_linear_sklearn.py
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = fetch_openml("mnist_784", version=1, as_frame=False, return_X_y=True)
X = X.astype("float32") / 255.0
y = y.astype("int64")

X_train, X_test = X[:60_000], X[60_000:]
y_train, y_test = y[:60_000], y[60_000:]

# Multinomial logistic = softmax regression
clf = LogisticRegression(
    solver="lbfgs", max_iter=200, n_jobs=-1, random_state=0,
)
clf.fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
print(f"MNIST test accuracy (softmax regression): {acc:.4f}")

Available as scripts/mnist_linear_sklearn.py. Runtime: ~30 s on a laptop CPU. Notice that scikit-learn itself uses L-BFGS under the hood for this convex problem — the same optimiser we’ll use for the fine-tuning stage of every PINN in Units 5–7. Convex losses are why L-BFGS converges so cleanly here; in deeper networks the loss surface isn’t convex and we’ll need Adam.

Note✏️ Section exercise — logistic regression from scratch

Strip the framework away entirely. Using only plain Julia arrays (no Lux, no Zygote), implement binary logistic regression for the digits 0 vs 1 of MNIST: write the sigmoid, the binary cross-entropy, derive the gradient \nabla_w \mathcal{L} = \frac{1}{N} X (\hat p - y) by hand, and run 500 steps of plain gradient descent. You should clear 99% test accuracy on this easy pair. Then swap in the digits 4 vs 9 and watch the same model struggle — why is that pair so much harder for a linear decision boundary?

💡 Hint

No autodiff needed: for sigmoid + binary cross-entropy the gradient collapses to X * (p̂ .- y) ./ N — derive it by chaining \partial L/\partial \hat p and \partial \hat p/\partial z and watch the \sigma' cancel. Select the two digits with a boolean mask like (y .== 0) .| (y .== 1). Twenty lines total.

2.4 Feedforward networks

MLPs: depth \times width

A multilayer perceptron (MLP) stacks affine layers with elementwise nonlinearities \sigma in between:

f_\theta(x) = W_L\,\sigma\!\bigl(W_{L-1}\,\sigma(\cdots W_1 x + b_1 \cdots) + b_{L-1}\bigr) + b_L.

A one-hidden-layer network is enough to demonstrate the architecture shape:

One hidden layer. The input vector flows through an affine map + nonlinearity to a hidden representation, then a final linear projection. (Figure from MEDL, §5.1.)

Depth lets the network compose features; width gives it room to represent each:

A deeper feedforward network — same picture, repeated. (Figure from MEDL, §5.1.)

Universal approximation

A two-layer MLP with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy, given enough hidden units (Hornik, 1991; Cybenko, 1989). The theorem is qualitative: it doesn’t tell you how many units, how much data, or how to train. Depth, architectural priors, and good optimisation make the practical difference.

Activation choices (with a PINN aside)

Common nonlinearities:

  • Sigmoid / tanh — smooth, bounded, classical. Saturate, so deep networks gradient-vanish.
  • ReLU \max(0, x) — cheap, non-saturating, the modern default for vision / language tasks.
  • Swish x\,\sigma(\beta x) — smooth ReLU; good default for PINNs.

For PINNs the second derivatives matter (they show up in the PDE residual). ReLU’s second derivative is zero almost everywhere — useless for diffusion-type PDEs. Smooth activations (tanh, Swish, GELU) are the safe default; we use Swish throughout Units 5–7.

Note✏️ Section exercise — universal approximation, and the ReLU trap

Fit a one-hidden-layer Lux MLP to f(x) = \sin(4\pi x) \cdot e^{-x} on [0, 1] at widths 5, 20, 100, and plot the three fits over the target. Then — the PINN-relevant part — take the width-20 network in two flavours, tanh and relu, and use nested ForwardDiff.derivative to plot the second derivative f_\theta''(x) of each. The tanh network gives a smooth curve; what does the ReLU network give, and why does that disqualify it from any diffusion-type PDE residual?

💡 Hint

Lux.Chain(Lux.Dense(1 => w, act), Lux.Dense(w => 1)) with act = tanh or relu, trained with the standard Zygote.gradient + Optimisers.Adam loop from the section. For the second derivative, wrap the network in a scalar closure u(x) = first(model([x], ps, st))[1] and nest ForwardDiff.derivative twice — exactly the §5.2 pattern, used here a unit early.

2.5 Optimisation and autodiff

Training a neural network — or fitting a PINN — comes down to minimising a scalar loss over a large parameter vector \theta \in \mathbb{R}^P. The next six subsubsections cover the machinery: how gradients of the loss are computed (computational graph + backprop), the family of first-order methods that use those gradients (SGD, momentum, Adam), the quasi-Newton method that often finishes the job (L-BFGS), the batching choices that determine how noisy each gradient estimate is, and finally the forward-vs-reverse-mode autodiff distinction that PINNs depend on. Most of this material is treated at greater depth in Goodfellow, Bengio & Courville Chapter 8; the goal here is a working understanding sufficient to read PINN training code and know which knob to turn when it doesn’t converge.

Computational graphs and backprop

Every neural-network loss is the output of a finite computation graph — a DAG whose nodes are tensors and whose edges are elementary operations.

A small computational graph. Forward evaluation flows left-to-right; reverse-mode AD (backpropagation) populates gradients right-to-left. (Figure from MEDL, §4.4.)

Backpropagation is exactly reverse-mode automatic differentiation applied to that graph: one forward pass evaluates intermediate quantities, one backward pass propagates gradients using the chain rule. Cost is O(\text{nodes}) per gradient — i.e., similar to one forward pass.

Gradient descent: the geometric core

Given a smooth loss \mathcal{L}: \mathbb{R}^P \to \mathbb{R} and a parameter vector \theta_t, the gradient g_t = \nabla_\theta \mathcal{L}(\theta_t) \in \mathbb{R}^P is the direction in which \mathcal{L} increases fastest. Gradient descent takes a small step in the opposite direction:

\theta_{t+1} \;=\; \theta_t \;-\; \eta\, g_t, \tag{1}

with learning rate \eta > 0. To first order in \eta,

\mathcal{L}(\theta_{t+1}) \;\approx\; \mathcal{L}(\theta_t) - \eta\, \|g_t\|^2 + O(\eta^2),

so any positive \eta decreases the loss locally; too large and the O(\eta^2) Hessian term wins and the iterate diverges. The practical learning-rate range for deep networks (10^{-4} to 10^{-2}) is set by this balance — see Goodfellow, Bengio & Courville 2016 §8.3 for the convergence analysis.

Why isn’t plain gradient descent enough? Three failure modes that recur throughout deep learning and PINN training:

  1. Pathological curvature. Loss surfaces often have long, thin valleys (the ravine problem). g_t points across the valley, not along it. Plain GD zig-zags.
  2. Bad scaling between parameters. Different layers have wildly different gradient magnitudes. A single global \eta can’t be right for all of them at once.
  3. Stochastic noise when g_t is estimated from a mini-batch rather than the full dataset (covered below).

Each of momentum, Adam, and L-BFGS is a different fix.

Stochastic and mini-batch gradient descent

For empirical risk \mathcal{L}(\theta) = \frac{1}{N}\sum_i \ell_i(\theta), the full-batch gradient sums over all N training examples — exact, but expensive when N \in [10^4, 10^9]. Stochastic gradient descent (SGD) replaces the sum with a single random example i_t:

\theta_{t+1} \;=\; \theta_t - \eta \, \nabla_\theta \ell_{i_t}(\theta_t),

a noisy but unbiased estimator of the true gradient. The middle ground is mini-batch SGD — sum over a random subset B \subset \{1, \ldots, N\} of size |B| (typically 32 – 1024):

\theta_{t+1} \;=\; \theta_t - \frac{\eta}{|B|}\sum_{i \in B} \nabla_\theta \ell_i(\theta_t).

The gradient variance falls like 1/|B|. Three reasons mini-batches dominate practice:

  • Computational cost. A modern GPU/TPU is massively parallel; cost per gradient is roughly constant up to |B| \sim hundreds. So a single mini-batch update is nearly as cheap as a single SGD update but with far lower variance.
  • Gradient noise as a regulariser. Noisy gradients prevent the iterate from settling into sharp minima; flat minima generalise better empirically. See Goodfellow et al. 2016 §8.1.3.
  • Memory. Full-batch gradients on a million-example dataset don’t fit on a GPU.

PINNs invert this picture: their “dataset” is a few thousand collocation points, so full-batch L-BFGS is feasible and usually wins. But every PINN training loop still uses mini-batches for the first phase (Adam), so the vocabulary matters.

Momentum and Adam

SGD + momentum (Polyak 1964; Sutskever et al. 2013) adds inertia:

\begin{aligned} v_{t+1} &= \mu\, v_t + g_t, \\ \theta_{t+1} &= \theta_t - \eta\, v_{t+1}, \end{aligned}

with momentum coefficient \mu \in [0, 1) (typically 0.9). The update v_t is an exponentially-weighted moving average of past gradients — it cancels in directions where successive g_t oscillate (the ravine), and accumulates in directions where they agree (the valley floor). For a quadratic loss with condition number \kappa, momentum reduces the iterations to converge from O(\kappa) to O(\sqrt{\kappa}) — a square-root speedup.

Adam (Kingma & Ba 2015) combines momentum with per-parameter step sizes based on the magnitude of recent gradients:

\begin{aligned} m_t &= \beta_1\, m_{t-1} + (1 - \beta_1)\, g_t, &\hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \\ v_t &= \beta_2\, v_{t-1} + (1 - \beta_2)\, g_t^2, &\hat{v}_t &= \frac{v_t}{1 - \beta_2^t}, \\ \theta_{t+1} &= \theta_t - \eta \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}. \end{aligned}

Read as: \hat{m}_t is the (bias-corrected) running average of the gradient (the “momentum part”), \hat{v}_t is the running average of the squared gradient component-wise (the “scale part”), and the update divides one by the square root of the other so that each parameter gets a step size inversely proportional to how loud its historical gradient has been. The bias correction (/(1 - \beta_1^t), /(1 - \beta_2^t)) fixes the initial-transient underestimate when m_0 = v_0 = 0.

Hyperparameter defaults from Goodfellow et al. 2016 §8.5.3 (and what the workshop scripts use): \eta \in [10^{-4}, 10^{-2}], \beta_1 = 0.9, \beta_2 = 0.999, \varepsilon = 10^{-8}. Adam almost always works; it is the right first move for any neural-network training, including the initial phase of every PINN in Units 5–7.

L-BFGS: quasi-Newton for the small-batch regime

Newton’s method takes step \theta_{t+1} = \theta_t - H_t^{-1} g_t, using the Hessian H_t = \nabla_\theta^2 \mathcal{L}. In one step on a quadratic loss it lands at the minimum. The catch: H_t is P \times P and inverting it costs O(P^3) — infeasible for P \gtrsim 10^4 parameters.

BFGS builds a positive-definite approximation B_t \approx H_t incrementally from successive gradient differences using a rank-2 update. L-BFGS (Liu & Nocedal 1989) is the limited-memory variant: store only the last m pairs of (s_k, y_k) = (\theta_{k+1} - \theta_k, \, g_{k+1} - g_k) and reconstruct B_t^{-1} g_t implicitly in O(mP) time. Default m = 10 — so L-BFGS uses ~10 P memory and ~10 P work per step. Crucially:

  • No learning rate to tune. L-BFGS does an internal Wolfe line search.
  • Quadratic convergence near the minimum. Once close, it finishes in a handful of steps.
  • Needs an accurate gradient. L-BFGS assumes the gradient is deterministic — feed it a noisy mini-batch gradient and the Wolfe conditions break.

This is exactly the regime PINNs land in: a few thousand collocation points (small enough to take the full-batch gradient on every step), a non-convex but locally well-conditioned loss after Adam has done the rough work. The recipe across Units 5–7 is therefore:

Adam for a few thousand iterations (escape bad initialisations under mini-batch noise) → L-BFGS for a few hundred iterations (drive the residual to convergence with exact gradients).

Forward vs reverse mode (why PINNs need both)

For a function f: \mathbb{R}^n \to \mathbb{R}^m:

  • Forward mode propagates derivatives along with values. Cost scales with n (inputs).
  • Reverse mode propagates derivatives backward. Cost scales with m (outputs).

Standard deep learning is “many inputs, one scalar loss”: m = 1, reverse mode wins. PINNs are different: the loss involves spatial derivatives of the network output evaluated at collocation points. Those derivatives go into the loss, and the final loss-gradient pass needs gradients of those derivatives — a derivative of a derivative.

Computing this efficiently means forward-over-reverse: forward-mode for the spatial derivative (one or two inputs), reverse-mode for the outer gradient over parameters. Getting the nesting right is what makes a PINN frame work; getting it wrong makes training 100× slower. Lux.jl + Zygote.jl + ForwardDiff.jl handle this composition cleanly, which is why the workshop uses that stack.

Note✏️ Section exercise — race the optimisers down a ravine

Implement gradient descent, momentum (\mu = 0.9), and Adam by hand — each is under ten lines — and race them on the Rosenbrock function f(x, y) = (1 - x)^2 + 100\,(y - x^2)^2 from (-1.5, 2). Plot the three trajectories over the loss contours and the loss-vs-iteration curves (log scale, 5 000 iterations, tune each learning rate to the largest stable value). Which method zig-zags across the valley, which one tracks the valley floor, and where does Adam’s per-parameter scaling visibly help?

💡 Hint

Each optimiser is under ten lines if you mutate the iterate in place and keep state in a NamedTuple. The hard part is fair tuning: push each learning rate up until the method just diverges, then back off (GD lands near 1e-3, Adam near 1e-2). Plot log10 ∘ f for the contours or the colour scale saturates.

2.6 Training a model in practice

The previous sections set up everything we need — a function class (MLP), an autodiff engine, an optimiser, and a dataset (MNIST) with two known baselines (random forest ~97%, softmax regression ~92%). Now we close the gap with an actual MLP, written three times so the cross-ecosystem mechanics are explicit. The Julia (Lux.jl) version doubles as the iris sidecar that already runs under ./build.sh execute 2; the MNIST PyTorch and scikit-learn parallels are static, for the cross-ecosystem comparison.

Lux.jl MLP on iris (the runnable sidecar)

units/unit_02/scripts/iris_lux.jl
# Small MLP on the iris dataset, in Lux.jl + Zygote + Adam.
# Demonstrates the explicit-parameter Lux style and a hand-rolled
# training loop — the same skeleton we use in later units when the
# loss starts to include PDE residuals.
#
# Run via ./build.sh execute 2 (writes output to ../output/iris_lux.md).

using Lux, Random, Zygote, Optimisers, Statistics
using MLJ: load_iris, unpack

# ── data ────────────────────────────────────────────────────────────────
iris = load_iris()
y, X = unpack(iris, ==(:target); rng = 123)
class_index = Dict(c => i - 1 for (i, c) in enumerate(unique(y)))
y_int       = [class_index[v] for v in y]                       # 0..2
# `unpack` returns X as a NamedTuple of feature columns; stack into a
# (4, 150) Float32 matrix with one sample per column.
X_mat = Float32.(permutedims(reduce(hcat, [collect(c) for c in values(X)])))

rng = Random.MersenneTwister(123)
perm = randperm(rng, length(y_int))
ntrain = 120
i_train, i_test = perm[1:ntrain], perm[ntrain+1:end]

# one-hot targets
onehot(k, K) = (e = zeros(Float32, K); e[k+1] = 1f0; e)
Y_train = reduce(hcat, [onehot(y_int[i], 3) for i in i_train])
Y_test  = reduce(hcat, [onehot(y_int[i], 3) for i in i_test])

X_train = X_mat[:, i_train]
X_test  = X_mat[:, i_test]

# ── model ───────────────────────────────────────────────────────────────
model = Lux.Chain(
    Lux.Dense(4 => 16, tanh),
    Lux.Dense(16 => 3),
)
ps, st = Lux.setup(rng, model)

# ── loss + accuracy ─────────────────────────────────────────────────────
function logits_then_softmax_xent(logits, Y)
    z = logits .- maximum(logits; dims = 1)           # stability
    logp = z .- log.(sum(exp.(z); dims = 1))
    -mean(sum(Y .* logp; dims = 1))
end

function loss_fn(ps, st, X, Y)
    logits, st = model(X, ps, st)
    logits_then_softmax_xent(logits, Y), st
end

function accuracy(ps, st, X, Y)
    logits, _ = model(X, ps, st)
    preds  = vec(map(argmax, eachcol(logits))) .- 1
    truth  = vec(map(argmax, eachcol(Y))) .- 1
    mean(preds .== truth)
end

# ── training loop with Adam (Optimisers.jl) ────────────────────────────
function train!(ps, st, X_train, Y_train, X_test, Y_test;
                epochs = 500, lr = 1.0f-2)
    opt_state = Optimisers.setup(Optimisers.Adam(lr), ps)
    for epoch in 1:epochs
        gs = first(Zygote.gradient(
            p -> first(loss_fn(p, st, X_train, Y_train)), ps,
        ))
        opt_state, ps = Optimisers.update(opt_state, ps, gs)
        if epoch % 100 == 0
            L, _ = loss_fn(ps, st, X_train, Y_train)
            a_tr = accuracy(ps, st, X_train, Y_train)
            a_te = accuracy(ps, st, X_test,  Y_test)
            @info "epoch $epoch  loss=$(round(L; digits=4))  train=$(round(a_tr; digits=3))  test=$(round(a_te; digits=3))"
        end
    end
    ps
end

ps = train!(ps, st, X_train, Y_train, X_test, Y_test)

# ── final report ────────────────────────────────────────────────────────
println("\nfinal train accuracy: ", round(accuracy(ps, st, X_train, Y_train); digits = 3))
println("final test  accuracy: ", round(accuracy(ps, st, X_test,  Y_test);  digits = 3))

Captured output (after ./build.sh execute 2):

[ Info: epoch 100  loss=0.1523  train=0.967  test=0.967
[ Info: epoch 200  loss=0.0733  train=0.983  test=0.967
[ Info: epoch 300  loss=0.0581  train=0.983  test=0.967
[ Info: epoch 400  loss=0.0501  train=0.983  test=0.967
[ Info: epoch 500  loss=0.0455  train=0.983  test=0.967

final train accuracy: 0.983
final test  accuracy: 0.967

The next two parallels swap iris for MNIST so the depth-vs-baseline comparison from §2.2–§2.3 closes.

scikit-learn MLP on MNIST

The same architecture in scikit-learn’s MLPClassifier. Expect ~98% test accuracy — comfortably beating the softmax baseline, edging out the random forest:

units/unit_02/scripts/mnist_mlp_sklearn.py
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

X, y = fetch_openml("mnist_784", version=1, as_frame=False, return_X_y=True)
X = X.astype("float32") / 255.0
y = y.astype("int64")
X_train, X_test = X[:60_000], X[60_000:]
y_train, y_test = y[:60_000], y[60_000:]

clf = MLPClassifier(
    hidden_layer_sizes=(256, 128), activation="relu",
    solver="adam", batch_size=128, max_iter=20, random_state=0,
)
clf.fit(X_train, y_train)
print(f"MNIST test accuracy (sklearn MLP): "
      f"{accuracy_score(y_test, clf.predict(X_test)):.4f}")

PyTorch MLP on MNIST

The same model with explicit autodiff and an explicit training loop. This is the PyTorch idiom you’ll see everywhere in the deep learning book:

units/unit_02/scripts/mnist_mlp_pytorch.py
# pip install torch torchvision
import torch, torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

device = "cuda" if torch.cuda.is_available() else "cpu"
tf  = transforms.Compose([transforms.ToTensor(), transforms.Lambda(torch.flatten)])
tr  = datasets.MNIST("~/torch-data", train=True,  download=True, transform=tf)
te  = datasets.MNIST("~/torch-data", train=False, download=True, transform=tf)
tl  = DataLoader(tr, batch_size=128, shuffle=True)
vl  = DataLoader(te, batch_size=512)

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10),
).to(device)
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)
loss  = nn.CrossEntropyLoss()

for epoch in range(10):                 # mini-batch SGD with Adam
    model.train()
    for X, y in tl:
        X, y = X.to(device), y.to(device)
        opt.zero_grad()
        loss(model(X), y).backward()    # backprop
        opt.step()

    # eval
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for X, y in vl:
            X, y = X.to(device), y.to(device)
            pred = model(X).argmax(dim=1)
            correct += (pred == y).sum().item()
            total   += y.size(0)
    print(f"epoch {epoch:2d}  test acc = {correct / total:.4f}")

The structural beats — model.train() / model.eval() mode, opt.zero_grad() to clear accumulated gradients, the inner loss.backward(); opt.step() two-liner — are PyTorch idiom you’ll see in every textbook and tutorial. Functionally it’s the same training loop as mnist_linear_lux.jl above. The only ecosystem difference is whether parameters live on the model object (PyTorch) or in a separate ps tuple (Lux).

Run it via scripts/mnist_mlp_pytorch.py after installing torch + torchvision; expect ~30 s/epoch on a modern laptop CPU, much less on GPU.

All three implementations follow the same skeleton — load data, define model, set optimiser, loop minibatches, compute loss, step. The Julia version makes the parameter state explicit; PyTorch and scikit-learn hide it. The only meaningful differences are surface-level. Pick by ecosystem.

Regularisation, batch norm, dropout

Three standard knobs to keep training honest:

  • Weight regularisation — add \lambda \|\theta\|^2 to the loss (L_2, “weight decay”) to keep the parameters small.
  • Batch normalisation — normalise activations within each mini-batch; lets you use higher learning rates and reduces sensitivity to initialisation.
  • Dropout — randomly zero a fraction of activations during training; equivalent to averaging over an ensemble of thinned networks.

Dropout: at each training step, a random subset of hidden units is masked out; at test time all units are used with weights rescaled. (Figure from MEDL, §5.7.)

For PINNs the regularisation knobs you care about are physics terms in the loss rather than weight penalties or dropout — the PDE residual itself is a powerful regulariser, as we’ll see from Unit 5 onward.

Note✏️ Section exercise — close the MNIST gap in Lux

The unit has now shown you MNIST three ways: random forest (~97%), softmax regression (~92%), and an MLP in two Python frameworks (~98%). Close the loop in Julia: extend mnist_linear_lux.jl from §2.3 with two hidden layers (784 → 256 → 128 → 10, ReLU), keep the same Adam mini-batch loop, and train for 10 epochs. Verify you beat the forest. Then re-run with the Adam → L-BFGS-style trick in miniature: after the Adam epochs, do 5 epochs at a 10× smaller learning rate and note the effect on the final accuracy. Roughly how many parameters does your network have, and how does that compare to the softmax baseline?

💡 Hint

Copy mnist_linear_lux.jl and change only the model definition and epoch count — the data pipeline and mini-batch loop are already right. Lux.parameterlength(model) gives the parameter count without hand-arithmetic. For the fine-tune, rebuild opt_state with the smaller learning rate but keep the trained ps.

2.7 Kolmogorov-Arnold networks (KANs)

A recent alternative to the MLP that’s been picking up traction in the PINN community: Kolmogorov-Arnold networks (Liu et al. 2024). Where an MLP puts a fixed activation \sigma at every node and learns the weights on the edges, a KAN puts a learnable univariate function (a B-spline) on every edge and just sums them at the nodes:

\underbrace{f_{\text{MLP}}(x) = W_2\, \sigma(W_1 x + b_1) + b_2}_{\text{fixed }\sigma\text{, learnable }W} \qquad\text{vs.}\qquad \underbrace{f_{\text{KAN}}(x)_q = \sum_p \phi_{q, p}^{(2)}\!\Bigl(\sum_r \phi_{p, r}^{(1)}(x_r)\Bigr)}_{\text{learnable univariate }\phi\text{ on every edge}}.

The motivation is the Kolmogorov-Arnold superposition theorem (1957): any continuous multivariate function on a compact domain can be written as a finite sum of single-variable functions of single-variable functions — the same universal-approximation spirit as MLPs but with a structurally different parameterisation. In practice KANs trade fewer parameters for more interpretability (you can plot every edge’s learned \phi) at the cost of a more expensive forward pass.

For PINNs the appeal is that learnable univariate splines often need many fewer collocation points to capture sharp features in the PDE solution. Some preliminary results (KAN-PINN) suggest KANs can outperform MLPs on stiff or oscillatory PDEs — this is an active research front.

KAN in Julia (KolmogorovArnold.jl)

A 2-layer KAN approximating f(x_1, x_2) = \exp(\sin(\pi x_1) + x_2^2) — a Liu et al. demo — in the Julia ecosystem using KolmogorovArnold.jl:

units/unit_02/scripts/kan_julia.jl
# Pkg.add("KolmogorovArnold")
using KolmogorovArnold, Lux, Random, Zygote, Optimisers, Statistics

rng = Random.MersenneTwister(0)

# data ─────────────────────────────────────────────────────────────────
n = 1000
X = 2f0 .* rand(rng, Float32, 2, n) .- 1f0
y = reshape(exp.(sinpi.(X[1, :]) .+ X[2, :].^2), 1, n)

# KAN: 2 → 5 → 1, B-spline basis ──────────────────────────────────────
model = Lux.Chain(
    KDense(2 => 5; basis_func = rbf,    normalizer = softsign),
    KDense(5 => 1; basis_func = rbf,    normalizer = softsign),
)
ps, st = Lux.setup(rng, model)

loss(ps, st) = mean(abs2, first(model(X, ps, st)) .- y)
opt_state = Optimisers.setup(Optimisers.Adam(1f-2), ps)
for k in 1:2000
    gs = first(Zygote.gradient(p -> loss(p, st), ps))
    opt_state, ps = Optimisers.update(opt_state, ps, gs)
end
@info "test MSE = $(loss(ps, st))"

Setup note. KolmogorovArnold.jl is not in the workshop Project.toml; add it once with Pkg.add("KolmogorovArnold"). The package uses radial basis functions as the per-edge basis instead of B-splines — a near-equivalent parameterisation with better Julia ergonomics.

KAN in Python (pykan)

The reference Liu-et-al. implementation, in Python:

units/unit_02/scripts/kan_pykan.py
# pip install pykan
import torch
from kan import KAN, create_dataset

# build dataset for f(x1, x2) = exp(sin(pi x1) + x2^2)
f = lambda x: torch.exp(torch.sin(torch.pi * x[:, [0]]) + x[:, [1]]**2)
dataset = create_dataset(f, n_var=2, train_num=1000, test_num=1000)

# KAN with grid=5 B-spline knots, k=3 cubic splines, shape 2→5→1
model = KAN(width=[2, 5, 1], grid=5, k=3, seed=0)
model.fit(dataset, opt="LBFGS", steps=50, lamb=0.0)

print(f"test loss = {torch.mean((model(dataset['test_input']) - dataset['test_label'])**2).item():.3e}")

Available as scripts/kan_pykan.py. Requires pip install pykan (which pulls in torch). Notice the default training optimiser is L-BFGS — exactly the small-batch regime we just discussed.

When to reach for a KAN

The honest answer in mid-2026: rarely yet, but worth knowing.

  • Where they shine. Low-dimensional problems where interpretability matters; PDE solutions with sharp features that an MLP would need many neurons to capture; symbolic-regression applications.
  • Where they don’t. High-dimensional image-style inputs (CNNs / transformers still dominate); production deep learning (the MLP-Adam-cross-entropy stack is still vastly more debugged).
  • For PINNs specifically. A reasonable experimental swap-in for the inner MLP of a PINN — NeuralPDE.jl accepts any Lux chain, so the substitution is a one-liner. Whether KANs meaningfully improve PINN convergence is still being worked out.

We won’t return to KANs in the worked examples — Units 5–9 stick with MLP-PINNs — but they belong in the working vocabulary of any practitioner reading the 2024–26 PINN literature.

Note✏️ Section exercise — KAN vs MLP on a sharp feature

Sharp features are where KANs claim an edge. Fit both a 2-layer KAN (KolmogorovArnold.jl, as in the script above) and a parameter-matched tanh MLP to the 1-D function f(x) = |x| + 0.2\sin(5x) on [-1, 1] — smooth everywhere except the kink at zero. Train both with Adam for the same number of iterations, compare final MSE, and plot both fits near x = 0. Which model captures the kink better at equal parameter count, and what happens to the MLP if you give it 10× the training iterations?

💡 Hint

KolmogorovArnold.jl needs a one-off Pkg.add; KDense(1 => 6; basis_func = rbf, normalizer = softsign) mirrors the section’s script. Match parameter counts with Lux.parameterlength before comparing — an unmatched comparison is meaningless. Zoom the comparison plot into x ∈ [-0.1, 0.1]; the global MSE hides where the action is.

2.8 What carries over to PINNs

Three ideas from this unit survive the journey to physics-informed networks:

  1. Loss minimisation as the unifying training principle. PINNs add residual terms to the loss but keep the empirical-risk skeleton.
  2. Generalisation as the goal. A PINN that fits its collocation points but violates physics elsewhere is overfitting; held-out collocation samples play the role of the test set.
  3. Regularisation by physics. The PDE residual is a domain prior — a regulariser sharper than any L_2 penalty, because it actually encodes what the function must satisfy.

With those in hand, Unit 3 lifts the function-approximation idea into a SciML landscape, and Unit 4 introduces the ODE side that PINNs will meet head-on in Unit 5.

Note✏️ Section exercise — translate the vocabulary

A quick dictionary drill before moving on. For each supervised-learning concept on the left, write down its PINN counterpart and one sentence on what changes: (a) a labelled training example; (b) the test set; (c) overfitting; (d) the L_2 weight penalty; (e) the mini-batch. Then answer: in what sense does a PINN have “infinite training data”, and what is the catch?

💡 Hint

For each concept, ask: what supplies the supervision signal in a PINN (the equation), and what plays the role of unseen data (points you didn’t sample)? The §2.1 ERM picture maps over almost mechanically — the one genuinely new idea is that the regulariser is the physics.