Unit 4 - Parallel and Fast

1 Introduction & Motivation

1.1 Why parallel?

For decades, computers got faster mainly because chipmakers squeezed more transistors onto chips and raised their clock speeds—this was Moore’s law. But after the early 2000s, increasing clock speeds hit limits due to heat and power dissipation limits. In order to get more power into machines, CPU manufacturers instead started putting multiple cores on CPUs: identical and somewhat isolated execution units that can do work independently. In essence, it became harder to make CPUs faster so they just put many on one die.

GPUs, which were designed for graphics, also became much more powerful and began to be used for general-purpose computing.

Now, to speed up programs beyond what a single core can do, we have to use parallel computing across multiple cores, CPUs, GPUs, or even multiple machines working together—this is called distributed computing. See for example the article Sutter et al. (2005).

This brings a whole new set of challenges to writing programs: now we can’t just write code that runs linearly in the obvious order, instead we have to come up with algorithms that can be run in parallel and speed up execution by performing different parts on different logical computation units. This brings with it a whole new hoarde of bugs, but also incredible time savings and performance throughput improvements, including even the ability to do computation that would not be possible without them (e.g. training LLMs would never be possible on single-threaded CPUs).

1.2 Overview

Julia is well suited for distributed computing, parallel computing, and GPU use. See the overview of parallel computing in the Julia documentation. In this unit we deal with several layers of parallelism for speedup:

Multi-threading: Use multiple CPU threads to run tasks simultaneously within a single process. See the docs.
Distributed Computing: Use multiple processes, possibly on different machines, to work together on larger problems. See the docs as well as Distributed.jl and Malt.jl.
GPUs: Harness the massive parallelism of graphics processors for compute-heavy tasks. See JuliaGPU.org.

Combinations of 1, 2, and 3 are obviously possible.

Note that a related item, also in the Julia documentation is “Asynchronous ‘tasks’, or coroutines”. See the Asynchronous Programming docs.

We now focus on multi-threading, distributed computing, and GPUs. We then run an example with Oceananigans.jl from the CliMA organization for ocean simulation. See the recent preprint Wagner et al. (2025).

1.3 How do we identify code for parallelisation?

Knowing when and where to apply parallelisation is as crucial as knowing how. Not all code benefits from parallelisation, and overheads can even slow things down.

Look for Independence:
- “Embarrassingly Parallel” Tasks: If parts of your computation can run completely independently without needing results from other parts (e.g., each iteration of a for loop is self-contained), it’s ideal for parallelization.
- Data Parallelism: Applying the same operation to different chunks of data.
Consider Data Access: access patterns are key to deciding which code does and does not benefit from parallelisation. Here are some things to think about:
- Shared Memory: Computations that share large data structures (e.g. pre-baked grids of computation) are great for threading-based parallelism where you can easily share memory. Note that this requires careful synchronization to avoid errors.
- Distinct Memory: When computing with independent data, one can often leverage parallelisation such as GPUs. Distributed computing can be useful for problems too large to fit on one machine, where data is segmented across processes. Note that heavy data transfer can negate gains.
Profile code: to find your code’s hotspots – the functions or lines that consume the most time. Don’t optimize until you know what’s slow! Sometimes the slow parts are obvious, sometimes not so obvious.
“Amdahl’s Law”: Speedup is limited by the serial portion of code. If only a small part of code can be parallelized, even with infinite cores, total speedup is limited. Focus on parallelizing the largest possible fraction.

1.4 Relevant Julia communities

These are great resources for Julia’s parallel computing ecosystem:

JuliaParallel Organization: Hosts parallel packages like Distributed.jl, MPI.jl.
JuliaGPU.org: The hub for Julia’s GPU computing, deceloping CUDA.jl, AMDGPU.jl, etc.

1.5 Things we won’t cover

We will not delve into MPI (Message Passing Interface), but it is implemented in Julia through MPI.jl. This is oftentimes used in certain computations on HPC clusters, you might want to check it out. In Julia, this functionality is largely provided by Distributed.jl.

Because the focus is on numerical computing relevant to AIMS, we won’t be discussing @async and non-heterogeneous multithreading in detail; we will not talk about in depth about locking and synchronization.

We won’t be delving into low-level CPU optimizaitons like explicit SIMD vectorization or LoopVectorization.jl. Note that Julia (like other LLVM-based languages) is able to utilize SIMD in a number of cases automatically.

2 Multi-threading

This quarto notebook was generated with the command line flag -t 8 (or --threads 8). It means there are 8 threads available. Without this flag, Julia starts up with only 1 thread. If you try in the command line julia --help you’ll see:

 -t, --threads {auto|N[,auto|M]}
                           Enable N[+M] threads; N threads are assigned to the `default`
                           threadpool, and if M is specified, M threads are assigned to the
                           `interactive` threadpool; `auto` tries to infer a useful
                           default number of threads to use but the exact behavior might change
                           in the future. Currently sets N to the number of CPUs assigned to
                           this Julia process based on the OS-specific affinity assignment
                           interface if supported (Linux and Windows) or to the number of CPU
                           threads if not supported (MacOS) or if process affinity is not
                           configured, and sets M to 1.

The interactive threadpool has higher priority and is meant for interactive tasks: such as updating the UI or doing something user facing; whereas default is meant for things like background jobs.

Let’s use the nthreads function inside Julia to see how many threads there are:

Threads.nthreads()

To help us visualize time of execution, we’re going to build a small function, time_stamp that computes the number of 10th-seconds since the start of midnight tonight. Remember that there are 86400 seconds in a day, so expect the values of this function to reach almost up to \(10^6\). We’ll use this as a relative timestamp:

using Dates

"""
Returns how many tenths of seconds passed since midnight
"""
function time_stamp()
    now = Dates.now()
    midnight = DateTime(Date(now))  # Today at 00:00:00
    return Millisecond(now - midnight).value ÷ 100
end;

For basic parallelism, one of the best tools we have is the Threads.@threads macro. Now since we have 8 threads in our system, observe that this loop essentially runs in three batches (since there are 17 tasks, which is more than 16 but fewer than 24).

before_start = time_stamp()
Threads.@threads for i in 1:17
    println(time_stamp(),
            "."^(time_stamp()-before_start), # spacing proportional to elapsed time
            " Starting iteration $i")

    sleep(1) # sleep for one second as though "processing something"

    println(time_stamp(),
            "."^(time_stamp()-before_start), # spacing proportional to elapsed time
             " Finished sleeping (\"processing\") on thread $(Threads.threadid())")
end

77667.. Starting iteration 8
77667.. Starting iteration 10
77667.. Starting iteration 4
77667.. Starting iteration 16
77667.. Starting iteration 14
77667.. Starting iteration 6
77667.. Starting iteration 12
77667.. Starting iteration 1
77677............ Finished sleeping ("processing") on thread 3
77677............ Finished sleeping ("processing") on thread 5
77677............ Starting iteration 9
77677............ Finished sleeping ("processing") on thread 1
77677............ Finished sleeping ("processing") on thread 8
77677............ Finished sleeping ("processing") on thread 4
77677............ Starting iteration 7
77677............ Starting iteration 15
77677............ Finished sleeping ("processing") on thread 7
77677............ Starting iteration 5
77677............ Starting iteration 11
77677............ Starting iteration 17
77677............ Finished sleeping ("processing") on thread 8
77677............ Finished sleeping ("processing") on thread 6
77677............ Starting iteration 2
77677............ Starting iteration 13
77687...................... Finished sleeping ("processing") on thread 5
77687...................... Finished sleeping ("processing") on thread 7
77687...................... Finished sleeping ("processing") on thread 2
77687...................... Finished sleeping ("processing") on thread 8
77687...................... Finished sleeping ("processing") on thread 6
77687...................... Finished sleeping ("processing") on thread 1
77687...................... Finished sleeping ("processing") on thread 4
77687...................... Finished sleeping ("processing") on thread 3
77687...................... Starting iteration 3
77697................................ Finished sleeping ("processing") on thread 4

2.1 Example - Mandelbrot Set

The simplest class of algorithms to parallelize are called embarrassingly parallel: those where we are literally doing isolated computations. They are a good place to start to learn about splitting computation apart in Julia.

The Mandelbrot set \(\mathcal M\subseteq\mathbb{C}\) is the set of numbers \(c\in\mathbb{C}\) (in the complex plane) where iterating the function \(f_c(z)=z^2+c\) does not diverge to infinity (where we start iterating with \(z=0\)). Clearly for example \(0+0i\in\mathcal M\) but \(4+0i\not\in\mathcal M\).

More formally we define the Mandelbrot set \(\mathcal M\) as follows. Fix \(c\in\mathbb{C}\), define \(z_0=0\) and set \(z_{k+1}=f_c(z_k)=z_k^2+c\). Then \(c\in\mathcal M\) if and only if \(\limsup_{k\to\infty}|z_k|<\infty\).

For example, at \(c=-1+0i\), we have the iterates

\(z_0=0\),
\(z_1=0^2-1=1\),
\(z_2=(-1)^2-1=0\),
we hit a loop, and so clearly \(z_k\) is bounded. Therefore \(-1\in\mathcal{M}\).

On the other hand, at \(c=3i\), we have the iterates

\(z_0=3i\),
\(z_1=(3i)^2+3i=3i-9\),
\(z_2=(3i-9)^2+3i=72-51i\),
now it’s pretty clear that the values will diverge, so \(3i\not\in\mathcal{M}\).

Observe that if \(|z_n|>2\) at any point, then we are guaranteed that the iterates go to infinity. This helps us simplify our code to compute \(\mathcal M\).

We can write a Julia function to compute this as follows:

function mandelbrot_iter(c::Complex, max_iter::Int=50)::Int
    # initializes z to the zero element of its type, in this case 0+0i
    z = zero(c)
    for i in 1:max_iter
        z = z*z + c
        # abs2 computes the square of the absolute value, saves us a tiny bit by not computing the square root
        if abs2(z) > 4
            return i
        end
    end
    return max_iter
end

mandelbrot_iter (generic function with 2 methods)

Here we are outputting the iteration on which the sequence has \(|z|>2\), as this will let us make pretty images.

We said earlier that \(0+0i\in\mathcal{M}\),

mandelbrot_iter(0im)

And that \(-1+0i\in\mathcal{M}\),

mandelbrot_iter(-1+0im)

And that \(3i\not\in\mathcal{M}\),

mandelbrot_iter(3im)

Let’s now write a function to run it over a grid:

function compute_mandelbrot(
    # real (x-axis) values
    real_range,
    # imaginary (y-axis) values
    imag_range;
    # max iterations per run
    max_iter::Int
)::Matrix{Int}
  # produce the grid of c values
  c_grid = complex.(real_range', imag_range)
  # apply mandelbrot_iter to each value of c_grid
  return mandelbrot_iter.(c_grid, max_iter)
end;

Let’s define a box that contains most of \(\mathcal M\):

width, height = 1500, 1500

real_range_full = LinRange(-2., 0.5, width)
imag_range_full = LinRange(-1.25, 1.25, height);

And run to see see how fast we go:

using BenchmarkTools

@time my_mb = compute_mandelbrot(real_range_full, imag_range_full, max_iter=100);

  0.502244 seconds (610.02 k allocations: 81.952 MiB, 2.02% gc time, 64.51% compilation time)

Run it a second time to get a clean time

@time my_mb = compute_mandelbrot(real_range_full, imag_range_full, max_iter=100);

  0.176925 seconds (6 allocations: 51.504 MiB, 5.10% gc time)

On my machine, this takes about 0.155540 seconds.

Nice picture:

using Plots

heatmap(my_mb, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")

Zoom in on the top part with some more iterations

real_range_top = LinRange(-.3, .1, width)
imag_range_top = LinRange(.6, 1, height)

@time my_mb_top = compute_mandelbrot(real_range_top, imag_range_top, max_iter=250);
heatmap(my_mb_top, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")

  0.484760 seconds (6 allocations: 51.502 MiB, 0.72% gc time)

2.2 Making it faster with threads

function compute_mandelbrot_multithreaded(
    # real (x-axis) values
    real_range,
    # imaginary (y-axis) values
    imag_range;
    # max iterations per run
    max_iter::Int
)::Matrix{Int}
  # produce the grid of c values
  c_grid = complex.(real_range', imag_range)
  # output array
  out = zeros(Int64, size(c_grid))
  # note the @threads annotation
  Threads.@threads for i in eachindex(out)
    out[i] = mandelbrot_iter(c_grid[i], max_iter)
  end
  return out
end

@time mt_mb = compute_mandelbrot_multithreaded(real_range_full, imag_range_full, max_iter=100);

  0.141279 seconds (60.00 k allocations: 54.446 MiB, 139.61% compilation time)

Run it a second time to get a clean time

@time mt_mb = compute_mandelbrot_multithreaded(real_range_full, imag_range_full, max_iter=100);

  0.077174 seconds (48 allocations: 51.506 MiB, 27.83% gc time)

On my machine, this takes about 0.046682 seconds, about 3.3x as fast as the original one.

2.2.1 More on `Threads.@threads`

The Threads.@threads macro allows you to specify different schedulers for how to execute loop iterations on different threads. The default scheduler, :dynamic assumes loop iterations are approximately uniform.

If this is not the case, you are better off using :greedy which will greedily pick another iteration when it finishes the last one. This is better for non-uniform iterations.

As an example, consider this loop of artifical and highly non-uniform iteration times where with 10% probability it will sleep for a long time:

using Random
using BenchmarkTools

function do_work()
    if rand() < 0.1
        sleep(.3)
    end
end

do_work (generic function with 1 method)

With standard scheduler (:dynamic, omitted):

Random.seed!(1974)

@btime begin
    # same as Threads.@threads :dynamic for _ ∈ 1:200
    Threads.@threads for _ ∈ 1:200
        do_work()
    end
end

  1.207 s (98 allocations: 5.66 KiB)

With :greedy:

Random.seed!(1974)

@btime begin
    Threads.@threads :greedy for _ ∈ 1:200
        do_work()
    end
end

  603.718 ms (133 allocations: 7.94 KiB)

There is also :static. See the docs.

Why not always use :greedy? Because of cache coherence and other reasons, :dynamic which will do blocks at a time is often faster for uniform iterations.

2.3 The shared memory paradigm

We just saw how easy it is to use Julia’s @threads macro. This leverages shared memory parallelism:

One Process, Many Threads: Multithreading operates within a single process. Multiple “threads of execution” run concurrently.
Shared Memory: Crucially, all threads within that process share the same memory space. They can directly read from and write to the same variables and data structures.
The Power: This direct access means extremely low communication overhead. It’s efficient for tasks where threads frequently interact with common data, like processing a large array.
The Peril: But this power comes with a critical challenge: data consistency. If multiple threads try to write to the same memory location simultaneously, the result is unpredictable and incorrect. This is called a race condition.

Shared memory is powerful, but managing access to shared data is vital. Let’s see an example of what can go wrong.

2.4 What can go wrong?

Consider this simple function that computes the sum

\[ \sum_{n=1}^\infty\frac{1}{n^2} \]

It’s a well known identity that this sum equals \(\pi^2/6\) (but not so easy to prove).

function inverse_squares_simple(arr::Vector{Int64})
  sum = 0.
  for i in eachindex(arr)
    sum += 1/(arr[i]^2)
  end
  return sum
end

inverse_squares_simple (generic function with 1 method)

using Base.Threads

function inverse_squares_buggy_multithreaded(arr::Vector{Int64})
  sum = 0.
  # Try to use @threads to parallelize the loop over array indices
  @threads for i in eachindex(arr)
    # Each thread attempts to add to the sum
    sum += 1/(arr[i]^2) # <-- This is problematic!
  end
  return sum
end

N = 1_000
numbers = collect(1:N)
true_val = π^2/6
naive_sum = inverse_squares_simple(numbers)

println("True value is $true_val, naive method gives $naive_sum, or around $(round(100*naive_sum/true_val, digits=2))% of true val")

for _ in 1:10
    result = inverse_squares_buggy_multithreaded(numbers)
    println("\"sum\" is $result, got $(round(100*result/naive_sum, digits=2))% of naive sum $naive_sum")
end

True value is 1.6449340668482264, naive method gives 1.6439345666815615, or around 99.94% of true val
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6383048604989814, got 99.66% of naive sum 1.6439345666815615
"sum" is 1.5285646487446394, got 92.98% of naive sum 1.6439345666815615
"sum" is 0.0015750123760240658, got 0.1% of naive sum 1.6439345666815615
"sum" is 1.3650871857767202, got 83.04% of naive sum 1.6439345666815615
"sum" is 0.0007901012204946777, got 0.05% of naive sum 1.6439345666815615
"sum" is 1.5882829451460938, got 96.61% of naive sum 1.6439345666815615
"sum" is 1.627378117950526, got 98.99% of naive sum 1.6439345666815615
"sum" is 1.5963059480767745, got 97.1% of naive sum 1.6439345666815615
"sum" is 1.2509147702998493, got 76.09% of naive sum 1.6439345666815615

The problem is caused by a data race where a thread will read the value of sum first, then another thread will read it. Regardless of which order the two threads now write the answer to the piece of memory, the result will be wrong, as at least one summand will be “lost”.

One can get around this by wrapping the shared variable in an “atomic” type (atomic means that its updates are “atoms”, they cannot be broken down into a read and a write that could end up un-synchronized)

function inverse_squares_with_atomic(arr::Vector{Int64})
  sum = Atomic{Float64}(0.)
  @threads for i in eachindex(arr)
    atomic_add!(sum, 1/(arr[i]^2))
  end
  # The `[]` syntax dereferences the Atomic value.
  return sum[]
end

for _ in 1:10
    result = inverse_squares_with_atomic(numbers)
    println("\"sum\" is $result, got $(round(100*result/naive_sum, digits=2))% of naive sum $naive_sum")
end

"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815612, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815617, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615

But note:

@benchmark inverse_squares_simple(numbers)

BenchmarkTools.Trial: 10000 samples with 18 evaluations per sample.
 Range (min … max):  977.778 ns …  1.723 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.044 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.053 μs ± 55.308 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         █▅▂▇▁                                                  
  ▂▁▁▂▂▂██████▂▂▂▁▁▁▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂ ▃
  978 ns          Histogram: frequency by time         1.43 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark inverse_squares_buggy_multithreaded(numbers)

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  17.600 μs …  22.555 ms  ┊ GC (min … max): 0.00% … 97.70%
 Time  (median):     20.394 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.107 μs ± 225.469 μs  ┊ GC (mean ± σ):  9.54% ±  0.98%

         ▃▂ █ ▃▇▃▆▃▆▃ ▃                                         
  ▂▂▃▃▄▇▃██▆█▆███████▆█▅▅▆▃▃▃▃▃▂▃▂▂▂▂▂▂▁▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  17.6 μs         Histogram: frequency by time         28.8 μs <

 Memory estimate: 35.52 KiB, allocs estimate: 2043.

@benchmark inverse_squares_with_atomic(numbers)

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  25.772 μs … 140.243 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     52.242 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.633 μs ±   5.828 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▂▁▁▄▄▃▇█▆██▆▅▇▃▂▄           
  ▂▁▁▂▂▁▁▁▂▂▂▁▂▂▂▃▃▃▃▃▃▄▄▄▅▆▅▅▆▅▆█▇███████████████████▇▇▅▅▅▃▃▃ ▅
  25.8 μs         Histogram: frequency by time         63.4 μs <

 Memory estimate: 4.27 KiB, allocs estimate: 43.

Trying to do it with multithreading in this case made it slower (and broke it!) then fixing it up with atomic made it even slower. It’s now 40x slower for no benefit.

Parallelism can be hard. You have to be careful with various failure modes like race conditions. We won’t delve deeper into locking and synchronization pritimives, it’s a huge area.

2.5 Task-based parallelism with `Threads.@spawn`

Instead of simply splitting a loop’s iterations, Threads.@spawn allows you to launch an individual function call or expression as a separate, lightweight thread (often called a “task” or “goroutine” in other languages). Julia’s runtime then intelligently schedules these tasks to run on available threads in your thread pool.

Note that Julia’s threads are not true Operating System threads: they are run on Julia’s threadpool and scheduled by Julia. (This is why you need to specify the number of --threads on startup.)

When you use Threads.@spawn, it immediately returns a Task object. This Task object is like a promise that the computation will be done eventually. To get the actual result from the Task, you use fetch(). If you just want to know when the task is complete without needing its return value, you can use wait().

function simulate_work(it)
    println(time_stamp(),
            "."^(time_stamp()-before_start), # spacing proportional to elapsed time
            " Starting iteration $it")

    sleep(1) # sleep for one second as though "processing something"

    println(time_stamp(),
            "."^(time_stamp()-before_start), # spacing proportional to elapsed time
             " Finished sleeping (\"processing\") on thread $(Threads.threadid())")

    return it
end

tasks = Vector{Task}()

before_start = time_stamp()
for i in 1:17
    push!(tasks, Threads.@spawn simulate_work(i))
end

println("Spawned $(length(tasks)) tasks")

results = [fetch(task) for task in tasks]
println(time_stamp(), "."^(time_stamp()-before_start) * " Task return values: $results")

78293. Starting iteration 11
78293. Starting iteration 5
78293. Starting iteration 8
78293. Starting iteration 12
78293. Starting iteration 3
78293. Starting iteration 15
78293. Starting iteration 16
78293. Starting iteration 7
78293. Starting iteration 13
78293. Starting iteration 6
78293. Starting iteration 14
78293. Starting iteration 17
78293. Starting iteration 1
78293. Starting iteration 2
78293. Starting iteration 9
78293. Starting iteration 4
78293. Starting iteration 10
Spawned 17 tasks
78303........... Finished sleeping ("processing") on thread 8
78303........... Finished sleeping ("processing") on thread 7
78303........... Finished sleeping ("processing") on thread 5
78303........... Finished sleeping ("processing") on thread 3
78303........... Finished sleeping ("processing") on thread 2
78303........... Finished sleeping ("processing") on thread 1
78303........... Finished sleeping ("processing") on thread 6
78303........... Finished sleeping ("processing") on thread 4
78303........... Finished sleeping ("processing") on thread 1
78303........... Finished sleeping ("processing") on thread 8
78303........... Finished sleeping ("processing") on thread 3
78303........... Finished sleeping ("processing") on thread 6
78303........... Finished sleeping ("processing") on thread 1
78303........... Finished sleeping ("processing") on thread 4
78303........... Finished sleeping ("processing") on thread 7
78303........... Finished sleeping ("processing") on thread 2
78303........... Finished sleeping ("processing") on thread 5
78303........... Task return values: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

2.5.1 `Threads.@spawn` vs. `Threads.@threads`

Feature	`Threads.@threads`	`Threads.@spawn`
Purpose	Loop parallelization (iterating over collections)	Launching individual, independent computations
Control Flow	Synchronous (main thread waits for loop to finish)	Asynchronous (main thread continues immediately)
Return Value	No direct return from individual iterations	Returns a `Task` object, which holds the result
Use Case	Homogeneous, iterative work	Heterogeneous, distinct, or long-running computations

It makes sense ot use Threads.@spawn if your ``computation graph’’ is heterogeneous, i.e. you depend on inputs/

3 GPUs

3.1 History of GPGPU

GPGPU stands for General Purpose GPU computing: originally GPUs were built for and only really good for accelerating graphical workflows: texture mapping, shading, rendering polygons and such for gaming and professional applications (CAD, architecture, rendering, etc). They were highly optimized for doing these kinds of computations, which often involved computing small matrix multiplications over and over again. GPU performance was measured in things polygons per second (which is a bit of a nonsensical measure). See Vuduc and Choi (2013).

Around the early 2000s, some hackers and researched started coming up with novel ways of using GPUs for non-graphics applications. These worked largely by writing custom pixel shaders that were supposed to give graphical programs the ability to write very restrictive, but general code to manipulate fragments of textures and bitmaps. These applications creatively abused the graphics pipelines to do simple parallel computations. For example:

2001: Martin Rumpf et al. from University of Duisburg used GPUs to solve FEM calculations (Rumpf and Strzodka (2001))
2003: Mark Harris et al. from UNC used GPUs to simulate cloud dynamics (Harris et al. (2003));
2005: Nico Galoppo et al, from UNC used GPUs to solve linear systems (Galoppo et al. (2005));
2004: Debra Cook et al. from Columbia used GPUs for speeding up cryptography (Cook and Keromytis (2006)); and
2004: Ian Buck et al. from Stanford created Brook, an early start at programming GPUs, introduced kernels directly compiled by a custom compiler for the GPU (Buck et al. (2004)).

Buck in particular showed through Brook (during his PhD) that it’s possible to implement operations like linear algebra and FFT on GPUs with his methodology. He quickly got hired by NVIDIA and was one of the core individuals who created CUDA.

The stats on the number of floating point operations (FLOPs) a GPU could do compared to a CPU made it a very attractive platform for early research. This was directly enabled by the parallelism story on GPUs compared to CPUs – they were parallel from the beginning. A 2003 article from the magazine Computer states “The new GPUs are very fast. Stanford’s Ian Buck calculates that the current GeForce FX 5900’s performance peaks at 20 Gigaflops, the equivalent of a 10-GHz Pentium”, from Macedonia (2003) (the whole article is a very interesting read).

The cryptoanalysis and hacker communities also used GPUs for computing rainbow tables and bruteforcing passwords and hash pre-images¹. This was exacerbated by the Crypto Wars: futile efforts by the United States government to limit the availability strong encryption to the general public and in export products (cryptography is still classified as a dual-use technology and its exports are somewhat restricted). The new availability of highly-parallel hardware combined with new ideas around this time made for practically feasible attacks on widely deployed cryptosystems, such as cracking passwords in networked Microsoft operating systems (e.g. LAN Manager), cracking early WiFi encryption (WEP, and later WPA), and breaking commonly used hashes (such as the abysmal MD5). Cryptography (and laws around cryptography) have come a long way since then.

3.2 CUDA.jl

Julia makes it extremely simple to get stuff running on a GPU with CUDA.jl. Let’s see an example

# this might take a moment
using CUDA

# CUDA.@time is needed to synchronize GPU data to make sure we time everything
CUDA.@time cuda_mb = compute_mandelbrot(CuArray(real_range_full), CuArray(imag_range_full), max_iter=100);

heatmap(cuda_mb, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")

 11.919268 seconds (18.46 M CPU allocations: 949.890 MiB, 1.15% gc time) (4 GPU allocations: 51.521 MiB, 0.02% memmgmt time)

Run it a second time to get a clean time

CUDA.@time cuda_mb = compute_mandelbrot(CuArray(real_range_full), CuArray(imag_range_full), max_iter=100);

  0.004562 seconds (295 CPU allocations: 17.199 MiB) (4 GPU allocations: 51.521 MiB, 0.59% memmgmt time)

On my machine, this takes about 0.005708 seconds, 27.2x faster than singlethreaded CPU.

And the top part:

# the first run is normally slow due to compiling, this run should be much much faster
CUDA.@time cuda_mb_top = compute_mandelbrot(CuArray(real_range_top), CuArray(imag_range_top), max_iter=250);

heatmap(cuda_mb_top, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")

  0.010385 seconds (297 CPU allocations: 17.199 MiB) (4 GPU allocations: 51.521 MiB, 2.37% memmgmt time)

3.3 Recap of Mandelbrot timings

Execution mode	Time (msec)	Speedup
CPU (single thread)	155.54	1x
CPU (multithreaded)	46.68	3.3x
GPU (CUDA)	5.71	27.2x

3.4 A crash course to modern NVIDIA architectures

GPUs are largely fast for two reasons:

Massive Parallelism: instead of having few cores that implement a very powerful and rich instruction set (e.g. 6-32 on CPUs), GPUs have hundreds or thousands of small ``streaming multiprocessors’’ (e.g. an NVIDIA RTX 3090 has 10 496 CUDA cores).
Massive Memory Bandwidth: a high end workstation has 50-100 GB/s of memory bandwidth, compared to 1000 GB/s+ on GPUs (old school RTX 3090 has ~900 GB/s, a top of the line H200 has 4800 GB/s).

GPUs achieve these incredible stats with a completely different architecture to CPUs.

3.4.1 Single Instruction, Multiple Thread (SIMT)

The core of GPU parallelism lies in the Single Instruction, Multiple Thread (SIMT) execution model. Unlike CPUs where each core executes independent instructions, a GPU’s Streaming Multiprocessor (SM) operates on groups of threads called warps (32 threads). All threads within a warp execute the same instruction simultaneously, but on different data.

A CPU has a very sophisticated superscalar pipeline to execute a rich set of instructions one after the other, to allow for expressing a huge generality of programs and executing them simultaneously. A GPU on the other hand is built to run basically the same code on different data extremely efficiently.

3.4.2 Computational model

The NVIDIA GPU execution model organizes computation in a hierarchy:

Grid: The entire computation defined by the user. A grid is composed of independent thread blocks.
Thread Block (Block): A group of threads that can cooperate via shared memory and synchronization. All threads in a block execute on the same Streaming Multiprocessor (SM, see below). Max 1024 threads/block.
Warp: The fundamental unit of execution on an SM. A warp consists of 32 threads that execute the same instruction in lock-step (i.e. SIMT). If threads in a warp diverge (e.g., if-else statements), execution paths are serialized (run one after the other, e.g. first if, then else), impacting performance.
Streaming Multiprocessor (SM): The core processing unit of the GPU. An SM contains:

CUDA Cores: For general-purpose scalar/vector operations (FP32, INT32, etc.).
Tensor Cores: Specialized units for matrix operations, particularly for ML.
Load/Store Units (LSUs): For memory access.
Special Function Units (SFUs): For transcendental functions.
Shared Memory/L1 Cache: Fast, on-chip memory accessible by threads within the same block.
Warp Schedulers: Manage and dispatch warps to available execution units. An SM can concurrently execute multiple active warps (up to its resource limits).

CUDA Cores / Tensor Cores / LSUs / SFUs: These are the actual execution units within an SM that perform the computations dictated by the warp’s instructions.

Hierarchy Flow:

A Grid is launched onto the GPU.
The GPU schedules Thread Blocks onto available SMs. An SM can host multiple blocks concurrently, limited by its resources (registers, shared memory).
Each Thread Block is further divided into Warps (32 threads).
SMs fetch and dispatch instructions from active Warps to their respective CUDA Cores, Tensor Cores, LSUs, and SFUs for execution.

This hierarchy allows for massive parallelism by distributing work across many SMs, and within each SM, by executing multiple warps concurrently and leveraging specialized hardware units.

3.4.3 Quick reminder on CPU architecture and memory bandwidth

Here is what a CPU cache hierarchy looks like:

Multicore CPU cache hierarchy, from insidetheiot.com

Consider my main desktop from around 2020, consisting of an AMD Ryzen 5800X, and 64 GB of DDR4 RAM at 3200 MT/s. Here are the specs:

Cores/SMT threads: 8/16
L1 cache: 64 KB per core (32 KB inst/32 KB data)
L2 cache: 512 KB per core
L3 cache: 32 MB total
CPU cache line: 64 bytes
RAM: dual-channel DDR4
Memory data path width: 64 bits per channel
Memory speed: 3200 MT/s

Modern CPUs read memory one cache line at a time: that is, 64 bytes, or 512 bits. When my CPU fetches some data, it first looks in L1 cache, then L2 cache, then L3 cache. If it doesn’t find it in any of them, it issues a DDR burst to grab a cache line worth of data.

When the CPU issues a DDR read command, the memory module will transfer 64 bits (that’s the memory data path per channel) of data at a time from RAM to the CPU. The CPU generally reads one cache line at a time so 64 bytes, which is 8x64 bit reads.

With two DDR4 modules (sticks of RAM), in dual-channel mode, we can read twice the bytes. 3200 MT/s means 3200 million transfers per second. This gives \(3200*2*64/8\) million bytes per second, or roughly 50 GB/s of sequential read from RAM. This is the theoretical maximum rate at which we can process large datasets on my PC.

This is just the max bandwidth. Random reads will be much slower, a good rule of thumb is a tenth of it. This also does not consider latency. See this:

From <https://gist.github.com/jboner/2841832>

Latency Comparison Numbers
----------------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

3.4.4 GPU memory architecture

While our CPU’s memory bandwidth of approximately 50 GB/s for sequential reads is impressive for a general-purpose processor, this is peanuts compared to GPUs. Let’s consider the NVIDIA RTX 3090, a late 2020 high-end consumer GPU:

Memory: 24 GB GDDR6X
Memory Interface Width: 384-bit
Memory Speed (Effective): 19.5 Gbps (Gigabits per second)

This gives a theoretical maximum memory bandwidth of 384/8 bytes at 19.5 Gbps = 936 GB/s. That’s 18.7x the CPU.

3.4.5 Kernel Functions: The Building Blocks of GPU Computation

A “kernel” is a function that runs on the GPU. When you launch a kernel, thousands of threads are created, and each thread executes the kernel function on its own piece of data.

# First, ensure CUDA is functional and you have a device
if !CUDA.functional()
    error("CUDA is not functional! Do you have an NVIDIA GPU and proper drivers?")
end
CUDA.device!(0) # Select GPU 0 (if you have multiple)

CuDevice(0): NVIDIA GeForce RTX 3090

To write your own kernel, you define a regular Julia function, but it needs to be compatible with GPU execution. You then launch it using the @cuda macro. Inside a kernel, you access special functions to determine the current thread’s ID and block ID within the grid:

blockIdx().x, blockIdx().y, blockIdx().z: The 1D, 2D, or 3D index of the current thread block.
blockDim().x, blockDim().y, blockDim().z: The dimensions (number of threads) of the current thread block.
threadIdx().x, threadIdx().y, threadIdx().z: The 1D, 2D, or 3D index of the current thread within its block.
gridDim().x, gridDim().y, gridDim().z: The dimensions (number of blocks) of the grid of thread blocks.

From these, you calculate a global unique index for each thread:

# A simple kernel that adds two arrays element-wise
function add_kernel!(C, A, B)
    # Calculate global linear index for the current thread
    # This formula works for 1D arrays or flat 2D/3D arrays
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # Check if the index is within the bounds of the array
    # This is crucial as you often launch more threads than elements
    if idx <= length(A)
        @inbounds C[idx] = A[idx] + B[idx] # @inbounds for performance, assumes bounds checking is handled
    end
    return
end

# Host-side function to prepare data and launch the kernel
function gpu_add_arrays(A_host, B_host)
    # Move input data to the GPU
    A_d = CuArray(A_host)
    B_d = CuArray(B_host)
    # Allocate memory for the result on the GPU
    C_d = similar(A_d)

    N = length(A_host)
    # Determine launch configuration:
    # threads_per_block: How many threads in each block (typically 128, 256, 512, 1024)
    # blocks_per_grid: How many blocks needed to cover all N elements
    threads_per_block = 256
    blocks_per_grid = ceil(Int, N / threads_per_block)

    # Launch the kernel!
    @cuda threads=threads_per_block blocks=blocks_per_grid add_kernel!(C_d, A_d, B_d)

    # Synchronize the GPU (wait for all kernel computations to finish)
    # This happens implicitly when you transfer data back or if you need results immediately.
    # explicit CUDA.synchronize() can be used for timing.

    # Move the result back to the CPU (if needed for further CPU processing)
    return Array(C_d)
end

N_gpu_example = 10^7
A_h = rand(Float32, N_gpu_example)
B_h = rand(Float32, N_gpu_example)

println("Starting custom GPU kernel array addition...")
@time C_h_gpu_custom = gpu_add_arrays(A_h, B_h)
println("Custom GPU kernel array addition finished.")

Starting custom GPU kernel array addition...
  0.412340 seconds (667.84 k allocations: 72.529 MiB, 2.76% gc time, 84.90% compilation time)
Custom GPU kernel array addition finished.

This direct kernel programming gives you maximum control but can be complex.

3.4.6 Broadcast Fusion: The Magic of Element-wise Operations

When you perform element-wise operations on CuArrays using Julia’s broadcasting syntax (e.g., .+, .*, or the dot syntax @.), CUDA.jl automatically compiles and launches an optimized GPU kernel for you. This is incredibly efficient because it can fuse multiple operations into a single kernel, minimizing memory access and launch overheads.

using CUDA

N_implicit = 10^7
X_h = rand(Float32, N_implicit)
Y_h = rand(Float32, N_implicit)

# Move to device once
X_d = CuArray(X_h)
Y_d = CuArray(Y_h)

println("\nStarting GPU implicit kernel operations...")
# Complex broadcast operation - Julia automatically generates an efficient kernel
@time Z_d = @. X_d * sin(Y_d^2) + sqrt(X_d)
println("GPU implicit kernel operations finished.")

# Julia's Base functions often have GPU implementations that work seamlessly
@time S_d = CUDA.sum(Z_d) # Optimized GPU reduction
println("GPU sum finished.")

# If you need the result back on the CPU:
Z_h_implicit = Array(Z_d)
S_h_implicit = S_d[] # Get the scalar value from the CuArray scalar

# Compare with CPU version for timing and correctness
println("\nStarting CPU operations for comparison...")
@time Z_h_cpu = @. X_h * sin(Y_h^2) + sqrt(X_h)
println("CPU element-wise finished.")
@time S_h_cpu = sum(Z_h_cpu)
println("CPU sum finished.")

println("Difference in sum: $(abs(S_h_implicit - S_h_cpu))")


Starting GPU implicit kernel operations...
  0.971768 seconds (1.67 M allocations: 83.383 MiB, 86.54% compilation time)
GPU implicit kernel operations finished.
  1.491146 seconds (3.32 M allocations: 168.340 MiB, 2.77% gc time, 87.43% compilation time)
GPU sum finished.

Starting CPU operations for comparison...
  0.179868 seconds (437.39 k allocations: 60.804 MiB, 64.51% compilation time)
CPU element-wise finished.
  0.022318 seconds (56.61 k allocations: 2.852 MiB, 93.37% compilation time)
CPU sum finished.
Difference in sum: 0.0

You’ll often find that for typical data processing tasks, this implicit kernel generation through broadcasting is all you need, and it performs remarkably well.

3.5 Performance Considerations and Best Practices

While GPUs are powerful, they are not a magic bullet. To get the most out of them, consider these points:

Data Transfer Overhead: Copying data between CPU and GPU is expensive. Avoid frequent transfers. Try to keep data on the GPU for as long as possible, processing it in stages there.
Kernel Launch Overhead: There’s a small overhead associated with launching each kernel. For very small computations, this overhead can sometimes negate the GPU’s advantage. Try to combine multiple small operations into a single larger kernel (which Julia’s broadcast fusion often does for you).
Memory Management: GPU memory is a finite resource. Be mindful of large allocations. CUDA.jl generally manages memory well, but for very large problems, you might need to reuse memory or process data in chunks.
Algorithm Choice: Not all algorithms are “GPU-friendly.” Algorithms with frequent, unpredictable memory access patterns or high thread divergence (where threads in a warp take different execution paths) will perform poorly. Algorithms that operate uniformly on large datasets (e.g., matrix multiplication, image filters, finite difference methods) are ideal.

3.6 Profiling CUDA

Once you have your CUDA code running, you’ll often want to optimize its performance. Profiling helps you identify bottlenecks in your code. Julia’s CUDA.jl package integrates well with NVIDIA’s profiling tools.

3.6.1 Basic GPU Profiling

For a quick and easy way to see the time spent on individual kernel launches within your Julia code, you can use the built-in CUDA.@profile macro. This provides a high-level overview directly in your Julia REPL or output.

using CUDA

function saxpy_kernel!(Y, A, X)
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if idx <= length(Y)
        @inbounds Y[idx] = A * X[idx] + Y[idx]
    end
    return
end

N = 10_000_000
A_scalar = Float32(2.0)
X = CUDA.fill(Float32(1.0), N)
Y = CUDA.fill(Float32(3.0), N)

# Profile the kernel call
CUDA.@profile begin
    @cuda threads=256 blocks=cld(N, 256) saxpy_kernel!(Y, A_scalar, X)
end

Profiler ran for 162.58 ms, capturing 36 events.

Host-side activity: calling CUDA APIs took 241.04 µs (0.15% of the trace)
┌──────────┬────────────┬───────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                │
├──────────┼────────────┼───────┼─────────────────────┤
│    0.04% │    68.9 µs │     1 │ cuModuleLoadDataEx  │
│    0.04% │   68.66 µs │     1 │ cuModuleGetFunction │
│    0.02% │   36.24 µs │     1 │ cuLaunchKernel      │
│    0.01% │    9.78 µs │     1 │ cuCtxSynchronize    │
└──────────┴────────────┴───────┴─────────────────────┘

Device-side activity: GPU was busy for 143.29 µs (0.09% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Name                                         ⋯
├──────────┼────────────┼───────┼───────────────────────────────────────────────
│    0.09% │  143.29 µs │     1 │ saxpy_kernel_(CuDeviceArray<Float32, 1, 1>,  ⋯
└──────────┴────────────┴───────┴───────────────────────────────────────────────
                                                                1 column omitted

This tells you the execution time of the kernel itself on the GPU.

For deeper profiling you will want to use NVIDIA Nsight Systems.

4 Distributed computing

In this section we briefly investigate Distributed.jl for distributed computing.

Distributed.jl runs on workers: processes running on the current host or on other hosts communicating via the network. Like multithreading, Julia then keeps track of remote references to objects that reside on other workers. However, unlike multithreading where data is shared in the same memory space, in distributed computing, data must be explicitly sent between workers. That means that data must also be serialized between processes.

using Distributed

# Start with a few processes if not already done
if nprocs() == 1
    # addprocs(3) # Adds 3 workers, total 4 processes
end

# the beauty of Distributed.jl is that you can run this on multiple machines on a cluster,
# e.g. the following would add a process on another machine by SSHing into it
# addprocs("user@othermachine")

See our workers:

println(workers())

[1]

We can execute expressions on all workers with @everywhere:

# The @everywhere macro runs this on all processes, i.e. it created `remote_task` in all procs
@everywhere function remote_task(x)
    println("Hello from worker $(myid()) with x = $x")
    sleep(x) # Simulate some work
    return "Worker $(myid()) finished task in $x seconds"
end

Then spawn tasks on workers with @spawnat

# Spawn a task on worker 1
f1 = @spawnat 1 remote_task(3)

Future(1, 1, 1, ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), SpinLock(0)), (135234848938624, 135234849061584, 141733920768)), nothing)

@everywhere function estimate_pi_hits(num_samples_per_worker)
    hits = 0
    for _ in 1:num_samples_per_worker
        x = rand()
        y = rand()
        if x^2 + y^2 <= 1.0
            hits += 1
        end
    end
    return hits
end

total_samples = 10^8
num_workers = nprocs()
samples_per_worker = ceil(Int, total_samples / num_workers)

println("Starting distributed estimation of pi with $(num_workers) workers...")

@time begin
    # The `(+)` specifies a reduction operation: sum the results from each worker
    total_hits = @distributed (+) for i in 1:num_workers
        # Each worker calculates hits for its share of samples
        estimate_pi_hits(samples_per_worker)
    end

    estimated_pi = 4.0 * total_hits / (samples_per_worker * num_workers)
end

println("Estimated Pi: $estimated_pi")
println("Actual Pi: $(π)")

# Clean up workers if they were added programmatically
rmprocs(workers())

Starting distributed estimation of pi with 1 workers...

TaskFailedException

    nested task error: UndefVarError: `estimate_pi_hits` not defined in `Main.Notebook`
    Suggestion: check for spelling errors or missing imports.
    Stacktrace:
      [1] macro expansion
        @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:848 [inlined]
      [2] (::var"#27#28")(reducer::typeof(+), R::UnitRange{Int64}, lo::Int64, hi::Int64)
        @ Main.Notebook /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:288
      [3] #invokelatest#2
        @ ./essentials.jl:1055 [inlined]
      [4] invokelatest
        @ ./essentials.jl:1052 [inlined]
      [5] #153
        @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:425 [inlined]
      [6] run_work_thunk(thunk::Distributed.var"#153#154"{var"#27#28", Tuple{typeof(+), UnitRange{Int64}, Int64, Int64}, @Kwargs{}}, print_error::Bool)
        @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:70
      [7] #remotecall_fetch#158
        @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:450 [inlined]
      [8] remotecall_fetch
        @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:449 [inlined]
      [9] remotecall_fetch
        @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492 [inlined]
     [10] (::Distributed.var"#175#176"{typeof(+), var"#27#28", UnitRange{Int64}, Vector{UnitRange{Int64}}, Int64, Int64})()
        @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:270
    Stacktrace:
     [1] #remotecall_fetch#158
       @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:451 [inlined]
     [2] remotecall_fetch
       @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:449 [inlined]
     [3] remotecall_fetch
       @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492 [inlined]
     [4] (::Distributed.var"#175#176"{typeof(+), var"#27#28", UnitRange{Int64}, Vector{UnitRange{Int64}}, Int64, Int64})()
       @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:270
Stacktrace:
 [1] wait(t::Task)
   @ Base ./task.jl:370
 [2] fetch
   @ ./task.jl:390 [inlined]
 [3] preduce(reducer::Function, f::Function, R::UnitRange{Int64})
   @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:274
 [4] macro expansion
   @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:846 [inlined]
 [5] macro expansion
   @ ./timing.jl:581 [inlined]
 [6] top-level scope
   @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:844

4.1 Things to be aware of

Memory Isolation: Each process has its own memory space. This is a double-edged sword, but it prevents accidental data corruption from one part of the code affecting another, which can be a challenge in shared-memory multithreading. It also means you’re not limited by the total RAM of a single machine.
Communication Overhead: Sending data between processes (especially across a network) is significantly slower than accessing shared memory. This can easily negate any performance gains if your task requires frequent, small data transfers.
Data Serialization/Deserialization: Data sent between processes must be converted into a byte stream (serialized) on the sender side and then reconstructed (deserialized) on the receiver side. This adds computational overhead.
Latency: It takes time for a message to travel from one process to another, even on the same machine, adds latency that isn’t present with shared memory.
Load Balancing: Ensuring that work is evenly distributed among workers can be challenging. If one worker gets a disproportionately large or complex task, it can become a bottleneck.

4.2 See also

ClusterManagers.jl for various HPC cluster managers for Distributed.jl
a great blog post by Jun Tian for more in-depth.

5 Example - Oceananigans

Let’s explore the Oceananigans.jl package and try to run it both on GPU and on CPU.

Let’s start with a very basic example from their quick start guide:

using Oceananigans
using CairoMakie

grid = RectilinearGrid(GPU(),
                       size = (1024, 1024),
                       x = (-π, π),
                       y = (-π, π),
                       topology = (Periodic, Periodic, Flat))

model = NonhydrostaticModel(; grid, advection=WENO(), tracers=:c)

δ = 0.5
cᵢ(x, y) = exp(-(x^2 + y^2) / 2δ^2)
ϵ(x, y) = 2rand() - 1
set!(model, u=ϵ, v=ϵ, c=cᵢ)

simulation = Simulation(model; Δt=1e-3, stop_time=10)
conjure_time_step_wizard!(simulation, cfl=0.2, IterationInterval(10))
run!(simulation)

u, v, w = model.velocities
ζ = Field(∂x(v) - ∂y(u))

fig = Figure(size=(1200, 600))
axζ = Axis(fig[1, 1], aspect=1, title="vorticity")
axc = Axis(fig[1, 2], aspect=1, title="tracer")
heatmap!(axζ, ζ, colormap=:balance)
heatmap!(axc, model.tracers.c)
current_figure()

┌ Warning: You are using Julia v1.11 or later!"
│ Oceananigans is currently tested on Julia v1.10."
│ If you find issues with Julia v1.11 or later,"
│ please report at https://github.com/CliMA/Oceananigans.jl/issues/new
└ @ Oceananigans ~/.julia/packages/Oceananigans/Xawvn/src/Oceananigans.jl:124
[ Info: Oceananigans will use 8 threads
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (8.027 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (2.272 seconds).
[ Info: Simulation is stopping after running for 2.704 minutes.
[ Info: Simulation time 10 seconds equals or exceeds stop time 10 seconds.

UndefVarError: `heatmap!` not defined in `Main.Notebook`
Hint: It looks like two or more modules export different bindings with this name, resulting in ambiguity. Try explicitly importing it from a particular module, or qualifying the name with the module it should come from.
Hint: a global variable of this name may be made accessible by importing Plots in the current active module Main
Hint: a global variable of this name may be made accessible by importing MakieCore in the current active module Main
Hint: a global variable of this name may be made accessible by importing Makie in the current active module Main
Hint: a global variable of this name may be made accessible by importing CairoMakie in the current active module Main
Stacktrace:
 [1] top-level scope
   @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:908

We’ll then go through one of thier examples.

6 Additional online resources

An incredible Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications course/book by Chris Rackauckas.
A Julia for high-performance scientific computing course.
Another High Performance Course.

7 Exercises

Carry out the “Multithread the computation of π” exercise provided here.
Carry out the “Distribute the computation of π” exercise provided here.
Carry out the “Port sqrt_sum() to GPU” exercise provided here.

References

Buck, Ian, Theresa Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. “Brook for GPUs: Stream Computing on Graphics Hardware.” ACM Transactions on Graphics (TOG) 23 (3): 777–86.

Cook, Debra, and Angelos D Keromytis. 2006. Cryptographics: Exploiting Graphics Cards for Security. Vol. 20. Springer Science & Business Media.

Galoppo, Nico, Naga K Govindaraju, Michael Henson, and Dinesh Manocha. 2005. “LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware.” In SC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, 3–3. IEEE.

Harris, Mark J, William V Baxter, Thorsten Scheuermann, and Anselmo Lastra. 2003. “Simulation of Cloud Dynamics on Graphics Hardware.” In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 92–101.

Macedonia, Michael. 2003. “The GPU Enters Computing’s Mainstream.” Computer 36 (10): 106–8.

Rumpf, Martin, and Robert Strzodka. 2001. Using Graphics Cards for Quantized FEM Computations. Citeseer.

Sutter, Herb et al. 2005. “The Free Lunch Is over: A Fundamental Turn Toward Concurrency in Software.” Dr. Dobb’s Journal 30 (3): 202–10.

Vuduc, Richard, and Jee Choi. 2013. “A Brief History and Introduction to GPGPU.” Modern Accelerator Technologies for Geographic Information Science, 9–23.

Wagner, Gregory L, Simone Silvestri, Navid C Constantinou, Ali Ramadhan, Jean-Michel Campin, Chris Hill, Tomas Chor, et al. 2025. “High-Level, High-Resolution Ocean Modeling at All Scales with Oceananigans.” arXiv Preprint arXiv:2502.14148.

Footnotes

It’s not entirely clear how widespread the use of GPUs was for cracking cryptosystems before the introduction of CUDA (after which it exploded). There were certainly examples and it was a budding technique in the community, but I couldn’t find concrete and widely used examples to cite.↩︎

A Crash Course on the Julia Language and Ecosystem

1 Introduction & Motivation

1.1 Why parallel?

1.2 Overview

1.3 How do we identify code for parallelisation?

1.4 Relevant Julia communities

1.5 Things we won’t cover

2 Multi-threading

2.1 Example - Mandelbrot Set

2.2 Making it faster with threads

2.2.1 More on Threads.@threads

2.3 The shared memory paradigm

2.4 What can go wrong?

2.5 Task-based parallelism with Threads.@spawn

2.5.1 Threads.@spawn vs. Threads.@threads

3 GPUs

3.1 History of GPGPU

3.2 CUDA.jl

3.3 Recap of Mandelbrot timings

3.4 A crash course to modern NVIDIA architectures

3.4.1 Single Instruction, Multiple Thread (SIMT)

3.4.2 Computational model

3.4.3 Quick reminder on CPU architecture and memory bandwidth

3.4.4 GPU memory architecture

3.4.5 Kernel Functions: The Building Blocks of GPU Computation

3.4.6 Broadcast Fusion: The Magic of Element-wise Operations

3.5 Performance Considerations and Best Practices

3.6 Profiling CUDA

3.6.1 Basic GPU Profiling

4 Distributed computing

4.1 Things to be aware of

4.2 See also

5 Example - Oceananigans

6 Additional online resources

7 Exercises

References

Footnotes

2.2.1 More on `Threads.@threads`

2.5 Task-based parallelism with `Threads.@spawn`

2.5.1 `Threads.@spawn` vs. `Threads.@threads`