Threads.nthreads()
8
An Accumulation Point workshop for AIMS, delivered during June, 2025. See workshop materials in the AIMS GitHub repo.
For decades, computers got faster mainly because chipmakers squeezed more transistors onto chips and raised their clock speeds—this was Moore’s law. But after the early 2000s, increasing clock speeds hit limits due to heat and power dissipation limits. In order to get more power into machines, CPU manufacturers instead started putting multiple cores on CPUs: identical and somewhat isolated execution units that can do work independently. In essence, it became harder to make CPUs faster so they just put many on one die.
GPUs, which were designed for graphics, also became much more powerful and began to be used for general-purpose computing.
Now, to speed up programs beyond what a single core can do, we have to use parallel computing across multiple cores, CPUs, GPUs, or even multiple machines working together—this is called distributed computing. See for example the article Sutter et al. (2005).
This brings a whole new set of challenges to writing programs: now we can’t just write code that runs linearly in the obvious order, instead we have to come up with algorithms that can be run in parallel and speed up execution by performing different parts on different logical computation units. This brings with it a whole new hoarde of bugs, but also incredible time savings and performance throughput improvements, including even the ability to do computation that would not be possible without them (e.g. training LLMs would never be possible on single-threaded CPUs).
Julia is well suited for distributed computing, parallel computing, and GPU use. See the overview of parallel computing in the Julia documentation. In this unit we deal with several layers of parallelism for speedup:
Combinations of 1, 2, and 3 are obviously possible.
Note that a related item, also in the Julia documentation is “Asynchronous ‘tasks’, or coroutines”. See the Asynchronous Programming docs.
We now focus on multi-threading, distributed computing, and GPUs. We then run an example with Oceananigans.jl from the CliMA organization for ocean simulation. See the recent preprint Wagner et al. (2025).
Knowing when and where to apply parallelisation is as crucial as knowing how. Not all code benefits from parallelisation, and overheads can even slow things down.
for
loop is self-contained), it’s ideal for parallelization.These are great resources for Julia’s parallel computing ecosystem:
Distributed.jl
, MPI.jl
.CUDA.jl
, AMDGPU.jl
, etc.We will not delve into MPI (Message Passing Interface), but it is implemented in Julia through MPI.jl. This is oftentimes used in certain computations on HPC clusters, you might want to check it out. In Julia, this functionality is largely provided by Distributed.jl
.
Because the focus is on numerical computing relevant to AIMS, we won’t be discussing @async
and non-heterogeneous multithreading in detail; we will not talk about in depth about locking and synchronization.
We won’t be delving into low-level CPU optimizaitons like explicit SIMD vectorization or LoopVectorization.jl
. Note that Julia (like other LLVM-based languages) is able to utilize SIMD in a number of cases automatically.
This quarto notebook was generated with the command line flag -t 8
(or --threads 8
). It means there are 8 threads available. Without this flag, Julia starts up with only 1 thread. If you try in the command line julia --help
you’ll see:
-t, --threads {auto|N[,auto|M]}
Enable N[+M] threads; N threads are assigned to the `default`
threadpool, and if M is specified, M threads are assigned to the
`interactive` threadpool; `auto` tries to infer a useful
default number of threads to use but the exact behavior might change
in the future. Currently sets N to the number of CPUs assigned to
this Julia process based on the OS-specific affinity assignment
interface if supported (Linux and Windows) or to the number of CPU
threads if not supported (MacOS) or if process affinity is not
configured, and sets M to 1.
The interactive
threadpool has higher priority and is meant for interactive tasks: such as updating the UI or doing something user facing; whereas default
is meant for things like background jobs.
Let’s use the nthreads
function inside Julia to see how many threads there are:
Threads.nthreads()
8
To help us visualize time of execution, we’re going to build a small function, time_stamp
that computes the number of 10th-seconds since the start of midnight tonight. Remember that there are 86400
seconds in a day, so expect the values of this function to reach almost up to \(10^6\). We’ll use this as a relative timestamp:
using Dates
"""
Returns how many tenths of seconds passed since midnight
"""
function time_stamp()
= Dates.now()
now = DateTime(Date(now)) # Today at 00:00:00
midnight return Millisecond(now - midnight).value ÷ 100
end;
For basic parallelism, one of the best tools we have is the Threads.@threads
macro. Now since we have 8 threads in our system, observe that this loop essentially runs in three batches (since there are 17 tasks, which is more than 16 but fewer than 24).
= time_stamp()
before_start Threads.@threads for i in 1:17
println(time_stamp(),
"."^(time_stamp()-before_start), # spacing proportional to elapsed time
" Starting iteration $i")
sleep(1) # sleep for one second as though "processing something"
println(time_stamp(),
"."^(time_stamp()-before_start), # spacing proportional to elapsed time
" Finished sleeping (\"processing\") on thread $(Threads.threadid())")
end
77667.. Starting iteration 8
77667.. Starting iteration 10
77667.. Starting iteration 4
77667.. Starting iteration 16
77667.. Starting iteration 14
77667.. Starting iteration 6
77667.. Starting iteration 12
77667.. Starting iteration 1
77677............ Finished sleeping ("processing") on thread 3
77677............ Finished sleeping ("processing") on thread 5
77677............ Starting iteration 9
77677............ Finished sleeping ("processing") on thread 1
77677............ Finished sleeping ("processing") on thread 8
77677............ Finished sleeping ("processing") on thread 4
77677............ Starting iteration 7
77677............ Starting iteration 15
77677............ Finished sleeping ("processing") on thread 7
77677............ Starting iteration 5
77677............ Starting iteration 11
77677............ Starting iteration 17
77677............ Finished sleeping ("processing") on thread 8
77677............ Finished sleeping ("processing") on thread 6
77677............ Starting iteration 2
77677............ Starting iteration 13
77687...................... Finished sleeping ("processing") on thread 5
77687...................... Finished sleeping ("processing") on thread 7
77687...................... Finished sleeping ("processing") on thread 2
77687...................... Finished sleeping ("processing") on thread 8
77687...................... Finished sleeping ("processing") on thread 6
77687...................... Finished sleeping ("processing") on thread 1
77687...................... Finished sleeping ("processing") on thread 4
77687...................... Finished sleeping ("processing") on thread 3
77687...................... Starting iteration 3
77697................................ Finished sleeping ("processing") on thread 4
The simplest class of algorithms to parallelize are called embarrassingly parallel: those where we are literally doing isolated computations. They are a good place to start to learn about splitting computation apart in Julia.
The Mandelbrot set \(\mathcal M\subseteq\mathbb{C}\) is the set of numbers \(c\in\mathbb{C}\) (in the complex plane) where iterating the function \(f_c(z)=z^2+c\) does not diverge to infinity (where we start iterating with \(z=0\)). Clearly for example \(0+0i\in\mathcal M\) but \(4+0i\not\in\mathcal M\).
More formally we define the Mandelbrot set \(\mathcal M\) as follows. Fix \(c\in\mathbb{C}\), define \(z_0=0\) and set \(z_{k+1}=f_c(z_k)=z_k^2+c\). Then \(c\in\mathcal M\) if and only if \(\limsup_{k\to\infty}|z_k|<\infty\).
For example, at \(c=-1+0i\), we have the iterates
On the other hand, at \(c=3i\), we have the iterates
Observe that if \(|z_n|>2\) at any point, then we are guaranteed that the iterates go to infinity. This helps us simplify our code to compute \(\mathcal M\).
We can write a Julia function to compute this as follows:
function mandelbrot_iter(c::Complex, max_iter::Int=50)::Int
# initializes z to the zero element of its type, in this case 0+0i
= zero(c)
z for i in 1:max_iter
= z*z + c
z # abs2 computes the square of the absolute value, saves us a tiny bit by not computing the square root
if abs2(z) > 4
return i
end
end
return max_iter
end
mandelbrot_iter (generic function with 2 methods)
Here we are outputting the iteration on which the sequence has \(|z|>2\), as this will let us make pretty images.
We said earlier that \(0+0i\in\mathcal{M}\),
mandelbrot_iter(0im)
50
And that \(-1+0i\in\mathcal{M}\),
mandelbrot_iter(-1+0im)
50
And that \(3i\not\in\mathcal{M}\),
mandelbrot_iter(3im)
1
Let’s now write a function to run it over a grid:
function compute_mandelbrot(
# real (x-axis) values
real_range,# imaginary (y-axis) values
imag_range;# max iterations per run
::Int
max_iter::Matrix{Int}
)# produce the grid of c values
= complex.(real_range', imag_range)
c_grid # apply mandelbrot_iter to each value of c_grid
return mandelbrot_iter.(c_grid, max_iter)
end;
Let’s define a box that contains most of \(\mathcal M\):
= 1500, 1500
width, height
= LinRange(-2., 0.5, width)
real_range_full = LinRange(-1.25, 1.25, height); imag_range_full
And run to see see how fast we go:
using BenchmarkTools
@time my_mb = compute_mandelbrot(real_range_full, imag_range_full, max_iter=100);
0.502244 seconds (610.02 k allocations: 81.952 MiB, 2.02% gc time, 64.51% compilation time)
Run it a second time to get a clean time
@time my_mb = compute_mandelbrot(real_range_full, imag_range_full, max_iter=100);
0.176925 seconds (6 allocations: 51.504 MiB, 5.10% gc time)
On my machine, this takes about 0.155540 seconds.
Nice picture:
using Plots
heatmap(my_mb, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
Zoom in on the top part with some more iterations
= LinRange(-.3, .1, width)
real_range_top = LinRange(.6, 1, height)
imag_range_top
@time my_mb_top = compute_mandelbrot(real_range_top, imag_range_top, max_iter=250);
heatmap(my_mb_top, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
0.484760 seconds (6 allocations: 51.502 MiB, 0.72% gc time)
function compute_mandelbrot_multithreaded(
# real (x-axis) values
real_range,# imaginary (y-axis) values
imag_range;# max iterations per run
::Int
max_iter::Matrix{Int}
)# produce the grid of c values
= complex.(real_range', imag_range)
c_grid # output array
= zeros(Int64, size(c_grid))
out # note the @threads annotation
Threads.@threads for i in eachindex(out)
= mandelbrot_iter(c_grid[i], max_iter)
out[i] end
return out
end
@time mt_mb = compute_mandelbrot_multithreaded(real_range_full, imag_range_full, max_iter=100);
0.141279 seconds (60.00 k allocations: 54.446 MiB, 139.61% compilation time)
Run it a second time to get a clean time
@time mt_mb = compute_mandelbrot_multithreaded(real_range_full, imag_range_full, max_iter=100);
0.077174 seconds (48 allocations: 51.506 MiB, 27.83% gc time)
On my machine, this takes about 0.046682 seconds, about 3.3x as fast as the original one.
Threads.@threads
The Threads.@threads
macro allows you to specify different schedulers for how to execute loop iterations on different threads. The default scheduler, :dynamic
assumes loop iterations are approximately uniform.
If this is not the case, you are better off using :greedy
which will greedily pick another iteration when it finishes the last one. This is better for non-uniform iterations.
As an example, consider this loop of artifical and highly non-uniform iteration times where with 10% probability it will sleep for a long time:
using Random
using BenchmarkTools
function do_work()
if rand() < 0.1
sleep(.3)
end
end
do_work (generic function with 1 method)
With standard scheduler (:dynamic
, omitted):
Random.seed!(1974)
@btime begin
# same as Threads.@threads :dynamic for _ ∈ 1:200
Threads.@threads for _ ∈ 1:200
do_work()
end
end
1.207 s (98 allocations: 5.66 KiB)
With :greedy
:
Random.seed!(1974)
@btime begin
Threads.@threads :greedy for _ ∈ 1:200
do_work()
end
end
603.718 ms (133 allocations: 7.94 KiB)
There is also :static
. See the docs.
Why not always use :greedy
? Because of cache coherence and other reasons, :dynamic
which will do blocks at a time is often faster for uniform iterations.
Consider this simple function that computes the sum
\[ \sum_{n=1}^\infty\frac{1}{n^2} \]
It’s a well known identity that this sum equals \(\pi^2/6\) (but not so easy to prove).
function inverse_squares_simple(arr::Vector{Int64})
= 0.
sum for i in eachindex(arr)
+= 1/(arr[i]^2)
sum end
return sum
end
inverse_squares_simple (generic function with 1 method)
using Base.Threads
function inverse_squares_buggy_multithreaded(arr::Vector{Int64})
= 0.
sum # Try to use @threads to parallelize the loop over array indices
@threads for i in eachindex(arr)
# Each thread attempts to add to the sum
+= 1/(arr[i]^2) # <-- This is problematic!
sum end
return sum
end
= 1_000
N = collect(1:N)
numbers = π^2/6
true_val = inverse_squares_simple(numbers)
naive_sum
println("True value is $true_val, naive method gives $naive_sum, or around $(round(100*naive_sum/true_val, digits=2))% of true val")
for _ in 1:10
= inverse_squares_buggy_multithreaded(numbers)
result println("\"sum\" is $result, got $(round(100*result/naive_sum, digits=2))% of naive sum $naive_sum")
end
True value is 1.6449340668482264, naive method gives 1.6439345666815615, or around 99.94% of true val
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6383048604989814, got 99.66% of naive sum 1.6439345666815615
"sum" is 1.5285646487446394, got 92.98% of naive sum 1.6439345666815615
"sum" is 0.0015750123760240658, got 0.1% of naive sum 1.6439345666815615
"sum" is 1.3650871857767202, got 83.04% of naive sum 1.6439345666815615
"sum" is 0.0007901012204946777, got 0.05% of naive sum 1.6439345666815615
"sum" is 1.5882829451460938, got 96.61% of naive sum 1.6439345666815615
"sum" is 1.627378117950526, got 98.99% of naive sum 1.6439345666815615
"sum" is 1.5963059480767745, got 97.1% of naive sum 1.6439345666815615
"sum" is 1.2509147702998493, got 76.09% of naive sum 1.6439345666815615
The problem is caused by a data race where a thread will read the value of sum
first, then another thread will read it. Regardless of which order the two threads now write the answer to the piece of memory, the result will be wrong, as at least one summand will be “lost”.
One can get around this by wrapping the shared variable in an “atomic” type (atomic means that its updates are “atoms”, they cannot be broken down into a read and a write that could end up un-synchronized)
function inverse_squares_with_atomic(arr::Vector{Int64})
= Atomic{Float64}(0.)
sum @threads for i in eachindex(arr)
atomic_add!(sum, 1/(arr[i]^2))
end
# The `[]` syntax dereferences the Atomic value.
return sum[]
end
for _ in 1:10
= inverse_squares_with_atomic(numbers)
result println("\"sum\" is $result, got $(round(100*result/naive_sum, digits=2))% of naive sum $naive_sum")
end
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815612, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815617, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
"sum" is 1.6439345666815615, got 100.0% of naive sum 1.6439345666815615
But note:
@benchmark inverse_squares_simple(numbers)
BenchmarkTools.Trial: 10000 samples with 18 evaluations per sample. Range (min … max): 977.778 ns … 1.723 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 1.044 μs ┊ GC (median): 0.00% Time (mean ± σ): 1.053 μs ± 55.308 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▅▂▇▁ ▂▁▁▂▂▂██████▂▂▂▁▁▁▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂ ▃ 978 ns Histogram: frequency by time 1.43 μs < Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark inverse_squares_buggy_multithreaded(numbers)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 17.600 μs … 22.555 ms ┊ GC (min … max): 0.00% … 97.70% Time (median): 20.394 μs ┊ GC (median): 0.00% Time (mean ± σ): 23.107 μs ± 225.469 μs ┊ GC (mean ± σ): 9.54% ± 0.98% ▃▂ █ ▃▇▃▆▃▆▃ ▃ ▂▂▃▃▄▇▃██▆█▆███████▆█▅▅▆▃▃▃▃▃▂▃▂▂▂▂▂▂▁▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄ 17.6 μs Histogram: frequency by time 28.8 μs < Memory estimate: 35.52 KiB, allocs estimate: 2043.
@benchmark inverse_squares_with_atomic(numbers)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 25.772 μs … 140.243 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 52.242 μs ┊ GC (median): 0.00% Time (mean ± σ): 51.633 μs ± 5.828 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▁▁▄▄▃▇█▆██▆▅▇▃▂▄ ▂▁▁▂▂▁▁▁▂▂▂▁▂▂▂▃▃▃▃▃▃▄▄▄▅▆▅▅▆▅▆█▇███████████████████▇▇▅▅▅▃▃▃ ▅ 25.8 μs Histogram: frequency by time 63.4 μs < Memory estimate: 4.27 KiB, allocs estimate: 43.
Trying to do it with multithreading in this case made it slower (and broke it!) then fixing it up with atomic made it even slower. It’s now 40x slower for no benefit.
Parallelism can be hard. You have to be careful with various failure modes like race conditions. We won’t delve deeper into locking and synchronization pritimives, it’s a huge area.
Threads.@spawn
Instead of simply splitting a loop’s iterations, Threads.@spawn
allows you to launch an individual function call or expression as a separate, lightweight thread (often called a “task” or “goroutine” in other languages). Julia’s runtime then intelligently schedules these tasks to run on available threads in your thread pool.
Note that Julia’s threads are not true Operating System threads: they are run on Julia’s threadpool and scheduled by Julia. (This is why you need to specify the number of --threads
on startup.)
When you use Threads.@spawn
, it immediately returns a Task
object. This Task
object is like a promise that the computation will be done eventually. To get the actual result from the Task
, you use fetch()
. If you just want to know when the task is complete without needing its return value, you can use wait()
.
function simulate_work(it)
println(time_stamp(),
"."^(time_stamp()-before_start), # spacing proportional to elapsed time
" Starting iteration $it")
sleep(1) # sleep for one second as though "processing something"
println(time_stamp(),
"."^(time_stamp()-before_start), # spacing proportional to elapsed time
" Finished sleeping (\"processing\") on thread $(Threads.threadid())")
return it
end
= Vector{Task}()
tasks
= time_stamp()
before_start for i in 1:17
push!(tasks, Threads.@spawn simulate_work(i))
end
println("Spawned $(length(tasks)) tasks")
= [fetch(task) for task in tasks]
results println(time_stamp(), "."^(time_stamp()-before_start) * " Task return values: $results")
78293. Starting iteration 11
78293. Starting iteration 5
78293. Starting iteration 8
78293. Starting iteration 12
78293. Starting iteration 3
78293. Starting iteration 15
78293. Starting iteration 16
78293. Starting iteration 7
78293. Starting iteration 13
78293. Starting iteration 6
78293. Starting iteration 14
78293. Starting iteration 17
78293. Starting iteration 1
78293. Starting iteration 2
78293. Starting iteration 9
78293. Starting iteration 4
78293. Starting iteration 10
Spawned 17 tasks
78303........... Finished sleeping ("processing") on thread 8
78303........... Finished sleeping ("processing") on thread 7
78303........... Finished sleeping ("processing") on thread 5
78303........... Finished sleeping ("processing") on thread 3
78303........... Finished sleeping ("processing") on thread 2
78303........... Finished sleeping ("processing") on thread 1
78303........... Finished sleeping ("processing") on thread 6
78303........... Finished sleeping ("processing") on thread 4
78303........... Finished sleeping ("processing") on thread 1
78303........... Finished sleeping ("processing") on thread 8
78303........... Finished sleeping ("processing") on thread 3
78303........... Finished sleeping ("processing") on thread 6
78303........... Finished sleeping ("processing") on thread 1
78303........... Finished sleeping ("processing") on thread 4
78303........... Finished sleeping ("processing") on thread 7
78303........... Finished sleeping ("processing") on thread 2
78303........... Finished sleeping ("processing") on thread 5
78303........... Task return values: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
Threads.@spawn
vs. Threads.@threads
Feature | Threads.@threads |
Threads.@spawn |
---|---|---|
Purpose | Loop parallelization (iterating over collections) | Launching individual, independent computations |
Control Flow | Synchronous (main thread waits for loop to finish) | Asynchronous (main thread continues immediately) |
Return Value | No direct return from individual iterations | Returns a Task object, which holds the result |
Use Case | Homogeneous, iterative work | Heterogeneous, distinct, or long-running computations |
It makes sense ot use Threads.@spawn
if your ``computation graph’’ is heterogeneous, i.e. you depend on inputs/
GPGPU stands for General Purpose GPU computing: originally GPUs were built for and only really good for accelerating graphical workflows: texture mapping, shading, rendering polygons and such for gaming and professional applications (CAD, architecture, rendering, etc). They were highly optimized for doing these kinds of computations, which often involved computing small matrix multiplications over and over again. GPU performance was measured in things polygons per second (which is a bit of a nonsensical measure). See Vuduc and Choi (2013).
Around the early 2000s, some hackers and researched started coming up with novel ways of using GPUs for non-graphics applications. These worked largely by writing custom pixel shaders that were supposed to give graphical programs the ability to write very restrictive, but general code to manipulate fragments of textures and bitmaps. These applications creatively abused the graphics pipelines to do simple parallel computations. For example:
Buck in particular showed through Brook (during his PhD) that it’s possible to implement operations like linear algebra and FFT on GPUs with his methodology. He quickly got hired by NVIDIA and was one of the core individuals who created CUDA.
The stats on the number of floating point operations (FLOPs) a GPU could do compared to a CPU made it a very attractive platform for early research. This was directly enabled by the parallelism story on GPUs compared to CPUs – they were parallel from the beginning. A 2003 article from the magazine Computer states “The new GPUs are very fast. Stanford’s Ian Buck calculates that the current GeForce FX 5900’s performance peaks at 20 Gigaflops, the equivalent of a 10-GHz Pentium”, from Macedonia (2003) (the whole article is a very interesting read).
The cryptoanalysis and hacker communities also used GPUs for computing rainbow tables and bruteforcing passwords and hash pre-images1. This was exacerbated by the Crypto Wars: futile efforts by the United States government to limit the availability strong encryption to the general public and in export products (cryptography is still classified as a dual-use technology and its exports are somewhat restricted). The new availability of highly-parallel hardware combined with new ideas around this time made for practically feasible attacks on widely deployed cryptosystems, such as cracking passwords in networked Microsoft operating systems (e.g. LAN Manager), cracking early WiFi encryption (WEP, and later WPA), and breaking commonly used hashes (such as the abysmal MD5). Cryptography (and laws around cryptography) have come a long way since then.
Julia makes it extremely simple to get stuff running on a GPU with CUDA.jl. Let’s see an example
# this might take a moment
using CUDA
# CUDA.@time is needed to synchronize GPU data to make sure we time everything
@time cuda_mb = compute_mandelbrot(CuArray(real_range_full), CuArray(imag_range_full), max_iter=100);
CUDA.
heatmap(cuda_mb, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
11.919268 seconds (18.46 M CPU allocations: 949.890 MiB, 1.15% gc time) (4 GPU allocations: 51.521 MiB, 0.02% memmgmt time)
Run it a second time to get a clean time
@time cuda_mb = compute_mandelbrot(CuArray(real_range_full), CuArray(imag_range_full), max_iter=100); CUDA.
0.004562 seconds (295 CPU allocations: 17.199 MiB) (4 GPU allocations: 51.521 MiB, 0.59% memmgmt time)
On my machine, this takes about 0.005708 seconds, 27.2x faster than singlethreaded CPU.
And the top part:
# the first run is normally slow due to compiling, this run should be much much faster
@time cuda_mb_top = compute_mandelbrot(CuArray(real_range_top), CuArray(imag_range_top), max_iter=250);
CUDA.
heatmap(cuda_mb_top, size=(500, 500), axis=false, aspect_ratio=:equal, colorbar_title="Iterations until provable divergence")
0.010385 seconds (297 CPU allocations: 17.199 MiB) (4 GPU allocations: 51.521 MiB, 2.37% memmgmt time)
Execution mode | Time (msec) | Speedup |
---|---|---|
CPU (single thread) | 155.54 | 1x |
CPU (multithreaded) | 46.68 | 3.3x |
GPU (CUDA) | 5.71 | 27.2x |
GPUs are largely fast for two reasons:
GPUs achieve these incredible stats with a completely different architecture to CPUs.
The core of GPU parallelism lies in the Single Instruction, Multiple Thread (SIMT) execution model. Unlike CPUs where each core executes independent instructions, a GPU’s Streaming Multiprocessor (SM) operates on groups of threads called warps (32 threads). All threads within a warp execute the same instruction simultaneously, but on different data.
A CPU has a very sophisticated superscalar pipeline to execute a rich set of instructions one after the other, to allow for expressing a huge generality of programs and executing them simultaneously. A GPU on the other hand is built to run basically the same code on different data extremely efficiently.
The NVIDIA GPU execution model organizes computation in a hierarchy:
if-else
statements), execution paths are serialized (run one after the other, e.g. first if
, then else
), impacting performance.Hierarchy Flow:
This hierarchy allows for massive parallelism by distributing work across many SMs, and within each SM, by executing multiple warps concurrently and leveraging specialized hardware units.
Here is what a CPU cache hierarchy looks like:
Consider my main desktop from around 2020, consisting of an AMD Ryzen 5800X, and 64 GB of DDR4 RAM at 3200 MT/s. Here are the specs:
Modern CPUs read memory one cache line at a time: that is, 64 bytes, or 512 bits. When my CPU fetches some data, it first looks in L1 cache, then L2 cache, then L3 cache. If it doesn’t find it in any of them, it issues a DDR burst to grab a cache line worth of data.
When the CPU issues a DDR read command, the memory module will transfer 64 bits (that’s the memory data path per channel) of data at a time from RAM to the CPU. The CPU generally reads one cache line at a time so 64 bytes, which is 8x64 bit reads.
With two DDR4 modules (sticks of RAM), in dual-channel mode, we can read twice the bytes. 3200 MT/s means 3200 million transfers per second. This gives \(3200*2*64/8\) million bytes per second, or roughly 50 GB/s of sequential read from RAM. This is the theoretical maximum rate at which we can process large datasets on my PC.
This is just the max bandwidth. Random reads will be much slower, a good rule of thumb is a tenth of it. This also does not consider latency. See this:
From <https://gist.github.com/jboner/2841832>
Latency Comparison Numbers
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
While our CPU’s memory bandwidth of approximately 50 GB/s for sequential reads is impressive for a general-purpose processor, this is peanuts compared to GPUs. Let’s consider the NVIDIA RTX 3090, a late 2020 high-end consumer GPU:
This gives a theoretical maximum memory bandwidth of 384/8 bytes at 19.5 Gbps = 936 GB/s. That’s 18.7x the CPU.
A “kernel” is a function that runs on the GPU. When you launch a kernel, thousands of threads are created, and each thread executes the kernel function on its own piece of data.
# First, ensure CUDA is functional and you have a device
if !CUDA.functional()
error("CUDA is not functional! Do you have an NVIDIA GPU and proper drivers?")
end
device!(0) # Select GPU 0 (if you have multiple) CUDA.
CuDevice(0): NVIDIA GeForce RTX 3090
To write your own kernel, you define a regular Julia function, but it needs to be compatible with GPU execution. You then launch it using the @cuda
macro. Inside a kernel, you access special functions to determine the current thread’s ID and block ID within the grid:
blockIdx().x
, blockIdx().y
, blockIdx().z
: The 1D, 2D, or 3D index of the current thread block.blockDim().x
, blockDim().y
, blockDim().z
: The dimensions (number of threads) of the current thread block.threadIdx().x
, threadIdx().y
, threadIdx().z
: The 1D, 2D, or 3D index of the current thread within its block.gridDim().x
, gridDim().y
, gridDim().z
: The dimensions (number of blocks) of the grid of thread blocks.From these, you calculate a global unique index for each thread:
# A simple kernel that adds two arrays element-wise
function add_kernel!(C, A, B)
# Calculate global linear index for the current thread
# This formula works for 1D arrays or flat 2D/3D arrays
= (blockIdx().x - 1) * blockDim().x + threadIdx().x
idx
# Check if the index is within the bounds of the array
# This is crucial as you often launch more threads than elements
if idx <= length(A)
@inbounds C[idx] = A[idx] + B[idx] # @inbounds for performance, assumes bounds checking is handled
end
return
end
# Host-side function to prepare data and launch the kernel
function gpu_add_arrays(A_host, B_host)
# Move input data to the GPU
= CuArray(A_host)
A_d = CuArray(B_host)
B_d # Allocate memory for the result on the GPU
= similar(A_d)
C_d
= length(A_host)
N # Determine launch configuration:
# threads_per_block: How many threads in each block (typically 128, 256, 512, 1024)
# blocks_per_grid: How many blocks needed to cover all N elements
= 256
threads_per_block = ceil(Int, N / threads_per_block)
blocks_per_grid
# Launch the kernel!
@cuda threads=threads_per_block blocks=blocks_per_grid add_kernel!(C_d, A_d, B_d)
# Synchronize the GPU (wait for all kernel computations to finish)
# This happens implicitly when you transfer data back or if you need results immediately.
# explicit CUDA.synchronize() can be used for timing.
# Move the result back to the CPU (if needed for further CPU processing)
return Array(C_d)
end
= 10^7
N_gpu_example = rand(Float32, N_gpu_example)
A_h = rand(Float32, N_gpu_example)
B_h
println("Starting custom GPU kernel array addition...")
@time C_h_gpu_custom = gpu_add_arrays(A_h, B_h)
println("Custom GPU kernel array addition finished.")
Starting custom GPU kernel array addition...
0.412340 seconds (667.84 k allocations: 72.529 MiB, 2.76% gc time, 84.90% compilation time)
Custom GPU kernel array addition finished.
This direct kernel programming gives you maximum control but can be complex.
When you perform element-wise operations on CuArray
s using Julia’s broadcasting syntax (e.g., .+
, .*
, or the dot syntax @.
), CUDA.jl
automatically compiles and launches an optimized GPU kernel for you. This is incredibly efficient because it can fuse multiple operations into a single kernel, minimizing memory access and launch overheads.
using CUDA
= 10^7
N_implicit = rand(Float32, N_implicit)
X_h = rand(Float32, N_implicit)
Y_h
# Move to device once
= CuArray(X_h)
X_d = CuArray(Y_h)
Y_d
println("\nStarting GPU implicit kernel operations...")
# Complex broadcast operation - Julia automatically generates an efficient kernel
@time Z_d = @. X_d * sin(Y_d^2) + sqrt(X_d)
println("GPU implicit kernel operations finished.")
# Julia's Base functions often have GPU implementations that work seamlessly
@time S_d = CUDA.sum(Z_d) # Optimized GPU reduction
println("GPU sum finished.")
# If you need the result back on the CPU:
= Array(Z_d)
Z_h_implicit = S_d[] # Get the scalar value from the CuArray scalar
S_h_implicit
# Compare with CPU version for timing and correctness
println("\nStarting CPU operations for comparison...")
@time Z_h_cpu = @. X_h * sin(Y_h^2) + sqrt(X_h)
println("CPU element-wise finished.")
@time S_h_cpu = sum(Z_h_cpu)
println("CPU sum finished.")
println("Difference in sum: $(abs(S_h_implicit - S_h_cpu))")
Starting GPU implicit kernel operations...
0.971768 seconds (1.67 M allocations: 83.383 MiB, 86.54% compilation time)
GPU implicit kernel operations finished.
1.491146 seconds (3.32 M allocations: 168.340 MiB, 2.77% gc time, 87.43% compilation time)
GPU sum finished.
Starting CPU operations for comparison...
0.179868 seconds (437.39 k allocations: 60.804 MiB, 64.51% compilation time)
CPU element-wise finished.
0.022318 seconds (56.61 k allocations: 2.852 MiB, 93.37% compilation time)
CPU sum finished.
Difference in sum: 0.0
You’ll often find that for typical data processing tasks, this implicit kernel generation through broadcasting is all you need, and it performs remarkably well.
While GPUs are powerful, they are not a magic bullet. To get the most out of them, consider these points:
CUDA.jl
generally manages memory well, but for very large problems, you might need to reuse memory or process data in chunks.Once you have your CUDA code running, you’ll often want to optimize its performance. Profiling helps you identify bottlenecks in your code. Julia’s CUDA.jl package integrates well with NVIDIA’s profiling tools.
For a quick and easy way to see the time spent on individual kernel launches within your Julia code, you can use the built-in CUDA.@profile
macro. This provides a high-level overview directly in your Julia REPL or output.
using CUDA
function saxpy_kernel!(Y, A, X)
= (blockIdx().x - 1) * blockDim().x + threadIdx().x
idx if idx <= length(Y)
@inbounds Y[idx] = A * X[idx] + Y[idx]
end
return
end
= 10_000_000
N = Float32(2.0)
A_scalar = CUDA.fill(Float32(1.0), N)
X = CUDA.fill(Float32(3.0), N)
Y
# Profile the kernel call
@profile begin
CUDA.@cuda threads=256 blocks=cld(N, 256) saxpy_kernel!(Y, A_scalar, X)
end
Profiler ran for 162.58 ms, capturing 36 events. Host-side activity: calling CUDA APIs took 241.04 µs (0.15% of the trace) ┌──────────┬────────────┬───────┬─────────────────────┐ │ Time (%) │ Total time │ Calls │ Name │ ├──────────┼────────────┼───────┼─────────────────────┤ │ 0.04% │ 68.9 µs │ 1 │ cuModuleLoadDataEx │ │ 0.04% │ 68.66 µs │ 1 │ cuModuleGetFunction │ │ 0.02% │ 36.24 µs │ 1 │ cuLaunchKernel │ │ 0.01% │ 9.78 µs │ 1 │ cuCtxSynchronize │ └──────────┴────────────┴───────┴─────────────────────┘ Device-side activity: GPU was busy for 143.29 µs (0.09% of the trace) ┌──────────┬────────────┬───────┬─────────────────────────────────────────────── │ Time (%) │ Total time │ Calls │ Name ⋯ ├──────────┼────────────┼───────┼─────────────────────────────────────────────── │ 0.09% │ 143.29 µs │ 1 │ saxpy_kernel_(CuDeviceArray<Float32, 1, 1>, ⋯ └──────────┴────────────┴───────┴─────────────────────────────────────────────── 1 column omitted
This tells you the execution time of the kernel itself on the GPU.
For deeper profiling you will want to use NVIDIA Nsight Systems.
In this section we briefly investigate Distributed.jl for distributed computing.
Distributed.jl runs on workers
: processes running on the current host or on other hosts communicating via the network. Like multithreading, Julia then keeps track of remote references to objects that reside on other workers. However, unlike multithreading where data is shared in the same memory space, in distributed computing, data must be explicitly sent between workers. That means that data must also be serialized between processes.
using Distributed
# Start with a few processes if not already done
if nprocs() == 1
# addprocs(3) # Adds 3 workers, total 4 processes
end
# the beauty of Distributed.jl is that you can run this on multiple machines on a cluster,
# e.g. the following would add a process on another machine by SSHing into it
# addprocs("user@othermachine")
See our workers:
println(workers())
[1]
We can execute expressions on all workers with @everywhere
:
# The @everywhere macro runs this on all processes, i.e. it created `remote_task` in all procs
@everywhere function remote_task(x)
println("Hello from worker $(myid()) with x = $x")
sleep(x) # Simulate some work
return "Worker $(myid()) finished task in $x seconds"
end
Then spawn tasks on workers with @spawnat
# Spawn a task on worker 1
= @spawnat 1 remote_task(3) f1
Future(1, 1, 1, ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), SpinLock(0)), (135234848938624, 135234849061584, 141733920768)), nothing)
@everywhere function estimate_pi_hits(num_samples_per_worker)
= 0
hits for _ in 1:num_samples_per_worker
= rand()
x = rand()
y if x^2 + y^2 <= 1.0
+= 1
hits end
end
return hits
end
= 10^8
total_samples = nprocs()
num_workers = ceil(Int, total_samples / num_workers)
samples_per_worker
println("Starting distributed estimation of pi with $(num_workers) workers...")
@time begin
# The `(+)` specifies a reduction operation: sum the results from each worker
= @distributed (+) for i in 1:num_workers
total_hits # Each worker calculates hits for its share of samples
estimate_pi_hits(samples_per_worker)
end
= 4.0 * total_hits / (samples_per_worker * num_workers)
estimated_pi end
println("Estimated Pi: $estimated_pi")
println("Actual Pi: $(π)")
# Clean up workers if they were added programmatically
rmprocs(workers())
Starting distributed estimation of pi with 1 workers...
TaskFailedException nested task error: UndefVarError: `estimate_pi_hits` not defined in `Main.Notebook` Suggestion: check for spelling errors or missing imports. Stacktrace: [1] macro expansion @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:848 [inlined] [2] (::var"#27#28")(reducer::typeof(+), R::UnitRange{Int64}, lo::Int64, hi::Int64) @ Main.Notebook /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:288 [3] #invokelatest#2 @ ./essentials.jl:1055 [inlined] [4] invokelatest @ ./essentials.jl:1052 [inlined] [5] #153 @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:425 [inlined] [6] run_work_thunk(thunk::Distributed.var"#153#154"{var"#27#28", Tuple{typeof(+), UnitRange{Int64}, Int64, Int64}, @Kwargs{}}, print_error::Bool) @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:70 [7] #remotecall_fetch#158 @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:450 [inlined] [8] remotecall_fetch @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:449 [inlined] [9] remotecall_fetch @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492 [inlined] [10] (::Distributed.var"#175#176"{typeof(+), var"#27#28", UnitRange{Int64}, Vector{UnitRange{Int64}}, Int64, Int64})() @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:270 Stacktrace: [1] #remotecall_fetch#158 @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:451 [inlined] [2] remotecall_fetch @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:449 [inlined] [3] remotecall_fetch @ /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492 [inlined] [4] (::Distributed.var"#175#176"{typeof(+), var"#27#28", UnitRange{Int64}, Vector{UnitRange{Int64}}, Int64, Int64})() @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:270 Stacktrace: [1] wait(t::Task) @ Base ./task.jl:370 [2] fetch @ ./task.jl:390 [inlined] [3] preduce(reducer::Function, f::Function, R::UnitRange{Int64}) @ Distributed /work/julia-ml/_tool/julia/1.11.5/x64/share/julia/stdlib/v1.11/Distributed/src/macros.jl:274 [4] macro expansion @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:846 [inlined] [5] macro expansion @ ./timing.jl:581 [inlined] [6] top-level scope @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:844
Let’s explore the Oceananigans.jl package and try to run it both on GPU and on CPU.
Let’s start with a very basic example from their quick start guide:
using Oceananigans
using CairoMakie
= RectilinearGrid(GPU(),
grid = (1024, 1024),
size = (-π, π),
x = (-π, π),
y = (Periodic, Periodic, Flat))
topology
= NonhydrostaticModel(; grid, advection=WENO(), tracers=:c)
model
= 0.5
δ cᵢ(x, y) = exp(-(x^2 + y^2) / 2δ^2)
ϵ(x, y) = 2rand() - 1
set!(model, u=ϵ, v=ϵ, c=cᵢ)
= Simulation(model; Δt=1e-3, stop_time=10)
simulation conjure_time_step_wizard!(simulation, cfl=0.2, IterationInterval(10))
run!(simulation)
= model.velocities
u, v, w = Field(∂x(v) - ∂y(u))
ζ
= Figure(size=(1200, 600))
fig = Axis(fig[1, 1], aspect=1, title="vorticity")
axζ = Axis(fig[1, 2], aspect=1, title="tracer")
axc heatmap!(axζ, ζ, colormap=:balance)
heatmap!(axc, model.tracers.c)
current_figure()
┌ Warning: You are using Julia v1.11 or later!" │ Oceananigans is currently tested on Julia v1.10." │ If you find issues with Julia v1.11 or later," │ please report at https://github.com/CliMA/Oceananigans.jl/issues/new └ @ Oceananigans ~/.julia/packages/Oceananigans/Xawvn/src/Oceananigans.jl:124 [ Info: Oceananigans will use 8 threads [ Info: Initializing simulation... [ Info: ... simulation initialization complete (8.027 seconds) [ Info: Executing initial time step... [ Info: ... initial time step complete (2.272 seconds). [ Info: Simulation is stopping after running for 2.704 minutes. [ Info: Simulation time 10 seconds equals or exceeds stop time 10 seconds.
UndefVarError: `heatmap!` not defined in `Main.Notebook` Hint: It looks like two or more modules export different bindings with this name, resulting in ambiguity. Try explicitly importing it from a particular module, or qualifying the name with the module it should come from. Hint: a global variable of this name may be made accessible by importing Plots in the current active module Main Hint: a global variable of this name may be made accessible by importing MakieCore in the current active module Main Hint: a global variable of this name may be made accessible by importing Makie in the current active module Main Hint: a global variable of this name may be made accessible by importing CairoMakie in the current active module Main Stacktrace: [1] top-level scope @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit4/unit_4.qmd:908
We’ll then go through one of thier examples.
sqrt_sum()
to GPU” exercise provided here.It’s not entirely clear how widespread the use of GPUs was for cracking cryptosystems before the introduction of CUDA (after which it exploded). There were certainly examples and it was a budding technique in the community, but I couldn’t find concrete and widely used examples to cite.↩︎
Welcome | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5