= Dict()
pop "Australia"] = 27_864_000
pop["United States"] = 340_111_000
pop["Finland"] = 5_634_000
pop[
pop
Dict{Any, Any} with 3 entries:
"United States" => 340111000
"Finland" => 5634000
"Australia" => 27864000
An Accumulation Point workshop for AIMS, delivered during June, 2025. See workshop materials in the AIMS GitHub repo.
In this unit we focus on data. We start by considering basic Julia data structures including dictionaries, sets, named tuples, and others. We then then focus on basic text (string) processing in Julia. Then we move onto Dataframes - a general and useful way to keep tabular data. We then touch on JSON files, and serialization.
Beyond arrays which are very important and include Vector
and Matrix
, here are some basic data structures in Julia:
See Dictionaries in the Julia docs.
Dictionaries (often called hash maps or associative arrays) store key-value pairs. Each key in a dictionary must be unique. They are incredibly useful for many purposes because their looking up values quickly based on a unique identifier. In particular, well designed hash maps are implemented with lookup (get value by key), insertion (insert value to key), and deletion (remove value by key) operations taking average \(O(1)\) (constant) time1. This makes them very popular both for their simplicity but also to speed up algorithms with smart tricks (like reverse indeces built in hash maps).
= Dict()
pop "Australia"] = 27_864_000
pop["United States"] = 340_111_000
pop["Finland"] = 5_634_000
pop[
pop
Dict{Any, Any} with 3 entries:
"United States" => 340111000
"Finland" => 5634000
"Australia" => 27864000
Infer its type:
@show typeof(pop)
typeof(pop) = Dict{Any, Any}
Dict{Any, Any}
We can restrict the types:
= Dict{String,Int}()
strict_pop "Australia"] = 27_864_000
strict_pop["United States"] = 340_111_000
strict_pop["Finland"] = 5_634_000
strict_pop[
strict_pop
Dict{String, Int64} with 3 entries:
"United States" => 340111000
"Finland" => 5634000
"Australia" => 27864000
# this is okay
"North Pole"] = 0.5
pop[# not okay
"North Pole"] = 0.5 strict_pop[
InexactError: Int64(0.5) Stacktrace: [1] Int64 @ ./float.jl:994 [inlined] [2] convert @ ./number.jl:7 [inlined] [3] setindex!(h::Dict{String, Int64}, v0::Float64, key::String) @ Base ./dict.jl:355 [4] top-level scope @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit2/unit_2.qmd:48
Checking and accessing dictionary values:
# Accessing a value
= pop["Australia"]
population_australia println("Population of Australia: ", population_australia)
= get(pop, "Mars", nothing) mars_pop_safe
Population of Australia: 27864000
Use haskey
to check if the key exists:
if haskey(pop, "United States")
println("United States population exists: ", pop["United States"])
end
if !haskey(pop, "Atlantis")
println("Atlantis population does not exist.")
end
United States population exists: 340111000
Atlantis population does not exist.
More useful operations:
keys()
: Returns an iterable collection of all keys in the dictionary.values()
: Returns an iterable collection of all values in the dictionary.pairs()
: Returns an iterable collection of Pair
objects (key => value) for all entries.length()
: Returns the number of key-value pairs in the dictionary.empty!()
: Removes all key-value pairs from the dictionary.println()
println("Keys in pop: ", keys(pop))
println("Values in pop: ", values(pop))
println("Pairs in pop: ", pairs(pop))
println("Number of entries in pop: ", length(pop))
# Iterating through a dictionary
println()
println("Iterating through pop:")
for (country, population) in pop
println("$country: $population")
end
# Create a dictionary using the Dict constructor with pairs
= Dict("Canada" => 38_000_000, "Mexico" => 126_000_000)
new_countries println()
println("New countries dictionary: ", new_countries)
# Note that `=>` constructs a pair:
typeof(:s => 2)
# Merging dictionaries (creates a new dictionary)
= merge(pop, new_countries)
merged_pop println("Merged population dictionary: ", merged_pop)
# In-place merge (modifies the first dictionary)
merge!(pop, new_countries)
println("Pop after in-place merge: ", pop)
# Clearing a dictionary
empty!(pop)
println("Pop after empty!: ", pop)
Keys in pop: Any["North Pole", "United States", "Finland", "Australia"]
Values in pop: Any[0.5, 340111000, 5634000, 27864000]
Pairs in pop: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Australia" => 27864000)
Number of entries in pop: 4
Iterating through pop:
North Pole: 0.5
United States: 340111000
Finland: 5634000
Australia: 27864000
New countries dictionary: Dict("Mexico" => 126000000, "Canada" => 38000000)
Merged population dictionary: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Mexico" => 126000000, "Australia" => 27864000, "Canada" => 38000000)
Pop after in-place merge: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Mexico" => 126000000, "Australia" => 27864000, "Canada" => 38000000)
Pop after empty!: Dict{Any, Any}()
See Set-Like Collections in the Julia docs. Here are some examples.
= Set([2,7,2,3])
A = Set(1:6)
B = Set(1:10)
omega
= union(A, B)
AunionB = intersect(A, B)
AintersectionB = setdiff(B,A)
BdifferenceA = setdiff(omega,B)
Bcomplement = union(setdiff(A,B),setdiff(B,A))
AsymDifferenceB println("A = $A, B = $B")
println("A union B = $AunionB")
println("A intersection B = $AintersectionB")
println("B diff A = $BdifferenceA")
println("B complement = $Bcomplement")
println("A symDifference B = $AsymDifferenceB")
println("The element '6' is an element of A: $(in(6,A))")
println("Symmetric difference and intersection are subsets of the union: ",
issubset(AsymDifferenceB,AunionB),", ", issubset(AintersectionB,AunionB))
A = Set([7, 2, 3]), B = Set([5, 4, 6, 2, 3, 1])
A union B = Set([5, 4, 6, 7, 2, 3, 1])
A intersection B = Set([2, 3])
B diff A = Set([5, 4, 6, 1])
B complement = Set([7, 10, 9, 8])
A symDifference B = Set([5, 4, 6, 7, 1])
The element '6' is an element of A: false
Symmetric difference and intersection are subsets of the union: true, true
Internally, sets are a thin wrapper around dictionaries with no values:
# base/set.jl
struct Set{T} <: AbstractSet{T}
::Dict{T,Nothing}
dict
global _Set(dict::Dict{T,Nothing}) where {T} = new{T}(dict)
end
In addition to tuples (see docs), Julia has named tuples. Here are some examples:
= (age=28, gender=:male, name="Aapeli")
my_stuff = (age=51, gender=:male, name="Yoni")
yonis_stuff
my_stuff.gender
:male
Named tuples are also used as keyword arguments.
function my_function_kwargs(; keyword_arg1=default_value1, keyword_arg2=default_value2)
println("Keyword 1: $keyword_arg1")
println("Keyword 2: $keyword_arg2")
end
= (keyword_arg1="hello!", keyword_arg2="nothing")
todays_args my_function_kwargs(; todays_args...)
Keyword 1: hello!
Keyword 2: nothing
An example with Plots:
using Plots
using LaTeXStrings
# we can use named tuples to pass in keyword arguments
= (label=false, xlim=(-1,1), xlabel=L"x")
args # `...` is the "splat" operator, similar to `**args` in python
= plot(x->sin(1/x); ylabel=L"\sin(\frac{1}{x})", args...)
p1 = plot(x->cos(1/x); ylabel=L"\cos(\frac{1}{x})", args...)
p2 plot(p1, p2, size=(700,300))
You can obviously define your own types see composite types in docs. You can use struct
which is by default immutable, or mutable struct
. In terms of memory management, immutable types sit on the stack while mutable types sit on the heap and require allocations and garbage collection.
struct Place
::String
name::Float64
lon::Float64
latend
# Constructing Place instances
= Place("New York", -74.0060, 40.7128)
new_york = Place("Brisbane", 153.0251, -27.4698)
brisbane = Place("Townsville", 146.8169, -19.2581)
townsville
println(new_york)
println(brisbane)
println(townsville)
# access fields
println("Latitude of new_york: ", new_york.lat)
Place("New York", -74.006, 40.7128)
Place("Brisbane", 153.0251, -27.4698)
Place("Townsville", 146.8169, -19.2581)
Latitude of new_york: 40.7128
We can also have constructors with logic
"""
A fancier place that wraps longitude automatically
"""
struct FancyPlace
::String
name::Float64
lon::Float64
lat
# Default constructor (provided automatically if no inner constructors are defined)
function FancyPlace(name::String, lon::Float64, lat::Float64)
# make sure longitude is in [-180,180)
= mod(lon + 180, 360) - 180
wrapped_lon # new is a special keyword used to create the actual struct instance
# It takes the values for the fields in the order they are defined in
# the struct, effectively calling the "primary" constructor
new(name, wrapped_lon, lat)
end
# Custom constructor for an "unnamed" place
FancyPlace(lon::Float64, lat::Float64) = FancyPlace("[unnamed]", lon, lat) # The `new` keyword calls the primary constructor
end
# Now we can use the new constructor
= FancyPlace(1000.0, 20.0)
unnamed_location println("\nUnnamed location: ", unnamed_location)
println("Name of unnamed_location: ", unnamed_location.name)
Unnamed location: FancyPlace("[unnamed]", -80.0, 20.0)
Name of unnamed_location: [unnamed]
We can add additional “outer” constructors, but they cannot call new
directly. For example, suppose you use a GIS package with your own coordinates
struct WGS84Coordinates{T}
::T
x::T
yend
function FancyPlace(name::String, coords::WGS84Coordinates)
return FancyPlace(name, Float64(coords.x), Float64(coords.y))
end
= WGS84Coordinates{Float32}(142.2, 11.35)
zero_coords = FancyPlace("Mariana Trench", zero_coords)
mariana_trench
@show mariana_trench
mariana_trench = Main.Notebook.FancyPlace("Mariana Trench", 142.1999969482422, 11.350000381469727)
FancyPlace("Mariana Trench", 142.1999969482422, 11.350000381469727)
The Parameters.jl package extends the functionality by automatically creating keyword based constructors for struct beyond the default constructors.
using Parameters
@with_kw struct MyStruct
::Int = 6
a::Float64 = -1.1
b::UInt8
cend
MyStruct(c=4) # call to the constructor created with the @with_kw with a keyword argument
MyStruct
a: Int64 6
b: Float64 -1.1
c: UInt8 0x04
Another useful macro based modification of the language is with the Accessors.jl package. It allows to update values of structs (immutable) easily by creating a copy without having to copy all other values:
using Accessors
= MyStruct(a=10, c=4)
a @show a
= @set a.c = 0
b @show b;
# but observe a is still untouched
@show a
a = Main.Notebook.MyStruct
a: Int64 10
b: Float64 -1.1
c: UInt8 0x04
b = Main.Notebook.MyStruct
a: Int64 10
b: Float64 -1.1
c: UInt8 0x00
a = Main.Notebook.MyStruct
a: Int64 10
b: Float64 -1.1
c: UInt8 0x04
MyStruct
a: Int64 10
b: Float64 -1.1
c: UInt8 0x04
The JuliaCollections library provides other data structures. One useful package is DataStructures.jl. Let’s use for example a heap for heap sort (note that this is only for illustrative purposes. The system’s sort will be more efficient).
using Random, DataStructures
Random.seed!(0)
function heap_sort!(a::AbstractArray)
= BinaryMinHeap{eltype(a)}()
h for e in a
push!(h, e) #This is an O(log n) operation
end
#Write back onto the original array
for i in 1:length(a)
= pop!(h) #This is an O(log n) operation
a[i] end
return a
end
= [65, 51, 32, 12, 23, 84, 68, 1]
data heap_sort!(data)
@show data
@show heap_sort!(["Finland", "USA", "Australia", "Brazil"]);
data = [1, 12, 23, 32, 51, 65, 68, 84]
heap_sort!(["Finland", "USA", "Australia", "Brazil"]) = ["Australia", "Brazil", "Finland", "USA"]
Again, note that this is a bunch slower than the standard lib sort:
using BenchmarkTools
= rand(10_000); numbers
@benchmark sort!(numbers)
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample. Range (min … max): 8.730 μs … 16.809 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 9.196 μs ┊ GC (median): 0.00% Time (mean ± σ): 9.227 μs ± 412.276 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▇▇▆▇█▃▁▂ ▁ ▁ ▂ ▂ ▄▃▁▃▅█████████▇▅██▇██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▃▁▁▃▅▅▅███ █ 8.73 μs Histogram: log(frequency) by time 11.6 μs < Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark heap_sort!(numbers)
BenchmarkTools.Trial: 9555 samples with 1 evaluation per sample. Range (min … max): 494.762 μs … 3.795 ms ┊ GC (min … max): 0.00% … 85.86% Time (median): 507.543 μs ┊ GC (median): 0.00% Time (mean ± σ): 519.436 μs ± 83.512 μs ┊ GC (mean ± σ): 1.64% ± 6.07% ▇█▇▅▃▂ ▁▃▃ ▁ ▂ ██████▇▆▅███▇▅▁▃▁▁▁▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆██ █ 495 μs Histogram: log(frequency) by time 917 μs < Memory estimate: 326.45 KiB, allocs estimate: 14.
Here are strings in the julia docs. Let’s see some examples:
= 2
x "The value of x is $x"
"The value of x is 2"
split("Hello world!")
2-element Vector{SubString{String}}:
"Hello"
"world!"
# multiline blocks will clear up whitespace to make life nice with indentation
= """
my_life_story I was born
in 1935.
"""
println(my_life_story)
I was born
in 1935.
ismutable(String)
true
println("A rough ASCII table")
println("Decimal\tHex\tCharacter")
for c in 0x20:0x7E
println(c,"\t","0x" * string(c,base=16),"\t",Char(c))
end
A rough ASCII table
Decimal Hex Character
32 0x20
33 0x21 !
34 0x22 "
35 0x23 #
36 0x24 $
37 0x25 %
38 0x26 &
39 0x27 '
40 0x28 (
41 0x29 )
42 0x2a *
43 0x2b +
44 0x2c ,
45 0x2d -
46 0x2e .
47 0x2f /
48 0x30 0
49 0x31 1
50 0x32 2
51 0x33 3
52 0x34 4
53 0x35 5
54 0x36 6
55 0x37 7
56 0x38 8
57 0x39 9
58 0x3a :
59 0x3b ;
60 0x3c <
61 0x3d =
62 0x3e >
63 0x3f ?
64 0x40 @
65 0x41 A
66 0x42 B
67 0x43 C
68 0x44 D
69 0x45 E
70 0x46 F
71 0x47 G
72 0x48 H
73 0x49 I
74 0x4a J
75 0x4b K
76 0x4c L
77 0x4d M
78 0x4e N
79 0x4f O
80 0x50 P
81 0x51 Q
82 0x52 R
83 0x53 S
84 0x54 T
85 0x55 U
86 0x56 V
87 0x57 W
88 0x58 X
89 0x59 Y
90 0x5a Z
91 0x5b [
92 0x5c \
93 0x5d ]
94 0x5e ^
95 0x5f _
96 0x60 `
97 0x61 a
98 0x62 b
99 0x63 c
100 0x64 d
101 0x65 e
102 0x66 f
103 0x67 g
104 0x68 h
105 0x69 i
106 0x6a j
107 0x6b k
108 0x6c l
109 0x6d m
110 0x6e n
111 0x6f o
112 0x70 p
113 0x71 q
114 0x72 r
115 0x73 s
116 0x74 t
117 0x75 u
118 0x76 v
119 0x77 w
120 0x78 x
121 0x79 y
122 0x7a z
123 0x7b {
124 0x7c |
125 0x7d }
126 0x7e ~
Julia has built-in regex!
= "Julia is fun!"
text = r"Julia"
pattern occursin(pattern, text) # true
true
= "Call me at 0468879289 when I'm home, or 0468879555 if I'm at work"
text for m in eachmatch(r"04\d{8}", text)
println("Found phone number $(m.match)")
end
Found phone number 0468879289
Found phone number 0468879555
The open
function is your primary tool, often used with do
blocks to ensure files are automatically closed.
To write text to a file:
open("work/my_output.txt", "w") do io
write(io, "Hello from Julia!\n")
write(io, "This is a second line.")
end
22
Here, "w"
signifies “write mode.” If the file doesn’t exist, it’s created; if it does, its contents are overwritten.
To append text to an existing file:
open("work/my_output.txt", "a") do io
write(io, "\nAppending a new line.")
end
22
The "a"
mode means “append.” New stuff is added to the end of the file.
To read the entire content of a file:
= read("work/my_output.txt", String)
file_content println(file_content)
Hello from Julia!
This is a second line.
Appending a new line.
The read
function with String
as the type argument reads the whole file into a single string.
For reading a file line by line, which is more memory-efficient for large files:
open("work/my_output.txt", "r") do io
for line in eachline(io)
println("Line: ", line)
end
end
Line: Hello from Julia!
Line: This is a second line.
Line: Appending a new line.
The Printf package is built-in and provides formatted output functions similar to the C standard library.
Strings are related to IO. See the I/O and Network docs. Something quite common is to use flush(stdout)
.
Sometimes when writing test code we want strings to be approximately equal. For this it is useful to use the StringDistances.jl package.
Consider the YAML.jl package for YAML files.
Dataframes are huge subject. The Julia Dataframes.jl package provides functionality similar to Python pandas or R dataframes.
Let’s get started
using DataFrames
The most common way to create a DataFrame is by providing column names (as symbols) and their corresponding vectors of data.
# Create a DataFrame with two columns 'a' and 'b'
= DataFrame(a = [1, 2, 3], b = [2.0, 4.0, 6.0]) df
Row | a | b |
---|---|---|
Int64 | Float64 | |
1 | 1 | 2.0 |
2 | 2 | 4.0 |
3 | 3 | 6.0 |
Notice that Julia infers the data types for each column. Here, a
is Int64
and b
is Float64
.
We can also create DataFrames using Pairs
:
DataFrame(:c => ["apple", "banana", "cherry"], :d => [true, false, true])
Row | c | d |
---|---|---|
String | Bool | |
1 | apple | true |
2 | banana | false |
3 | cherry | true |
You can also construct a DataFrame from a dictionary where keys are column names (symbols or strings) and values are vectors.
DataFrame(Dict(
:name => ["Aapeli", "Yoni", "Jesse"],
:age => [25, 30, 35],
:city => ["New York", "Brisbane", "Berlin"]
))
Row | age | city | name |
---|---|---|---|
Int64 | String | String | |
1 | 25 | New York | Aapeli |
2 | 30 | Brisbane | Yoni |
3 | 35 | Berlin | Jesse |
NamedTuple
sCreating a DataFrame from a vector of NamedTuple
s is very flexible.
DataFrame([
= 1, value = 10.5, tag = "A"),
(id = 2, value = 20.1, tag = "B"),
(id = 3, value = 15.0, tag = "C")
(id ])
Row | id | value | tag |
---|---|---|---|
Int64 | Float64 | String | |
1 | 1 | 10.5 | A |
2 | 2 | 20.1 | B |
3 | 3 | 15.0 | C |
If the NamedTuple
s have different fields or different orders, we can use Tables.dictcolumntable
to fill missing values with missing
.
DataFrame(Tables.dictcolumntable([
= 1, name = "Julia"),
(id = 2, score = 95.5),
(id = 3, name = "DataFrame", type = "Table")
(id ]))
Row | id | name | score | type |
---|---|---|---|---|
Int64 | String? | Float64? | String? | |
1 | 1 | Julia | missing | missing |
2 | 2 | missing | 95.5 | missing |
3 | 3 | DataFrame | missing | Table |
Notice the ?
after the types, indicating that these columns now allow missing
values.
In DataFrames.jl, columns are primarily accessed using Symbol
s.
= DataFrame(a = [1, 2, 3], b = [2.0, 4.0, 6.0], c = ["x", "y", "z"])
df
:, :a] df[
3-element Vector{Int64}:
1
2
3
You can get the column names:
names(df)
3-element Vector{String}:
"a"
"b"
"c"
And column types:
eltype.(eachcol(df))
3-element Vector{DataType}:
Int64
Float64
String
To get the dimensions of a DataFrame, similar to matrices:
size(df) # (rows, columns)
(3, 3)
You can also specify the dimension:
@show size(df, 1) # Number of rows
@show size(df, 2) # Number of columns
size(df, 1) = 3
size(df, 2) = 3
3
DataFrames.jl stores data in a column-oriented fashion. This means each column is essentially a Vector
.
You can retrieve a column using dot syntax or indexing:
# Access column 'a' using dot syntax
df.a :b] # Access column 'b' using ! (returns a view, i.e., no copy)
df[!, :, :c] # Access column 'c' using :, which makes a copy df[
3-element Vector{String}:
"x"
"y"
"z"
The difference between .
and !
versus :
for column retrieval is crucial for performance and understanding data manipulation.
=== df[!, :a] # They refer to the same underlying data df.a
true
=== df[:, :a] # The : operator creates a copy, so they are not the same object df.a
false
When you need to iterate through rows, you can use eachrow(df)
:
for row in eachrow(df)
println("Row: $(row.a), $(row.b), $(row.c)")
end
Row: 1, 2.0, x
Row: 2, 4.0, y
Row: 3, 6.0, z
Each row
here is a DataFrameRow
object, which behaves like a NamedTuple
for row-wise access.
DataFrames can be indexed similar to matrices, but with the added flexibility of column names.
1, 1] # First row, first column
df[2, :b] # Second row, column 'b'
df[1, :] # First row (returns a DataFrameRow)
df[:, 1] # First column (returns a Vector, view) df[
3-element Vector{Int64}:
1
2
3
You can select multiple columns by passing a vector of column names (symbols or strings):
:, [:a, :c]] # Select columns 'a' and 'c' (creates a new DataFrame) df[
Row | a | c |
---|---|---|
Int64 | String | |
1 | 1 | x |
2 | 2 | y |
3 | 3 | z |
Or exclude columns using Not
:
:, Not(:b)] # Select all columns except 'b' df[
Row | a | c |
---|---|---|
Int64 | String | |
1 | 1 | x |
2 | 2 | y |
3 | 3 | z |
You can combine Not
with a vector of columns:
:, Not([:a])] # Select all columns except 'a' df[
Row | b | c |
---|---|---|
Float64 | String | |
1 | 2.0 | x |
2 | 4.0 | y |
3 | 6.0 | z |
Recall the distinction between !
and :
for column access. This also applies to row and full DataFrame indexing.
df[!, :colname]
returns a view of the column (no copy).df[:, :colname]
returns a copy of the column.df[!, [col1, col2]]
returns a view of the selected columns (a SubDataFrame
).df[:, [col1, col2]]
returns a copy of the selected columns (a new DataFrame
).df[!, row_indices, col_indices]
returns a SubDataFrame
(view).df[row_indices, col_indices]
returns a new DataFrame
(copy).Using views (!
) is more memory-efficient when you don’t need a separate copy of the data and want changes to the view to reflect in the original DataFrame. However, views require translating between the parent df indeces and the view indeces, which might in theory cause performance issues in edge cases.
You can retrieve, set, and modify individual cells, rows, or columns.
1, :a] = 100 # Set value at row 1, column 'a' df[
100
= [10.0, 20.0, 30.0] # Replace column 'b' df.b
3-element Vector{Float64}:
10.0
20.0
30.0
If the new column has a different type, it will be converted if possible, or an error will occur. If a column doesn’t exist, it will be added.
= ["alpha", "beta", "gamma"] # Add a new column 'd' df.d
3-element Vector{String}:
"alpha"
"beta"
"gamma"
Broadcasting (.=
) is extremely powerful for performing element-wise operations and assignments efficiently.
.= 0 # Set all values in column 'a' to 0 df.a
3-element Vector{Int64}:
0
0
0
You can also use it with a scalar or a vector of compatible size:
.= df.b * 2 # Double all values in column 'b' df.b
3-element Vector{Float64}:
20.0
40.0
60.0
Or apply a function:
.= uppercase.(df.c) # Convert all strings in column 'c' to uppercase df.c
3-element Vector{String}:
"X"
"Y"
"Z"
Broadcasting assignment works with sub-selections as well:
1:2, :a] .= 99 # Set the first two values of column 'a' to 99 df[
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
99
99
We’ll now look at a more in-depth, hands-on exercise of using DataFrames.
The Queensland government has an open data portal, and makes available tide predictions at various locations on the state’s coast. (There’s some other interesting data as well at https://www.qld.gov.au/tides).
Let’s use this to do some exploration. We’ll first download with the HTTP.jl package and write it to tides.csv
using HTTP
= HTTP.get("https://www.data.qld.gov.au/datastore/dump/1311fc19-1e60-444f-b5cf-24687f1c15a7?bom=True")
response write("work/tides.csv", response.body)
1603979
Let’s explore the first few lines
open("work/tides.csv") do io
for i ∈ 1:5
= readline(io)
line println(line)
end
end
_id,Site,Seconds,DateTime,Water Level,Prediction,Residual,Latitude,Longitude
1,abellpoint,1750082400,2025-06-17T00:00,2.713,2.535,0.178,-20.2608,148.7103
2,abellpoint,1750083000,2025-06-17T00:10,2.765,2.605,0.160,-20.2608,148.7103
3,abellpoint,1750083600,2025-06-17T00:20,2.838,2.670,0.168,-20.2608,148.7103
4,abellpoint,1750084200,2025-06-17T00:30,2.898,2.731,0.167,-20.2608,148.7103
We can read it into a dataframe with CSV.read
, and show the first few lines with first
using CSV
= CSV.read("work/tides.csv", DataFrame)
df first(df, 5)
Row | _id | Site | Seconds | DateTime | Water Level | Prediction | Residual | Latitude | Longitude |
---|---|---|---|---|---|---|---|---|---|
Int64 | String15 | Int64 | DateTime | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1 | abellpoint | 1750082400 | 2025-06-17T00:00:00 | 2.713 | 2.535 | 0.178 | -20.2608 | 148.71 |
2 | 2 | abellpoint | 1750083000 | 2025-06-17T00:10:00 | 2.765 | 2.605 | 0.16 | -20.2608 | 148.71 |
3 | 3 | abellpoint | 1750083600 | 2025-06-17T00:20:00 | 2.838 | 2.67 | 0.168 | -20.2608 | 148.71 |
4 | 4 | abellpoint | 1750084200 | 2025-06-17T00:30:00 | 2.898 | 2.731 | 0.167 | -20.2608 | 148.71 |
5 | 5 | abellpoint | 1750084800 | 2025-06-17T00:40:00 | 2.934 | 2.786 | 0.148 | -20.2608 | 148.71 |
Note the inferred datatypes, including the automatically converted DateTime
. We can customize this
# we could also do
= CSV.read("work/tides.csv", DataFrame; types=Dict("Water Level" => Float32, "Prediction" => Float32, "Residual" => Float32, "Latitude" => Float32, "Longitude" => Float32)); df32
println("With Float32s, we saved $(round((1-Base.summarysize(df32)/Base.summarysize(df))*100; digits=2))% memory")
With Float32s, we saved 29.64% memory
(This is silly, don’t do it in practice.)
Let’s look also at the last rows
last(df, 3)
Row | _id | Site | Seconds | DateTime | Water Level | Prediction | Residual | Latitude | Longitude |
---|---|---|---|---|---|---|---|---|---|
Int64 | String15 | Int64 | DateTime | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 19420 | whyteislandnx | 1750728000 | 2025-06-24T11:20:00 | 1.154 | 1.022 | 0.132 | -27.4017 | 153.157 |
2 | 19421 | whyteislandnx | 1750728600 | 2025-06-24T11:30:00 | 1.091 | 0.964 | 0.127 | -27.4017 | 153.157 |
3 | 19422 | whyteislandnx | 1750729200 | 2025-06-24T11:40:00 | -99.0 | 0.907 | -99.0 | -27.4017 | 153.157 |
Here it seems that “-99.0” seems to mean missing. Let’s see where it’s coming from in the CSV
open("work/tides.csv") do io
while true
= readline(io)
line if contains(line, "-99")
println(line)
break
end
end
end
1073,abellpoint,1750725600,2025-06-24T10:40,-99.000,2.131,-99.000,-20.2608,148.7103
We can tell CSV.read
to mark values with “-99.000” as missing
= CSV.read("work/tides.csv", DataFrame; missingstring=["-99.000"])
df last(df, 3)
Row | _id | Site | Seconds | DateTime | Water Level | Prediction | Residual | Latitude | Longitude |
---|---|---|---|---|---|---|---|---|---|
Int64 | String15 | Int64 | DateTime | Float64? | Float64 | Float64? | Float64 | Float64 | |
1 | 19420 | whyteislandnx | 1750728000 | 2025-06-24T11:20:00 | 1.154 | 1.022 | 0.132 | -27.4017 | 153.157 |
2 | 19421 | whyteislandnx | 1750728600 | 2025-06-24T11:30:00 | 1.091 | 0.964 | 0.127 | -27.4017 | 153.157 |
3 | 19422 | whyteislandnx | 1750729200 | 2025-06-24T11:40:00 | missing | 0.907 | missing | -27.4017 | 153.157 |
Note the “?” in water level/residual: this is DataFrames notation for columns which contain missing data.
Referring to Water Level
is a bit annoying now:
:, Symbol("Water Level")] df[
19422-element Vector{Union{Missing, Float64}}:
2.713
2.765
2.838
2.898
2.934
2.986
3.029
3.078
3.103
3.175
⋮
1.519
1.462
1.397
1.335
1.269
1.212
1.154
1.091
missing
Let’s rename it, and let’s rename DateTime
too to avoid confusion:
# ! means in-place
rename!(df, Symbol("Water Level") => :WaterLevel, Symbol("DateTime") => :Time)
first(df, 5)
Row | _id | Site | Seconds | Time | WaterLevel | Prediction | Residual | Latitude | Longitude |
---|---|---|---|---|---|---|---|---|---|
Int64 | String15 | Int64 | DateTime | Float64? | Float64 | Float64? | Float64 | Float64 | |
1 | 1 | abellpoint | 1750082400 | 2025-06-17T00:00:00 | 2.713 | 2.535 | 0.178 | -20.2608 | 148.71 |
2 | 2 | abellpoint | 1750083000 | 2025-06-17T00:10:00 | 2.765 | 2.605 | 0.16 | -20.2608 | 148.71 |
3 | 3 | abellpoint | 1750083600 | 2025-06-17T00:20:00 | 2.838 | 2.67 | 0.168 | -20.2608 | 148.71 |
4 | 4 | abellpoint | 1750084200 | 2025-06-17T00:30:00 | 2.898 | 2.731 | 0.167 | -20.2608 | 148.71 |
5 | 5 | abellpoint | 1750084800 | 2025-06-17T00:40:00 | 2.934 | 2.786 | 0.148 | -20.2608 | 148.71 |
Drop some redundant columns
select!(df, [:Site, :Latitude, :Longitude, :Time, :WaterLevel, :Prediction])
first(df, 5)
Row | Site | Latitude | Longitude | Time | WaterLevel | Prediction |
---|---|---|---|---|---|---|
String15 | Float64 | Float64 | DateTime | Float64? | Float64 | |
1 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:00:00 | 2.713 | 2.535 |
2 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:10:00 | 2.765 | 2.605 |
3 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:20:00 | 2.838 | 2.67 |
4 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:30:00 | 2.898 | 2.731 |
5 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:40:00 | 2.934 | 2.786 |
Here is our list of columns:
names(df)
6-element Vector{String}:
"Site"
"Latitude"
"Longitude"
"Time"
"WaterLevel"
"Prediction"
Or by piping
|> names df
6-element Vector{String}:
"Site"
"Latitude"
"Longitude"
"Time"
"WaterLevel"
"Prediction"
Let’s dive a bit deeper, what do we have?
describe(df)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Any | Any | Int64 | Type | |
1 | Site | abellpoint | whyteislandnx | 0 | String15 | ||
2 | Latitude | -26.2292 | -28.1721 | -27.4382 | -19.1266 | 0 | Float64 |
3 | Longitude | 152.434 | 146.91 | 153.249 | 153.558 | 0 | Float64 |
4 | Time | 2025-06-17T00:00:00 | 2025-06-20T17:50:00 | 2025-06-24T11:40:00 | 0 | DateTime | |
5 | WaterLevel | 1.36547 | -0.233 | 1.153 | 5.774 | 7743 | Union{Missing, Float64} |
6 | Prediction | 1.18008 | -0.179 | 1.007 | 5.537 | 0 | Float64 |
What are the site names?
unique(df.Site)
18-element Vector{String15}:
"abellpoint"
"bananabank"
"birkdale"
"coombabahst"
"hallsbay"
"husseycreek"
"maroochydore"
"rabybay"
"russellislande"
"russellislandw"
"seaforth"
"tangalooma"
"theskids"
"townsvillecard"
"tweedsbj"
"wavebreaknc"
"wavebreakwc"
"whyteislandnx"
A note on String15
:
df.Site
19422-element PooledArrays.PooledVector{String15, UInt32, Vector{UInt32}}:
"abellpoint"
"abellpoint"
"abellpoint"
"abellpoint"
"abellpoint"
"abellpoint"
"abellpoint"
"abellpoint"
"abellpoint"
"abellpoint"
⋮
"whyteislandnx"
"whyteislandnx"
"whyteislandnx"
"whyteislandnx"
"whyteislandnx"
"whyteislandnx"
"whyteislandnx"
"whyteislandnx"
"whyteislandnx"
Compute the squared error in prediction with transform
Let’s group by site
# groupby takes a dataframe and a list of columns to group by
= groupby(df, :Site) by_site
GroupedDataFrame with 18 groups based on key: Site
Row | Site | Latitude | Longitude | Time | WaterLevel | Prediction |
---|---|---|---|---|---|---|
String15 | Float64 | Float64 | DateTime | Float64? | Float64 | |
1 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:00:00 | 2.713 | 2.535 |
2 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:10:00 | 2.765 | 2.605 |
3 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:20:00 | 2.838 | 2.67 |
4 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:30:00 | 2.898 | 2.731 |
5 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:40:00 | 2.934 | 2.786 |
6 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:50:00 | 2.986 | 2.839 |
7 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:00:00 | 3.029 | 2.887 |
8 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:10:00 | 3.078 | 2.93 |
9 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:20:00 | 3.103 | 2.971 |
10 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:30:00 | 3.175 | 3.007 |
11 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:40:00 | 3.204 | 3.037 |
12 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:50:00 | 3.202 | 3.063 |
13 | abellpoint | -20.2608 | 148.71 | 2025-06-17T02:00:00 | 3.224 | 3.084 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1068 | abellpoint | -20.2608 | 148.71 | 2025-06-24T09:50:00 | 2.577 | 2.379 |
1069 | abellpoint | -20.2608 | 148.71 | 2025-06-24T10:00:00 | 2.587 | 2.347 |
1070 | abellpoint | -20.2608 | 148.71 | 2025-06-24T10:10:00 | 2.513 | 2.306 |
1071 | abellpoint | -20.2608 | 148.71 | 2025-06-24T10:20:00 | 2.492 | 2.255 |
1072 | abellpoint | -20.2608 | 148.71 | 2025-06-24T10:30:00 | 2.43 | 2.196 |
1073 | abellpoint | -20.2608 | 148.71 | 2025-06-24T10:40:00 | missing | 2.131 |
1074 | abellpoint | -20.2608 | 148.71 | 2025-06-24T10:50:00 | missing | 2.057 |
1075 | abellpoint | -20.2608 | 148.71 | 2025-06-24T11:00:00 | missing | 1.979 |
1076 | abellpoint | -20.2608 | 148.71 | 2025-06-24T11:10:00 | missing | 1.897 |
1077 | abellpoint | -20.2608 | 148.71 | 2025-06-24T11:20:00 | missing | 1.811 |
1078 | abellpoint | -20.2608 | 148.71 | 2025-06-24T11:30:00 | missing | 1.721 |
1079 | abellpoint | -20.2608 | 148.71 | 2025-06-24T11:40:00 | missing | 1.631 |
⋮
Row | Site | Latitude | Longitude | Time | WaterLevel | Prediction |
---|---|---|---|---|---|---|
String15 | Float64 | Float64 | DateTime | Float64? | Float64 | |
1 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T00:00:00 | 2.373 | 2.215 |
2 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T00:10:00 | 2.416 | 2.251 |
3 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T00:20:00 | 2.437 | 2.282 |
4 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T00:30:00 | 2.471 | 2.309 |
5 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T00:40:00 | 2.491 | 2.331 |
6 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T00:50:00 | 2.51 | 2.347 |
7 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T01:00:00 | 2.509 | 2.357 |
8 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T01:10:00 | 2.516 | 2.362 |
9 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T01:20:00 | 2.497 | 2.361 |
10 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T01:30:00 | 2.486 | 2.353 |
11 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T01:40:00 | 2.459 | 2.339 |
12 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T01:50:00 | 2.428 | 2.318 |
13 | whyteislandnx | -27.4017 | 153.157 | 2025-06-17T02:00:00 | 2.4 | 2.291 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1068 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T09:50:00 | 1.691 | 1.541 |
1069 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:00:00 | 1.638 | 1.487 |
1070 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:10:00 | 1.578 | 1.431 |
1071 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:20:00 | 1.519 | 1.375 |
1072 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:30:00 | 1.462 | 1.316 |
1073 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:40:00 | 1.397 | 1.256 |
1074 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:50:00 | 1.335 | 1.198 |
1075 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:00:00 | 1.269 | 1.138 |
1076 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:10:00 | 1.212 | 1.079 |
1077 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:20:00 | 1.154 | 1.022 |
1078 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:30:00 | 1.091 | 0.964 |
1079 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:40:00 | missing | 0.907 |
This produces a grouped dataframe
typeof(by_site)
GroupedDataFrame{DataFrame}
What’s the mean water level per site?
# get the mean function
using Statistics
# enter ∘ with \circ TAB
# combine takes the grouped df and a list of operations
combine(by_site, :WaterLevel => mean ∘ skipmissing => :MeanWaterLevel)
Row | Site | MeanWaterLevel |
---|---|---|
String15 | Float64 | |
1 | abellpoint | 2.03237 |
2 | bananabank | 1.64445 |
3 | birkdale | 1.43709 |
4 | coombabahst | 0.275567 |
5 | hallsbay | 0.725246 |
6 | husseycreek | NaN |
7 | maroochydore | 0.77414 |
8 | rabybay | 1.61293 |
9 | russellislande | 0.723282 |
10 | russellislandw | NaN |
11 | seaforth | 3.01772 |
12 | tangalooma | NaN |
13 | theskids | NaN |
14 | townsvillecard | NaN |
15 | tweedsbj | 1.19848 |
16 | wavebreaknc | NaN |
17 | wavebreakwc | NaN |
18 | whyteislandnx | 1.4865 |
Here we applied mean(skipmissing(...))
to the :WaterLevel
column.
Let’s plot the water level at some sites
using Plots
= ["coombabahst", "russellislande", "rabybay"]
my_sites
= plot(
p ="Time",
xlabel="Water Level",
ylabel="Water Level Over Time for Selected Sites",
title=:topleft
legend
)
for group in by_site
= group.Site[1] # Get the site name from the first row of the group
site_name if site_name ∉ my_sites
continue
end
plot!(
p,Time,
group.
group.WaterLevel,# Label for the legend
=site_name,
label=0.8,
linealpha=2
linewidth
)end
p
How many data points do we have per site?
combine(by_site, nrow => :Count)
Row | Site | Count |
---|---|---|
String15 | Int64 | |
1 | abellpoint | 1079 |
2 | bananabank | 1079 |
3 | birkdale | 1079 |
4 | coombabahst | 1079 |
5 | hallsbay | 1079 |
6 | husseycreek | 1079 |
7 | maroochydore | 1079 |
8 | rabybay | 1079 |
9 | russellislande | 1079 |
10 | russellislandw | 1079 |
11 | seaforth | 1079 |
12 | tangalooma | 1079 |
13 | theskids | 1079 |
14 | townsvillecard | 1079 |
15 | tweedsbj | 1079 |
16 | wavebreaknc | 1079 |
17 | wavebreakwc | 1079 |
18 | whyteislandnx | 1079 |
Let’s compute the squared residual:
:SqResidual] = (df.WaterLevel - df.Prediction).^2 df[!,
19422-element Vector{Union{Missing, Float64}}:
0.031683999999999976
0.025600000000000046
0.02822400000000005
0.027889000000000087
0.021904000000000038
0.021609000000000073
0.020163999999999974
0.021903999999999906
0.01742400000000003
0.028223999999999902
⋮
0.020735999999999973
0.021315999999999974
0.019881000000000003
0.018769000000000004
0.017161000000000003
0.017689000000000003
0.01742399999999997
0.016129
missing
There were some sites with fully missing water levels
= combine(groupby(df, :Site), :WaterLevel => (x -> all(ismissing, x)) => :IsMissing) all_missing
Row | Site | IsMissing |
---|---|---|
String15 | Bool | |
1 | abellpoint | false |
2 | bananabank | false |
3 | birkdale | false |
4 | coombabahst | false |
5 | hallsbay | false |
6 | husseycreek | true |
7 | maroochydore | false |
8 | rabybay | false |
9 | russellislande | false |
10 | russellislandw | true |
11 | seaforth | false |
12 | tangalooma | true |
13 | theskids | true |
14 | townsvillecard | true |
15 | tweedsbj | false |
16 | wavebreaknc | true |
17 | wavebreakwc | true |
18 | whyteislandnx | false |
filter!(row -> row.IsMissing == false, all_missing)
Row | Site | IsMissing |
---|---|---|
String15 | Bool | |
1 | abellpoint | false |
2 | bananabank | false |
3 | birkdale | false |
4 | coombabahst | false |
5 | hallsbay | false |
6 | maroochydore | false |
7 | rabybay | false |
8 | russellislande | false |
9 | seaforth | false |
10 | tweedsbj | false |
11 | whyteislandnx | false |
select!(all_missing, Not(:IsMissing))
Row | Site |
---|---|
String15 | |
1 | abellpoint |
2 | bananabank |
3 | birkdale |
4 | coombabahst |
5 | hallsbay |
6 | maroochydore |
7 | rabybay |
8 | russellislande |
9 | seaforth |
10 | tweedsbj |
11 | whyteislandnx |
= innerjoin(df, all_missing, on=:Site) df_clean
Row | Site | Latitude | Longitude | Time | WaterLevel | Prediction | SqResidual |
---|---|---|---|---|---|---|---|
String15 | Float64 | Float64 | DateTime | Float64? | Float64 | Float64? | |
1 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:00:00 | 2.713 | 2.535 | 0.031684 |
2 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:10:00 | 2.765 | 2.605 | 0.0256 |
3 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:20:00 | 2.838 | 2.67 | 0.028224 |
4 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:30:00 | 2.898 | 2.731 | 0.027889 |
5 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:40:00 | 2.934 | 2.786 | 0.021904 |
6 | abellpoint | -20.2608 | 148.71 | 2025-06-17T00:50:00 | 2.986 | 2.839 | 0.021609 |
7 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:00:00 | 3.029 | 2.887 | 0.020164 |
8 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:10:00 | 3.078 | 2.93 | 0.021904 |
9 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:20:00 | 3.103 | 2.971 | 0.017424 |
10 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:30:00 | 3.175 | 3.007 | 0.028224 |
11 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:40:00 | 3.204 | 3.037 | 0.027889 |
12 | abellpoint | -20.2608 | 148.71 | 2025-06-17T01:50:00 | 3.202 | 3.063 | 0.019321 |
13 | abellpoint | -20.2608 | 148.71 | 2025-06-17T02:00:00 | 3.224 | 3.084 | 0.0196 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
11858 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T09:50:00 | 1.691 | 1.541 | 0.0225 |
11859 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:00:00 | 1.638 | 1.487 | 0.022801 |
11860 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:10:00 | 1.578 | 1.431 | 0.021609 |
11861 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:20:00 | 1.519 | 1.375 | 0.020736 |
11862 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:30:00 | 1.462 | 1.316 | 0.021316 |
11863 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:40:00 | 1.397 | 1.256 | 0.019881 |
11864 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T10:50:00 | 1.335 | 1.198 | 0.018769 |
11865 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:00:00 | 1.269 | 1.138 | 0.017161 |
11866 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:10:00 | 1.212 | 1.079 | 0.017689 |
11867 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:20:00 | 1.154 | 1.022 | 0.017424 |
11868 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:30:00 | 1.091 | 0.964 | 0.016129 |
11869 | whyteislandnx | -27.4017 | 153.157 | 2025-06-24T11:40:00 | missing | 0.907 | missing |
Let’s compute the 90th percentile of water level per site?
p90(x) = quantile(x, .9)
combine(groupby(df_clean, :Site), :WaterLevel => p90 ∘ skipmissing => :WaterLevelP90)
Row | Site | WaterLevelP90 |
---|---|---|
String15 | Float64 | |
1 | abellpoint | 3.0888 |
2 | bananabank | 2.4426 |
3 | birkdale | 2.2089 |
4 | coombabahst | 0.6794 |
5 | hallsbay | 1.2018 |
6 | maroochydore | 1.1973 |
7 | rabybay | 2.3942 |
8 | russellislande | 1.18 |
9 | seaforth | 4.5453 |
10 | tweedsbj | 1.761 |
11 | whyteislandnx | 2.232 |
Let’s plot the mean square error in prediction per site
= combine(groupby(df_clean, :Site), :SqResidual => mean ∘ skipmissing => :MSE)
mse_by_site
plot(mse_by_site.Site, mse_by_site.MSE, seriestype=:bar, xrotation=45, title="MSE in water level prediction by site")
Here are key operations:
groupby
– Split a DataFrame into groups by one or more columns.combine
– Apply functions to groups or columns and combine results in a new DataFrame.transform
– Create or modify columns (optionally in-place).select
– Select (and transform) columns, optionally creating new ones.With DataFramesMeta.jl: - @subset
– Filter rows based on row-wise conditions. - @select
– Select or transform columns. - @transform
– Add or modify columns by assignment. - @combine
– Combine results of group operations into a DataFrame.
For more, see the official DataFrames.jl documentation and the DataFramesMeta.jl documentation.
As there are already great resources for this on the web let us go through these resources:
Here are the common packages in this ecosystem:
There are two competing JSON libraries: JSON.jl and JSON3.jl. Here is a JSON.jl
example:
using HTTP
using JSON
= HTTP.get("https://couchers.org/api/status")
response = JSON.parse(String(response.body))
data
println(data)
Dict{String, Any}("coucherCount" => "55920", "version" => "develop-24532dd9", "nonce" => "")
Julia provides out of the box serialization. Here is an example. The example is slightly interesting because we also create a tree data structure.
using Random
Random.seed!(0)
struct Node
::UInt16
id::Vector{Node}
friends
# inner constructor, uses the default constructor
Node() = new(rand(UInt16), [])
# another inner constructor
Node(friend::Node) = new(rand(UInt16),[friend])
end
"""
Makes `n` children to node, each with a single friend
"""
function make_children(node::Node, n::Int, friend::Node)
for _ in 1:n
= Node(friend)
new_node push!(node.friends, new_node)
end
end;
# make a tree
= Node()
root make_children(root, 3, root)
for node in root.friends
make_children(node, 2,root)
end
root
Node(0x67db, Node[Node(0x118c, Node[Node(#= circular reference @-4 =#), Node(0xa95f, Node[Node(#= circular reference @-6 =#)]), Node(0x1dc7, Node[Node(#= circular reference @-6 =#)])]), Node(0xdcb5, Node[Node(#= circular reference @-4 =#), Node(0x1c00, Node[Node(#= circular reference @-6 =#)]), Node(0xb3b6, Node[Node(#= circular reference @-6 =#)])]), Node(0x1602, Node[Node(#= circular reference @-4 =#), Node(0x4a1d, Node[Node(#= circular reference @-6 =#)]), Node(0x074f, Node[Node(#= circular reference @-6 =#)])])])
Note that when we try to show root
, it’s complete gibberish. We can write a Base.show()
function to make this pretty:
# make it show up pretty
function Base.show(io::IO, x::Node)
= Set{Node}()
shown function recursive_show(y::Node, depth::Int)
print(io, " "^depth*"Node: $(y.id)")
if y in shown
println(io, " (already shown)")
else
push!(shown, y)
println(io, ", friends:")
for f in y.friends
recursive_show(f, depth+1)
end
end
end
recursive_show(x, 0)
return nothing
end
root
Node: 26587, friends:
Node: 4492, friends:
Node: 26587 (already shown)
Node: 43359, friends:
Node: 26587 (already shown)
Node: 7623, friends:
Node: 26587 (already shown)
Node: 56501, friends:
Node: 26587 (already shown)
Node: 7168, friends:
Node: 26587 (already shown)
Node: 46006, friends:
Node: 26587 (already shown)
Node: 5634, friends:
Node: 26587 (already shown)
Node: 18973, friends:
Node: 26587 (already shown)
Node: 1871, friends:
Node: 26587 (already shown)
Suppose we now want to save this in a file…
using Serialization
serialize("work/tree.dat", root)
= deserialize("work/tree.dat") newroot
Node: 26587, friends:
Node: 4492, friends:
Node: 26587 (already shown)
Node: 43359, friends:
Node: 26587 (already shown)
Node: 7623, friends:
Node: 26587 (already shown)
Node: 56501, friends:
Node: 26587 (already shown)
Node: 7168, friends:
Node: 26587 (already shown)
Node: 46006, friends:
Node: 26587 (already shown)
Node: 5634, friends:
Node: 26587 (already shown)
Node: 18973, friends:
Node: 26587 (already shown)
Node: 1871, friends:
Node: 26587 (already shown)
DataFrames.jl
made it into the Journal of Statistical Software, Bouchet-Valat and Kamiński (2023).DataFrames.jl
.DataFramesMeta.jl
tutorial is useful. country_capital = Dict(
"France" => "Paris",
"Germany" => "Berlin",
"Italy" => "Rome",
"Spain" => "Madrid")
Now create a new dictionary, capital_country
where the keys are the capital cities and the values are the country names.
in
or ∈
symbol is possible both in an array and a set. You can create an array with rand(1:10^10, 10^7)
which will have \(10^7\) entries, selected from the numbers \(1,\ldots,10^{10}\). You can also wrap this to create a set. Now compare lookup timings with @time
or @btime
(from BenchmarkTools.jl) for lookup to see if a single rand(1:10^10)
is an element of the set.text = "Julia is a high-level, high-performance programming language."
, write Julia code to count how many times the substring “high” appears in the text (case-insensitive).Rdatasets.jl
package. Then load the “iris” dataset. Then, filter the DataFrame to only include rows where the SepalLength is greater than its mean, and display the first five rows of the result.RDatasets
. Then, group the data by the Cyl
(number of cylinders) column and compute the average MPG
(miles per gallon) for each group. Display the resulting summary DataFrame. {
"name": "Alice",
"age": 30,
"skills": ["Julia", "Python", "SQL"]
}
Given the JSON string above, write Julia code to parse it and print the person’s name and the number of skills they have.
Float64
(you can use rand(Float64, 3)
). Then serialize and inspect the file size. See it makes sense with sizeof(Float64)
. Now do the same with Float16
, Float32
, UInt8
, and another type of your choice.“Constant time” suffices in practice, there is minutiae and worst case is \(O(n)\) which is bad – for theoretical applications, they can be implemented in \(O(\log n)\) worst case time with self-balancing trees, but all practical applications rely on constant time average and engineering tricks to avoid the linear time worst case.↩︎
Welcome | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5