A Crash Course on the Julia Language and Ecosystem

An Accumulation Point workshop for AIMS, delivered during June, 2025. See workshop materials in the AIMS GitHub repo.

Welcome | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5

Unit 2 - Processing Data

In this unit we focus on data. We start by considering basic Julia data structures including dictionaries, sets, named tuples, and others. We then then focus on basic text (string) processing in Julia. Then we move onto Dataframes - a general and useful way to keep tabular data. We then touch on JSON files, and serialization.

1 Basic data structures

Beyond arrays which are very important and include Vector and Matrix, here are some basic data structures in Julia:

1.1 Dictionaries

See Dictionaries in the Julia docs.

Dictionaries (often called hash maps or associative arrays) store key-value pairs. Each key in a dictionary must be unique. They are incredibly useful for many purposes because their looking up values quickly based on a unique identifier. In particular, well designed hash maps are implemented with lookup (get value by key), insertion (insert value to key), and deletion (remove value by key) operations taking average \(O(1)\) (constant) time1. This makes them very popular both for their simplicity but also to speed up algorithms with smart tricks (like reverse indeces built in hash maps).

pop = Dict()
pop["Australia"] = 27_864_000
pop["United States"] = 340_111_000
pop["Finland"] = 5_634_000

pop
Dict{Any, Any} with 3 entries:
  "United States" => 340111000
  "Finland"       => 5634000
  "Australia"     => 27864000

Infer its type:

@show typeof(pop)
typeof(pop) = Dict{Any, Any}
Dict{Any, Any}

We can restrict the types:

strict_pop = Dict{String,Int}()
strict_pop["Australia"] = 27_864_000
strict_pop["United States"] = 340_111_000
strict_pop["Finland"] = 5_634_000

strict_pop
Dict{String, Int64} with 3 entries:
  "United States" => 340111000
  "Finland"       => 5634000
  "Australia"     => 27864000
# this is okay
pop["North Pole"] = 0.5
# not okay
strict_pop["North Pole"] = 0.5
InexactError: Int64(0.5)
Stacktrace:
 [1] Int64
   @ ./float.jl:994 [inlined]
 [2] convert
   @ ./number.jl:7 [inlined]
 [3] setindex!(h::Dict{String, Int64}, v0::Float64, key::String)
   @ Base ./dict.jl:355
 [4] top-level scope
   @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit2/unit_2.qmd:48

Checking and accessing dictionary values:

# Accessing a value
population_australia = pop["Australia"]
println("Population of Australia: ", population_australia)

mars_pop_safe = get(pop, "Mars", nothing)
Population of Australia: 27864000

Use haskey to check if the key exists:

if haskey(pop, "United States")
    println("United States population exists: ", pop["United States"])
end

if !haskey(pop, "Atlantis")
    println("Atlantis population does not exist.")
end
United States population exists: 340111000
Atlantis population does not exist.

More useful operations:

  • keys(): Returns an iterable collection of all keys in the dictionary.
  • values(): Returns an iterable collection of all values in the dictionary.
  • pairs(): Returns an iterable collection of Pair objects (key => value) for all entries.
  • length(): Returns the number of key-value pairs in the dictionary.
  • empty!(): Removes all key-value pairs from the dictionary.
println()
println("Keys in pop: ", keys(pop))
println("Values in pop: ", values(pop))
println("Pairs in pop: ", pairs(pop))
println("Number of entries in pop: ", length(pop))

# Iterating through a dictionary
println()
println("Iterating through pop:")
for (country, population) in pop
    println("$country: $population")
end

# Create a dictionary using the Dict constructor with pairs
new_countries = Dict("Canada" => 38_000_000, "Mexico" => 126_000_000)
println()
println("New countries dictionary: ", new_countries)

# Note that `=>` constructs a pair:
typeof(:s => 2)

# Merging dictionaries (creates a new dictionary)
merged_pop = merge(pop, new_countries)
println("Merged population dictionary: ", merged_pop)

# In-place merge (modifies the first dictionary)
merge!(pop, new_countries)
println("Pop after in-place merge: ", pop)

# Clearing a dictionary
empty!(pop)
println("Pop after empty!: ", pop)

Keys in pop: Any["North Pole", "United States", "Finland", "Australia"]
Values in pop: Any[0.5, 340111000, 5634000, 27864000]
Pairs in pop: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Australia" => 27864000)
Number of entries in pop: 4

Iterating through pop:
North Pole: 0.5
United States: 340111000
Finland: 5634000
Australia: 27864000

New countries dictionary: Dict("Mexico" => 126000000, "Canada" => 38000000)
Merged population dictionary: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Mexico" => 126000000, "Australia" => 27864000, "Canada" => 38000000)
Pop after in-place merge: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Mexico" => 126000000, "Australia" => 27864000, "Canada" => 38000000)
Pop after empty!: Dict{Any, Any}()

1.2 Sets

See Set-Like Collections in the Julia docs. Here are some examples.

A = Set([2,7,2,3])
B = Set(1:6)
omega = Set(1:10)

AunionB = union(A, B)
AintersectionB = intersect(A, B)
BdifferenceA = setdiff(B,A)
Bcomplement = setdiff(omega,B)
AsymDifferenceB = union(setdiff(A,B),setdiff(B,A))
println("A = $A, B = $B")
println("A union B = $AunionB")
println("A intersection B = $AintersectionB")
println("B diff A = $BdifferenceA")
println("B complement = $Bcomplement")
println("A symDifference B = $AsymDifferenceB")
println("The element '6' is an element of A: $(in(6,A))")
println("Symmetric difference and intersection are subsets of the union: ",
        issubset(AsymDifferenceB,AunionB),", ", issubset(AintersectionB,AunionB))
A = Set([7, 2, 3]), B = Set([5, 4, 6, 2, 3, 1])
A union B = Set([5, 4, 6, 7, 2, 3, 1])
A intersection B = Set([2, 3])
B diff A = Set([5, 4, 6, 1])
B complement = Set([7, 10, 9, 8])
A symDifference B = Set([5, 4, 6, 7, 1])
The element '6' is an element of A: false
Symmetric difference and intersection are subsets of the union: true, true

Internally, sets are a thin wrapper around dictionaries with no values:

# base/set.jl
struct Set{T} <: AbstractSet{T}
    dict::Dict{T,Nothing}

    global _Set(dict::Dict{T,Nothing}) where {T} = new{T}(dict)
end

1.3 Named tuples

In addition to tuples (see docs), Julia has named tuples. Here are some examples:

my_stuff = (age=28, gender=:male, name="Aapeli")
yonis_stuff = (age=51, gender=:male, name="Yoni")

my_stuff.gender
:male

Named tuples are also used as keyword arguments.

function my_function_kwargs(; keyword_arg1=default_value1, keyword_arg2=default_value2)
    println("Keyword 1: $keyword_arg1")
    println("Keyword 2: $keyword_arg2")
end

todays_args = (keyword_arg1="hello!", keyword_arg2="nothing")
my_function_kwargs(; todays_args...)
Keyword 1: hello!
Keyword 2: nothing

An example with Plots:

using Plots
using LaTeXStrings

# we can use named tuples to pass in keyword arguments
args = (label=false, xlim=(-1,1), xlabel=L"x")
# `...` is the "splat" operator, similar to `**args` in python
p1 = plot(x->sin(1/x); ylabel=L"\sin(\frac{1}{x})", args...)
p2 = plot(x->cos(1/x); ylabel=L"\cos(\frac{1}{x})", args...)
plot(p1, p2, size=(700,300))

1.4 Structs (Composite Types)

You can obviously define your own types see composite types in docs. You can use struct which is by default immutable, or mutable struct. In terms of memory management, immutable types sit on the stack while mutable types sit on the heap and require allocations and garbage collection.

struct Place
  name::String
  lon::Float64
  lat::Float64
end
# Constructing Place instances
new_york = Place("New York", -74.0060, 40.7128)
brisbane = Place("Brisbane", 153.0251, -27.4698)
townsville = Place("Townsville", 146.8169, -19.2581)

println(new_york)
println(brisbane)
println(townsville)

# access fields
println("Latitude of new_york: ", new_york.lat)
Place("New York", -74.006, 40.7128)
Place("Brisbane", 153.0251, -27.4698)
Place("Townsville", 146.8169, -19.2581)
Latitude of new_york: 40.7128

We can also have constructors with logic

"""
A fancier place that wraps longitude automatically
"""
struct FancyPlace
  name::String
  lon::Float64
  lat::Float64

  # Default constructor (provided automatically if no inner constructors are defined)
  function FancyPlace(name::String, lon::Float64, lat::Float64)
    # make sure longitude is in [-180,180)
    wrapped_lon = mod(lon + 180, 360) - 180
    # new is a special keyword used to create the actual struct instance
    # It takes the values for the fields in the order they are defined in
    # the struct, effectively calling the "primary" constructor
    new(name, wrapped_lon, lat)
  end

  # Custom constructor for an "unnamed" place
  FancyPlace(lon::Float64, lat::Float64) = FancyPlace("[unnamed]", lon, lat) # The `new` keyword calls the primary constructor
end

# Now we can use the new constructor
unnamed_location = FancyPlace(1000.0, 20.0)
println("\nUnnamed location: ", unnamed_location)
println("Name of unnamed_location: ", unnamed_location.name)

Unnamed location: FancyPlace("[unnamed]", -80.0, 20.0)
Name of unnamed_location: [unnamed]

We can add additional “outer” constructors, but they cannot call new directly. For example, suppose you use a GIS package with your own coordinates

struct WGS84Coordinates{T}
  x::T
  y::T
end

function FancyPlace(name::String, coords::WGS84Coordinates)
    return FancyPlace(name, Float64(coords.x), Float64(coords.y))
end

zero_coords = WGS84Coordinates{Float32}(142.2, 11.35)
mariana_trench = FancyPlace("Mariana Trench", zero_coords)

@show mariana_trench
mariana_trench = Main.Notebook.FancyPlace("Mariana Trench", 142.1999969482422, 11.350000381469727)
FancyPlace("Mariana Trench", 142.1999969482422, 11.350000381469727)

The Parameters.jl package extends the functionality by automatically creating keyword based constructors for struct beyond the default constructors.

using Parameters

@with_kw struct MyStruct
    a::Int = 6
    b::Float64 = -1.1
    c::UInt8
end

MyStruct(c=4) # call to the constructor created with the @with_kw with a keyword argument
MyStruct
  a: Int64 6
  b: Float64 -1.1
  c: UInt8 0x04

Another useful macro based modification of the language is with the Accessors.jl package. It allows to update values of structs (immutable) easily by creating a copy without having to copy all other values:

using Accessors

a = MyStruct(a=10, c=4)
@show a

b = @set a.c = 0
@show b;

# but observe a is still untouched
@show a
a = Main.Notebook.MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x04

b = Main.Notebook.MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x00

a = Main.Notebook.MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x04
MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x04

1.5 Datastructures (not in the standard library)

The JuliaCollections library provides other data structures. One useful package is DataStructures.jl. Let’s use for example a heap for heap sort (note that this is only for illustrative purposes. The system’s sort will be more efficient).

using Random, DataStructures
Random.seed!(0)

function heap_sort!(a::AbstractArray)
    h = BinaryMinHeap{eltype(a)}()
    for e in a
        push!(h, e) #This is an O(log n) operation
    end

    #Write back onto the original array
    for i in 1:length(a)
        a[i] = pop!(h) #This is an O(log n) operation
    end
    return a
end

data = [65, 51, 32, 12, 23, 84, 68, 1]
heap_sort!(data)
@show data
@show heap_sort!(["Finland", "USA", "Australia", "Brazil"]);
data = [1, 12, 23, 32, 51, 65, 68, 84]
heap_sort!(["Finland", "USA", "Australia", "Brazil"]) = ["Australia", "Brazil", "Finland", "USA"]

Again, note that this is a bunch slower than the standard lib sort:

using BenchmarkTools

numbers = rand(10_000);
@benchmark sort!(numbers)
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.730 μs 16.809 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.196 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.227 μs ± 412.276 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂▇▇▆▇▃▁▂  ▁  ▁                                    ▂   ▂
  ▄▃▁▃▅████████▇▅██▇██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▃▁▁▃▅▅▅███ █
  8.73 μs      Histogram: log(frequency) by time      11.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark heap_sort!(numbers)
BenchmarkTools.Trial: 9555 samples with 1 evaluation per sample.
 Range (minmax):  494.762 μs 3.795 ms   GC (min … max): 0.00% … 85.86%
 Time  (median):     507.543 μs               GC (median):    0.00%
 Time  (mean ± σ):   519.436 μs ± 83.512 μs   GC (mean ± σ):  1.64% ±  6.07%

  ▇█▂   ▁▃▃                                               ▁ ▂
  ███▇▆▅███▇▅▁▃▁▁▁▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆██ █
  495 μs        Histogram: log(frequency) by time       917 μs <

 Memory estimate: 326.45 KiB, allocs estimate: 14.

2 Basic text processing

Here are strings in the julia docs. Let’s see some examples:

x = 2
"The value of x is $x"
"The value of x is 2"
split("Hello world!")
2-element Vector{SubString{String}}:
 "Hello"
 "world!"
# multiline blocks will clear up whitespace to make life nice with indentation
my_life_story = """
    I was born
       in 1935.
    """

println(my_life_story)
I was born
   in 1935.
ismutable(String)
true
println("A rough ASCII table")
println("Decimal\tHex\tCharacter")
for c in 0x20:0x7E
    println(c,"\t","0x" * string(c,base=16),"\t",Char(c))
end
A rough ASCII table
Decimal Hex Character
32  0x20     
33  0x21    !
34  0x22    "
35  0x23    #
36  0x24    $
37  0x25    %
38  0x26    &
39  0x27    '
40  0x28    (
41  0x29    )
42  0x2a    *
43  0x2b    +
44  0x2c    ,
45  0x2d    -
46  0x2e    .
47  0x2f    /
48  0x30    0
49  0x31    1
50  0x32    2
51  0x33    3
52  0x34    4
53  0x35    5
54  0x36    6
55  0x37    7
56  0x38    8
57  0x39    9
58  0x3a    :
59  0x3b    ;
60  0x3c    <
61  0x3d    =
62  0x3e    >
63  0x3f    ?
64  0x40    @
65  0x41    A
66  0x42    B
67  0x43    C
68  0x44    D
69  0x45    E
70  0x46    F
71  0x47    G
72  0x48    H
73  0x49    I
74  0x4a    J
75  0x4b    K
76  0x4c    L
77  0x4d    M
78  0x4e    N
79  0x4f    O
80  0x50    P
81  0x51    Q
82  0x52    R
83  0x53    S
84  0x54    T
85  0x55    U
86  0x56    V
87  0x57    W
88  0x58    X
89  0x59    Y
90  0x5a    Z
91  0x5b    [
92  0x5c    \
93  0x5d    ]
94  0x5e    ^
95  0x5f    _
96  0x60    `
97  0x61    a
98  0x62    b
99  0x63    c
100 0x64    d
101 0x65    e
102 0x66    f
103 0x67    g
104 0x68    h
105 0x69    i
106 0x6a    j
107 0x6b    k
108 0x6c    l
109 0x6d    m
110 0x6e    n
111 0x6f    o
112 0x70    p
113 0x71    q
114 0x72    r
115 0x73    s
116 0x74    t
117 0x75    u
118 0x76    v
119 0x77    w
120 0x78    x
121 0x79    y
122 0x7a    z
123 0x7b    {
124 0x7c    |
125 0x7d    }
126 0x7e    ~

2.1 Regular Expressions

Julia has built-in regex!

text = "Julia is fun!"
pattern = r"Julia"
occursin(pattern, text)   # true
true
text = "Call me at 0468879289 when I'm home, or 0468879555 if I'm at work"
for m in eachmatch(r"04\d{8}", text)
  println("Found phone number $(m.match)")
end
Found phone number 0468879289
Found phone number 0468879555

2.2 Reading and writing files

The open function is your primary tool, often used with do blocks to ensure files are automatically closed.

To write text to a file:

open("work/my_output.txt", "w") do io
    write(io, "Hello from Julia!\n")
    write(io, "This is a second line.")
end
22

Here, "w" signifies “write mode.” If the file doesn’t exist, it’s created; if it does, its contents are overwritten.

To append text to an existing file:

open("work/my_output.txt", "a") do io
    write(io, "\nAppending a new line.")
end
22

The "a" mode means “append.” New stuff is added to the end of the file.

To read the entire content of a file:

file_content = read("work/my_output.txt", String)
println(file_content)
Hello from Julia!
This is a second line.
Appending a new line.

The read function with String as the type argument reads the whole file into a single string.

For reading a file line by line, which is more memory-efficient for large files:

open("work/my_output.txt", "r") do io
    for line in eachline(io)
        println("Line: ", line)
    end
end
Line: Hello from Julia!
Line: This is a second line.
Line: Appending a new line.

2.3 Some extras

The Printf package is built-in and provides formatted output functions similar to the C standard library.

Strings are related to IO. See the I/O and Network docs. Something quite common is to use flush(stdout).

Sometimes when writing test code we want strings to be approximately equal. For this it is useful to use the StringDistances.jl package.

Consider the YAML.jl package for YAML files.

3 Dataframes

Dataframes are huge subject. The Julia Dataframes.jl package provides functionality similar to Python pandas or R dataframes.

Let’s get started

using DataFrames

3.1 Constructing DataFrames

3.1.1 From Column-Value Pairs

The most common way to create a DataFrame is by providing column names (as symbols) and their corresponding vectors of data.

# Create a DataFrame with two columns 'a' and 'b'
df = DataFrame(a = [1, 2, 3], b = [2.0, 4.0, 6.0])
3×2 DataFrame
Row a b
Int64 Float64
1 1 2.0
2 2 4.0
3 3 6.0

Notice that Julia infers the data types for each column. Here, a is Int64 and b is Float64.

We can also create DataFrames using Pairs:

DataFrame(:c => ["apple", "banana", "cherry"], :d => [true, false, true])
3×2 DataFrame
Row c d
String Bool
1 apple true
2 banana false
3 cherry true

3.1.2 From Dictionaries

You can also construct a DataFrame from a dictionary where keys are column names (symbols or strings) and values are vectors.

DataFrame(Dict(
    :name => ["Aapeli", "Yoni", "Jesse"],
    :age => [25, 30, 35],
    :city => ["New York", "Brisbane", "Berlin"]
))
3×3 DataFrame
Row age city name
Int64 String String
1 25 New York Aapeli
2 30 Brisbane Yoni
3 35 Berlin Jesse

3.1.3 From NamedTuples

Creating a DataFrame from a vector of NamedTuples is very flexible.

DataFrame([
    (id = 1, value = 10.5, tag = "A"),
    (id = 2, value = 20.1, tag = "B"),
    (id = 3, value = 15.0, tag = "C")
])
3×3 DataFrame
Row id value tag
Int64 Float64 String
1 1 10.5 A
2 2 20.1 B
3 3 15.0 C

If the NamedTuples have different fields or different orders, we can use Tables.dictcolumntable to fill missing values with missing.

DataFrame(Tables.dictcolumntable([
    (id = 1, name = "Julia"),
    (id = 2, score = 95.5),
    (id = 3, name = "DataFrame", type = "Table")
]))
3×4 DataFrame
Row id name score type
Int64 String? Float64? String?
1 1 Julia missing missing
2 2 missing 95.5 missing
3 3 DataFrame missing Table

Notice the ? after the types, indicating that these columns now allow missing values.

3.2 Column Names and Basic Information

In DataFrames.jl, columns are primarily accessed using Symbols.

df = DataFrame(a = [1, 2, 3], b = [2.0, 4.0, 6.0], c = ["x", "y", "z"])

df[:, :a]
3-element Vector{Int64}:
 1
 2
 3

You can get the column names:

names(df)
3-element Vector{String}:
 "a"
 "b"
 "c"

And column types:

eltype.(eachcol(df))
3-element Vector{DataType}:
 Int64
 Float64
 String

3.2.1 Size and Dimensions

To get the dimensions of a DataFrame, similar to matrices:

size(df) # (rows, columns)
(3, 3)

You can also specify the dimension:

@show size(df, 1) # Number of rows
@show size(df, 2) # Number of columns
size(df, 1) = 3
size(df, 2) = 3
3

3.2.2 Column-based Storage and Iterators

DataFrames.jl stores data in a column-oriented fashion. This means each column is essentially a Vector.

You can retrieve a column using dot syntax or indexing:

df.a # Access column 'a' using dot syntax
df[!, :b] # Access column 'b' using ! (returns a view, i.e., no copy)
df[:, :c] # Access column 'c' using :, which makes a copy
3-element Vector{String}:
 "x"
 "y"
 "z"

The difference between . and ! versus : for column retrieval is crucial for performance and understanding data manipulation.

df.a === df[!, :a] # They refer to the same underlying data
true
df.a === df[:, :a] # The : operator creates a copy, so they are not the same object
false

When you need to iterate through rows, you can use eachrow(df):

for row in eachrow(df)
    println("Row: $(row.a), $(row.b), $(row.c)")
end
Row: 1, 2.0, x
Row: 2, 4.0, y
Row: 3, 6.0, z

Each row here is a DataFrameRow object, which behaves like a NamedTuple for row-wise access.

3.3 Indexing and Slicing

DataFrames can be indexed similar to matrices, but with the added flexibility of column names.

3.3.1 Positional Indexing

df[1, 1] # First row, first column
df[2, :b] # Second row, column 'b'
df[1, :] # First row (returns a DataFrameRow)
df[:, 1] # First column (returns a Vector, view)
3-element Vector{Int64}:
 1
 2
 3

3.3.2 Column Selection

You can select multiple columns by passing a vector of column names (symbols or strings):

df[:, [:a, :c]] # Select columns 'a' and 'c' (creates a new DataFrame)
3×2 DataFrame
Row a c
Int64 String
1 1 x
2 2 y
3 3 z

Or exclude columns using Not:

df[:, Not(:b)] # Select all columns except 'b'
3×2 DataFrame
Row a c
Int64 String
1 1 x
2 2 y
3 3 z

You can combine Not with a vector of columns:

df[:, Not([:a])] # Select all columns except 'a'
3×2 DataFrame
Row b c
Float64 String
1 2.0 x
2 4.0 y
3 6.0 z

3.3.3 Views vs. Copies

Recall the distinction between ! and : for column access. This also applies to row and full DataFrame indexing.

  • df[!, :colname] returns a view of the column (no copy).
  • df[:, :colname] returns a copy of the column.
  • df[!, [col1, col2]] returns a view of the selected columns (a SubDataFrame).
  • df[:, [col1, col2]] returns a copy of the selected columns (a new DataFrame).
  • df[!, row_indices, col_indices] returns a SubDataFrame (view).
  • df[row_indices, col_indices] returns a new DataFrame (copy).

Using views (!) is more memory-efficient when you don’t need a separate copy of the data and want changes to the view to reflect in the original DataFrame. However, views require translating between the parent df indeces and the view indeces, which might in theory cause performance issues in edge cases.

3.4 Getting, Setting, and Mutating Data

You can retrieve, set, and modify individual cells, rows, or columns.

3.4.1 Setting Individual Values

df[1, :a] = 100 # Set value at row 1, column 'a'
100

3.4.2 Setting Entire Columns

df.b = [10.0, 20.0, 30.0] # Replace column 'b'
3-element Vector{Float64}:
 10.0
 20.0
 30.0

If the new column has a different type, it will be converted if possible, or an error will occur. If a column doesn’t exist, it will be added.

df.d = ["alpha", "beta", "gamma"] # Add a new column 'd'
3-element Vector{String}:
 "alpha"
 "beta"
 "gamma"

3.4.3 Broadcasting Assignment

Broadcasting (.=) is extremely powerful for performing element-wise operations and assignments efficiently.

df.a .= 0 # Set all values in column 'a' to 0
3-element Vector{Int64}:
 0
 0
 0

You can also use it with a scalar or a vector of compatible size:

df.b .= df.b * 2 # Double all values in column 'b'
3-element Vector{Float64}:
 20.0
 40.0
 60.0

Or apply a function:

df.c .= uppercase.(df.c) # Convert all strings in column 'c' to uppercase
3-element Vector{String}:
 "X"
 "Y"
 "Z"

Broadcasting assignment works with sub-selections as well:

df[1:2, :a] .= 99 # Set the first two values of column 'a' to 99
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
 99
 99

4 Story: Working with real data

We’ll now look at a more in-depth, hands-on exercise of using DataFrames.

The Queensland government has an open data portal, and makes available tide predictions at various locations on the state’s coast. (There’s some other interesting data as well at https://www.qld.gov.au/tides).

Let’s use this to do some exploration. We’ll first download with the HTTP.jl package and write it to tides.csv

using HTTP

response = HTTP.get("https://www.data.qld.gov.au/datastore/dump/1311fc19-1e60-444f-b5cf-24687f1c15a7?bom=True")
write("work/tides.csv", response.body)
1603979

Let’s explore the first few lines

open("work/tides.csv") do io
    for i  1:5
        line = readline(io)
        println(line)
    end
end
_id,Site,Seconds,DateTime,Water Level,Prediction,Residual,Latitude,Longitude
1,abellpoint,1750082400,2025-06-17T00:00,2.713,2.535,0.178,-20.2608,148.7103
2,abellpoint,1750083000,2025-06-17T00:10,2.765,2.605,0.160,-20.2608,148.7103
3,abellpoint,1750083600,2025-06-17T00:20,2.838,2.670,0.168,-20.2608,148.7103
4,abellpoint,1750084200,2025-06-17T00:30,2.898,2.731,0.167,-20.2608,148.7103

We can read it into a dataframe with CSV.read, and show the first few lines with first

using CSV

df = CSV.read("work/tides.csv", DataFrame)
first(df, 5)
5×9 DataFrame
Row _id Site Seconds DateTime Water Level Prediction Residual Latitude Longitude
Int64 String15 Int64 DateTime Float64 Float64 Float64 Float64 Float64
1 1 abellpoint 1750082400 2025-06-17T00:00:00 2.713 2.535 0.178 -20.2608 148.71
2 2 abellpoint 1750083000 2025-06-17T00:10:00 2.765 2.605 0.16 -20.2608 148.71
3 3 abellpoint 1750083600 2025-06-17T00:20:00 2.838 2.67 0.168 -20.2608 148.71
4 4 abellpoint 1750084200 2025-06-17T00:30:00 2.898 2.731 0.167 -20.2608 148.71
5 5 abellpoint 1750084800 2025-06-17T00:40:00 2.934 2.786 0.148 -20.2608 148.71

Note the inferred datatypes, including the automatically converted DateTime. We can customize this

# we could also do
df32 = CSV.read("work/tides.csv", DataFrame; types=Dict("Water Level" => Float32, "Prediction" => Float32, "Residual" => Float32, "Latitude" => Float32, "Longitude" => Float32));
println("With Float32s, we saved $(round((1-Base.summarysize(df32)/Base.summarysize(df))*100; digits=2))% memory")
With Float32s, we saved 29.64% memory

(This is silly, don’t do it in practice.)

Let’s look also at the last rows

last(df, 3)
3×9 DataFrame
Row _id Site Seconds DateTime Water Level Prediction Residual Latitude Longitude
Int64 String15 Int64 DateTime Float64 Float64 Float64 Float64 Float64
1 19420 whyteislandnx 1750728000 2025-06-24T11:20:00 1.154 1.022 0.132 -27.4017 153.157
2 19421 whyteislandnx 1750728600 2025-06-24T11:30:00 1.091 0.964 0.127 -27.4017 153.157
3 19422 whyteislandnx 1750729200 2025-06-24T11:40:00 -99.0 0.907 -99.0 -27.4017 153.157

Here it seems that “-99.0” seems to mean missing. Let’s see where it’s coming from in the CSV

open("work/tides.csv") do io
    while true
        line = readline(io)
        if contains(line, "-99")
            println(line)
            break
        end
    end
end
1073,abellpoint,1750725600,2025-06-24T10:40,-99.000,2.131,-99.000,-20.2608,148.7103

We can tell CSV.read to mark values with “-99.000” as missing

df = CSV.read("work/tides.csv", DataFrame; missingstring=["-99.000"])
last(df, 3)
3×9 DataFrame
Row _id Site Seconds DateTime Water Level Prediction Residual Latitude Longitude
Int64 String15 Int64 DateTime Float64? Float64 Float64? Float64 Float64
1 19420 whyteislandnx 1750728000 2025-06-24T11:20:00 1.154 1.022 0.132 -27.4017 153.157
2 19421 whyteislandnx 1750728600 2025-06-24T11:30:00 1.091 0.964 0.127 -27.4017 153.157
3 19422 whyteislandnx 1750729200 2025-06-24T11:40:00 missing 0.907 missing -27.4017 153.157

Note the “?” in water level/residual: this is DataFrames notation for columns which contain missing data.

Referring to Water Level is a bit annoying now:

df[:, Symbol("Water Level")]
19422-element Vector{Union{Missing, Float64}}:
 2.713
 2.765
 2.838
 2.898
 2.934
 2.986
 3.029
 3.078
 3.103
 3.175
 ⋮
 1.519
 1.462
 1.397
 1.335
 1.269
 1.212
 1.154
 1.091
  missing

Let’s rename it, and let’s rename DateTime too to avoid confusion:

# ! means in-place
rename!(df, Symbol("Water Level") => :WaterLevel, Symbol("DateTime") => :Time)
first(df, 5)
5×9 DataFrame
Row _id Site Seconds Time WaterLevel Prediction Residual Latitude Longitude
Int64 String15 Int64 DateTime Float64? Float64 Float64? Float64 Float64
1 1 abellpoint 1750082400 2025-06-17T00:00:00 2.713 2.535 0.178 -20.2608 148.71
2 2 abellpoint 1750083000 2025-06-17T00:10:00 2.765 2.605 0.16 -20.2608 148.71
3 3 abellpoint 1750083600 2025-06-17T00:20:00 2.838 2.67 0.168 -20.2608 148.71
4 4 abellpoint 1750084200 2025-06-17T00:30:00 2.898 2.731 0.167 -20.2608 148.71
5 5 abellpoint 1750084800 2025-06-17T00:40:00 2.934 2.786 0.148 -20.2608 148.71

Drop some redundant columns

select!(df, [:Site, :Latitude, :Longitude, :Time, :WaterLevel, :Prediction])
first(df, 5)
5×6 DataFrame
Row Site Latitude Longitude Time WaterLevel Prediction
String15 Float64 Float64 DateTime Float64? Float64
1 abellpoint -20.2608 148.71 2025-06-17T00:00:00 2.713 2.535
2 abellpoint -20.2608 148.71 2025-06-17T00:10:00 2.765 2.605
3 abellpoint -20.2608 148.71 2025-06-17T00:20:00 2.838 2.67
4 abellpoint -20.2608 148.71 2025-06-17T00:30:00 2.898 2.731
5 abellpoint -20.2608 148.71 2025-06-17T00:40:00 2.934 2.786

Here is our list of columns:

names(df)
6-element Vector{String}:
 "Site"
 "Latitude"
 "Longitude"
 "Time"
 "WaterLevel"
 "Prediction"

Or by piping

df |> names
6-element Vector{String}:
 "Site"
 "Latitude"
 "Longitude"
 "Time"
 "WaterLevel"
 "Prediction"

4.1 Getting to know our data

Let’s dive a bit deeper, what do we have?

describe(df)
6×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Any Any Int64 Type
1 Site abellpoint whyteislandnx 0 String15
2 Latitude -26.2292 -28.1721 -27.4382 -19.1266 0 Float64
3 Longitude 152.434 146.91 153.249 153.558 0 Float64
4 Time 2025-06-17T00:00:00 2025-06-20T17:50:00 2025-06-24T11:40:00 0 DateTime
5 WaterLevel 1.36547 -0.233 1.153 5.774 7743 Union{Missing, Float64}
6 Prediction 1.18008 -0.179 1.007 5.537 0 Float64

What are the site names?

unique(df.Site)
18-element Vector{String15}:
 "abellpoint"
 "bananabank"
 "birkdale"
 "coombabahst"
 "hallsbay"
 "husseycreek"
 "maroochydore"
 "rabybay"
 "russellislande"
 "russellislandw"
 "seaforth"
 "tangalooma"
 "theskids"
 "townsvillecard"
 "tweedsbj"
 "wavebreaknc"
 "wavebreakwc"
 "whyteislandnx"

A note on String15:

df.Site
19422-element PooledArrays.PooledVector{String15, UInt32, Vector{UInt32}}:
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 ⋮
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"

Compute the squared error in prediction with transform

Let’s group by site

# groupby takes a dataframe and a list of columns to group by
by_site = groupby(df, :Site)

GroupedDataFrame with 18 groups based on key: Site

First Group (1079 rows): Site = "abellpoint"
1054 rows omitted
Row Site Latitude Longitude Time WaterLevel Prediction
String15 Float64 Float64 DateTime Float64? Float64
1 abellpoint -20.2608 148.71 2025-06-17T00:00:00 2.713 2.535
2 abellpoint -20.2608 148.71 2025-06-17T00:10:00 2.765 2.605
3 abellpoint -20.2608 148.71 2025-06-17T00:20:00 2.838 2.67
4 abellpoint -20.2608 148.71 2025-06-17T00:30:00 2.898 2.731
5 abellpoint -20.2608 148.71 2025-06-17T00:40:00 2.934 2.786
6 abellpoint -20.2608 148.71 2025-06-17T00:50:00 2.986 2.839
7 abellpoint -20.2608 148.71 2025-06-17T01:00:00 3.029 2.887
8 abellpoint -20.2608 148.71 2025-06-17T01:10:00 3.078 2.93
9 abellpoint -20.2608 148.71 2025-06-17T01:20:00 3.103 2.971
10 abellpoint -20.2608 148.71 2025-06-17T01:30:00 3.175 3.007
11 abellpoint -20.2608 148.71 2025-06-17T01:40:00 3.204 3.037
12 abellpoint -20.2608 148.71 2025-06-17T01:50:00 3.202 3.063
13 abellpoint -20.2608 148.71 2025-06-17T02:00:00 3.224 3.084
1068 abellpoint -20.2608 148.71 2025-06-24T09:50:00 2.577 2.379
1069 abellpoint -20.2608 148.71 2025-06-24T10:00:00 2.587 2.347
1070 abellpoint -20.2608 148.71 2025-06-24T10:10:00 2.513 2.306
1071 abellpoint -20.2608 148.71 2025-06-24T10:20:00 2.492 2.255
1072 abellpoint -20.2608 148.71 2025-06-24T10:30:00 2.43 2.196
1073 abellpoint -20.2608 148.71 2025-06-24T10:40:00 missing 2.131
1074 abellpoint -20.2608 148.71 2025-06-24T10:50:00 missing 2.057
1075 abellpoint -20.2608 148.71 2025-06-24T11:00:00 missing 1.979
1076 abellpoint -20.2608 148.71 2025-06-24T11:10:00 missing 1.897
1077 abellpoint -20.2608 148.71 2025-06-24T11:20:00 missing 1.811
1078 abellpoint -20.2608 148.71 2025-06-24T11:30:00 missing 1.721
1079 abellpoint -20.2608 148.71 2025-06-24T11:40:00 missing 1.631

Last Group (1079 rows): Site = "whyteislandnx"
1054 rows omitted
Row Site Latitude Longitude Time WaterLevel Prediction
String15 Float64 Float64 DateTime Float64? Float64
1 whyteislandnx -27.4017 153.157 2025-06-17T00:00:00 2.373 2.215
2 whyteislandnx -27.4017 153.157 2025-06-17T00:10:00 2.416 2.251
3 whyteislandnx -27.4017 153.157 2025-06-17T00:20:00 2.437 2.282
4 whyteislandnx -27.4017 153.157 2025-06-17T00:30:00 2.471 2.309
5 whyteislandnx -27.4017 153.157 2025-06-17T00:40:00 2.491 2.331
6 whyteislandnx -27.4017 153.157 2025-06-17T00:50:00 2.51 2.347
7 whyteislandnx -27.4017 153.157 2025-06-17T01:00:00 2.509 2.357
8 whyteislandnx -27.4017 153.157 2025-06-17T01:10:00 2.516 2.362
9 whyteislandnx -27.4017 153.157 2025-06-17T01:20:00 2.497 2.361
10 whyteislandnx -27.4017 153.157 2025-06-17T01:30:00 2.486 2.353
11 whyteislandnx -27.4017 153.157 2025-06-17T01:40:00 2.459 2.339
12 whyteislandnx -27.4017 153.157 2025-06-17T01:50:00 2.428 2.318
13 whyteislandnx -27.4017 153.157 2025-06-17T02:00:00 2.4 2.291
1068 whyteislandnx -27.4017 153.157 2025-06-24T09:50:00 1.691 1.541
1069 whyteislandnx -27.4017 153.157 2025-06-24T10:00:00 1.638 1.487
1070 whyteislandnx -27.4017 153.157 2025-06-24T10:10:00 1.578 1.431
1071 whyteislandnx -27.4017 153.157 2025-06-24T10:20:00 1.519 1.375
1072 whyteislandnx -27.4017 153.157 2025-06-24T10:30:00 1.462 1.316
1073 whyteislandnx -27.4017 153.157 2025-06-24T10:40:00 1.397 1.256
1074 whyteislandnx -27.4017 153.157 2025-06-24T10:50:00 1.335 1.198
1075 whyteislandnx -27.4017 153.157 2025-06-24T11:00:00 1.269 1.138
1076 whyteislandnx -27.4017 153.157 2025-06-24T11:10:00 1.212 1.079
1077 whyteislandnx -27.4017 153.157 2025-06-24T11:20:00 1.154 1.022
1078 whyteislandnx -27.4017 153.157 2025-06-24T11:30:00 1.091 0.964
1079 whyteislandnx -27.4017 153.157 2025-06-24T11:40:00 missing 0.907

This produces a grouped dataframe

typeof(by_site)
GroupedDataFrame{DataFrame}

What’s the mean water level per site?

# get the mean function
using Statistics

# enter ∘ with \circ TAB
# combine takes the grouped df and a list of operations
combine(by_site, :WaterLevel => mean  skipmissing => :MeanWaterLevel)
18×2 DataFrame
Row Site MeanWaterLevel
String15 Float64
1 abellpoint 2.03237
2 bananabank 1.64445
3 birkdale 1.43709
4 coombabahst 0.275567
5 hallsbay 0.725246
6 husseycreek NaN
7 maroochydore 0.77414
8 rabybay 1.61293
9 russellislande 0.723282
10 russellislandw NaN
11 seaforth 3.01772
12 tangalooma NaN
13 theskids NaN
14 townsvillecard NaN
15 tweedsbj 1.19848
16 wavebreaknc NaN
17 wavebreakwc NaN
18 whyteislandnx 1.4865

Here we applied mean(skipmissing(...)) to the :WaterLevel column.

Let’s plot the water level at some sites

using Plots

my_sites = ["coombabahst", "russellislande", "rabybay"]

p = plot(
    xlabel="Time",
    ylabel="Water Level",
    title="Water Level Over Time for Selected Sites",
    legend=:topleft
)

for group in by_site
    site_name = group.Site[1] # Get the site name from the first row of the group
    if site_name  my_sites
        continue
    end
    plot!(
        p,
        group.Time,
        group.WaterLevel,
        # Label for the legend
        label=site_name,
        linealpha=0.8,
        linewidth=2
    )
end

p

How many data points do we have per site?

combine(by_site, nrow => :Count)
18×2 DataFrame
Row Site Count
String15 Int64
1 abellpoint 1079
2 bananabank 1079
3 birkdale 1079
4 coombabahst 1079
5 hallsbay 1079
6 husseycreek 1079
7 maroochydore 1079
8 rabybay 1079
9 russellislande 1079
10 russellislandw 1079
11 seaforth 1079
12 tangalooma 1079
13 theskids 1079
14 townsvillecard 1079
15 tweedsbj 1079
16 wavebreaknc 1079
17 wavebreakwc 1079
18 whyteislandnx 1079

Let’s compute the squared residual:

df[!, :SqResidual] = (df.WaterLevel - df.Prediction).^2
19422-element Vector{Union{Missing, Float64}}:
 0.031683999999999976
 0.025600000000000046
 0.02822400000000005
 0.027889000000000087
 0.021904000000000038
 0.021609000000000073
 0.020163999999999974
 0.021903999999999906
 0.01742400000000003
 0.028223999999999902
 ⋮
 0.020735999999999973
 0.021315999999999974
 0.019881000000000003
 0.018769000000000004
 0.017161000000000003
 0.017689000000000003
 0.01742399999999997
 0.016129
  missing

There were some sites with fully missing water levels

all_missing = combine(groupby(df, :Site), :WaterLevel => (x -> all(ismissing, x)) => :IsMissing)
18×2 DataFrame
Row Site IsMissing
String15 Bool
1 abellpoint false
2 bananabank false
3 birkdale false
4 coombabahst false
5 hallsbay false
6 husseycreek true
7 maroochydore false
8 rabybay false
9 russellislande false
10 russellislandw true
11 seaforth false
12 tangalooma true
13 theskids true
14 townsvillecard true
15 tweedsbj false
16 wavebreaknc true
17 wavebreakwc true
18 whyteislandnx false
filter!(row -> row.IsMissing == false, all_missing)
11×2 DataFrame
Row Site IsMissing
String15 Bool
1 abellpoint false
2 bananabank false
3 birkdale false
4 coombabahst false
5 hallsbay false
6 maroochydore false
7 rabybay false
8 russellislande false
9 seaforth false
10 tweedsbj false
11 whyteislandnx false
select!(all_missing, Not(:IsMissing))
11×1 DataFrame
Row Site
String15
1 abellpoint
2 bananabank
3 birkdale
4 coombabahst
5 hallsbay
6 maroochydore
7 rabybay
8 russellislande
9 seaforth
10 tweedsbj
11 whyteislandnx
df_clean = innerjoin(df, all_missing, on=:Site)
11869×7 DataFrame
11844 rows omitted
Row Site Latitude Longitude Time WaterLevel Prediction SqResidual
String15 Float64 Float64 DateTime Float64? Float64 Float64?
1 abellpoint -20.2608 148.71 2025-06-17T00:00:00 2.713 2.535 0.031684
2 abellpoint -20.2608 148.71 2025-06-17T00:10:00 2.765 2.605 0.0256
3 abellpoint -20.2608 148.71 2025-06-17T00:20:00 2.838 2.67 0.028224
4 abellpoint -20.2608 148.71 2025-06-17T00:30:00 2.898 2.731 0.027889
5 abellpoint -20.2608 148.71 2025-06-17T00:40:00 2.934 2.786 0.021904
6 abellpoint -20.2608 148.71 2025-06-17T00:50:00 2.986 2.839 0.021609
7 abellpoint -20.2608 148.71 2025-06-17T01:00:00 3.029 2.887 0.020164
8 abellpoint -20.2608 148.71 2025-06-17T01:10:00 3.078 2.93 0.021904
9 abellpoint -20.2608 148.71 2025-06-17T01:20:00 3.103 2.971 0.017424
10 abellpoint -20.2608 148.71 2025-06-17T01:30:00 3.175 3.007 0.028224
11 abellpoint -20.2608 148.71 2025-06-17T01:40:00 3.204 3.037 0.027889
12 abellpoint -20.2608 148.71 2025-06-17T01:50:00 3.202 3.063 0.019321
13 abellpoint -20.2608 148.71 2025-06-17T02:00:00 3.224 3.084 0.0196
11858 whyteislandnx -27.4017 153.157 2025-06-24T09:50:00 1.691 1.541 0.0225
11859 whyteislandnx -27.4017 153.157 2025-06-24T10:00:00 1.638 1.487 0.022801
11860 whyteislandnx -27.4017 153.157 2025-06-24T10:10:00 1.578 1.431 0.021609
11861 whyteislandnx -27.4017 153.157 2025-06-24T10:20:00 1.519 1.375 0.020736
11862 whyteislandnx -27.4017 153.157 2025-06-24T10:30:00 1.462 1.316 0.021316
11863 whyteislandnx -27.4017 153.157 2025-06-24T10:40:00 1.397 1.256 0.019881
11864 whyteislandnx -27.4017 153.157 2025-06-24T10:50:00 1.335 1.198 0.018769
11865 whyteislandnx -27.4017 153.157 2025-06-24T11:00:00 1.269 1.138 0.017161
11866 whyteislandnx -27.4017 153.157 2025-06-24T11:10:00 1.212 1.079 0.017689
11867 whyteislandnx -27.4017 153.157 2025-06-24T11:20:00 1.154 1.022 0.017424
11868 whyteislandnx -27.4017 153.157 2025-06-24T11:30:00 1.091 0.964 0.016129
11869 whyteislandnx -27.4017 153.157 2025-06-24T11:40:00 missing 0.907 missing

Let’s compute the 90th percentile of water level per site?

p90(x) = quantile(x, .9)
combine(groupby(df_clean, :Site), :WaterLevel => p90  skipmissing => :WaterLevelP90)
11×2 DataFrame
Row Site WaterLevelP90
String15 Float64
1 abellpoint 3.0888
2 bananabank 2.4426
3 birkdale 2.2089
4 coombabahst 0.6794
5 hallsbay 1.2018
6 maroochydore 1.1973
7 rabybay 2.3942
8 russellislande 1.18
9 seaforth 4.5453
10 tweedsbj 1.761
11 whyteislandnx 2.232

Let’s plot the mean square error in prediction per site

mse_by_site = combine(groupby(df_clean, :Site), :SqResidual => mean  skipmissing => :MSE)

plot(mse_by_site.Site, mse_by_site.MSE, seriestype=:bar, xrotation=45, title="MSE in water level prediction by site")

4.2 More Transformations

Here are key operations:

  • groupby – Split a DataFrame into groups by one or more columns.
  • combine – Apply functions to groups or columns and combine results in a new DataFrame.
  • transform – Create or modify columns (optionally in-place).
  • select – Select (and transform) columns, optionally creating new ones.

With DataFramesMeta.jl: - @subset – Filter rows based on row-wise conditions. - @select – Select or transform columns. - @transform – Add or modify columns by assignment. - @combine – Combine results of group operations into a DataFrame.

For more, see the official DataFrames.jl documentation and the DataFramesMeta.jl documentation.

As there are already great resources for this on the web let us go through these resources:

  1. A PumasAI tutorial
  2. A UQ course tutorial

4.3 More dataframes

Here are the common packages in this ecosystem:

  • DataFrames.jl - the main dataframes package.
  • DataFramesMeta.jl - metaprogramming tools for DataFrames.jl objects.
  • CSV.jl - read and write to CSV files.
  • CategoricalArrays.jl - provides tools for working with categorical variables, both with unordered (nominal variables) and ordered categories (ordinal variables), optionally with missing values.
  • Chain.jl - provides a useful macro rewrites a series of expressions into a chain.
  • XLSX.jl - Excel file reader/writer for the Julia language.
  • SummaryTables.jl - creating publication-ready tables in HTML, docx, LaTeX and Typst formats.

5 JSON

There are two competing JSON libraries: JSON.jl and JSON3.jl. Here is a JSON.jl example:

using HTTP
using JSON

response = HTTP.get("https://couchers.org/api/status")
data = JSON.parse(String(response.body))

println(data)
Dict{String, Any}("coucherCount" => "55920", "version" => "develop-24532dd9", "nonce" => "")

6 Serialization

Julia provides out of the box serialization. Here is an example. The example is slightly interesting because we also create a tree data structure.

using Random

Random.seed!(0)

struct Node
    id::UInt16
    friends::Vector{Node}

    # inner constructor, uses the default constructor
    Node() = new(rand(UInt16), [])
    # another inner constructor
    Node(friend::Node) = new(rand(UInt16),[friend])
end
"""
Makes `n` children to node, each with a single friend
"""
function make_children(node::Node, n::Int, friend::Node)
    for _ in 1:n
        new_node = Node(friend)
        push!(node.friends, new_node)
    end
end;
# make a tree
root = Node()
make_children(root, 3, root)
for node in root.friends
    make_children(node, 2,root)
end

root
Node(0x67db, Node[Node(0x118c, Node[Node(#= circular reference @-4 =#), Node(0xa95f, Node[Node(#= circular reference @-6 =#)]), Node(0x1dc7, Node[Node(#= circular reference @-6 =#)])]), Node(0xdcb5, Node[Node(#= circular reference @-4 =#), Node(0x1c00, Node[Node(#= circular reference @-6 =#)]), Node(0xb3b6, Node[Node(#= circular reference @-6 =#)])]), Node(0x1602, Node[Node(#= circular reference @-4 =#), Node(0x4a1d, Node[Node(#= circular reference @-6 =#)]), Node(0x074f, Node[Node(#= circular reference @-6 =#)])])])

Note that when we try to show root, it’s complete gibberish. We can write a Base.show() function to make this pretty:

# make it show up pretty
function Base.show(io::IO, x::Node)
    shown = Set{Node}()
    function recursive_show(y::Node, depth::Int)
        print(io, "  "^depth*"Node: $(y.id)")
        if y in shown
            println(io, " (already shown)")
        else
            push!(shown, y)
            println(io, ", friends:")
            for f in y.friends
                recursive_show(f, depth+1)
            end
        end
    end
    recursive_show(x, 0)
    return nothing
end

root
Node: 26587, friends:
  Node: 4492, friends:
    Node: 26587 (already shown)
    Node: 43359, friends:
      Node: 26587 (already shown)
    Node: 7623, friends:
      Node: 26587 (already shown)
  Node: 56501, friends:
    Node: 26587 (already shown)
    Node: 7168, friends:
      Node: 26587 (already shown)
    Node: 46006, friends:
      Node: 26587 (already shown)
  Node: 5634, friends:
    Node: 26587 (already shown)
    Node: 18973, friends:
      Node: 26587 (already shown)
    Node: 1871, friends:
      Node: 26587 (already shown)

Suppose we now want to save this in a file…

using Serialization
serialize("work/tree.dat", root)
newroot = deserialize("work/tree.dat")
Node: 26587, friends:
  Node: 4492, friends:
    Node: 26587 (already shown)
    Node: 43359, friends:
      Node: 26587 (already shown)
    Node: 7623, friends:
      Node: 26587 (already shown)
  Node: 56501, friends:
    Node: 26587 (already shown)
    Node: 7168, friends:
      Node: 26587 (already shown)
    Node: 46006, friends:
      Node: 26587 (already shown)
  Node: 5634, friends:
    Node: 26587 (already shown)
    Node: 18973, friends:
      Node: 26587 (already shown)
    Node: 1871, friends:
      Node: 26587 (already shown)

7 Additional online resources

8 Exercises

  1. You have this dictionary:
        country_capital = Dict(
                                "France" => "Paris",
                                "Germany" => "Berlin",
                                "Italy" => "Rome",
                                "Spain" => "Madrid")

Now create a new dictionary, capital_country where the keys are the capital cities and the values are the country names.

  1. Looking up with the in or symbol is possible both in an array and a set. You can create an array with rand(1:10^10, 10^7) which will have \(10^7\) entries, selected from the numbers \(1,\ldots,10^{10}\). You can also wrap this to create a set. Now compare lookup timings with @time or @btime (from BenchmarkTools.jl) for lookup to see if a single rand(1:10^10) is an element of the set.
  2. Given the string text = "Julia is a high-level, high-performance programming language.", write Julia code to count how many times the substring “high” appears in the text (case-insensitive).
  3. Install the Rdatasets.jl package. Then load the “iris” dataset. Then, filter the DataFrame to only include rows where the SepalLength is greater than its mean, and display the first five rows of the result.
  4. Load the “mtcars” dataset from RDatasets. Then, group the data by the Cyl (number of cylinders) column and compute the average MPG (miles per gallon) for each group. Display the resulting summary DataFrame.
  5. Consider this JSON file (put it in a string):
                {
                  "name": "Alice",
                  "age": 30,
                  "skills": ["Julia", "Python", "SQL"]
                }

Given the JSON string above, write Julia code to parse it and print the person’s name and the number of skills they have.

  1. Create an array of \(10^6\) random Float64 (you can use rand(Float64, 3)). Then serialize and inspect the file size. See it makes sense with sizeof(Float64). Now do the same with Float16, Float32, UInt8, and another type of your choice.

References

Bouchet-Valat, Milan, and Bogumił Kamiński. 2023. “Dataframes. Jl: Flexible and Fast Tabular Data in Julia.” Journal of Statistical Software 107: 1–32. https://www.jstatsoft.org/article/view/v107i04.
Kaminski, Bogumil. 2023. Julia for Data Analysis. Manning Publications. https://www.manning.com/books/julia-for-data-analysis.
Storopoli, Jose, Rik Huijzer, and Lazaro Alonso. 2021. Julia Data Science. https://juliadatascience.io.

Footnotes

  1. “Constant time” suffices in practice, there is minutiae and worst case is \(O(n)\) which is bad – for theoretical applications, they can be implemented in \(O(\log n)\) worst case time with self-balancing trees, but all practical applications rely on constant time average and engineering tricks to avoid the linear time worst case.↩︎

Welcome | Unit 1 | Unit 2 | Unit 3 | Unit 4 | Unit 5