Unit 2 - Processing Data

In this unit we focus on data. We start by considering basic Julia data structures including dictionaries, sets, named tuples, and others. We then then focus on basic text (string) processing in Julia. Then we move onto Dataframes - a general and useful way to keep tabular data. We then touch on JSON files, and serialization.

1 Basic data structures

Beyond arrays which are very important and include Vector and Matrix, here are some basic data structures in Julia:

1.1 Dictionaries

See Dictionaries in the Julia docs.

Dictionaries (often called hash maps or associative arrays) store key-value pairs. Each key in a dictionary must be unique. They are incredibly useful for many purposes because their looking up values quickly based on a unique identifier. In particular, well designed hash maps are implemented with lookup (get value by key), insertion (insert value to key), and deletion (remove value by key) operations taking average \(O(1)\) (constant) time¹. This makes them very popular both for their simplicity but also to speed up algorithms with smart tricks (like reverse indeces built in hash maps).

pop = Dict()
pop["Australia"] = 27_864_000
pop["United States"] = 340_111_000
pop["Finland"] = 5_634_000

pop

Dict{Any, Any} with 3 entries:
  "United States" => 340111000
  "Finland"       => 5634000
  "Australia"     => 27864000

Infer its type:

@show typeof(pop)

typeof(pop) = Dict{Any, Any}

Dict{Any, Any}

We can restrict the types:

strict_pop = Dict{String,Int}()
strict_pop["Australia"] = 27_864_000
strict_pop["United States"] = 340_111_000
strict_pop["Finland"] = 5_634_000

strict_pop

Dict{String, Int64} with 3 entries:
  "United States" => 340111000
  "Finland"       => 5634000
  "Australia"     => 27864000

# this is okay
pop["North Pole"] = 0.5
# not okay
strict_pop["North Pole"] = 0.5

InexactError: Int64(0.5)
Stacktrace:
 [1] Int64
   @ ./float.jl:994 [inlined]
 [2] convert
   @ ./number.jl:7 [inlined]
 [3] setindex!(h::Dict{String, Int64}, v0::Float64, key::String)
   @ Base ./dict.jl:355
 [4] top-level scope
   @ /work/julia-ml/Julia_ML_training/Julia_ML_training/unit2/unit_2.qmd:48

Checking and accessing dictionary values:

# Accessing a value
population_australia = pop["Australia"]
println("Population of Australia: ", population_australia)

mars_pop_safe = get(pop, "Mars", nothing)

Population of Australia: 27864000

Use haskey to check if the key exists:

if haskey(pop, "United States")
    println("United States population exists: ", pop["United States"])
end

if !haskey(pop, "Atlantis")
    println("Atlantis population does not exist.")
end

United States population exists: 340111000
Atlantis population does not exist.

More useful operations:

keys(): Returns an iterable collection of all keys in the dictionary.
values(): Returns an iterable collection of all values in the dictionary.
pairs(): Returns an iterable collection of Pair objects (key => value) for all entries.
length(): Returns the number of key-value pairs in the dictionary.
empty!(): Removes all key-value pairs from the dictionary.

println()
println("Keys in pop: ", keys(pop))
println("Values in pop: ", values(pop))
println("Pairs in pop: ", pairs(pop))
println("Number of entries in pop: ", length(pop))

# Iterating through a dictionary
println()
println("Iterating through pop:")
for (country, population) in pop
    println("$country: $population")
end

# Create a dictionary using the Dict constructor with pairs
new_countries = Dict("Canada" => 38_000_000, "Mexico" => 126_000_000)
println()
println("New countries dictionary: ", new_countries)

# Note that `=>` constructs a pair:
typeof(:s => 2)

# Merging dictionaries (creates a new dictionary)
merged_pop = merge(pop, new_countries)
println("Merged population dictionary: ", merged_pop)

# In-place merge (modifies the first dictionary)
merge!(pop, new_countries)
println("Pop after in-place merge: ", pop)

# Clearing a dictionary
empty!(pop)
println("Pop after empty!: ", pop)


Keys in pop: Any["North Pole", "United States", "Finland", "Australia"]
Values in pop: Any[0.5, 340111000, 5634000, 27864000]
Pairs in pop: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Australia" => 27864000)
Number of entries in pop: 4

Iterating through pop:
North Pole: 0.5
United States: 340111000
Finland: 5634000
Australia: 27864000

New countries dictionary: Dict("Mexico" => 126000000, "Canada" => 38000000)
Merged population dictionary: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Mexico" => 126000000, "Australia" => 27864000, "Canada" => 38000000)
Pop after in-place merge: Dict{Any, Any}("North Pole" => 0.5, "United States" => 340111000, "Finland" => 5634000, "Mexico" => 126000000, "Australia" => 27864000, "Canada" => 38000000)
Pop after empty!: Dict{Any, Any}()

1.2 Sets

See Set-Like Collections in the Julia docs. Here are some examples.

A = Set([2,7,2,3])
B = Set(1:6)
omega = Set(1:10)

AunionB = union(A, B)
AintersectionB = intersect(A, B)
BdifferenceA = setdiff(B,A)
Bcomplement = setdiff(omega,B)
AsymDifferenceB = union(setdiff(A,B),setdiff(B,A))
println("A = $A, B = $B")
println("A union B = $AunionB")
println("A intersection B = $AintersectionB")
println("B diff A = $BdifferenceA")
println("B complement = $Bcomplement")
println("A symDifference B = $AsymDifferenceB")
println("The element '6' is an element of A: $(in(6,A))")
println("Symmetric difference and intersection are subsets of the union: ",
        issubset(AsymDifferenceB,AunionB),", ", issubset(AintersectionB,AunionB))

A = Set([7, 2, 3]), B = Set([5, 4, 6, 2, 3, 1])
A union B = Set([5, 4, 6, 7, 2, 3, 1])
A intersection B = Set([2, 3])
B diff A = Set([5, 4, 6, 1])
B complement = Set([7, 10, 9, 8])
A symDifference B = Set([5, 4, 6, 7, 1])
The element '6' is an element of A: false
Symmetric difference and intersection are subsets of the union: true, true

Internally, sets are a thin wrapper around dictionaries with no values:

# base/set.jl
struct Set{T} <: AbstractSet{T}
    dict::Dict{T,Nothing}

    global _Set(dict::Dict{T,Nothing}) where {T} = new{T}(dict)
end

1.3 Named tuples

In addition to tuples (see docs), Julia has named tuples. Here are some examples:

my_stuff = (age=28, gender=:male, name="Aapeli")
yonis_stuff = (age=51, gender=:male, name="Yoni")

my_stuff.gender

:male

Named tuples are also used as keyword arguments.

function my_function_kwargs(; keyword_arg1=default_value1, keyword_arg2=default_value2)
    println("Keyword 1: $keyword_arg1")
    println("Keyword 2: $keyword_arg2")
end

todays_args = (keyword_arg1="hello!", keyword_arg2="nothing")
my_function_kwargs(; todays_args...)

Keyword 1: hello!
Keyword 2: nothing

An example with Plots:

using Plots
using LaTeXStrings

# we can use named tuples to pass in keyword arguments
args = (label=false, xlim=(-1,1), xlabel=L"x")
# `...` is the "splat" operator, similar to `**args` in python
p1 = plot(x->sin(1/x); ylabel=L"\sin(\frac{1}{x})", args...)
p2 = plot(x->cos(1/x); ylabel=L"\cos(\frac{1}{x})", args...)
plot(p1, p2, size=(700,300))

1.4 Structs (Composite Types)

You can obviously define your own types see composite types in docs. You can use struct which is by default immutable, or mutable struct. In terms of memory management, immutable types sit on the stack while mutable types sit on the heap and require allocations and garbage collection.

struct Place
  name::String
  lon::Float64
  lat::Float64
end

# Constructing Place instances
new_york = Place("New York", -74.0060, 40.7128)
brisbane = Place("Brisbane", 153.0251, -27.4698)
townsville = Place("Townsville", 146.8169, -19.2581)

println(new_york)
println(brisbane)
println(townsville)

# access fields
println("Latitude of new_york: ", new_york.lat)

Place("New York", -74.006, 40.7128)
Place("Brisbane", 153.0251, -27.4698)
Place("Townsville", 146.8169, -19.2581)
Latitude of new_york: 40.7128

We can also have constructors with logic

"""
A fancier place that wraps longitude automatically
"""
struct FancyPlace
  name::String
  lon::Float64
  lat::Float64

  # Default constructor (provided automatically if no inner constructors are defined)
  function FancyPlace(name::String, lon::Float64, lat::Float64)
    # make sure longitude is in [-180,180)
    wrapped_lon = mod(lon + 180, 360) - 180
    # new is a special keyword used to create the actual struct instance
    # It takes the values for the fields in the order they are defined in
    # the struct, effectively calling the "primary" constructor
    new(name, wrapped_lon, lat)
  end

  # Custom constructor for an "unnamed" place
  FancyPlace(lon::Float64, lat::Float64) = FancyPlace("[unnamed]", lon, lat) # The `new` keyword calls the primary constructor
end

# Now we can use the new constructor
unnamed_location = FancyPlace(1000.0, 20.0)
println("\nUnnamed location: ", unnamed_location)
println("Name of unnamed_location: ", unnamed_location.name)


Unnamed location: FancyPlace("[unnamed]", -80.0, 20.0)
Name of unnamed_location: [unnamed]

We can add additional “outer” constructors, but they cannot call new directly. For example, suppose you use a GIS package with your own coordinates

struct WGS84Coordinates{T}
  x::T
  y::T
end

function FancyPlace(name::String, coords::WGS84Coordinates)
    return FancyPlace(name, Float64(coords.x), Float64(coords.y))
end

zero_coords = WGS84Coordinates{Float32}(142.2, 11.35)
mariana_trench = FancyPlace("Mariana Trench", zero_coords)

@show mariana_trench

mariana_trench = Main.Notebook.FancyPlace("Mariana Trench", 142.1999969482422, 11.350000381469727)

FancyPlace("Mariana Trench", 142.1999969482422, 11.350000381469727)

The Parameters.jl package extends the functionality by automatically creating keyword based constructors for struct beyond the default constructors.

using Parameters

@with_kw struct MyStruct
    a::Int = 6
    b::Float64 = -1.1
    c::UInt8
end

MyStruct(c=4) # call to the constructor created with the @with_kw with a keyword argument

MyStruct
  a: Int64 6
  b: Float64 -1.1
  c: UInt8 0x04

Another useful macro based modification of the language is with the Accessors.jl package. It allows to update values of structs (immutable) easily by creating a copy without having to copy all other values:

using Accessors

a = MyStruct(a=10, c=4)
@show a

b = @set a.c = 0
@show b;

# but observe a is still untouched
@show a

a = Main.Notebook.MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x04

b = Main.Notebook.MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x00

a = Main.Notebook.MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x04

MyStruct
  a: Int64 10
  b: Float64 -1.1
  c: UInt8 0x04

1.5 Datastructures (not in the standard library)

The JuliaCollections library provides other data structures. One useful package is DataStructures.jl. Let’s use for example a heap for heap sort (note that this is only for illustrative purposes. The system’s sort will be more efficient).

using Random, DataStructures
Random.seed!(0)

function heap_sort!(a::AbstractArray)
    h = BinaryMinHeap{eltype(a)}()
    for e in a
        push!(h, e) #This is an O(log n) operation
    end

    #Write back onto the original array
    for i in 1:length(a)
        a[i] = pop!(h) #This is an O(log n) operation
    end
    return a
end

data = [65, 51, 32, 12, 23, 84, 68, 1]
heap_sort!(data)
@show data
@show heap_sort!(["Finland", "USA", "Australia", "Brazil"]);

data = [1, 12, 23, 32, 51, 65, 68, 84]
heap_sort!(["Finland", "USA", "Australia", "Brazil"]) = ["Australia", "Brazil", "Finland", "USA"]

Again, note that this is a bunch slower than the standard lib sort:

using BenchmarkTools

numbers = rand(10_000);

@benchmark sort!(numbers)

BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min … max):  8.730 μs …  16.809 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.196 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.227 μs ± 412.276 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▇▇▆▇█▃▁▂  ▁  ▁                                    ▂   ▂
  ▄▃▁▃▅█████████▇▅██▇██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▃▁▁▃▅▅▅███ █
  8.73 μs      Histogram: log(frequency) by time      11.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark heap_sort!(numbers)

BenchmarkTools.Trial: 9555 samples with 1 evaluation per sample.
 Range (min … max):  494.762 μs …  3.795 ms  ┊ GC (min … max): 0.00% … 85.86%
 Time  (median):     507.543 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   519.436 μs ± 83.512 μs  ┊ GC (mean ± σ):  1.64% ±  6.07%

  ▇█▇▅▃▂   ▁▃▃                                               ▁ ▂
  ██████▇▆▅███▇▅▁▃▁▁▁▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆██ █
  495 μs        Histogram: log(frequency) by time       917 μs <

 Memory estimate: 326.45 KiB, allocs estimate: 14.

2 Basic text processing

Here are strings in the julia docs. Let’s see some examples:

x = 2
"The value of x is $x"

"The value of x is 2"

split("Hello world!")

2-element Vector{SubString{String}}:
 "Hello"
 "world!"

# multiline blocks will clear up whitespace to make life nice with indentation
my_life_story = """
    I was born
       in 1935.
    """

println(my_life_story)

I was born
   in 1935.

ismutable(String)

true

println("A rough ASCII table")
println("Decimal\tHex\tCharacter")
for c in 0x20:0x7E
    println(c,"\t","0x" * string(c,base=16),"\t",Char(c))
end

A rough ASCII table
Decimal Hex Character
32  0x20     
33  0x21    !
34  0x22    "
35  0x23    #
36  0x24    $
37  0x25    %
38  0x26    &
39  0x27    '
40  0x28    (
41  0x29    )
42  0x2a    *
43  0x2b    +
44  0x2c    ,
45  0x2d    -
46  0x2e    .
47  0x2f    /
48  0x30    0
49  0x31    1
50  0x32    2
51  0x33    3
52  0x34    4
53  0x35    5
54  0x36    6
55  0x37    7
56  0x38    8
57  0x39    9
58  0x3a    :
59  0x3b    ;
60  0x3c    <
61  0x3d    =
62  0x3e    >
63  0x3f    ?
64  0x40    @
65  0x41    A
66  0x42    B
67  0x43    C
68  0x44    D
69  0x45    E
70  0x46    F
71  0x47    G
72  0x48    H
73  0x49    I
74  0x4a    J
75  0x4b    K
76  0x4c    L
77  0x4d    M
78  0x4e    N
79  0x4f    O
80  0x50    P
81  0x51    Q
82  0x52    R
83  0x53    S
84  0x54    T
85  0x55    U
86  0x56    V
87  0x57    W
88  0x58    X
89  0x59    Y
90  0x5a    Z
91  0x5b    [
92  0x5c    \
93  0x5d    ]
94  0x5e    ^
95  0x5f    _
96  0x60    `
97  0x61    a
98  0x62    b
99  0x63    c
100 0x64    d
101 0x65    e
102 0x66    f
103 0x67    g
104 0x68    h
105 0x69    i
106 0x6a    j
107 0x6b    k
108 0x6c    l
109 0x6d    m
110 0x6e    n
111 0x6f    o
112 0x70    p
113 0x71    q
114 0x72    r
115 0x73    s
116 0x74    t
117 0x75    u
118 0x76    v
119 0x77    w
120 0x78    x
121 0x79    y
122 0x7a    z
123 0x7b    {
124 0x7c    |
125 0x7d    }
126 0x7e    ~

2.1 Regular Expressions

Julia has built-in regex!

text = "Julia is fun!"
pattern = r"Julia"
occursin(pattern, text)   # true

true

text = "Call me at 0468879289 when I'm home, or 0468879555 if I'm at work"
for m in eachmatch(r"04\d{8}", text)
  println("Found phone number $(m.match)")
end

Found phone number 0468879289
Found phone number 0468879555

2.2 Reading and writing files

The open function is your primary tool, often used with do blocks to ensure files are automatically closed.

To write text to a file:

open("work/my_output.txt", "w") do io
    write(io, "Hello from Julia!\n")
    write(io, "This is a second line.")
end

Here, "w" signifies “write mode.” If the file doesn’t exist, it’s created; if it does, its contents are overwritten.

To append text to an existing file:

open("work/my_output.txt", "a") do io
    write(io, "\nAppending a new line.")
end

The "a" mode means “append.” New stuff is added to the end of the file.

To read the entire content of a file:

file_content = read("work/my_output.txt", String)
println(file_content)

Hello from Julia!
This is a second line.
Appending a new line.

The read function with String as the type argument reads the whole file into a single string.

For reading a file line by line, which is more memory-efficient for large files:

open("work/my_output.txt", "r") do io
    for line in eachline(io)
        println("Line: ", line)
    end
end

Line: Hello from Julia!
Line: This is a second line.
Line: Appending a new line.

2.3 Some extras

The Printf package is built-in and provides formatted output functions similar to the C standard library.

Strings are related to IO. See the I/O and Network docs. Something quite common is to use flush(stdout).

Sometimes when writing test code we want strings to be approximately equal. For this it is useful to use the StringDistances.jl package.

Consider the YAML.jl package for YAML files.

3 Dataframes

Dataframes are huge subject. The Julia Dataframes.jl package provides functionality similar to Python pandas or R dataframes.

Let’s get started

using DataFrames

3.1 Constructing DataFrames

3.1.1 From Column-Value Pairs

The most common way to create a DataFrame is by providing column names (as symbols) and their corresponding vectors of data.

# Create a DataFrame with two columns 'a' and 'b'
df = DataFrame(a = [1, 2, 3], b = [2.0, 4.0, 6.0])

3×2 DataFrame

Row	a	b
	Int64	Float64
1	1	2.0
2	2	4.0
3	3	6.0

Notice that Julia infers the data types for each column. Here, a is Int64 and b is Float64.

We can also create DataFrames using Pairs:

DataFrame(:c => ["apple", "banana", "cherry"], :d => [true, false, true])

3×2 DataFrame

Row	c	d
	String	Bool
1	apple	true
2	banana	false
3	cherry	true

3.1.2 From Dictionaries

You can also construct a DataFrame from a dictionary where keys are column names (symbols or strings) and values are vectors.

DataFrame(Dict(
    :name => ["Aapeli", "Yoni", "Jesse"],
    :age => [25, 30, 35],
    :city => ["New York", "Brisbane", "Berlin"]
))

3×3 DataFrame

Row	age	city	name
	Int64	String	String
1	25	New York	Aapeli
2	30	Brisbane	Yoni
3	35	Berlin	Jesse

3.1.3 From `NamedTuple`s

Creating a DataFrame from a vector of NamedTuples is very flexible.

DataFrame([
    (id = 1, value = 10.5, tag = "A"),
    (id = 2, value = 20.1, tag = "B"),
    (id = 3, value = 15.0, tag = "C")
])

3×3 DataFrame

Row	id	value	tag
	Int64	Float64	String
1	1	10.5	A
2	2	20.1	B
3	3	15.0	C

If the NamedTuples have different fields or different orders, we can use Tables.dictcolumntable to fill missing values with missing.

DataFrame(Tables.dictcolumntable([
    (id = 1, name = "Julia"),
    (id = 2, score = 95.5),
    (id = 3, name = "DataFrame", type = "Table")
]))

3×4 DataFrame

Row	id	name	score	type
	Int64	String?	Float64?	String?
1	1	Julia	missing	missing
2	2	missing	95.5	missing
3	3	DataFrame	missing	Table

Notice the ? after the types, indicating that these columns now allow missing values.

3.2 Column Names and Basic Information

In DataFrames.jl, columns are primarily accessed using Symbols.

df = DataFrame(a = [1, 2, 3], b = [2.0, 4.0, 6.0], c = ["x", "y", "z"])

df[:, :a]

3-element Vector{Int64}:
 1
 2
 3

You can get the column names:

names(df)

3-element Vector{String}:
 "a"
 "b"
 "c"

And column types:

eltype.(eachcol(df))

3-element Vector{DataType}:
 Int64
 Float64
 String

3.2.1 Size and Dimensions

To get the dimensions of a DataFrame, similar to matrices:

size(df) # (rows, columns)

(3, 3)

You can also specify the dimension:

@show size(df, 1) # Number of rows
@show size(df, 2) # Number of columns

size(df, 1) = 3
size(df, 2) = 3

3.2.2 Column-based Storage and Iterators

DataFrames.jl stores data in a column-oriented fashion. This means each column is essentially a Vector.

You can retrieve a column using dot syntax or indexing:

df.a # Access column 'a' using dot syntax
df[!, :b] # Access column 'b' using ! (returns a view, i.e., no copy)
df[:, :c] # Access column 'c' using :, which makes a copy

3-element Vector{String}:
 "x"
 "y"
 "z"

The difference between . and ! versus : for column retrieval is crucial for performance and understanding data manipulation.

df.a === df[!, :a] # They refer to the same underlying data

true

df.a === df[:, :a] # The : operator creates a copy, so they are not the same object

false

When you need to iterate through rows, you can use eachrow(df):

for row in eachrow(df)
    println("Row: $(row.a), $(row.b), $(row.c)")
end

Row: 1, 2.0, x
Row: 2, 4.0, y
Row: 3, 6.0, z

Each row here is a DataFrameRow object, which behaves like a NamedTuple for row-wise access.

3.3 Indexing and Slicing

DataFrames can be indexed similar to matrices, but with the added flexibility of column names.

3.3.1 Positional Indexing

df[1, 1] # First row, first column
df[2, :b] # Second row, column 'b'
df[1, :] # First row (returns a DataFrameRow)
df[:, 1] # First column (returns a Vector, view)

3-element Vector{Int64}:
 1
 2
 3

3.3.2 Column Selection

You can select multiple columns by passing a vector of column names (symbols or strings):

df[:, [:a, :c]] # Select columns 'a' and 'c' (creates a new DataFrame)

3×2 DataFrame

Row	a	c
	Int64	String
1	1	x
2	2	y
3	3	z

Or exclude columns using Not:

df[:, Not(:b)] # Select all columns except 'b'

3×2 DataFrame

Row	a	c
	Int64	String
1	1	x
2	2	y
3	3	z

You can combine Not with a vector of columns:

df[:, Not([:a])] # Select all columns except 'a'

3×2 DataFrame

Row	b	c
	Float64	String
1	2.0	x
2	4.0	y
3	6.0	z

3.3.3 Views vs. Copies

Recall the distinction between ! and : for column access. This also applies to row and full DataFrame indexing.

df[!, :colname] returns a view of the column (no copy).
df[:, :colname] returns a copy of the column.
df[!, [col1, col2]] returns a view of the selected columns (a SubDataFrame).
df[:, [col1, col2]] returns a copy of the selected columns (a new DataFrame).
df[!, row_indices, col_indices] returns a SubDataFrame (view).
df[row_indices, col_indices] returns a new DataFrame (copy).

Using views (!) is more memory-efficient when you don’t need a separate copy of the data and want changes to the view to reflect in the original DataFrame. However, views require translating between the parent df indeces and the view indeces, which might in theory cause performance issues in edge cases.

3.4 Getting, Setting, and Mutating Data

You can retrieve, set, and modify individual cells, rows, or columns.

3.4.1 Setting Individual Values

df[1, :a] = 100 # Set value at row 1, column 'a'

3.4.2 Setting Entire Columns

df.b = [10.0, 20.0, 30.0] # Replace column 'b'

3-element Vector{Float64}:
 10.0
 20.0
 30.0

If the new column has a different type, it will be converted if possible, or an error will occur. If a column doesn’t exist, it will be added.

df.d = ["alpha", "beta", "gamma"] # Add a new column 'd'

3-element Vector{String}:
 "alpha"
 "beta"
 "gamma"

3.4.3 Broadcasting Assignment

Broadcasting (.=) is extremely powerful for performing element-wise operations and assignments efficiently.

df.a .= 0 # Set all values in column 'a' to 0

3-element Vector{Int64}:
 0
 0
 0

You can also use it with a scalar or a vector of compatible size:

df.b .= df.b * 2 # Double all values in column 'b'

3-element Vector{Float64}:
 20.0
 40.0
 60.0

Or apply a function:

df.c .= uppercase.(df.c) # Convert all strings in column 'c' to uppercase

3-element Vector{String}:
 "X"
 "Y"
 "Z"

Broadcasting assignment works with sub-selections as well:

df[1:2, :a] .= 99 # Set the first two values of column 'a' to 99

2-element view(::Vector{Int64}, 1:2) with eltype Int64:
 99
 99

4 Story: Working with real data

We’ll now look at a more in-depth, hands-on exercise of using DataFrames.

The Queensland government has an open data portal, and makes available tide predictions at various locations on the state’s coast. (There’s some other interesting data as well at https://www.qld.gov.au/tides).

Let’s use this to do some exploration. We’ll first download with the HTTP.jl package and write it to tides.csv

using HTTP

response = HTTP.get("https://www.data.qld.gov.au/datastore/dump/1311fc19-1e60-444f-b5cf-24687f1c15a7?bom=True")
write("work/tides.csv", response.body)

Let’s explore the first few lines

open("work/tides.csv") do io
    for i ∈ 1:5
        line = readline(io)
        println(line)
    end
end

_id,Site,Seconds,DateTime,Water Level,Prediction,Residual,Latitude,Longitude
1,abellpoint,1750082400,2025-06-17T00:00,2.713,2.535,0.178,-20.2608,148.7103
2,abellpoint,1750083000,2025-06-17T00:10,2.765,2.605,0.160,-20.2608,148.7103
3,abellpoint,1750083600,2025-06-17T00:20,2.838,2.670,0.168,-20.2608,148.7103
4,abellpoint,1750084200,2025-06-17T00:30,2.898,2.731,0.167,-20.2608,148.7103

We can read it into a dataframe with CSV.read, and show the first few lines with first

using CSV

df = CSV.read("work/tides.csv", DataFrame)
first(df, 5)

5×9 DataFrame

Row	_id	Site	Seconds	DateTime	Water Level	Prediction	Residual	Latitude	Longitude
	Int64	String15	Int64	DateTime	Float64	Float64	Float64	Float64	Float64
1	1	abellpoint	1750082400	2025-06-17T00:00:00	2.713	2.535	0.178	-20.2608	148.71
2	2	abellpoint	1750083000	2025-06-17T00:10:00	2.765	2.605	0.16	-20.2608	148.71
3	3	abellpoint	1750083600	2025-06-17T00:20:00	2.838	2.67	0.168	-20.2608	148.71
4	4	abellpoint	1750084200	2025-06-17T00:30:00	2.898	2.731	0.167	-20.2608	148.71
5	5	abellpoint	1750084800	2025-06-17T00:40:00	2.934	2.786	0.148	-20.2608	148.71

Note the inferred datatypes, including the automatically converted DateTime. We can customize this

# we could also do
df32 = CSV.read("work/tides.csv", DataFrame; types=Dict("Water Level" => Float32, "Prediction" => Float32, "Residual" => Float32, "Latitude" => Float32, "Longitude" => Float32));

println("With Float32s, we saved $(round((1-Base.summarysize(df32)/Base.summarysize(df))*100; digits=2))% memory")

With Float32s, we saved 29.64% memory

(This is silly, don’t do it in practice.)

Let’s look also at the last rows

last(df, 3)

3×9 DataFrame

Row	_id	Site	Seconds	DateTime	Water Level	Prediction	Residual	Latitude	Longitude
	Int64	String15	Int64	DateTime	Float64	Float64	Float64	Float64	Float64
1	19420	whyteislandnx	1750728000	2025-06-24T11:20:00	1.154	1.022	0.132	-27.4017	153.157
2	19421	whyteislandnx	1750728600	2025-06-24T11:30:00	1.091	0.964	0.127	-27.4017	153.157
3	19422	whyteislandnx	1750729200	2025-06-24T11:40:00	-99.0	0.907	-99.0	-27.4017	153.157

Here it seems that “-99.0” seems to mean missing. Let’s see where it’s coming from in the CSV

open("work/tides.csv") do io
    while true
        line = readline(io)
        if contains(line, "-99")
            println(line)
            break
        end
    end
end

1073,abellpoint,1750725600,2025-06-24T10:40,-99.000,2.131,-99.000,-20.2608,148.7103

We can tell CSV.read to mark values with “-99.000” as missing

df = CSV.read("work/tides.csv", DataFrame; missingstring=["-99.000"])
last(df, 3)

3×9 DataFrame

Row	_id	Site	Seconds	DateTime	Water Level	Prediction	Residual	Latitude	Longitude
	Int64	String15	Int64	DateTime	Float64?	Float64	Float64?	Float64	Float64
1	19420	whyteislandnx	1750728000	2025-06-24T11:20:00	1.154	1.022	0.132	-27.4017	153.157
2	19421	whyteislandnx	1750728600	2025-06-24T11:30:00	1.091	0.964	0.127	-27.4017	153.157
3	19422	whyteislandnx	1750729200	2025-06-24T11:40:00	missing	0.907	missing	-27.4017	153.157

Note the “?” in water level/residual: this is DataFrames notation for columns which contain missing data.

Referring to Water Level is a bit annoying now:

df[:, Symbol("Water Level")]

19422-element Vector{Union{Missing, Float64}}:
 2.713
 2.765
 2.838
 2.898
 2.934
 2.986
 3.029
 3.078
 3.103
 3.175
 ⋮
 1.519
 1.462
 1.397
 1.335
 1.269
 1.212
 1.154
 1.091
  missing

Let’s rename it, and let’s rename DateTime too to avoid confusion:

# ! means in-place
rename!(df, Symbol("Water Level") => :WaterLevel, Symbol("DateTime") => :Time)
first(df, 5)

5×9 DataFrame

Row	_id	Site	Seconds	Time	WaterLevel	Prediction	Residual	Latitude	Longitude
	Int64	String15	Int64	DateTime	Float64?	Float64	Float64?	Float64	Float64
1	1	abellpoint	1750082400	2025-06-17T00:00:00	2.713	2.535	0.178	-20.2608	148.71
2	2	abellpoint	1750083000	2025-06-17T00:10:00	2.765	2.605	0.16	-20.2608	148.71
3	3	abellpoint	1750083600	2025-06-17T00:20:00	2.838	2.67	0.168	-20.2608	148.71
4	4	abellpoint	1750084200	2025-06-17T00:30:00	2.898	2.731	0.167	-20.2608	148.71
5	5	abellpoint	1750084800	2025-06-17T00:40:00	2.934	2.786	0.148	-20.2608	148.71

Drop some redundant columns

select!(df, [:Site, :Latitude, :Longitude, :Time, :WaterLevel, :Prediction])
first(df, 5)

5×6 DataFrame

Row	Site	Latitude	Longitude	Time	WaterLevel	Prediction
	String15	Float64	Float64	DateTime	Float64?	Float64
1	abellpoint	-20.2608	148.71	2025-06-17T00:00:00	2.713	2.535
2	abellpoint	-20.2608	148.71	2025-06-17T00:10:00	2.765	2.605
3	abellpoint	-20.2608	148.71	2025-06-17T00:20:00	2.838	2.67
4	abellpoint	-20.2608	148.71	2025-06-17T00:30:00	2.898	2.731
5	abellpoint	-20.2608	148.71	2025-06-17T00:40:00	2.934	2.786

Here is our list of columns:

names(df)

6-element Vector{String}:
 "Site"
 "Latitude"
 "Longitude"
 "Time"
 "WaterLevel"
 "Prediction"

Or by piping

df |> names

6-element Vector{String}:
 "Site"
 "Latitude"
 "Longitude"
 "Time"
 "WaterLevel"
 "Prediction"

4.1 Getting to know our data

Let’s dive a bit deeper, what do we have?

describe(df)

6×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Any	Any	Int64	Type
1	Site		abellpoint		whyteislandnx	0	String15
2	Latitude	-26.2292	-28.1721	-27.4382	-19.1266	0	Float64
3	Longitude	152.434	146.91	153.249	153.558	0	Float64
4	Time		2025-06-17T00:00:00	2025-06-20T17:50:00	2025-06-24T11:40:00	0	DateTime
5	WaterLevel	1.36547	-0.233	1.153	5.774	7743	Union{Missing, Float64}
6	Prediction	1.18008	-0.179	1.007	5.537	0	Float64

What are the site names?

unique(df.Site)

18-element Vector{String15}:
 "abellpoint"
 "bananabank"
 "birkdale"
 "coombabahst"
 "hallsbay"
 "husseycreek"
 "maroochydore"
 "rabybay"
 "russellislande"
 "russellislandw"
 "seaforth"
 "tangalooma"
 "theskids"
 "townsvillecard"
 "tweedsbj"
 "wavebreaknc"
 "wavebreakwc"
 "whyteislandnx"

A note on String15:

df.Site

19422-element PooledArrays.PooledVector{String15, UInt32, Vector{UInt32}}:
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 "abellpoint"
 ⋮
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"
 "whyteislandnx"

Compute the squared error in prediction with transform

Let’s group by site

# groupby takes a dataframe and a list of columns to group by
by_site = groupby(df, :Site)

GroupedDataFrame with 18 groups based on key: Site

First Group (1079 rows): Site = "abellpoint"

1054 rows omitted

Row	Site	Latitude	Longitude	Time	WaterLevel	Prediction
	String15	Float64	Float64	DateTime	Float64?	Float64
1	abellpoint	-20.2608	148.71	2025-06-17T00:00:00	2.713	2.535
2	abellpoint	-20.2608	148.71	2025-06-17T00:10:00	2.765	2.605
3	abellpoint	-20.2608	148.71	2025-06-17T00:20:00	2.838	2.67
4	abellpoint	-20.2608	148.71	2025-06-17T00:30:00	2.898	2.731
5	abellpoint	-20.2608	148.71	2025-06-17T00:40:00	2.934	2.786
6	abellpoint	-20.2608	148.71	2025-06-17T00:50:00	2.986	2.839
7	abellpoint	-20.2608	148.71	2025-06-17T01:00:00	3.029	2.887
8	abellpoint	-20.2608	148.71	2025-06-17T01:10:00	3.078	2.93
9	abellpoint	-20.2608	148.71	2025-06-17T01:20:00	3.103	2.971
10	abellpoint	-20.2608	148.71	2025-06-17T01:30:00	3.175	3.007
11	abellpoint	-20.2608	148.71	2025-06-17T01:40:00	3.204	3.037
12	abellpoint	-20.2608	148.71	2025-06-17T01:50:00	3.202	3.063
13	abellpoint	-20.2608	148.71	2025-06-17T02:00:00	3.224	3.084
⋮	⋮	⋮	⋮	⋮	⋮	⋮
1068	abellpoint	-20.2608	148.71	2025-06-24T09:50:00	2.577	2.379
1069	abellpoint	-20.2608	148.71	2025-06-24T10:00:00	2.587	2.347
1070	abellpoint	-20.2608	148.71	2025-06-24T10:10:00	2.513	2.306
1071	abellpoint	-20.2608	148.71	2025-06-24T10:20:00	2.492	2.255
1072	abellpoint	-20.2608	148.71	2025-06-24T10:30:00	2.43	2.196
1073	abellpoint	-20.2608	148.71	2025-06-24T10:40:00	missing	2.131
1074	abellpoint	-20.2608	148.71	2025-06-24T10:50:00	missing	2.057
1075	abellpoint	-20.2608	148.71	2025-06-24T11:00:00	missing	1.979
1076	abellpoint	-20.2608	148.71	2025-06-24T11:10:00	missing	1.897
1077	abellpoint	-20.2608	148.71	2025-06-24T11:20:00	missing	1.811
1078	abellpoint	-20.2608	148.71	2025-06-24T11:30:00	missing	1.721
1079	abellpoint	-20.2608	148.71	2025-06-24T11:40:00	missing	1.631

⋮

Last Group (1079 rows): Site = "whyteislandnx"

1054 rows omitted

Row	Site	Latitude	Longitude	Time	WaterLevel	Prediction
	String15	Float64	Float64	DateTime	Float64?	Float64
1	whyteislandnx	-27.4017	153.157	2025-06-17T00:00:00	2.373	2.215
2	whyteislandnx	-27.4017	153.157	2025-06-17T00:10:00	2.416	2.251
3	whyteislandnx	-27.4017	153.157	2025-06-17T00:20:00	2.437	2.282
4	whyteislandnx	-27.4017	153.157	2025-06-17T00:30:00	2.471	2.309
5	whyteislandnx	-27.4017	153.157	2025-06-17T00:40:00	2.491	2.331
6	whyteislandnx	-27.4017	153.157	2025-06-17T00:50:00	2.51	2.347
7	whyteislandnx	-27.4017	153.157	2025-06-17T01:00:00	2.509	2.357
8	whyteislandnx	-27.4017	153.157	2025-06-17T01:10:00	2.516	2.362
9	whyteislandnx	-27.4017	153.157	2025-06-17T01:20:00	2.497	2.361
10	whyteislandnx	-27.4017	153.157	2025-06-17T01:30:00	2.486	2.353
11	whyteislandnx	-27.4017	153.157	2025-06-17T01:40:00	2.459	2.339
12	whyteislandnx	-27.4017	153.157	2025-06-17T01:50:00	2.428	2.318
13	whyteislandnx	-27.4017	153.157	2025-06-17T02:00:00	2.4	2.291
⋮	⋮	⋮	⋮	⋮	⋮	⋮
1068	whyteislandnx	-27.4017	153.157	2025-06-24T09:50:00	1.691	1.541
1069	whyteislandnx	-27.4017	153.157	2025-06-24T10:00:00	1.638	1.487
1070	whyteislandnx	-27.4017	153.157	2025-06-24T10:10:00	1.578	1.431
1071	whyteislandnx	-27.4017	153.157	2025-06-24T10:20:00	1.519	1.375
1072	whyteislandnx	-27.4017	153.157	2025-06-24T10:30:00	1.462	1.316
1073	whyteislandnx	-27.4017	153.157	2025-06-24T10:40:00	1.397	1.256
1074	whyteislandnx	-27.4017	153.157	2025-06-24T10:50:00	1.335	1.198
1075	whyteislandnx	-27.4017	153.157	2025-06-24T11:00:00	1.269	1.138
1076	whyteislandnx	-27.4017	153.157	2025-06-24T11:10:00	1.212	1.079
1077	whyteislandnx	-27.4017	153.157	2025-06-24T11:20:00	1.154	1.022
1078	whyteislandnx	-27.4017	153.157	2025-06-24T11:30:00	1.091	0.964
1079	whyteislandnx	-27.4017	153.157	2025-06-24T11:40:00	missing	0.907

This produces a grouped dataframe

typeof(by_site)

GroupedDataFrame{DataFrame}

What’s the mean water level per site?

# get the mean function
using Statistics

# enter ∘ with \circ TAB
# combine takes the grouped df and a list of operations
combine(by_site, :WaterLevel => mean ∘ skipmissing => :MeanWaterLevel)

18×2 DataFrame

Row	Site	MeanWaterLevel
	String15	Float64
1	abellpoint	2.03237
2	bananabank	1.64445
3	birkdale	1.43709
4	coombabahst	0.275567
5	hallsbay	0.725246
6	husseycreek	NaN
7	maroochydore	0.77414
8	rabybay	1.61293
9	russellislande	0.723282
10	russellislandw	NaN
11	seaforth	3.01772
12	tangalooma	NaN
13	theskids	NaN
14	townsvillecard	NaN
15	tweedsbj	1.19848
16	wavebreaknc	NaN
17	wavebreakwc	NaN
18	whyteislandnx	1.4865

Here we applied mean(skipmissing(...)) to the :WaterLevel column.

Let’s plot the water level at some sites

using Plots

my_sites = ["coombabahst", "russellislande", "rabybay"]

p = plot(
    xlabel="Time",
    ylabel="Water Level",
    title="Water Level Over Time for Selected Sites",
    legend=:topleft
)

for group in by_site
    site_name = group.Site[1] # Get the site name from the first row of the group
    if site_name ∉ my_sites
        continue
    end
    plot!(
        p,
        group.Time,
        group.WaterLevel,
        # Label for the legend
        label=site_name,
        linealpha=0.8,
        linewidth=2
    )
end

p

How many data points do we have per site?

combine(by_site, nrow => :Count)

18×2 DataFrame

Row	Site	Count
	String15	Int64
1	abellpoint	1079
2	bananabank	1079
3	birkdale	1079
4	coombabahst	1079
5	hallsbay	1079
6	husseycreek	1079
7	maroochydore	1079
8	rabybay	1079
9	russellislande	1079
10	russellislandw	1079
11	seaforth	1079
12	tangalooma	1079
13	theskids	1079
14	townsvillecard	1079
15	tweedsbj	1079
16	wavebreaknc	1079
17	wavebreakwc	1079
18	whyteislandnx	1079

Let’s compute the squared residual:

df[!, :SqResidual] = (df.WaterLevel - df.Prediction).^2

19422-element Vector{Union{Missing, Float64}}:
 0.031683999999999976
 0.025600000000000046
 0.02822400000000005
 0.027889000000000087
 0.021904000000000038
 0.021609000000000073
 0.020163999999999974
 0.021903999999999906
 0.01742400000000003
 0.028223999999999902
 ⋮
 0.020735999999999973
 0.021315999999999974
 0.019881000000000003
 0.018769000000000004
 0.017161000000000003
 0.017689000000000003
 0.01742399999999997
 0.016129
  missing

There were some sites with fully missing water levels

all_missing = combine(groupby(df, :Site), :WaterLevel => (x -> all(ismissing, x)) => :IsMissing)

18×2 DataFrame

Row	Site	IsMissing
	String15	Bool
1	abellpoint	false
2	bananabank	false
3	birkdale	false
4	coombabahst	false
5	hallsbay	false
6	husseycreek	true
7	maroochydore	false
8	rabybay	false
9	russellislande	false
10	russellislandw	true
11	seaforth	false
12	tangalooma	true
13	theskids	true
14	townsvillecard	true
15	tweedsbj	false
16	wavebreaknc	true
17	wavebreakwc	true
18	whyteislandnx	false

filter!(row -> row.IsMissing == false, all_missing)

11×2 DataFrame

Row	Site	IsMissing
	String15	Bool
1	abellpoint	false
2	bananabank	false
3	birkdale	false
4	coombabahst	false
5	hallsbay	false
6	maroochydore	false
7	rabybay	false
8	russellislande	false
9	seaforth	false
10	tweedsbj	false
11	whyteislandnx	false

select!(all_missing, Not(:IsMissing))

11×1 DataFrame

Row	Site
	String15
1	abellpoint
2	bananabank
3	birkdale
4	coombabahst
5	hallsbay
6	maroochydore
7	rabybay
8	russellislande
9	seaforth
10	tweedsbj
11	whyteislandnx

df_clean = innerjoin(df, all_missing, on=:Site)

11869×7 DataFrame

11844 rows omitted

Row	Site	Latitude	Longitude	Time	WaterLevel	Prediction	SqResidual
	String15	Float64	Float64	DateTime	Float64?	Float64	Float64?
1	abellpoint	-20.2608	148.71	2025-06-17T00:00:00	2.713	2.535	0.031684
2	abellpoint	-20.2608	148.71	2025-06-17T00:10:00	2.765	2.605	0.0256
3	abellpoint	-20.2608	148.71	2025-06-17T00:20:00	2.838	2.67	0.028224
4	abellpoint	-20.2608	148.71	2025-06-17T00:30:00	2.898	2.731	0.027889
5	abellpoint	-20.2608	148.71	2025-06-17T00:40:00	2.934	2.786	0.021904
6	abellpoint	-20.2608	148.71	2025-06-17T00:50:00	2.986	2.839	0.021609
7	abellpoint	-20.2608	148.71	2025-06-17T01:00:00	3.029	2.887	0.020164
8	abellpoint	-20.2608	148.71	2025-06-17T01:10:00	3.078	2.93	0.021904
9	abellpoint	-20.2608	148.71	2025-06-17T01:20:00	3.103	2.971	0.017424
10	abellpoint	-20.2608	148.71	2025-06-17T01:30:00	3.175	3.007	0.028224
11	abellpoint	-20.2608	148.71	2025-06-17T01:40:00	3.204	3.037	0.027889
12	abellpoint	-20.2608	148.71	2025-06-17T01:50:00	3.202	3.063	0.019321
13	abellpoint	-20.2608	148.71	2025-06-17T02:00:00	3.224	3.084	0.0196
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
11858	whyteislandnx	-27.4017	153.157	2025-06-24T09:50:00	1.691	1.541	0.0225
11859	whyteislandnx	-27.4017	153.157	2025-06-24T10:00:00	1.638	1.487	0.022801
11860	whyteislandnx	-27.4017	153.157	2025-06-24T10:10:00	1.578	1.431	0.021609
11861	whyteislandnx	-27.4017	153.157	2025-06-24T10:20:00	1.519	1.375	0.020736
11862	whyteislandnx	-27.4017	153.157	2025-06-24T10:30:00	1.462	1.316	0.021316
11863	whyteislandnx	-27.4017	153.157	2025-06-24T10:40:00	1.397	1.256	0.019881
11864	whyteislandnx	-27.4017	153.157	2025-06-24T10:50:00	1.335	1.198	0.018769
11865	whyteislandnx	-27.4017	153.157	2025-06-24T11:00:00	1.269	1.138	0.017161
11866	whyteislandnx	-27.4017	153.157	2025-06-24T11:10:00	1.212	1.079	0.017689
11867	whyteislandnx	-27.4017	153.157	2025-06-24T11:20:00	1.154	1.022	0.017424
11868	whyteislandnx	-27.4017	153.157	2025-06-24T11:30:00	1.091	0.964	0.016129
11869	whyteislandnx	-27.4017	153.157	2025-06-24T11:40:00	missing	0.907	missing

Let’s compute the 90th percentile of water level per site?

p90(x) = quantile(x, .9)
combine(groupby(df_clean, :Site), :WaterLevel => p90 ∘ skipmissing => :WaterLevelP90)

11×2 DataFrame

Row	Site	WaterLevelP90
	String15	Float64
1	abellpoint	3.0888
2	bananabank	2.4426
3	birkdale	2.2089
4	coombabahst	0.6794
5	hallsbay	1.2018
6	maroochydore	1.1973
7	rabybay	2.3942
8	russellislande	1.18
9	seaforth	4.5453
10	tweedsbj	1.761
11	whyteislandnx	2.232

Let’s plot the mean square error in prediction per site

mse_by_site = combine(groupby(df_clean, :Site), :SqResidual => mean ∘ skipmissing => :MSE)

plot(mse_by_site.Site, mse_by_site.MSE, seriestype=:bar, xrotation=45, title="MSE in water level prediction by site")

4.2 More Transformations

Here are key operations:

groupby – Split a DataFrame into groups by one or more columns.
combine – Apply functions to groups or columns and combine results in a new DataFrame.
transform – Create or modify columns (optionally in-place).
select – Select (and transform) columns, optionally creating new ones.

With DataFramesMeta.jl: - @subset – Filter rows based on row-wise conditions. - @select – Select or transform columns. - @transform – Add or modify columns by assignment. - @combine – Combine results of group operations into a DataFrame.

For more, see the official DataFrames.jl documentation and the DataFramesMeta.jl documentation.

As there are already great resources for this on the web let us go through these resources:

4.3 More dataframes

Here are the common packages in this ecosystem:

DataFrames.jl - the main dataframes package.
DataFramesMeta.jl - metaprogramming tools for DataFrames.jl objects.
CSV.jl - read and write to CSV files.
CategoricalArrays.jl - provides tools for working with categorical variables, both with unordered (nominal variables) and ordered categories (ordinal variables), optionally with missing values.
Chain.jl - provides a useful macro rewrites a series of expressions into a chain.
XLSX.jl - Excel file reader/writer for the Julia language.
SummaryTables.jl - creating publication-ready tables in HTML, docx, LaTeX and Typst formats.

5 JSON

There are two competing JSON libraries: JSON.jl and JSON3.jl. Here is a JSON.jl example:

using HTTP
using JSON

response = HTTP.get("https://couchers.org/api/status")
data = JSON.parse(String(response.body))

println(data)

Dict{String, Any}("coucherCount" => "55920", "version" => "develop-24532dd9", "nonce" => "")

6 Serialization

Julia provides out of the box serialization. Here is an example. The example is slightly interesting because we also create a tree data structure.

using Random

Random.seed!(0)

struct Node
    id::UInt16
    friends::Vector{Node}

    # inner constructor, uses the default constructor
    Node() = new(rand(UInt16), [])
    # another inner constructor
    Node(friend::Node) = new(rand(UInt16),[friend])
end

"""
Makes `n` children to node, each with a single friend
"""
function make_children(node::Node, n::Int, friend::Node)
    for _ in 1:n
        new_node = Node(friend)
        push!(node.friends, new_node)
    end
end;

# make a tree
root = Node()
make_children(root, 3, root)
for node in root.friends
    make_children(node, 2,root)
end

root

Node(0x67db, Node[Node(0x118c, Node[Node(#= circular reference @-4 =#), Node(0xa95f, Node[Node(#= circular reference @-6 =#)]), Node(0x1dc7, Node[Node(#= circular reference @-6 =#)])]), Node(0xdcb5, Node[Node(#= circular reference @-4 =#), Node(0x1c00, Node[Node(#= circular reference @-6 =#)]), Node(0xb3b6, Node[Node(#= circular reference @-6 =#)])]), Node(0x1602, Node[Node(#= circular reference @-4 =#), Node(0x4a1d, Node[Node(#= circular reference @-6 =#)]), Node(0x074f, Node[Node(#= circular reference @-6 =#)])])])

Note that when we try to show root, it’s complete gibberish. We can write a Base.show() function to make this pretty:

# make it show up pretty
function Base.show(io::IO, x::Node)
    shown = Set{Node}()
    function recursive_show(y::Node, depth::Int)
        print(io, "  "^depth*"Node: $(y.id)")
        if y in shown
            println(io, " (already shown)")
        else
            push!(shown, y)
            println(io, ", friends:")
            for f in y.friends
                recursive_show(f, depth+1)
            end
        end
    end
    recursive_show(x, 0)
    return nothing
end

root

Node: 26587, friends:
  Node: 4492, friends:
    Node: 26587 (already shown)
    Node: 43359, friends:
      Node: 26587 (already shown)
    Node: 7623, friends:
      Node: 26587 (already shown)
  Node: 56501, friends:
    Node: 26587 (already shown)
    Node: 7168, friends:
      Node: 26587 (already shown)
    Node: 46006, friends:
      Node: 26587 (already shown)
  Node: 5634, friends:
    Node: 26587 (already shown)
    Node: 18973, friends:
      Node: 26587 (already shown)
    Node: 1871, friends:
      Node: 26587 (already shown)

Suppose we now want to save this in a file…

using Serialization
serialize("work/tree.dat", root)

newroot = deserialize("work/tree.dat")

Node: 26587, friends:
  Node: 4492, friends:
    Node: 26587 (already shown)
    Node: 43359, friends:
      Node: 26587 (already shown)
    Node: 7623, friends:
      Node: 26587 (already shown)
  Node: 56501, friends:
    Node: 26587 (already shown)
    Node: 7168, friends:
      Node: 26587 (already shown)
    Node: 46006, friends:
      Node: 26587 (already shown)
  Node: 5634, friends:
    Node: 26587 (already shown)
    Node: 18973, friends:
      Node: 26587 (already shown)
    Node: 1871, friends:
      Node: 26587 (already shown)

7 Additional online resources

Basic blog post about Tuples and Vectors, Allocations and Performance for Beginners.
This introductory post covers StaticArrays.jl which is a popular library used in many numerical packages.
The book Storopoli, Huijzer, and Alonso (2021) is a good resource for DataFrames.
A paper about DataFrames.jl made it into the Journal of Statistical Software, Bouchet-Valat and Kamiński (2023).
A nice “cheatsheet” for dataframes here.
The book Kaminski (2023) is also written by the main creator of DataFrames.jl.
This Pumas DataFramesMeta.jl tutorial is useful.
See this Hacker News discussion around JSON.jl and JSON3.jl.
This unit used many macros. The proper documentation is here. This blog post is an elementary introduction.

8 Exercises

You have this dictionary:

        country_capital = Dict(
                                "France" => "Paris",
                                "Germany" => "Berlin",
                                "Italy" => "Rome",
                                "Spain" => "Madrid")

Now create a new dictionary, capital_country where the keys are the capital cities and the values are the country names.

Looking up with the in or ∈ symbol is possible both in an array and a set. You can create an array with rand(1:10^10, 10^7) which will have \(10^7\) entries, selected from the numbers \(1,\ldots,10^{10}\). You can also wrap this to create a set. Now compare lookup timings with @time or @btime (from BenchmarkTools.jl) for lookup to see if a single rand(1:10^10) is an element of the set.
Given the string text = "Julia is a high-level, high-performance programming language.", write Julia code to count how many times the substring “high” appears in the text (case-insensitive).
Install the Rdatasets.jl package. Then load the “iris” dataset. Then, filter the DataFrame to only include rows where the SepalLength is greater than its mean, and display the first five rows of the result.
Load the “mtcars” dataset from RDatasets. Then, group the data by the Cyl (number of cylinders) column and compute the average MPG (miles per gallon) for each group. Display the resulting summary DataFrame.
Consider this JSON file (put it in a string):

                {
                  "name": "Alice",
                  "age": 30,
                  "skills": ["Julia", "Python", "SQL"]
                }

Given the JSON string above, write Julia code to parse it and print the person’s name and the number of skills they have.

Create an array of \(10^6\) random Float64 (you can use rand(Float64, 3)). Then serialize and inspect the file size. See it makes sense with sizeof(Float64). Now do the same with Float16, Float32, UInt8, and another type of your choice.

References

Bouchet-Valat, Milan, and Bogumił Kamiński. 2023. “Dataframes. Jl: Flexible and Fast Tabular Data in Julia.” Journal of Statistical Software 107: 1–32. https://www.jstatsoft.org/article/view/v107i04.

Kaminski, Bogumil. 2023. Julia for Data Analysis. Manning Publications. https://www.manning.com/books/julia-for-data-analysis.

Storopoli, Jose, Rik Huijzer, and Lazaro Alonso. 2021. Julia Data Science. https://juliadatascience.io.

Footnotes

“Constant time” suffices in practice, there is minutiae and worst case is \(O(n)\) which is bad – for theoretical applications, they can be implemented in \(O(\log n)\) worst case time with self-balancing trees, but all practical applications rely on constant time average and engineering tricks to avoid the linear time worst case.↩︎

A Crash Course on the Julia Language and Ecosystem

1 Basic data structures

1.1 Dictionaries

1.2 Sets

1.3 Named tuples

1.4 Structs (Composite Types)

1.5 Datastructures (not in the standard library)

2 Basic text processing

2.1 Regular Expressions

2.2 Reading and writing files

2.3 Some extras

3 Dataframes

3.1 Constructing DataFrames

3.1.1 From Column-Value Pairs

3.1.2 From Dictionaries

3.1.3 From NamedTuples

3.2 Column Names and Basic Information

3.2.1 Size and Dimensions

3.2.2 Column-based Storage and Iterators

3.3 Indexing and Slicing

3.3.1 Positional Indexing

3.3.2 Column Selection

3.3.3 Views vs. Copies

3.4 Getting, Setting, and Mutating Data

3.4.1 Setting Individual Values

3.4.2 Setting Entire Columns

3.4.3 Broadcasting Assignment

4 Story: Working with real data

4.1 Getting to know our data

4.2 More Transformations

4.3 More dataframes

5 JSON

6 Serialization

7 Additional online resources

8 Exercises

References

Footnotes

3.1.3 From `NamedTuple`s