Part of my work with Lob as a Staff Engineer is to implement various USPS standards pertaining to address autocorrection (aka CASS Cycle ‘N’ and ‘O’). These standards have a lot of subtle details, so the Elixir code that implements them can be intimidating. It is difficult to get this complex code to do what you want, so I started to investigate techniques that would help me track how my changes impacted the results. Before we dive into my solution, let’s first give a quick recap on how functional data structures differ from their imperative counterparts.
One of the counterintuitive aspects of writing functional code is the constructive nature of data structures. You cannot mutate an existing data structure — if you want to make a change, you must create new elements containing the change that you desire without altering the existing structure. Essentially you make a new version of the data structure that embodies your change that will share common parts of the previous version that stayed the same.
Let’s take the concrete example of a binary tree. The tree, T1, has six nodes: A, B, C, D, E, and F:
The traditional imperative way to add G to this tree would be to overwrite the right field of the parent node containing F to point to a newly created node containing G as illustrated by the left side of the diagram. But in a functional language like Elixir, we don’t have the ability to overwrite fields. Instead, we must create a new spine for the tree containing the nodes leading to the newly created G node while sharing the unchanged nodes from the original tree, T1.
At first, this seems terribly inefficient — after all the imperative counterpart just needed to perform one (destructive) update. But it’s not all bad news in the functional world; although in the tree example we create log₂(n) extra nodes, our old version and new version can happily co-exist. In fact, by merely keeping a pointer to old versions of the tree, we can trivially implement an infinite undo facility (space permitting). In the imperative world, this is a bit trickier — we would need to record the changes via a list of deltas and rerun those deltas if we wanted to access an earlier state. In the functional world, all these versions can simultaneously coexist for a small investment in space. Neat! In the imperative world? Not so much…
This also illustrates why it’s easier to reason about functional code versus imperative code — in the functional code, everything stays constant. In the imperative code, a pointer to a structure is no guarantee that structure will remain intact. This is also the reason that imperative concurrent programming is fraught with difficulty whereas functional programming lends itself quite naturally to working in parallel. No need for critical sections, semaphores, monitors, etc. Those constructs are all about protecting some piece of code that side effects from being executed simultaneously in different threads/processes. No side effects — no problem!
Implementing data structures functionally is actually a fascinating area of study — I highly recommend Chris Osaki’s Ph.D. thesis on the topic: Purely Functional Data Structures.
Now that we’ve refreshed our understanding of functional techniques for implementing data structures, we can move forward with an explanation of the basic idea. The core of our program doing address correction manipulates a very large %Address{} struct. The program works in phases and in each phase a different set of fields are populated with new metadata either deduced from machine learning sub-systems, or extracted from various USPS- supplied metadata. The final Address struct will have the answer we seek (we hope) and a whole lot of additional metadata that might help us understand how it came to that conclusion. But with a data structure as large as Address with over a hundred fields, IO.inspect at some strategic locations isn’t going to cut it — the amount of information printed to the screen is just too overwhelming. We need something a little bit more precise.
A key component of our library is the difference engine. This takes two objects and computes a list of the differences between them which makes it easier for us to visualize the evolution of our data structures over time. First, let’s decide how to represent a difference. A delta has three possibilities:
@type delta() :: {:update, any(), any()} | {:delete, any()} | {:add, any()}
@type path() :: list()
@type delta_spec() :: [{path(), delta()}]
With our representation settled, we need to figure out our implementation strategy. We simultaneously recurse over the two objects we wish to compare until we find a point where they diverge and we return the difference. So our main function will look like this:
@spec delta(any(), any()) :: delta_spec()
def delta(a, b) do
[]
|> delta(a, b)
|> List.flatten()
|> Enum.map(fn {path, element} -> {Enum.reverse(path), element} end)
|> Enum.sort_by(fn {path, _} -> {length(path), path} end)
end
This delegates to a three-argument version that takes the current path as the first argument. We flatten the result, reverse the paths (because they’re call stacks) and then order the results by length of path.
So now we tackle the major cases for delta: struct, map, list, tuple, and everything else:
@spec delta(path(), any(), any()) :: delta_spec()
defp delta(path, a, b) when is_struct(a) and is_struct(b), do: delta_struct(path, a, b)
defp delta(path, a, b) when is_map(a) and is_map(b), do: delta_map(path, a, b)
defp delta(path, a, b) when is_list(a) and is_list(b), do: delta_list(path, a, b)
defp delta(path, a, b) when is_tuple(a) and is_tuple(b), do: delta_tuple(path, a, b)
defp delta(path, a, b), do: delta_simple(path, a, b)
Let’s tackle these one at a time. For structs, if they are the same kind of struct, convert them into a map and carry on; otherwise, they’re already different.
@spec delta_struct(path(), struct(), struct()) :: delta_spec()
def delta_struct(path, %a_s{} = a, %b_s{} = b) when a_s != b_s, do: delta_simple(path, a, b)
def delta_struct(path, a, b), do: delta_map(path, Map.from_struct(a), Map.from_struct(b))
For maps, we want to compute the overlap of keys: a_only, only in a; b_only, only in b; and common, appears in both. With that information, it’s simply a matter of computing the difference of the values of the common keys:
@spec delta_map(path(), map(), map()) :: [delta()]
def delta_map(path, a, b) do
a_keys = MapSet.new(Map.keys(a))
b_keys = MapSet.new(Map.keys(b))
common = MapSet.intersection(a_keys, b_keys)
a_only = a_keys |> MapSet.difference(common)
b_only = b_keys |> MapSet.difference(common)
Enum.flat_map(common, fn key -> delta([key | path], a[key], b[key]) end) ++
Enum.flat_map(a_only, fn key -> [{[key | path], {:delete, a[key]}}] end) ++
Enum.flat_map(b_only, fn key -> [{[key | path], {:add, b[key]}}] end)
end
For lists, I take a simplistic approach of zipping them together to compare the elements pairwise. (I think computing the Levenshtein edits would be overkill).
@spec delta_list(path(), list(), list()) :: [delta()]
defp delta_list(path, as, bs) when length(as) != length(bs), do: delta_simple(path, as, bs)
defp delta_list(path, as, bs) do
as
|> indexed_zip(bs)
|> Enum.flat_map(fn {i, {a, b}} -> delta([i | path], a, b) end)
end
def indexed_zip(as, bs, i \\ 0)
def indexed_zip([], [], _), do: []
def indexed_zip([a | as], [], i), do: [{i, {a, nil}} | indexed_zip(as, [], i + 1)]
def indexed_zip([], [b | bs], i), do: [{i, {nil, b}} | indexed_zip([], bs, i + 1)]
def indexed_zip([a | as], [b | bs], i), do: [{i, {a, b}} | indexed_zip(as, bs, i + 1)]
For tuples, if they are the same size convert them to lists and let delta_list process them, otherwise, they’re different.
@spec delta_tuple(path(), tuple(), tuple()) :: [delta()]
defp delta_tuple(path, a, b) when tuple_size(a) != tuple_size(b), do: delta_simple(path, a, b)
defp delta_tuple(path, a, b), do: delta_list(path, Tuple.to_list(a), Tuple.to_list(b))
This just leaves the base case and we’re done.
@spec delta_simple(path(), term(), term()) :: [delta()]
defp delta_simple(_path, a, b) when a == b, do: []
defp delta_simple(path, a, b) do
cond do
empty?(a) -> [{path, {:add, b}}]
empty?(b) -> [{path, {:delete, a}}]
true -> [{path, {:update, a, b}}]
end
end
Ok, I suppose we should define empty?
@empty [false, "", nil, [], 0, 0.0]
@spec empty?(any()) :: boolean()
def empty?(value) when value in @empty, do: true
def empty?(value) when is_list(value), do: Enum.all?(value, &empty?/1)
def empty?(value) when is_tuple(value), do: value |> Tuple.to_list() |> empty?()
def empty?(value) when is_struct(value), do: value |> Map.from_struct() |> empty?()
def empty?(value) when is_map(value), do: value |> Map.values() |> empty?()
def empty?(_), do: false
And that wraps up our description of our difference engine.
A lot of folks don’t realize that Elixir is actually quite a small language; most of the features that you know and love are implemented via the powerful macro languages. Don’t believe me? Check out all the language features implemented inside the Kernel module: @variables, def, defdelegate, defexception, defguard, defimpl, defmacro, |> etc etc. This also explains why the syntax feels is a little clunky in places — it’s likely implemented as a macro!
Our basic idea is to define a macro, audit, which embeds a paper trail inside an __audittrail__ field. It needs to be a macro because we want to capture the file and line number from where the macro was called. Basically, this field will contain a triple of: the last version of the data structure, the filename, and the line number.
Let’s get started by defining some types:
@key :__audit_trail__
@enabled? Application.compile_env(:audit, :enabled?, false)
@type file_t :: binary()
@type line_t :: non_neg_integer()
@type trail_t :: {struct(), file_t(), line_t()}
@type change_t :: {struct(), trail_t()}
So our audit consists of a snapshot of the struct, a filename, and a line number. It’s considered good macro design practice to isolate as much of the functionality as possible in a function. So here’s our audit_fun:
@spec audit_fun(struct(), Macro.Env.t()) :: struct()
def audit_fun(r, e) do
r |> struct([{@key, payload(r, e)}])
end
@spec payload(struct(), Macro.Env) :: trail_t()
def payload(r, e), do: {r, e.file, e.line}
So now our macro is extremely simple:
defmacro audit(record) do
quote generated: true do
unquote(__MODULE__).audit_fun(unquote(record), __ENV__)
end
end
To put it to use, we want to be able to generate a changelist given a struct that has our magic __audit_trail__ field:
@spec changelist(term) :: [change_t()]
defp changelist(r = %_{__audit_trail__: audit_trail}) do
if audit_trail, do: [{r, audit_trail} | changelist(record(audit_trail))], else: []
end
defp changelist(_), do: []
Now we just need some helper functions to turn that changelist into a human-readable string:
def to_string(r) do
r
|> changelist()
|> Enum.map_join("\n=====\n", &stringify_change/1)
end
@spec changelist(term) :: [change_t()]
defp changelist(r = %_{__audit_trail__: audit_trail}) do
if audit_trail, do: [{r, audit_trail} | changelist(record(audit_trail))], else: []
end
defp changelist(_), do: []
@spec stringify_change(change_t()) :: binary()
defp stringify_change({post, {pre, filename, line}}) do
diff = Audit.Delta.delta(unaudit_fun(pre), unaudit_fun(post))
code = Audit.FileCache.get(filename) |> Enum.at(line - 1)
filename = String.replace_prefix(filename, "#{Audit.Github.git_root()}/", "")
url = Audit.Github.git_url(filename, line)
[
"github url: #{url}",
"local path: #{filename}:#{line}",
"code: #{String.trim(code)}",
"diff: #{inspect(diff)}"
]
|> Enum.join("\n")
end
There are two nuances in this code. Firstly, as we’ll be looking up our sources files a lot, it behooves us to cache them. So we implement a simple Agent to accomplish that:
defmodule Audit.FileCache do
@moduledoc """
Simple File cache Agent
"""
use Agent
def start_link(_args) do
Agent.start_link(fn -> %{} end, name: __MODULE__)
end
def get(filename) do
Agent.get_and_update(
__MODULE__,
fn state ->
if cached = Map.get(state, filename) do
{cached, state}
else
file = File.stream!(filename)
{file, Map.put(state, filename, file)}
end
end
)
end
end
Secondly, by their very nature source files are subject to change. So for this information to be useful at a later point in time, we should use git SHAs to specify the version of the file that we’re looking at. To accomplish this, we have a module to do all the GitHub URL wrangling:
defmodule Audit.Github do
@moduledoc """
Helper functions for generating Github URLs from filename and line number
"""
@type line_number_t :: binary | non_neg_integer()
@type filename_t :: binary()
@type git_t :: binary() | nil
@spec git([binary()]) :: git_t()
def git(args) do
case System.cmd("git", args) do
{s, 0} -> String.trim(s)
_ -> nil
end
end
@spec git_url(filename_t(), line_number_t()) :: git_t() | nil
def git_url(filename, line) do
branch()
|> remote()
|> get_url()
|> to_https()
|> format_string(branch(), filename, line)
end
@spec format_string(git_t(), git_t(), filename_t(), line_number_t()) :: git_t()
def format_string(nil, _branch, _filename, _line), do: nil
def format_string(_base, nil, _filename, _line), do: nil
def format_string(base, branch, filename, line) do
[base, "tree", branch, "#{filename}#L#{line}"] |> Path.join()
end
@spec branch() :: git_t()
def branch(), do: git(~w(rev-parse --abbrev-ref HEAD))
@spec git_root() :: git_t()
def git_root(), do: git(~w(rev-parse --show-toplevel))
@spec remote(git_t()) :: git_t()
def remote(nil), do: nil
def remote(branch) do
git(["config", "branch.#{branch}.remote"])
end
@spec get_url(git_t()) :: git_t()
def get_url(nil), do: nil
def get_url(remote) do
git(["remote", "get-url", remote])
end
@spec to_https(git_t()) :: git_t()
def to_https(nil), do: nil
def to_https(get_url) do
get_url
|> String.replace(":", "/")
|> String.replace("git@", "https://")
|> String.replace("\.git", "")
end
@spec commit_hash() :: git_t()
def commit_hash() do
git(["log", "-n1", ~s(--format=format:"%H")]) |> String.replace("\"", "")
end
end
With all these components in place, we’re finally in a position to demonstrate our library in action.
defmodule Tree do
alias __MODULE__
use Audit
defstruct left: nil, item: nil, right: nil, __audit_trail__: nil
def add(nil, item), do: %Tree{item: item}
def add(tree = %Tree{left: l, item: i, right: r}, item) do
cond do
item < i -> %Tree{audit(tree) | left: add(l, item)}
item > i -> %Tree{audit(tree) | right: add(r, item)}
true -> tree
end
end
end
This is a somewhat trivial example. The ideal use case is for much larger struct where it becomes hard to see the wood from the trees when you IO.inspect results. The use of the difference engine shines in these scenarios.
iex> nil |> Tree.add(2) |> Tree.add(1) |> Tree.add(3) |> Audit.to_string() |> IO.puts
github url: https://github.com/rugyoga/tree_ex/tree/main/lib/tree.ex#L15
local path: lib/tree.ex:15
code: item > i -> %Tree{audit(tree) | right: add(r, item)}
diff: [{[:right], {:add, %Tree{item: 3, left: nil, right: nil}}}]
=====
github url: https://github.com/rugyoga/tree_ex/tree/main/lib/tree.ex#L14
local path: lib/tree.ex:14
code: item < i -> %Tree{audit(tree) | left: add(l, item)}
diff: [{[:left], {:add, %Tree{item: 1, left: nil, right: nil}}}]
For extremely large %Address structs in our Address Verification code, this library has saved my bacon on several occasions. I hope you find it equally useful.
We’ve described a simple library to track changes to data structures. If you’d like to kick the tires and try it out, you can find it in hex.pm under audit.