Robin van der Vleuten

Why my Ruby gem ships address formats twice

I already wrote about why international address handling gets messy in Ruby. That post is about forms, labels, postal rules, and the public API around all of it.

This one is smaller.

It is about why Addressing has both of these files:

text
data/address_formats.json
data/address_formats.dump

At first that looks wasteful. Why ship the same address formats twice?

Because the two files have different jobs.

The obvious version: JSON

The JSON file is the source of truth.

Not because JSON is exciting. It isn't. That is the useful part.

Address formats are data, and I want to review them as data. When a format changes, the diff should show the country code, the fields, the required values, and the postal code pattern.

I do not want to review generated Ruby full of escaping, nested hashes, and noise.

A simplified entry looks roughly like this:

json
{
"country_code": "NL",
"format": "%given_name %family_name\n%organization\n%address_line1\n%postal_code %locality",
"required_fields": ["address_line1", "postal_code", "locality"],
"uppercase_fields": ["locality"],
"postal_code_pattern": "\\d{4}\\s?[A-Z]{2}"
}

That is the kind of thing I want in Git. You can open it, search it, and compare it with the upstream source. If a country needs a tweak, the change is boring in the best possible way.

JSON has a cost at runtime. Every process that needs address formats has to read the file, parse it, build Ruby objects, normalize keys, and then finally answer the actual question the application asked.

For one request, who cares.

For a gem that gets loaded during application boot, tests, background jobs, consoles, and one-off scripts, that repeated setup starts to feel wasteful. The data is static anyway. It only changes when the gem changes.

I don't need to parse the same address definitions again and again.

The file Ruby actually loads

At runtime, Addressing reads the dump:

ruby
def definitions
@definitions ||= Marshal.load(
File.read(
File.expand_path("../../../data/address_formats.dump", __FILE__).to_s
)
)
end

Marshal is Ruby's built-in object serialization format. Give it Ruby objects with Marshal.dump, and it writes bytes. Read those bytes with Marshal.load, and you get Ruby objects back.

That is the whole trick. JSON parsing happens when the dump is generated, not every time an application boots.

The dump is still generated from the readable source. It is not a second source of truth.

Once the definitions are loaded, the gem also memoizes the AddressFormat objects it builds from them:

ruby
def get(country_code)
country_code = country_code.upcase
@address_formats ||= {}
unless @address_formats.key?(country_code)
definition = process_definition(
definitions[country_code] || { country_code: country_code }
)
@address_formats[country_code] = new(definition)
end
@address_formats[country_code]
end

So there are two layers:

  • definitions caches the raw format definitions loaded from disk.
  • @address_formats caches the actual AddressFormat objects by country code.

The common path stays small:

  • ask for NL
  • get the Dutch format
  • ask again
  • get the same object back

No JSON parsing, repeated normalization, or walking through a big file line by line.

Where the dump comes from

The dump file is generated. Nobody edits it by hand, because that would be miserable and pointless.

The Rake task is deliberately plain:

ruby
namespace :addressing do
task dump: :generate do
require "json"
root_dir = File.expand_path("../..", __FILE__)
definitions = {}
File.readlines("#{root_dir}/data/address_formats.json").each do |line|
definition = JSON.parse(line, symbolize_names: true)
definitions[definition[:country_code]] = definition
end
File.open("#{root_dir}/data/address_formats.dump", "w") do |f|
Marshal.dump definitions, f
end
end
end

Read each JSON line, parse it into a Ruby hash, index it by country code, and let Marshal.dump write the result.

There is no separate schema language hiding in the middle, no custom compiler, and no clever code generation step that I will forget how to debug six months later.

That matters because generated files tend to become suspicious over time. If the generated file changes, I can rebuild it from the source data. If the source data changes, I can review the JSON and ignore the binary blob as an artifact.

Why not generate Ruby?

I did consider the obvious alternative: generate Ruby.

Something like:

ruby
ADDRESS_FORMATS = {
"NL" => {
format: "...",
required_fields: [...]
}
}

That would avoid JSON parsing too. Ruby could load it like any other file.

I don't like it here.

Large generated constants are noisy to review, easy to format into chaos, and not especially pleasant to diff when a lot of data changes at once.

More importantly, generated Ruby blurs the line between code and data. Address formats are definitions. I want the behavior in Ruby and the definitions in a data file.

The Marshal dump is an implementation detail. It is not the thing you edit.

Marshal is not for strangers

There is one big warning with Marshal: don't load untrusted data.

Marshal.load can instantiate Ruby objects. That makes it the wrong tool for user-provided files, uploads, API responses, or anything that came from outside your trust boundary.

Addressing does something narrower.

The dump file is generated as part of the gem's own data pipeline and shipped with the gem. Applications don't upload it. Users don't edit it. The library loads its own packaged data from its own gem.

The distinction matters. Marshal is a bad interchange format. It is Ruby-specific, opaque, and unsafe for untrusted input.

As a private cache for static Ruby data inside a gem, it is a much narrower tool. That is the only reason I am comfortable using it here.

The boundary

The split only works because the boundary stays strict:

  • edit JSON
  • generate Marshal
  • load Marshal

If the cache ever becomes the source of truth, the whole thing gets worse. Then you have a binary file nobody can review and a maintenance process nobody wants to touch.

So I try to keep the roles boring.

JSON is for people. Marshal is for Ruby. The generated file can always be rebuilt.

That is enough structure.

The useful lesson

I like this pattern more than I expected because it avoids a false choice.

You can keep source data nice to maintain and runtime data cheap to load, as long as one is clearly generated from the other.

It is not an architecture I would reach for everywhere. Most data files should probably stay as data files. Most caches do not need to ship with your package.

But when the data is static, trusted, and expensive enough to shape repeatedly, this is a useful little trick.

Keep the readable thing readable. Keep the fast thing generated.