Why my Ruby gem ships address formats twice
I already wrote about why international address handling gets messy in Ruby. That post is about forms, labels, postal rules, and the public API around all of it.
This one is smaller.
It is about why Addressing has both of these files:
textdata/address_formats.jsondata/address_formats.dump
At first that looks wasteful. Why ship the same address formats twice?
Because the two files have different jobs.
The obvious version: JSON
The JSON file is the source of truth.
Not because JSON is exciting. It isn't. That is the useful part.
Address formats are data, and I want to review them as data. When a format changes, the diff should show the country code, the fields, the required values, and the postal code pattern.
I do not want to review generated Ruby full of escaping, nested hashes, and noise.
A simplified entry looks roughly like this:
json{"country_code": "NL","format": "%given_name %family_name\n%organization\n%address_line1\n%postal_code %locality","required_fields": ["address_line1", "postal_code", "locality"],"uppercase_fields": ["locality"],"postal_code_pattern": "\\d{4}\\s?[A-Z]{2}"}
That is the kind of thing I want in Git. You can open it, search it, and compare it with the upstream source. If a country needs a tweak, the change is boring in the best possible way.
JSON has a cost at runtime. Every process that needs address formats has to read the file, parse it, build Ruby objects, normalize keys, and then finally answer the actual question the application asked.
For one request, who cares.
For a gem that gets loaded during application boot, tests, background jobs, consoles, and one-off scripts, that repeated setup starts to feel wasteful. The data is static anyway. It only changes when the gem changes.
I don't need to parse the same address definitions again and again.
The file Ruby actually loads
At runtime, Addressing reads the dump:
rubydef definitions@definitions ||= Marshal.load(File.read(File.expand_path("../../../data/address_formats.dump", __FILE__).to_s))end
Marshal is Ruby's built-in object serialization format. Give it Ruby objects with Marshal.dump, and it writes bytes. Read those bytes with Marshal.load, and you get Ruby objects back.
That is the whole trick. JSON parsing happens when the dump is generated, not every time an application boots.
The dump is still generated from the readable source. It is not a second source of truth.
Once the definitions are loaded, the gem also memoizes the AddressFormat objects it builds from them:
rubydef get(country_code)country_code = country_code.upcase@address_formats ||= {}unless @address_formats.key?(country_code)definition = process_definition(definitions[country_code] || { country_code: country_code })@address_formats[country_code] = new(definition)end@address_formats[country_code]end
So there are two layers:
definitionscaches the raw format definitions loaded from disk.@address_formatscaches the actualAddressFormatobjects by country code.
The common path stays small:
- ask for
NL - get the Dutch format
- ask again
- get the same object back
No JSON parsing, repeated normalization, or walking through a big file line by line.
Where the dump comes from
The dump file is generated. Nobody edits it by hand, because that would be miserable and pointless.
The Rake task is deliberately plain:
rubynamespace :addressing dotask dump: :generate dorequire "json"root_dir = File.expand_path("../..", __FILE__)definitions = {}File.readlines("#{root_dir}/data/address_formats.json").each do |line|definition = JSON.parse(line, symbolize_names: true)definitions[definition[:country_code]] = definitionendFile.open("#{root_dir}/data/address_formats.dump", "w") do |f|Marshal.dump definitions, fendendend
Read each JSON line, parse it into a Ruby hash, index it by country code, and let Marshal.dump write the result.
There is no separate schema language hiding in the middle, no custom compiler, and no clever code generation step that I will forget how to debug six months later.
That matters because generated files tend to become suspicious over time. If the generated file changes, I can rebuild it from the source data. If the source data changes, I can review the JSON and ignore the binary blob as an artifact.
Why not generate Ruby?
I did consider the obvious alternative: generate Ruby.
Something like:
rubyADDRESS_FORMATS = {"NL" => {format: "...",required_fields: [...]}}
That would avoid JSON parsing too. Ruby could load it like any other file.
I don't like it here.
Large generated constants are noisy to review, easy to format into chaos, and not especially pleasant to diff when a lot of data changes at once.
More importantly, generated Ruby blurs the line between code and data. Address formats are definitions. I want the behavior in Ruby and the definitions in a data file.
The Marshal dump is an implementation detail. It is not the thing you edit.
Marshal is not for strangers
There is one big warning with Marshal: don't load untrusted data.
Marshal.load can instantiate Ruby objects. That makes it the wrong tool for user-provided files, uploads, API responses, or anything that came from outside your trust boundary.
Addressing does something narrower.
The dump file is generated as part of the gem's own data pipeline and shipped with the gem. Applications don't upload it. Users don't edit it. The library loads its own packaged data from its own gem.
The distinction matters. Marshal is a bad interchange format. It is Ruby-specific, opaque, and unsafe for untrusted input.
As a private cache for static Ruby data inside a gem, it is a much narrower tool. That is the only reason I am comfortable using it here.
The boundary
The split only works because the boundary stays strict:
- edit JSON
- generate Marshal
- load Marshal
If the cache ever becomes the source of truth, the whole thing gets worse. Then you have a binary file nobody can review and a maintenance process nobody wants to touch.
So I try to keep the roles boring.
JSON is for people. Marshal is for Ruby. The generated file can always be rebuilt.
That is enough structure.
The useful lesson
I like this pattern more than I expected because it avoids a false choice.
You can keep source data nice to maintain and runtime data cheap to load, as long as one is clearly generated from the other.
It is not an architecture I would reach for everywhere. Most data files should probably stay as data files. Most caches do not need to ship with your package.
But when the data is static, trusted, and expensive enough to shape repeatedly, this is a useful little trick.
Keep the readable thing readable. Keep the fast thing generated.