It has been said that “software has eaten the world”. To the extent that statement is true, some things remain a bit chewy:
And there’s still a bit of indigestion, especially around street addresses.
In this article, we’re going to explore some common mistakes that programmers make when building systems to work with addresses, with a particular focus on GIS.
For a more general introduction to the complexity of addresses read this.
A simple and concise explanation of the complexity of addresses is that addresses are a shorthand convention for expressing directions to tell a human how to get to a particular place, structure, or sub-unit of a structure. Standards bodies, like the USPS, and governments (at all levels) have spent considerable time and effort working to standardize addresses, but at the end of the day reality is complicated and the weight of history heavy. As long as the directions are followable then even the worst address gets its job done. At the same time, this great flexibility is a source of tremendous frustration for folks who need to integrate addresses with computer systems.
Bad Assumptions to Make About Addresses
Addresses are just a simple string
You may be tempted to store an address as a String. After all, it’s fairly easy to get a user to input their address in a text box, right? In certain situations, this can actually work, but in general, it’s a Bad Idea:
- Most systems that you might want to integrate with (e.g. billing, delivery, etc) will require you to send addresses in some componentized format. That means if you do use a String your system will still have to destructure the address later on.
- Should you want to analyze the address — say you want to group by city and state for some fancy machine learning — then you’ll need to extract this component information regardless.
- Requirements change and you may need to translate these addresses to another format. This is nearly equivalent to the translation of unstructured text given how complex addresses can be.
In short, don’t store addresses as Strings unless you absolutely know you can get away with it.
Addresses have just a few components
Ok, you’ve decided that you won’t store the address as a String, instead, you’ll store the individual components of an address. Can’t be that bad, right?
In the US there are as many as 13 different components, of which several can be mixed and/or matched in one of three flavors — commonly seen as single-line, double-line, and triple-line addresses. So you not only need to be able to store any or all of these components, but you also need to know which flavor you’ve stored. In addition, some components (like State), have both abbreviated and spelled-out forms, which you may need to detect and/or distinguish.
The issue here is not that modern databases have trouble with 10s or even 100s of columns, but that for an address nearly all of these columns are Strings which themselves don’t have nice length limits or universally consistent structure. If you plan to store lots of addresses, finding a scheme that works well with database compression is advised.
We don’t need to store every component type, just single/double/triple line is enough structure for our Addresses
This is true a reasonably large percentage of the time, especially for non-GIS applications. However, in GIS foregoing the more granular scheme isn’t usually the best trade-off. For example, you might give up:
- grouping by street name
- inferring neighborhood-level groups
- detecting apartments/duplexes
- differentiating commercial from residential addresses.
The downside is that while you may want to store more granular information it may not be readily accessible. Many user-facing systems tend to follow single/double/triple input line schemes that don’t typically require, for example, street name as a separate component. Doing this more granular componentization yourself is also not recommended.
Addresses can be stored as entered and re-used later without issue
It is entirely reasonable, and sometimes even required by law, to store the address the customer/user/whoever gave you. However, if you attempt to actually do anything with that address it MUST be normalized first, otherwise, you risk working with invalid, poorly formatted, unprocessable junk data. It’s not that users intend to give us junk data, it’s that the crumbs in their keyboards sometimes interfere with their typing.
If you are unfamiliar, normalization is the process of applying a standardizing set of rules to a domain. In this case, the rules for normalizing addresses include things like:
- Abbreviating certain forms — Apartment becomes Apt, Lane becomes Ln, etc
- Writing out and not spelling out street numbers, e.g “Twelve Tower Lane” becomes “12 Tower Lane”
Normalizing addresses is easy
Correctly normalizing addresses is quite difficult. The rules for how to normalize vary from country to country, contain a great many edge cases, usually allow for legacy or antiquated usage, and have varying levels of specificity. Even existing commercial normalization engines differ in how they normalize the same address. This can lead to interesting experiences for those folks who live in more rural settings. For GIS systems this often means that their normalized data isn’t much better than the raw data.
Every address can be normalized
Nope. Certainly, yes, most assigned postal addresses can be normalized. But with more than 150 million postal addresses in the US alone, all we need is one address to not normalize.
Remember, an address is just directions. You can (in some instances) draw a map on a letter and get it delivered. From Falsehoods about Addresses:
Kirk Kerekes spent several years using an address of the form “2 mi N then 3 mi W of Jennings, OK 74038” which regularly got successful deliveries. Mike Riley used to mail the Very Large Array radio telescope at “50 miles (80 km) West of Socorro, New Mexico, USA”
Addresses are immutable
You might be tempted to assume that addresses are immutable, i.e. addresses don’t change once assigned to a location. And how nice a world that would be! But, alas, it is not so.
While it is obvious that people move (roughly 37 million address changes were processed in 2016), and therefore update addresses somewhat regularly, it is not as obvious that buildings and locations change addresses as well. In either case, the relationship between a person and an address, or between a location and it’s address, involves a hidden implicit temporal component that may need to be tracked.
A location’s address changes for a variety of reasons:
- The structure physically moves — houseboats are capable of moving
- The addressing scheme changes — Rural Route to Street Name
- Street Names change
- Zip Codes change
- Towns are annexed or created
- Political borders change
Addresses can also fail to change, even though something relevant about the location has changed. For example, structures may be razed and re-built but have old addresses remain, which is what happens with re-development.
Ok, addresses change, but not enough to worry about
The most insidious part of storing addresses is that addresses rot. Just because you had a valid address last week doesn’t mean it will be valid next week. Quite literally addresses become obsolete.
The best numbers I can find online assume that on average about 0.01% of addresses per year become obsolete due to just zip code changes. At least in the US zip code changes are by far the largest source of address deprecation. If we assume that the other factors contribute another 0.005%, then on average about 0.015% of addresses per year become obsolete.
Doing the math:
150,000,000 postal addresses * 0.00015 = 22500 addresses/year
While that isn’t bad, most large GIS databases don’t have data that is recent or even recently validated. Let’s assume that we have data that is on average around 10 years old. What happens then? The likelihoods compound exponentially! Instead of 22k addresses, it looks more like:
(150m*1.00015^10) – 150m = 225,000 addresses
The problem is now more than 10 times larger! The older the data-set the worse it gets.
Addresses are unique — each location only has one address
It would be awfully convenient for there to be a one-to-one mapping between addresses and locations, but the reality is much more complicated.
A given location or building may have multiple synonymous and interchangeable addresses, for example:
Both refer to the same location because the two street/road names refer to the same physical street, but notice that Google doesn’t drop a pin for the second address. In truth, Google doesn’t even map the second address to the same location — we’ve had to supply the correct latitude and longitude!
Some structures may have multiple addresses, maybe because of how large they are, or because of a quirk of their history. The Port Authority building in New York is an example. However, just because the building has multiple addresses doesn’t mean they are interchangeable — each address only refers to distinct sub-units, which may or may not have their own entrances, of the larger Port Authority building,
Nor can we assume that addresses map to one unique location or building. For example, many large organizations operate over a campus, on which there are many buildings, will still usually have a single mailing address. A building name or code may be required to get more specific.
Every building has a postal address
Not every building has a postal address, in fact, a registered active postal address is only applicable if there is a need to receive mail — and many businesses only need to receive shipments from their distributors. This means that a building can have an address, but not be a valid delivery point according to databases like the USPS.
A valid postal address is valid
Just because an address passes syntactic validation and normalizes doesn’t mean that it corresponds to an actual building or delivery location. To know that you have to cross-check with the known set of delivery points, like what the USPS maintains.
At least addresses map to a location, yes?
In practice, it will be possible to get a location associated with most addresses, at least here in the US. However, the quality of those inferences will change radically from address to address, and the location inferred may change with time. What you do with the inferred locations can be problematic if you’ve made bad assumptions.
- What happens if you try to locate a PO Box?
- Does a duplex with two distinct addresses have two distinct locations?
- Does an apartment building with N distinct addresses have N distinct locations?
- Does a campus get a location that is the centroid of all buildings? Or do you try to locate each separate building?
An address is just a list of directions and those directions can lead you to a different place each time you follow them. Beware!
Storing an address is ok
The simple take away is — if you can, don’t ever store an address! They are complicated, fragile, rotting, legacy data that will only cause you pain.
But what if there was a better way? Stay tuned since we may just have the answer you’re looking for!