DEV Community

loading...

My attempt to build a worldwide zip code data set

Andreas A.
・2 min read

I always thought getting worldwide postal codes by myself was an easy task because postal codes seem to be nothing more than a simple shortcode that is publicly available. I quickly realized this was not the case, because:

  • There is no single source of truth
  • Most sources were incomplete
  • Data was very often presented in a very unstructured way

After doing some general research, I soon understood, that the reason for the problems above had their origin in the history of postal codes. Each country has a different format, area granularity, and way of structuring postal codes as a whole.

I first tried to scrape Wikipedia with the following code. For this post, I will use the example of Austria.

For this, I a small python script.
Before running it make sure to install all dependencies:

  • pip3 install lxml
  • pip3 install requests,
  • pip3 install bs4

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_in_Austria'

# fire GET request
response = requests.get(url)

# parse content
content = BeautifulSoup(response.text, 'lxml')

# get postal codes

postcodes = [
    postcode.text for postcode in content.find_all('li')
    if ' - ' in postcode.text
]

# filter edge cases

postcodes = [
    postcode.split()[0] for postcode in postcodes 
    if len(postcode.split()) == 3 or
    len(postcode.split()) == 4
]

# write output to file
with open('at_postcodes.txt', 'a') as f:
    for postcode in postcodes:
        f.write(postcode + '\n')

The obtained data set and the related approach might be enough for some use cases, but since I wanted to get global postal code data, I was not satisfied.

I live in Austria and realized very quickly that the data that I have just scraped is not complete (some postal codes are missing). Considering the time it took my to build the parser and the fact that I would have to adapt it for every single data source (adaptions are even needed across Wikipedia since every article is written differently), I decided to give up.

This was the moment I gave up and started to look for ready-to-use solutions:

I hope this article will save you some time, in case you are trying to achieve the same.

Discussion (0)