Skip to content

Parsing Address strings in Python: structure address text data in python

Getting structured address components (Street name, street number, street direction, city, state/province, etc.) from unstructured text

examples of using ez_address_parser and usaddress packages

Use-case:

Let’s say we want to extract consistent and structured address information from raw text or from somewhat inconsistent address data in our database. There are many ways that addresses can be input by users, leading to inconsistencies in how street name and numbers, house or unit numbers, city names, postal code or zip codes and etc can be input. In this tutorial, we parse structured components from raw text with the goal of breaking down those address components from the whole text. We use python and the two packages as two alternatives for the solution.

Video Tutorial for this blog is available here:

Check out these related tutorial in your convenient time:


Step1: package

# general packages
import pandas as pd
import numpy as np
# two alternative packages for address parsing
!pip install ez-address-parser
from ez_address_parser import AddressParser
!pip install usaddress
import usaddress

The specific packages to test and apply here are ez_address_parser (the main one) and usaddress .

Step2: data input

These are some random address data as a list object we are going to try. You need to convert your input address as a list to be able to use the solution here. As you will see in step 4.

# random address data as string list
list1 = ['HouseA ,#123 11 something Avenue NorthWest, cityXX, CA T606X7',
'811 Roberts Drive West Bloomfield MI 48322',
'808 Kingston Ave., Macon, GA 31204', 
'792 Annadale Street, Mableton, GA 30126',
'504 Hall Ave., Urbandale, IA 50322',
'7335 Pumpkin Hill St. NorthWest, Atlanta, GA 30303',
'37 Littleton Ave., Leesburg, VA 20175']

Method 1: ez_address_parser

Step3: Let’s test the method

Here, we test to see what is the result of address parsing for first value in the address list.

# initiate the parser
ap = AddressParser()
# test the address parser
result = ap.parse(list1[0])
for token, label in result:
      print(f"{token:20s} -> {label}")
result of test

Not bad!

Step4: Apply the address parser on all address data and store the results

You need to apply the list input of your unstructured address data in place for list1 in below code to get the ouput of df dataframe from the below code.


# object to store all address components
df = pd.DataFrame()
# index number
i = 0
# loop over address string list
for x in list1:
  # parsing each address first
  address = x
  result = ap.parse(address)
  # list to store the components and labels of each address string
  values = []
  labels = []
  # the loop over components of each address string in order to store them properly in a dataframe
  for token, label in result:
      # optional to print the outcome for each address string
      ## print(f"{token:20s} -> {label}")
      #
      # if there are more than one components with the same label ( two street names), then merge the value so we have one unique component name (one street name)
      # example: Pumpkin Hill will be stored as 'Pumpkin Hill' one value for street name as opposed to two different columns of pumpkin and hill
      if label in labels:
        # in our experimentation, the repeated label names were consecutive, 
        # hence we attach the value of repeated labels to last value and skip 
        # to next component in the address
        values[-1]= str(values[-1]) + ' ' + str(token)
        continue
      # store the components in lists for labels and component values
      values.append(token)
      labels.append(label)
  # store the resulted components in one final dataframe
  temp = pd.DataFrame()
  temp[i] = values
  temp = temp.T
  temp.columns = labels
  #
  df = pd.concat([df,temp], axis=0,ignore_index=True)
  # index number for next address
  i+=1

Two sidenotes:

  1. In our experimental test, we notice it is possible for this method to identify one address component in two consecutively different values. For instance if there is a space in street name, it is broken down to two address names. This is not ideal obviously. Hence we apply this part of above code to deal with that.

2. For this case, it was ideal to store the final result in a data frame, but for in each leap we create two lists for address component labels and address component values. Our goal was creating a single dataframe from lists in a loop to and append to that datafarme for final results. This part of code deal with that

Finally, let’s look at the final dataframe result which broken down and standardize address components for us

Method2: using “usadress“ package 

You can use thie method instead ez_address_parser. There is difference in labeling of final result and performance. In our experimentation, ez_address_parser was deemed easier to work with and addressed or need. But of course it could be different for your use-case.

The big difference in literal application this code to the above snippet is that usaddress package created an orderedDict object as final result instead of dict or list. So, you need to get your result values appropriately and replace with the two values and labels lists in above loop to be able to store the result similarly, as we did in Step 4.

Related Links

Leave a Reply

Your email address will not be published. Required fields are marked *