Finding The Same (Misspelled) Name Using Python/NLTK

Published on Friday, September 13, 2013

I have been meaning to play around with the Natural Language Toolkit for quite some time, but I had been waiting for a time when I could experiment with it and actually create some value (as opposed to just play with it). A suitable use case appeared this week: matching strings. In particular, matching two different lists of many, many thousands of names.

To give you an example, let's say you had two lists of names, but with the name spelled incorrectly in one list:

List 1:
Leonard Hofstadter
Sheldon Cooper
Penny
Howard Wolowitz
Raj Koothrappali
Leslie Winkle
Bernadette Rostenkowski
Amy Farrah Fowler
Stuart Bloom
Alex Jensen
Barry Kripke

List 2:
Leonard Hofstadter
Sheldon Coopers
Howie Wolowits
Rav Toothrapaly
Ami Sarah Fowler
Stu Broom
Alexander Jensen

This could easily occur if somebody was manually typing in the lists, dictating names over the phone, or spell their name differently (e.g. Phil vs. Phillip) at different times.

If we wanted to match people on List 1 to List 2, how could we go about that? For a small list like this you can just look and see, but with many thousands of people, something more sophisticated would be useful. One tool could be NLTK's edit_distance function. The following Python script displays how easy this is:

import nltk

list_1 = ['Leonard Hofstadter', 'Sheldon Cooper', 'Penny', 'Howard Wolowitz', 'Raj Koothrappali', 'Leslie Winkle', 'Bernadette Rostenkowski', 'Amy Farrah Fowler', 'Stuart Bloom', 'Alex Jensen', 'Barry Kripke']

list_2 = ['Leonard Hofstadter', 'Sheldon Coopers', 'Howie Wolowits', 'Rav Toothrapaly', 'Ami Sarah Fowler', 'Stu Broom', 'Alexander Jensen']

for person_1 in list_1:
    for person_2 in list_2:
        print nltk.metrics.edit_distance(person_1, person_2), person_1, person_2

0 Leonard Hofstadter Leonard Hofstadter
15 Leonard Hofstadter Sheldon Coopers
14 Leonard Hofstadter Howie Wolowits
15 Leonard Hofstadter Rav Toothrapaly
14 Leonard Hofstadter Ami Sarah Fowler
16 Leonard Hofstadter Stu Broom
15 Leonard Hofstadter Alexander Jensen
14 Sheldon Cooper Leonard Hofstadter
1 Sheldon Cooper Sheldon Coopers
13 Sheldon Cooper Howie Wolowits
13 Sheldon Cooper Rav Toothrapaly
12 Sheldon Cooper Ami Sarah Fowler
11 Sheldon Cooper Stu Broom
12 Sheldon Cooper Alexander Jensen
16 Penny Leonard Hofstadter
13 Penny Sheldon Coopers
13 Penny Howie Wolowits
14 Penny Rav Toothrapaly
16 Penny Ami Sarah Fowler
9 Penny Stu Broom
13 Penny Alexander Jensen
11 Howard Wolowitz Leonard Hofstadter
13 Howard Wolowitz Sheldon Coopers
4 Howard Wolowitz Howie Wolowits
15 Howard Wolowitz Rav Toothrapaly
13 Howard Wolowitz Ami Sarah Fowler
13 Howard Wolowitz Stu Broom
14 Howard Wolowitz Alexander Jensen
16 Raj Koothrappali Leonard Hofstadter
14 Raj Koothrappali Sheldon Coopers
16 Raj Koothrappali Howie Wolowits
4 Raj Koothrappali Rav Toothrapaly
14 Raj Koothrappali Ami Sarah Fowler
14 Raj Koothrappali Stu Broom
16 Raj Koothrappali Alexander Jensen
14 Leslie Winkle Leonard Hofstadter
13 Leslie Winkle Sheldon Coopers
11 Leslie Winkle Howie Wolowits
14 Leslie Winkle Rav Toothrapaly
14 Leslie Winkle Ami Sarah Fowler
12 Leslie Winkle Stu Broom
12 Leslie Winkle Alexander Jensen
17 Bernadette Rostenkowski Leonard Hofstadter
18 Bernadette Rostenkowski Sheldon Coopers
18 Bernadette Rostenkowski Howie Wolowits
19 Bernadette Rostenkowski Rav Toothrapaly
20 Bernadette Rostenkowski Ami Sarah Fowler
20 Bernadette Rostenkowski Stu Broom
17 Bernadette Rostenkowski Alexander Jensen
15 Amy Farrah Fowler Leonard Hofstadter
14 Amy Farrah Fowler Sheldon Coopers
15 Amy Farrah Fowler Howie Wolowits
14 Amy Farrah Fowler Rav Toothrapaly
3 Amy Farrah Fowler Ami Sarah Fowler
14 Amy Farrah Fowler Stu Broom
13 Amy Farrah Fowler Alexander Jensen
15 Stuart Bloom Leonard Hofstadter
12 Stuart Bloom Sheldon Coopers
12 Stuart Bloom Howie Wolowits
14 Stuart Bloom Rav Toothrapaly
13 Stuart Bloom Ami Sarah Fowler
4 Stuart Bloom Stu Broom
14 Stuart Bloom Alexander Jensen
15 Alex Jensen Leonard Hofstadter
12 Alex Jensen Sheldon Coopers
13 Alex Jensen Howie Wolowits
15 Alex Jensen Rav Toothrapaly
13 Alex Jensen Ami Sarah Fowler
10 Alex Jensen Stu Broom
5 Alex Jensen Alexander Jensen
15 Barry Kripke Leonard Hofstadter
13 Barry Kripke Sheldon Coopers
13 Barry Kripke Howie Wolowits
12 Barry Kripke Rav Toothrapaly
13 Barry Kripke Ami Sarah Fowler
10 Barry Kripke Stu Broom
14 Barry Kripke Alexander Jensen

As you can see, this displays the Levenstein distance of the two sequences. Another option we have is to look at the ratio.

len1 = len(list_1)
len2 = len(list_2)
lensum = len1 + len2
for person_1 in list_1:
    for person_2 in list_2:
        levdist = nltk.metrics.edit_distance(person_1, person_2)
        nltkratio = (float(lensum) - float(levdist)) / float(lensum)
        if nltkratio > 0.70:
            print nltkratio, person_1, person_2

1.0 Leonard Hofstadter Leonard Hofstadter
0.944444444444 Sheldon Cooper Sheldon Coopers
0.777777777778 Howard Wolowitz Howie Wolowits
0.777777777778 Raj Koothrappali Rav Toothrapaly
0.833333333333 Amy Farrah Fowler Ami Sarah Fowler
0.777777777778 Stuart Bloom Stu Broom
0.722222222222 Alex Jensen Alexander Jensen

Sydney's Education Levels Mapped

Published on Sunday, September 8, 2013

I was talking with my wife about what education levels might look like across Sydney, so she challenged me to map it. The below map is my first draft.

The map was derived by combining three datasets from the Australian Bureau of Statistics (ABS - a department releasing some great datasets). The first dataset was the spatial data for "SA2" level boundaries, the second the population data for various geographic areas, and the third from the 2011 Census on Non-School Qualification Level of Education (e.g. Certificates, Diplomas, Masters, Doctorates). I aggregated all people with bachelors or higher in an SA2 region, and then divided that number by the total number of people in that region. A different methodology could have been used.

EDIT: I should have paid more attention to mapping education levels. I mapped the percentage of overall population, but should have mapped the percentage of 25 to 34 year olds, as this would have aligned to various government metrics.

Reported education levels differ vastly by region, e.g. "North Sydney - Lavender Bay" (40%) vs. "Bidwell - Hebersham - Emerton" (3%). It is interesting to look at the different urban density levels of the areas, as well as the commute times to the nearest centre.

Without trying to sound too elitist, I was hoping to use this map to guide me where to consider buying our next property (i.e. looking for a well educated, clean area with decent schools and frequent public transport). It was interesting to discover that the SA2 region we currently live in has the second highest percentage in NSW.

Feel free to take a look at the aggregated data yourself or download it (attribution to ABS for source datasets).


View Full Screen