Free Splunk Hosting

I first used Splunk about 10 years ago after an old colleague installed it on a computer in the corner, and ever since then I have preached about it. If you have log data, of any kind, I’d recommend you give it a go.

The Splunk people have a a few pretty good options for trying Splunk out, as you can either use Splunk Storm or Splunk Free. The first option is obviously hosted, and has a generous storage option, but also does not allow long term storage of data. I send system log data to Splunk Storm.

However, what if you don’t have a lot of data, but you want to keep that data forever? After reading Ed Hunsinger’s Go Splunk Yourself entry about using it for Quantified Self data, I knew I had to do the same.

From personal experience, Splunk requires at least 1GB to even start. You can probably get it to run on less, but I haven’t had much success. This leaves two options: look at Low End Box for a VPS with enough memory (as cheap as $5/month), of use OpenShift. Red Hat generously provides three “gears” to host applications, for free, and each with 1GB of memory. I have sort of a love-hate relationship with OpenShift, maybe a bit like using OAuth. Red Hat calls OpenShift the “Open Hybrid Cloud Application Platform”, and I can attest that it is really this. They have provided a method to bundle an application stack and push it into production without needing to fuss about infrastructure, or even provisioning and management of the application. It feels like what would happen if Google App Engine and Amazon’s EC2 had a child. Heroku or dotCloud might be its closest alternatives.

Anyways, this isn’t a review of OpenShift, although it would be a positive review, but instead on how to use OpenShift to host Splunk. I first installed Splunk in a gear using Nginx as a proxy, and it worked. However, this felt overly complex, and after one of my colleagues started working on installing Splunk in a cartridge, I eventually agreed this would be the way to go. The result was a Splunk cartridge that can be installed inside any existing gear. Here are the instructions; you need an OpenShift account, obviously. The install should take less than ten clicks of your mouse, and one copy/paste.

From the cartridge’s GitHub README:

  1. Create an Application based on existing web framework. If in doubt, just pick “Do-It-Yourself 0.1” or “Python 2.7”
  2. Click on “Continue to the application overview page.”
  3. On the Application page, click on “Or, see the entire list of cartridges you can add”.
  4. Under “Install your own cartridge” enter the following URL: https://raw.github.com/kelvinn/openshift-splunk-cartridge/master/metadata/manifest.yml
  5. Next and Add Cartrdige. Wait a few minutes for Splunk to download and install.
  6. Logon to Splunk at: https://your-app.rhcloud.com/ui

More details can be read on the cartridge’s GitHub page, and I would especially direct you to the limitations of this configuration. This will all stop working if Splunk makes the installer file unavailable, but I will deal with that when the time comes. Feel free to alert me if this happens.

Finding The Same (Misspelled) Name Using Python/NLTK

I have been meaning to play around with the Natural Language Toolkit for quite some time, but I had been waiting for a time when I could experiment with it and actually create some value (as opposed to just play with it). A suitable use case appeared this week: matching strings. In particular, matching two different lists of many, many thousands of names.

To give you an example, let’s say you had two lists of names, but with the name spelled incorrectly in one list:

List 1:
Leonard Hofstadter
Sheldon Cooper
Penny
Howard Wolowitz
Raj Koothrappali
Leslie Winkle
Bernadette Rostenkowski
Amy Farrah Fowler
Stuart Bloom
Alex Jensen
Barry Kripke

List 2:
Leonard Hofstadter
Sheldon Coopers
Howie Wolowits
Rav Toothrapaly
Ami Sarah Fowler
Stu Broom
Alexander Jensen

This could easily occur if somebody was manually typing in the lists, dictating names over the phone, or spell their name differently (e.g. Phil vs. Phillip) at different times.

If we wanted to match people on List 1 to List 2, how could we go about that? For a small list like this you can just look and see, but with many thousands of people, something more sophisticated would be useful. One tool could be NLTK’s edit_distance function. The following Python script displays how easy this is:

import nltk
 
list_1 = ['Leonard Hofstadter', 'Sheldon Cooper', 'Penny', 'Howard Wolowitz', 'Raj Koothrappali', 'Leslie Winkle', 'Bernadette Rostenkowski', 'Amy Farrah Fowler', 'Stuart Bloom', 'Alex Jensen', 'Barry Kripke']
 
list_2 = ['Leonard Hofstadter', 'Sheldon Coopers', 'Howie Wolowits', 'Rav Toothrapaly', 'Ami Sarah Fowler', 'Stu Broom', 'Alexander Jensen']
 
for person_1 in list_1:
    for person_2 in list_2:
        print nltk.metrics.edit_distance(person_1, person_2), person_1, person_2

And we get this output:

0 Leonard Hofstadter Leonard Hofstadter  
15 Leonard Hofstadter Sheldon Coopers  
14 Leonard Hofstadter Howie Wolowits  
15 Leonard Hofstadter Rav Toothrapaly  
14 Leonard Hofstadter Ami Sarah Fowler  
16 Leonard Hofstadter Stu Broom  
15 Leonard Hofstadter Alexander Jensen  
14 Sheldon Cooper Leonard Hofstadter  
1 Sheldon Cooper Sheldon Coopers  
13 Sheldon Cooper Howie Wolowits  
13 Sheldon Cooper Rav Toothrapaly  
12 Sheldon Cooper Ami Sarah Fowler  
11 Sheldon Cooper Stu Broom  
12 Sheldon Cooper Alexander Jensen  
16 Penny Leonard Hofstadter  
13 Penny Sheldon Coopers  
13 Penny Howie Wolowits  
14 Penny Rav Toothrapaly  
16 Penny Ami Sarah Fowler  
9 Penny Stu Broom  
13 Penny Alexander Jensen  
11 Howard Wolowitz Leonard Hofstadter  
13 Howard Wolowitz Sheldon Coopers  
4 Howard Wolowitz Howie Wolowits  
15 Howard Wolowitz Rav Toothrapaly  
13 Howard Wolowitz Ami Sarah Fowler  
13 Howard Wolowitz Stu Broom  
14 Howard Wolowitz Alexander Jensen  
16 Raj Koothrappali Leonard Hofstadter  
14 Raj Koothrappali Sheldon Coopers  
16 Raj Koothrappali Howie Wolowits  
4 Raj Koothrappali Rav Toothrapaly  
14 Raj Koothrappali Ami Sarah Fowler  
14 Raj Koothrappali Stu Broom  
16 Raj Koothrappali Alexander Jensen  
14 Leslie Winkle Leonard Hofstadter  
13 Leslie Winkle Sheldon Coopers  
11 Leslie Winkle Howie Wolowits  
14 Leslie Winkle Rav Toothrapaly  
14 Leslie Winkle Ami Sarah Fowler  
12 Leslie Winkle Stu Broom  
12 Leslie Winkle Alexander Jensen  
17 Bernadette Rostenkowski Leonard Hofstadter  
18 Bernadette Rostenkowski Sheldon Coopers  
18 Bernadette Rostenkowski Howie Wolowits  
19 Bernadette Rostenkowski Rav Toothrapaly  
20 Bernadette Rostenkowski Ami Sarah Fowler  
20 Bernadette Rostenkowski Stu Broom  
17 Bernadette Rostenkowski Alexander Jensen  
15 Amy Farrah Fowler Leonard Hofstadter  
14 Amy Farrah Fowler Sheldon Coopers  
15 Amy Farrah Fowler Howie Wolowits  
14 Amy Farrah Fowler Rav Toothrapaly  
3 Amy Farrah Fowler Ami Sarah Fowler  
14 Amy Farrah Fowler Stu Broom  
13 Amy Farrah Fowler Alexander Jensen  
15 Stuart Bloom Leonard Hofstadter  
12 Stuart Bloom Sheldon Coopers  
12 Stuart Bloom Howie Wolowits  
14 Stuart Bloom Rav Toothrapaly  
13 Stuart Bloom Ami Sarah Fowler  
4 Stuart Bloom Stu Broom  
14 Stuart Bloom Alexander Jensen  
15 Alex Jensen Leonard Hofstadter  
12 Alex Jensen Sheldon Coopers  
13 Alex Jensen Howie Wolowits  
15 Alex Jensen Rav Toothrapaly  
13 Alex Jensen Ami Sarah Fowler  
10 Alex Jensen Stu Broom  
5 Alex Jensen Alexander Jensen  
15 Barry Kripke Leonard Hofstadter  
13 Barry Kripke Sheldon Coopers  
13 Barry Kripke Howie Wolowits  
12 Barry Kripke Rav Toothrapaly  
13 Barry Kripke Ami Sarah Fowler  
10 Barry Kripke Stu Broom  
14 Barry Kripke Alexander Jensen  

As you can see, this displays the Levenstein distance of the two sequences. Another option we have is to look at the ratio.

len1 = len(list_1)
len2 = len(list_2)
lensum = len1 + len2
for person_1 in list_1:
    for person_2 in list_2:
        levdist = nltk.metrics.edit_distance(person_1, person_2)
        nltkratio = (float(lensum) - float(levdist)) / float(lensum)
        if nltkratio > 0.70:
            print nltkratio, person_1, person_2

Which we can see the end result below:

1.0 Leonard Hofstadter Leonard Hofstadter  
0.944444444444 Sheldon Cooper Sheldon Coopers  
0.777777777778 Howard Wolowitz Howie Wolowits  
0.777777777778 Raj Koothrappali Rav Toothrapaly  
0.833333333333 Amy Farrah Fowler Ami Sarah Fowler  
0.777777777778 Stuart Bloom Stu Broom  
0.722222222222 Alex Jensen Alexander Jensen

Sydney's Education Levels Mapped

I was talking to a friend about what education levels might look like across Sydney, and a friend challenged me to map it.

The map was derived by combining three datasets from the Australian Bureau of Statistics (ABS - a department releasing some great datasets). The first dataset was the spatial data for “SA2” level boundaries, the second the population data for various geographic areas, and the third from the 2011 Census on Non-School Qualification Level of Education (e.g. Certificates, Diplomas, Masters, Doctorates). I aggregated all people with bachelors or higher in an SA2 region, and then divided that number by the total number of people in that region. A different methodology could have been used.

EDIT: I should have paid more attention to mapping education levels. I mapped the percentage of overall population, but should have mapped the percentage of 25 to 34 year olds, as this would have aligned to various government metrics.

Reported education levels differ vastly by region, e.g. “North Sydney - Lavender Bay” (40%) vs. “Bidwell - Hebersham - Emerton” (3%). It is interesting to look at the different urban density levels of the areas, as well as the commute times to the nearest centre.

Without trying to sound too elitist, I was hoping to use this map to guide me where to consider moving (i.e. looking for a well educated, clean area with decent schools and frequent public transport). It was interesting to discover that the SA2 region I currently live in has the second highest percentage in NSW.

Sydney Commute Times Mapped Part 2

EDIT 12-03-2025: I accidentially broke the maps when deleting my AWS account, as the mbtiles were hosted there. Oops.

In Sydney Commute Times Mapped Part 1 I took a small step to a bigger goal of mashing together public transport in Sydney, and the Metropolitan Strategy for Sydney to 2031. The question I wanted to answer is this: how aligned is Sydney’s public transport infrastructure and the Metropolitan Strategy’s of a “city of cities”?

I decided to find out.

Thanks to the release of GTFS data by 131500 it is possible to visualise how long it takes via public transport to commute to the nearest “centre”.

Cities and Corridors - Metropolitan Strategy for Sydney to 2031

The Australian Bureau of Statistics collects data based on “mesh blocks”, or roughly an area containing roughly 50 dwellings. Last week I had some fun mapping the mesh blocks, as well as looking at Sydney’s urban densities. These mesh blocks are a good size to look at for calculating commute times.

The simplified process I used was this, for the technical minded:

  1. Calculate the centre of each mesh block
  2. Calculate the commute time via public transport from each block to every “centre” (using 131500’s GTFS and OpenTripPlanner’s Analyst tool)
  3. Import times in a database, calculate lowest commute time to each centre
  4. Visualise in TileMill
  5. Serve tiles in TileStache and visualise with Leaflet

The first map I created was simply to indicate how long it would take to the nearest centre. There appears to be rapidly poorer accessibility on the fringe of Sydney. I was also surprised of what appears to be a belt of higher times between Wetherill Park and all the way to Marrickville. There also appears to be poorer accessibility in parts of Western Sydney. It is worth noting that I offer not guarantee of the integrity of the data in these maps, and I have seen a few spots where the commute times increase significantly in adjacent mesh blocks. This tells me the street data (from OpenStreetMap) might not be connected correctly.

My next map shows what areas are within 30 minutes.

These maps were both created using open data and open source tools, which I find quite neat.

I have been interested in mapping traffic for a number of years, maybe ever since arriving in Sydney. It is sort of a hobby; I find making maps relaxing. My first little map was way back in 2008, where I visualised speed from a GPS unit. A little later I added some colour to the visualisations, and then used this as an excuse to create a little GUI for driving speed. My interest in visualising individual vehicles has decreased recently, as it has now shifted to the mapping wider systems. Have an idea you would like to see mapped? Leave a note in the comments.

Quantified Self Interview

YS and I were recently interviewed about self-tracking and Quantified Self by one of the major news channels in Australia. I will reflect on the experience after the show has aired, but it was an overall great experience. We have a new respect for filming what may ultimately be just a two minute segment. Depending on how the editing is done it will either provoke the hosts to contemplate the value of a data-centric macroscopic view of the world, or give them lots of fodder.

That said, as you would expect, I had to track my heart rate during the interview - see below. My interpretation is that my heart rate jumped at the start of every questions, and went down as I answered the question. It also dropped when the interview finished. I wish I had a more expensive heart rate monitor (e.g. Zephyr BioHarness or Scanadu) that tracked skin temperature and breathing. My hands felt cold by the end.

.

Coffee, Beer, Wine and Time of Day

One of the things I like Tableau, a piece of software to visualise data, is that it aggregates on dates really well. Below is a spread of beer / wine / coffee over 18 months, but grouped by what hour is fell in. You can see some trends, like I usually consume coffee in the morning, and that I usually drink after 17:00. There are exceptions, of course, like that beer I had at 10AM, and that coffee I had at 1AM.

Some QS Numbers

There is the possibility I will be giving an interview on the Quantified Self “movement”. What follows is a brief summary of QS, the things I track, and some pretty charts.

What is Quantified Self

I suppose it depends on who you talk to. Wikipedia states that it is “a movement to incorporate technology into data acquisition on aspects of a person’s daily life in terms of inputs”, but I side more on the idea that the movement is “a collaboration of users and tool makers who share an interest in self knowledge through self-tracking.” It is at this point that it is probably important to interject that most people are self-trackers: weight, height, reps at the gym, hours worked, and so forth. If you have ever made a goal, you probably tracked how you could reach it. What makes us QS folk a bit different is that we tend to track lots of things, correlate between them, and share our results. So, with this theme, let me share what I track.

What I Track, and How

This is a list of some of the things I track, and the tools I use to do so.

  • Weight / Body Fat / Temperature / Measurements -> scales, callipers, ear thermometer
  • Resting Heart Rate -> oximeter
  • Drinks (wine, beer, coffee – and previously water) -> Android app (bespoke)
  • Drugs and vitamins -> Android app (bespoke)
  • Various conditions (headaches, “colds”, itchiness, nausea, sore throats, “the runs”) -> Android app (bespoke)
  • Finances (family) -> Android app (TOSHL)
  • Start/Stop times of work -> Excel…
  • Mood (Terrible to Great) -> Android app (How Are You Feeling)
  • Indoor air quality (not really QS) -> various sensors
  • Computer activity (Keystrokes / mouse clicks / mouse movement) -> WorkRave
  • Location -> Google Latitude
  • Steps & sleep -> Fitbit
  • Fitness –> Android app (Sports Tracker) and a Zephyr Bluetooth Heart Rate Monitor
  • Health History -> Microsoft HealthVault
  • Photo every day -> Android app (PhotoChron)

You can see that this list seems utterly normal, but still gives me enough to work with to start forming a macroscopic view of life.

A Few Charts

I created these using Tableau, a fabulous piece of software for putting meaning behind numbers. These are not good examples of what the software is capable of, but it is the quickest way for me to visualise them.

I like coffee. It is, in all honesty, a drug. There have been times (I could probably find the date!) when I went from two cups a day to none, and I had withdrawals (headaches and nausea). I track the amount of coffee I consume to remind myself to not get into the habit of having two cups/day for too long. It is also bad for my stomach.

If I chart the days of the week I like to drink coffee over the last 18 months, it turns out I drink the most amount of coffee on Saturday

 I also enjoy an alcoholic drink from time to time, but was told in January to cut back (for my stomach’s sake).

I track both beer and wine consumption. I have managed to cut back on wine, but not so much on beer.

This can be explained because I tend to have beer when I go out with work colleagues or friends, but wine at home. It appears to have been easier to stop drinking with dinner than when out.

For the last two years I have been wearing a FitBit, usually, and using it to “track” my sleep.

It looks like I averaged about 7500 steps/day, yet started walking more in January of this year. Walking more was not a New Year’s Resolution. In May I broke the clip to my FitBit, but a friend was kind enough to give me their’s as a replacement. I should walk more.

I should also sleep more. It appears as though maybe, just maybe, I am starting to sleep more. My average is about 7.5hr/night. This is one area I would like to experiment more with.

I have also started tracking happiness on a simple Terrible -> Great! scale.

This graph shows my average happiness on a weekly basis for the last ~8 months. We could conclude that I’m getting more happy, and was really unhappy around Christmas.

And here we have my happiness levels when grouped by day of the week. We could conclude that I am, on average, the most content on a Sunday. I would like to believe it is just a coincidence that I am most content on a Sunday, and drink the least amount of coffee.

This is the standard deviation of my happiness tracking on a monthly basis. It looks like I am also getting less moody.

And finally, weight. Nothing interesting here. I need to get back down to 77KG, which is a more natural weight for me. I use a normal scale so only record every few months - if I had a wi-fi scale, I would be able to record much more frequently. 

Final Thoughts

In the last ~18 months I have become more happy and less moody, with Sunday being my happiest day, and Monday and Wednesday being my least content. I have put on three KG. I drink the most amount of coffee on Saturday, the least amount on Sunday, and have been able to drink less wine, but keep drinking the same amount of beer.

By looking at this evaluation I know I should probably start to incorporate a lunchtime walk into my daily routine, and stop drinking coffee on one day of the weekend. I should also drink my beer at a slower pace when I’m out, as this will prevent me from buying more than one, or, even harder to resist, friends and colleagues buying it for me.

Finally: I know none of the charts have a title. Read the text.

Sydney Commute Times Mapped Part 1

EDIT 12-03-2025: I accidentially broke the maps when deleting my AWS account, as the mbtiles were hosted there. Oops.

I quite like open data. I like data based on open standards (or mostly open standards) even better. Many transport operators around the world have started releasing their timetable data using (mostly) open standards, e.g. GTFS. One of the nice things about using a standard is that clever people have created tools to work with the timetable data, and those tools can now be used to manipulate timetable data from hundreds of agencies. The magnificent OpenTripPlanner is one such tool, and it works well with 131500’s GTFS data.

New South Wales Planning & Infrastructure have released a draft plan for how they hope to shape Sydney’s growth, which is where they detail the idea of a “city of cities”. I thought it would be interesting to mash these smaller “cities” with 131500’s transport data, and then display a map with the shortest commute to the nearest city. Various cities, I believe including Melbourne, have goals of re-achieving a “20-minute” city, or something similar (i.e. X% of the population can reach X% of the city within X minutes).

This map is the first stage. It only displays the commute time to St Leonards from every Mesh Block in the greater Sydney area. I used the open source tool OpenTripPlanner to computer the commute times, with OpenStreetMaps to support walking distances. The next map I release will probably have all the regional cities, and a similar styled map depicting time to nearest “centre”.

Mapping Mesh Blocks with TileMill

This quick tutorial will detail how to prepair the ABS Mesh Blocks to be used with MapBox’s TileMill. Beyond scope is how to install postgresql, postgis and TileMill. There is a lot of documentation how to do these tasks.

First, we create a database to import the shapefile and population data into:

Using ‘psql’ or ‘SQL Query’, create a new database:

CREATE DATABASE transport WITH TEMPLATE postgis20 OWNER postgres;
# Query returned successfully with no result in 5527 ms.

It is necessary to first import the Mesh Block spatial file using something like PostGIS Loader.

We then create a table to import the Mesh Block population data:

CREATE TABLE tmp_x (id character varying(11), Dwellings numeric, Persons_Usually_Resident numeric);

And then load the data:

COPY tmp_x FROM '/home/kelvinn/censuscounts_mb_2011_aust_good.csv' DELIMITERS ',' CSV HEADER;

It is possible to import the GIS information and view it in QGIS:

Now that we know the shapefile was imported correctly we can merge the population with spatial data. The following query is used to merge the datasets:

UPDATE mb_2011_nsw
SET    dwellings = tmp_x.dwellings FROM tmp_x
WHERE  mb_2011_nsw.mb_code11 = tmp_x.id;

UPDATE mb_2011_nsw
SET    pop = tmp_x.persons_usually_resident FROM tmp_x
WHERE  mb_2011_nsw.mb_code11 = tmp_x.id;

We can do a rough validation by using this query:

SELECT sum(pop) FROM mb_2011_nsw;

And we get 6916971, which is about right (ABS has the 2011 official NSW population of 7.21 million).

Finally, using TileMill, we can connect to the PostgGIS database and apply some themes to the map.

host=127.0.0.1 user=MyUsername password=MyPassword dbname=transport
(SELECT * from mb_2011_nsw JOIN westmead_health on mb_2011_nsw.mb_code11 = westmead_health.label) as mb

After generating the MBTiles file I pushed it to my little $15/year VPS and used TileStache to serve the tiles and UTFGrids. The TileStache configuration I am using looks something like this:

{
  "cache": {
    "class": "TileStache.Goodies.Caches.LimitedDisk.Cache",
    "kwargs": {
        "path": "/tmp/limited-cache",
        "limit": 16777216
    }
  },
  "layers": 
  {
    "NSWUrbanDensity":
    {
        "provider": {
            "name": "mbtiles",
            "tileset": "/home/user/mbtiles/NSWUrbanDensity.mbtiles"
        }
    },
    "NSWPopDensity":
    {
        "provider": {
            "name": "mbtiles",
            "tileset": "/home/user/mbtiles/NSWPopDensity.mbtiles"
        }
    }
  }
}

Mapping Urban Density in Sydney

EDIT 12-03-2025: I broke the maps when I deleted my AWS account, which I forgot as hosting the mbtiles.

Five years ago I started exploring different mapping technologies by detailing instructions on installing Mapnik and mod_tile. Times have changed significantly in the last five years, and thanks a lot to the products offered by MapBox. After playing with TileMill, MBTiles, Leaflet and UTFGrids, it is great how many annoyances have been fixed by MapBox. I find it enjoyable making maps now, as I no longer need to worry about patching code just to get it to run, or mucking about with oddities in web browser.

Each night this week I have created a new map using Mesh Block spatial data from the Australian Bureau of Statistics (Mesh Blocks are the smallest area used when conducting surveys). I am thankful to live in a country that provides a certain amount of open data, and the ABS should be applauded for the amount of data they provide. They provide spatial data about Mesh Blocks, as well as population counts for this spatial data. It is relatively easy to merge the two and then visualise them using TileMill.

First up - population density of Sydney, i.e. persons reported to be living in each mesh block. Darker red indicates a higher population count.

I find it interesting to see how many people live in certain Mesh Blocks. You will notice that Mesh Blocks with high population levels tend to be nearer public transport - either major roads with frequent bus service, or train stations.

We can look at the urban densities by determining dwellings per hectare, and do this per Mesh Block. The definition I used for urban densities comes from Ann Forsyth in “Measuring Density: Working Definitions for Residential Density and Building Intensity” (pdf). Ann discusses the need to consider net or gross densities, depending on the type of land use. At the Mesh Block level the land use type appears to be singular: Industrial, Parkland, Commercial, Residential, and Transport. Because the land use type was generally singular I have not adjusted to gross/net, but still used Ann’s definitions of certain density bands:

  • Very low density: 11 dw/ha
  • Low density: 11-22 dw/ha
  • Medium density: 23-45 dw/ha
  • High density: 45 dw/ha

“dw/ha” is dwellings per hectare. I decided to map the four density levels, which can be relatively easily achieved using TileMill. See below for an example.

You can zoom in and scroll over any Mesh Block in Sydney to find out more. Additional installation information on how I did this can be found on this special page: Mapping Mesh Block Data.