Solved: slow build times from Dockerfiles with Python packages (pip)

Published on Wednesday, July 2, 2014

I have recently had the opportunity to begin exploring Docker, the currently hip way to build application containers, and I generally like it. It feels a bit like using Xen back in 2005, when you still had to download it from cl.cam.ac.uk, but there is huge momentum right now. I like the idea of breaking down each component of your application into unique services and bundling them up - it seems clean. The next year is going to be very interesting with Docker, as I am especially looking forward to seeing how Google's App Engine allows Docker usage, or what's in store for the likes of Flynn, Deis, CoreOS, or Stackdock.

One element I had been frustrated with is the build time of my image to host a Django application I'm working on. I kept hearing these crazy low rebuild times, but my container was taking ages to rebuild. I noticed that it was cached up until I re-added my code, and then pip would reinstall all my packages.

It appeared as though anything after I used ADD for my code was being rebuilt, and reading online seemed to confirm this. Most of the items were very quick, e.g. "EXPOSE 80", but then it hit "RUN pip -r requirements.txt"

There are various documented ways around this, from two Dockerfiles to just using packaged libraries. However, I found it easier to just use multiple ADD statements, and the good Docker folks have added caching for them. The idea is to ADD your requirements first, then RUN pip, and then ADD your code. This will mean that any code changes don't invalidate the pip cache.

For instance, I had something (abbreviated snippet) like this:

# Set the base image to Ubuntu
FROM ubuntu:14.04

# Update the sources list
RUN apt-get update
RUN apt-get upgrade -y

# Install basic applications
RUN apt-get install -y build-essential

# Install Python and Basic Python Tools
RUN apt-get install -y python python-dev python-distribute python-pip postgresql-client

# Copy the application folder inside the container
ADD . /app

# Get pip to download and install requirements:
RUN pip install -r /app/requirements.txt

# Expose ports
EXPOSE 80 8000

# Set the default directory where CMD will execute
WORKDIR /app

VOLUME [/app]

CMD ["sh", "/app/run.sh"]

And it rebuild pip whenever the code changes. Just add the requirements and move the RUN pip line:

# Set the base image to Ubuntu
FROM ubuntu:14.04

# Update the sources list
RUN apt-get update
RUN apt-get upgrade -y

# Install basic applications
RUN apt-get install -y build-essential

# Install Python and Basic Python Tools
RUN apt-get install -y python python-dev python-distribute python-pip postgresql-client

ADD requirements.txt /app/requirements.txt

# Get pip to download and install requirements:
RUN pip install -r /app/requirements.txt

# Copy the application folder inside the container
ADD . /app

# Expose ports
EXPOSE 80 8000

# Set the default directory where CMD will execute
WORKDIR /app

VOLUME [/app]

CMD ["sh", "/app/run.sh"]

I feel a bit awkward for having missed something that must be so obvious, so hopefully this can help somebody in a similar situation.

TLS Module In SaltStack Not Available (Fixed)

Published on Wednesday, May 7, 2014

I was trying to install HALite, the WebUI for SaltStack, using the provided instructions. However, I kept getting the following errors when trying to create the certificates using Salt:

'tls.create_ca_signed_cert' is not available.
'tls.create_ca' is not available.


Basically, the 'tls' module in Salt simply didn't appear to work. The reason for this is detailed on intothesaltmind.org:

Note: Use of the tls module within Salt requires the pyopenssl python extension.

That makes sense. We can fix this with something like:

apt-get install libffi-dev
pip install -U pyOpenSSL
/etc/init.d/salt-minion restart


Or, better yet, with Salt alone:

salt '*' cmd.run 'apt-get install libffi-dev'
salt '*' pip.install pyOpenSSL
salt '*' cmd.run "service salt-minion restart"


The commands to create the PKI key should work now:

Created Private Key: "/etc/pki/salt/salt_ca_cert.key." Created CA "salt": "/etc/pki/salt/salt_ca_cert.crt."

Beers of Myanmar

Published on Wednesday, April 30, 2014



While in Myanmar on a recent trip the wife and I did a brief taste comparison of the three main beers available in most supermarkets.


Andaman - Not to my taste, perhaps like XXXX, VB, Natural Light, or a light Steel Reserve.

Myanmar - Quite refreshing, a bit like similar beers in the region, e.g. Chang, Tiger, or Laos Beer.

ABC - An extra stout (and 8%!) in such a hot country? That's a surprise.

Error opening /dev/sda: No medium found

Published on Saturday, March 1, 2014

I have had this issue before, solved it, and had it again.

Let's say you plug in a USB drive into a Linux machine, and try to access it (mount it, partition it with fdisk/parted, or format it), and you get the error

Error opening /dev/sda: No medium found


Naturally the first thing you will do is ensure that it appeared when you plugged it in, so you run 'dmesg' and get:

sd 2:0:0:0: [sda] 125045424 512-byte logical blocks: (64.0 GB/59.6 GiB)


And it appears in /dev

Computer:~ $ ls /dev/sd*
/dev/sda
Computer:~ $


Now what? Here's what has bitten me twice: make sure the drive has enough power. Let's say you mounted a 2.5" USB drive into a Raspberry Pi. The Pi probably doesn't have enough current to power the drive, but it does have enough to make the drive recognisable. Or, if you are like me, the USB charger powering the drive is faulty, so even though it has power, it doesn't have enough.

The next troubleshooting step should be obvious: give the drive enough power to completely spin up.

Continuous Flow Through Worm Bin

Published on Sunday, January 5, 2014

Status: Done!

A few months ago we decided we wanted a worm bin, as we were eating a lot of vegetables, and tossing away bits that weren't used. We were also buying soil for our plants, so it made sense to try to turn one into another.

One of our friends gave us some worms from her compost - no idea what kind - and I build an experimental CFT worm bin (sample plans). We harvested once at about two months, but I don't think it was quite ready. We'll keep experimenting.


Free Splunk Hosting

Published on Thursday, November 28, 2013

I first used Splunk about 10 years ago after an old colleague installed it on a computer in the corner, and ever since then I have preached about it. If you have log data, of any kind, I'd recommend you give it a go.

The Splunk people have a a few pretty good options for trying Splunk out, as you can either use Splunk Storm or Splunk Free. The first option is obviously hosted, and has a generous storage option, but also does not allow long term storage of data. I send system log data to Splunk Storm.

However, what if you don't have a lot of data, but you want to keep that data forever? After reading Ed Hunsinger's Go Splunk Yourself entry about using it for Quantified Self data, I knew I had to do the same.

From personal experience, Splunk requires at least 1GB to even start. You can probably get it to run on less, but I haven't had much success. This leaves two options: look at Low End Box for a VPS with enough memory (as cheap as $5/month), of use OpenShift. Red Hat generously provides three "gears" to host applications, for free, and each with 1GB of memory. I have sort of a love-hate relationship with OpenShift, maybe a bit like using OAuth. Red Hat calls OpenShift the "Open Hybrid Cloud Application Platform", and I can attest that it is really this. They have provided a method to bundle an application stack and push it into production without needing to fuss about infrastructure, or even provisioning and management of the application. It feels like what would happen if Google App Engine and Amazon's EC2 had a child. Heroku or dotCloud might be its closest alternatives.

Anyways, this isn't a review of OpenShift, although it would be a positive review, but instead on how to use OpenShift to host Splunk. I first installed Splunk in a gear using Nginx as a proxy, and it worked. However, this felt overly complex, and after one of my colleagues started working on installing Splunk in a cartridge, I eventually agreed this would be the way to go. The result was a Splunk cartridge that can be installed inside any existing gear. Here are the instructions; you need an OpenShift account, obviously. The install should take less than ten clicks of your mouse, and one copy/paste.

From the cartridge's GitHub README:

  1. Create an Application based on existing web framework. If in doubt, just pick "Do-It-Yourself 0.1" or "Python 2.7"
  2. Click on "Continue to the application overview page."
  3. On the Application page, click on "Or, see the entire list of cartridges you can add".
  4. Under "Install your own cartridge" enter the following URL: https://raw.github.com/kelvinn/openshift-splunk-cartridge/master/metadata/manifest.yml
  5. Next and Add Cartrdige. Wait a few minutes for Splunk to download and install.
  6. Logon to Splunk at: https://your-app.rhcloud.com/ui

More details can be read on the cartridge's GitHub page, and I would especially direct you to the limitations of this configuration. This will all stop working if Splunk makes the installer file unavailable, but I will deal with that when the time comes. Feel free to alert me if this happens.


Finding The Same (Misspelled) Name Using Python/NLTK

Published on Friday, September 13, 2013

I have been meaning to play around with the Natural Language Toolkit for quite some time, but I had been waiting for a time when I could experiment with it and actually create some value (as opposed to just play with it). A suitable use case appeared this week: matching strings. In particular, matching two different lists of many, many thousands of names.

To give you an example, let's say you had two lists of names, but with the name spelled incorrectly in one list:

List 1:
Leonard Hofstadter
Sheldon Cooper
Penny
Howard Wolowitz
Raj Koothrappali
Leslie Winkle
Bernadette Rostenkowski
Amy Farrah Fowler
Stuart Bloom
Alex Jensen
Barry Kripke

List 2:
Leonard Hofstadter
Sheldon Coopers
Howie Wolowits
Rav Toothrapaly
Ami Sarah Fowler
Stu Broom
Alexander Jensen

This could easily occur if somebody was manually typing in the lists, dictating names over the phone, or spell their name differently (e.g. Phil vs. Phillip) at different times.

If we wanted to match people on List 1 to List 2, how could we go about that? For a small list like this you can just look and see, but with many thousands of people, something more sophisticated would be useful. One tool could be NLTK's edit_distance function. The following Python script displays how easy this is:

import nltk

list_1 = ['Leonard Hofstadter', 'Sheldon Cooper', 'Penny', 'Howard Wolowitz', 'Raj Koothrappali', 'Leslie Winkle', 'Bernadette Rostenkowski', 'Amy Farrah Fowler', 'Stuart Bloom', 'Alex Jensen', 'Barry Kripke']

list_2 = ['Leonard Hofstadter', 'Sheldon Coopers', 'Howie Wolowits', 'Rav Toothrapaly', 'Ami Sarah Fowler', 'Stu Broom', 'Alexander Jensen']

for person_1 in list_1:
    for person_2 in list_2:
        print nltk.metrics.edit_distance(person_1, person_2), person_1, person_2

0 Leonard Hofstadter Leonard Hofstadter
15 Leonard Hofstadter Sheldon Coopers
14 Leonard Hofstadter Howie Wolowits
15 Leonard Hofstadter Rav Toothrapaly
14 Leonard Hofstadter Ami Sarah Fowler
16 Leonard Hofstadter Stu Broom
15 Leonard Hofstadter Alexander Jensen
14 Sheldon Cooper Leonard Hofstadter
1 Sheldon Cooper Sheldon Coopers
13 Sheldon Cooper Howie Wolowits
13 Sheldon Cooper Rav Toothrapaly
12 Sheldon Cooper Ami Sarah Fowler
11 Sheldon Cooper Stu Broom
12 Sheldon Cooper Alexander Jensen
16 Penny Leonard Hofstadter
13 Penny Sheldon Coopers
13 Penny Howie Wolowits
14 Penny Rav Toothrapaly
16 Penny Ami Sarah Fowler
9 Penny Stu Broom
13 Penny Alexander Jensen
11 Howard Wolowitz Leonard Hofstadter
13 Howard Wolowitz Sheldon Coopers
4 Howard Wolowitz Howie Wolowits
15 Howard Wolowitz Rav Toothrapaly
13 Howard Wolowitz Ami Sarah Fowler
13 Howard Wolowitz Stu Broom
14 Howard Wolowitz Alexander Jensen
16 Raj Koothrappali Leonard Hofstadter
14 Raj Koothrappali Sheldon Coopers
16 Raj Koothrappali Howie Wolowits
4 Raj Koothrappali Rav Toothrapaly
14 Raj Koothrappali Ami Sarah Fowler
14 Raj Koothrappali Stu Broom
16 Raj Koothrappali Alexander Jensen
14 Leslie Winkle Leonard Hofstadter
13 Leslie Winkle Sheldon Coopers
11 Leslie Winkle Howie Wolowits
14 Leslie Winkle Rav Toothrapaly
14 Leslie Winkle Ami Sarah Fowler
12 Leslie Winkle Stu Broom
12 Leslie Winkle Alexander Jensen
17 Bernadette Rostenkowski Leonard Hofstadter
18 Bernadette Rostenkowski Sheldon Coopers
18 Bernadette Rostenkowski Howie Wolowits
19 Bernadette Rostenkowski Rav Toothrapaly
20 Bernadette Rostenkowski Ami Sarah Fowler
20 Bernadette Rostenkowski Stu Broom
17 Bernadette Rostenkowski Alexander Jensen
15 Amy Farrah Fowler Leonard Hofstadter
14 Amy Farrah Fowler Sheldon Coopers
15 Amy Farrah Fowler Howie Wolowits
14 Amy Farrah Fowler Rav Toothrapaly
3 Amy Farrah Fowler Ami Sarah Fowler
14 Amy Farrah Fowler Stu Broom
13 Amy Farrah Fowler Alexander Jensen
15 Stuart Bloom Leonard Hofstadter
12 Stuart Bloom Sheldon Coopers
12 Stuart Bloom Howie Wolowits
14 Stuart Bloom Rav Toothrapaly
13 Stuart Bloom Ami Sarah Fowler
4 Stuart Bloom Stu Broom
14 Stuart Bloom Alexander Jensen
15 Alex Jensen Leonard Hofstadter
12 Alex Jensen Sheldon Coopers
13 Alex Jensen Howie Wolowits
15 Alex Jensen Rav Toothrapaly
13 Alex Jensen Ami Sarah Fowler
10 Alex Jensen Stu Broom
5 Alex Jensen Alexander Jensen
15 Barry Kripke Leonard Hofstadter
13 Barry Kripke Sheldon Coopers
13 Barry Kripke Howie Wolowits
12 Barry Kripke Rav Toothrapaly
13 Barry Kripke Ami Sarah Fowler
10 Barry Kripke Stu Broom
14 Barry Kripke Alexander Jensen

As you can see, this displays the Levenstein distance of the two sequences. Another option we have is to look at the ratio.

len1 = len(list_1)
len2 = len(list_2)
lensum = len1 + len2
for person_1 in list_1:
    for person_2 in list_2:
        levdist = nltk.metrics.edit_distance(person_1, person_2)
        nltkratio = (float(lensum) - float(levdist)) / float(lensum)
        if nltkratio > 0.70:
            print nltkratio, person_1, person_2

1.0 Leonard Hofstadter Leonard Hofstadter
0.944444444444 Sheldon Cooper Sheldon Coopers
0.777777777778 Howard Wolowitz Howie Wolowits
0.777777777778 Raj Koothrappali Rav Toothrapaly
0.833333333333 Amy Farrah Fowler Ami Sarah Fowler
0.777777777778 Stuart Bloom Stu Broom
0.722222222222 Alex Jensen Alexander Jensen