December 1, 2014

Back from Clojure Part 1: Python

A series applying the lessons of functional programming

One of my favorite genres of article to write are the ones that involve refactoring some code to make it more functional, and (hopefully) improve it on the way. With that in mind, I’ve decided to embark on a tour of some of the things users of other popular dynamic languages can take away from the ideas behind Clojure, even if they never use it themselves. Today, I’ll be taking an old Python library I wrote and refactoring it to fit a few ground rules.

Said ground rules will be cribbing extensively from Rich Hickey’s Simple made Easy talk, which besides being insightful is humorous and in general worth watching. If you haven’t seen that, you should go watch it now – it’s a better use of your time than this article is.

Setting up

Every good functional programming tutorial needs a strawman. In this case, I’ll be using a python library I wrote back in the dark ages called googlegantt, a library that produces a google charts URL showing a basic gantt chart.

from googlegantt import GanttChart, GanttCategory

gc = GanttChart('Test Chart', width=650, height=200, progress=(2011, 02, 27))

on_time = GanttCategory('On Time', '0c0')
late = GanttCategory('Late', 'c00')
upcoming = GanttCategory('Upcoming', '00c')

t1 = gc.add_task('Late Task', (2011, 02, 20), (2011, 02, 25), category=late)
t2 = gc.add_task('On Time Task', depends_on=t1, duration=3, category=on_time)
t3 = gc.add_task('Future Task', depends_on=t2, duration=7, category=upcoming)

url = gc.get_url()
image = gc.get_image()

So what’s wrong with that? It seems pretty good to me. Unfortunately, it’s pretty hard to illustrate the issues with stateful code in small examples. The problems only start to arise as long-lived objects live in your code and change over time. But in general, mutable objects mean more work mentally when it comes to keeping track of what your code is doing. Every time you pass a mutable object along to somewhere else in your code, you have to know that it won’t get changed somehow without you knowing.

This stackoverflow discussion contains some interesting commentary on the fact, but let’s just proceed for now with the assumption that functional and immutable are good.

So how do we want our library to work? Well, we’ll get to that. First, let’s take a look at some ground rules.

1. Data is better than code

Code looks like this:

GanttChart('Test Chart', width=650, height=200)

Data looks like this:

{'name': "Test Chart", 'width': 650, 'height': 200}

We will refactor to use the latter exclusively. This way, anyone using our library will know exactly what’s happening, if they care to inspect the intermediary variables. A few more benefits of plain-old-data:

  • Trivially serializable. In Python it’s practically JSON already
  • Easily adjusted: the user can easily see what can be changed, and make their own changes if they so desire, without us having to go all-out implementing methods for every possible change.

Note that this advice may superficially sound like opposite of what might be given to users of a statically-typed language, especially one with a full-featured type system like Haskell or OCaml. It’s true that it make a lot of sense to lean on the type system in those languages, but luckily, we get to skip that argument entirely, because classes in Python buy you no type-safety at all. Instead, we get to go all-in with our dynamic language and its homogeneous collections. However, remember that features like typed records, or else libraries like schema or core.typed in clojure, go very well with this strategy.

So we should strive to make any intermediate values regular, inspectable python data structures. But, we’ll make a small exception to that to follow this next rule:

2. Immutable is better than mutable

Python’s tuple type is immutable; the rest of its built-in collection types are not. Plus, the tuple type is not optimized for most collection operations.

Immutable collections in most modern functional languages are implemented as trees, with operations that change the data returning new objects that share a lot of the existing data. This saves memory and computation compared to straight-up copying everything all the time. In Python, this can be achieved using a library called pysistence. Relevant to our interests is the PDict structure, which is an immutable dict:

from pysistence import PDict

gc = PDict(name='Test Chart', width=650, height=200)

gc['name']  # "Test Chart"

The benefit to this is that we can pass it all around and still be confident that it won’t change on us. Sure, this could be accomplished with discipline and a few stern comments, but how could we enforce it? The more mistakes our users are prevented from making, the better.

Sometimes state is unavoidable. Files and database, for example, are intractably stateful, and this is not necessarily a bad thing. On the other hand are classes that follow a setter pattern, which are stateful for no good reason:

gc = GanttChart()
gc.width = 650
gc.height = 200
gc.add_task(GanttTask("Task1", ...))
# ...

We will use immutable data structures to avoid this sort of thing.

3. (Pure) Functions are better than methods

You don’t need any special machinery to write pure functions in Python. You just need to make sure that every function you write returns the same output for any given input (we’ll give IO a pass; Python doesn’t really have the machinery to work with Monads effectively in general). Usually, python functions end up being pure anyhow; they may have internal state as variables, but nobody will hold it against you as long as you don’t let that leak out.

Methods are another story. Every method in Python is passed an instance of the class. Nothing can be done with a method that can’t be done with a function, and the function will probably have better manners if it’s not given implicit permission to have its way with the self parameter.

In general, classes represent an interleaving of data (fields) and functionality (methods). This is not necessary, and it leads to complexity such as this. In our refactoring, we’ll just skip classes and pass immutable data around between functions – it’s no big deal, as you’ll see.

Pulling it together

I always like to start with my top-level API and work down to the implementation, so let’s start there. Here’s what our functional GanttChart will look like:

import datetime
import googlegantt as gg

d = datetime.date

late = gg.gantt_category('Late', 'c00')
on_time = gg.gantt_category('On Time', '0c0')
upcoming = gg.gantt_category('Upcoming', '00c')

task1 = gg.gantt_task('Late Task', d(2011, 2, 20), d(2011, 2, 25), category=late)
task2 = gg.gantt_task('On Time Task', depends_on=t1, duration=3, category=on_time)
task3 = gg.gantt_task('Future Task', depends_on=t2, duration=7, category=upcoming)

gc = gg.gantt_chart('Test Chart', width=650, height=200, progress=d(2011, 2, 27), tasks=(task1, task2, task3))

url = gg.get_url(gc)
img = gg.get_image(gc)

Well, it’s certainly not worse at first glance, so that’s something. What’s different?

  • We use functions instead of classes to instantiate our data.
  • All these functions will return PDicts, which carry the benefits described above.
  • We pass in all our tasks when we instantiate the chart.
  • I eliminated those silly date tuples, which were a bad idea from the get-go. but that’s neither here nor there.

Note that our constructor functions are still a bit cleverer than simple dict instantiations: they do some work depending on the arguments they receive. But, they should return dicts of the same structure.

gc = gg.gantt_chart(
    'Test Chart',
    width=650,
    height=200,
    progress=d(2011, 2, 27),
    tasks=(task1, task2, task3))
# PDict({
#   "name": "Test Chart",
#   "width": 650,
#   "height": 200,
#   "progress": d(2011, 2, 27),
#   "start_date": d(2011, 2, 20),
#   "end_date": d(2011, 3, 7),
#   "tasks": (...)})

task = gg.gantt_task("New Task", d(2011, 2, 22), d(2011, 2, 25))
# PDict({
#   "name": "New Task",
#   "start_date": d(2011, 2, 22),
#   "end_date": d(2011, 2, 25)})

Here’s another example where functions-on-data has benefits. In the original implementation, both GanttChart and GanttTask have a property/method called “duration”. In our new version, we just need one function, which operates on any dict with a start_date and an end_date. Similarly, get_url and get_image become functions operating on data.

Reducing complexity

Supposing we do have a need for an add_task function. How would that work with our immutable chart data? Well, the answer is, it would accept a chart and return a new chart without changing the old one. Here’s how that could go:

def add_task(chart, task):
    start_date = min(task['start_date'], chart['start_date'])
    end_date = max(task['end_date'], chart['end_date'])
    return chart.using(
        start_date=start_date,
        end_date=end_date,
        tasks=chart['tasks'] + (task,))

We take special care to update the start and end dates accordingly. Of course, if that got tedious you could just strip those from the chart object and create functions to extract those values from the tasks, but I don’t think they’re necessarily inappropriate here.

So this function has a very useful signature: it takes a collection and an item, and returns an updated collection. This means that adding multiple tasks is as simple as:

def add_tasks(chart, tasks):
    return reduce(add_task, tasks, chart)

Puns aside, the refactored-to-functional version is significantly smaller too, even after accounting for all the error-checking I lazily omitted.

That’s all for today. You can find the functional refactor of the whole library in this gist, and compare it to the original