October 1, 2015

Writing Friendlier Clojure

Yet another article about refactoring

I love the Clojure language, but I don’t think there’s any use pretending that the combination of expressiveness, power, and repl-driven development can result in some staggeringly dense code. Everyone that writes Clojure is guilty of this at one time or another; you start with the core of your function, evaluate it, wrap it a bit, eval again, and before you know it you have a lopsided, deeply nested, organically-grown stalagmite of a program.

Now you have a choice: you can leave your edifice as is, or you can go back and re-organize it with your newfound understanding of the problem, so that in the future when someone else (or future you) has occasion to read it, they won’t have to start from zero.

In this post, I’ll be refactoring the following snippet of code taken from this reddit thread, which in turn comes from this blog post. I don’t mean to pick on it in particular – far more egregious examples exist, and for all its warts the implementation of a markov-chain sentence generator is perfectly sound and reasonable. I choose it simply because it was held up by a discouraged Clojure user as an example of code that was difficult to understand.


(defn markov-data [text]
  (let [maps
        (for [line (clojure.string/split text #"\.")
              m (let [l (str line ".")
                      words
                      (cons :start (clojure.string/split l #"\s+"))]
                  (for [p (partition 2 1 (remove #(= "" %) words))]
                    {(first p) [(second p)]}))]
          m)]
    (apply merge-with concat maps)))

(defn sentence [data]
  (loop [ws (data :start)
         acc []]
    (let [w (rand-nth ws)
          nws (data w)
          nacc (concat acc [w])]
      (if (= \. (last w))
        (clojure.string/join " " nacc)
        (recur nws nacc)))))

This code is for a markov text generator. The markov-data function accepts a string, and returns a hash-map mapping each unique word in the string to each word that ever followed it. The sentence function uses this data to generate a new string, choosing each word from the corresponding lookup of the previous word.

One of the features that tends to make Clojure code harder to parse than it needs to be is explicit nesting. This isn’t something that Clojure is uniquely capable/guilty of; other languages nest just as much, and just pretend they don’t by pulling out variables and avoiding nested function calls. There’s no reason why we can’t apply the same sensibilities to our Clojure code.

On the flip side of this, Clojure’s explicit nesting and immutable so-and-so makes it easy to logically treat all nested parts as black boxes. This helps greatly in refactoring. Let’s start with the first function, and write outside-in:

(defn markov-data [text]
  (let [sentences (clojure.string/split text #"\.")
        maps (mapcat make-markov-map sentences)]
    (apply merge-with concat maps)))

One of the nice things about refactoring from the outside is that we can just make up functions that do what we want, and leave the implementations for later. I’m still not happy with how many times the word “map” appears in that one line meaning different things, though. Calling it something like “dict” or “hash” would confuse matters more, so let’s call it a “mapping.” That’s enough distinction for me.

We might also notice that our let block is following a classic pattern: define binding, use binding to define next binding. Whenever you notice this pattern, you can probably turn to one of the threading macros instead:

(defn markov-data [text]
  (->> (clojure.string/split text #"\.")
       (mapcat make-markov-mapping)
       (apply merge-with concat)))

The threading macros tend to confuse newbies a bit, but they’re really good at their job: reorganizing code so that the order of operations is the order you read them in. Here, it’s easy to see (with a bit of knowledge) that we’re splitting the text on periods, applying some function to each sentence, and then merging the result. You may have to refer to the make-markov-mapping function and the docs for merge-with to help you understand the first time, but you have all those extra brain cycles spent not untangling the code’s logic to dedicate to that actually-important task.

On to make-markov-mapping. Here’s the un-changed version wrapped in a function:

(defn make-markov-mapping [sentence]
  (for [m (let [l (str sentence ".")
       words (cons :start (clojure.string/split l #"\s+"))]
    (for [p (partition 2 1 (remove #(= "" %) words))]
      {(first p) [(second p)]}))]
   m))

Take a while to read the function and understand what is happening. First off, since we removed the iteration over each sentence to the outer function, the outer for is not necessary. Besides that, this function does a few things, most of which concern the details of splitting the sentence into words (appending a period, filtering blanks, prepending :start, etc.). We can pull that all out into its own function and just worry about the partitioning:

(defn make-markov-mapping [sentence]
  (let [wordlist (sentence->wordlist sentence)]
    (for [[word & words] (partition 2 1 wordlist)]
      {word words})))

Here, we’ve moved all the wordlist generation to a new function, and also used destructuring to remove the need for the first and second in the body of the inner for from the original function. Again, this code still may not be obvious if you’re not familiar with partition, but that’s a meaty and useful bit of information (hint: we’re partitioning the list into groups of 2, stepping through it 1 at a time. So, [:start "This" "is" "a" "sentence"] becomes [[:start "This"] ["This" "is"] ["is" "a"] ["a" "sentence"]]).

Finally, the sentence->wordlist function:

(defn sentence->wordlist [sentence]
  (-> sentence
      (str ".")
      (clojure.string/split #"\s+")
      (->> (cons :start)
           (remove #(= "" %)))))

In this function, we add a period to the end of the sentence, split the sentence into words, prepend the list of words with :start, and remove empty words. Thanks to the threading macros, we do all those things, one per line, in that order. Note that nesting the ->> macro inside -> works fine, but you can’t do it the other way around; I’ll leave why as an exercise.

So now we’ve broken the markov-data function into three, each of which is pretty easy to understand.

(defn sentence->wordlist [sentence]
  (-> sentence
      (str ".")
      (clojure.string/split #"\s+")
      (->> (cons :start)
           (remove empty?))))

(defn make-markov-mapping [sentence]
  (let [wordlist (sentence->wordlist sentence)]
    (for [[word & words] (partition 2 1 wordlist)]
      {word words})))

(defn markov-data [text]
  (->> (clojure.string/split text #"\.")
       (mapcat make-markov-mapping)
       (apply merge-with concat)))

Next, we move on to the function to generate a sentence.

(defn sentence [data]
  (loop [ws (data :start)
         acc []]
    (let [w (rand-nth ws)
          nws (data w)
          nacc (concat acc [w])]
      (if (= \. (last w))
        (clojure.string/join " " nacc)
        (recur nws nacc)))))

What we want to do here is to build a sentence. We start with the special :start keyword to pick the first word, and then use each word to look up the options for the next word, of which we pick one at random. When we pick a word that ends with a period, our sentence is complete.

Let’s start refactoring this from the inside this time, by writing a function to accomplish the job of picking the next word:

(defn pick-next-word [mapping this-word]
  (let [choices (get mapping this-word)]
    (rand-nth choices)))

We only save one nested call here, but I think adding a meaningful function name is worth the effort.

Next, we notice that this implementation uses a loop with an accumulator. Whenever you see that pattern, you should probably start thinking about how reduce can simplify it. However, in this case, reduce doesn’t work; each word depends on the previous word, and the length of the sentence is unbounded. Instead, the simplest solution is a recursive function:

(defn sentence [mapping words this-word]
  (let [next-word (pick-next-word mapping this-word)
        words (conj words next-word)]
  (if (= (last next-word) \.)
    (clojure.string/join " " words)
    (recur mapping words next-word))))

Notice that this is almost exactly the body of the original function’s loop with longer variable names, although we’ve changed one thing slightly; we no longer include the candidates for next word in our looping, instead passing the current word to be chosen so that we can use our pick-next-word function. This removes a detail about how words are picked from the overall sentence construction function.

This function works well enough as-is if we call it using (build-sentence mapping [] :start), but we’ll probably want to make that explicit. You can do that with either a loop or a variadic function; both work well:

(defn build-sentence-with-loop [mapping]
  (loop [words []
         this-word :start]
    (let [next-word (pick-next-word mapping this-word)
          words (conj words next-word)]
      (if (= (last next-word) \.)
        (clojure.string/join " " words)
        (recur words next-word)))))

(defn build-sentence-multivariadic
  ([mapping] (build-sentence-multivariadic mapping [] :start))
  ([mapping words this-word]
    (let [next-word (pick-next-word mapping this-word)
          words (conj words next-word)]
      (if (= (last next-word) \.)
        (clojure.string/join " " words)
        (recur mapping words next-word)))))

In either case, the bulk of the refactoring I did was simply changing variable names.

And so, here is our final, refactored version:

(defn sentence->wordlist [sentence]
  (-> sentence
      (str ".")
      (clojure.string/split #"\s+")
      (->> (cons :start)
           (remove #(= "" %)))))

(defn make-markov-mapping [sentence]
  (let [wordlist (sentence->wordlist sentence)]
    (for [[word & words] (partition 2 1 wordlist)]
      {word words})))

(defn markov-data [text]
  (->> (clojure.string/split text #"\.")
       (mapcat make-markov-mapping)
       (apply merge-with concat)))

(defn pick-next-word [mapping this-word]
  (let [choices (get mapping this-word)]
    (rand-nth choices)))

(defn sentence [mapping]
  (loop [words []
         this-word :start]
    (let [next-word (pick-next-word mapping this-word)
          words (conj words next-word)]
      (if (= (last next-word) \.)
        (clojure.string/join " " words)
        (recur words next-word)))))

Our line and function count has increased, but now each function encapsulates a smaller, more logically coherent operation. Beauty is subjective, but I think it’s easier to read and understand. And, best of all, it works the exact same way. If you swapped out the former code for the latter, nobody who was using it would ever notice, but anyone who had to dig in – which, remember, is usually you from 6 months in the future when you can’t remember anything about it – will thank you for it.

Here are some quick tips in case you couldn’t be bothered to read or remember the whole thing:

  • Use more, smaller functions.
  • Use -> and ->> wherever possible to make logical order match reading order.
  • Use self-documenting variable and function names.