August 6, 2017

Domain modelling with clojure.spec

Clojure.spec is, among other things, Clojure’s official answer to tools like Typed Clojure and Plumatic’s Schema. It represents an attempt to apply some validation to data and functions, without compromising Clojure’s dynamism and data-is-data philosophy. In this post, I’ll be working through a sample program by first outlining and modelling it with the help of clojure.spec, then using spec to guide me while I develop the implementation.

Today’s demonstration problem

The project I’ve chosen for this demo is an RSS feed fetcher and formatter. This is actually a port of an F# project a friend is working on, so I already had a type structure to port.

The program is pretty simple – it’s a tool to fetch the latest from Hacker News’ RSS feed, then go through each of those items and fetch a cleaned-up version of the link via Mercury’s API. Specifically, we’ll need to accomplish a few tasks:

  • Retrieve the RSS feed
  • Parse the feed to retrieve its contents
  • For each item:
    • Check whether its domain appears on a pre-configured blacklist (using regexes) This is important because Mercury simply doesn’t work on some domains, so we want to be able to skip those.
    • Retrieve the article’s content via the Mercury API

Make sense? Good, let’s get started!

Creating the project

I tend to use boot for my Clojure projects, so I created a build.boot file with the following:

(set-env!
  :source-paths #{"src"}
  :resource-paths #{"resources"}
  :dependencies '[[org.clojure/clojure "1.9.0-alpha17"]
                  [org.clojure/spec.alpha "0.1.123"]
                  [cheshire "5.7.1"]
                  [failjure "1.0.1"]
                  [org.clojure/core.async "0.3.443"]
                  [bidi "2.1.2"]
                  [http-kit "2.2.0"]
                  [clj-http "3.6.1"]])

I also need to ensure that I’m using clojure 1.9, so I set up the boot.properties file to ensure this:

BOOT_CLOJURE_NAME=org.clojure/clojure
BOOT_CLOJURE_VERSION=1.9.0-alpha17
BOOT_VERSION=2.7.1

After running mkdir src and mkdir resources, I can run boot repl and start developing.

A Quick primer on namespaced keywords

I’ll assume you’re familiar with Clojure’s keywords. They look like this: :keyword.

You may have encountered keywords with two colons instead of one. These keywords are namespaced, and the double-colon syntax is shorthand for “use this namespace”. So, if I’m in (ns myproject.myns), ::keyword returns :myproject.myns/keyword.

A second shorthand that exists is the ability to assign a required namespace to a keyword. For example, if I’m in (ns myproject.myotherns) and I’ve run (require '[myproject.myns :as myns]), then ::myns/keyword will as well return :myproject.myns/keyword.

Creating the Domain definitions

For this project, I’ve decided to put all the domain definitions in a single namespace. This is because, besides being a way to usefully validate values, specs (like static types) offer a valuable sort of documentation, and keeping them in one place creates a very useful reference.

I created a new directory in src/hackynews, opened up src/hackynews/domain.clj and added a namespace declaration:

(ns hackynews.domain
  (:require [[clojure.spec.alpha :as s]
             [failjure.core :as f]]))

Failjure is a library I maintain to help work with errors as values, and it turns out to play nice with spec – at least, nicer than thrown exceptions, which can’t really be specced.

Our domain definitions will not only define our data types, but also the steps in our process. We’ll go through and write our domain specs in three parts:

  • Inputs
  • Outputs
  • Process Steps

Defining the Domain Inputs

I started by defining the structure of the Domain inputs: namely, the feed.

(s/def ::rss-feed
  (s/keys
    :req-un [::title ::description ::link-uri ::items]))

Here, I define a single spec, using s/keys, which checks that keys are present in a map. I also used s/def to register the spec to a key, which must be a namespaced key.

This is already a perfectly valid and useful spec. It will ensure that its input is a map, and require some unqualified keys (hence, :req-un): title, description, link-uri, and items. However, even though the spec will accept unqualified keywords as valid, it demands that I use namespaced keywords to define them, for reasons I’ll explain right now.

We’ve already run into one of spec’s most interesting design decisions. Notice that I’ve specified nothing about what the values of these keys might be. That’s because I can’t spec the value of a feed’s :link-uri key. However, I can attach a spec to ::link-uri, which will then apply to all link-uri keys in the current namespace. And in fact this comes up right away, because each feed item also has a link-uri:

(s/def ::feed-item
  (s/keys
    :req-un [::title ::description ::link-uri ::comments-uri ::pub-date]))

You may have noticed that the feed item spec contains several keys in common with the rss-feed spec, so let’s enforce those a little bit:

(s/def ::title string?)
(s/def ::description string?)
(s/def ::link-uri uri?)
(s/def ::comments-uri uri?)

Now, I’ve applied some additional validation to both the ::feed-item spec and the ::rss-feed spec. I’ve left out a spec for the ::pub-date key because, even though it appears in the RSS data, I won’t actually be using it at all.

Next, we can tie our two major specs together.

(s/def ::items (s/coll-of ::feed-item))

Here I’ve added an additional constraint to ::rss-feed, which is that the :items key must be a collection of (hence, coll-of) values that match the ::feed-item spec.

We also mentioned that we wanted to have a predefined blacklist of regular expressions, which we want to use to skip links that we don’t want to fetch the content of. Here’s what that spec looks like:

(s/def ::blacklist (s/coll-of #(instance? java.util.regex.Pattern %)))

As demonstrated here, any function with the signature (x) -> boolean can be used as a spec.

Defining the outputs

We don’t really need anything as the output except a list of ::feed-items with an extra key, ::content, which we’ll spec as a regular string:


(s/def ::content string?)

(s/def ::feed-item-with-content
  (s/and ::feed-item (s/keys :req-un [::content])))

Here, I’ve used s/and to combine two specs.

Defining the process

Next, we can pre-spec the functions that will compose our overall program.

Let’s start from the bottom: We’ll need to be able to turn a ::feed-item into a ::feed-item-with-content, it’s the whole point! However, we can add two constraints:

  • The content retrieval might fail, in which case we want the ::feed-item as a fallback
  • The item’s url might be on the blacklist, so we’ll need access to the blacklist to check against. A failure of this check should also return the ::feed-item.

So, here’s our spec:

(s/def ::fetched-item-result
  (s/or
    :ok ::feed-item-with-content
    :error ::feed-item))


(s/def ::try-fetch-item-content
  (s/fspec
    :args (s/cat
            :blacklist ::blacklist
            :item ::feed-item)
    :ret ::fetched-item-result))

Here, we have first defined a spec that represents either failure or success. In case of failure, we fall back on the unfetched feed item. We’ve also defined a spec for a function using s/fpec, that accepts two arguments (the blacklist and a feed item) and returns something matching the ::fetched-item-result that we defined.

The s/cat here is a bit interesting. It represents the concatenation of several, tagged values. The tags will show up in error messages thrown by spec to help point out which condition failed. s/cat, along with a few others, are part of a branch of spec called “regular expression specs”, which are beyond the scope of this article (and problem) but worth reading about anyhow.

Next, we’ll need a function that turns an rss feed into a list of ::fetched-item-result. We can spec that straightforwardly:

(s/def ::try-fetch-items
  (s/fspec
    :args (s/cat
            :blacklist ::blacklist
            :feed ::rss-feed)
    :ret (s/coll-of ::fetched-item-result)))

We pass in the blacklist because we need to pass it along.

We’ll need a way to get the rss feed, which will be a function that is given a uri:

(s/def ::get-rss-feed
  (s/fspec
    :args (s/cat :uri uri?)
    :ret (s/or
           :ok ::rss-feed
           :error f/failed?)))

Here, I use failjure’s failed? as a spec, which does a fine job if I may say so myself.

Finally, we’ll want one more function that ties everything together, accepting the blacklist and a feed url and returning a list of fetched items:

(s/def ::fetch-rss-feed-items
  (s/fspec
    :args (s/cat
            :blacklist ::blacklist
            :uri ::link-uri)
    :ret (s/or
           :ok (s/coll-of ::fetched-item-result)
           :error f/failed?)))

With the spec done, it’s time to see how it can help us actually write the code – after all, we haven’t actually done anything yet!

Developing the implementation

Now that our domain is laid out, the implementation becomes a matter of filling out those function specs we crafted so nicely. We wrote out the specs back-to-front, but for the sake of repl-driven development it’s probably a bit easier to write out the implementation the right way around, so that we have values to pass to the next step.

Here’s my namespace declaration:

(ns hackynews.impl
  (:require [hackynews.domain :as domain]
            [clojure.spec.alpha :as s]
            [clj-http.client :as http]
            [failjure.core :as f]
            [clojure.xml :as xml]
            [cheshire.core :refer [parse-string]]))

Setting up for development

Before beginning to write these functions, I prepared a little helper in a (comment) at the bottom of the file:

(comment
  (require '[clojure.spec.test.alpha :as stest])
  (stest/instrument)
)

The instrument function will attach automatic spec-checking to every function in the namespace, which makes spec errors very obvious. However, note that this is (necessarily, for Clojure) run-time checking. You also need to re-run instrument when you add or change a spec.

Retrieving the feed

This is actually where having access to specs helped the most. I used clojure.xml to retrieve the feed, which returns a somewhat verbose data structure of the format {:tag :rss :attrs {} :content [{:tag :title ...} ...]}. Getting this down into the format we want to work with ended up being most of the implementation:


(defn- parse-item [item-node]
  (reduce (fn [item node]
            (case (:tag node)
              :title (assoc item :title (-> node :content first))
              :description (assoc item :description (-> node :content first))
              :link (assoc item :link-uri (-> node :content first (java.net.URI.)))
              :comments (assoc item :comments-uri (-> node :content first (java.net.URI.)))
              :pubDate (assoc item :pub-date (-> node :content first))
              )
            ) {} (:content item-node)))

(defn- parse-channel [channel-node]
  (reduce (fn [feed node]
            (case (:tag node)
              :title (assoc feed :title (-> node :content first))
              :description (assoc feed :description (-> node :content first))
              :link (assoc feed :link-uri (-> node :content first (java.net.URI.)))
              :item (update feed :items conj (parse-item node))

              feed)) {} (:content channel-node)))

(defn get-rss-feed [uri]
  (f/attempt-all [feed (f/try* (xml/parse (str uri)))
                  channel (-> feed :content first)]
    (parse-channel channel)))

(s/def get-rss-feed ::domain/get-rss-feed)

However, as I was developing the above, I was able to refer to (s/explain ::domain/rss-feed result). Explain takes a spec and a value, and tells you just where your value is failing to conform to the spec (or prints a nice success message if it does conform). This gave me a lot more confidence in my implementation.

Fetching the items

Retrieving the items is a pretty straightforward operation, simple requiring me to make a request to mercury’s JSON endpoint and add the result to the item.


(defn- fetch-item-content [item]
  (f/attempt-all
    [req {:query-params {:url (str (:link-uri item))}
          :headers {"x-api-key" "XXXXXXXXXXXXXXXx"}}
     resp (f/try* (http/get "https://mercury.postlight.com/parser" req))
     content (-> resp
                 (:body)
                 (parse-string true)
                 (:content))]

    (assoc item :content content)
    (f/when-failed [e] item)))


(defn try-fetch-item-content [blacklist item]
  (if (some #(re-matches % (:link-uri item)) blacklist)
    item
    (fetch-item-content item)))

(s/def try-fetch-item-content ::domain/try-fetch-item-content)


(defn try-fetch-items [blacklist feed]
  (map #(try-fetch-item-content blacklist %) (:items feed)))

(s/def try-fetch-items ::domain/try-fetch-items)

Tying it together

The final piece of the puzzle was the overarching function, which turned out a bit anticlimactic:


(defn fetch-rss-feed-items [blacklist uri]
  (f/attempt-all
    [feed (get-rss-feed uri)]
    (try-fetch-items blacklist feed)))

(s/def fetch-rss-feed-items ::domain/fetch-rss-feed-items)

And with that, we have a working project! Now it’s pretty straightforward to hook this up to a template generator of one description or another and come up with a nice, readable summary of the day’s HN posts.

Conclusion

Using spec on this project was a bit overkill, not because it’s too small a project, but because I’m never going to touch it again. However, the benefits of spec, like the benefits of static typing and other validation systems, come mostly when someone else (or perhaps yourself, in a year or two) has to understand it to use or maintain it.

Here’s hoping that spec catches on as a standard for Clojure libraries!