June 7, 2015

Building a graph-powered suggestion engine for redditlater with Loom

Er, I mean Later (for reddit)

Besides the post scheduling, the main reason people come to http://www.redditlater.com/ is the subreddit analysis feature. This lets folks enter in a subreddit and tells them when the best time to post on a given subreddit is.

Little do they know, it tells me about them, too.

Well, by that I mean, I keep a record of which session searches for which subreddits. I don’t link the session ids up to logged-in users or anything, I just know if the same person searched for “baking” and “cooking”. I started collecting this information a while back because I thought it might be fun to build a graph of who was searching for what, and I just got around to actually doing the work and using that information to show suggestions when subs are analyzed – that is, if you look up a subreddit, it tells you which other subreddits people that searched for that subreddit looked for.

To do this, I used a library called Loom. Using loom, the site builds and stores a graph of connected subreddits, weighted by the number of times those subreddits have appeared together in one session’s history. This obviously biases results towards more popular subreddits, as expected, but also turns up some interesting suggestions. In this post, I’ll go into detail about how I built the graph.

Getting the data out of Mongo

The analysis log table contains records with two fields: a session id (user) and the subreddit they searched for. The first step in building the weighted graph needed to power this feature is to get all that data grouped up.

Redditlater is (presently) backed by MongoDB, because it seemed like a good idea two years ago. Mongo has actually done a fine job keeping up with what traffic the site gets, so I don’t regret it yet, but one thing it struggles with is aggregation. The only mechanism provided to do aggregation on the database side of things is the Map-Reduce function.

Using the congomongo driver, mapreduce is called with two strings, each a javascript function definition to be used for the map and reduce functions. I love writing javascript in my clojure.

If the functions were complex enough to warrent it I might have tried to do some clojurescript-style compilation and written the functions that way; as is, I just made them strings and moved on with my life. Here’s how that ended up looking:

(def MAP_BY_USER "
function(){
  return emit(this.user, [this.sub, 1])
}"
  )

(def REDUCE_BY_USER "
function (user, values){
  var result = {};
  for (var ii=0; ii<values.length; ii++){
    var val = values[ii];
    var sub = val[0];
    var count = val[1];
    if(result[sub] == null) {
      result[sub] = count;
    }else{
      result[sub] += count;
    }
  }
  return result;
}")

(defn build-user-queries-table [table-name]
  (mongo/map-reduce
    :analysis-log
    MAP_BY_USER,
    REDUCE_BY_USER,
    {:replace table-name}))

After that, I have another table that contains the session id as a key, and a map of subreddit to count as the value. Now, we can start building the graph.

Wrangling data

The actual function for building the graph ended up as one long ->>, which I’ll show you here:

(->> query-result
                 (r/map :value)
                 (r/filter map?) ; Single-entry = useless
                 (r/map keys)
                 (r/mapcat all-pairs)
                 (frequencies)
                 (r/map #(concat (first %) (rest %)))
                 (into [])
                 (apply loom/weighted-graph))

Line-by-line, this:

Pulls out just the list of subreddits (value) for each item
Filters out those that only have one subreddit, which are returned as vectors rather than maps.
Grabs just the keys, i.e. the subreddit names, disregarding counts
For each list of subs, uses a function called all-pairs to expand the list into a long list of pairs. So, [:1 :2 :3] becomes [[:1 :2] [:1 3:] [:2 :3]]
Calls frequencies on this long list of pairs, getting a count for each unique pair
Transforms each result from e.g. [[:sub1 :sub2] 1] to (:sub1 :sub2 2), the format expected by loom’s weighted-graph function.
Uses into to get a result, since we were using reducers for the maps and filters.
Call’s loom’s weighted-graph function to construct a graph from all those values.

We use a weighted graph because we want to be able to sort the suggestions by relevance. The weights are the count of the number of times that pair of subreddits has appeared in the same session.

This graph is stored in an agent in the suggestion namespace. I used an agent because it can take a bit of time to build the graph, and I don’t want to block on updating it. This way, I can just send the build-suggestion-graph function to the agent and updating happens in its own time, without blocking. I don’t really mind if the first few queries after a server restart have no associated suggestions.

To keep the suggestions up to date, the reset-suggestion-graph! function that updates the agent is passed to an at-at/every job, so that the graph is re-built once per hour.

Getting suggestions out

This is the really easy part. I just call loom/successors and pass the graph and the subreddit I want to get suggestions for, and it passes me all the connections that exist for that sub. Then, I use loom/weight to get the weight of each of those connections, so I can sort by the heaviest ones.

(defn weighted-successors [graph k]
  (->>
    (loom/successors graph k)
    (map #(vector % (loom/weight graph k %)))
    (sort-by last)))

I expose that data to the analysis API, which shows it to the user. Easy.

Loom is super great

My takeaway from this is that, if you have a problem that may require a graph, loom is super easy to use and effective. You build a graph, then you query the graph, and that’s all there is to it. If you have such a problem, I encourage you to try to use loom to solve it. And let me know how that goes, too!