Building a graph-powered suggestion engine for redditlater with Loom
Er, I mean Later (for reddit)
Besides the post scheduling, the main reason people come to http://www.redditlater.com/ is the subreddit analysis feature. This lets folks enter in a subreddit and tells them when the best time to post on a given subreddit is.
Little do they know, it tells me about them, too.
Well, by that I mean, I keep a record of which session searches for which subreddits. I don’t link the session ids up to logged-in users or anything, I just know if the same person searched for “baking” and “cooking”. I started collecting this information a while back because I thought it might be fun to build a graph of who was searching for what, and I just got around to actually doing the work and using that information to show suggestions when subs are analyzed – that is, if you look up a subreddit, it tells you which other subreddits people that searched for that subreddit looked for.
To do this, I used a library called Loom. Using loom, the site builds and stores a graph of connected subreddits, weighted by the number of times those subreddits have appeared together in one session’s history. This obviously biases results towards more popular subreddits, as expected, but also turns up some interesting suggestions. In this post, I’ll go into detail about how I built the graph.
Getting the data out of Mongo
The analysis log table contains records with two fields: a session id (user
) and the subreddit
they searched for. The first step in building the weighted graph needed to power this feature
is to get all that data grouped up.
Redditlater is (presently) backed by MongoDB, because it seemed like a good idea two years ago. Mongo has actually done a fine job keeping up with what traffic the site gets, so I don’t regret it yet, but one thing it struggles with is aggregation. The only mechanism provided to do aggregation on the database side of things is the Map-Reduce function.
Using the congomongo driver, mapreduce is called with two strings, each a javascript function definition to be used for the map and reduce functions. I love writing javascript in my clojure.
If the functions were complex enough to warrent it I might have tried to do some clojurescript-style compilation and written the functions that way; as is, I just made them strings and moved on with my life. Here’s how that ended up looking:
(def MAP_BY_USER "
function(){
return emit(this.user, [this.sub, 1])
}"
)
(def REDUCE_BY_USER "
function (user, values){
var result = {};
for (var ii=0; ii<values.length; ii++){
var val = values[ii];
var sub = val[0];
var count = val[1];
if(result[sub] == null) {
result[sub] = count;
}else{
result[sub] += count;
}
}
return result;
}")
(defn build-user-queries-table [table-name]
(mongo/map-reduce
:analysis-log
MAP_BY_USER,
REDUCE_BY_USER,
{:replace table-name}))
After that, I have another table that contains the session id as a key, and a map of subreddit to count as the value. Now, we can start building the graph.
Wrangling data
The actual function for building the graph ended up as one long ->>
, which
I’ll show you here:
(->> query-result
(r/map :value)
(r/filter map?) ; Single-entry = useless
(r/map keys)
(r/mapcat all-pairs)
(frequencies)
(r/map #(concat (first %) (rest %)))
(into [])
(apply loom/weighted-graph))
Line-by-line, this:
- Pulls out just the list of subreddits (value) for each item
- Filters out those that only have one subreddit, which are returned as vectors rather than maps.
- Grabs just the keys, i.e. the subreddit names, disregarding counts
- For each list of subs, uses a function called
all-pairs
to expand the list into a long list of pairs. So,[:1 :2 :3]
becomes[[:1 :2] [:1 3:] [:2 :3]]
- Calls frequencies on this long list of pairs, getting a count for each unique pair
- Transforms each result from e.g.
[[:sub1 :sub2] 1]
to(:sub1 :sub2 2)
, the format expected by loom’s weighted-graph function. - Uses
into
to get a result, since we were using reducers for the maps and filters. - Call’s loom’s
weighted-graph
function to construct a graph from all those values.
We use a weighted graph because we want to be able to sort the suggestions by relevance. The weights are the count of the number of times that pair of subreddits has appeared in the same session.
This graph is stored in an agent in the suggestion namespace. I used an agent because it can take a bit of
time to build the graph, and I don’t want to block on updating it. This way, I can just send
the build-suggestion-graph
function to the agent and updating happens in its own time, without blocking. I don’t really mind if the first few
queries after a server restart have no associated suggestions.
To keep the suggestions up to date, the reset-suggestion-graph!
function that updates the agent is passed to an
at-at/every
job, so that the graph is re-built once per hour.
Getting suggestions out
This is the really easy part. I just call loom/successors
and pass the graph and the subreddit I want
to get suggestions for, and it passes me all the connections that exist for that sub. Then, I use loom/weight
to get the weight of each of those connections, so I can sort by the heaviest ones.
(defn weighted-successors [graph k]
(->>
(loom/successors graph k)
(map #(vector % (loom/weight graph k %)))
(sort-by last)))
I expose that data to the analysis API, which shows it to the user. Easy.
Loom is super great
My takeaway from this is that, if you have a problem that may require a graph, loom is super easy to use and effective. You build a graph, then you query the graph, and that’s all there is to it. If you have such a problem, I encourage you to try to use loom to solve it. And let me know how that goes, too!