On October 2nd 2018, over 800 participants came together for the 12th ACM conference on recommender systems in Vancouver, Canada. The conference was held over six days with various tutorials and workshops in addition to the main conference. The single-track format was used for the first time with the following distribution over areas of research:

Spotify had a strong presence, starting with a tutorial "Mixed Methods for Evaluating User Satisfaction" by Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, Ben Carterette and Fernando Diaz, our paper "Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits", a position paper “Assessing and Addressing Algorithmic Bias - But Before We Get There” by Jean Garcia-Gathright, Aaron Springer, Henriette Cramer at FATREC workshop, and finishing with the RecSys challenge organised by Ching-Wei Chen, Markus Schedl from Johannes Kepler University, Hamed Zamani from University of Massachusetts Amherst and Paul Lamere.

**Confounding in Recommender Systems**

Confounded data in recommendation systems was a recurring theme at RecSys this year. As several papers discuss, implicit feedback data collected by a recommender system in production yields data confounded by the recommender. Naively using this data to do offline evaluation can be misleading, especially if the recommender being evaluated offline is very different to the production recommender. This also impacts batch training because there is always the danger that the new recommender is being trained to emulate the production recommender, not necessarily to optimize user engagement. We describe some papers that appeared at the main conference and workshops that deal with these issues.

**Societal Impacts of Recommender Systems**

The conference both started and ended with two keynotes emphasizing the social impact of recommender systems. Elizabeth Churchill, Director of User Experience at Google, called out designers and engineers to have five Es in mind when designing recommendation systems:

*Expedient* as convenient and practical, *Exigent* as pressing and demanding, *Explainable* as understandable or intelligible, *Equitable*** **as fair and impartial, *Ethical*: morally good or correct. Christopher Berry, Director of Product Intelligence at CBC, ended the conference with a fascinating story of the rocky path to social cohesion in Canada. He invited the community to explore how recommender systems could help promoting cohesion and understanding the differences between the content that polarizes and the content that unites. In the main conference, Recsys that care was the title of a track where researchers addressed various topics including diversity, sustainability and bias. In workshops, FATREC was held for the second consecutive year as a full day workshop presenting the ideals and challenges of Fairness, Accountability, and Transparency in recommender systems.

**User Understanding**

Another apparent theme was the growing efforts in understanding users on a deeper level across different domains. These efforts ranged from gaining a more comprehensive representation of user's interests and goals through diverse sources of information to a more realistic interpretation of their implicit and explicit signals. Researchers aspired to use their learnings to design and optimize methods that satisfy users’ realistic needs and are aimed at capturing their longer term satisfaction.

Here are some of the highlights from the tutorials, conference, and workshops. This list is by no means exhaustive, there were many other interesting contributions left unmentioned.

**Evaluation**

In this tutorial Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, and Fernando Diaz went in depth explaining how mixed-methods approach can provide the framework to integrate qualitative and quantitative analysis. Using this approach, researchers can make a holistic picture of users and help interpreting implicit signals in a more realistic way. This tutorial ranged from qualitative small in-lab studies to large data collection and validation analysis to understand the complex world of user satisfaction.

In another part of this tutorial, Ben Carterette presented fundamentals of significance testing, explained examples of non-parametric, parametric, and bootstrap tests with guidelines on how researchers can understand and interpret the results of these tests.

This tutorial was well-received by the audience and made a great topic for researchers visiting Spotify’s booth to discuss their own challenges and learnings in designing satisfaction metrics.

**Sequences**

Sequences was a half-day tutorial covering the recent survey paper on Sequence-aware recommender systems. This area explores the opportunities and challenges that adding time as an extra dimension adds to the design of recommender systems. Four type of tasks were discussed as the main applications of sequence-aware recommendations: context adaptation, trend detection, repeated recommendation, and tasks with order constraints. The tutorial was organized in two main parts of evaluation and algorithms and concluded with a hands-on session. Dataset partitioning (an example shown in the picture below) was highlighted as a distinct challenge in offline evaluation of these systems. In algorithms, sequence learning, sequence-aware matrix factorization and hybrid models were described. You can find the slides here.

On the topic of recommender confounding, there were several interesting papers. Allison Chaney et al. articulate and clarify the problem in their paper “How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility”. They introduce a model of how users interact with recommendations over several iterations of training and data collection. They use the model to examine confounding in simulations and find that it results in homogeneity in recommended items, as measured by the overlap of recommended items for similar users in the simulation vs. similar users in the ideal non-filter bubble setting (a measure they call “change in Jaccard index”).

There are different ways to address confounding. One way is inverse propensity score estimation (e.g., Yang et al. 2018, McInerney et al. 2018 at RecSys 2018). The best long paper award went to Stephen Bonner and Flavian Vasile for their impressive work “Causal Embedding for Recommendation”. They take a different approach using domain adaptation and show that a small amount of pure exploration data in combination with larger amount of confounded data helps with de-confounding. The role domain adaptation plays in a factorization approach is to constrain the item vectors to be the same for both exploration and exploitation. The user vectors are allowed to vary across domains but are regularized to be similar. (There are alternative hierarchical parameter assumptions, such as allowing the item vectors to vary and user vectors to either vary or stay the same.) They find improved offline training with their method compared to the usual confounded factorization on data adjusted offline to exhibit less skew toward popular and exposed items.

On the theme of choosing the right user signal to optimize the model, there is increasing interest in optimizing longer term user satisfaction using reinforcement learning. An example this year of advancing beyond a stationary reward distribution for full-page optimization is a paper by Zhao et al. called “Deep Reinforcement Learning for Page-wise Recommendations”. They propose an approach using the Actor-Critic method where two networks are learnt. The actor network is trained to map states (i.e. user context) to a best action (i.e. whole page of recommendations). The critic network is trained to map both the state and best action to the long-term reward (i.e. reward for the whole page as measured by clicks and purchases in an online store). Training the two networks jointly avoids the problems associated with a large action space common to recommender systems.

Another typical assumption in recommenders is that the set of items being selected already exist. An intriguing approach by Vo and Soh “Generation meets recommendation: proposing novel items for groups of users” looks at how one could decide what items to generate next based on a set of historical user consumption data. The basic idea is to learn a joint embedding of users and items using consumption data then to use submodular optimization to “cover” regions in the embedding space that would satisfy the most people. Once these regions have been discovered, the potential new items are mapped back to the original item space. They show that this method can be used to suggest new works of art, and, more convincingly, movie plots. For example, they posit that a narrated movie with strong storytelling, social commentary, and dark humour would be a hit (think “American Beauty meets Pulp Fiction and One Flew Over the Cuckoo’s Nest”). The method is capable of suggesting new items that satisfy a pre-existing demand in new ways (though not new markets for original content).

In one of the attempts to consider a more comprehensive representation of the users, “Calibrated Recommendations” by Harald Steck asks recommender systems to be “fair” towards a user’s various tastes. This work shows that, especially when training data is noisy and limited, optimizing for accuracy could end up in a list of recommended items that is dominated by the user’s major taste, ignoring a subset of their tastes all together. To prevent users from getting into a personal filter bubble, they suggest a “calibration” metric that compares the distribution of categories/genres between users’ consumption taste and the list they are being recommended with. This problem is addressed as a trade-off between accuracy and calibration and proposes a simple greedy algorithm for post-processing the recommendation list to increase calibration. The algorithm starts with an empty list and iteratively appends one item at a time optimizing for the trade-off between calibration and accuracy using a pre-set parameter.

Yet another interesting work aiming at better understanding of the users was presented by Zhao et al. “Interpreting User Inaction in Recommender Systems” explores what most recommender systems ignore or consider as negative feedback: user’s *inaction** *towards a recommended item. Inspired by the Decision Field Theory, these researchers have designed a survey exploring seven reasons behind users’ inaction. They studied which reasons still make good candidates for future recommendations and which ones the recommender system should discard. Here is the conclusion:

“Would Not Enjoy < Watched < Not Noticed < Not Now or Others Better < Explore Later or Decided To Watch” meaning that an “Explore later or Decided to watch” reason is the best option for later recommendation and ignored items with “Would not enjoy” reason should get discarded. Classifiers were trained to predict inaction reasons which resulted in better than chance performance. However, use of sensors such as eye-tracking equipments is suggested to improve the performance on inaction reason detection.

**RecSys Challenge 2018 Workshop**

The RecSys Challenge Workshop featured oral presentations from 16 of the top performing systems submitted to the RecSys Challenge 2018 on the task of Automatic Playlist Continuation. Overall participation was high (1791 registrants, 117 active teams, and 1467 submissions). On the surface, a majority of systems, including the winning entry, took a similar ensemble approach including a high-recall candidate generation phase(s), followed by a high-precision ranking phase, with some special techniques applied to handle the “cold start” use case (title-only playlists). However, when examined in more detail, the systems showed great variety and novelty in approaches to the task.

Approaches based on simple neighborhood methods were surprisingly effective on the task, for example here and here. Matrix factorization formed the basis of many other systems, using algorithms such as Weighted Regularized Matrix Factorization (WRMF) with playlist-track matrices in place of the traditional user-item matrix. Many other systems integrated features beyond basic playlist-track co-occurrence through the LightFM matrix factorization library, which is a form of Factorization Machine.

Beyond the general multi-stage ensemble framework, there was a great diversity of features and techniques presented. There was a graph-based random-walk approach, an IR-based query-expansion approach, and sub-profile aware diversification approach. Deep Learning was used in a variety of ways as well, from using autoencoders to predict playlist contents, to character-level CNNs on playlist titles, to recurrent networks to model sequences of tracks.

Entries to the Creative Track integrated data from other sources, including audio features, and lyrics, in an ensemble method for playlist prediction. One of the unique features of this dataset and challenge is the importance of the playlist title: titles can indicate the intent of a playlist (for example “Music from Woodstock” or “Awesome Cover Songs”), and can have a big impact on the types of songs that fit in a playlist. While several entries to the Main Track took creative approaches to learn from the playlist titles contained in the Million Playlist Dataset, we can imagine further gains from integrating approaches and external datasets from the Natural Language Processing (NLP) communities.

All accepted papers, along with code, and slides (where available) are posted on the Workshop website. A detailed summary of the Challenge outcomes, approaches, and findings can be found in this preprint whitepaper.

**REVEAL: Offline Evaluation for Recommender Systems**

The REVEAL workshop was about offline evaluation of recommender systems, with themes around inverse propensity scoring, reinforcement learning, and evaluation testbeds. Here we highlight a small selection of papers; you can find many other excellent papers on the workshop website. Nikos Vlassis et al. present a generalization of inverse propensity score estimators in their paper On the Design of Estimators for Off-Policy Evaluation and show how one can retrieve various existing off policy evaluators such as doubly robust and control variates as special cases. Tao Ye and Mohit Singh of Pandora presented their contextual bandit approach using LinUCB for the home page of Pandora and showed how it responds to breaking events in music (e.g. artist passing away). Finally, Minmin Chen of Google Brain discussed their methods for long-term optimization of user satisfaction using simulations by combining off-policy evaluation with the REINFORCE algorithm so that you can train a new policy offline using randomized online data.

*Spotify was a diamond sponsor of RecSys 2018.*

In our recent RecSys publication, we propose a bandit method for personalizing explainable recommendations (e.g. cards on a grid with shelf titles). Our method provides a way of approximating online metrics using counterfactual evaluation with grid-based propensities. Experiments on the Home page of Spotify show a significant improvement in stream rate over non-bandit methods and a pretty good agreement in offline and online ranking of approaches.

Here are some associated resources:

J. McInerney, B. Lacker, S. Hansen, K. Higley, H. Bouchard, A. Gruson, R. Mehrotra. Explore, Exploit, Explain: Personalizing Explainable Recommendations with Bandits. In *ACM Conference on Recommender Systems (RecSys), *October 2018.

Slides | Poster | Blog | Paper

I start by describing the motivation for bandits in recommendation then describe how explained recommendations are now a common use case in modern recommender systems.

A small number of content producers dominate consumption in culture. This happens wherever exposure (e.g., word of mouth, media coverage, recommendation algorithms) is higher for popular producers than less popular producers. More popularity leads to greater exposure and greater exposure leads to more popularity (see Fig. 1). This is known as the Matthew effect (biblical reference) or the Pareto principle [Juran, 1937]. The way this plays out in recommendation algorithms is interesting and worth diving into next.

Figure 1. More popular content producers receive greater exposure which feeds their popularity.

Filter bubbles occur when you train a model on user response data gathered by a recommender [Chaney et al., 2018]. The problem originates because items receive different levels of exposure, which biases the learnt relationship between treatment (i.e. recommended item) and response (i.e. observed relevance) [Liang et al, 2016]. Fig. 2 gives a schematic of the way that most recommenders in production across industry are deployed. An alternative way to describe the problem is with Pearl’s do-calculus: in training a model of user response given recommendations, our model is biased by the back-door path that goes via the contextual information (e.g. the user history or user covariates).

Figure 2. Feedback loops in existing recommender systems result in biased models.

This motivates the need for contextual bandit approaches that explore the item space sufficiently and correct for the bias in the training objective.

To understand the role of bandits in all this, it helps to introduce the notion of recommender uncertainty (see Fig. 3). When all a recommender can do is either exploit (i.e. recommend) or ignore (i.e. not recommend) an item, the recommendation system ends up ignoring potentially relevant items. This happens to due to the fact that it only ever has access to finite data about user responses.

Figure 3. When a recommender is uncertain about an item relevance to a user, all it can do is either explore or ignore.

Wouldn’t it be nice if our recommender could also recommend items on an exploratory basis when it is uncertain about its relevance? This motivates using a contextual bandit (see Fig. 4) [Sutton & Barto, 1998]. The type of bandit we focus on here is a *stochastic policy contextual bandit *because it affords importance sample reweighting (described in the “Counterfactual Training” section).

Figure 4. When a bandit recommender is uncertain about an item relevance to a user, it can make sure to explore those items.

The final piece of the motivation relates to explained recommendations, or *recsplanations* (a term coined by Spotify’s Paul Lamere a few years ago). Recsplanations are now a common way for a recommender to tell a user why they are being recommended a particular item. They are often manifested in a shelf interface: e.g. on the Spotify home page on the mobile app you see playlists arranged on shelves, and each shelf has a title or theme that tells you a bit about where the recommendations come from (“based on your recent listening”, “because your friends like”).

Figure 5: example of a shelf interface recommender.

The research question is: how can we extend the bandit approach to recsplanations?

Bart consists of a reward model, a stochastic policy, a counterfactual training method, and a propensity score scheme. I discuss each choice next.

The job of the reward model is predict the user response given a context and recommended item:

$$\mathrm{predicted \; reward\;} = \mathbb{E}_{p_\theta(R | A, X)}[R]$$

where X is the context (e.g. recent user listening, region, platform), A is the recommended item, R is the user response (e.g. stream = 1, no stream = 0), and \( \theta \) are a set of model parameters.

What makes a good reward model? That depends on the kind of data and problem. For recommendation, it helps to use methods that do not fold under data sparsity, since there may only be a few responses per user for you to train on. For our experiments, we focused on the class of factorization machines [Rendle, 2010].

How you explore items with an uncertain reward is also a design choice. Given that exploration is wasteful it makes sense to focus it where there is the most uncertainty. Ideally, as our recommender gains confidence in the quality of some items it explores them less. For simplicity we chose epsilon-greedy, which takes the following form:

$$\pi(A | X) = (1-\epsilon) + \frac{\epsilon}{|\mathcal{A}|} \mathrm{\;when\;}A=A^*$$

$$\pi(A | X) = \frac{\epsilon}{|\mathcal{A}|} \mathrm{\;otherwise} $$

Could we have used upper confidence bounds (UCB) or Thompson sampling? UCB is a deterministic policy so is not compatible with the importance reweighting trick we are about to use. Thompson sampling requires an accurate quantification of uncertainty in the reward prediction, so we leave that for future work.

Our ideal training objective is on data collected through a randomized controlled trial:

$$\mathbb{E}_{X, A \sim \mathrm{Uniform}(\mathcal{A}),R}[\log p_\theta(R | A, X)]$$

But you can’t always get what you want: users would get dissatisfied if we just recommended them things regardless of relevance. Therefore, our approach is to collected data with a biased policy, \( \pi \) described above, then manipulate the objective such that it approximates the randomized controlled trial:

$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{n=1}^N \frac{\log p_\theta(R | A, X)}{|\mathcal{A}| \pi(a | x)}$$

In fact, we do the counterfactual training using normalized capped importance sampling (NCIS) in order to reduce the variance of the estimator [Gilotte et al., 2018].

The novel aspect of the training procedure is how to specify the propensity scores (i.e. the logging policy) and, concomitantly, the stochastic policy on the explanation grid.

Let’s start from what would make our lives easy: given a database of impressions and stream outcomes, the most natural plug-n-play approach is to train on the rows of this database as weighted i.i.d. observations in a classifier (or regressor for continuous rewards). What are the weights? None other than the inverse propensities.

How does that translate into bandit world? It turns out, it is equivalent to filling in slots in two directions with a bandit action in each slot. First, fill up each shelf with a diminishing action space of items, preselected to be relevant to the shelf title (a form of *safe exploration*):

Figure 6

Figure 7

Second, fill up each shelf with a diminishing action space of shelves.

Why are we allowed to decide the ordering before a single impression has been collected? The reason is because our reward model is only trained episodically: \( \theta \) remains the same therefore we can greedily choose item placements, on the strong assumption that the rewards for each action are independent.

We studied a range of settings and compared to a baseline that does not use dynamic ordering of items (control) and a random placement of items. We found a significant improvement when using contextual bandits. Further details around the experimental setup are in the paper.

We introduced Bart, a contextual bandit approach to performing exploration-exploitation with recsplanations. There are clearly several extensions to improve the approach. First, the user preference model is amenable to several improvements: it assumes independence of rewards (ruling out rankings that encourage diversity of content), estimates absolute reward when comparative reward (a la LambdaMart) could be more accurate, and it is vulnerable to our definition of user stream as the success metric (why not stream length?). The second avenue of improvement is along whole page optimization lines: can we use the slate method to improve this [Swaminathan et al, 2018]?

[Chaney et al., 2018] Chaney, A. J., Stewart, B. M., & Engelhardt, B. E. (2018, September). How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In *Proceedings of the 12th ACM Conference on Recommender Systems* (pp. 224-232). ACM.

[Gilotte et al., 2018] Gilotte, A., Calauzènes, C., Nedelec, T., Abraham, A., & Dollé, S. (2018, February). Offline A/B testing for Recommender Systems. In *Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining* (pp. 198-206). ACM.

[Juran, 1937] Juran, J. (1937). Pareto principle. *Juran Institute*.

[Liang et al., 2016] Liang, D., Charlin, L., McInerney, J., & Blei, D. M. (2016, April). Modeling user exposure in recommendation. In *Proceedings of the 25th International Conference on World Wide Web* (pp. 951-961). International World Wide Web Conferences Steering Committee.

[Pearl, 2009] Pearl, Judea. *Causality*. Cambridge University Press, 2009.

[Rendle, 2010] Rendle, S. (2010, December). Factorization machines. In *Data Mining (ICDM), 2010 IEEE 10th International Conference on*(pp. 995-1000). IEEE.

[Sutton & Barto, 1998] Sutton, Richard S., Andrew G. Barto, and Francis Bach. *Reinforcement learning: An introduction*. MIT press, 1998.

[Swaminathan et al., 2018] Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., & Zitouni, I. (2017). Off-policy evaluation for slate recommendation. In *Advances in Neural Information Processing Systems* (pp. 3632-3642).

IPS requires the policies to be stochastic and for the probability of actions under the collection policy to be non-zero wherever the target policy is non-zero. But what happens when the collection policy has additional noise?

For example, let's say a ghost haunts your production pipeline and at random times during collection it tweaks the collection policy -- assume you have still logged the correct propensity, it's just that it is different to what was intended.

It might seem that you would need to correct for this additional noise in the logged propensity somehow. But in fact, counter-intuitively, the standard IPS estimator works fine in this case too.

Let's assume the ghost in the machine ensures that the collection policy is actually a mixture of S sub-policies:

$$\pi_c = \sum_{s=1}^S w_s \pi_c^s$$In words, for each action the bandit randomly chooses one of S policies according to the weights \(w\) where \( \sum_s w_s = 1\) and \( w_s \ge 0 \; \forall s \), then randomly selects an action according to the selected policy \( \pi_c^s \).

As usual, we are interested in estimating the average reward (r.v. \( R \)) under the target policy \( \pi_t \) as \( \mathbb{E}_{\pi_t}[R] \) using data logged from the collection policy \( (x_i, a_i, r_i) \sim F_X \pi_c F_R \). How to show that the standard IPS estimator (that uses whatever logged propensity was applied in each action) is equal to \( \mathbb{E}_{\pi_t}[R] \) in expectation? That is, how to show:

$$ \mathbb{E} \left[ \frac{1}{N} \sum_{n=1}^N \frac{\pi_t(a_n)}{\pi_c^{s_n}(a_n)} r_n\ \right] = \mathbb{E}_{\pi_t}[R],$$ where I have introduced \( s_n \) as the index of the sub-policy used for action \( n \).The trick here is to separate the data from the different policies into S "sub-experiments" then notice that we have an unbiased estimator for each sub-experiment and use the fact that \( \sum_s w_s = 1\):

$$ \mathbb{E} \left[ \frac{1}{N} \sum_{n=1}^N \frac{\pi_t(a_n)}{\pi_c^{s_n}(a_n)} r_n \right] \\ = \mathbb{E} \left[\sum_{s=1}^S \frac{1}{N} \sum_{n: s_n = s}^N \frac{\pi_t(a_n)}{\pi_c^{s_n}(a_n)} r_n \right] \\ = \mathbb{E} \left[ \sum_{s=1}^S \frac{\sum_{n: s_n = s}^N 1}{N} \mathbb{E}_{\pi_t}[ R ] \right ]\\ = \sum_{s=1}^S w_n \mathbb{E}_{\pi_t}[ R ]\\ = \mathbb{E}_{\pi_t}[ R ]. $$Needless to say, all this requires that none of the sub-policies violates the assumption that the collection policy has non-zero mass wherever the target policy has non-zero mass.

*Acknowledgements: *thanks to Alois Gruson, Christophe Charbuillet, and Damien Tardieu for the helpful discussion.

- Off-policy evaluation for slate recommendation. Swaminathan, Adith, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. In
*Advances in Neural Information Processing Systems*, pp. 3635-3645. 2017.

Slate recommendation concerns a set of recommendations that are exposed to a user at once. For example, when you do a search, you get a page of ranked results. When you visit the home page of a music, movie, or book app, you usually see a whole page of recommendations. This paper is about how to do off-policy evaluation and optimization with data collected with a known logging policy.

They set up the problem as a contextual bandit where each action is a slate of sub-actions:

$$s = (s_1, \dots, s_l)$$where \(s\) refers to the slate and \(s_j\) refers to the \(j^\mathrm{th}\) item in the slate.

They key idea is that they estimate the reward for particular context \(x\) and slate \(s\) using a linear regression:

$$\hat{V}(x, s) = \mathbb{1}_s^\top \phi_{x}$$ where \(\mathbb{1}_s\) is an indicator vector of size \(lm\) telling us which items (out of a possible \(m\) items) appears in the \(l\) slots. \(\phi_x\) represents the contribution towards to the overall reward for each item at each slot. The assumption is therefore that, in any given context, the contribution of an item to the overall reward must be linear.The attraction of this approach is that
it does not try to model the rewards and suffer the model mismatch bias. Instead, they impose a linearity assumption on how the individual item rewards relate to the overall rewards for a slate (and make the typical importance sample reweighting assumption that the target policy \(\pi\) is non-zero wherever the logging policy \(\mu\) is non-zero for all contexts). After that, they replace the predicted reward with a single-sample Monte Carlo estimate from the logged data to arrive at the

where the weights with the smallest norm can be estimated using the closed form solution for linear regression. One thing that was not clear from the paper was how to estimate the uncentered covariance matrix but I guess that can be done in the same way using a single-sample empirical estimate.

This is a pretty neat idea and they go on to show that the PI estimator is unbiased and has low variance in relation to the number of items and slots in the slate.

I like the approach of avoiding bias in the estimator by making assumptions about how the reward decomposes and how it relates to the context. The single sample estimate, on the face of it, looks pretty crude but due to the unbiasedness, the errors cancel out over a large number of logged impressions. The same cannot be said for the model-based approach (what they call the *direct method* or DM). However, as they show in their experiments, DM does well for smaller sets of impressions (< 10^5) which makes sense because that's where the bias introduced by model assumptions can compensate for the small data issues.

While they show that the linear reward assumption has common metrics like NDCG as special cases, I wonder how much interaction exists between items within a slate. An item that is a bad match for a context could tank the reward for the whole page but it might also be bad in how it interacts with other items. Interactions cannot be captured by the linearity, only indirectly through the context.

And I also wonder what the payoff is for introducing a better models in DM. Specifically, they seem to have considered only regression trees and lasso regression as their baselines for DM. Model mismatch bias can be addressed by better models, reducing the error in the DM estimate.

]]>Imagine you're in charge of a bunch of hospitals and are looking at the survival statistics of all the patients with a certain illness (e.g., brain tumours) to help you decide which doctors to promote. (This is probably not how doctors are promoted, but anyway). You might be tempted to pick the doctors with the best survival rates. This is wrong because it ignores the assignment mechanism of patients to doctors. Let's see how wrong you could be with a specific example.

doctor id | | num patients | | survival rate |
---|---|---|

0 | 117 | 0.205 |

1 | 85 | 0.812 |

2 | 93 | 0.129 |

3 | 88 | 0.216 |

4 | 98 | 0.153 |

5 | 112 | 0.830 |

6 | 89 | 0.809 |

7 | 109 | 0.138 |

8 | 102 | 0.824 |

9 | 107 | 0.841 |

Let's make a set of statistical modelling assumptions about how this data was generated.

Assumptions: D doctors N patients \(Z_{n,d}\) is whether patient n is assigned to doctor d \(Y_n(d)\) is binary recovery if patient n is treated by doctor d \(\beta_d\) is skill of doctor d (latent, binary: low or high skill) \(\theta_n\) is the severity of patient n's illness (observed, binary: low or high severity) Generative procedure: for patient \(n = 1,\dots,N\): draw an indicator for whether the patient is assigned a highly skilled or less highly skilled doctor, \(s_n \sim \mathrm{Bernoulli}( \theta_n p_\mathrm{high} + (1-\theta_n) p_\mathrm{low} )\) if \(s_n = 1\): draw \(Z_{n .}\) uniformly from the pool of highly skilled doctors otherwise: draw \(Z_{n .}\) doctor d uniformly from the pool of other doctors calculate probability of recovery as follows: \(p_\mathrm{recover} = \beta_d \theta_n p_\mathrm{low} + (1-\beta_d) \theta_n p_\mathrm{very low} + (1-\theta_n) p_\mathrm{high}\) draw recovery \(Y_n(d) \sim \mathrm{Bernoulli}( p_\mathrm{recover} )\)What this generative process is saying is that we should try to assign highly skilled doctors to severely ill patients and let the rest of the doctors take care of the rest of the patients, but this process is noisy and there is a non-zero probability that an average doctor care for a very ill patient or a very skilled doctor will care for a less ill patient.

Why this recovery probability? If the illness is severe, then the best that a skilled doctor can do is ensure a low probability of recovery, while a less skilled doctor will do even worse than that. If the illness is not severe, then it doesn't matter whether the patient gets a very skilled doctor or not, they are likely to recover.

In reality, the patient covariates \(\theta_d\) would not be a simple binary number, it would be features of the illness (e.g., tumour size and location, age of patient), and doctors might have skills along multiple dimensions. Binary numbers suit our thought experiment for the moment, but we'll revisit the dimensionality of covariates later.

Using this generative model and a given set of parameters \((p_\mathrm{high}, p_\mathrm{low}, p_\mathrm{very\; low}, N, D, \theta, \beta)\), I generated 10 doctors and 1000 patients (the data you see in the table above, actually). Now, let's pretend that we don't know \(\beta\), the doctor skill that we are trying to estimate from the data alone. Would you want to see doctor 4? Naively, no, but who knows, they might be the top specialist in the country for your illness. I consider how to answer that question next.doctor id | | num severe patients | | recovery rate (non-severe) | | recovery rate (severe) |
---|---|---|---|

0 | 103 | 0.857 | 0.117 |

1 | 7 | 0.885 | 0.000 |

2 | 87 | 0.833 | 0.080 |

3 | 78 | 0.900 | 0.128 |

4 | 92 | 1.000 | 0.098 |

5 | 6 | 0.877 | 0.000 |

6 | 11 | 0.923 | 0.000 |

7 | 97 | 0.917 | 0.041 |

8 | 11 | 0.923 | 0.000 |

9 | 8 | 0.909 | 0.000 |

doctor id | | avg treatment effect | | doctor skill | | num treated | | avg recovery rate |
---|---|---|---|---|

0 | 0.000 | 1 | 117 | 0.205 |

1 | -0.024 | 0 | 85 | 0.812 |

2 | 0.000 | 1 | 93 | 0.129 |

3 | 0.023 | 1 | 88 | 0.216 |

4 | -0.020 | 1 | 98 | 0.153 |

5 | -0.080 | 0 | 112 | 0.830 |

6 | 0.034 | 0 | 89 | 0.809 |

7 | -0.064 | 1 | 109 | 0.138 |

8 | 0.010 | 0 | 102 | 0.824 |

9 | 0.037 | 0 | 107 | 0.841 |

unit covariate | | doc_0 | | doc_1 | | doc_2 | | doc_3 | | doc_4 |
---|---|---|---|---|---|

non-severe | 0.028 | 0.156 | 0.012 | 0.020 | 0.012 |

severe | 0.206 | 0.014 | 0.174 | 0.156 | 0.184 |

unit covariate | | doc_5 | | doc_6 | | doc_7 | | doc_8 | | doc_9 |
---|---|---|---|---|---|

non-severe | 0.212 | 0.156 | 0.024 | 0.182 | 0.198 |

severe | 0.012 | 0.022 | 0.194 | 0.022 | 0.016 |

doctor id | | doctor skill | | treated recovery | | untreated recovery | | avg treatment effect |
---|---|---|---|---|

0 | 1 | 0.487 | 0.489 | -0.003 |

1 | 0 | 0.442 | 0.495 | -0.053 |

2 | 1 | 0.457 | 0.494 | -0.037 |

3 | 1 | 0.514 | 0.489 | 0.025 |

4 | 1 | 0.549 | 0.491 | 0.058 |

5 | 0 | 0.439 | 0.497 | -0.058 |

6 | 0 | 0.462 | 0.492 | -0.030 |

7 | 1 | 0.479 | 0.498 | -0.019 |

8 | 0 | 0.462 | 0.492 | -0.030 |

9 | 0 | 0.455 | 0.493 | -0.038 |

There are many modes of evidence accepted in courts of law. Each mode has its strengths and weaknesses, which will usually be highlighted to suit either side of the case. For example, if a witness places the defendant at the scene of the crime, the defense lawyer will attack her credibility. If fingerprint evidence is lacking, the persecution will say it's because the defendant was careful. Will inferences from probabilistic models ever become a mode of evidence?

It's a natural idea. Courts institutionally engage in uncertainty. They use phrases like "beyond reasonable doubt", they talk about balance of evidence, and consider precision-recall rates (it is "better that ten guilty persons escape than that one innocent suffer" according to English jurist William Blackstone). And the closest thing we have to a science of uncertainty is Bayesian modelling.

In a limited sense we already have probabilistic models as a mode of evidence. For example, the most damning piece of evidence in the Sally Clark cot death case was the testimony of a paediatrician who said that the chance of two cot deaths happening in one household, without malicious action, is 1 in 73 million. This figure was wrong because the model assumptions were wrong -- there could be a genetic or non-malicious environmental component to cot death but this was not captured in the model. But as is vividly illustrated by the Sally Clark case, inferential evidence is currently gatekept by experts. In that sense, the expert is the witness and the court rarely interacts with the model itself. The law has a long history of attacking witness testimony. But what will happen when we have truly democratized Bayesian inference?

Perhaps one day, in a courtroom near you, the defense and prosecution will negotiate an inference method, then present alternative models for explaining data relevant to the case. The lawyers will use their own model to make their side of the case while attacking the opposing side's model.

In what scenarios would probabilistic models be an important mode of evidence?

When there are large amounts of ambiguous data, too large for people to fit into their heads, and even too large/complex to visualize without making significant assumptions.

Consider a trove of emails between employees of a large corporation. The prosecution might propose a network model to support the accusation that management was active or complicit in criminal activities. The defense might counter-propose an alternative model that shows that several key players outside of the management team were the most responsible and took steps to hide the malfeasance from management.

In these types of cases, one would not hope for, or expect, a definitive answer. Inferences are witnesses and they can be validly attacked from both sides on the grounds of model assumptions (and the inference method).

If this were to happen, lawyers would quickly become model criticism ninjas, because they would need model criticism skills to argue their cases. Who knows, maybe those proceedings will make their way onto court room drama TV. In that case, probabilistic model criticism will enter into the public psyche the same way jury selection, unreliable witnesses, and reasonable doubt have. The expertise will come from machines, not humans, and the world will want to develop ever richer language and concepts that enable it to attack the conclusions of that expertise.

]]>Does your attitude to risk change based on the type of uncertainty you harbour? This is a blog post about epistemic and non-epistemic risk.

Here is a quote from theconversation.com:

"Australians [have] an 8.2% chance of being diagnosed with bowel cancer over their lifetime [...] If we assume that a quarter of the Australian population eats 50 grams per day of processed meat, then the lifetime risk for the three-quarters who eat no processed meat would be 7.9% (or about one in 13). For those who eat 50 grams per day, the lifetime risk would be 9.3% (or about one in 11)."

There are at least two ways to interpret the above quote:

Option 1:

- there is a 9.3% chance of getting bowel cancer for processed meat eaters and a 7.9% chance for non-processed meat eaters; genes don't matter

Option 2:

- there is a x% chance of having the genes that make you susceptible to bowel cancer
- if you have the genes that make you susceptible: there is a high chance of getting bowel cancer if you eat meat
- if you don't have the genes that make you susceptible: it doesn't matter what you do, there is a low chance of getting bowel cancer

In either case, we can assume the marginal probability of getting the illness is the same (i.e., we can adjust the percentages in option 2 to make them the same as option 1). If you're a processed meat eater, look at Option 1 and think to yourself: is never eating bacon or a burger again worth 1.6 percentage points reduction in risk? I'm not sure what my answer is, which means that the choices are fairly balanced for me.

Now look at Option 2. Does your answer change? For a rational agent it should not change. My inner monologue for Option 2 goes as follows: if I have the bad genes, then I'm definitely screwing myself over by eating processed meat, and I want to avoid doing that.

But you don't get to know what genes you have (at least, not yet, that will probably change in the next few years), so the main source of risk is epistemic. That is, you already have the genes that you have (tautological though it is to say), you just don't know which kind you have.

Here's what I think is going on: as we go about our lives we have to "satisfice" which means that we focus on actions that are expected to make big differences and try to avoid fiddling at the margins. Option 1 looks a lot like fiddling at the margins to me. Option 2 instead gives me more control: if I have the bad genes then I'm in much greater control over the risk of a terrible illness. But the greater control is illusionary: as long as I remain uncertain about the state of my genes, the utility of eating or not eating processed meat is the same for both Option 1 and Option 2. I call this the paradox of epistemic risk.

]]>I learnt from this book that causality is fun. That hasn't happened to me before. Maybe it's just that the ideas around causal inference that we've been studying in the reading group for a decent length of time now are finally starting to click.

]]>One argument from the book has stuck with me. It goes like this: there is a popular notion that the internet is not an invention or a technology, but an inevitable ideal* *that must be protected from all outside influence. Perhaps this visionary stance was useful in building the internet in the early days, but now that we have the internet, it's less useful.

For example, most recently, this ideal underlies a lot of the objections in the debate about the right to be forgotten. The European court of human rights recently ruled that European citizens have a right to have damaging links about them removed from search engine results. People who believe in the idealised internet warn of the dangers of regulating the internet in this way. Here is a quote from Jennifer Granick (Stanford University) in the New Yorker magazine (“Solace of Oblivion”, Sept 29, 2014 issue):

[This legislation] marks the beginning of the end of the global Internet, where everyone has access to the same information, and the beginning of an Internet where there are national networks, where decisions by governments dictate which information people get access to. The Internet as a whole is being Balkanized, and Europeans are going to have a very different access to information than we have.

This warning appears to be designed to send shivers up the spine, to paint a dystopian future where, *gasp*, Europeans see different information to everyone else. Jonathan Zittrain (Harvard Law School) makes the danger explicit in the same article:

"[... ] what happens if the Brazilians come along and say, ‘We want only search results that are consistent with our laws’? It becomes a contest about who can exert the most muscle on Google.” Search companies might decide to tailor their search results in order to offend the fewest countries, limiting all searches according to the rules of the most restrictive country. As Zittrain put it, “Then the convoy will move only as fast as the slowest ship.”

This quote makes explicit the fear of government control and of technology companies’ harmful reactions to such control.

Underlying both quotes is the assumption that the internet is sacred: its destiny is to be global and pure. But, to repeat, the internet is a *technology*, it is built and maintained by us because we find it useful. Replace “internet” with “car” in the above quotes, and consider again the first quote by Jennifer Granick. Does the experience of driving have to be identical wherever you are in the world? Does the fact that the speed limit varies from country to country, or that you can turn right on a red light in some US states and not in others, keep anybody up at night? Road laws, manufacturing safety laws, drink driving laws are highly Balkanised across states and countries, but no-one is worrying whether some authoritarian government of the future is going to take our cars away from us (except maybe some Tea Party activists).

Or consider the second quote, by Jonathan Zittrain. Do car manufacturers have a right to demand to sell identical cars globally? Is car manufacturing technology held back by the country with the most stringent safety laws? Of course not. But even if it were, it wouldn’t be the only consideration on the table. We don’t feel beholden to any visions that Henry Ford may have had a century ago about a future with open roads and rapid transit, certainly not when it comes to preventing people from dying horrifically on the roads.

Yet, when it comes to the internet, a lot of people believe that an unregulated internet trumps everything, including the grief of parents struggling to get over their daughter’s death when pictures of her dead body are circulating online, or the struggles of a person falsely accused of a crime who has to live with those accusations every time someone searches their name. A balanced viewpoint would look at freedom of speech versus the harm such speech causes in broad classes (e.g., criminal cases, breaches of privacy) and make a decision. Different countries have different priorities which will lead to different regulations on internet freedoms, and that’s ok. If the government is authoritarian, then it has *already* put huge restrictions on the local internet (e.g., China, North Korea). You can add that to the list of reasons why it’s unpleasant to live in authoritarian countries.

At this point, the person arguing for a global and pure internet retreats to practicalities. They cite two main practical barriers to the right to be forgotten: 1) it’s impossible to exert complete control over information - a determined person will find the restricted information anyway; 2) it’s too labour intensive to enact the right to be forgotten. Let’s start with the first barrier. I agree that a demand for perfection is misguided, and I don’t believe anyone is making such a demand. It’s possible to take reasonable and simple steps to allow people to be forgotten that gets you 95% of the way there. In the same way that a determinedly bad driver can still kill people, a determinedly assiduous employer will still be able to dig up information about a potential employee. But this was always the case, even before the internet.

The second practical barrier is the more important one, I feel, and is a manifestation of the fact that technology enables mass automation (e.g., indexing websites) while the judgement that society requires of Google (i.e., “is this index removal request valid?”) cannot currently be automated. While this challenge is substantial, it’s ironic that the same technologists (me included) who claim that they can solve the world’s societal problems, throw their hands up in despair when asked to automate such judgement.

]]>The book was given an interesting review by Gillian Tett of the Financial Times last weekend. Her main point of agreement with the book is that the difference between new and old research on human behaviour analysis is due to the size of the data, plus the extent to which that data is interpreted subjectively v.s. objectively (e.g., anthropologists analysing small groups of people v.s. the 10GB that Prof. Pentland collected on 80 call centre workers at an American bank).

So far so good. But she goes on to criticise computer scientists for analysing people like atoms "using the tools of physics" when in reality they can be "culturally influenced too". I think this is false dichotomy.

There are two ways you can read the aforementioned criticism. The first is that statistical models of human behaviour are unable to infer (or learn) cultural influences from the raw data. The second is that the models themselves are constructed under the bias of the researcher's culture.

In the first case, there's nothing to stop researchers from finding new ways of allowing the models to pick up on cultural influences in behaviour (in an unsupervised learning manner), as long as the tools are rich and flexible enough (which they almost certainly are in the case of statistical machine learning).

The second case is more subtle. In my experience, the assumptions that are expressed in the models are highly flexible and don't appear to impose cultural constraints on the data. How do I know? I could be wrong, but my confidence in the flexibility of the models is due to the fact that model performance can be quantified with hard data (e.g., using measures of precision/recall or held-out data likelihood). This means that any undue bias (cultural or otherwise) that one researcher imposes will be almost immediately penalised by another researcher coming along and claiming he or she can do "X%" better in predictive accuracy. This is the kind of honesty that hard data brings to the table, though I agree that it is not perfect and that many other things can go wrong with the academic process (e.g., proprietary data making publications impervious to criticism, and the fact that researchers will spend more effort on optimising their own approach for the dataset in question v.s. the benchmarks).

My line of argument of course assumes that the data itself is culturally diverse. This wouldn't be the case if the only data we collect and use comes from, for example, university students in developed countries. But the trend is definitely moving away from that situation. Firstly, the amount of data available is growing by a factor of 10 about every few years (driven by data available from social networking sites like Twitter and Foursquare). At a certain point, the amount of data makes user homogeneity impossible, simply because there just aren't that many university students out there! Secondly, forward thinking network operators like Orange are making cell tower data from developing countries available to academic researchers (under very restricted circumstances). So in conclusion, since the data is becoming more diverse, this should force out any cultural bias in the models (if there was any there to start with).

]]>Mobile location services have been a topic of considerable interest in recent years, both in industry and academia. In industry, software applications (or apps) with location-based components enjoy widespread use. This is evidenced, for example, by the 20 million active users who opt to check in (i.e., share their current location with friends) on Foursquare , the 50 million users who search for local services on Yelp when out and about, and the increasing number who electronically hail a taxi in Uber in 11 cities (up from 1 city in 2011), which they can then watch arrive at their location on a real-time map.

In updating this paragraph, I found that the statistics reflecting 2013's progress by mobile location services are as follows: Foursquare grew from 20 to 30 million users, Yelp grew from 50 million to 100 million users, and Uber is now in 66 cities (up from 11 cities in 2012).

From this small sample of progress, it seems that there is still a lot of growth in location-based services, especially ones involving crowdsourcing of physical tasks (e.g., Uber, TaskRabbit, Gigwalk).

The disappointing one of the pack in terms of user growth is Foursquare (which "only" grew by 50% in 2013), but they are arguably facing the different challenge of proving "that there's a real business there", in CEO Dennis Crowley's words (i.e., finding sustainable revenue streams). But in general, the most promising location-based services are still following the exponential growth curve (in number of users) which is good news for innovation.

]]>During my work, I've often noticed similarities between language and individual daily life location behaviour (as detected by GPS, cell towers, tweets, check-ins etc.). To summarise these thoughts, I've compiled a list of the similarities and differences between language and location below. I then mention a few papers that exploit these similarities to create more powerful or interesting approaches to analysing location data.

**Similarities between Location and Language Data**

- Both exhibit power laws. A lot of words are used very rarely while a few words are very frequently used. The same happens with the frequency of visits to locations (e.g., how often you visit home v.s. your favourite theme park). This is not a truism. The most frequently visited locations or words used are *much* more likely to be visited/used than most other places/words.
- Both exhibit sequential structure. Words are highly correlated with words near to them on the page. The same for locations on a particular day.
- Both exhibit topics or themes. In the case of language, groups of words tend to co-occur in the same document (e.g., two webpages that talk about cars are both likely to mention words from a similar group of words representing the "car" topic). In the case of location data, a similar thing happens. I mention two interpretations from specific papers later in this post.
- The availability of both language data and location data has exploded in the last decade (the former from the web, the latter from mobile devices).
- There are cultural differences in using language just as there are cultural differences in location behaviour (e.g., Spanish people like to eat out later than people of other cultures).
- Both are hierarchical. Languages have letters, words, sentences, and paragraphs. A person can be moving around at the level of the street, city, or country (during an hour, day, or week).
- Both exhibit social interactions. Language is exchanged in emails, texts, verbally, or in scholarly debate. Friends, co-workers, and family may have interesting patterns of co-location.

**Differences between Location and Language Data**

- Many words are shared between texts (of same language) but locations are usually highly personal to individuals (except for the special cases of friends, co-workers, and family).
- There are no periodicities in text but strong periodicities in location (i.e., hourly, weekly, and monthly).
- Language data is not noisy (except for spelling and grammar mistakes) while location data is usually noisy.
- Language analysts do not usually need to worry about privacy issues whilst location analysts usually do.

**Work that Exploits These Similarities**

Here are a few papers that apply or adapt approaches that were primarily used for language models to location data:

They use topic modelling to capture the tendency of visiting certain locations on the same day. This is similar to using the presence of words like "windshield" and "wheel" to place higher predictive density on words like "road" and "bumper" (i.e., topic modelling bags of words). I have talked previously about why I think this is a good paper.

This paper uses a similar application of topic modelling as the one by Farrahi and Gatica-Perez.

This paper uses a model that was previously used to capture sequential structure in words and applies it to Foursquare checkins.

In my own work, I've used the concept of topics to refer to location habits that represent the tendency of an individual to be at a given location at a certain time of day or week. This way of thinking about locations is useful in generalising temporal structure in location behaviour across people, while still allowing for topics/habits to be present to greater or varying degrees in different people's location histories (just as topics are more or less prevalent in different documents).

Both language and location data are results of human behaviour, so it is unsurprising to find similarities, even if I think some of the similarities are coincidental (e.g., power laws crop up in many places and often for different reasons, and the increasing availability of data is part of the general trend of moving the things we care about into the digital domain). The benefits of analysis approaches seem to be flowing in the language -> location direction only at the moment, though I hope one day that will change.

]]>

There are a few standard techniques for performing inference on hierarchical Bayesian models. Finding the posterior distribution over parameters or performing prediction requires an intractable integral for most Bayesian models, arising from the need to marginalise ("integrate out") nuisance parameters. In the face of this intractability there are two main ways to perform approximate inference: either transform the integration into a sampling problem (e.g., Gibbs sampling, slice sampling) or an optimisation problem (e.g., expectation-maximisation, variational inference).

Probably the most straightforward method is Gibbs sampling (see Chapter 29 of "Information Theory, Inference, and Learning Algorithms" by David MacKay) because you only need to derive conditional probability distributions for each random variable and then sample from these distributions in turn. Of course you have to handle convergence of the Markov chain, and make sure your samples are independent, but you can't go far wrong with the derivation of the conditional distributions themselves. The downside of sampling methods is their slow speed. A related issue is that sampling methods are not ideal for online scalable inference (e.g., learning from streaming social network data).

For these reasons, I have spent the last 6 months learning how to apply variational inference to my mobility models. While there are some very good sources describing variational inference (e.g., chapter 10 of "Pattern Recognition and Machine Learning" by Bishop, this tutorial by Fox and Roberts, this tutorial by Blei), I feel that the operational details can get lost among the theoretical motivation. This makes it hard for someone just starting out to know what steps to follow. Having successfully derived variational inference for several custom hierarchical models (e.g., stick-breaking hierarchical HMMs, extended mixture models), I'm writing a practical summary for anyone about go down the same path. So, here is my summary for how you actually apply variational Bayes to your model.

**Preliminaries**

I'm omitting an in-depth motivation because it has been covered so well by the aforementioned tutorials. But briefly, the way that mean-field variational inference transforms an integration problem into an optimisation problem is by first assuming that your model factorises further than you originally specified. It then defines a measure of error between the simpler factorised model and the original model (usually, this function is the Kullback-Leibler divergence, which is a measure of distance between two distributions). *The optimisation problem is to minimise this error by modifying the parameters to the factorised model (i.e., the variational parameters)*.

Something that can be confusing is that these variational parameters have a similar role in the variational model as the (often, fixed) hyperparameters have in the original model, which is to control things like prior mean, variance, and concentration. The difference is that you will be updating the variational parameters to optimise the factorised model, while the fixed hyperparameters to the original model are left alone. The way that you do this is by using the following equations for the optimal distributions over the parameters and latent variables, which follow from the assumptions made earlier:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$$

$$\mathrm{ln} \; q^*(Z_i) = \mathbb{E}_{-Z_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$$

where \(X\) is the observed data, \(V\) is the set of parameters, \(Z\) is the set latent variables, and \(\alpha\) is the set of hyperparameters. Another source of possible confusion is that these equations do not explicitly include the variational parameters, yet these parameters are the primary source of interest in the variational scheme. In the steps below, I describe how to derive the update equations for the variational parameters from these equations.

**1. Write down the joint probability of your model**

Specify the distributions and conditional dependencies of the data, parameters, and latent variables for your original model. Then write down the joint probability of the model, given the hyperparameters. In the following steps, I'm assuming that all the distributions are conjugate to each other (e.g., multinomial data have Dirichlet priors, Gaussian data have Gaussian-Wishart priors and so on).

The joint probability will usually look like this:

$$p(X, Z, V | \alpha) = p(V | \alpha) \prod_n^N \mathrm{<data \; likelihood \; of \; V, Z_n>} \mathrm{<probability \; of \; Z_n>}$$

where \(N\) is the number of observations. For example, in a mixture model, the data likelihood is \(p(X_n | Z_n, V)\) and the probability of \(Z_n\) is \(p(Z_n | V)\). An HMM has the same form, except that \(Z_n\) now has probability \(p(Z_n | Z_{n-1}, V)\). A Kalman filter is an HMM with continuous \(Z_n\). A topic model introduces an outer product over documents and additional set of (global) parameters.

**2. Decide on the independence assumptions for the variational model**

Decide on the factorisation that will allow tractable inference on the simpler model. The assumption that the latent variables are independent of the parameters is a common way to achieve this. Interestingly, you will find that a single assumption of factorisation will often induce further factorisations as a consequence. These come "for free" in the sense that you get simpler and easier equations without having to make any additional assumptions about the structure of the variational model.

Your variational model will probably factorise like this:

$$q(Z, V) = q(Z) q(V)$$

and you will probably get \(q(V) = \prod_i q(V_i)\) as a set of induced factorisations.

**3. Derive the variational update equations**

We now address the optimisation problem of minimising the difference between the factorised model and the original one.

*Parameters*

Use the general formula that we saw earlier:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln}\; p(X, Z, V | \alpha) \right)$$

The trick is that most of the terms in \(p(X, Z, V | \alpha)\) do not involve \(V_i\), so can be removed from the expectation and absorbed into a single constant (which becomes a normalising factor when you take the exponential of both sides). You will get something that looks like this:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha) + \sum_n^N \mathrm{ln} \; \mathrm{<data \; likelihood \; of \; V_i, Z_n>} \right) + \mathrm{constant}$$

What you are left with is the log prior distribution of \(V_i\) plus the total log data likelihood of \(V_i\) given \(Z_n\). Even in the two remaining equations, you can often find terms that do not involve \(V_i\), so a lot of the work in this step involves discarding irrelevant parts.

The remaining work, assuming you chose conjugate distributions for your model, is to manipulate the equations to look like the prior distribution of $$V_i$$ (i.e., to have the same functional form as \(p(V_i | \alpha)\)). You will end up with something that looks like this:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha_i') \right) + \mathrm{constant}$$

where your goal is to find the value of \(\alpha_i'\) through equation manipulation. \(\alpha_i'\) is your variational parameter, and it will involve expectations of other parameters \(V_{-i}\) and/or \(Z\) (if it didn't, then you wouldn't need an iterative method). It's helpful to remember at this point that there are standard equations to calculate \(\mathbb{E} \left( \mathrm{ln} \; V_j \right)\) for common types of distribution (e.g., Dirichlet \(V_j\) has \(\mathbb{E} \left( \mathrm{ln} \; V_{j,k} \right) = \psi(V_{j,k}) - \psi(\sum_{k'} V_{j,k'})\), where \(\psi\) is the digamma function). Sometimes you will have to do further manipulation to find expectations of other functions of the parameters. We consider next how to find the expectations of the latent variables \(\mathbb{E}(Z)\).

*Latent variables*

Start with:

$$\mathrm{ln} \; q^*(Z_n) = \mathbb{E}_{-Z_{n}}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$$

and try to factor out \(Z_{n}\). This will usually be the largest update equation because you will not be able to absorb many terms into the constant. This is because you need to consider the parameters generating the latent variables as well as the parameters that control their effect on observed data. Using the example of multinomial independent \(Z_n\) (e.g., in a mixture model), this works out to be:

$$\mathrm{ln} \; q^*(Z_{n,k}) = \mathbb{E}_{-Z_{n,k}}\left( Z_{n,k} \mathrm{ln} \; V_k + Z_{n,k} \mathrm{ln} \; p(X_n | V_k) \right) + \mathrm{constant}$$

factorising out \(Z_{n,k}\) to get:

$$\mathrm{ln} \; \mathbb{E}(Z_{n,k}) = \mathbb{E}(\mathrm{ln} \; V_k) + \mathbb{E}(\mathrm{ln} \; p(X_n | V_k)) + \mathrm{constant}$$

**4. Implement the update equations**

Put your update equations from step 3 into code. Iterate over the parameters (M-step) and latent variables (E-step) in turn until your parameters converge. Multiple restarts from random initialisations of the expected latent variables are recommended, as variational inference converges to the local optimum.

The video below shows what variational inference looks like on a mixture model. The green scatters represent the observed data, the blue diamonds are the ground truth means (not known by the model, obviously), the red dots are the inferred means and the ellipses are the inferred covariance matrices:

]]>I was in London yesterday to watch a debate about the future of mobility (i.e., transport) hosted by Intelligence Squared at the Royal Institution, in the same room Faraday demonstrated electricity. We saw four speakers give their visions for the future. Despite my impression that any of the speakers could have talked for 10 times longer than they did and still be fascinating, the organisers went for TED-style short talks. The format seemed to go quite well because the talks were polished and energy levels were high in the room throughout.

1. Paul Newman

Prof Newman has been developing self-driving cars at Oxford University's Information Engineering department. In a word, his idea of how self-driving cars will take over the world is "gradually". Imagine driving yourself through a traffic jam when your car offers to take over for you in that stretch of road, and, having already participated in a behind-the-scenes auction, has agreed with your insurance company to offer you a 9 pence discount if you do so (given that computers will be safer drivers than people in future).

The motivations for the gradual change seem to be mostly technical. There will be weather conditions and extreme situations in which the car decides for itself that it wouldn't be able to drive very well. The self-driving cars also need a period in which they will learn our driving styles and learn about the world's roads.

2. Robin Chase

Robin Chase co-founded car sharing startup ZipCar and has since gone on to create new ride sharing companies BuzzCar and GoLoco. She painted two possible extremes for car occupancy in the future.

Future hell is the proliferation of zero-occupancy cars, where owners of self-driving cars send their cars on trips round the block while they shop, for lack of available parking options. Or imagine us all cycling to work, but having our cars follow us with our laptops and a change of clothes (in the style of David Cameron)? This scenario was not so plausible to me because I think it currently is, or will be made, illegal to have empty cars driving on the road anywhere in the world.

The future car paradise, in her view, is having 6 people sharing a self-driving car (boosting car occupancy way beyond the current level), but those who can afford it can ride with lower occupancy if they wish. It seems to me that if digital records of car occupancy are kept (which they would be, if we are all ride-sharing in future) then it makes it easy to apply a direct car-occupancy tax, so not many people would be tempted by the much more expensive single-occupancy option.

The other interesting thing she mentioned about the future of car sharing is that car owners would start to fit the car to the occasion (e.g., a swanky car for a first date or a job interview, but something more basic the rest of the time), instead of having to cover all possible scenarios, as they do now, when buying a car that they permanently own.

3. Jerry Sanders

In a polished pitch, Jerry Sanders told us about his plans for SkyTran, a rapid transit system to be built in Tel Aviv. Passengers each get their own pod that will hang from a pylon above the city, but will be propelled magnetically forward ("surfing on a wave of magnetism") at speeds of 100 mph. The speaker painted SkyTran as the solution to congestion.

I like the idea and would definitely take a ride in one if I ever had the chance, but it seems futuristic in the same way that monorails were meant to be the solution to city congestion 30 years ago. What happened to monorails? According to Wikipedia, city councils perceived them to be a high cost and untested solution, so preferred to invest in more mature (but ultimately more expensive) public transport methods like underground rail.

4. Ben Hamilton-Baillie

After hearing three technology-related visions, it was refreshing to hear low-tech but pragmatic ideas from Ben Hamilton-Baillie. He has pioneered "shared spaces" in this country, which boils down to removing all the signs and road markings in an urban area (e.g, Exhibition Road in London) to induce a sense of shared risk and responsibility.

Someone in the audience asked whether that increases the stress levels of pedestrians; "I'm glad you're less comfortable", he replied, because pedestrians are in danger all the time, it's just that they don't feel it.

The talks were followed by a panel Q+A session. Questions from the audience were on topics such as transport of goods (Robin Chase said that there is still massive unspent capacity in freight) the fate of the £40 billion spent by motorists today (someone on the panel said they would prefer we spend it on massages and take-aways) and the love of cars for their own sake (Paul Newman emphasised that technology gives us options, so car lovers can still indulge themselves in future).

]]>- Ashbrook, Daniel, and Thad Starner. "Using GPS to learn significant locations and predict movement across multiple users." Personal and Ubiquitous Computing 7.5 (2003): 275-286.
- A classic in the field, introducing a vision for finding significant places (with respect to individuals) and learning structure from past behaviour to make predictions. Although there are some shortcomings (e.g., their method of finding significant locations involved manual calibration, and they only considered a first-order Markov model to make predictions), I think they provide a compelling vision that stands the test of time 10 years later.

- Eagle, Nathan, and Alex Sandy Pentland. "Eigenbehaviors: Identifying structure in routine."
*Behavioral Ecology and Sociobiology*63.7 (2009): 1057-1066.- In this paper, Eagle and Pentland represent each day of an individual's mobility as a point in high dimensional space, then use dimensionality reduction (principle component analysis) to find a set of "eigenbehaviors" that best characterises the location behaviour of that individual. Prediction can then be done by finding the mix of eigenbehaviors that best recreates a partially-seen day. The general form of the eigenbehaviors also allows comparison of habits between people, and they found some nice results showing how students from different faculties have similar location habits. I prefer the exploratory applications of this approach more than the predictive aspect, because I think the high accuracy results they report are more a function of people staying a long time at the same location in their dataset (since they consider a person to always be in 1 of 4 locations: home, work, other, or no signal). Still, I was always inspired by this paper. An extension that considered richer data sets was done by Sadilek and Krumm (2012), which found good results when also incorporating temporal information (day of week, whether the day was a national holiday).

- Gao, Huiji, Jiliang Tang, and Huan Liu. "Exploring Social-Historical Ties on Location-Based Social Networks." ICWSM. 2012.
- I think this is a strong paper that deserves more attention. It is an extension of the Ashbrook and Starner paper in the sense that the authors provide a more sophisticated way of doing sequential location prediction (i.e., "given your recent locations, where are you likely to be next?") using hierarchical Pitman-Yor processes (HPY). HPY assumes that the pattern of individual daily life locations follows a power law, with the rich getting richer (i.e., there are a few locations that are highly visited, and lots of locations that are hardly ever visited), which has been empirically observed in Gonzalez et al. (2008). Furthermore, HPY is a Bayesian way of varying the number of recent locations considered in the prediction (similar to variable-order Markov models, but with a more principled motivation that doesn't require a max-depth parameter). I suggest reading the paper for a more detailed description than I can give here.

- Horvitz, Eric, and John Krumm. "Some help on the way: Opportunistic routing under uncertainty." Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 2012.
- The authors propose an anticipatory mobile computing system that recommends interesting places to drivers on their way to a predicted location. The prediction part of the system is assumed (using a previous approach that assumes a rational driver) so the focus is on how to calculate the places to recommend using expected utility. They also consider how to balance the expected benefit of asking a driver to confirm their destination (making the suggested places more relevant) against the cost of interruption. Clearly, there will be times when knowing the user's destination for certain would not change the recommendation very much (v.s. only having a probabilistic prediction), so this approach is useful in avoiding the "talking paperclip" syndrome where anticipatory applications interrupt users too much.

- Farrahi, Katayoun, and Daniel Gatica-Perez. "A probabilistic approach to mining mobile phone data sequences." Personal and Ubiquitous Computing (2013): 1-16.
- This work provides a Bayesian probabilistic model of individual location behaviour that is a hybrid of latent Dirichlet allocation (LDA) and the eigenbehaviors approach of Eagle and Pentland. The similarity to eigenbehaviors comes from the assumption that there exists a set of characteristic motifs that repeat themselves in the data (e.g., leaving home to go to the bus station, then to work). The authors' comparison to n-gram models appears to be a bit of a red herring to me, as the next location in their model is not dependent, in general, on the most recent previous locations (that is not to say a mobility model requires the n-gram assumption to be useful). The benefit of having LDA underneath it all (as opposed to, say, a mixture model) is to express the assumption that motifs are locally correlated within a single day. Intuitively, if I follow the aforementioned "going to work" motif in the morning, then I am probably more likely to follow other workday-related motifs (e.g., going to the gym, then home) than other types of motif later that day. With hierarchical Bayesian modelling, this type of structure can be learnt in an unsupervised way from the data, and then be used to make predictions about future behaviour.

I had to leave a lot of very good papers out, so I hope to make a longer list in future.

]]>The list is focused on datasets from which one might be able to identify individual location routines. Therefore, aggregated or short datasets (< 1 week) are omitted. In no particular order:

GeoLife GPS Trajectories
182 users over period of 2 years. **Public**.

MIT Reality Mining Dataset (Page on Mit)
Cell tower locations (and various phone logs) of 100 people at MIT. **Semi-public **(need to submit your details on a web form).

~~Foursquare Data Set - University of Minnesota~~ *[update: seems that Foursquare has asked the collectors of this data to no longer distribute it]*
The check-in behaviour of over 2 million users. **Semi-public **(you have to request a link from them by email).

OpenPaths
Comprises mixture of GPS, cell tower, check-in data. **Semi-public** (you have to submit a project description and have individuals donate their data).

Dataset of LifeMap Users
Dataset of GPS and WiFi locations that includes crowdsourced semantic labels for locations by 68 people in South Korea.**Public**.

CRAWDAD
Long list of WiFi fingerprint data at University of Dartmouth, US. **Public**.

The concept we agreed on involved an animation of a log fire that the home occupant would interact with by gesture (in front of the screen). When the occupant wants to increase the temperature of the house, they can put their hands in front of the fire (visualised on their screen) in the way that people naturally do with campfires (a sort of inverted cupping gesture). When they want to decrease the temperature, they can push the fire away, in a sort of rejection gesture (though of course no-one would do this with a real fire).

The other people in the interface team worked on a nifty animation of a log fire in Javascript, and on communicating with the thermostat. However, Leap does not make it very easy for developers to define custom gestures. It seems that the makers of Leap are trying to establish some standard gestures, which users can learn and use for many apps. Good idea for them, but we wanted more flexibility so I set to work on using machine learning to do online classification of gestures.

My first step was to decide upon a set of features. Downloading the Python developer package. I found a script called Sample.py that prints the data that Leap provides to developers. The most important pieces of data for the two gestures I was interested in were the positions and orientations of the hands (both 3-dimensional).

I wanted the features to be invariant to the absolute position of the hands, so decided on using the speed and orientation of the hands only (4 numbers per time step). Clearly, a more general custom gesture recogniser might use more or different features.

Having decided on the features, I thought more about the machine learning approach to use. The approach needs to take into account the dynamic nature of gestures, i.e., a single gesture may be expressed over multiple time steps. One idea is to assume that the steps of a gesture can be explained by discrete latent states that follow a first-order Markov property. Fortunately, I have just written a package to do variational inference on non-parametric hidden Markov models (HMM) with multi-modal data. The original purpose was for location data (my main area of research), but since I had written it in a fairly re-usable way, only a few lines were required to run inference on the gesture features I defined (4 dimensional continuous feature data):

K = 20 #truncation parameter for number of components to Dirichlet Process mvgSensor = MVGaussianSensor(K,XDim,hyp=obsHyperparams[0]) exp_z,sensors,exp_a,Zmax = infer(N, [X], K, [mvgSensor], VERBOSE=1,useMix=1, useHMM=1, plotSensor=mvgSensor, plotX=X[:,:2])

In line 1, I set the truncation parameter to a suitably large number of possible components (though the HMM is not required to use all these). Line 2 defines the assumed distribution on the data (a multivariate Gaussian), while line 3 runs the actual inference on the data (X). The assumption is that the logarithm of the velocity, and 3-dimensional orientation of the hands, are normally distributed (and are covariant). Clearly, orientation needs a periodic distribution (as the range of values is between 0 and 2*pi, and values very close to 0 are actually close to 2*pi also), so I would have preferred a von Mises likelihood function there, but there was limited time, so a Gaussian would have to be a good enough approximation. Tthe effects should not be too noticeable, since neither of our gestures involve angles near 0 or 2*pi.

Assuming that the latent states of any user movement (which may include a gesture, or may just be noise), can be inferred correctly, the next step is to include supervision, in order to map the unsupervised inference of movement to known concepts like 'pushing away' or 'hands over fire', (which correspond to 'turn the temperature down!' or 'turn it up!', respectively). Since I hope to have already adequately represented complex movements with the latent states of an HMM, a simple supervised learning approach should be all that is needed. I chose linear least squares regression for this, mostly because it can be done in a single line in NumPy:

weights = numpy.linalg.lstsq(Z, reshape(Y,(N,1)))

where Z is the set of inferred distributions over latent states and Y is the set of corresponding target values (both for N time steps).

Finally, I created some training data. I spent a few minutes cupping my hands around an imaginary fire, and then another few minutes pushing it away. This gave me a 4000-by-4 feature matrix (with an observation on each row) with which to train. I gave a rough and ready supervision signal (for the final, linear regression step) of 2000 negative outputs ('turn up the heat') followed by 2000 positive outputs ('turn down the heat').

Performing nonparametric HMM inference on this data, the model told me that 19 components were enough to explain the data (which seemed a lot to me):

The graph shows parameters to the Gaussians (mean and covariance) that were learnt to two of the feature dimensions (the other two dimensions are not shown here, but are obviously included in the inference). After applying linear regression, I first tested on the original training data:

In the data I collected, I switched gestures half way through (i.e., a minute or so of me doing the first gesture, and then another minute doing the second gesture). Therefore, one would expect the classification line to be under the threshold at first, then switch to be over the threshold about half way through the data. For comparison, I also plotted how straightforward linear regression (again with least squares, with the green line) does on the raw data (without applying an HMM). We see that simple regression is not enough (achieving 55.6% accuracy), indicating that the raw data is not linearly separable. The combined approach (with the HMM) does much better at 89.8% accuracy.

Of course, this result was from testing with data seen during training, so I then tested on a shorter, unseen data set that I also collected:

My approach achieved 73.8% accuracy (v.s. 53.3% for the simple linear regression). For both data sets, there appears to be portions of the 'hands over fire' gesture (where the expected target value is 0) that spike above the threshold. I think this is because I had to move my hands into place, causing the speed to be non zero. This is important because I think that one important characteristic of the 'hands over fire' gesture is likely to be zero motion. Would be interesting to see how well we do with just the speed feature (without the angles), though I didn't have time for this, and in any case, I've noticed a difference in the angle of my hands between the two gestures.

Overall, I think that kind of accuracy is not bad for a day's work, especially considering that I collected a very small training dataset created in a few minutes of hand waving. I think that if I would collect more data, the results would improve. 73.8% accuracy is probably not good enough yet for this sort of application, as it would obviously be undesirable for the system to turn the thermostat down when the user meant to turn it up (or vice versa). I think the accuracy could be improve a lot with some smoothing, and by somehow dealing with the initial and tail ends of a gesture (which appear to result in a lot of noise). Also, it might be interesting to see how well this works when training on one set of people and testing on another.

In general, I'm excited by the idea of learning a rich representation of behaviour in an unsupervised way, then applying supervision only at the final steps, because it points towards the possibility of collecting a much larger training data set while spending a much smaller amount of time on labelling only some of the data (obviously, this is semi-supervised learning).

*I will make all the code I wrote for this available soon.*

Here are some additional notes about preprocessing that made a big difference in accuracy for me:

- since the speed feature is non-negative, I took the logarithm of that dimension (resulting in the assumption that the speed is log-normally distributed).
- as per usual, I centred all the data by the mean, and gave it unit standard deviation.