- Off-policy evaluation for slate recommendation. Swaminathan, Adith, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. In
*Advances in Neural Information Processing Systems*, pp. 3635-3645. 2017.

Slate recommendation concerns a set of recommendations that are exposed to a user at once. For example, when you do a search, you get a page of ranked results. When you visit the home page of a music, movie, or book app, you usually see a whole page of recommendations. This paper is about how to do off-policy evaluation and optimization with data collected with a known logging policy.

They set up the problem as a contextual bandit where each action is a slate of sub-actions:

$$s = (s_1, \dots, s_l)$$where \(s\) refers to the slate and \(s_j\) refers to the \(j^\mathrm{th}\) item in the slate.

They key idea is that they estimate the reward for particular context \(x\) and slate \(s\) using a linear regression:

$$\hat{V}(x, s) = \mathbb{1}_s^\top \phi_{x}$$ where \(\mathbb{1}_s\) is an indicator vector of size \(lm\) telling us which items (out of a possible \(m\) items) appears in the \(l\) slots. \(\phi_x\) represents the contribution towards to the overall reward for each item at each slot. The assumption is therefore that, in any given context, the contribution of an item to the overall reward must be linear.The attraction of this approach is that
it does not try to model the rewards and suffer the model mismatch bias. Instead, they impose a linearity assumption on how the individual item rewards relate to the overall rewards for a slate (and make the typical importance sample reweighting assumption that the target policy \(\pi\) is non-zero wherever the logging policy \(\mu\) is non-zero for all contexts). After that, they replace the predicted reward with a single-sample Monte Carlo estimate from the logged data to arrive at the

where the weights with the smallest norm can be estimated using the closed form solution for linear regression. One thing that was not clear from the paper was how to estimate the uncentered covariance matrix but I guess that can be done in the same way using a single-sample empirical estimate.

This is a pretty neat idea and they go on to show that the PI estimator is unbiased and has low variance in relation to the number of items and slots in the slate.

I like the approach of avoiding bias in the estimator by making assumptions about how the reward decomposes and how it relates to the context. The single sample estimate, on the face of it, looks pretty crude but due to the unbiasedness, the errors cancel out over a large number of logged impressions. The same cannot be said for the model-based approach (what they call the *direct method* or DM). However, as they show in their experiments, DM does well for smaller sets of impressions (< 10^5) which makes sense because that's where the bias introduced by model assumptions can compensate for the small data issues.

While they show that the linear reward assumption has common metrics like NDCG as special cases, I wonder how much interaction exists between items within a slate. An item that is a bad match for a context could tank the reward for the whole page but it might also be bad in how it interacts with other items. Interactions cannot be captured by the linearity, only indirectly through the context.

And I also wonder what the payoff is for introducing a better models in DM. Specifically, they seem to have considered only regression trees and lasso regression as their baselines for DM. Model mismatch bias can be addressed by better models, reducing the error in the DM estimate.

]]>Imagine you're in charge of a bunch of hospitals and are looking at the survival statistics of all the patients with a certain illness (e.g., brain tumours) to help you decide which doctors to promote. (This is probably not how doctors are promoted, but anyway). You might be tempted to pick the doctors with the best survival rates. This is wrong because it ignores the assignment mechanism of patients to doctors. Let's see how wrong you could be with a specific example.

doctor id | | num patients | | survival rate |
---|---|---|

0 | 117 | 0.205 |

1 | 85 | 0.812 |

2 | 93 | 0.129 |

3 | 88 | 0.216 |

4 | 98 | 0.153 |

5 | 112 | 0.830 |

6 | 89 | 0.809 |

7 | 109 | 0.138 |

8 | 102 | 0.824 |

9 | 107 | 0.841 |

Let's make a set of statistical modelling assumptions about how this data was generated.

Assumptions: D doctors N patients \(Z_{n,d}\) is whether patient n is assigned to doctor d \(Y_n(d)\) is binary recovery if patient n is treated by doctor d \(\beta_d\) is skill of doctor d (latent, binary: low or high skill) \(\theta_n\) is the severity of patient n's illness (observed, binary: low or high severity) Generative procedure: for patient \(n = 1,\dots,N\): draw an indicator for whether the patient is assigned a highly skilled or less highly skilled doctor, \(s_n \sim \mathrm{Bernoulli}( \theta_n p_\mathrm{high} + (1-\theta_n) p_\mathrm{low} )\) if \(s_n = 1\): draw \(Z_{n .}\) uniformly from the pool of highly skilled doctors otherwise: draw \(Z_{n .}\) doctor d uniformly from the pool of other doctors calculate probability of recovery as follows: \(p_\mathrm{recover} = \beta_d \theta_n p_\mathrm{low} + (1-\beta_d) \theta_n p_\mathrm{very low} + (1-\theta_n) p_\mathrm{high}\) draw recovery \(Y_n(d) \sim \mathrm{Bernoulli}( p_\mathrm{recover} )\)What this generative process is saying is that we should try to assign highly skilled doctors to severely ill patients and let the rest of the doctors take care of the rest of the patients, but this process is noisy and there is a non-zero probability that an average doctor care for a very ill patient or a very skilled doctor will care for a less ill patient.

Why this recovery probability? If the illness is severe, then the best that a skilled doctor can do is ensure a low probability of recovery, while a less skilled doctor will do even worse than that. If the illness is not severe, then it doesn't matter whether the patient gets a very skilled doctor or not, they are likely to recover.

In reality, the patient covariates \(\theta_d\) would not be a simple binary number, it would be features of the illness (e.g., tumour size and location, age of patient), and doctors might have skills along multiple dimensions. Binary numbers suit our thought experiment for the moment, but we'll revisit the dimensionality of covariates later.

Using this generative model and a given set of parameters \((p_\mathrm{high}, p_\mathrm{low}, p_\mathrm{very\; low}, N, D, \theta, \beta)\), I generated 10 doctors and 1000 patients (the data you see in the table above, actually). Now, let's pretend that we don't know \(\beta\), the doctor skill that we are trying to estimate from the data alone. Would you want to see doctor 4? Naively, no, but who knows, they might be the top specialist in the country for your illness. I consider how to answer that question next.doctor id | | num severe patients | | recovery rate (non-severe) | | recovery rate (severe) |
---|---|---|---|

0 | 103 | 0.857 | 0.117 |

1 | 7 | 0.885 | 0.000 |

2 | 87 | 0.833 | 0.080 |

3 | 78 | 0.900 | 0.128 |

4 | 92 | 1.000 | 0.098 |

5 | 6 | 0.877 | 0.000 |

6 | 11 | 0.923 | 0.000 |

7 | 97 | 0.917 | 0.041 |

8 | 11 | 0.923 | 0.000 |

9 | 8 | 0.909 | 0.000 |

doctor id | | avg treatment effect | | doctor skill | | num treated | | avg recovery rate |
---|---|---|---|---|

0 | 0.000 | 1 | 117 | 0.205 |

1 | -0.024 | 0 | 85 | 0.812 |

2 | 0.000 | 1 | 93 | 0.129 |

3 | 0.023 | 1 | 88 | 0.216 |

4 | -0.020 | 1 | 98 | 0.153 |

5 | -0.080 | 0 | 112 | 0.830 |

6 | 0.034 | 0 | 89 | 0.809 |

7 | -0.064 | 1 | 109 | 0.138 |

8 | 0.010 | 0 | 102 | 0.824 |

9 | 0.037 | 0 | 107 | 0.841 |

unit covariate | | doc_0 | | doc_1 | | doc_2 | | doc_3 | | doc_4 |
---|---|---|---|---|---|

non-severe | 0.028 | 0.156 | 0.012 | 0.020 | 0.012 |

severe | 0.206 | 0.014 | 0.174 | 0.156 | 0.184 |

unit covariate | | doc_5 | | doc_6 | | doc_7 | | doc_8 | | doc_9 |
---|---|---|---|---|---|

non-severe | 0.212 | 0.156 | 0.024 | 0.182 | 0.198 |

severe | 0.012 | 0.022 | 0.194 | 0.022 | 0.016 |

doctor id | | doctor skill | | treated recovery | | untreated recovery | | avg treatment effect |
---|---|---|---|---|

0 | 1 | 0.487 | 0.489 | -0.003 |

1 | 0 | 0.442 | 0.495 | -0.053 |

2 | 1 | 0.457 | 0.494 | -0.037 |

3 | 1 | 0.514 | 0.489 | 0.025 |

4 | 1 | 0.549 | 0.491 | 0.058 |

5 | 0 | 0.439 | 0.497 | -0.058 |

6 | 0 | 0.462 | 0.492 | -0.030 |

7 | 1 | 0.479 | 0.498 | -0.019 |

8 | 0 | 0.462 | 0.492 | -0.030 |

9 | 0 | 0.455 | 0.493 | -0.038 |

There are many modes of evidence accepted in courts of law. Each mode has its strengths and weaknesses, which will usually be highlighted to suit either side of the case. For example, if a witness places the defendant at the scene of the crime, the defense lawyer will attack her credibility. If fingerprint evidence is lacking, the persecution will say it's because the defendant was careful. Will inferences from probabilistic models ever become a mode of evidence?

It's a natural idea. Courts institutionally engage in uncertainty. They use phrases like "beyond reasonable doubt", they talk about balance of evidence, and consider precision-recall rates (it is "better that ten guilty persons escape than that one innocent suffer" according to English jurist William Blackstone). And the closest thing we have to a science of uncertainty is Bayesian modelling.

In a limited sense we already have probabilistic models as a mode of evidence. For example, the most damning piece of evidence in the Sally Clark cot death case was the testimony of a paediatrician who said that the chance of two cot deaths happening in one household, without malicious action, is 1 in 73 million. This figure was wrong because the model assumptions were wrong -- there could be a genetic or non-malicious environmental component to cot death but this was not captured in the model. But as is vividly illustrated by the Sally Clark case, inferential evidence is currently gatekept by experts. In that sense, the expert is the witness and the court rarely interacts with the model itself. The law has a long history of attacking witness testimony. But what will happen when we have truly democratized Bayesian inference?

Perhaps one day, in a courtroom near you, the defense and prosecution will negotiate an inference method, then present alternative models for explaining data relevant to the case. The lawyers will use their own model to make their side of the case while attacking the opposing side's model.

In what scenarios would probabilistic models be an important mode of evidence?

When there are large amounts of ambiguous data, too large for people to fit into their heads, and even too large/complex to visualize without making significant assumptions.

Consider a trove of emails between employees of a large corporation. The prosecution might propose a network model to support the accusation that management was active or complicit in criminal activities. The defense might counter-propose an alternative model that shows that several key players outside of the management team were the most responsible and took steps to hide the malfeasance from management.

In these types of cases, one would not hope for, or expect, a definitive answer. Inferences are witnesses and they can be validly attacked from both sides on the grounds of model assumptions (and the inference method).

If this were to happen, lawyers would quickly become model criticism ninjas, because they would need model criticism skills to argue their cases. Who knows, maybe those proceedings will make their way onto court room drama TV. In that case, probabilistic model criticism will enter into the public psyche the same way jury selection, unreliable witnesses, and reasonable doubt have. The expertise will come from machines, not humans, and the world will want to develop ever richer language and concepts that enable it to attack the conclusions of that expertise.

]]>Does your attitude to risk change based on the type of uncertainty you harbour? This is a blog post about epistemic and non-epistemic risk.

Here is a quote from theconversation.com:

"Australians [have] an 8.2% chance of being diagnosed with bowel cancer over their lifetime [...] If we assume that a quarter of the Australian population eats 50 grams per day of processed meat, then the lifetime risk for the three-quarters who eat no processed meat would be 7.9% (or about one in 13). For those who eat 50 grams per day, the lifetime risk would be 9.3% (or about one in 11)."

There are at least two ways to interpret the above quote:

Option 1:

- there is a 9.3% chance of getting bowel cancer for processed meat eaters and a 7.9% chance for non-processed meat eaters; genes don't matter

Option 2:

- there is a x% chance of having the genes that make you susceptible to bowel cancer
- if you have the genes that make you susceptible: there is a high chance of getting bowel cancer if you eat meat
- if you don't have the genes that make you susceptible: it doesn't matter what you do, there is a low chance of getting bowel cancer

In either case, we can assume the marginal probability of getting the illness is the same (i.e., we can adjust the percentages in option 2 to make them the same as option 1). If you're a processed meat eater, look at Option 1 and think to yourself: is never eating bacon or a burger again worth 1.6 percentage points reduction in risk? I'm not sure what my answer is, which means that the choices are fairly balanced for me.

Now look at Option 2. Does your answer change? For a rational agent it should not change. My inner monologue for Option 2 goes as follows: if I have the bad genes, then I'm definitely screwing myself over by eating processed meat, and I want to avoid doing that.

But you don't get to know what genes you have (at least, not yet, that will probably change in the next few years), so the main source of risk is epistemic. That is, you already have the genes that you have (tautological though it is to say), you just don't know which kind you have.

Here's what I think is going on: as we go about our lives we have to "satisfice" which means that we focus on actions that are expected to make big differences and try to avoid fiddling at the margins. Option 1 looks a lot like fiddling at the margins to me. Option 2 instead gives me more control: if I have the bad genes then I'm in much greater control over the risk of a terrible illness. But the greater control is illusionary: as long as I remain uncertain about the state of my genes, the utility of eating or not eating processed meat is the same for both Option 1 and Option 2. I call this the paradox of epistemic risk.

]]>I learnt from this book that causality is fun. That hasn't happened to me before. Maybe it's just that the ideas around causal inference that we've been studying in the reading group for a decent length of time now are finally starting to click.

]]>One argument from the book has stuck with me. It goes like this: there is a popular notion that the internet is not an invention or a technology, but an inevitable ideal* *that must be protected from all outside influence. Perhaps this visionary stance was useful in building the internet in the early days, but now that we have the internet, it's less useful.

For example, most recently, this ideal underlies a lot of the objections in the debate about the right to be forgotten. The European court of human rights recently ruled that European citizens have a right to have damaging links about them removed from search engine results. People who believe in the idealised internet warn of the dangers of regulating the internet in this way. Here is a quote from Jennifer Granick (Stanford University) in the New Yorker magazine (“Solace of Oblivion”, Sept 29, 2014 issue):

[This legislation] marks the beginning of the end of the global Internet, where everyone has access to the same information, and the beginning of an Internet where there are national networks, where decisions by governments dictate which information people get access to. The Internet as a whole is being Balkanized, and Europeans are going to have a very different access to information than we have.

This warning appears to be designed to send shivers up the spine, to paint a dystopian future where, *gasp*, Europeans see different information to everyone else. Jonathan Zittrain (Harvard Law School) makes the danger explicit in the same article:

"[... ] what happens if the Brazilians come along and say, ‘We want only search results that are consistent with our laws’? It becomes a contest about who can exert the most muscle on Google.” Search companies might decide to tailor their search results in order to offend the fewest countries, limiting all searches according to the rules of the most restrictive country. As Zittrain put it, “Then the convoy will move only as fast as the slowest ship.”

This quote makes explicit the fear of government control and of technology companies’ harmful reactions to such control.

Underlying both quotes is the assumption that the internet is sacred: its destiny is to be global and pure. But, to repeat, the internet is a *technology*, it is built and maintained by us because we find it useful. Replace “internet” with “car” in the above quotes, and consider again the first quote by Jennifer Granick. Does the experience of driving have to be identical wherever you are in the world? Does the fact that the speed limit varies from country to country, or that you can turn right on a red light in some US states and not in others, keep anybody up at night? Road laws, manufacturing safety laws, drink driving laws are highly Balkanised across states and countries, but no-one is worrying whether some authoritarian government of the future is going to take our cars away from us (except maybe some Tea Party activists).

Or consider the second quote, by Jonathan Zittrain. Do car manufacturers have a right to demand to sell identical cars globally? Is car manufacturing technology held back by the country with the most stringent safety laws? Of course not. But even if it were, it wouldn’t be the only consideration on the table. We don’t feel beholden to any visions that Henry Ford may have had a century ago about a future with open roads and rapid transit, certainly not when it comes to preventing people from dying horrifically on the roads.

Yet, when it comes to the internet, a lot of people believe that an unregulated internet trumps everything, including the grief of parents struggling to get over their daughter’s death when pictures of her dead body are circulating online, or the struggles of a person falsely accused of a crime who has to live with those accusations every time someone searches their name. A balanced viewpoint would look at freedom of speech versus the harm such speech causes in broad classes (e.g., criminal cases, breaches of privacy) and make a decision. Different countries have different priorities which will lead to different regulations on internet freedoms, and that’s ok. If the government is authoritarian, then it has *already* put huge restrictions on the local internet (e.g., China, North Korea). You can add that to the list of reasons why it’s unpleasant to live in authoritarian countries.

At this point, the person arguing for a global and pure internet retreats to practicalities. They cite two main practical barriers to the right to be forgotten: 1) it’s impossible to exert complete control over information - a determined person will find the restricted information anyway; 2) it’s too labour intensive to enact the right to be forgotten. Let’s start with the first barrier. I agree that a demand for perfection is misguided, and I don’t believe anyone is making such a demand. It’s possible to take reasonable and simple steps to allow people to be forgotten that gets you 95% of the way there. In the same way that a determinedly bad driver can still kill people, a determinedly assiduous employer will still be able to dig up information about a potential employee. But this was always the case, even before the internet.

The second practical barrier is the more important one, I feel, and is a manifestation of the fact that technology enables mass automation (e.g., indexing websites) while the judgement that society requires of Google (i.e., “is this index removal request valid?”) cannot currently be automated. While this challenge is substantial, it’s ironic that the same technologists (me included) who claim that they can solve the world’s societal problems, throw their hands up in despair when asked to automate such judgement.

]]>The book was given an interesting review by Gillian Tett of the Financial Times last weekend. Her main point of agreement with the book is that the difference between new and old research on human behaviour analysis is due to the size of the data, plus the extent to which that data is interpreted subjectively v.s. objectively (e.g., anthropologists analysing small groups of people v.s. the 10GB that Prof. Pentland collected on 80 call centre workers at an American bank).

So far so good. But she goes on to criticise computer scientists for analysing people like atoms "using the tools of physics" when in reality they can be "culturally influenced too". I think this is false dichotomy.

There are two ways you can read the aforementioned criticism. The first is that statistical models of human behaviour are unable to infer (or learn) cultural influences from the raw data. The second is that the models themselves are constructed under the bias of the researcher's culture.

In the first case, there's nothing to stop researchers from finding new ways of allowing the models to pick up on cultural influences in behaviour (in an unsupervised learning manner), as long as the tools are rich and flexible enough (which they almost certainly are in the case of statistical machine learning).

The second case is more subtle. In my experience, the assumptions that are expressed in the models are highly flexible and don't appear to impose cultural constraints on the data. How do I know? I could be wrong, but my confidence in the flexibility of the models is due to the fact that model performance can be quantified with hard data (e.g., using measures of precision/recall or held-out data likelihood). This means that any undue bias (cultural or otherwise) that one researcher imposes will be almost immediately penalised by another researcher coming along and claiming he or she can do "X%" better in predictive accuracy. This is the kind of honesty that hard data brings to the table, though I agree that it is not perfect and that many other things can go wrong with the academic process (e.g., proprietary data making publications impervious to criticism, and the fact that researchers will spend more effort on optimising their own approach for the dataset in question v.s. the benchmarks).

My line of argument of course assumes that the data itself is culturally diverse. This wouldn't be the case if the only data we collect and use comes from, for example, university students in developed countries. But the trend is definitely moving away from that situation. Firstly, the amount of data available is growing by a factor of 10 about every few years (driven by data available from social networking sites like Twitter and Foursquare). At a certain point, the amount of data makes user homogeneity impossible, simply because there just aren't that many university students out there! Secondly, forward thinking network operators like Orange are making cell tower data from developing countries available to academic researchers (under very restricted circumstances). So in conclusion, since the data is becoming more diverse, this should force out any cultural bias in the models (if there was any there to start with).

]]>Mobile location services have been a topic of considerable interest in recent years, both in industry and academia. In industry, software applications (or apps) with location-based components enjoy widespread use. This is evidenced, for example, by the 20 million active users who opt to check in (i.e., share their current location with friends) on Foursquare , the 50 million users who search for local services on Yelp when out and about, and the increasing number who electronically hail a taxi in Uber in 11 cities (up from 1 city in 2011), which they can then watch arrive at their location on a real-time map.

In updating this paragraph, I found that the statistics reflecting 2013's progress by mobile location services are as follows: Foursquare grew from 20 to 30 million users, Yelp grew from 50 million to 100 million users, and Uber is now in 66 cities (up from 11 cities in 2012).

From this small sample of progress, it seems that there is still a lot of growth in location-based services, especially ones involving crowdsourcing of physical tasks (e.g., Uber, TaskRabbit, Gigwalk).

The disappointing one of the pack in terms of user growth is Foursquare (which "only" grew by 50% in 2013), but they are arguably facing the different challenge of proving "that there's a real business there", in CEO Dennis Crowley's words (i.e., finding sustainable revenue streams). But in general, the most promising location-based services are still following the exponential growth curve (in number of users) which is good news for innovation.

]]>During my work, I've often noticed similarities between language and individual daily life location behaviour (as detected by GPS, cell towers, tweets, check-ins etc.). To summarise these thoughts, I've compiled a list of the similarities and differences between language and location below. I then mention a few papers that exploit these similarities to create more powerful or interesting approaches to analysing location data.

**Similarities between Location and Language Data**

- Both exhibit power laws. A lot of words are used very rarely while a few words are very frequently used. The same happens with the frequency of visits to locations (e.g., how often you visit home v.s. your favourite theme park). This is not a truism. The most frequently visited locations or words used are *much* more likely to be visited/used than most other places/words.
- Both exhibit sequential structure. Words are highly correlated with words near to them on the page. The same for locations on a particular day.
- Both exhibit topics or themes. In the case of language, groups of words tend to co-occur in the same document (e.g., two webpages that talk about cars are both likely to mention words from a similar group of words representing the "car" topic). In the case of location data, a similar thing happens. I mention two interpretations from specific papers later in this post.
- The availability of both language data and location data has exploded in the last decade (the former from the web, the latter from mobile devices).
- There are cultural differences in using language just as there are cultural differences in location behaviour (e.g., Spanish people like to eat out later than people of other cultures).
- Both are hierarchical. Languages have letters, words, sentences, and paragraphs. A person can be moving around at the level of the street, city, or country (during an hour, day, or week).
- Both exhibit social interactions. Language is exchanged in emails, texts, verbally, or in scholarly debate. Friends, co-workers, and family may have interesting patterns of co-location.

**Differences between Location and Language Data**

- Many words are shared between texts (of same language) but locations are usually highly personal to individuals (except for the special cases of friends, co-workers, and family).
- There are no periodicities in text but strong periodicities in location (i.e., hourly, weekly, and monthly).
- Language data is not noisy (except for spelling and grammar mistakes) while location data is usually noisy.
- Language analysts do not usually need to worry about privacy issues whilst location analysts usually do.

**Work that Exploits These Similarities**

Here are a few papers that apply or adapt approaches that were primarily used for language models to location data:

They use topic modelling to capture the tendency of visiting certain locations on the same day. This is similar to using the presence of words like "windshield" and "wheel" to place higher predictive density on words like "road" and "bumper" (i.e., topic modelling bags of words). I have talked previously about why I think this is a good paper.

This paper uses a similar application of topic modelling as the one by Farrahi and Gatica-Perez.

This paper uses a model that was previously used to capture sequential structure in words and applies it to Foursquare checkins.

In my own work, I've used the concept of topics to refer to location habits that represent the tendency of an individual to be at a given location at a certain time of day or week. This way of thinking about locations is useful in generalising temporal structure in location behaviour across people, while still allowing for topics/habits to be present to greater or varying degrees in different people's location histories (just as topics are more or less prevalent in different documents).

Both language and location data are results of human behaviour, so it is unsurprising to find similarities, even if I think some of the similarities are coincidental (e.g., power laws crop up in many places and often for different reasons, and the increasing availability of data is part of the general trend of moving the things we care about into the digital domain). The benefits of analysis approaches seem to be flowing in the language -> location direction only at the moment, though I hope one day that will change.

]]>

There are a few standard techniques for performing inference on hierarchical Bayesian models. Finding the posterior distribution over parameters or performing prediction requires an intractable integral for most Bayesian models, arising from the need to marginalise ("integrate out") nuisance parameters. In the face of this intractability there are two main ways to perform approximate inference: either transform the integration into a sampling problem (e.g., Gibbs sampling, slice sampling) or an optimisation problem (e.g., expectation-maximisation, variational inference).

Probably the most straightforward method is Gibbs sampling (see Chapter 29 of "Information Theory, Inference, and Learning Algorithms" by David MacKay) because you only need to derive conditional probability distributions for each random variable and then sample from these distributions in turn. Of course you have to handle convergence of the Markov chain, and make sure your samples are independent, but you can't go far wrong with the derivation of the conditional distributions themselves. The downside of sampling methods is their slow speed. A related issue is that sampling methods are not ideal for online scalable inference (e.g., learning from streaming social network data).

For these reasons, I have spent the last 6 months learning how to apply variational inference to my mobility models. While there are some very good sources describing variational inference (e.g., chapter 10 of "Pattern Recognition and Machine Learning" by Bishop, this tutorial by Fox and Roberts, this tutorial by Blei), I feel that the operational details can get lost among the theoretical motivation. This makes it hard for someone just starting out to know what steps to follow. Having successfully derived variational inference for several custom hierarchical models (e.g., stick-breaking hierarchical HMMs, extended mixture models), I'm writing a practical summary for anyone about go down the same path. So, here is my summary for how you actually apply variational Bayes to your model.

**Preliminaries**

I'm omitting an in-depth motivation because it has been covered so well by the aforementioned tutorials. But briefly, the way that mean-field variational inference transforms an integration problem into an optimisation problem is by first assuming that your model factorises further than you originally specified. It then defines a measure of error between the simpler factorised model and the original model (usually, this function is the Kullback-Leibler divergence, which is a measure of distance between two distributions). *The optimisation problem is to minimise this error by modifying the parameters to the factorised model (i.e., the variational parameters)*.

Something that can be confusing is that these variational parameters have a similar role in the variational model as the (often, fixed) hyperparameters have in the original model, which is to control things like prior mean, variance, and concentration. The difference is that you will be updating the variational parameters to optimise the factorised model, while the fixed hyperparameters to the original model are left alone. The way that you do this is by using the following equations for the optimal distributions over the parameters and latent variables, which follow from the assumptions made earlier:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$$

$$\mathrm{ln} \; q^*(Z_i) = \mathbb{E}_{-Z_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$$

where \(X\) is the observed data, \(V\) is the set of parameters, \(Z\) is the set latent variables, and \(\alpha\) is the set of hyperparameters. Another source of possible confusion is that these equations do not explicitly include the variational parameters, yet these parameters are the primary source of interest in the variational scheme. In the steps below, I describe how to derive the update equations for the variational parameters from these equations.

**1. Write down the joint probability of your model**

Specify the distributions and conditional dependencies of the data, parameters, and latent variables for your original model. Then write down the joint probability of the model, given the hyperparameters. In the following steps, I'm assuming that all the distributions are conjugate to each other (e.g., multinomial data have Dirichlet priors, Gaussian data have Gaussian-Wishart priors and so on).

The joint probability will usually look like this:

$$p(X, Z, V | \alpha) = p(V | \alpha) \prod_n^N \mathrm{<data \; likelihood \; of \; V, Z_n>} \mathrm{<probability \; of \; Z_n>}$$

where \(N\) is the number of observations. For example, in a mixture model, the data likelihood is \(p(X_n | Z_n, V)\) and the probability of \(Z_n\) is \(p(Z_n | V)\). An HMM has the same form, except that \(Z_n\) now has probability \(p(Z_n | Z_{n-1}, V)\). A Kalman filter is an HMM with continuous \(Z_n\). A topic model introduces an outer product over documents and additional set of (global) parameters.

**2. Decide on the independence assumptions for the variational model**

Decide on the factorisation that will allow tractable inference on the simpler model. The assumption that the latent variables are independent of the parameters is a common way to achieve this. Interestingly, you will find that a single assumption of factorisation will often induce further factorisations as a consequence. These come "for free" in the sense that you get simpler and easier equations without having to make any additional assumptions about the structure of the variational model.

Your variational model will probably factorise like this:

$$q(Z, V) = q(Z) q(V)$$

and you will probably get \(q(V) = \prod_i q(V_i)\) as a set of induced factorisations.

**3. Derive the variational update equations**

We now address the optimisation problem of minimising the difference between the factorised model and the original one.

*Parameters*

Use the general formula that we saw earlier:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln}\; p(X, Z, V | \alpha) \right)$$

The trick is that most of the terms in \(p(X, Z, V | \alpha)\) do not involve \(V_i\), so can be removed from the expectation and absorbed into a single constant (which becomes a normalising factor when you take the exponential of both sides). You will get something that looks like this:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha) + \sum_n^N \mathrm{ln} \; \mathrm{<data \; likelihood \; of \; V_i, Z_n>} \right) + \mathrm{constant}$$

What you are left with is the log prior distribution of \(V_i\) plus the total log data likelihood of \(V_i\) given \(Z_n\). Even in the two remaining equations, you can often find terms that do not involve \(V_i\), so a lot of the work in this step involves discarding irrelevant parts.

The remaining work, assuming you chose conjugate distributions for your model, is to manipulate the equations to look like the prior distribution of $$V_i$$ (i.e., to have the same functional form as \(p(V_i | \alpha)\)). You will end up with something that looks like this:

$$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha_i') \right) + \mathrm{constant}$$

where your goal is to find the value of \(\alpha_i'\) through equation manipulation. \(\alpha_i'\) is your variational parameter, and it will involve expectations of other parameters \(V_{-i}\) and/or \(Z\) (if it didn't, then you wouldn't need an iterative method). It's helpful to remember at this point that there are standard equations to calculate \(\mathbb{E} \left( \mathrm{ln} \; V_j \right)\) for common types of distribution (e.g., Dirichlet \(V_j\) has \(\mathbb{E} \left( \mathrm{ln} \; V_{j,k} \right) = \psi(V_{j,k}) - \psi(\sum_{k'} V_{j,k'})\), where \(\psi\) is the digamma function). Sometimes you will have to do further manipulation to find expectations of other functions of the parameters. We consider next how to find the expectations of the latent variables \(\mathbb{E}(Z)\).

*Latent variables*

Start with:

$$\mathrm{ln} \; q^*(Z_n) = \mathbb{E}_{-Z_{n}}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$$

and try to factor out \(Z_{n}\). This will usually be the largest update equation because you will not be able to absorb many terms into the constant. This is because you need to consider the parameters generating the latent variables as well as the parameters that control their effect on observed data. Using the example of multinomial independent \(Z_n\) (e.g., in a mixture model), this works out to be:

$$\mathrm{ln} \; q^*(Z_{n,k}) = \mathbb{E}_{-Z_{n,k}}\left( Z_{n,k} \mathrm{ln} \; V_k + Z_{n,k} \mathrm{ln} \; p(X_n | V_k) \right) + \mathrm{constant}$$

factorising out \(Z_{n,k}\) to get:

$$\mathrm{ln} \; \mathbb{E}(Z_{n,k}) = \mathbb{E}(\mathrm{ln} \; V_k) + \mathbb{E}(\mathrm{ln} \; p(X_n | V_k)) + \mathrm{constant}$$

**4. Implement the update equations**

Put your update equations from step 3 into code. Iterate over the parameters (M-step) and latent variables (E-step) in turn until your parameters converge. Multiple restarts from random initialisations of the expected latent variables are recommended, as variational inference converges to the local optimum.

The video below shows what variational inference looks like on a mixture model. The green scatters represent the observed data, the blue diamonds are the ground truth means (not known by the model, obviously), the red dots are the inferred means and the ellipses are the inferred covariance matrices:

]]>I was in London yesterday to watch a debate about the future of mobility (i.e., transport) hosted by Intelligence Squared at the Royal Institution, in the same room Faraday demonstrated electricity. We saw four speakers give their visions for the future. Despite my impression that any of the speakers could have talked for 10 times longer than they did and still be fascinating, the organisers went for TED-style short talks. The format seemed to go quite well because the talks were polished and energy levels were high in the room throughout.

1. Paul Newman

Prof Newman has been developing self-driving cars at Oxford University's Information Engineering department. In a word, his idea of how self-driving cars will take over the world is "gradually". Imagine driving yourself through a traffic jam when your car offers to take over for you in that stretch of road, and, having already participated in a behind-the-scenes auction, has agreed with your insurance company to offer you a 9 pence discount if you do so (given that computers will be safer drivers than people in future).

The motivations for the gradual change seem to be mostly technical. There will be weather conditions and extreme situations in which the car decides for itself that it wouldn't be able to drive very well. The self-driving cars also need a period in which they will learn our driving styles and learn about the world's roads.

2. Robin Chase

Robin Chase co-founded car sharing startup ZipCar and has since gone on to create new ride sharing companies BuzzCar and GoLoco. She painted two possible extremes for car occupancy in the future.

Future hell is the proliferation of zero-occupancy cars, where owners of self-driving cars send their cars on trips round the block while they shop, for lack of available parking options. Or imagine us all cycling to work, but having our cars follow us with our laptops and a change of clothes (in the style of David Cameron)? This scenario was not so plausible to me because I think it currently is, or will be made, illegal to have empty cars driving on the road anywhere in the world.

The future car paradise, in her view, is having 6 people sharing a self-driving car (boosting car occupancy way beyond the current level), but those who can afford it can ride with lower occupancy if they wish. It seems to me that if digital records of car occupancy are kept (which they would be, if we are all ride-sharing in future) then it makes it easy to apply a direct car-occupancy tax, so not many people would be tempted by the much more expensive single-occupancy option.

The other interesting thing she mentioned about the future of car sharing is that car owners would start to fit the car to the occasion (e.g., a swanky car for a first date or a job interview, but something more basic the rest of the time), instead of having to cover all possible scenarios, as they do now, when buying a car that they permanently own.

3. Jerry Sanders

In a polished pitch, Jerry Sanders told us about his plans for SkyTran, a rapid transit system to be built in Tel Aviv. Passengers each get their own pod that will hang from a pylon above the city, but will be propelled magnetically forward ("surfing on a wave of magnetism") at speeds of 100 mph. The speaker painted SkyTran as the solution to congestion.

I like the idea and would definitely take a ride in one if I ever had the chance, but it seems futuristic in the same way that monorails were meant to be the solution to city congestion 30 years ago. What happened to monorails? According to Wikipedia, city councils perceived them to be a high cost and untested solution, so preferred to invest in more mature (but ultimately more expensive) public transport methods like underground rail.

4. Ben Hamilton-Baillie

After hearing three technology-related visions, it was refreshing to hear low-tech but pragmatic ideas from Ben Hamilton-Baillie. He has pioneered "shared spaces" in this country, which boils down to removing all the signs and road markings in an urban area (e.g, Exhibition Road in London) to induce a sense of shared risk and responsibility.

Someone in the audience asked whether that increases the stress levels of pedestrians; "I'm glad you're less comfortable", he replied, because pedestrians are in danger all the time, it's just that they don't feel it.

The talks were followed by a panel Q+A session. Questions from the audience were on topics such as transport of goods (Robin Chase said that there is still massive unspent capacity in freight) the fate of the £40 billion spent by motorists today (someone on the panel said they would prefer we spend it on massages and take-aways) and the love of cars for their own sake (Paul Newman emphasised that technology gives us options, so car lovers can still indulge themselves in future).

]]>- Ashbrook, Daniel, and Thad Starner. "Using GPS to learn significant locations and predict movement across multiple users." Personal and Ubiquitous Computing 7.5 (2003): 275-286.
- A classic in the field, introducing a vision for finding significant places (with respect to individuals) and learning structure from past behaviour to make predictions. Although there are some shortcomings (e.g., their method of finding significant locations involved manual calibration, and they only considered a first-order Markov model to make predictions), I think they provide a compelling vision that stands the test of time 10 years later.

- Eagle, Nathan, and Alex Sandy Pentland. "Eigenbehaviors: Identifying structure in routine."
*Behavioral Ecology and Sociobiology*63.7 (2009): 1057-1066.- In this paper, Eagle and Pentland represent each day of an individual's mobility as a point in high dimensional space, then use dimensionality reduction (principle component analysis) to find a set of "eigenbehaviors" that best characterises the location behaviour of that individual. Prediction can then be done by finding the mix of eigenbehaviors that best recreates a partially-seen day. The general form of the eigenbehaviors also allows comparison of habits between people, and they found some nice results showing how students from different faculties have similar location habits. I prefer the exploratory applications of this approach more than the predictive aspect, because I think the high accuracy results they report are more a function of people staying a long time at the same location in their dataset (since they consider a person to always be in 1 of 4 locations: home, work, other, or no signal). Still, I was always inspired by this paper. An extension that considered richer data sets was done by Sadilek and Krumm (2012), which found good results when also incorporating temporal information (day of week, whether the day was a national holiday).

- Gao, Huiji, Jiliang Tang, and Huan Liu. "Exploring Social-Historical Ties on Location-Based Social Networks." ICWSM. 2012.
- I think this is a strong paper that deserves more attention. It is an extension of the Ashbrook and Starner paper in the sense that the authors provide a more sophisticated way of doing sequential location prediction (i.e., "given your recent locations, where are you likely to be next?") using hierarchical Pitman-Yor processes (HPY). HPY assumes that the pattern of individual daily life locations follows a power law, with the rich getting richer (i.e., there are a few locations that are highly visited, and lots of locations that are hardly ever visited), which has been empirically observed in Gonzalez et al. (2008). Furthermore, HPY is a Bayesian way of varying the number of recent locations considered in the prediction (similar to variable-order Markov models, but with a more principled motivation that doesn't require a max-depth parameter). I suggest reading the paper for a more detailed description than I can give here.

- Horvitz, Eric, and John Krumm. "Some help on the way: Opportunistic routing under uncertainty." Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 2012.
- The authors propose an anticipatory mobile computing system that recommends interesting places to drivers on their way to a predicted location. The prediction part of the system is assumed (using a previous approach that assumes a rational driver) so the focus is on how to calculate the places to recommend using expected utility. They also consider how to balance the expected benefit of asking a driver to confirm their destination (making the suggested places more relevant) against the cost of interruption. Clearly, there will be times when knowing the user's destination for certain would not change the recommendation very much (v.s. only having a probabilistic prediction), so this approach is useful in avoiding the "talking paperclip" syndrome where anticipatory applications interrupt users too much.

- Farrahi, Katayoun, and Daniel Gatica-Perez. "A probabilistic approach to mining mobile phone data sequences." Personal and Ubiquitous Computing (2013): 1-16.
- This work provides a Bayesian probabilistic model of individual location behaviour that is a hybrid of latent Dirichlet allocation (LDA) and the eigenbehaviors approach of Eagle and Pentland. The similarity to eigenbehaviors comes from the assumption that there exists a set of characteristic motifs that repeat themselves in the data (e.g., leaving home to go to the bus station, then to work). The authors' comparison to n-gram models appears to be a bit of a red herring to me, as the next location in their model is not dependent, in general, on the most recent previous locations (that is not to say a mobility model requires the n-gram assumption to be useful). The benefit of having LDA underneath it all (as opposed to, say, a mixture model) is to express the assumption that motifs are locally correlated within a single day. Intuitively, if I follow the aforementioned "going to work" motif in the morning, then I am probably more likely to follow other workday-related motifs (e.g., going to the gym, then home) than other types of motif later that day. With hierarchical Bayesian modelling, this type of structure can be learnt in an unsupervised way from the data, and then be used to make predictions about future behaviour.

I had to leave a lot of very good papers out, so I hope to make a longer list in future.

]]>The list is focused on datasets from which one might be able to identify individual location routines. Therefore, aggregated or short datasets (< 1 week) are omitted. In no particular order:

GeoLife GPS Trajectories
182 users over period of 2 years. **Public**.

MIT Reality Mining Dataset (Page on Mit)
Cell tower locations (and various phone logs) of 100 people at MIT. **Semi-public **(need to submit your details on a web form).

~~Foursquare Data Set - University of Minnesota~~ *[update: seems that Foursquare has asked the collectors of this data to no longer distribute it]*
The check-in behaviour of over 2 million users. **Semi-public **(you have to request a link from them by email).

OpenPaths
Comprises mixture of GPS, cell tower, check-in data. **Semi-public** (you have to submit a project description and have individuals donate their data).

Dataset of LifeMap Users
Dataset of GPS and WiFi locations that includes crowdsourced semantic labels for locations by 68 people in South Korea.**Public**.

CRAWDAD
Long list of WiFi fingerprint data at University of Dartmouth, US. **Public**.

The concept we agreed on involved an animation of a log fire that the home occupant would interact with by gesture (in front of the screen). When the occupant wants to increase the temperature of the house, they can put their hands in front of the fire (visualised on their screen) in the way that people naturally do with campfires (a sort of inverted cupping gesture). When they want to decrease the temperature, they can push the fire away, in a sort of rejection gesture (though of course no-one would do this with a real fire).

The other people in the interface team worked on a nifty animation of a log fire in Javascript, and on communicating with the thermostat. However, Leap does not make it very easy for developers to define custom gestures. It seems that the makers of Leap are trying to establish some standard gestures, which users can learn and use for many apps. Good idea for them, but we wanted more flexibility so I set to work on using machine learning to do online classification of gestures.

My first step was to decide upon a set of features. Downloading the Python developer package. I found a script called Sample.py that prints the data that Leap provides to developers. The most important pieces of data for the two gestures I was interested in were the positions and orientations of the hands (both 3-dimensional).

I wanted the features to be invariant to the absolute position of the hands, so decided on using the speed and orientation of the hands only (4 numbers per time step). Clearly, a more general custom gesture recogniser might use more or different features.

Having decided on the features, I thought more about the machine learning approach to use. The approach needs to take into account the dynamic nature of gestures, i.e., a single gesture may be expressed over multiple time steps. One idea is to assume that the steps of a gesture can be explained by discrete latent states that follow a first-order Markov property. Fortunately, I have just written a package to do variational inference on non-parametric hidden Markov models (HMM) with multi-modal data. The original purpose was for location data (my main area of research), but since I had written it in a fairly re-usable way, only a few lines were required to run inference on the gesture features I defined (4 dimensional continuous feature data):

K = 20 #truncation parameter for number of components to Dirichlet Process mvgSensor = MVGaussianSensor(K,XDim,hyp=obsHyperparams[0]) exp_z,sensors,exp_a,Zmax = infer(N, [X], K, [mvgSensor], VERBOSE=1,useMix=1, useHMM=1, plotSensor=mvgSensor, plotX=X[:,:2])

In line 1, I set the truncation parameter to a suitably large number of possible components (though the HMM is not required to use all these). Line 2 defines the assumed distribution on the data (a multivariate Gaussian), while line 3 runs the actual inference on the data (X). The assumption is that the logarithm of the velocity, and 3-dimensional orientation of the hands, are normally distributed (and are covariant). Clearly, orientation needs a periodic distribution (as the range of values is between 0 and 2*pi, and values very close to 0 are actually close to 2*pi also), so I would have preferred a von Mises likelihood function there, but there was limited time, so a Gaussian would have to be a good enough approximation. Tthe effects should not be too noticeable, since neither of our gestures involve angles near 0 or 2*pi.

Assuming that the latent states of any user movement (which may include a gesture, or may just be noise), can be inferred correctly, the next step is to include supervision, in order to map the unsupervised inference of movement to known concepts like 'pushing away' or 'hands over fire', (which correspond to 'turn the temperature down!' or 'turn it up!', respectively). Since I hope to have already adequately represented complex movements with the latent states of an HMM, a simple supervised learning approach should be all that is needed. I chose linear least squares regression for this, mostly because it can be done in a single line in NumPy:

weights = numpy.linalg.lstsq(Z, reshape(Y,(N,1)))

where Z is the set of inferred distributions over latent states and Y is the set of corresponding target values (both for N time steps).

Finally, I created some training data. I spent a few minutes cupping my hands around an imaginary fire, and then another few minutes pushing it away. This gave me a 4000-by-4 feature matrix (with an observation on each row) with which to train. I gave a rough and ready supervision signal (for the final, linear regression step) of 2000 negative outputs ('turn up the heat') followed by 2000 positive outputs ('turn down the heat').

Performing nonparametric HMM inference on this data, the model told me that 19 components were enough to explain the data (which seemed a lot to me):

The graph shows parameters to the Gaussians (mean and covariance) that were learnt to two of the feature dimensions (the other two dimensions are not shown here, but are obviously included in the inference). After applying linear regression, I first tested on the original training data:

In the data I collected, I switched gestures half way through (i.e., a minute or so of me doing the first gesture, and then another minute doing the second gesture). Therefore, one would expect the classification line to be under the threshold at first, then switch to be over the threshold about half way through the data. For comparison, I also plotted how straightforward linear regression (again with least squares, with the green line) does on the raw data (without applying an HMM). We see that simple regression is not enough (achieving 55.6% accuracy), indicating that the raw data is not linearly separable. The combined approach (with the HMM) does much better at 89.8% accuracy.

Of course, this result was from testing with data seen during training, so I then tested on a shorter, unseen data set that I also collected:

My approach achieved 73.8% accuracy (v.s. 53.3% for the simple linear regression). For both data sets, there appears to be portions of the 'hands over fire' gesture (where the expected target value is 0) that spike above the threshold. I think this is because I had to move my hands into place, causing the speed to be non zero. This is important because I think that one important characteristic of the 'hands over fire' gesture is likely to be zero motion. Would be interesting to see how well we do with just the speed feature (without the angles), though I didn't have time for this, and in any case, I've noticed a difference in the angle of my hands between the two gestures.

Overall, I think that kind of accuracy is not bad for a day's work, especially considering that I collected a very small training dataset created in a few minutes of hand waving. I think that if I would collect more data, the results would improve. 73.8% accuracy is probably not good enough yet for this sort of application, as it would obviously be undesirable for the system to turn the thermostat down when the user meant to turn it up (or vice versa). I think the accuracy could be improve a lot with some smoothing, and by somehow dealing with the initial and tail ends of a gesture (which appear to result in a lot of noise). Also, it might be interesting to see how well this works when training on one set of people and testing on another.

In general, I'm excited by the idea of learning a rich representation of behaviour in an unsupervised way, then applying supervision only at the final steps, because it points towards the possibility of collecting a much larger training data set while spending a much smaller amount of time on labelling only some of the data (obviously, this is semi-supervised learning).

*I will make all the code I wrote for this available soon.*

Here are some additional notes about preprocessing that made a big difference in accuracy for me:

- since the speed feature is non-negative, I took the logarithm of that dimension (resulting in the assumption that the speed is log-normally distributed).
- as per usual, I centred all the data by the mean, and gave it unit standard deviation.