A Causal Study of Simulated Patient Recovery Data

I'm currently reading "Do No Harm", by Henry Marsh, a fascinating autobiography of a neurosurgeon. Since I'm also reading about causality in statistics, naturally the two have combined in my mind to make the following worked example of causality.

Imagine you're in charge of a bunch of hospitals and are looking at the survival statistics of all the patients with a certain illness (e.g., brain tumours) to help you decide which doctors to promote. (This is probably not how doctors are promoted, but anyway). You might be tempted to pick the doctors with the best survival rates. This is wrong because it ignores the assignment mechanism of patients to doctors. Let's see how wrong you could be with a specific example.

doctor id num patients survival rate
0 117 0.205
1 85 0.812
2 93 0.129
3 88 0.216
4 98 0.153
5 112 0.830
6 89 0.809
7 109 0.138
8 102 0.824
9 107 0.841

Let's make a set of statistical modelling assumptions about how this data was generated.

Assumptions:

D doctors
N patients
Z_{n,d} is whether patient n is assigned to doctor d
Y_n(d) is binary recovery if patient n is treated by doctor d
\beta_d is skill of doctor d (latent, binary: low or high skill)
\theta_n is the severity of patient n's illness (observed, binary: low or high severity)

Generative procedure:

for patient n = 1,...,N:
     draw an indicator for whether the patient is assigned a highly skilled or less highly skilled doctor, s_n ~ Bernoulli( \theta_n p_high + (1-\theta_n) p_low )
     if s_n = 1: draw Z_{n .} uniformly from the pool of highly skilled doctors
     otherwise: draw Z_{n .} doctor d uniformly from the pool of other doctors
     calculate probability of recovery as follows: p_recover = \beta_d \theta_n p_low + (1-\beta_d) \theta_n p_{very low} + (1-\theta_n) p_high
     draw recovery Y_n(d) ~ Bernoulli( p_recover )

What this generative process is saying is that we should try to assign highly skilled doctors to severely ill patients and let the rest of the doctors take care of the rest of the patients, but this process is noisy and there is a non-zero probability that an average doctor care for a very ill patient or a very skilled doctor will care for a less ill patient.

Why this recovery probability? If the illness is severe, then the best that a skilled doctor can do is ensure a low probability of recovery, while a less skilled doctor will do even worse than that. If the illness is not severe, then it doesn't matter whether the patient gets a very skilled doctor or not, they are likely to recover.

In reality, the patient covariates (\theta_d) would not be a simple binary number, it would be features of the illness (e.g., tumour size and location, age of patient), and doctors might have skills along multiple dimensions. Binary numbers suit our thought experiment for the moment, but we'll revisit the dimensionality of covariates later.

Using this generative model and a given set of parameters (p_high, p_low, p_{very low}, N, D, \theta, \beta), I generated 10 doctors and 1000 patients (the data you see in the table above, actually). Now, let's pretend that we don't know \beta, the doctor skill that we are trying to estimate from the data alone.

Would you want to see doctor 4? Naively, no, but who knows, they might be the top specialist in the country for your illness. I consider how to answer that question next.

Causality assumptions

We need to make a set of assumptions to allow us to perform a causal analysis. I'm writing them out explicitly for good statistical hygiene but won't discuss them at length here.

First, we need the stable unit treatment value assumption (SUTVA) which says that there are no interactions between units (patients) in the sense that the outcome of treatment for patient n does not depend on the assignment or recovery of any of the other patients. This might be conceivably broken if doctors become tied up with patients limiting their ability to see new patients.

Second, we need unconfoundedness, which says that assignment is conditionally independent of the outcome (i.e., recovery) given the patient covariates (i.e., judged severity of illness). This might be broken if there are some unknown patient covariates (e.g., where they live) that affect assignment. And the bad news from Imbens & Rubin 2015 is that we can never tell if this is the case.

Third, we need individualistic and probabilistic assignments (see Imbens & Rubin 2015).

Recovery Rate Stratification

Here is what the recovery rates look like for each doctor when we stratify by patient severity:

doctor id num severe patients recovery rate | non-severe recovery rate | severe
0 103 0.857 0.117
1 7 0.885 0.000
2 87 0.833 0.080
3 78 0.900 0.128
4 92 1.000 0.098
5 6 0.877 0.000
6 11 0.923 0.000
7 97 0.917 0.041
8 11 0.923 0.000
9 8 0.909 0.000

The first thing to notice is that the number of severely ill patient assignments per doctor reveals the assignment mechanism quite clearly. But this only reveals information about doctor skill if we believe in benevolent assignments. One could easily design a perverse universe in which severely ill people get sent to the person least qualified to treat them. It is the stratified survival rate columns that tell you everything you need to know if you ever find yourself a patient in the simulated universe. If you're severely ill, see doctors 0, 2, 3, 4, or 7. Otherwise, see the first available doctor.

Split the patients up in low and high severity illnesses, then randomly match low severity patients (one treated by doctor d, another not treated by doctor d) and calculate the average difference in recovery. This gives us the following average treatment effects:

doctor id avg treatment effect doctor skill num treated avg recovery rate
0 0.000 1 117 0.205
1 -0.024 0 85 0.812
2 0.000 1 93 0.129
3 0.023 1 88 0.216
4 -0.020 1 98 0.153
5 -0.080 0 112 0.830
6 0.034 0 89 0.809
7 -0.064 1 109 0.138
8 0.010 0 102 0.824
9 0.037 0 107 0.841

Average treatment effect doesn't really help us identify skilled doctors in this example. It successfully identifies doctor 3 as being skilled, even though he has a low survival rate, and identifies doctor 1 as being unskilled, even though he has a high survival rate, but the rest are sort of confused. The problem arises because severely ill patients are rare and there are not enough patients overall (I reran the simulation with 10k patients and found I could reliably identify skilled and unskilled doctors looking at the average treatment effect). Skilled doctors raise the recovery rate of the average patient because they will greatly help severely ill patients and will perform like average doctors with less ill patients. Less skilled doctors lower the average recovery rate because they do a bad job with severely ill patients.

In general, looking at stratified recovery rates was successful because the unit covariate is low dimensional and is itself a good balancing score because it satisfies Y_n(d) conditionally independent of Z_{n d} given \theta_n for all n and d. This didn't have to be the case. In reality, we would expect many patient covariates, some of them irrelevant, others affecting assignment but not outcomes and others affecting both. Stratification is hopeless in that case because the dimensionality of things to stratify on overwhelms the number of units. For the purposes of continuing, let's assume that patient covariates belong to this more complex variate, motivating propensity score reweighting (or matching, though we have already done matching based on a balancing score so will focus on reweighting next).

Propensity Score Reweighting

Another strategy for causal analysis is to account for the assignment mechanism using propensity scoring. A propensity score is defined as the coarsest function e that satisfies:
Y_n( . ) conditionally independent of Z_{n .} given e( \theta_n ). Coarseness means the size of the image of the function, and is useful when considering high dimensional covariates.

In this example, we'd like e( \theta_n ) to be the probability of being assigned a highly skilled doctor. But we don't know who is highly skilled in the first place (that's the unknown \beta). So we have to make do with e( \theta_n ) being a vector of length D representing the probability of being assigned each doctor. Here is the propensity score, dependent on \theta_n (split into two tables to fit on the page):

unit covariate doc_0 doc_1 doc_2 doc_3 doc_4
non-severe 0.028 0.156 0.012 0.020 0.012
severe 0.206 0.014 0.174 0.156 0.184
unit covariate doc_5 doc_6 doc_7 doc_8 doc_9
non-severe 0.212 0.156 0.024 0.182 0.198
severe 0.012 0.022 0.194 0.022 0.016

Propensity score re-weighting estimated treatment effect for doctor d = (1 / N) \sum_{n} ( Z_{n d} Y_n(obs) / e( \theta_n ) - (1-Z_{n d}) Y_n(obs) / (1-e( \theta_n ) ) where N is the number of patients and Y_n(obs) is the observed recovery for patient n.

Here is the average treatment effect after propensity score reweighting:

doctor id doctor skill treated recovery untreated recovery avg treatment effect
0 1 0.487 0.489 -0.003
1 0 0.442 0.495 -0.053
2 1 0.457 0.494 -0.037
3 1 0.514 0.489 0.025
4 1 0.549 0.491 0.058
5 0 0.439 0.497 -0.058
6 0 0.462 0.492 -0.030
7 1 0.479 0.498 -0.019
8 0 0.462 0.492 -0.030
9 0 0.455 0.493 -0.038

The results of this analysis are that we have now identified doctor 3 AND 4 as skilled, and no longer think several unskilled doctors are skilled (as previously thought with matching), in particular, matching estimated doctor 9 to increase the survival of their patients over other doctors but propensity reweighting reveals this not to be the case. How is this possible? Because propensity reweighting gives us a better estimate of the control (i.e,. NOT being treated by any give doctor d) because we can consider the patients of all other doctors instead of matching with a single other patient.

iPython Notebook

Python code for this entire analysis (and the generative model) can be found in this iPython notebook:
link to iPython notebook code

Conclusions

Causality, even (perhaps especially) in simulation, is fun. I hope I have highlighted some interesting themes about finding counter intuitive conclusions from data that look at first straightforward. One main difference between my simulated scenario and many others that come up in causal analysis is that usually we're in danger of over-estimating the effect of a treatment (e.g., of going to university, of enrolling in a jobs program) whereas in this scenario we are under-estimating the effectiveness of highly skilled doctors.
 

Probabilistic Models on Trial

scales

There are many modes of evidence accepted in courts of law. Each mode has its strengths and weaknesses, which will usually be highlighted to suit either side of the case. For example, if a witness places the defendant at the scene of the crime, the defense lawyer will attack her credibility. If fingerprint evidence is lacking, the persecution will say it's because the defendant was careful. Will inferences from probabilistic models ever become a mode of evidence?

It's a natural idea. Courts institutionally engage in uncertainty. They use phrases like "beyond reasonable doubt", they talk about balance of evidence, and consider precision-recall rates (it is "better that ten guilty persons escape than that one innocent suffer" according to English jurist William Blackstone). And the closest thing we have to a science of uncertainty is Bayesian modelling.

In a limited sense we already have probabilistic models as a mode of evidence. For example, the most damning piece of evidence in the Sally Clark cot death case was the testimony of a paediatrician who said that the chance of two cot deaths happening in one household, without malicious action, is 1 in 73 million. This figure was wrong because the model assumptions were wrong -- there could be a genetic or non-malicious environmental component to cot death but this was not captured in the model. But as is vividly illustrated by the Sally Clark case, inferential evidence is currently gatekept by experts. In that sense, the expert is the witness and the court rarely interacts with the model itself. The law has a long history of attacking witness testimony. But what will happen when we have truly democratized Bayesian inference?

Perhaps one day, in a courtroom near you, the defense and prosecution will negotiate an inference method, then present alternative models for explaining data relevant to the case. The lawyers will use their own model to make their side of the case while attacking the opposing side's model.

In what scenarios would probabilistic models be an important mode of evidence?

When there are large amounts of ambiguous data, too large for people to fit into their heads, and even too large/complex to visualize without making significant assumptions.

Consider a trove of emails between employees of a large corporation. The prosecution might propose a network model to support the accusation that management was active or complicit in criminal activities. The defense might counter-propose an alternative model that shows that several key players outside of the management team were the most responsible and took steps to hide the malfeasance from management.

In these types of cases, one would not hope for, or expect, a definitive answer. Inferences are witnesses and they can be validly attacked from both sides on the grounds of model assumptions (and the inference method).

If this were to happen, lawyers would quickly become model criticism ninjas, because they would need model criticism skills to argue their cases. Who knows, maybe those proceedings will make their way onto court room drama TV. In that case, probabilistic model criticism will enter into the public psyche the same way jury selection, unreliable witnesses, and reasonable doubt have. The expertise will come from machines, not humans, and the world will want to develop ever richer language and concepts that enable it to attack the conclusions of that expertise.

The Paradox of Epistemic Risk

Does your attitude to risk change based on the type of uncertainty you harbour? This is a blog post about epistemic risk v.s. non-epistemic risk.

Here is a quote from theconversation.com:

"Australians [have] an 8.2% chance of being diagnosed with bowel cancer over their lifetime [...] If we assume that a quarter of the Australian population eats 50 grams per day of processed meat, then the lifetime risk for the three-quarters who eat no processed meat would be 7.9% (or about one in 13). For those who eat 50 grams per day, the lifetime risk would be 9.3% (or about one in 11)."
There are at least two ways to interpret the above quote:
Option 1:
  • there is a 9.3% chance of getting bowel cancer for processed meat eaters and a 7.9% chance for non-processed meat eaters; genes don't matter

Option 2:

  • there is a x% chance of having the genes that make you susceptible to bowel cancer
  • if you have the genes that make you susceptible: there is a high chance of getting bowel cancer if you eat meat
  • if you don't have the genes that make you susceptible: it doesn't matter what you do, there is a low chance of getting bowel cancer
In either case, we can assume the marginal probability of getting bowel cancer is the same (i.e., we can adjust the percentages in option 2 to make them the same as option 1).If you're a processed meat eater, look at Option 1 and think to yourself: is never eating bacon or a burger again worth 1.6 percentage points reduction in risk? I'm not sure what my answer is, which means that the choices are fairly balanced for me.

Now look at Option 2. Does your answer change? For a rational agent it should not change. My inner monologue for Option 2 goes as follows: if I have the bad genes, then I'm definitely screwing myself over by eating processed meat, and I want to avoid doing that.

But you don't get to know what genes you have (at least, not yet, that will probably change in the next few years), so the main source of risk is epistemic. That is, you already have the genes that you have (tautological though it is to say), you just don't know which kind you have.

Here's what I think is going on: as we go about our lives we have to "satisfice" which means that we focus on actions that are expected to make big differences and try to avoid fiddling at the margins. Option 1 looks a lot like fiddling at the margins to me. Option 2 instead gives me more control: if I have the bad genes then I'm in much greater control over the risk of a terrible illness. But the greater control is illusionary: as long as I remain uncertain about the state of my genes, the utility of eating or not eating processed meat is the same for both Option 1 and Option 2. I call this the paradox of epistemic risk.

I must say that I'm ignorant of the latest research on psychology, any references for related ideas are welcome, please leave a comment!

Summary of Chapter 12 of Imbens & Rubin, "Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction"

This month we're reading Imbens & Rubin's book on causal inference at our machine learning reading group at Columbia. It was my turn to present chapter 12 of that book this week. Here are my summary notes for that chapter:

Summary Notes of Chapter 12

I learnt from this book that causality is fun. That hasn't happened to me before. Maybe it's just that the ideas around causal inference that we've been studying in the reading group for a decent length of time now are finally starting to click.

The Right to be Forgotten

 

I like to read books whose premises I disagree with. It’s good for the soul. Recently, I read “To Save Everything, Click Here: The Folly of Technological Solutionism” by Evgeny Morozov. In it, the author argues that technologists (i.e., people who spend their time improving or advocating technology), particularly from Silicon Valley, are overconfident in their belief that they can solve the world’s problems with technology. Pick any speech by Google chairman Eric Schmidt to see an example of the type of thinking that the author rails against.

One argument from the book has stuck with me. It goes like this: there is a popular notion that the internet is not an invention or a technology, but an inevitable ideal that must be protected from all outside influence. Perhaps this visionary stance was useful in building the internet in the early days, but now that we have the internet, it's less useful.

For example, most recently, this ideal underlies a lot of the objections in the debate about the right to be forgotten. The European court of human rights recently ruled that European citizens have a right to have damaging links about them removed from search engine results. People who believe in the idealised internet warn of the dangers of regulating the internet in this way. Here is a quote from Jennifer Granick (Stanford University) in the New Yorker magazine (“Solace of Oblivion”, Sept 29, 2014 issue):

[This legislation] marks the beginning of the end of the global Internet, where everyone has access to the same information, and the beginning of an Internet where there are national networks, where decisions by governments dictate which information people get access to. The Internet as a whole is being Balkanized, and Europeans are going to have a very different access to information than we have.

This warning appears to be designed to send shivers up the spine, to paint a dystopian future where, *gasp*, Europeans see different information to everyone else. Jonathan Zittrain (Harvard Law School) makes the danger explicit in the same article:

"[... ] what happens if the Brazilians come along and say, ‘We want only search results that are consistent with our laws’? It becomes a contest about who can exert the most muscle on Google.” Search companies might decide to tailor their search results in order to offend the fewest countries, limiting all searches according to the rules of the most restrictive country. As Zittrain put it, “Then the convoy will move only as fast as the slowest ship.”

This quote makes explicit the fear of government control and of technology companies’ harmful reactions to such control.

Underlying both quotes is the assumption that the internet is sacred: its destiny is to be global and pure. But, to repeat, the internet is a technology, it is built and maintained by us because we find it useful. Replace “internet” with “car” in the above quotes, and consider again the first quote by Jennifer Granick. Does the experience of driving have to be identical wherever you are in the world? Does the fact that the speed limit varies from country to country, or that you can turn right on a red light in some US states and not in others, keep anybody up at night? Road laws, manufacturing safety laws, drink driving laws are highly Balkanised across states and countries, but no-one is worrying whether some authoritarian government of the future is going to take our cars away from us (except maybe some Tea Party activists).

Or consider the second quote, by Jonathan Zittrain. Do car manufacturers have a right to demand to sell identical cars globally? Is car manufacturing technology held back by the country with the most stringent safety laws? Of course not. But even if it were, it wouldn’t be the only consideration on the table. We don’t feel beholden to any visions that Henry Ford may have had a century ago about a future with open roads and rapid transit, certainly not when it comes to preventing people from dying horrifically on the roads.

Yet, when it comes to the internet, a lot of people believe that an unregulated internet trumps everything, including the grief of parents struggling to get over their daughter’s death when pictures of her dead body are circulating online, or the struggles of a person falsely accused of a crime who has to live with those accusations every time someone searches their name. A balanced viewpoint would look at freedom of speech versus the harm such speech causes in broad classes (e.g., criminal cases, breaches of privacy) and make a decision. Different countries have different priorities which will lead to different regulations on internet freedoms, and that’s ok. If the government is authoritarian, then it has already put huge restrictions on the local internet (e.g., China, North Korea). You can add that to the list of reasons why it’s unpleasant to live in authoritarian countries.

At this point, the person arguing for a global and pure internet retreats to practicalities. They cite two main practical barriers to the right to be forgotten: 1) it’s impossible to exert complete control over information - a determined person will find the restricted information anyway; 2) it’s too labour intensive to enact the right to be forgotten. Let’s start with the first barrier. I agree that a demand for perfection is misguided, and I don’t believe anyone is making such a demand. It’s possible to take reasonable and simple steps to allow people to be forgotten that gets you 95% of the way there. In the same way that a determinedly bad driver can still kill people, a determinedly assiduous employer will still be able to dig up information about a potential employee. But this was always the case, even before the internet.

The second practical barrier is the more important one, I feel, and is a manifestation of the fact that technology enables mass automation (e.g., indexing websites) while the judgement that society requires of Google (i.e., “is this index removal request valid?”) cannot currently be automated. While this challenge is substantial, it’s ironic that the same technologists (me included) who claim that they can solve the world’s societal problems, throw their hands up in despair when asked to automate such judgement.

Comment on "How Should We Analyse Our Lives?"

 

I just heard about a new book, called "Social Physics", being released at the end of the month by Alex "Sandy" Pentland (the famous professor of the MIT media lab). According to the synopsis, the book is about "idea flow" and "the way human social networks spread ideas and transform those ideas into behaviours".

More thoughts on this once it's actually released. But at first glance, the question of how ideas spread is typical popular science fare, so I hope Prof. Pentland's unique perspective will make it truly different from what's already out there. Also, I'm always dubious of attempts to raise the spectre of "physics" in the context of behaviour analysis. The intended analogy is clearly between finding universal laws of human behaviour and finding universal laws in physics. But scientists in every field are trying to find universal laws! Should we now rebrand economics as "currency physics", computer science as "algorithmic physics", and biology as "organic physics", etc.?

With those (rather superficial) caveats, the book is clearly relevant to my research topic of trying to analyse and predict location behaviour using the digital breadcrumbs left by humans in daily life. I'm looking forward to reading the book as I have often found inspiration in Prof. Pentland's research.

The book was given an interesting review by Gillian Tett of the Financial Times last weekend. Her main point of agreement with the book is that the difference between new and old research on human behaviour analysis is due to the size of the data, plus the extent to which that data is interpreted subjectively v.s. objectively (e.g., anthropologists analysing small groups of people v.s. the 10GB that Prof. Pentland collected on 80 call centre workers at an American bank).

So far so good. But she goes on to criticise computer scientists for analysing people like atoms "using the tools of physics" when in reality they can be "culturally influenced too". As someone who uses the tools of physics to analyse behaviour (i.e., simulation techniques and mean field approximations that originated in statistical physics) I think this is false dichotomy.

There are two ways you can read the aforementioned criticism. The first is that statistical models of human behaviour are unable to infer (or learn) cultural influences from the raw data. The second is that the models themselves are constructed under the bias of the researcher's culture.

In the first case, there's nothing to stop researchers from finding new ways of allowing the models to pick up on cultural influences in behaviour (in an unsupervised learning manner), as long as the tools are rich and flexible enough (which they almost certainly are in the case of statistical machine learning).

The second case is more subtle. In my experience, the assumptions that are expressed in my, and my colleagues', models are highly flexible and don't appear to impose cultural constraints on the data. How do I know? I could be wrong, but my confidence in the flexibility of our models is due to the fact that model performance can be quantified with hard data (e.g., using measures of precision/recall or held-out data likelihood). This means that any undue bias (cultural or otherwise) that one researcher imposes will be almost immediately penalised by another researcher coming along and claiming he or she can do "X%" better in predictive accuracy. This is the kind of honesty that hard data brings to the table, though I agree that it is not perfect and that many other things can go wrong with the academic process (e.g., proprietary data making publications impervious to criticism, and the fact that researchers will spend more effort on optimising their own approach for the dataset in question v.s. the benchmarks).

My line of argument of course assumes that the data itself is culturally diverse. This wouldn't be the case if the only data we collect and use comes from, for example, university students in developed countries. But the trend is definitely moving away from that situation. Firstly, the amount of data available is growing by a factor of 10 about every few years (driven by data available from social networking sites like Twitter and Foursquare). At a certain point, the amount of data makes user homogeneity impossible, simply because there just aren't that many university students out there! Secondly, forward thinking network operators like Orange are making cell tower data from developing countries available to academic researchers (under very restricted circumstances I should add). So in conclusion, since the data is becoming more diverse, this should force out any cultural bias in the models (if there was any there to start with).

Growth of Mobile Location Services in 2013

 

I'm writing up my PhD thesis at the moment, and found myself having to update the opening paragraph of page 1 (you know, the part where I say how incredibly relevant my research is). The previous version, from my transfer thesis written in 2012, went like this:

Mobile location services have been a topic of considerable interest in recent years, both in industry and academia. In industry, software applications (or apps) with location-based components enjoy widespread use. This is evidenced, for example, by the 20 million active users who opt to check in (i.e., share their current location with friends) on Foursquare , the 50 million users who search for local services on Yelp when out and about, and the increasing number who electronically hail a taxi in Uber in 11 cities (up from 1 city in 2011), which they can then watch arrive at their location on a real-time map.

In updating this paragraph, I found that the statistics reflecting 2013's progress by mobile location services are as follows: Foursquare grew from 20 to 30 million users, Yelp grew from 50 million to 100 million users, and Uber is now in 66 cities (up from 11 cities in 2012).

From this small sample of progress, it seems that there is still a lot of growth in location-based services, especially ones involving crowdsourcing of physical tasks (e.g., Uber, TaskRabbit, Gigwalk).

The disappointing one of the pack in terms of user growth is Foursquare (which "only" grew by 50% in 2013), but they are arguably facing the different challenge of proving "that there's a real business there", in CEO Dennis Crowley's words (i.e., finding sustainable revenue streams). But in general, the most promising location-based services are still following the exponential growth curve (in number of users) which is good news for innovation.

The Language of Location

 

During my work, I've often noticed similarities between language and individual daily life location behaviour (as detected by GPS, cell towers, tweets, check-ins etc.). To summarise these thoughts, I've compiled a list of the similarities and differences between language and location below. I then mention a few papers that exploit these similarities to create more powerful or interesting approaches to analysing location data.

Similarities between Location and Language Data

  • Both exhibit power laws. A lot of words are used very rarely while a few words are very frequently used. The same happens with the frequency of visits to locations (e.g., how often you visit home v.s. your favourite theme park). This is not a truism. The most frequently visited locations or words used are *much* more likely to be visited/used than most other places/words.
  • Both exhibit sequential structure. Words are highly correlated with words near to them on the page. The same for locations on a particular day.
  • Both exhibit topics or themes. In the case of language, groups of words tend to co-occur in the same document (e.g., two webpages that talk about cars are both likely to mention words from a similar group of words representing the "car" topic). In the case of location data, a similar thing happens. I mention two interpretations from specific papers later in this post.
  • The availability of both language data and location data has exploded in the last decade (the former from the web, the latter from mobile devices).
  • There are cultural differences in using language just as there are cultural differences in location behaviour (e.g., Spanish people like to eat out later than people of other cultures).
  • Both are hierarchical. Languages have letters, words, sentences, and paragraphs. A person can be moving around at the level of the street, city, or country (during an hour, day, or week).
  • Both exhibit social interactions. Language is exchanged in emails, texts, verbally, or in scholarly debate. Friends, co-workers, and family may have interesting patterns of co-location.

Differences between Location and Language Data

  • Many words are shared between texts (of same language) but locations are usually highly personal to individuals (except for the special cases of friends, co-workers, and family).
  • There are no periodicities in text but strong periodicities in location (i.e., hourly, weekly, and monthly).
  • Language data is not noisy (except for spelling and grammar mistakes) while location data is usually noisy.
  • Language analysts do not usually need to worry about privacy issues whilst location analysts usually do.

Work that Exploits These Similarities

Here are a few papers that apply or adapt approaches that were primarily used for language models to location data:

K. Farrahi and D. Gatica-Perez. Extracting mobile behavioral patterns with the distant n-gram topic model. In Proc. ISWC, 2012.

They use topic modelling to capture the tendency of visiting certain locations on the same day. This is similar to using the presence of words like "windshield" and "wheel" to place higher predictive density on words like "road" and "bumper" (i.e., topic modelling bags of words). I have talked previously about why I think this is a good paper.

L. Ferrari and M. Mamei. Discovering daily routines from google latitude with topic models. In PerCom Workshops, pages 432–437, 2011.

This paper uses a similar application of topic modelling as the one by Farrahi and Gatica-Perez.

H. Gao, J. Tang, and H. Liu. Exploring social-historical ties on location-based social networks. In 6th ICWSM, 2012.

This paper uses a model that was previously used to capture sequential structure in words and applies it to Foursquare checkins.

J. McInerney, J. Zheng, A. Rogers, N. R. Jennings. Modelling Heterogeneous Location Habits in Human Populations for Location Prediction Under Data Sparsity. In Proc. UbiComp, 2013.

In my own work, I've used the concept of topics to refer to location habits that represent the tendency of an individual to be at a given location at a certain time of day or week. This way of thinking about locations is useful in generalising temporal structure in location behaviour across people, while still allowing for topics/habits to be present to greater or varying degrees in different people's location histories (just as topics are more or less prevalent in different documents).

Both language and location data are results of human behaviour, so it is unsurprising to find similarities, even if I think some of the similarities are coincidental (e.g., power laws crop up in many places and often for different reasons, and the increasing availability of data is part of the general trend of moving the things we care about into the digital domain). The benefits of analysis approaches seem to be flowing in the language -> location direction only at the moment, though I hope one day that will change.

Practical Guide to Variational Inference

 

There are a few standard techniques for performing inference on hierarchical Bayesian models. Finding the posterior distribution over parameters or performing prediction requires an intractable integral for most Bayesian models, arising from the need to marginalise ("integrate out") nuisance parameters. In the face of this intractability there are two main ways to perform approximate inference: either transform the integration into a sampling problem (e.g., Gibbs sampling, slice sampling) or an optimisation problem (e.g., expectation-maximisation, variational inference).

For applied machine learning researchers, probably the most straightforward method is Gibbs sampling (see Chapter 29 of "Information Theory, Inference, and Learning Algorithms" by David MacKay) because you only need to derive conditional probability distributions for each random variable and then sample from these distributions in turn. Of course you have to handle convergence of the Markov chain, and make sure your samples are independent, but you can't go far wrong with the derivation of the conditional distributions themselves. The downside of sampling methods is their slow speed. A related issue is that sampling methods are not ideal for online scalable inference (e.g., learning from streaming social network data).

For these reasons, I have spent the last 6 months learning how to apply variational inference to my mobility models. While there are some very good sources describing variational inference (e.g., chapter 10 of "Pattern Recognition and Machine Learning" by Bishop, this tutorial by Fox and Roberts, this tutorial by Blei), I feel that the operational details can get lost among the theoretical motivation. This makes it hard for someone just starting out to know what steps to follow. Having successfully derived variational inference for several custom hierarchical models (e.g., stick-breaking hierarchical HMMs, extended mixture models), I'm writing a practical summary for anyone about go down the same path. So, here is my summary for how you actually apply variational Bayes to your model.

 

Preliminaries

I'm omitting an in-depth motivation because it has been covered so well by the aforementioned tutorials. But briefly, the way that mean-field variational inference transforms an integration problem into an optimisation problem is by first assuming that your model factorises further than you originally specified. It then defines a measure of error between the simpler factorised model and the original model (usually, this function is the Kullback-Leibler divergence, which is a measure of distance between two distributions). The optimisation problem is to minimise this error by modifying the parameters to the factorised model (i.e., the variational parameters).

Something that can be confusing is that these variational parameters have a similar role in the variational model as the (often, fixed) hyperparameters have in the original model, which is to control things like prior mean, variance, and concentration. The difference is that you will be updating the variational parameters to optimise the factorised model, while the fixed hyperparameters to the original model are left alone. The way that you do this is by using the following equations for the optimal distributions over the parameters and latent variables, which follow from the assumptions made earlier:

\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)

\mathrm{ln} \; q^*(Z_i) = \mathbb{E}_{-Z_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)

where X is the observed data, V is the set of parameters, Z is the set latent variables, and \alpha is the set of hyperparameters. Another source of possible confusion is that these equations do not explicitly include the variational parameters, yet these parameters are the primary source of interest in the variational scheme. In the steps below, I describe how to derive the update equations for the variational parameters from these equations.


1. Write down the joint probability of your model

Specify the distributions and conditional dependencies of the data, parameters, and latent variables for your original model. Then write down the joint probability of the model, given the hyperparameters. In the following steps, I'm assuming that all the distributions are conjugate to each other (e.g., multinomial data have Dirichlet priors, Gaussian data have Gaussian-Wishart priors and so on).

The joint probability will usually look like this:

p(X, Z, V | \alpha) = p(V | \alpha) \prod_n^N \mathrm{<data \; likelihood \; of \; V, Z_n>} \mathrm{<probability \; of \; Z_n>}

where N is the number of observations. For example, in a mixture model, the data likelihood is p(X_n | Z_n, V) and the probability of Z_n is p(Z_n | V). An HMM has the same form, except that Z_n now has probability p(Z_n | Z_{n-1}, V). A Kalman filter is an HMM with continuous Z_n. A topic model introduces an outer product over documents and additional set of (global) parameters.


2. Decide on the independence assumptions for the variational model

Decide on the factorisation that will allow tractable inference on the simpler model. The assumption that the latent variables are independent of the parameters is a common way to achieve this. Interestingly, you will find that a single assumption of factorisation will often induce further factorisations as a consequence. These come "for free" in the sense that you get simpler and easier equations without having to make any additional assumptions about the structure of the variational model.

Your variational model will probably factorise like this:

q(Z, V) = q(Z) q(V)

and you will probably get q(V) = \prod_i q(V_i) as a set of induced factorisations.


3. Derive the variational update equations

We now address the optimisation problem of minimising the difference between the factorised model and the original one.

Parameters

Use the general formula that we saw earlier:

\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln}\; p(X, Z, V | \alpha) \right)

The trick is that most of the terms in p(X, Z, V | \alpha) do not involve V_i, so can be removed from the expectation and absorbed into a single constant (which becomes a normalising factor when you take the exponential of both sides). You will get something that looks like this:

\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha) + \sum_n^N \mathrm{ln} \; \mathrm{<data \; likelihood \; of \; V_i, Z_n>} \right) + \mathrm{constant}

What you are left with is the log prior distribution of V_i plus the total log data likelihood of V_i given Z_n. Even in the two remaining equations, you can often find terms that do not involve V_i, so a lot of the work in this step involves discarding irrelevant parts.

The remaining work, assuming you chose conjugate distributions for your model, is to manipulate the equations to look like the prior distribution of V_i (i.e., to have the same functional form as p(V_i | \alpha)). You will end up with something that looks like this:

\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha_i') \right) + \mathrm{constant}

where your goal is to find the value of \alpha_i' through equation manipulation. \alpha_i' is your variational parameter, and it will involve expectations of other parameters V_{-i} and/or Z (if it didn't, then you wouldn't need an iterative method). It's helpful to remember at this point that there are standard equations to calculate \mathbb{E} \left( \mathrm{ln} \; V_j \right) for common types of distribution (e.g., Dirichlet V_j has \mathbb{E} \left( \mathrm{ln} \; V_{j,k} \right) = \psi(V_{j,k}) - \psi(\sum_{k'} V_{j,k'}), where \psi is the digamma function). Sometimes you will have to do further manipulation to find expectations of other functions of the parameters. We consider next how to find the expectations of the latent variables \mathbb{E}(Z).

Latent variables

Start with:

\mathrm{ln} \; q^*(Z_n) = \mathbb{E}_{-Z_{n}}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)

and try to factor out Z_{n}. This will usually be the largest update equation because you will not be able to absorb many terms into the constant. This is because you need to consider the parameters generating the latent variables as well as the parameters that control their effect on observed data. Using the example of multinomial independent Z_n (e.g., in a mixture model), this works out to be:

\mathrm{ln} \; q^*(Z_{n,k}) = \mathbb{E}_{-Z_{n,k}}\left( Z_{n,k} \mathrm{ln} \; V_k + Z_{n,k} \mathrm{ln} \; p(X_n | V_k) \right) + \mathrm{constant}

factorising out Z_{n,k} to get:

\mathrm{ln} \; \mathbb{E}(Z_{n,k}) = \mathbb{E}(\mathrm{ln} \; V_k) + \mathbb{E}(\mathrm{ln} \; p(X_n | V_k)) + \mathrm{constant}


4. Implement the update equations

Put your update equations from step 3 into code. Iterate over the parameters (M-step) and latent variables (E-step) in turn until your parameters converge. Multiple restarts from random initialisations of the expected latent variables are recommended, as variational inference converges to the local optimum.

 

The video below shows what variational inference looks like on a mixture model. The green scatters represent the observed data, the blue diamonds are the ground truth means (not known by the model, obviously), the red dots are the inferred means and the ellipses are the inferred covariance matrices:

Intelligence Squared Debate on the Future of Mobility

 

Jerry Sanders talking about SkyTran

Jerry Sanders talking about SkyTran (picture by @LinstockOnline)

I was in London yesterday to watch a debate about the future of mobility (i.e., transport) hosted by Intelligence Squared at the Royal Institution, in the same room Faraday demonstrated electricity. We saw four speakers give their visions for the future. Despite my impression that any of the speakers could have talked for 10 times longer than they did and still be fascinating, the organisers went for TED-style short talks. The format seemed to go quite well because the talks were polished and energy levels were high in the room throughout.

1. Paul Newman

Prof Newman has been developing self-driving cars at Oxford University's Information Engineering department. In a word, his idea of how self-driving cars will take over the world is "gradually". Imagine driving yourself through a traffic jam when your car offers to take over for you in that stretch of road, and, having already participated in a behind-the-scenes auction, has agreed with your insurance company to offer you a 9 pence discount if you do so (given that computers will be safer drivers than people in future).

The motivations for the gradual change seem to be mostly technical. There will be weather conditions and extreme situations in which the car decides for itself that it wouldn't be able to drive very well. The self-driving cars also need a period in which they will learn our driving styles and learn about the world's roads.

2. Robin Chase

Robin Chase co-founded car sharing startup ZipCar and has since gone on to create new ride sharing companies BuzzCar and GoLoco. She painted two possible extremes for car occupancy in the future.

Future hell is the proliferation of zero-occupancy cars, where owners of self-driving cars send their cars on trips round the block while they shop, for lack of available parking options. Or imagine us all cycling to work, but having our cars follow us with our laptops and a change of clothes (in the style of David Cameron)? This scenario was not so plausible to me because I think it currently is, or will be made, illegal to have empty cars driving on the road anywhere in the world.

The future car paradise, in her view, is having 6 people sharing a self-driving car (boosting car occupancy way beyond the current level), but those who can afford it can ride with lower occupancy if they wish. It seems to me that if digital records of car occupancy are kept (which they would be, if we are all ride-sharing in future) then it makes it easy to apply a direct car-occupancy tax, so not many people would be tempted by the much more expensive single-occupancy option.

The other interesting thing she mentioned about the future of car sharing is that car owners would start to fit the car to the occasion (e.g., a swanky car for a first date or a job interview, but something more basic the rest of the time), instead of having to cover all possible scenarios, as they do now, when buying a car that they permanently own.

3. Jerry Sanders

In a polished pitch, Jerry Sanders told us about his plans for SkyTran, a rapid transit system to be built in Tel Aviv. Passengers each get their own pod that will hang from a pylon above the city, but will be propelled magnetically forward ("surfing on a wave of magnetism") at speeds of 100 mph. The speaker painted SkyTran as the solution to congestion.

I like the idea and would definitely take a ride in one if I ever had the chance, but it seems futuristic in the same way that monorails were meant to be the solution to city congestion 30 years ago. What happened to monorails? According to Wikipedia, city councils perceived them to be a high cost and untested solution, so preferred to invest in more mature (but ultimately more expensive) public transport methods like underground rail.

4. Ben Hamilton-Baillie

After hearing three technology-related visions, it was refreshing to hear low-tech but pragmatic ideas from Ben Hamilton-Baillie. He has pioneered "shared spaces" in this country, which boils down to removing all the signs and road markings in an urban area (e.g, Exhibition Road in London) to induce a sense of shared risk and responsibility.

Someone in the audience asked whether that increases the stress levels of pedestrians; "I'm glad you're less comfortable", he replied, because pedestrians are in danger all the time, it's just that they don't feel it.

The talks were followed by a panel Q+A session. Questions from the audience were on topics such as transport of goods (Robin Chase said that there is still massive unspent capacity in freight) the fate of the £40 billion spent by motorists today (someone on the panel said they would prefer we spend it on massages and take-aways) and the love of cars for their own sake (Paul Newman emphasised that technology gives us options, so car lovers can still indulge themselves in future).