## Comment on "How Should We Analyse Our Lives?"

I just heard about a new book, called "Social Physics", being released at the end of the month by Alex "Sandy" Pentland (the famous professor of the MIT media lab). According to the synopsis, the book is about "idea flow" and "the way human social networks spread ideas and transform those ideas into behaviours".

More thoughts on this once it's actually released. But at first glance, the question of how ideas spread is typical popular science fare, so I hope Prof. Pentland's unique perspective will make it truly different from what's already out there. Also, I'm always dubious of attempts to raise the spectre of "physics" in the context of behaviour analysis. The intended analogy is clearly between finding universal laws of human behaviour and finding universal laws in physics. But scientists in every field are trying to find universal laws! Should we now rebrand economics as "currency physics", computer science as "algorithmic physics", and biology as "organic physics", etc.?

With those (rather superficial) caveats, the book is clearly relevant to my research topic of trying to analyse and predict location behaviour using the digital breadcrumbs left by humans in daily life. I'm looking forward to reading the book as I have often found inspiration in Prof. Pentland's research.

The book was given an interesting review by Gillian Tett of the Financial Times last weekend. Her main point of agreement with the book is that the difference between new and old research on human behaviour analysis is due to the size of the data, plus the extent to which that data is interpreted subjectively v.s. objectively (e.g., anthropologists analysing small groups of people v.s. the 10GB that Prof. Pentland collected on 80 call centre workers at an American bank).

So far so good. But she goes on to criticise computer scientists for analysing people like atoms "using the tools of physics" when in reality they can be "culturally influenced too". As someone who uses the tools of physics to analyse behaviour (i.e., simulation techniques and mean field approximations that originated in statistical physics) I think this is false dichotomy.

There are two ways you can read the aforementioned criticism. The first is that statistical models of human behaviour are unable to infer (or learn) cultural influences from the raw data. The second is that the models themselves are constructed under the bias of the researcher's culture.

In the first case, there's nothing to stop researchers from finding new ways of allowing the models to pick up on cultural influences in behaviour (in an unsupervised learning manner), as long as the tools are rich and flexible enough (which they almost certainly are in the case of statistical machine learning).

The second case is more subtle. In my experience, the assumptions that are expressed in my, and my colleagues', models are highly flexible and don't appear to impose cultural constraints on the data. How do I know? I could be wrong, but my confidence in the flexibility of our models is due to the fact that model performance can be quantified with hard data (e.g., using measures of precision/recall or held-out data likelihood). This means that any undue bias (cultural or otherwise) that one researcher imposes will be almost immediately penalised by another researcher coming along and claiming he or she can do "X%" better in predictive accuracy. This is the kind of honesty that hard data brings to the table, though I agree that it is not perfect and that many other things can go wrong with the academic process (e.g., proprietary data making publications impervious to criticism, and the fact that researchers will spend more effort on optimising their own approach for the dataset in question v.s. the benchmarks).

My line of argument of course assumes that the data itself is culturally diverse. This wouldn't be the case if the only data we collect and use comes from, for example, university students in developed countries. But the trend is definitely moving away from that situation. Firstly, the amount of data available is growing by a factor of 10 about every few years (driven by data available from social networking sites like Twitter and Foursquare). At a certain point, the amount of data makes user homogeneity impossible, simply because there just aren't that many university students out there! Secondly, forward thinking network operators like Orange are making cell tower data from developing countries available to academic researchers (under very restricted circumstances I should add). So in conclusion, since the data is becoming more diverse, this should force out any cultural bias in the models (if there was any there to start with).

## The Language of Location

During my work, I've often noticed similarities between language and individual daily life location behaviour (as detected by GPS, cell towers, tweets, check-ins etc.). To summarise these thoughts, I've compiled a list of the similarities and differences between language and location below. I then mention a few papers that exploit these similarities to create more powerful or interesting approaches to analysing location data.

Similarities between Location and Language Data

• Both exhibit power laws. A lot of words are used very rarely while a few words are very frequently used. The same happens with the frequency of visits to locations (e.g., how often you visit home v.s. your favourite theme park). This is not a truism. The most frequently visited locations or words used are *much* more likely to be visited/used than most other places/words.
• Both exhibit sequential structure. Words are highly correlated with words near to them on the page. The same for locations on a particular day.
• Both exhibit topics or themes. In the case of language, groups of words tend to co-occur in the same document (e.g., two webpages that talk about cars are both likely to mention words from a similar group of words representing the "car" topic). In the case of location data, a similar thing happens. I mention two interpretations from specific papers later in this post.
• The availability of both language data and location data has exploded in the last decade (the former from the web, the latter from mobile devices).
• There are cultural differences in using language just as there are cultural differences in location behaviour (e.g., Spanish people like to eat out later than people of other cultures).
• Both are hierarchical. Languages have letters, words, sentences, and paragraphs. A person can be moving around at the level of the street, city, or country (during an hour, day, or week).
• Both exhibit social interactions. Language is exchanged in emails, texts, verbally, or in scholarly debate. Friends, co-workers, and family may have interesting patterns of co-location.

Differences between Location and Language Data

• Many words are shared between texts (of same language) but locations are usually highly personal to individuals (except for the special cases of friends, co-workers, and family).
• There are no periodicities in text but strong periodicities in location (i.e., hourly, weekly, and monthly).
• Language data is not noisy (except for spelling and grammar mistakes) while location data is usually noisy.
• Language analysts do not usually need to worry about privacy issues whilst location analysts usually do.

Work that Exploits These Similarities

Here are a few papers that apply or adapt approaches that were primarily used for language models to location data:

K. Farrahi and D. Gatica-Perez. Extracting mobile behavioral patterns with the distant n-gram topic model. In Proc. ISWC, 2012.

They use topic modelling to capture the tendency of visiting certain locations on the same day. This is similar to using the presence of words like "windshield" and "wheel" to place higher predictive density on words like "road" and "bumper" (i.e., topic modelling bags of words). I have talked previously about why I think this is a good paper.

L. Ferrari and M. Mamei. Discovering daily routines from google latitude with topic models. In PerCom Workshops, pages 432–437, 2011.

This paper uses a similar application of topic modelling as the one by Farrahi and Gatica-Perez.

H. Gao, J. Tang, and H. Liu. Exploring social-historical ties on location-based social networks. In 6th ICWSM, 2012.

This paper uses a model that was previously used to capture sequential structure in words and applies it to Foursquare checkins.

J. McInerney, J. Zheng, A. Rogers, N. R. Jennings. Modelling Heterogeneous Location Habits in Human Populations for Location Prediction Under Data Sparsity. In Proc. UbiComp, 2013.

In my own work, I've used the concept of topics to refer to location habits that represent the tendency of an individual to be at a given location at a certain time of day or week. This way of thinking about locations is useful in generalising temporal structure in location behaviour across people, while still allowing for topics/habits to be present to greater or varying degrees in different people's location histories (just as topics are more or less prevalent in different documents).

Both language and location data are results of human behaviour, so it is unsurprising to find similarities, even if I think some of the similarities are coincidental (e.g., power laws crop up in many places and often for different reasons, and the increasing availability of data is part of the general trend of moving the things we care about into the digital domain). The benefits of analysis approaches seem to be flowing in the language -> location direction only at the moment, though I hope one day that will change.

## Practical Guide to Variational Inference

There are a few standard techniques for performing inference on hierarchical Bayesian models. Finding the posterior distribution over parameters or performing prediction requires an intractable integral for most Bayesian models, arising from the need to marginalise ("integrate out") nuisance parameters. In the face of this intractability there are two main ways to perform approximate inference: either transform the integration into a sampling problem (e.g., Gibbs sampling, slice sampling) or an optimisation problem (e.g., expectation-maximisation, variational inference).

For applied machine learning researchers, probably the most straightforward method is Gibbs sampling (see Chapter 29 of "Information Theory, Inference, and Learning Algorithms" by David MacKay) because you only need to derive conditional probability distributions for each random variable and then sample from these distributions in turn. Of course you have to handle convergence of the Markov chain, and make sure your samples are independent, but you can't go far wrong with the derivation of the conditional distributions themselves. The downside of sampling methods is their slow speed. A related issue is that sampling methods are not ideal for online scalable inference (e.g., learning from streaming social network data).

For these reasons, I have spent the last 6 months learning how to apply variational inference to my mobility models. While there are some very good sources describing variational inference (e.g., chapter 10 of "Pattern Recognition and Machine Learning" by Bishop, this tutorial by Fox and Roberts, this tutorial by Blei), I feel that the operational details can get lost among the theoretical motivation. This makes it hard for someone just starting out to know what steps to follow. Having successfully derived variational inference for several custom hierarchical models (e.g., stick-breaking hierarchical HMMs, extended mixture models), I'm writing a practical summary for anyone about go down the same path. So, here is my summary for how you actually apply variational Bayes to your model.

Preliminaries

I'm omitting an in-depth motivation because it has been covered so well by the aforementioned tutorials. But briefly, the way that mean-field variational inference transforms an integration problem into an optimisation problem is by first assuming that your model factorises further than you originally specified. It then defines a measure of error between the simpler factorised model and the original model (usually, this function is the Kullback-Leibler divergence, which is a measure of distance between two distributions). The optimisation problem is to minimise this error by modifying the parameters to the factorised model (i.e., the variational parameters).

Something that can be confusing is that these variational parameters have a similar role in the variational model as the (often, fixed) hyperparameters have in the original model, which is to control things like prior mean, variance, and concentration. The difference is that you will be updating the variational parameters to optimise the factorised model, while the fixed hyperparameters to the original model are left alone. The way that you do this is by using the following equations for the optimal distributions over the parameters and latent variables, which follow from the assumptions made earlier:

$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$

$\mathrm{ln} \; q^*(Z_i) = \mathbb{E}_{-Z_i}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$

where $X$ is the observed data, $V$ is the set of parameters, $Z$ is the set latent variables, and $\alpha$ is the set of hyperparameters. Another source of possible confusion is that these equations do not explicitly include the variational parameters, yet these parameters are the primary source of interest in the variational scheme. In the steps below, I describe how to derive the update equations for the variational parameters from these equations.

1. Write down the joint probability of your model

Specify the distributions and conditional dependencies of the data, parameters, and latent variables for your original model. Then write down the joint probability of the model, given the hyperparameters. In the following steps, I'm assuming that all the distributions are conjugate to each other (e.g., multinomial data have Dirichlet priors, Gaussian data have Gaussian-Wishart priors and so on).

The joint probability will usually look like this:

$p(X, Z, V | \alpha) = p(V | \alpha) \prod_n^N \mathrm{} \mathrm{}$

where $N$ is the number of observations. For example, in a mixture model, the data likelihood is $p(X_n | Z_n, V)$ and the probability of $Z_n$ is $p(Z_n | V)$. An HMM has the same form, except that $Z_n$ now has probability $p(Z_n | Z_{n-1}, V)$. A Kalman filter is an HMM with continuous $Z_n$. A topic model introduces an outer product over documents and additional set of (global) parameters.

2. Decide on the independence assumptions for the variational model

Decide on the factorisation that will allow tractable inference on the simpler model. The assumption that the latent variables are independent of the parameters is a common way to achieve this. Interestingly, you will find that a single assumption of factorisation will often induce further factorisations as a consequence. These come "for free" in the sense that you get simpler and easier equations without having to make any additional assumptions about the structure of the variational model.

Your variational model will probably factorise like this:

$q(Z, V) = q(Z) q(V)$

and you will probably get $q(V) = \prod_i q(V_i)$ as a set of induced factorisations.

3. Derive the variational update equations

We now address the optimisation problem of minimising the difference between the factorised model and the original one.

Parameters

Use the general formula that we saw earlier:

$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln}\; p(X, Z, V | \alpha) \right)$

The trick is that most of the terms in $p(X, Z, V | \alpha)$ do not involve $V_i$, so can be removed from the expectation and absorbed into a single constant (which becomes a normalising factor when you take the exponential of both sides). You will get something that looks like this:

$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha) + \sum_n^N \mathrm{ln} \; \mathrm{} \right) + \mathrm{constant}$

What you are left with is the log prior distribution of $V_i$ plus the total log data likelihood of $V_i$ given $Z_n$. Even in the two remaining equations, you can often find terms that do not involve $V_i$, so a lot of the work in this step involves discarding irrelevant parts.

The remaining work, assuming you chose conjugate distributions for your model, is to manipulate the equations to look like the prior distribution of $V_i$ (i.e., to have the same functional form as $p(V_i | \alpha)$). You will end up with something that looks like this:

$\mathrm{ln} \; q^*(V_i) = \mathbb{E}_{-V_i}\left( \mathrm{ln} \; p(V_i | \alpha_i') \right) + \mathrm{constant}$

where your goal is to find the value of $\alpha_i'$ through equation manipulation. $\alpha_i'$ is your variational parameter, and it will involve expectations of other parameters $V_{-i}$ and/or $Z$ (if it didn't, then you wouldn't need an iterative method). It's helpful to remember at this point that there are standard equations to calculate $\mathbb{E} \left( \mathrm{ln} \; V_j \right)$ for common types of distribution (e.g., Dirichlet $V_j$ has $\mathbb{E} \left( \mathrm{ln} \; V_{j,k} \right) = \psi(V_{j,k}) - \psi(\sum_{k'} V_{j,k'})$, where $\psi$ is the digamma function). Sometimes you will have to do further manipulation to find expectations of other functions of the parameters. We consider next how to find the expectations of the latent variables $\mathbb{E}(Z)$.

Latent variables

$\mathrm{ln} \; q^*(Z_n) = \mathbb{E}_{-Z_{n}}\left( \mathrm{ln} \; p(X, Z, V | \alpha) \right)$

and try to factor out $Z_{n}$. This will usually be the largest update equation because you will not be able to absorb many terms into the constant. This is because you need to consider the parameters generating the latent variables as well as the parameters that control their effect on observed data. Using the example of multinomial independent $Z_n$ (e.g., in a mixture model), this works out to be:

$\mathrm{ln} \; q^*(Z_{n,k}) = \mathbb{E}_{-Z_{n,k}}\left( Z_{n,k} \mathrm{ln} \; V_k + Z_{n,k} \mathrm{ln} \; p(X_n | V_k) \right) + \mathrm{constant}$

factorising out $Z_{n,k}$ to get:

$\mathrm{ln} \; \mathbb{E}(Z_{n,k}) = \mathbb{E}(\mathrm{ln} \; V_k) + \mathbb{E}(\mathrm{ln} \; p(X_n | V_k)) + \mathrm{constant}$

4. Implement the update equations

Put your update equations from step 3 into code. Iterate over the parameters (M-step) and latent variables (E-step) in turn until your parameters converge. Multiple restarts from random initialisations of the expected latent variables are recommended, as variational inference converges to the local optimum.

The video below shows what variational inference looks like on a mixture model. The green scatters represent the observed data, the blue diamonds are the ground truth means (not known by the model, obviously), the red dots are the inferred means and the ellipses are the inferred covariance matrices:

## Top 5 Papers for Mobile Location Prediction

Below are some of the most useful papers for location prediction from mobile data (GPS, cell towers, Foursquare checkins):

• Ashbrook, Daniel, and Thad Starner. "Using GPS to learn significant locations and predict movement across multiple users." Personal and Ubiquitous Computing 7.5 (2003): 275-286.
• A classic in the field, introducing a vision for finding significant places (with respect to individuals) and learning structure from past behaviour to make predictions. Although there are some shortcomings (e.g., their method of finding significant locations involved manual calibration, and they only considered a first-order Markov model to make predictions), I think they provide a compelling vision that stands the test of time 10 years later.
• Eagle, Nathan, and Alex Sandy Pentland. "Eigenbehaviors: Identifying structure in routine." Behavioral Ecology and Sociobiology 63.7 (2009): 1057-1066.
• In this paper, Eagle and Pentland represent each day of an individual's mobility as a point in high dimensional space, then use dimensionality reduction (principle component analysis) to find a set of "eigenbehaviors" that best characterises the location behaviour of that individual. Prediction can then be done by finding the mix of eigenbehaviors that best recreates a partially-seen day. The general form of the eigenbehaviors also allows comparison of habits between people, and they found some nice results showing how students from different faculties have similar location habits. I prefer the exploratory applications of this approach more than the predictive aspect, because I think the high accuracy results they report are more a function of people staying a long time at the same location in their dataset (since they consider a person to always be in 1 of 4 locations: home, work, other, or no signal). Still, I was always inspired by this paper. An extension that considered richer data sets was done by Sadilek and Krumm (2012), which found good results when also incorporating temporal information (day of week, whether the day was a national holiday).
• Gao, Huiji, Jiliang Tang, and Huan Liu. "Exploring Social-Historical Ties on Location-Based Social Networks." ICWSM. 2012.
• I think this is a strong paper that deserves more attention. It is an extension of the Ashbrook and Starner paper in the sense that the authors provide a more sophisticated way of doing sequential location prediction (i.e., "given your recent locations, where are you likely to be next?") using hierarchical Pitman-Yor processes (HPY). HPY assumes that the pattern of individual daily life locations follows a power law, with the rich getting richer (i.e., there are a few locations that are highly visited, and lots of locations that are hardly ever visited), which has been empirically observed in Gonzalez et al. (2008). Furthermore, HPY is a Bayesian way of varying the number of recent locations considered in the prediction (similar to variable-order Markov models, but with a more principled motivation that doesn't require a max-depth parameter). I suggest reading the paper for a more detailed description than I can give here.
• Horvitz, Eric, and John Krumm. "Some help on the way: Opportunistic routing under uncertainty." Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 2012.
• The authors propose an anticipatory mobile computing system that recommends interesting places to drivers on their way to a predicted location. The prediction part of the system is assumed (using a previous approach that assumes a rational driver) so the focus is on how to calculate the places to recommend using expected utility. They also consider how to balance the expected benefit of asking a driver to confirm their destination (making the suggested places more relevant) against the cost of interruption. Clearly, there will be times when knowing the user's destination for certain would not change the recommendation very much (v.s. only having a probabilistic prediction), so this approach is useful in avoiding the "talking paperclip" syndrome where anticipatory applications interrupt users too much.
• Farrahi, Katayoun, and Daniel Gatica-Perez. "A probabilistic approach to mining mobile phone data sequences." Personal and Ubiquitous Computing (2013): 1-16.
• This work provides a Bayesian probabilistic model of individual location behaviour that is a hybrid of latent Dirichlet allocation (LDA) and the eigenbehaviors approach of Eagle and Pentland. The similarity to eigenbehaviors comes from the assumption that there exists a set of characteristic motifs that repeat themselves in the data (e.g., leaving home to go to the bus station, then to work). The authors' comparison to n-gram models appears to be a bit of a red herring to me, as the next location in their model is not dependent, in general, on the most recent previous locations (that is not to say a mobility model requires the n-gram assumption to be useful). The benefit of having LDA underneath it all (as opposed to, say, a mixture model) is to express the assumption that motifs are locally correlated within a single day. Intuitively, if I follow the aforementioned "going to work" motif in the morning, then I am probably more likely to follow other workday-related motifs (e.g., going to the gym, then home) than other types of motif later that day. With hierarchical Bayesian modelling, this type of structure can be learnt in an unsupervised way from the data, and then be used to make predictions about future behaviour.

I had to leave a lot of very good papers out, so I hope to make a longer list in future.