Archive for datasets

Comment on "How Should We Analyse Our Lives?"

 

I just heard about a new book, called "Social Physics", being released at the end of the month by Alex "Sandy" Pentland (the famous professor of the MIT media lab). According to the synopsis, the book is about "idea flow" and "the way human social networks spread ideas and transform those ideas into behaviours".

More thoughts on this once it's actually released. But at first glance, the question of how ideas spread is typical popular science fare, so I hope Prof. Pentland's unique perspective will make it truly different from what's already out there. Also, I'm always dubious of attempts to raise the spectre of "physics" in the context of behaviour analysis. The intended analogy is clearly between finding universal laws of human behaviour and finding universal laws in physics. But scientists in every field are trying to find universal laws! Should we now rebrand economics as "currency physics", computer science as "algorithmic physics", and biology as "organic physics", etc.?

With those (rather superficial) caveats, the book is clearly relevant to my research topic of trying to analyse and predict location behaviour using the digital breadcrumbs left by humans in daily life. I'm looking forward to reading the book as I have often found inspiration in Prof. Pentland's research.

The book was given an interesting review by Gillian Tett of the Financial Times last weekend. Her main point of agreement with the book is that the difference between new and old research on human behaviour analysis is due to the size of the data, plus the extent to which that data is interpreted subjectively v.s. objectively (e.g., anthropologists analysing small groups of people v.s. the 10GB that Prof. Pentland collected on 80 call centre workers at an American bank).

So far so good. But she goes on to criticise computer scientists for analysing people like atoms "using the tools of physics" when in reality they can be "culturally influenced too". As someone who uses the tools of physics to analyse behaviour (i.e., simulation techniques and mean field approximations that originated in statistical physics) I think this is false dichotomy.

There are two ways you can read the aforementioned criticism. The first is that statistical models of human behaviour are unable to infer (or learn) cultural influences from the raw data. The second is that the models themselves are constructed under the bias of the researcher's culture.

In the first case, there's nothing to stop researchers from finding new ways of allowing the models to pick up on cultural influences in behaviour (in an unsupervised learning manner), as long as the tools are rich and flexible enough (which they almost certainly are in the case of statistical machine learning).

The second case is more subtle. In my experience, the assumptions that are expressed in my, and my colleagues', models are highly flexible and don't appear to impose cultural constraints on the data. How do I know? I could be wrong, but my confidence in the flexibility of our models is due to the fact that model performance can be quantified with hard data (e.g., using measures of precision/recall or held-out data likelihood). This means that any undue bias (cultural or otherwise) that one researcher imposes will be almost immediately penalised by another researcher coming along and claiming he or she can do "X%" better in predictive accuracy. This is the kind of honesty that hard data brings to the table, though I agree that it is not perfect and that many other things can go wrong with the academic process (e.g., proprietary data making publications impervious to criticism, and the fact that researchers will spend more effort on optimising their own approach for the dataset in question v.s. the benchmarks).

My line of argument of course assumes that the data itself is culturally diverse. This wouldn't be the case if the only data we collect and use comes from, for example, university students in developed countries. But the trend is definitely moving away from that situation. Firstly, the amount of data available is growing by a factor of 10 about every few years (driven by data available from social networking sites like Twitter and Foursquare). At a certain point, the amount of data makes user homogeneity impossible, simply because there just aren't that many university students out there! Secondly, forward thinking network operators like Orange are making cell tower data from developing countries available to academic researchers (under very restricted circumstances I should add). So in conclusion, since the data is becoming more diverse, this should force out any cultural bias in the models (if there was any there to start with).

The Language of Location

 

During my work, I've often noticed similarities between language and individual daily life location behaviour (as detected by GPS, cell towers, tweets, check-ins etc.). To summarise these thoughts, I've compiled a list of the similarities and differences between language and location below. I then mention a few papers that exploit these similarities to create more powerful or interesting approaches to analysing location data.

Similarities between Location and Language Data

  • Both exhibit power laws. A lot of words are used very rarely while a few words are very frequently used. The same happens with the frequency of visits to locations (e.g., how often you visit home v.s. your favourite theme park). This is not a truism. The most frequently visited locations or words used are *much* more likely to be visited/used than most other places/words.
  • Both exhibit sequential structure. Words are highly correlated with words near to them on the page. The same for locations on a particular day.
  • Both exhibit topics or themes. In the case of language, groups of words tend to co-occur in the same document (e.g., two webpages that talk about cars are both likely to mention words from a similar group of words representing the "car" topic). In the case of location data, a similar thing happens. I mention two interpretations from specific papers later in this post.
  • The availability of both language data and location data has exploded in the last decade (the former from the web, the latter from mobile devices).
  • There are cultural differences in using language just as there are cultural differences in location behaviour (e.g., Spanish people like to eat out later than people of other cultures).
  • Both are hierarchical. Languages have letters, words, sentences, and paragraphs. A person can be moving around at the level of the street, city, or country (during an hour, day, or week).
  • Both exhibit social interactions. Language is exchanged in emails, texts, verbally, or in scholarly debate. Friends, co-workers, and family may have interesting patterns of co-location.

Differences between Location and Language Data

  • Many words are shared between texts (of same language) but locations are usually highly personal to individuals (except for the special cases of friends, co-workers, and family).
  • There are no periodicities in text but strong periodicities in location (i.e., hourly, weekly, and monthly).
  • Language data is not noisy (except for spelling and grammar mistakes) while location data is usually noisy.
  • Language analysts do not usually need to worry about privacy issues whilst location analysts usually do.

Work that Exploits These Similarities

Here are a few papers that apply or adapt approaches that were primarily used for language models to location data:

K. Farrahi and D. Gatica-Perez. Extracting mobile behavioral patterns with the distant n-gram topic model. In Proc. ISWC, 2012.

They use topic modelling to capture the tendency of visiting certain locations on the same day. This is similar to using the presence of words like "windshield" and "wheel" to place higher predictive density on words like "road" and "bumper" (i.e., topic modelling bags of words). I have talked previously about why I think this is a good paper.

L. Ferrari and M. Mamei. Discovering daily routines from google latitude with topic models. In PerCom Workshops, pages 432–437, 2011.

This paper uses a similar application of topic modelling as the one by Farrahi and Gatica-Perez.

H. Gao, J. Tang, and H. Liu. Exploring social-historical ties on location-based social networks. In 6th ICWSM, 2012.

This paper uses a model that was previously used to capture sequential structure in words and applies it to Foursquare checkins.

J. McInerney, J. Zheng, A. Rogers, N. R. Jennings. Modelling Heterogeneous Location Habits in Human Populations for Location Prediction Under Data Sparsity. In Proc. UbiComp, 2013.

In my own work, I've used the concept of topics to refer to location habits that represent the tendency of an individual to be at a given location at a certain time of day or week. This way of thinking about locations is useful in generalising temporal structure in location behaviour across people, while still allowing for topics/habits to be present to greater or varying degrees in different people's location histories (just as topics are more or less prevalent in different documents).

Both language and location data are results of human behaviour, so it is unsurprising to find similarities, even if I think some of the similarities are coincidental (e.g., power laws crop up in many places and often for different reasons, and the increasing availability of data is part of the general trend of moving the things we care about into the digital domain). The benefits of analysis approaches seem to be flowing in the language -> location direction only at the moment, though I hope one day that will change.

Location Datasets

 

Here is a list of daily life location behaviour datasets available online. Will try to keep this updated, and obviously welcome to additional suggestions.  N.B., some of the semi-public datasets are available to people in academia only.

The list is focused on datasets from which one might be able to identify individual location routines. Therefore, aggregated or short datasets (< 1 week) are omitted. In no particular order:

GeoLife GPS Trajectories
182 users over period of 2 years. Public.

MIT Reality Mining Dataset (Page on Mit)
Cell tower locations (and various phone logs) of 100 people at MIT. Semi-public (need to submit your details on a web form).

Foursquare Data Set - University of Minnesota [update: seems that Foursquare has asked the collectors of this data to no longer distribute it]
The check-in behaviour of over 2 million users. Semi-public (you have to request a link from them by email).

OpenPaths
Comprises mixture of GPS, cell tower, check-in data. Semi-public (you have to submit a project description and have individuals donate their data).

Dataset of LifeMap Users
Dataset of GPS and WiFi locations that includes crowdsourced semantic labels for locations by 68 people in South Korea.Public.

CRAWDAD
Long list of WiFi fingerprint data at University of Dartmouth, US. Public.