Highlights from RecSys 2018 — James McInerney

Here is a summary of the recent Conference on Recommender Systems I wrote with my Spotify colleagues Zahra Nazari and Ching-Wei Chen.

Introduction

On October 2nd 2018, over 800 participants came together for the 12th ACM conference on recommender systems in Vancouver, Canada. The conference was held over six days with various tutorials and workshops in addition to the main conference. The single-track format was used for the first time with the following distribution over areas of research:

Spotify had a strong presence, starting with a tutorial "Mixed Methods for Evaluating User Satisfaction" by Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, Ben Carterette and Fernando Diaz, our paper "Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits", a position paper “Assessing and Addressing Algorithmic Bias - But Before We Get There” by Jean Garcia-Gathright, Aaron Springer, Henriette Cramer at FATREC workshop, and finishing with the RecSys challenge organised by Ching-Wei Chen, Markus Schedl from Johannes Kepler University, Hamed Zamani from University of Massachusetts Amherst and Paul Lamere.

Key Themes

Confounding in Recommender Systems

Confounded data in recommendation systems was a recurring theme at RecSys this year. As several papers discuss, implicit feedback data collected by a recommender system in production yields data confounded by the recommender. Naively using this data to do offline evaluation can be misleading, especially if the recommender being evaluated offline is very different to the production recommender. This also impacts batch training because there is always the danger that the new recommender is being trained to emulate the production recommender, not necessarily to optimize user engagement. We describe some papers that appeared at the main conference and workshops that deal with these issues.

Societal Impacts of Recommender Systems

The conference both started and ended with two keynotes emphasizing the social impact of recommender systems. Elizabeth Churchill, Director of User Experience at Google, called out designers and engineers to have five Es in mind when designing recommendation systems:

Expedient as convenient and practical, Exigent as pressing and demanding, Explainable as understandable or intelligible, Equitable as fair and impartial, Ethical: morally good or correct. Christopher Berry, Director of Product Intelligence at CBC, ended the conference with a fascinating story of the rocky path to social cohesion in Canada. He invited the community to explore how recommender systems could help promoting cohesion and understanding the differences between the content that polarizes and the content that unites. In the main conference, Recsys that care was the title of a track where researchers addressed various topics including diversity, sustainability and bias. In workshops, FATREC was held for the second consecutive year as a full day workshop presenting the ideals and challenges of Fairness, Accountability, and Transparency in recommender systems.

User Understanding

Another apparent theme was the growing efforts in understanding users on a deeper level across different domains. These efforts ranged from gaining a more comprehensive representation of user's interests and goals through diverse sources of information to a more realistic interpretation of their implicit and explicit signals. Researchers aspired to use their learnings to design and optimize methods that satisfy users’ realistic needs and are aimed at capturing their longer term satisfaction.

Here are some of the highlights from the tutorials, conference, and workshops. This list is by no means exhaustive, there were many other interesting contributions left unmentioned.

Tutorials

Evaluation

In this tutorial Jean Garcia-Gathright, Christine Hosey, Brian St. Thomas, and Fernando Diaz went in depth explaining how mixed-methods approach can provide the framework to integrate qualitative and quantitative analysis. Using this approach, researchers can make a holistic picture of users and help interpreting implicit signals in a more realistic way. This tutorial ranged from qualitative small in-lab studies to large data collection and validation analysis to understand the complex world of user satisfaction.

In another part of this tutorial, Ben Carterette presented fundamentals of significance testing, explained examples of non-parametric, parametric, and bootstrap tests with guidelines on how researchers can understand and interpret the results of these tests.

This tutorial was well-received by the audience and made a great topic for researchers visiting Spotify’s booth to discuss their own challenges and learnings in designing satisfaction metrics.

Sequences
Sequences was a half-day tutorial covering the recent survey paper on Sequence-aware recommender systems. This area explores the opportunities and challenges that adding time as an extra dimension adds to the design of recommender systems. Four type of tasks were discussed as the main applications of sequence-aware recommendations: context adaptation, trend detection, repeated recommendation, and tasks with order constraints. The tutorial was organized in two main parts of evaluation and algorithms and concluded with a hands-on session. Dataset partitioning (an example shown in the picture below) was highlighted as a distinct challenge in offline evaluation of these systems. In algorithms, sequence learning, sequence-aware matrix factorization and hybrid models were described. You can find the slides here.

Main Conference

On the topic of recommender confounding, there were several interesting papers. Allison Chaney et al. articulate and clarify the problem in their paper “How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility”. They introduce a model of how users interact with recommendations over several iterations of training and data collection. They use the model to examine confounding in simulations and find that it results in homogeneity in recommended items, as measured by the overlap of recommended items for similar users in the simulation vs. similar users in the ideal non-filter bubble setting (a measure they call “change in Jaccard index”).

There are different ways to address confounding. One way is inverse propensity score estimation (e.g., Yang et al. 2018, McInerney et al. 2018 at RecSys 2018). The best long paper award went to Stephen Bonner and Flavian Vasile for their impressive work “Causal Embedding for Recommendation”. They take a different approach using domain adaptation and show that a small amount of pure exploration data in combination with larger amount of confounded data helps with de-confounding. The role domain adaptation plays in a factorization approach is to constrain the item vectors to be the same for both exploration and exploitation. The user vectors are allowed to vary across domains but are regularized to be similar. (There are alternative hierarchical parameter assumptions, such as allowing the item vectors to vary and user vectors to either vary or stay the same.) They find improved offline training with their method compared to the usual confounded factorization on data adjusted offline to exhibit less skew toward popular and exposed items.

On the theme of choosing the right user signal to optimize the model, there is increasing interest in optimizing longer term user satisfaction using reinforcement learning. An example this year of advancing beyond a stationary reward distribution for full-page optimization is a paper by Zhao et al. called “Deep Reinforcement Learning for Page-wise Recommendations”. They propose an approach using the Actor-Critic method where two networks are learnt. The actor network is trained to map states (i.e. user context) to a best action (i.e. whole page of recommendations). The critic network is trained to map both the state and best action to the long-term reward (i.e. reward for the whole page as measured by clicks and purchases in an online store). Training the two networks jointly avoids the problems associated with a large action space common to recommender systems.

Another typical assumption in recommenders is that the set of items being selected already exist. An intriguing approach by Vo and Soh “Generation meets recommendation: proposing novel items for groups of users” looks at how one could decide what items to generate next based on a set of historical user consumption data. The basic idea is to learn a joint embedding of users and items using consumption data then to use submodular optimization to “cover” regions in the embedding space that would satisfy the most people. Once these regions have been discovered, the potential new items are mapped back to the original item space. They show that this method can be used to suggest new works of art, and, more convincingly, movie plots. For example, they posit that a narrated movie with strong storytelling, social commentary, and dark humour would be a hit (think “American Beauty meets Pulp Fiction and One Flew Over the Cuckoo’s Nest”). The method is capable of suggesting new items that satisfy a pre-existing demand in new ways (though not new markets for original content).

In one of the attempts to consider a more comprehensive representation of the users, “Calibrated Recommendations” by Harald Steck asks recommender systems to be “fair” towards a user’s various tastes. This work shows that, especially when training data is noisy and limited, optimizing for accuracy could end up in a list of recommended items that is dominated by the user’s major taste, ignoring a subset of their tastes all together. To prevent users from getting into a personal filter bubble, they suggest a “calibration” metric that compares the distribution of categories/genres between users’ consumption taste and the list they are being recommended with. This problem is addressed as a trade-off between accuracy and calibration and proposes a simple greedy algorithm for post-processing the recommendation list to increase calibration. The algorithm starts with an empty list and iteratively appends one item at a time optimizing for the trade-off between calibration and accuracy using a pre-set parameter.

Yet another interesting work aiming at better understanding of the users was presented by Zhao et al. “Interpreting User Inaction in Recommender Systems” explores what most recommender systems ignore or consider as negative feedback: user’s inaction towards a recommended item. Inspired by the Decision Field Theory, these researchers have designed a survey exploring seven reasons behind users’ inaction. They studied which reasons still make good candidates for future recommendations and which ones the recommender system should discard. Here is the conclusion:

“Would Not Enjoy < Watched < Not Noticed < Not Now or Others Better < Explore Later or Decided To Watch” meaning that an “Explore later or Decided to watch” reason is the best option for later recommendation and ignored items with “Would not enjoy” reason should get discarded. Classifiers were trained to predict inaction reasons which resulted in better than chance performance. However, use of sensors such as eye-tracking equipments is suggested to improve the performance on inaction reason detection.

Workshops

RecSys Challenge 2018 Workshop

The RecSys Challenge Workshop featured oral presentations from 16 of the top performing systems submitted to the RecSys Challenge 2018 on the task of Automatic Playlist Continuation. Overall participation was high (1791 registrants, 117 active teams, and 1467 submissions). On the surface, a majority of systems, including the winning entry, took a similar ensemble approach including a high-recall candidate generation phase(s), followed by a high-precision ranking phase, with some special techniques applied to handle the “cold start” use case (title-only playlists). However, when examined in more detail, the systems showed great variety and novelty in approaches to the task.

Approaches based on simple neighborhood methods were surprisingly effective on the task, for example here and here. Matrix factorization formed the basis of many other systems, using algorithms such as Weighted Regularized Matrix Factorization (WRMF) with playlist-track matrices in place of the traditional user-item matrix. Many other systems integrated features beyond basic playlist-track co-occurrence through the LightFM matrix factorization library, which is a form of Factorization Machine.

Beyond the general multi-stage ensemble framework, there was a great diversity of features and techniques presented. There was a graph-based random-walk approach, an IR-based query-expansion approach, and sub-profile aware diversification approach. Deep Learning was used in a variety of ways as well, from using autoencoders to predict playlist contents, to character-level CNNs on playlist titles, to recurrent networks to model sequences of tracks.

Entries to the Creative Track integrated data from other sources, including audio features, and lyrics, in an ensemble method for playlist prediction. One of the unique features of this dataset and challenge is the importance of the playlist title: titles can indicate the intent of a playlist (for example “Music from Woodstock” or “Awesome Cover Songs”), and can have a big impact on the types of songs that fit in a playlist. While several entries to the Main Track took creative approaches to learn from the playlist titles contained in the Million Playlist Dataset, we can imagine further gains from integrating approaches and external datasets from the Natural Language Processing (NLP) communities.

All accepted papers, along with code, and slides (where available) are posted on the Workshop website. A detailed summary of the Challenge outcomes, approaches, and findings can be found in this preprint whitepaper.

REVEAL: Offline Evaluation for Recommender Systems

The REVEAL workshop was about offline evaluation of recommender systems, with themes around inverse propensity scoring, reinforcement learning, and evaluation testbeds. Here we highlight a small selection of papers; you can find many other excellent papers on the workshop website. Nikos Vlassis et al. present a generalization of inverse propensity score estimators in their paper On the Design of Estimators for Off-Policy Evaluation and show how one can retrieve various existing off policy evaluators such as doubly robust and control variates as special cases. Tao Ye and Mohit Singh of Pandora presented their contextual bandit approach using LinUCB for the home page of Pandora and showed how it responds to breaking events in music (e.g. artist passing away). Finally, Minmin Chen of Google Brain discussed their methods for long-term optimization of user satisfaction using simulations by combining off-policy evaluation with the REINFORCE algorithm so that you can train a new policy offline using randomized online data.

Spotify was a diamond sponsor of RecSys 2018.