# "Off-Policy Evaluation for Slate Recommendation" by Swaminathan et al. 2017

- Off-policy evaluation for slate recommendation. Swaminathan, Adith, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. In
*Advances in Neural Information Processing Systems*, pp. 3635-3645. 2017.

## Summary

Slate recommendation concerns a set of recommendations that are exposed to a user at once. For example, when you do a search, you get a page of ranked results. When you visit the home page of a music, movie, or book app, you usually see a whole page of recommendations. This paper is about how to do off-policy evaluation and optimization with data collected with a known logging policy.

They set up the problem as a contextual bandit where each action is a slate of sub-actions:

$$s = (s_1, \dots, s_l)$$where \(s\) refers to the slate and \(s_j\) refers to the \(j^\mathrm{th}\) item in the slate.

They key idea is that they estimate the reward for particular context \(x\) and slate \(s\) using a linear regression:

$$\hat{V}(x, s) = \mathbb{1}_s^\top \phi_{x}$$ where \(\mathbb{1}_s\) is an indicator vector of size \(lm\) telling us which items (out of a possible \(m\) items) appears in the \(l\) slots. \(\phi_x\) represents the contribution towards to the overall reward for each item at each slot. The assumption is therefore that, in any given context, the contribution of an item to the overall reward must be linear.The attraction of this approach is that
it does not try to model the rewards and suffer the model mismatch bias. Instead, they impose a linearity assumption on how the individual item rewards relate to the overall rewards for a slate (and make the typical importance sample reweighting assumption that the target policy \(\pi\) is non-zero wherever the logging policy \(\mu\) is non-zero for all contexts). After that, they replace the predicted reward with a single-sample Monte Carlo estimate from the logged data to arrive at the

where the weights with the smallest norm can be estimated using the closed form solution for linear regression. One thing that was not clear from the paper was how to estimate the uncentered covariance matrix but I guess that can be done in the same way using a single-sample empirical estimate.

This is a pretty neat idea and they go on to show that the PI estimator is unbiased and has low variance in relation to the number of items and slots in the slate.

## Discussion

I like the approach of avoiding bias in the estimator by making assumptions about how the reward decomposes and how it relates to the context. The single sample estimate, on the face of it, looks pretty crude but due to the unbiasedness, the errors cancel out over a large number of logged impressions. The same cannot be said for the model-based approach (what they call the *direct method* or DM). However, as they show in their experiments, DM does well for smaller sets of impressions (< 10^5) which makes sense because that's where the bias introduced by model assumptions can compensate for the small data issues.

While they show that the linear reward assumption has common metrics like NDCG as special cases, I wonder how much interaction exists between items within a slate. An item that is a bad match for a context could tank the reward for the whole page but it might also be bad in how it interacts with other items. Interactions cannot be captured by the linearity, only indirectly through the context.

And I also wonder what the payoff is for introducing a better models in DM. Specifically, they seem to have considered only regression trees and lasso regression as their baselines for DM. Model mismatch bias can be addressed by better models, reducing the error in the DM estimate.