- Assume nothing about positional dependence, negative externalities, etc. In this case one can treat entire sets as actions and use the offset tree directly. Disadvantages of this approach include: generalization is restricted to combinations which have historical support; and training time computation scales as the number of combinations. However if the number of combinations is small this is perhaps the best way to go.
- Assume a greedy approach to constructing result vectors is optimal, but otherwise do not assume anything about positional dependence (or negative externalities?). In this case a sequence of offset trees arranged in greedy fashion reduces CSBM-PF to CSMC-PF. Generalization is restricted to combinations in which the individual action-position pairs have historical support, rather than entire combinations. Training time calculation is $m |A|$ rather than ${|A| \choose m}$.
- Ideally, I could find a necessary condition for the greedy approach to be optimal and use that in the learning algorithm, but I haven't found one.
- Assume swap supremacy (sufficient for the greedy approach to be optimal) and preservation of relative order by position (orthogonal to greedy, but necessary to make what follows work). Again a sequence of offset trees arranged in greedy fashion reduces CSBM-PF to CSMC-PF, but data is reused at later positions. Generalization is restricted to combinations in which the individual action has historical support at or before the position where it is being used. I had hoped to be able to use data at later positions with these assumptions, but further reflection suggests no. For instance, consider the following expected reward distribution as $\epsilon \to 0$, \[
\begin{array}{|c|c|c|}
\mbox{Action} & r (a, 1) & r (a, 2) \\ \hline
1 & 1 + 1 / \epsilon & 1 + \epsilon \\
2 & 1 & 1
\end{array}
\] which obeys swap supremacy and preserves relative order by position; yet a small regret in the second position leads to a large regret in the first position. Basically the assumptions I have so far are not strong enough to let me go backwards. This is sad because exploration at the earlier positions is the most expensive, in the sense that they are the higher value spots.
Now to just make that work.
No comments:
Post a Comment