Machined Learnings: Dependent Reward Revelation: Part II

In my previous post I talked about price differentiation and how it motivates dependent reward revelation, which is a fancy way of saying which rewards are revealed depends upon the reward values. I have to admit I'm somewhat stuck about how to exploit this additional information.

I'll repeat the setup here:

World chooses $(x, \omega, r)$ from $D$ and reveals $(x, \omega)$.
Player chooses $a \in A$ via $p (a | x, \omega)$.
World chooses $\mathcal{A} \in \mathcal{P} (A)$ via $q (\mathcal{A} | x, \omega, r, a)$, where $a \in \mathcal{A}$ is required.
World reveals $\{ r (a) | a \in \mathcal{A} \}$.

The usual ``see what you did'' scenario is $q (\mathcal{A} | x, \omega, r, a) = 1_{\mathcal{A} = \{ a \}}$, and for price differentiation it is \[
q (\mathcal{A} | x, \omega, r, a) =
\begin{cases}
\{ a^\prime | a^\prime \leq a \} & \mbox{if } r (a) > 0; \\
\{ a^\prime | a^\prime \geq a \} & \mbox{if } r (a) = 0.
\end{cases}
\] Since $a \in \mathcal{A}$ is required, I can always throw away the additional information and convert any $q$ into $q = 1_{\mathcal{A} = \{ a \}}$, then use the offset tree. That seems wasteful, but at the moment I don't have another option for the price differentiation problem.

A Filter-Offset Style Update (Fail)

Consider a filter-offset tree style solution. Fix $(x, \omega, r)$, and consider a fixed internal node with inputs $\lambda \not \in \omega$ and $\phi \not \in \omega$. The expected importance weight of $\lambda$ would be \[
\begin{aligned}
w_{\lambda|r} &= E_{a \sim p} \biggl[ E_{\mathcal{A} \sim q|r,a} \biggl[ \alpha_{\lambda,\neg \phi} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} 1_{r (\lambda) \geq \frac{1}{2}} \left( r(\lambda) - \frac{1}{2} \right) \\
&\quad\quad\quad + \alpha_{\neg \lambda, \phi} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\phi) \leq \frac{1}{2}} \left( \frac{1}{2} - r(\phi) \right) \\
&\quad\quad\quad + \alpha_{\lambda, \phi} 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\lambda) > r (\phi)} \left( r (\lambda) - r (\phi) \right) \biggr] \biggr] \biggl/ \\
&E_{a \sim p} \left[ E_{\mathcal{A} \sim q|r,a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right].
\end{aligned}
\] Analogy with the filter-offset update suggests the choices \[
\begin{aligned}
\alpha_{\lambda, \neg \phi} &= (1 - \gamma) \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \not \in \mathcal{A}} \right] \right]}, \\
\alpha_{\neg \lambda, \phi} &= (1 - \gamma) \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \not \in A} 1_{\phi \in \mathcal{A}} \right] \right]}, \\
\alpha_{\lambda, \phi} &= \gamma \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \in \mathcal{A}} \right] \right]},
\end{aligned}
\] for some $\gamma \in [0, 1]$. Unfortunately in general these quantities cannot be computed since $r$ is only partially revealed per instance. For the price differentiation $q$, for instance, only when $a$ is the largest possible price and $r (a) > 0$, or when $a$ is the smallest possible price and $r (a) = 0$, can these quantities be computed.

My suspicion is that the only way to proceed with this filter-offset style update is if the set of rewards that $q$ depends upon is always revealed. So something like \[
q (\mathcal{A} | x, \omega, r, a) =
\begin{cases}
\{ \tilde a \} \cup \{ a^\prime | a^\prime \geq a \} & \mbox{if } r (\tilde a) > 0; \\
\{ a, \tilde a \} & \mbox{if } r (\tilde a) = 0; \\
\{ \tilde a \} \cup \{ a^\prime | a^\prime \leq a \} & \mbox{if } r (\tilde a) < 0, \end{cases} \] would work since $q$ only depends upon $r (\tilde a)$ which is always revealed, so the above expectations can always be computed. With such a cooperative $q$, the rest of the filter-offset tree crank can be turned and the weighting factors would be \[
\begin{aligned}
\alpha_{\lambda, \neg \phi} &= \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \not \in \mathcal{A}} \right] \right]}, \\
\alpha_{\neg \lambda, \phi} &= \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \not \in A} 1_{\phi \in \mathcal{A}} \right] \right]}, \\
\alpha_{\lambda, \phi} &= 1,
\end{aligned}
\] which is neat, but still leaves me wondering how to exploit the additional information available in the price differentiation problem.

Machined Learnings

Saturday, October 9, 2010

Dependent Reward Revelation: Part II

A Filter-Offset Style Update (Fail)

No comments:

Post a Comment