In a previous post I talked about the
Ad Selection Problem, which I now prefer to call cost-sensitive best m (CSBM), since actual ad selection is vastly more complicated. CSBM is akin to cost-sensitive multiclass classification (CSMC) where
m unique classes have to chosen and the goal is to minimize the total cost. In that previous I reduced CSBM to CSMC, but the reduction was a bit tortured: first, the costs all had to bounded above by a constant; second, the algorithm could possibly generate duplicates anyway and so a randomization step was included to enforce feasibility; and third the analysis was cluttered with the implications of the randomization step.
Since then I introduced
constrained CSMC, which is like vanilla CSMC but with some of the classes forbidden as part of the instance specification. Constrained CSMC was designed with the goal of simplifying the CSBM reduction, but it does slightly more. It actually allows me to define a reduction from average constrained CSBM to average constrained CSMC, and since CSBM is a special case of average constrained CSBM, this accomplishes my original goal plus a bonus.
The average constrained CSBM definition is as follows. There is a distribution
D = D_x \times D_{\omega|x} \times D_{c|\omega,x}, where
c: K \to \mathbf{R} takes values in the extended reals
\mathbf{R} = \mathbb{R} \cup \{ \infty \}, and the components of
c which are
\infty-valued for a particular instance are revealed as part of the problem instance via
\omega \in \mathcal{P} (K) (i.e.,
\omega is a subset of
K). Allowed outputs in response to a problem instance are subsets of
K of size
m, denoted
S_m = \{ S | S \subseteq K, |S| = m \}. The regret of a particular classifier
h: X \times \mathcal{P} (K) \to S_m is given by
r_{csbm} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h (x, \omega)} c (k) \right] - \min_{h \in S_m}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right] \right]. Note when
|K \setminus \omega| < m, any strategy achieves zero regret (via
\infty cost); therefore the ``interesting'' parts of the problem space are when
|K \setminus \omega| \geq m.
The reduction takes an average constrained CSBM problem and converts it into a set of average constrained CSMC problems. The average constrained CSMC definition is as follows. There is a distribution
D = D_x \times D_{\omega|x} \times D_{c|\omega,x}, where
c: K \to \mathbf{R} takes values in the extended reals
\mathbf{R} = \mathbb{R} \cup \{ \infty \}, and the components of
c which are
\infty-valued for a particular instance are revealed as part of the problem instance via
\omega \in \mathcal{P} (K) (i.e.,
\omega is a subset of
K). The regret of a particular classifier
h: X \times \mathcal{P} (K) \to K is given by
r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right]. Average constrained CSMC can be attacked using variants of reductions designed for unconstrained CSMC, such as
argmax regression or the
filter tree. These reductions have the property that they always achieve finite regret, i.e., they choose a feasible class whenever possible. In this context, it means the subproblems will never create duplicates.
The reduction works as follows: first the lowest cost choice is chosen, then its cost is adjusted to
\infty, and the process is repeated until a set of size
m has been achieved. It is basically the same reduction as presented in a
previous post, but reducing to average constrained CSMC instead of unconstrained CSMC leads to some advantages: costs need not be bounded above and feasibility is naturally enforced.
Input: Class labels
K, (maximum) size of set to select
m \leq |K| / 2.
Input: Average constrained CSMC classifier
\mbox{Learn}.
Data: Training data set
S.
Result: Trained classifiers
\{\Psi_n | n \in [1, m] \}.
- Define \gamma_0 (\cdot, \cdot) = \emptyset.
- For each n from 1 to m:
- S_n = \emptyset.
- For each example (x, \omega, c) \in S such that |K \setminus \omega| \geq m:
- Let \gamma_{n-1} (x, \omega) be the predicted best set from the previous iteration.
- For each class k:
- If k \in \gamma_{n-1} (x, \omega), c (n, k) = \infty;
- else c (n, k) = c (k).
- S_n \leftarrow S_n \cup \left\{\bigl( x, \omega \cup \gamma_{n-1} (x), c (n, \cdot) \bigr) \right\}.
- Let \Psi_n = \mbox{Learn} (S_n).
- Let \gamma_n (x) = \Psi_n (x) \cup \gamma_{n-1} (x).
- Return \{ \Psi_n | n \in [1, m] \}.
Comment: If
m > |K|/2, negate all finite costs and choose complement of size
|K| - m.
Data: Class labels
K, number of positions to populate
l \leq m \leq |K|/2.
Data: Instance feature realization
(x, \omega).
Data: Trained classifiers
\{\Psi_n | n \in [1, m] \}.
Result: Set
h^\Psi: X \to S_l.
- \gamma_0^\Psi (x, \omega) = \emptyset.
- For n from 1 to l:
- \gamma_n^\Psi (x, \omega) = \gamma_{n-1}^\Psi (x, \omega) \cup \Psi_n (x, \omega \cup \gamma_{n-1}^\Psi (x, \omega)).
- If |\gamma_l^\Psi (x, \omega)| = l, h^\Psi (x, \omega) = \gamma_l^\Psi (x, \omega);
- else set h^\Psi (x, \omega) to an arbitrary element of S_l.
Comment: If
l \geq m > |K|/2, negate all finite costs, and choose complement of size
|K| - l.
The Set Select Train algorithm ignores training data where
|K \setminus \omega| < m, but for such an input any strategy achieves infinite cost and zero regret, so learning is pointless. Similarly, the Set Select Test algorithm is not defined when
|K \setminus \omega| < l \leq m, but for such an input any strategy achieves infinite cost and zero regret, so for the purposes of subsequent analysis I'll suppose that we pick an arbitrary element in
S_l.
My goal is to bound the average constrained CSBM regret
r_{csbm} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h (x, \omega)} c (k) \right] - \min_{h \in S_m}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right] \right] in terms of the average constrained CSMC regret on the induced subproblems. Once again I'll leverage a trick from the filter tree derivation and collapse the multiple subproblems into a single subproblem by defining an induced distribution. Let
D be the distribution of average constrained CSBM instances
(x, \omega, c). Define the induced distribution
D^\prime (\Psi, l) where
l \leq m of average constrained CSMC instances
(x^\prime, \omega^\prime, c^\prime) as follows:
- Draw (x, \omega, c) from D.
- Draw n uniform on [1, l].
- Let x^\prime = (x, n).
- Let \omega^\prime = \omega \cup \gamma_{n-1} (x, \omega).
- For each class k:
- If k \in \gamma_{n-1} (x, \omega), c^\prime (k) = \infty;
- else c^\prime (k) = c (k).
- Create cost-sensitive example (x^\prime, \omega^\prime, c^\prime).
Note
D^\prime (\Psi, l) depends upon the classifier via the duplicate penalty, but for any fixed classifier is well defined. Ultimately I'd like to relate the average constrained CSBM regret
r_{csbm} (h^\Psi) to the induced cost-sensitive regret
\begin{aligned} q (\Psi, l) &= E_{(x^\prime, \omega^\prime) \sim D^\prime_{x^\prime, \omega^\prime} (\Psi, l) } \left[ E_{c^\prime \sim D^\prime_{c^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ c ({\Psi (x^\prime, \omega^\prime)}) \right] - \min_{k \in K}\, E_{c^\prime \sim D^\prime_{c^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ c (k) \right] \right] \\ &= \frac{1}{l} \sum_{n=1}^l q_n (\Psi), \end{aligned} where
\begin{aligned} q_n (\Psi) &= E_{(x, \omega) \sim D_{x,\omega}} [ q_n (\Psi | x, \omega) ], \\ q_n (\Psi | x, \omega) &= E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n \bigl(\Psi_n (x, \gamma_{n-1} (x, \omega)) \bigr) \right] - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (k) \right] \\ \tilde c_n (k) &= \begin{cases} \infty & \mbox{if } k \in \gamma_{n-1} (x, \omega); \\ c (k) & \mbox{otherwise}. \end{cases} \end{aligned}
For all average constrained CSBM distributions
D and average constrained CMSC classifiers
\Psi,
r_{csbm} (h^\Psi) \leq l\, q (\Psi, l). Proof: See
appendix.
Some of the remarks from the previous version of this reduction still apply, and others do not.
The reduction still seems inefficient when comparing reduction to regression directly (
\sqrt{m} \sqrt{|K|} \sqrt{\epsilon_{L^2}}) versus reduction to regression via CSMC (
m \sqrt{|K|} \sqrt{\epsilon_{L^2}}). This suggests there is a way to reduce this problem which only leverages
\sqrt{m} CSMC subproblems. One possible source of inefficiency: the reduction is retrieving the elements in order, whereas the objective function is indifferent to order.
The regret bound indicates the following property: once I have trained to select sets of size
m, I can get a regret bound for selecting sets of size
l for any
l \leq m. This suggests a variant with
m = |K| could be used to reduce minimax constrained CSMC to average constrained CSMC. I'll explore that in a future blog post.
Appendix
This is the proof of the regret bound.
If
\Psi achieves infinite regret on the induced subproblem, the bound holds trivially. Thus consider a
\Psi that achieves finite regret.
If
|K \setminus \omega| < l, then
r_{csbm} = 0 for any choice in
S_l, and the bound conditionally holds trivially. Thus consider
|K \setminus \omega| \geq l: since
\Psi achieves finite regret no duplicates are generated from any sub-classifier and
h^\Psi (x, \omega) = \gamma^\Psi_l (x, \omega).
Consider a fixed
(x, \omega) with
|K \setminus \omega| \geq l. It is convenient to talk about
r_{csbm} (h^\Psi | x, \omega, n) = E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in \gamma^\Psi_n (x, \omega)} c (k) \right] - \min_{h \in S_n}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right], the conditional regret on this instance at the
n^\mathrm{th} step in Set Select Test. Let
h^* (x, \omega, n) = \underset{h \in S_n}{\operatorname{arg\,min\,}} E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right] be any minimizer of the second term (which is unique up to ties); note any
h^* (x, \omega, n) will select
n classes with respect to smallest conditional expected cost.
Proof is by demonstrating the property
r_{csbm} (h^\Psi | x, \omega, n) \leq \sum_{r=1}^n q_r (\Psi, l). The property holds with equality for
n = 1. For
n > 1 note
\begin{aligned} r_{csbm} (h^\Psi | x, \omega, n) - r_{csbm} (h^\Psi | x, \omega, n - 1) &= E_{c \sim D_{c|\omega,x}} \left[ c (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus h^* (x, \omega, n - 1)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \\ &\leq E_{c \sim D_{c|\omega,x}} \left[ c (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \\ &\leq E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (k) \right] \\ &= q_n (\Psi, l | x, \omega), \end{aligned} where the second inequality is due to the optimality of
h^* (x, \omega, n - 1) and the third inequality is because
\tilde c_n (k) \geq c (k) with equality if
k \not \in \gamma^\Psi_{n-1} (x, \omega). Summing the telescoping series establishes
r_{csbm} (h^\Psi | x, \omega) = r_{csbm} (h^\Psi | x, \omega, l) \leq \sum_{r=1}^l q_r (\Psi, l | x, \omega) = l\, q (\Psi, l | x, \omega). Taking the expectation with respect to
D_{x, \omega} completes the proof.