Over the holidays I really dug into
Agnostic Active Learning without Constraints by Beygelzimer et. al., and I've developed an intuition about this approach that I feel like sharing.
The nice thing about this approach is that it works with any classifier which can be said to be doing empirical risk minimization: linear predictors, neural networks, decision trees, etc. The consistency guarantees even hold for multiclass 0-1 loss (although the label complexity guarantees do not); and in fact due to the importance-weighting the consistency guarantees hold even if you don't exactly compute the sampling probabilities in the manner described (for example, Vowpal Wabbit computes an approximation). It is therefore an eminently practical technique.
Optimal Sequential Label Censorship
Imagine we have some function $f: \mathcal{X} \times \mathcal{Y} \to [-1, 1]$ that we want to estimate $E_D [f (X, Y)]$, and we don't know $D$ but we can sample from it. Think of $X$ as features, $Y$ as labels, and $f$ as something telling us how good a hypothesis is. So we grab $n$ samples $Z_k = (X_k, Y_k)$ and compute \[
\hat f_p (Z_{1:n}) = \frac{1}{n} \sum_{k=1}^n f (X_k, Y_k),
\] i.e., the empirical sample mean which converges to the true mean with high probability, e.g., via
Hoeffding's inequality since $f$ is bounded. That's great, but it takes $n$ samples and we want to use less. The above estimator is analogous to a passive learner and by using less samples we get an active learner. If we can get close to $\hat f_p$ with high probability by using less samples then by the triangle inequality we'll be close to $E_D[f (X, Y)]$ with high probability.
Let $Q_k \in \{ 0, 1 \}$ be an indicator variable which says whether or not we censor a particular data point from the set. We're going to be shown the $n$ samples that the passive learner utilized above one at a time: after seeing the $k^\mathrm{th}$ features $X_k$, we're going to look at all the previous $X_{1:k-1}$ and also any previous $Y_i$ such that $Q_i=1$. Using all that information we're going to choose a $P_k$. Once we have $P_k$ we'll sample from $Q_k$ by flipping a coin with bias $P_k$. If $Q_k = 1$ then we use $Y_k$, otherwise it not revealed. Note looking at the features is free so we can always do that on the current instance, but looking at a label has a cost. Our active estimator is \[
\hat f_a (Z_{1:n}) = \frac{1}{n} \sum_{k=1}^n \frac{Q_k}{P_k} f (X_k, Y_k).
\] We have to scale the values we use by $1 / P_k$ to be unbiased. Hopefully it is intuitive that, if the $P_k$ are very small, $\hat f_a$ is more likely to differ from $\hat f_p$ by a large amount, because whether or not we censor the data point makes a large difference in the estimate (of course proving this is a cool trick, because the $Q_k$ are not i.i.d.). However there is one important caveat. If $f (X_k, Y_k) = 0$, it doesn't matter if we censor the data point or not, and it doesn't matter what the $P_k$ is, because the estimate is unchanged.
Exploiting that loophole is key, because it turns out the function whose expectation we want to estimate is the instantaneous regret \[
1_{h (X_k) \neq Y_k} - 1_{h^* (X_k) \neq Y_k},
\] between the current empirical risk minimizer \[
\begin{aligned}
h_n &= \mathop{\operatorname{arg\,min\;}}_{h \in H} \hat r_a (Z_{1:n}; h) \\
&= \mathop{\operatorname{arg\,min\;}}_{h \in H} \frac{1}{n} \sum_{k=1}^n \frac{Q_k}{P_k} 1_{h (X_k) \neq Y_k},
\end{aligned}
\] and the true optimal hypothesis $h^* = \operatorname{arg\,min}_{h \in H} E_D[ 1_{h^* (X) = Y} ]$. If our error in estimating this expected regret is small, then that means our empirical minimizer will be almost as good as the true minimizer.
Note this function has a zero on any input where the empirical minimizer and the true minimizer agree. So ideally we would have a scheme such that
- whenever $h_k (X_k) \neq h^* (X_k)$, $P_k$ is ``large'', and
- whenever $h_k (X_k) = h^* (X_k)$, $P_k$ is ``small''.
The first condition ensures our estimator is accurate and the second condition ensures we are censoring unnecessary data points.
(Understand this before continuing)
The Sampling Scheme
Suppose we consumed $k - 1$ samples, censoring some of the labels according to some scheme as yet undefined, and we are now looking at the next set of features $X_k$. The current empirical risk minimizer is $h_k$. Clearly the true risk minimizer $h^*$ either does or not does not agree with $h_k$ on $X_k$. Suppose the latter: intuitively, since empirical averages are converging to their true means, we would expect \[
\begin{aligned}
h_k (X_k) \neq h^* (X_k) &\implies h^* \in \{ h \in H | h (X_k) \neq h_k (X_k) \} \\
&\mathop{\implies}_{\mathrm{Pr} > 1 - \delta} \hat r_a (Z_{1:k-1}; h^\prime_k) - \hat r_a (Z_{1:k-1}; h_k) < O (\frac{1}{\sqrt{k}} \log \frac{1}{\delta}), \\
h^\prime_k &= \mathop{\operatorname{arg\,min\;}}_{h \in H | h (X_k) \neq h_k (X_k)} \hat r_a (Z_{1:n}; h),
\end{aligned}
\] i.e., the best hypothesis $h^\prime_k$ that disagrees with $h_k$ on $X_k$ will have an empirical risk estimate that is almost as good as $h_k$ with high probability. This is because
- $h^*$ is a candidate for $h^\prime_k$, so $h^\prime_k$ has an empirical risk which is at most the empirical risk of $h^*$, and
- the empirical risk of $h_k$ is converging to the empirical risk of $h^*$ with high probability.
Conversely, if the empirical risk of $h^\prime_k$ is much worse than the empirical risk of $h_k$, it indicates $h_k (X_k) = h^* (X_k)$ with high probability. Recall points where $h_k (X_k) = h^* (X_k)$ are precisely the points we can safely aggressively censor, so the basic idea is to more aggressively censor instances when $h^\prime_k$ is sufficiently empirically worse than $h_k$. Of course, the exact formula for $P_k$ depends upon the details of the deviation bound, and working out the details is the impressive part, but as intuition suggests $P_k$ is monotonically decreasing in $\hat r_a (Z_{1:k-1}; h^\prime_k) - \hat r_a (Z_{1:k-1}; h_k)$.
(Understand this before continuing)
Now this scheme has no ``false negatives'', in the sense that whenever $h_k (X_k) \neq h^* (X_k)$, Beygelzimer et. al. arrange for $P_k \geq 1/2$ with high probability. So we are including the ``important points'' from an estimation perspective and our active learner will have an error bound comparable to a passive learner on the same set of data, except that the active learner might use significantly less labels.
Label Complexity
On the surface there are two ways for the active learner to frequently query labels.
- (``Many important points''): If the empirical minimizer $h_k$ frequently disagrees with $h^*$, the active learner will frequently query labels, because there are many non-zeroes in the expected regret function for the empirical minimizer.
- (``Many false positives''): If the empirical risk difference between $h^\prime_k$ and $h_k$ is small even when $h_k (X_k) = h^* (X_k)$, the active learner will fail to detect that a point can be safely censored.
Upon some reflection these can be seen to be two symptoms of the same underlying cause: the existence of many near-optimal hypothesis that frequently disagree with the optimal hypothesis.
To explore this insight further
[1], consider that the probability that we utilize the $k^\mathrm{th}$ data point is some function $g$ of $k$ and the empirical error difference between the empirical risk minimizer $h_k$ and the empirical disagreeing risk minimizer $h^\prime_k$, \[
\begin{aligned}
P_k &= g \bigl(\hat r_a (Z_{1:k-1}, h^\prime_k) - \hat r_a (Z_{1:k-1}, h_k), k \bigr).
\end{aligned}
\] If we can bound $P_k$, then by the linearity of expectation we can bound the expected number of samples that we will use by iteration $k$ by the sum of $P_k$.
If we knew the cumulative distribution $F_k$ of $\hat r_a (Z_{1:k-1}, h^\prime_k) - \hat r_a (Z_{1:k-1}, h_k)$, then we could say \[
P_k = \int_0^1 d\gamma\; \frac{d F_k (\gamma)}{d\gamma} g (\gamma, k).
\] Alas this cumulative distribution $F_k$ is a bit difficult to pin down directly. It is more convenient to talk about the true regret between a hypothesis and the true optimum, and leverage
the disagreement coefficient. Therefore the strategy will be to lower bound the first argument to $g$ as an function of $\mathrm{err} (\tilde h) - \mathrm{err} (h^*)$ for a suitably chosen $\tilde h$, and leverage the fact that $g$ is decreasing in the first argument to get an upper bound on $P_k$.
If $h_k (X_k) = h^* (X_k)$ then the empirical optimality of $h_k$ would allow us to lower bound the first argument to $g$ with $\hat r_a (Z_{1:k-1}, h^\prime_k) - \hat r_a (Z_{1:k-1}, h^*)$, after which we could subtract something $O (\frac{1}{\sqrt{k}} \log \frac{1}{\delta})$ to the empirical error difference to get a high probability lower bound based upon the true error difference $\mathrm{err} (h^\prime_k) - \mathrm{err} (h^*)$, and therefore get an upper bound on $g$ since $g$ decreases with both arguments. If $h^\prime_k (X_k) = h^* (X_k)$ the reasoning is more complicated but analogous, and the net result is that by choosing \[
\tilde h_k =
\begin{cases}
h_k &\mathrm{if}\; h_k (X_k) \neq h^* (X_k) \\
h^\prime_k &\mathrm{if}\; h^\prime_k (X_k) \neq h^* (X_k)
\end{cases},
\] then $P_k$ can be bounded by \[
P_k \leq \tilde g \bigl(\mathrm{err} (\tilde h_k) - \mathrm{err} (h^*), k \bigl),
\] for a slightly different $\tilde g$ which is also decreasing in both arguments.
Now the cumulative distribution we care about is $\mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma)$, and the upper bound looks like \[
\begin{aligned}
P_k &\leq \int_0^1 d\gamma\; \frac{d \mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma)}{d\gamma} \tilde g (\gamma, k) \\
&= \tilde g (1, k) - \int_0^1 d\gamma\; \mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma) \frac{d \tilde g (\gamma, k)}{d \gamma} \\
&= \tilde g (1, k) + \int_0^1 d\gamma\; \mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma) \left| \frac{d \tilde g (\gamma, k)}{d \gamma} \right| \\
\end{aligned}
\] The second line is integration by parts (!!) and the third line is using the fact that $\tilde g$ is decreasing in the first argument.
The next step is to bound $\mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma)$. If $\mathrm{err} (\tilde h_k) - \mathrm{err} (h^*) \leq \gamma$, then in the worst case $\tilde h_k$ could be correct on all cases where $h^*$ makes an error, and could make another $\mathrm{err} (h^*) + \gamma$ errors additionally, and still have a regret of $\gamma$, i.e., \[
\mathrm{Pr} (\tilde h_k (X_k) \neq h_k^* (X_k)) \leq 2 \mathrm{err} (h^*) + \gamma.
\] However for our $\tilde h_k$ this condition actually did happen (go back and look at the definition!). At this point we can leverage
the disagreement coefficient $\theta$ which bounds the probability of picking an input for which there is any hypothesis that has a certain regret, \[
\mathrm{Pr} (X \in \{ x \mid \exists h \:\mathrm{s.t.}\; h (x) \neq h^* (x) \land \mathrm{err} (h) - \mathrm{err}(h^*) \leq r \}) \leq \theta r.
\] Therefore, \[
\mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma) \leq \theta (2 \mathrm{err} (h^*) + \gamma),
\] which substituting into the bound gives \[
P_k \leq \tilde g (1, k) + 2 \theta \; \mathrm{err} (h^*) \left( \int_0^1 d\gamma\; \left| \frac{d \tilde g (\gamma, k)}{d \gamma} \right| \right) + \theta \int_0^1 d\gamma\; \gamma \left| \frac{d \tilde g (\gamma; k)}{d \gamma} \right|.
\] Now it's completely unclear from my exposition, but it turns out for the $\tilde g$ utilized, \[
\begin{aligned}
\left( \int_0^1 d\gamma\; \left| \frac{d \tilde g (\gamma, k)}{d \gamma} \right| \right) &\leq 1 + O \left( \sqrt{\frac{\log k}{k - 1}} \right), \\
\int_0^1 d\gamma\; \gamma \left| \frac{d \tilde g (\gamma; k)}{d \gamma} \right| &\leq O \left( \frac{\log^2 k}{k - 1} \right).
\end{aligned}
\] So good news and bad news:
- Bad news (non-realizable): if $\mathrm{err} (h^*) > 0$, worst case we will equilibrate at a sampling rate proportional to $\mathrm{err} (h^*)$. While disappointing, this limitation appears fundamental.
- Good news: if $\mathrm{err} (h^*) = 0$ we utilize roughly $(\sqrt{n / \log n})$ the number of data points as the passive learner. Not an exponential speedup, but better than a sharp stick in the eye.
Even better label complexity bounds are possible if further assumptions about the structure of the hypothesis space is made.
[1] Past this point the explanation gets a bit less ``intuitive''. I take this as a sign that I don't understand it as well. I'll try my best to maintain intelligibility.