Passive Learning with a Drifting Distribution
I wasn't sufficiently familiar with passive learning with a drifting distribution, so I had to spend some time with one of the citations: On the Complexity of Learning from Drifting Distributions by Barve and Long.
In passive learning in the IID setting we can get arbitrarily good estimates of true expectations via empirical averages. This means for hypothesis sets of finite VC dimension that with high probability we can get arbitrarily close to optimal with enough data.
Once the world is allowed to start changing over time, we may or may not have enough information to track the changes. There are multiple modeling choices here to formally characterize how the world drifts, including infrequent large changes, and possibly large changes with a slowly changing direction and rate. Barve and Long study the case that the world changes slowly. In particular, the world generates a countable sequence of training data consisting of feature label pairs $(X \times Y)^{\mathbb{N}}$. These samples are independent but not identically distributed, and $D_t$ denotes the distribution of $(x_t, y_t)$. The $D_t$ are constrained via the total variation distance \[
\sup_U |D_t (U) - D_{t+1} (U)| \leq \gamma,
\] where $U$ ranges over all measurable events, i.e., for anything to which you can assign a probability, consecutive distributions $D_t$ and $D_{t+1}$ will differ in the probability they assign by at most $\gamma$. In this scenario, if the world wants to change significantly it has to supply us with training data associated with intermediate states, so the learning algorithm can succeed. Barve and Long show a sufficient condition for agnostic learning is \[
\gamma = O \bigl( \frac{\epsilon^3}{d \ln (1 / \epsilon)} \bigr).
\] where $d$ is the VC dimension of the hypothesis set and $\epsilon$ is our target regret. This is achieved by doing ERM on a window of the most recent examples, where the window size is determined by $\epsilon$ and $\gamma$. Since regret at time $t$ is relative to $D_t$, this is a really cool result. Note the joint distribution of $X$ and $Y$ is allowed to change over time, so both the distribution of input features and the conditional label density (``concept'') are changing.
Unfortunately this result says that we cannot get arbitrarily accurate with unlimited data. In hindsight this is not surprising since we only use recent history for our ERM even with unlimited data. An accompanying necessary condition without the $\ln (1 / \epsilon)$ term in the denominator indicates this limitation is fundamental, and not merely an artifact of the ``windowed ERM'' strategy.
Back to Active Learning
Returning to Liu Yang, the first observation is that the above assumption of ``slow drift'' is not powerful enough for active learning. In active learning, we want to throw away some (most!) of the labels, which is equivalent to speeding up the world (scaling $\gamma$). In particular, if we want to get invited to the VIP section of the Active Learning Club, we'd really like a sub-linear label complexity, which would be like passive learning with $\gamma \to \infty$, and we've seen above that means essentially no accuracy. (However, since linear label complexity is the worst-case lower bound for agnostic active learning in the IID case, the above ``slow drift'' condition might still be worth investigating.)
Yang instead assumes that the various $D_t$ are drawn from a set of distributions $\mathbb{D}$ that are all close to each other. This is formalized via the minimal $\epsilon$-cover $\mathbb{D}_{\epsilon}$ of $\mathbb{D}$, defined as the smallest subset of $\mathbb{D}$ which is within $\epsilon$ of any member of $\mathbb{D}$, as measured by the total variation distance. In particular Yang assumes $\forall \epsilon > 0, |\mathbb{D}_{\epsilon}| < \infty$, and gets more precise bounds by assuming $\forall \epsilon > 0, |\mathbb{D}_\epsilon| < c \cdot \epsilon^{-m}$ for some constants $c, m \in [0, \infty)$.
Essentially, instead of being allowed to change arbitrarily over time, Yang confines the world to bounce around inside a region of distribution space characterized by a finite set of prototype distributions. To achieve a certain level of accuracy, we need only consider the world shifting between the prototype distributions, which in turn suggests that under suitable conditions we will eventually have accumulated enough information that we can forgo looking at some of the labels.
Yang further restricts the conditional distribution of labels given features to be fixed, and only allows the distribution of features to change over time. This second assumption means that it is meaningful to talk about the realizable case, and in the realizable case an algorithm called CAL achieves a sublinear mistake bound and label complexity in this kind of drifting environment, if the (largest) disagreement coefficient is sufficiently small. The quick summary of CAL is that it only queries the label if there is a empirically perfect hypothesis which predicts each label; otherwise it infers the unobserved label is the label that is associated with the remaining empirically perfect hypotheses. CAL uses the entire history, but leverages realizability to succeed, and note the conditional distribution (and therefore realizability) is constant here.
Extending the realizable analysis, Yang next considers the strictly benign noise case, defined as there being an optimal hypothesis which is always conditionally more likely to be correct than incorrect (for all distributions in $\mathbb{D}$). In this case an agnostic variant of CAL called ACAL achieves a sub-linear mistake bound and label complexity. Like CAL, ACAL will infer labels, but does so via an error difference instead of demanding empirical perfection. In Yang's paper ACAL uses windowed data where periodically all history is discarded and the window size is increased geometrically, but this is not due to drift; rather is it due to the online setting with an unbounded stream, because standard specifications of ACAL assume a particular label budget. Therefore ACAL is effectively using ``all the data''.
Some Takeaways
This paper (and the discussed citation) helped me appreciate that active learning in changing environments will only be ``efficient'' (from a label complexity standpoint) if the environment is somehow globally constrained in how it can change.One research strategy, exemplified by this paper, is to take an existing active learning algorithm for the IID case and adapt it to some model of world drift. Of course I'd like to see AALwC manhandled in this fashion: easier said than done! A mashup of AALwC with Barve and Long would suggest running AALwC over the most recent window of data and not incrementing the label counter once the window is full. This would have linear label complexity but sometimes constant factors matter (like on my Mechanical Turk invoices!). It's less clear to me what should happen if AALwC is to be used under the stricter Yang assumptions of world drift. The fact that ACAL is essentially using all the historical data suggests that AALwC might work under these conditions by merely slowing the schedule associated with decreasing the query probability.
A different research strategy would be to understand active learning in the adversarial setting and then apply different online to batch scenarios to convert a retrospective guarantee into multiple prospective guarantees. That approach holds the promise of yielding a single algorithm that does as well as possible under a variety of different world drift scenarios. If somebody achieves that, they will be justifiably famous!
No comments:
Post a Comment