A general agnostic active learning algorithm

Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. Twenty-First Annual Conference on Neural Information Processing Systems, 2007.

Other material: poster (pdf.gz), spotlight slide (pdf)

Abstract

We present a simple, agnostic active learning algorithm that works for any hypothesis class of bounded VC dimension, and any data distribution. Our algorithm extends a scheme of Cohn, Atlas, and Ladner to the agnostic setting, by (1) reformulating it using a reduction to supervised learning and (2) showing how to apply generalization bounds even for the non-i.i.d. samples that result from selective sampling. We provide a general characterization of the label complexity of our algorithm. This quantity is never more than the usual PAC sample complexity of supervised learning, and is exponentially smaller for some hypothesis classes and distributions. We also demonstrate improvements experimentally.

Code

MATLAB routines to simulate our algorithm on the learning problem defined by a given problem matrix and label vector. [tar.gz]

Video demonstrations

The following videos illustrate the locations of label requests throughout four different executions of our algorithm. In the first three executions, the data distribution is uniform over the unit interval [0,1], so we show a histogram of the label requests. In the fourth execution, the data distribution is uniform over the unit box [0,1]^2, so we show a heat map of the label requests. In the fifth execution, the data distribution is uniform over the unit circle; we show both the label request locations as well as a histogram of the inner product of label request locations with the target weight vector.

  1. Hypothesis class: thresholds on the line (1) [avi, mp4]
    Target is h(x) = sgn(x-0.5). Noise model: label x with h(x) with probability 0.8, and -h(x) otherwise (random misclassification). Total of 1061 labels requested after seeing 10000 data.
  2. Hypothesis class: thresholds on the line (2) [avi, mp4]
    Target is h(x) = sgn(x-0.5). Noise model: label x with +1 with probability (x-0.5)/0.8 + 0.5 (clipped at 0 and 1), and -1 otherwise (more noise near the boundary). Total of 3691 labels requested after seeing 10000 data.
  3. Hypothesis class: intervals on the line [avi, mp4]
    Target is h(x) = {if x in [0.4,0.6] then +1 else -1}. Noise model: label x with h(x) with probability 0.9, and -h(x) otherwise (random misclassification). Total of 2141 labels requested after seeing 10000 data.
  4. Hypothesis class: boxes in the plane [avi, mp4]
    Target is h(x) = {if x in [0.15,0.85]^2 then +1 else -1}. Noise model: label x with h(x) with probability 0.9, and -h(x) otherwise (random misclassification). Total of 509 labels requested after seeing 1000 data.
  5. Hypothesis class: linear separators* in the plane [avi, mp4]
    Target is h(x) = sgn(0.528 * x_1 + 0.8493 * x_2). Noise model: label x with h(x) with probability 0.9 and -h(x) otherwise (random misclassification). Total of 207 labels requestsed after seeing 10000 data. (*) The hypothesis class used was just the set of halfspaces normal to the data.