-
Notifications
You must be signed in to change notification settings - Fork 0
Aggregated measures in Active Learning
WebLabeler uses an Active Learning module that, given a dataset S = {x1,..., xN} he algorithm generates a sequence of samples for labelling,
z1, z2, ..., zM
We want to estimate the average of the samples in the training set:
mx = (x1 + ... + xN) / N
If samples zi were selected at random from S, we could take
Y = (z1 + ... + zN) / M
which is an unbiased estimate (E{Y} = mx).
If samples zi are selected with probabilities pi(xn), we can use the estimator
Y' = (z1/p1(z1) + ... + zN/pN(zN)) / MN
which is also unbiased, because E{z1/p1(z1)} = N mx and, thus, E{Y'} = mx.
Computing pn(zn) for the active learning algorithm used in WebLabeler is difficult, because the sample selected at any iteration depends of the history of the previous selections.
In general, if sample zi depends on z1,..., z{i-1}, we can take
Y'' = (z1/(N1·p1(z1)) + z2/(N1·N2·p2(z1,z2)) + ... + zN/(N1·N2·...·N_N·pN(z1,...,zN))) / M
where Nk is the size of the domain of pk(zk| z1,...,z{k-1}. In particular, is sampling is done with replacement, Nk = N^k and
Y'' = (z1/(N·p1(z1)) + z2/(N^2·p2(z1,z2)) + ... + zN/(N^N·pN(z1,...,zN))) / M
but, if sampling is done without replacement:
Y'' = (z1/(N·p1(z1)) + z2/(N(N-1)·p2(z1,z2)) + ... + zN/(N!·pN(z1,...,zN))) / M
This is because
E{zk / (N^k pk(z1,...,zk)) } = 1/(N1·...·Nk) sum_{n1} ... sum_{nk} x_{nk}/pk(x_n1,...,x_nk)·pk(x_n1,...,x_nk)
that is
E{zk / (N^k pk(z1,...,zk)) } = 1/(N1·...·Nk) sum_{nk=1}^N x_{nk} = mx
The advantage of this eq. is that the denominator of the estimator can be computed recursively: defining
w<sub>k</sub> = (N1·...·Nk) pk(z1,...,zk)
we can write
wk = Nk w_{k-1} pk(zk | z1,...,z_{k-1})
For sampling with replacement:
wk = N w_{k-1} pk(zk | z1,...,z_{k-1})
For sampling without replacement:
wk = (N-k) w_{k-1} pk(zk | z1,...,z_{k-1})
The conditional distribution is specific of the Active Learning algorithm, and it can be usually computed for any sequential active learning algorithm, and also for the 'tourney' algorithm implemented in WebLabeler.