Semi-Supervised Regression with Co-Training Zhi-Hua Zhou and Ming Li
National Laboratory for Novel Software Technology
Nanjing University, Nanjing 210093, China
{zhouzh, lim}@lamda.nju.edu.cn
Abstract
A prominent achievement in this area is the co-training
paradigm proposed by Blum and Mitchell [1998], which
In many practical machine learning and data min-
trains two classifiers separately on two sufficient and redun-
ing applications, unlabeled training examples are
dant views, i.e. two attribute sets each of which is sufficient
readily available but labeled ones are fairly expen-
for learning and conditionally independent to the other given
sive to obtain. Therefore, semi-supervised learn-
the class label, and uses the predictions of each classifier on
ing algorithms such as co-training have attracted
unlabeled examples to augment the training set of the other.
much attention. Previous research mainly focuses
Dasgupta et al. [2002] have shown that when the require-
on semi-supervised classification. In this paper, a
ment of sufficient and redundant views is met, the co-trained
co-training style semi-supervised regression algo-
classifiers could make few generalization errors by maxi-
rithm, i.e. COREG, is proposed. This algorithm
mizing their agreement over the unlabeled data. Unfortu-
uses two k-nearest neighbor regressors with differ-
nately, such a requirement can hardly be met in most sce-
ent distance metrics, each of which labels the unla-
narios. Goldman and Zhou [2000] proposed an algorithm
beled data for the other regressor where the label-
which does not exploit attribute partition. This algorithm re-
ing confidence is estimated through consulting the
quires using two different supervised learning algorithms that
influence of the labeling of unlabeled examples on
partition the instance space into a set of equivalence classes,
the labeled ones. Experiments show that COREG
and employs cross validation technique to determine how to
can effectively exploit unlabeled data to improve
label the unlabeled examples and how to produce the final
hypothesis. Although the requirement of sufficient and re-dundant views is quite strict, the co-training paradigm has al-ready been used in many domains such as statistical parsing
Introduction
and noun phrase identification [Hwa et al., 2003][Pierce and
In many practical machine learning and data mining appli-
Cardie, 2001][Sarkar, 2001][Steedman et al., 2003].
cations such as web user profile analysis, unlabeled training
It is noteworthy that previous research mainly focuses on
examples are readily available but labeled ones are fairly ex-
classification while regression remains almost untouched. In
pensive to obtain because they require human effort. There-
this paper, a co-training style semi-supervised regression al-
fore, semi-supervised learning methods that exploit unlabeled
examples in addition to labeled ones have attracted much at-
proposed. This algorithm employs two k-nearest neighbor
(kNN) regressors, each of which labels the unlabeled data
Many current semi-supervised learning methods use a gen-
for the other during the learning process. In order to choose
erative model for the classifier and employ Expectation-
appropriate unlabeled examples to label, COREG estimates
Maximization (EM) [Dempster et al., 1977] to model the
the labeling confidence through consulting the influence of
label estimation or parameter estimation process. For ex-
the labeling of unlabeled examples on the labeled examples.
ample, mixture of Gaussians [Shahshahani and Landgrebe,
The final prediction is made by averaging the regression es-
1994], mixture of experts [Miller and Uyar, 1997], and naive
timates generated by both regressors. Since COREG utilizes
Bayes [Nigam et al., 2000] have been respectively used as the
different distance metrics instead of requiring sufficient and
generative model, while EM is used to combine labeled and
redundant views, its applicability is broad. Moreover, experi-
unlabeled data for classification. There are also many other
mental results show that this algorithm can effectively exploit
methods such as using transductive inference for support vec-
unlabeled data to improve regression estimates.
tor machines to optimize performance on a specific test set
The rest of this paper is organized as follows. Section 2
[Joachims, 1999], constructing a graph on the examples such
proposes the COREG algorithm. Section 3 presents an anal-
that the minimum cut on the graph yields an optimal labeling
ysis on the algorithm. Section 4 reports on the experimental
of the unlabeled examples according to certain optimization
results. Finally, Section 5 concludes and raises several issues
functions [Blum and Chawla, 2001], etc.
and h2 can be diverse through instantiating them with differ-ent p values. Such a setting can also bring another profit, that
Let L = {(x1, y1), · · · , (x|L|, y|L|)} denote the labeled ex-
is, since it is usually difficult to decide which p value is better
ample set, where xi is the i-th instance described by d at-
for the concerned task, the functions of these regressors may
tributes, yi is its real-valued label, i.e. its expected real-
be somewhat complementary to be combined.
valued output, and |L| is the number of labeled examples;let U denote the unlabeled data set, where the instances are
also described by the d attributes, whose real-valued labels
M inkowskyp(xr, xs) =
are unknown, and |U | is the number of unlabeled examples.
Two regressors, i.e. h1 and h2, are generated from L, each
In order to choose appropriate unlabeled examples to la-
of which is then refined with the help of unlabeled exam-
bel, the labeling confidence should be estimated such that the
ples that are labeled by the latest version of the other re-
most confidently labeled example can be identified. In classi-
gressor. Here the kNN regressor [Dasarathy, 1991] is used
fication this is relatively straightforward because when mak-
as the base learner to instantiate h1 and h2, which labels a
ing classifications, many classifiers can also provide an esti-
new instance through averaging the real-valued labels of its
mated probability (or an approximation) for the classification,
k-nearest neighboring examples.
e.g. a Naive Bayes classifier returns the maximum posteriori
The use of kNN regressor as the base learner is due to the
hypothesis where the posterior probabilities can be used, a
following considerations. First, the regressors will be refined
BP neural network classifier returns thresholded classification
in each of many learning iterations. If neural networks or re-
where the real-valued outputs can be used, etc. Therefore, the
gression trees are used, then in each iteration the regressors
labeling confidence can be estimated through consulting the
have to be re-trained with the labeled examples in addition to
probabilities of the unlabeled examples being labeled to dif-
the newly labeled ones, the computational load of which will
ferent classes. For example, suppose the probability of the
be quite heavy. Since kNN is a lazy learning method which
instance a being classified to the classes c1 and c2 is 0.90
does not hold a separate training phase, the refinement of the
and 0.10, respectively, while that of the instance b is 0.60 and
kNN regressors can be efficiently realized. Second, in order
0.40, respectively. Then the instance a is more confident to
to choose appropriate unlabeled examples to label, the label-
ing confidence should be estimated. In COREG the estimation
Unfortunately, in regression there is no such estimated
utilizes the neighboring properties of the training examples,
probability that can be used directly. This is because in con-
which can be easily coupled with the kNN regressors.
trast to classification where the number of class labels to be
It is noteworthy that the initial regressors should be diverse
predicted is finite, the possible predictions in regression are
because if they are identical, then for either regressor, the
infinite. Therefore, a key of COREG is the mechanism for
unlabeled examples labeled by the other regressor may be
the same as these labeled by the regressor for itself. Thus,
Heuristically, the most confidently labeled example of a
the algorithm degenerates to self-training [Nigam and Ghani,
regressor should be with such a property, i.e. the error of
2000] with a single learner. In the standard setting of co-
the regressor on the labeled example set should decrease the
training, the use of sufficient and redundant views enables the
most if the most confidently labeled example is utilized. In
learners be different. Previous research has also shown that
other words, the most confidently labeled example should be
even when there is no natural attribute partitions, if there are
the one which makes the regressor most consistent with the
sufficient redundancy among the attributes then a fairly rea-
labeled example set. Thus, the mean squared error (MSE)
sonable attribute partition will enable co-training to exhibit
of the regressor on the labeled example set can be evaluated
advantages [Nigam and Ghani, 2000]. While in the extended
first. Then, the MSE of the regressor utilizing the information
co-training algorithm which does not require sufficient and
yu) can be evaluated on the labeled exam-
redundant views, the diversity among the learners is achieved
ple set, where xu is an unlabeled instance while ˆ
through using different learning algorithms [Goldman and
real-valued label generated by the original regressor. Let ∆u
Zhou, 2000]. Since COREG does not assume sufficient and
denote the result of subtracting the latter MSE from the for-
redundant views and different learning algorithms, the diver-
mer MSE. Note that the number of ∆u to be estimated equals
sity of the regressors should be sought from other channels.
to the number of unlabeled examples. Finally, (xu, ˆ
Here the diversity is achieved through utilizing different
sociated with the biggest positive ∆u can be regarded as the
distance metrics. In fact, a key of kNN learner is how to
determine the distances between different instances.
Since repeatedly measuring the MSE of the kNN regres-
Minkowsky distance shown in Eq. 1 is usually used for this
sor on the whole labeled example set in each iteration will be
purpose. Note that different concrete distance metrics can be
time-consuming, considering that kNN regressor mainly uti-
generated through setting different values to the distance or-
lizes local information, COREG employs an approximation.
der, p. Roughly speaking, the smaller the order, the more
That is, for each xu, COREG identifies its k-nearest neighbor-
robust the resulting distance metric to data variations; while
ing labeled examples and uses them to compute the MSE. In
the bigger the order, the more sensitive the resulting distance
detail, let Ω denote the set of k-nearest neighboring labeled
metric to data variations. Therefore, the vicinities identified
examples of xu, then the most confidently labeled example
for a given instance may be different using the Minkowsky
is identified through maximizing the value of ∆x in Eq. 2,
distance with different orders. Thus, the kNN regressors h1
where h denotes the original regressor while h denotes the
regression estimates. In order to simplify the discussion, here
Table 1: Pseudo-code describing the COREG algorithm
the effect of the pool U is not considered as in [Blum andMitchell, 1998]. That is, the unlabeled examples are assumed
as being picked from the unlabeled example set U directly.
INPUT: labeled example set L, unlabeled example set U ,
In each learning iteration of COREG, for each unlabeled ex-
ample xu, its k-nearest neighboring labeled examples are put
maximum number of learning iterations T ,
into the set Ω. As mentioned before, the newly labeled exam-
ple should make the regressor become more consistent with
the labeled data set. Therefore, a criterion shown in Eq. 3 can
Create pool U by randomly picking examples from U
nal regressor while h is the one refined with (xu, ˆ
1 ← kN N (L1, k, p1); h2 ← kN N (L2, k, p2)
u is positive, then utilizing (xu, ˆ
Repeat for T rounds: for j ∈ {1, 2} do i − h(xi))2 − |L|for each xu ∈ U do
Ω ← N eighbors(xu, k, Lj)
In the COREG algorithm, the unlabeled example which
j ← kN N (Lj ∪ {(xu, ˆ
fore, the question is, whether the unlabeled example chosen
i − hj (xi))2 − yi − hj (xi) 2)
according to the maximization of ∆x will result in a positive
end of for if there exists an ∆x > 0
yu) is among the k-nearest neigh-
bors of some examples in Ω, and is not among the k-nearest
neighbors of any other examples in L. In this case, it is obvi-
πj ← {(˜xj, ˜yj)}; U ← U − πj
yu) will only change the regression es-
else πj ← ∅
timates on the examples in Ω, therefore Eq. 3 becomes Eq. 4. end of for
Comparing Eqs. 2 with 4 it can be found that the maximiza-
L1 ← L1 ∪ π2; L2 ← L2 ∪ π1
tion of ∆x also results in the maximization of ∆
if neither of L1 and L2 changes then exit else
1 ← kN N (L1, k, p1); h2 ← kN N (L2, k, p2)
Replenish U by randomly picking examples from Uend of Repeat
yu) is not among the k-nearest
neighbors of any example in Ω. In this case, the value of ∆
UTPUT: regressor h∗(x) ← 1 (h
yu) won’t be chosen in COREG.
yu) is among the k-nearest neigh-
refined regressor which has utilized the information provided
bors of some examples in Ω as well as some examples
in L − Ω, and assume these examples in L − Ω are
(x1, y1), · · · , (xm, ym). Then Eq. 3 becomes Eq. 5. i − h(xi))2 − (yi − h (xi))2)
i − h(xi))2 − (yi − h (xi))2)+
the function kN N (Lj, k, pj) returns a kNN regressor on the
j , whose distance order is pi. The learn-
ing process stops when the maximum number of learning it-
will maximize the first sum term of Eq. 5,
erations, i.e. T , is reached, or there is no unlabeled exam-
ple which is capable of reducing the MSE of any of the re-
u be positive should also refer the
second sum term. Unfortunately, the value of this sum term
gressors on the labeled example set. According to Blum and
is difficult to be measured except that the neighboring rela-
Mitchell [1998]’s suggestion, a pool of unlabeled examples
tionships between all the labeled examples and (x
smaller than U is used. Note that in each iteration the unla-
evaluated. Therefore, there may exist cases where the unla-
beled example chosen by h1 won’t be chosen by h2, which
beled example chosen according to the maximization of ∆
is an extra mechanism for encouraging the diversity of the
u, which is the cost COREG takes for using
that can be more efficiently computed to approximate
ples they label for each other will still be different.
∆u. Nevertheless, experiments show that in most cases suchan approximation is effective. Analysis
It seems that using only one regressor to label the unlabeled
This section attempts to analyze whether the learning process
examples for itself might be feasible, where the unlabeled ex-
of COREG can use the unlabeled examples to improve the
amples can be chosen according to the maximization of ∆x .
While considering that the labeled example set usually con-
from the original ones. For example, on 3-d Mexican Hat two
tains noise, the use of two regressors can be helpful to reduce
new attributes, i.e. x3 and x4, are constructed from x1 + x2
and x1 − x2, and then a kNN regressor is built on x1 and x2
Let Λ denote the subset of noisy examples in L. For the
while the other is built on x3 and x4. In each iteration, each
unlabeled instance xu, either of the regressors h1 and h2 will
kNN regressor chooses the unlabeled example which maxi-
identify a set of k-nearest neighboring labeled examples for
mizes the value of ∆x in Eq. 2 to label for the other regres-
xu. Assume these sets are Ω1 and Ω2, respectively. Since h1
sor. The final prediction is made by averaging the regression
and h2 use different distance orders, Ω1 and Ω2 are usually
estimates of these two refined regressors. Besides, a kNN re-
different, and therefore Ω1 ∩ Λ and Ω2 ∩ Λ are also usually
gressor using only the labeled data is tested as a baseline for
different. Suppose xu is labeled by h1 and then (xu, h1(xu))
the comparison, which is denoted by LABELED.
is put into L1, where h1(xu) suffers from Ω1 ∩ Λ. For an-
All the kNN regressors used in SELF, ARTRE, and LA-
other unlabeled instance xv which is very close to xu, its
BELED employ 2nd-order Minkowski distance, and the kk-nearest neighbors identified by h1 will be very similar to
value is set to 3. The same pool, U , as that used by COREG is
Ω1 except that (xu, h1(xu)) has replaced a previous neigh-
used in each iteration of SELF and ARTRE, and the maximum
bor. Thus, h1(xv) will suffer from Ω1 ∩ Λ more seriously
number of iterations is also set to 100.
than h1(xu) does. While, if the instance xu is labeled by h2
One hundred runs of experiments are carried out on each
and (xu, h2(xu)) is put into L1, then h1(xv) will suffer from
data set. In each run, the performance of all the four algo-
Ω1 ∩ Λ only once, although xu is still very close to xv.
rithms, i.e. COREG, SELF, ARTRE, and LABELED, are eval-uated on randomly partitioned labeled/unlabeled/test splits. Experiments
The average MSE at each iteration is recorded. Note that
Experiments are performed on ten data sets listed in Table 2
the learning processes of the algorithms may stop before the
where “# attribute” denotes the number of input attributes.
maximum number of iterations is reached, and in that case,
These data sets have been used in [Zhou et al., 2002] where
the final MSE is used in computing the average MSE of the
the detailed descriptions of the data sets can be found. Note
that the input attributes as well as the real-valued labels have
The improvement on average MSE obtained by exploiting
been normalized to [0.0, 1.0].
unlabeled examples is tabulated in Table 3, which is com-puted by subtracting the final MSE from the initial MSE and
Table 3: Improvement on average mean squared error
For each data set, 25% data are kept as the test set, while
the remaining 75% data are partitioned into the labeled andunlabeled sets where 10% (of the 75%) data are used as la-
Table 3 shows that SELF and ARTRE improve the regres-
beled examples while the remaining 90% (of the 75%) data
sion estimates on only five data sets, while COREG improves
on eight data sets. Moreover, Table 3 discloses that the im-
In the experiments, the distance orders used by the two
provement of COREG is always bigger than that of SELF and
kNN regressors in COREG are set to 2 and 5, respectively,
ARTRE. These observations tell that COREG can effectively
the k value is set to 3, the maximum number of iterations T is
exploit unlabeled examples to improve regression estimates.
set to 100, and the pool U contains 100 unlabeled examples
For further studying the compared algorithms, the average
randomly picked from the unlabeled set in each iteration.
MSE of different algorithms at different iterations are plotted
A self-training style algorithm is tested for comparison,
in Fig 1, where the average MSE of the two kNN regressors
which is denoted by SELF. This algorithm uses a kNN re-
used in COREG are also depicted. Note that in each subfig-
gressor and in each iteration, it chooses the unlabeled exam-
ure, every curve contains 101 points corresponding to the 100
ple which maximizes the value of ∆x in Eq. 2 to label for
learning iterations in addition to the initial condition, where
itself. Moreover, a co-training style algorithm, denoted by
only 11 of them are explicitly depicted for better presentation.
ARTRE, is also tested. Since the experimental data sets are
Fig. 1 shows that COREG can exploit unlabeled examples
with no sufficient and redundant views, here an artificial re-
to improve the regression estimates on most data sets, except
dundant view is developed through deriving new attributes
that on Friedman #3 there is no improvement while on Plane
Figure 1: Comparisons on average mean squared error of different algorithms at different iterations
the performance is degenerated. While, SELF and ARTRE de-
the final regression estimates of COREG are significantly bet-
generate the regression estimates on five data sets, i.e. Fried-
ter than these of ARTRE on almost all the data sets except
man #1 to #3, Multi, and Plane. Moreover, the average MSE
on Friedman #1 where the latter is better. Furthermore, the
of the final prediction made by COREG is almost always the
final regression estimates of COREG are significantly better
best except on Friedman #1 where ARTRE is slightly better
than these of SELF and LABELED on almost all the data sets
and on Plane where LABELED is the best while all the semi-
except on Plane where LABELED is better and on Fried-
supervised learning algorithms degenerate the performance. man #3 where there is no significant difference. These re-
These observations disclose that COREG is apparently the
sults of t-tests confirm that COREG is the strongest among
the compared algorithms, which can effectively exploit unla-beled data to improve the regression estimates.
Pairwise two-tailed t-tests with 0.05 significance level re-
veal that the final regression estimates of COREG are signif-icantly better than its initial regression estimates on almost
Conclusion
all the data sets except that on Plane the latter is better while
This paper proposes a co-training style semi-supervised re-
on Friedman #3 there is no significant difference. Moreover,
gression algorithm COREG. This algorithm employs two k-
nearest neighbor regressors using different distance metrics.
[Goldman and Zhou, 2000] S. Goldman and Y. Zhou. En-
In each learning iteration, each regressor labels the unlabeled
hancing supervised learning with unlabeled data. In Pro-
example which can be most confidently labeled for the other
ceedings of the 17th International Conference on Machine
learner, where the labeling confidence is estimated through
Learning, pages 327–334, San Francisco, CA, 2000. Mor-
considering the consistency of the regressor with the labeled
example set. The final prediction is made by averaging the
[Hwa et al., 2003] R. Hwa, M. Osborne, A. Sarkar, and
predictions of both the refined kNN regressors. Experiments
M. Steedman. Corrected co-training for statistical parsers.
show that COREG can effectively exploit unlabeled data to
In Working Notes of the ICML’03 Workshop on the Contin-uum from Labeled to Unlabeled Data in Machine Learning
In contrast to standard co-training setting, COREG does not
and Data Mining, Washington, DC, 2003.
require sufficient and redundant views, which enables it havebroad applicability. However, this forces C
verse initial regressors with specific mechanisms. In this pa-
text classification using support vector machines. In Pro-
per the diversity is obtained by instantiating the Minkowski
ceedings of the 16th International Conference on Machine
distance with different distance orders. It is obvious that us-
Learning, pages 200–209, Bled, Slovenia, 1999. Morgan
ing completely different distance metrics may be more help-
ful. Moreover, trying to obtain the diversity of the initial
[Miller and Uyar, 1997] D. J. Miller and H. S. Uyar. A mix-
regressors from channels other than using different distance
ture of experts classifier with learning based on both la-
metrics is an issue to be investigated in future work. Note that
belled and unlabelled data. In M. Mozer, M. I. Jordan, and
although this paper uses kNN regressor as the base learner,
T. Petsche, editors, Advances in Neural Information Pro-
an important idea of COREG, i.e. regarding the labeling of
cessing Systems 9, pages 571–577. MIT Press, Cambridge,
the unlabeled example which makes the regressor most con-
sistent with the labeled example set as with the most confi-
[Nigam and Ghani, 2000] K. Nigam and R. Ghani. Analyz-
dence, can also be used with other base learners. Therefore,
ing the effectiveness and applicability of co-training. In
designing semi-supervised regression algorithms with other
Proceedings of the 9th ACM International Conference on
base learners along the way of COREG is another interest-
Information and Knowledge Management, pages 86–93,
ing issue to be explored in the future. Furthermore, designing
semi-supervised regression algorithms outside the co-trainingframework is also well-worth studying.
[Nigam et al., 2000] K. Nigam, A. K. McCallum, S. Thrun,
and T. Mitchell. Text classification from labeled and un-
Acknowledgments
labeled documents using EM. Machine Learning, 39(2–3):103–134, 2000.
Supported by NSFC (60325207), FANEDD (200343), andJ
[Pierce and Cardie, 2001] D. Pierce and C. Cardie. Limi-
tations of co-training for natural language learning from
References
large data sets. In Proceedings of the 2001 Conferenceon Empirical Methods in Natural Language Processing,
[Blum and Chawla, 2001] A. Blum and S. Chawla. Learn-
ing from labeled and unlabeled data using graph mincuts.
In Proceedings of the 18th International Conference on
Sarkar, 2001] A. Sarkar. Applying co-training methods to
Machine Learning, pages 19–26, Williamston, MA, 2001.
statistical parsing. In Proceedings of the 2nd Annual Meet-ing of the North American Chapter of the Association forComputational Linguistics, pages 95–102, Pittsburgh, PA,
[Blum and Mitchell, 1998] A. Blum and T. Mitchell. Com-
bining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Compu-tational Learning Theory, pages 92–100, Madison, WI,
ducing the small sample size problem and mitigating thehughes phenomenon. IEEE Transactions on Geoscienceand Remote Sensing, 32(5):1087–1095, 1994. Norms: NN Pattern Classification Techniques. IEEE Com-
puter Society Press, Los Alamitos, CA, 1991.
A. Sarkar, S. Clark, R. Hwa, J. Hockenmaier, P. Ruhlen,
[Dasgupta et al., 2002] S. Dasgupta,
S. Baker, and J. Crim. Bootstrapping statistical parsers
D. McAllester. PAC generalization bounds for co-training.
from small data sets. In Proceedings of the 11th Con-
In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors,
ference on the European Chapter of the Association forAdvances in Neural Information Processing Systems 14,
Computational Linguistics, pages 331–338, Budapest,
pages 375–382. MIT Press, Cambridge, MA, 2002.
[Dempster et al., 1977] A. P. Dempster, N. M. Laird, and
[Zhou et al., 2002] Z.-H. Zhou, J. Wu, and W. Tang. Ensem-
D. B. Rubin. Maximum likelihood from incomplete data
bling neural networks: many could be better than all. Ar-
via the EM algorithm. Journal of the Royal Statistical So-tificial Intelligence, 137(1–2):239–263, 2002. ciety, Series B, 39(1):1–38, 1977.

Animal-Kind International Annual Report for 2012 (By Karen Menczer, Executive Director) The 2012 Annual Report is organized in line with the AKI mission: to raise visibility of the work of AKI partner organizations with the aim of connecting people, and donating funds and supplies to our partners. During 2012, AKI kept the same partner organizations as we had during 2011. We added Anim