Semi-Supervised Regression with Co-Training
Zhi-Hua Zhou and Ming Li
National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {zhouzh, lim}@lamda.nju.edu.cn Abstract
A prominent achievement in this area is the co-training paradigm proposed by Blum and Mitchell [1998], which In many practical machine learning and data min- trains two classifiers separately on two sufficient and redun- ing applications, unlabeled training examples are dant views, i.e. two attribute sets each of which is sufficient readily available but labeled ones are fairly expen- for learning and conditionally independent to the other given sive to obtain. Therefore, semi-supervised learn- the class label, and uses the predictions of each classifier on ing algorithms such as co-training have attracted unlabeled examples to augment the training set of the other.
much attention. Previous research mainly focuses Dasgupta et al. [2002] have shown that when the require- on semi-supervised classification. In this paper, a ment of sufficient and redundant views is met, the co-trained co-training style semi-supervised regression algo- classifiers could make few generalization errors by maxi- rithm, i.e. COREG, is proposed. This algorithm mizing their agreement over the unlabeled data. Unfortu- uses two k-nearest neighbor regressors with differ- nately, such a requirement can hardly be met in most sce- ent distance metrics, each of which labels the unla- narios. Goldman and Zhou [2000] proposed an algorithm beled data for the other regressor where the label- which does not exploit attribute partition. This algorithm re- ing confidence is estimated through consulting the quires using two different supervised learning algorithms that influence of the labeling of unlabeled examples on partition the instance space into a set of equivalence classes, the labeled ones. Experiments show that COREG and employs cross validation technique to determine how to can effectively exploit unlabeled data to improve label the unlabeled examples and how to produce the final hypothesis. Although the requirement of sufficient and re-dundant views is quite strict, the co-training paradigm has al-ready been used in many domains such as statistical parsing Introduction
and noun phrase identification [Hwa et al., 2003][Pierce and In many practical machine learning and data mining appli- Cardie, 2001][Sarkar, 2001][Steedman et al., 2003].
cations such as web user profile analysis, unlabeled training It is noteworthy that previous research mainly focuses on examples are readily available but labeled ones are fairly ex- classification while regression remains almost untouched. In pensive to obtain because they require human effort. There- this paper, a co-training style semi-supervised regression al- fore, semi-supervised learning methods that exploit unlabeled examples in addition to labeled ones have attracted much at- proposed. This algorithm employs two k-nearest neighbor (kNN) regressors, each of which labels the unlabeled data Many current semi-supervised learning methods use a gen- for the other during the learning process. In order to choose erative model for the classifier and employ Expectation- appropriate unlabeled examples to label, COREG estimates Maximization (EM) [Dempster et al., 1977] to model the the labeling confidence through consulting the influence of label estimation or parameter estimation process. For ex- the labeling of unlabeled examples on the labeled examples.
ample, mixture of Gaussians [Shahshahani and Landgrebe, The final prediction is made by averaging the regression es- 1994], mixture of experts [Miller and Uyar, 1997], and naive timates generated by both regressors. Since COREG utilizes Bayes [Nigam et al., 2000] have been respectively used as the different distance metrics instead of requiring sufficient and generative model, while EM is used to combine labeled and redundant views, its applicability is broad. Moreover, experi- unlabeled data for classification. There are also many other mental results show that this algorithm can effectively exploit methods such as using transductive inference for support vec- unlabeled data to improve regression estimates.
tor machines to optimize performance on a specific test set The rest of this paper is organized as follows. Section 2 [Joachims, 1999], constructing a graph on the examples such proposes the COREG algorithm. Section 3 presents an anal- that the minimum cut on the graph yields an optimal labeling ysis on the algorithm. Section 4 reports on the experimental of the unlabeled examples according to certain optimization results. Finally, Section 5 concludes and raises several issues functions [Blum and Chawla, 2001], etc.
and h2 can be diverse through instantiating them with differ-ent p values. Such a setting can also bring another profit, that Let L = {(x1, y1), · · · , (x|L|, y|L|)} denote the labeled ex- is, since it is usually difficult to decide which p value is better ample set, where xi is the i-th instance described by d at- for the concerned task, the functions of these regressors may tributes, yi is its real-valued label, i.e. its expected real- be somewhat complementary to be combined.
valued output, and |L| is the number of labeled examples;let U denote the unlabeled data set, where the instances are also described by the d attributes, whose real-valued labels M inkowskyp(xr, xs) = are unknown, and |U | is the number of unlabeled examples.
Two regressors, i.e. h1 and h2, are generated from L, each In order to choose appropriate unlabeled examples to la- of which is then refined with the help of unlabeled exam- bel, the labeling confidence should be estimated such that the ples that are labeled by the latest version of the other re- most confidently labeled example can be identified. In classi- gressor. Here the kNN regressor [Dasarathy, 1991] is used fication this is relatively straightforward because when mak- as the base learner to instantiate h1 and h2, which labels a ing classifications, many classifiers can also provide an esti- new instance through averaging the real-valued labels of its mated probability (or an approximation) for the classification, k-nearest neighboring examples.
e.g. a Naive Bayes classifier returns the maximum posteriori The use of kNN regressor as the base learner is due to the hypothesis where the posterior probabilities can be used, a following considerations. First, the regressors will be refined BP neural network classifier returns thresholded classification in each of many learning iterations. If neural networks or re- where the real-valued outputs can be used, etc. Therefore, the gression trees are used, then in each iteration the regressors labeling confidence can be estimated through consulting the have to be re-trained with the labeled examples in addition to probabilities of the unlabeled examples being labeled to dif- the newly labeled ones, the computational load of which will ferent classes. For example, suppose the probability of the be quite heavy. Since kNN is a lazy learning method which instance a being classified to the classes c1 and c2 is 0.90 does not hold a separate training phase, the refinement of the and 0.10, respectively, while that of the instance b is 0.60 and kNN regressors can be efficiently realized. Second, in order 0.40, respectively. Then the instance a is more confident to to choose appropriate unlabeled examples to label, the label- ing confidence should be estimated. In COREG the estimation Unfortunately, in regression there is no such estimated utilizes the neighboring properties of the training examples, probability that can be used directly. This is because in con- which can be easily coupled with the kNN regressors.
trast to classification where the number of class labels to be It is noteworthy that the initial regressors should be diverse predicted is finite, the possible predictions in regression are because if they are identical, then for either regressor, the infinite. Therefore, a key of COREG is the mechanism for unlabeled examples labeled by the other regressor may be the same as these labeled by the regressor for itself. Thus, Heuristically, the most confidently labeled example of a the algorithm degenerates to self-training [Nigam and Ghani, regressor should be with such a property, i.e. the error of 2000] with a single learner. In the standard setting of co- the regressor on the labeled example set should decrease the training, the use of sufficient and redundant views enables the most if the most confidently labeled example is utilized. In learners be different. Previous research has also shown that other words, the most confidently labeled example should be even when there is no natural attribute partitions, if there are the one which makes the regressor most consistent with the sufficient redundancy among the attributes then a fairly rea- labeled example set. Thus, the mean squared error (MSE) sonable attribute partition will enable co-training to exhibit of the regressor on the labeled example set can be evaluated advantages [Nigam and Ghani, 2000]. While in the extended first. Then, the MSE of the regressor utilizing the information co-training algorithm which does not require sufficient and yu) can be evaluated on the labeled exam- redundant views, the diversity among the learners is achieved ple set, where xu is an unlabeled instance while ˆ through using different learning algorithms [Goldman and real-valued label generated by the original regressor. Let ∆u Zhou, 2000]. Since COREG does not assume sufficient and denote the result of subtracting the latter MSE from the for- redundant views and different learning algorithms, the diver- mer MSE. Note that the number of ∆u to be estimated equals sity of the regressors should be sought from other channels.
to the number of unlabeled examples. Finally, (xu, ˆ Here the diversity is achieved through utilizing different sociated with the biggest positive ∆u can be regarded as the distance metrics. In fact, a key of kNN learner is how to determine the distances between different instances.
Since repeatedly measuring the MSE of the kNN regres- Minkowsky distance shown in Eq. 1 is usually used for this sor on the whole labeled example set in each iteration will be purpose. Note that different concrete distance metrics can be time-consuming, considering that kNN regressor mainly uti- generated through setting different values to the distance or- lizes local information, COREG employs an approximation.
der, p. Roughly speaking, the smaller the order, the more That is, for each xu, COREG identifies its k-nearest neighbor- robust the resulting distance metric to data variations; while ing labeled examples and uses them to compute the MSE. In the bigger the order, the more sensitive the resulting distance detail, let Ω denote the set of k-nearest neighboring labeled metric to data variations. Therefore, the vicinities identified examples of xu, then the most confidently labeled example for a given instance may be different using the Minkowsky is identified through maximizing the value of ∆x in Eq. 2, distance with different orders. Thus, the kNN regressors h1 where h denotes the original regressor while h denotes the regression estimates. In order to simplify the discussion, here Table 1: Pseudo-code describing the COREG algorithm the effect of the pool U is not considered as in [Blum andMitchell, 1998]. That is, the unlabeled examples are assumed as being picked from the unlabeled example set U directly.
INPUT: labeled example set L, unlabeled example set U , In each learning iteration of COREG, for each unlabeled ex- ample xu, its k-nearest neighboring labeled examples are put maximum number of learning iterations T , into the set Ω. As mentioned before, the newly labeled exam- ple should make the regressor become more consistent with the labeled data set. Therefore, a criterion shown in Eq. 3 can Create pool U by randomly picking examples from U nal regressor while h is the one refined with (xu, ˆ 1 ← kN N (L1, k, p1); h2 ← kN N (L2, k, p2) u is positive, then utilizing (xu, ˆ Repeat for T rounds:
for j ∈ {1, 2} do
i − h(xi))2 − |L| for each xu ∈ U do
← N eighbors(xu, k, Lj) In the COREG algorithm, the unlabeled example which j ← kN N (Lj ∪ {(xu, ˆ fore, the question is, whether the unlabeled example chosen i − hj (xi))2 yi − hj (xi) 2) according to the maximization of ∆x will result in a positive end of for
if there exists an ∆x > 0
yu) is among the k-nearest neigh- bors of some examples in Ω, and is not among the k-nearest neighbors of any other examples in L. In this case, it is obvi- πj ← {(˜xj, ˜yj)}; U ← U − πj yu) will only change the regression es- else πj ← ∅
timates on the examples in Ω, therefore Eq. 3 becomes Eq. 4.
end of for
Comparing Eqs. 2 with 4 it can be found that the maximiza- L1 ← L1 ∪ π2; L2 ← L2 ∪ π1 tion of ∆x also results in the maximization of ∆ if neither of L1 and L2 changes then exit

1 ← kN N (L1, k, p1); h2 ← kN N (L2, k, p2) Replenish U by randomly picking examples from U end of Repeat
yu) is not among the k-nearest neighbors of any example in Ω. In this case, the value of ∆ UTPUT: regressor h∗(x) 1 (h yu) won’t be chosen in COREG.
yu) is among the k-nearest neigh- refined regressor which has utilized the information provided bors of some examples in Ω as well as some examples in L − Ω, and assume these examples in L − Ω are (x1, y1), · · · , (xm, ym). Then Eq. 3 becomes Eq. 5.
i − h(xi))2 (yi − h (xi))2) i − h(xi))2 (yi − h (xi))2)+ the function kN N (Lj, k, pj) returns a kNN regressor on the j , whose distance order is pi. The learn- ing process stops when the maximum number of learning it- will maximize the first sum term of Eq. 5, erations, i.e. T , is reached, or there is no unlabeled exam- ple which is capable of reducing the MSE of any of the re- u be positive should also refer the second sum term. Unfortunately, the value of this sum term gressors on the labeled example set. According to Blum and is difficult to be measured except that the neighboring rela- Mitchell [1998]’s suggestion, a pool of unlabeled examples tionships between all the labeled examples and (x smaller than U is used. Note that in each iteration the unla- evaluated. Therefore, there may exist cases where the unla- beled example chosen by h1 won’t be chosen by h2, which beled example chosen according to the maximization of ∆ is an extra mechanism for encouraging the diversity of the u, which is the cost COREG takes for using that can be more efficiently computed to approximate ples they label for each other will still be different.
u. Nevertheless, experiments show that in most cases suchan approximation is effective.
It seems that using only one regressor to label the unlabeled This section attempts to analyze whether the learning process examples for itself might be feasible, where the unlabeled ex- of COREG can use the unlabeled examples to improve the amples can be chosen according to the maximization of ∆x .
While considering that the labeled example set usually con- from the original ones. For example, on 3-d Mexican Hat two tains noise, the use of two regressors can be helpful to reduce new attributes, i.e. x3 and x4, are constructed from x1 + x2 and x1 − x2, and then a kNN regressor is built on x1 and x2 Let Λ denote the subset of noisy examples in L. For the while the other is built on x3 and x4. In each iteration, each unlabeled instance xu, either of the regressors h1 and h2 will kNN regressor chooses the unlabeled example which maxi- identify a set of k-nearest neighboring labeled examples for mizes the value of ∆x in Eq. 2 to label for the other regres- xu. Assume these sets are Ω1 and Ω2, respectively. Since h1 sor. The final prediction is made by averaging the regression and h2 use different distance orders, Ω1 and Ω2 are usually estimates of these two refined regressors. Besides, a kNN re- different, and therefore Ω1 Λ and Ω2 Λ are also usually gressor using only the labeled data is tested as a baseline for different. Suppose xu is labeled by h1 and then (xu, h1(xu)) the comparison, which is denoted by LABELED.
is put into L1, where h1(xu) suffers from Ω1 Λ. For an- All the kNN regressors used in SELF, ARTRE, and LA- other unlabeled instance xv which is very close to xu, its BELED employ 2nd-order Minkowski distance, and the k k-nearest neighbors identified by h1 will be very similar to value is set to 3. The same pool, U , as that used by COREG is Ω1 except that (xu, h1(xu)) has replaced a previous neigh- used in each iteration of SELF and ARTRE, and the maximum bor. Thus, h1(xv) will suffer from Ω1 Λ more seriously number of iterations is also set to 100.
than h1(xu) does. While, if the instance xu is labeled by h2 One hundred runs of experiments are carried out on each and (xu, h2(xu)) is put into L1, then h1(xv) will suffer from data set. In each run, the performance of all the four algo- Ω1 Λ only once, although xu is still very close to xv.
rithms, i.e. COREG, SELF, ARTRE, and LABELED, are eval-uated on randomly partitioned labeled/unlabeled/test splits.
The average MSE at each iteration is recorded. Note that Experiments are performed on ten data sets listed in Table 2 the learning processes of the algorithms may stop before the where “# attribute” denotes the number of input attributes.
maximum number of iterations is reached, and in that case, These data sets have been used in [Zhou et al., 2002] where the final MSE is used in computing the average MSE of the the detailed descriptions of the data sets can be found. Note that the input attributes as well as the real-valued labels have The improvement on average MSE obtained by exploiting been normalized to [0.0, 1.0].
unlabeled examples is tabulated in Table 3, which is com-puted by subtracting the final MSE from the initial MSE and Table 3: Improvement on average mean squared error For each data set, 25% data are kept as the test set, while the remaining 75% data are partitioned into the labeled andunlabeled sets where 10% (of the 75%) data are used as la- Table 3 shows that SELF and ARTRE improve the regres- beled examples while the remaining 90% (of the 75%) data sion estimates on only five data sets, while COREG improves on eight data sets. Moreover, Table 3 discloses that the im- In the experiments, the distance orders used by the two provement of COREG is always bigger than that of SELF and kNN regressors in COREG are set to 2 and 5, respectively, ARTRE. These observations tell that COREG can effectively the k value is set to 3, the maximum number of iterations T is exploit unlabeled examples to improve regression estimates.
set to 100, and the pool U contains 100 unlabeled examples For further studying the compared algorithms, the average randomly picked from the unlabeled set in each iteration.
MSE of different algorithms at different iterations are plotted A self-training style algorithm is tested for comparison, in Fig 1, where the average MSE of the two kNN regressors which is denoted by SELF. This algorithm uses a kNN re- used in COREG are also depicted. Note that in each subfig- gressor and in each iteration, it chooses the unlabeled exam- ure, every curve contains 101 points corresponding to the 100 ple which maximizes the value of ∆x in Eq. 2 to label for learning iterations in addition to the initial condition, where itself. Moreover, a co-training style algorithm, denoted by only 11 of them are explicitly depicted for better presentation.
ARTRE, is also tested. Since the experimental data sets are Fig. 1 shows that COREG can exploit unlabeled examples with no sufficient and redundant views, here an artificial re- to improve the regression estimates on most data sets, except dundant view is developed through deriving new attributes that on Friedman #3 there is no improvement while on Plane Figure 1: Comparisons on average mean squared error of different algorithms at different iterations the performance is degenerated. While, SELF and ARTRE de- the final regression estimates of COREG are significantly bet- generate the regression estimates on five data sets, i.e. Fried- ter than these of ARTRE on almost all the data sets except man #1 to #3, Multi, and Plane. Moreover, the average MSE on Friedman #1 where the latter is better. Furthermore, the of the final prediction made by COREG is almost always the final regression estimates of COREG are significantly better best except on Friedman #1 where ARTRE is slightly better than these of SELF and LABELED on almost all the data sets and on Plane where LABELED is the best while all the semi- except on Plane where LABELED is better and on Fried- supervised learning algorithms degenerate the performance.
man #3 where there is no significant difference. These re- These observations disclose that COREG is apparently the sults of t-tests confirm that COREG is the strongest among the compared algorithms, which can effectively exploit unla-beled data to improve the regression estimates.
Pairwise two-tailed t-tests with 0.05 significance level re- veal that the final regression estimates of COREG are signif-icantly better than its initial regression estimates on almost Conclusion
all the data sets except that on Plane the latter is better while This paper proposes a co-training style semi-supervised re- on Friedman #3 there is no significant difference. Moreover, gression algorithm COREG. This algorithm employs two k- nearest neighbor regressors using different distance metrics.
[Goldman and Zhou, 2000] S. Goldman and Y. Zhou. En- In each learning iteration, each regressor labels the unlabeled hancing supervised learning with unlabeled data. In Pro- example which can be most confidently labeled for the other ceedings of the 17th International Conference on Machine learner, where the labeling confidence is estimated through Learning, pages 327–334, San Francisco, CA, 2000. Mor- considering the consistency of the regressor with the labeled example set. The final prediction is made by averaging the [Hwa et al., 2003] R. Hwa, M. Osborne, A. Sarkar, and predictions of both the refined kNN regressors. Experiments M. Steedman. Corrected co-training for statistical parsers.
show that COREG can effectively exploit unlabeled data to In Working Notes of the ICML’03 Workshop on the Contin- uum from Labeled to Unlabeled Data in Machine Learning In contrast to standard co-training setting, COREG does not and Data Mining, Washington, DC, 2003.
require sufficient and redundant views, which enables it havebroad applicability. However, this forces C verse initial regressors with specific mechanisms. In this pa- text classification using support vector machines. In Pro- per the diversity is obtained by instantiating the Minkowski ceedings of the 16th International Conference on Machine distance with different distance orders. It is obvious that us- Learning, pages 200–209, Bled, Slovenia, 1999. Morgan ing completely different distance metrics may be more help- ful. Moreover, trying to obtain the diversity of the initial [Miller and Uyar, 1997] D. J. Miller and H. S. Uyar. A mix- regressors from channels other than using different distance ture of experts classifier with learning based on both la- metrics is an issue to be investigated in future work. Note that belled and unlabelled data. In M. Mozer, M. I. Jordan, and although this paper uses kNN regressor as the base learner, T. Petsche, editors, Advances in Neural Information Pro- an important idea of COREG, i.e. regarding the labeling of cessing Systems 9, pages 571–577. MIT Press, Cambridge, the unlabeled example which makes the regressor most con- sistent with the labeled example set as with the most confi- [Nigam and Ghani, 2000] K. Nigam and R. Ghani. Analyz- dence, can also be used with other base learners. Therefore, ing the effectiveness and applicability of co-training. In designing semi-supervised regression algorithms with other Proceedings of the 9th ACM International Conference on base learners along the way of COREG is another interest- Information and Knowledge Management, pages 86–93, ing issue to be explored in the future. Furthermore, designing semi-supervised regression algorithms outside the co-trainingframework is also well-worth studying.
[Nigam et al., 2000] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and un- Acknowledgments
labeled documents using EM. Machine Learning, 39(2–3):103–134, 2000.
Supported by NSFC (60325207), FANEDD (200343), andJ [Pierce and Cardie, 2001] D. Pierce and C. Cardie. Limi- tations of co-training for natural language learning from References
large data sets. In Proceedings of the 2001 Conferenceon Empirical Methods in Natural Language Processing, [Blum and Chawla, 2001] A. Blum and S. Chawla. Learn- ing from labeled and unlabeled data using graph mincuts.
In Proceedings of the 18th International Conference on Sarkar, 2001] A. Sarkar. Applying co-training methods to Machine Learning, pages 19–26, Williamston, MA, 2001.
statistical parsing. In Proceedings of the 2nd Annual Meet- ing of the North American Chapter of the Association forComputational Linguistics, pages 95–102, Pittsburgh, PA, [Blum and Mitchell, 1998] A. Blum and T. Mitchell. Com- bining labeled and unlabeled data with co-training.
Proceedings of the 11th Annual Conference on Compu- tational Learning Theory, pages 92–100, Madison, WI, ducing the small sample size problem and mitigating thehughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32(5):1087–1095, 1994.
Norms: NN Pattern Classification Techniques. IEEE Com- puter Society Press, Los Alamitos, CA, 1991.
A. Sarkar, S. Clark, R. Hwa, J. Hockenmaier, P. Ruhlen, [Dasgupta et al., 2002] S. Dasgupta, S. Baker, and J. Crim. Bootstrapping statistical parsers D. McAllester. PAC generalization bounds for co-training.
from small data sets. In Proceedings of the 11th Con- In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, ference on the European Chapter of the Association for Advances in Neural Information Processing Systems 14, Computational Linguistics, pages 331–338, Budapest, pages 375–382. MIT Press, Cambridge, MA, 2002.
[Dempster et al., 1977] A. P. Dempster, N. M. Laird, and [Zhou et al., 2002] Z.-H. Zhou, J. Wu, and W. Tang. Ensem- D. B. Rubin. Maximum likelihood from incomplete data bling neural networks: many could be better than all. Ar- via the EM algorithm. Journal of the Royal Statistical So- tificial Intelligence, 137(1–2):239–263, 2002.
ciety, Series B, 39(1):1–38, 1977.

Source: http://hapci.mit.bme.hu/eng/system/files/oktatas/targyak/7153/SemiSupervisedRegressionWithCoTraining_Zhou.pdf


Animal-Kind International Annual Report for 2012 (By Karen Menczer, Executive Director) The 2012 Annual Report is organized in line with the AKI mission: to raise visibility of the work of AKI partner organizations with the aim of connecting people, and donating funds and supplies to our partners. During 2012, AKI kept the same partner organizations as we had during 2011. We added Anim


Southeast Georgia Urology Associates PATIENT INFORMATION FORM Preferred Urology Office Location: Sex: Male Race: White Black Preferred Pharmacy:____________________________ Phone:__________________ Fax:__________________ City:________________________ State:_______ Zip:___________ Reason for Appointment today (Chief Complaint) PRIMARY INSURANCE : SECONDARY INSURANCE

Copyright © 2009-2018 Drugs Today