Semisupervised learning
Part of a series on 
Machine learning and data mining 

Problems

Supervised learning (classification • regression) 
Learning with humans

Model diagnostics


Semisupervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semisupervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision.
Unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semisupervised learning can be of great practical value. Semisupervised learning is also of theoretical interest in machine learning and as a model for human learning.
A set of $l$ independently identically distributed examples $x_{1},\dots ,x_{l}\in X$ with corresponding labels $y_{1},\dots ,y_{l}\in Y$ and $u$ unlabeled examples $x_{l+1},\dots ,x_{l+u}\in X$ are processed. Semisupervised learning combines this information to surpass the classification performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.
Semisupervised learning may refer to either transductive learning or inductive learning.^{[1]} The goal of transductive learning is to infer the correct labels for the given unlabeled data $x_{l+1},\dots ,x_{l+u}$ only. The goal of inductive learning is to infer the correct mapping from $X$ to $Y$.
Intuitively, the learning problem can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam.
It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.
Assumptions
In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semisupervised learning algorithms make use of at least one of the following assumptions:^{[2]}
Continuity / smoothness assumption
Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semisupervised learning, the smoothness assumption additionally yields a preference for decision boundaries in lowdensity regions, so few points are close to each other but in different classes.^{[citation needed]}
Cluster assumption
The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.
Manifold assumption
The data lie approximately on a manifold of much lower dimension than the input space. In this case learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold.
The manifold assumption is practical when highdimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds,^{[3]} and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively.
History
The heuristic approach of selftraining (also known as selflearning or selflabeling) is historically the oldest approach to semisupervised learning,^{[2]} with examples of applications starting in the 1960s.^{[4]}
The transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s.^{[5]} Interest in inductive learning using generative models also began in the 1970s. A probably approximately correct learning bound for semisupervised learning of a Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995.^{[6]}
Semisupervised learning has recently^{[when?]} become more popular and practically relevant due to the variety of problems for which vast quantities of unlabeled data are available—e.g. text on websites, protein sequences, or images.^{[7]}
Methods
Generative models
Generative approaches to statistical learning first seek to estimate $p(xy)$,^{[disputed – discuss]} the distribution of data points belonging to each class. The probability $p(yx)$ that a given point $x$ has label $y$ is then proportional to $p(xy)p(y)$ by Bayes' rule. Semisupervised learning with generative models can be viewed either as an extension of supervised learning (classification plus information about $p(x)$) or as an extension of unsupervised learning (clustering plus some labels).
Generative models assume that the distributions take some particular form $p(xy,\theta )$ parameterized by the vector $\theta$. If these assumptions are incorrect, the unlabeled data may actually decrease the accuracy of the solution relative to what would have been obtained from labeled data alone.^{[8]} However, if the assumptions are correct, then the unlabeled data necessarily improves performance.^{[6]}
The unlabeled data are distributed according to a mixture of individualclass distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models.
The parameterized joint distribution can be written as $p(x,y\theta )=p(y\theta )p(xy,\theta )$ by using the chain rule. Each parameter vector $\theta$ is associated with a decision function $f_{\theta }(x)={\underset {y}{\operatorname {argmax} }}\ p(yx,\theta )$. The parameter is then chosen based on fit to both the labeled and unlabeled data, weighted by $\lambda$:
 ${\underset {\Theta }{\operatorname {argmax} }}\left(\log p(\{x_{i},y_{i}\}_{i=1}^{l}\theta )+\lambda \log p(\{x_{i}\}_{i=l+1}^{l+u}\theta )\right)$^{[9]}
Lowdensity separation
Another major class of methods attempts to place boundaries in regions with few data points (labeled or unlabeled). One of the most commonly used algorithms is the transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learning as well). Whereas support vector machines for supervised learning seek a decision boundary with maximal margin over the labeled data, the goal of TSVM is a labeling of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standard hinge loss $(1yf(x))_{+}$ for labeled data, a loss function $(1f(x))_{+}$ is introduced over the unlabeled data by letting $y=\operatorname {sign} {f(x)}$. TSVM then selects $f^{*}(x)=h^{*}(x)+b$ from a reproducing kernel Hilbert space ${\mathcal {H}}$ by minimizing the regularized empirical risk:
 $f^{*}={\underset {f}{\operatorname {argmin} }}\left(\displaystyle \sum _{i=1}^{l}(1y_{i}f(x_{i}))_{+}+\lambda _{1}\h\_{\mathcal {H}}^{2}+\lambda _{2}\sum _{i=l+1}^{l+u}(1f(x_{i}))_{+}\right)$
An exact solution is intractable due to the nonconvex term $(1f(x))_{+}$, so research focuses on useful approximations.^{[9]}
Other approaches that implement lowdensity separation include Gaussian process models, information regularization, and entropy minimization (of which TSVM is a special case).
Laplacian regularization
Laplacian regularization has been historically approached through graphLaplacian. Graphbased methods for semisupervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its $k$ nearest neighbors or to examples within some distance $\epsilon$. The weight $W_{ij}$ of an edge between $x_{i}$ and $x_{j}$ is then set to $e^{\frac {\x_{i}x_{j}\^{2}}{\epsilon }}$.
Within the framework of manifold regularization,^{[10]}^{[11]} the graph serves as a proxy for the manifold. A term is added to the standard Tikhonov regularization problem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes
 ${\underset {f\in {\mathcal {H}}}{\operatorname {argmin} }}\left({\frac {1}{l}}\displaystyle \sum _{i=1}^{l}V(f(x_{i}),y_{i})+\lambda _{A}\f\_{\mathcal {H}}^{2}+\lambda _{I}\int _{\mathcal {M}}\\nabla _{\mathcal {M}}f(x)\^{2}dp(x)\right)$^{[9]}
where ${\mathcal {H}}$ is a reproducing kernel Hilbert space and ${\mathcal {M}}$ is the manifold on which the data lie. The regularization parameters $\lambda _{A}$ and $\lambda _{I}$ control smoothness in the ambient and intrinsic spaces respectively. The graph is used to approximate the intrinsic regularization term. Defining the graph Laplacian $L=DW$ where $D_{ii}=\sum _{j=1}^{l+u}W_{ij}$ and $\mathbf {f}$ is the vector $[f(x_{1})\dots f(x_{l+u})]$, we have
 $\mathbf {f} ^{T}L\mathbf {f} =\displaystyle \sum _{i,j=1}^{l+u}W_{ij}(f_{i}f_{j})^{2}\approx \int _{\mathcal {M}}\\nabla _{\mathcal {M}}f(x)\^{2}dp(x)$.
The graphbased approach to Laplacian regularization is to put in relation with finite difference method.^{[clarification needed]}^{[citation needed]}
The Laplacian can also be used to extend the supervised learning algorithms: regularized least squares and support vector machines (SVM) to semisupervised versions Laplacian regularized least squares and Laplacian SVM.
Heuristic approaches
Some methods for semisupervised learning are not intrinsically geared to learning from both unlabeled and labeled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labeled and unlabeled examples $x_{1},\dots ,x_{l+u}$ may inform a choice of representation, distance metric, or kernel for the data in an unsupervised first step. Then supervised learning proceeds from only the labeled examples. In this vein, some methods learn a lowdimensional representation using the supervised data and then apply either lowdensity separation or graphbased methods to the learned representation.^{[12]}^{[13]} Iteratively refining the representation and then performing semisupervised learning on said representation may further improve performance.
Selftraining is a wrapper method for semisupervised learning.^{[14]} First a supervised learning algorithm is trained based on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for the supervised learning algorithm. Generally only the labels the classifier is most confident in are added at each step.^{[15]}
Cotraining is an extension of selftraining in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labeled examples for one another.^{[16]}
In human cognition
Human responses to formal semisupervised learning problems have yielded varying conclusions about the degree of influence of the unlabeled data.^{[17]} More natural learning problems may also be viewed as instances of semisupervised learning. Much of human concept learning involves a small amount of direct instruction (e.g. parental labeling of objects during childhood) combined with large amounts of unlabeled experience (e.g. observation of objects without naming or counting them, or at least without feedback).
Human infants are sensitive to the structure of unlabeled natural categories such as images of dogs and cats or male and female faces.^{[18]} Infants and children take into account not only unlabeled examples, but the sampling process from which labeled examples arise.^{[19]}^{[20]}
See also
References
 ^ "SemiSupervised Learning Literature Survey, Page 5". 2007. CiteSeerX 10.1.1.99.9681.
{{cite journal}}
: Cite journal requiresjournal=
(help)  ^ ^{a} ^{b} Chapelle, Schölkopf & Zien 2006.
 ^ Stevens, Kenneth N., 1924 (1998). Acoustic phonetics. Cambridge, Mass.: MIT Press. ISBN 0585087202. OCLC 42856189.
{{cite book}}
: CS1 maint: multiple names: authors list (link)  ^ Scudder, H. (July 1965). "Probability of error of some adaptive patternrecognition machines". IEEE Transactions on Information Theory. 11 (3): 363–371. doi:10.1109/TIT.1965.1053799. ISSN 15579654.
 ^ Vapnik, V.; Chervonenkis, A. (1974). Theory of Pattern Recognition (in Russian). Moscow: Nauka. cited in Chapelle, Schölkopf & Zien 2006, p. 3
 ^ ^{a} ^{b} Ratsaby, J.; Venkatesh, S. "Learning from a mixture of labeled and unlabeled examples with parametric side information" (PDF). in Proceedings of the eighth annual conference on Computational learning theory  COLT '95. New York, New York, USA: ACM Press. 1995. pp. 412–417. doi:10.1145/225298.225348. ISBN 0897917235. S2CID 17561403.. Cited in Chapelle, Schölkopf & Zien 2006, p. 4
 ^ Zhu, Xiaojin (2008). "Semisupervised learning literature survey" (PDF). University of WisconsinMadison.
 ^ Fabio, Cozman; Ira, Cohen (20060922), "Risks of SemiSupervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers", SemiSupervised Learning, The MIT Press, pp. 56–72, doi:10.7551/mitpress/9780262033589.003.0004, ISBN 9780262033589 In: Chapelle, Schölkopf & Zien 2006
 ^ ^{a} ^{b} ^{c} Zhu, Xiaojin. SemiSupervised Learning University of WisconsinMadison.
 ^ M. Belkin; P. Niyogi (2004). "Semisupervised Learning on Riemannian Manifolds". Machine Learning. 56 (Special Issue on Clustering): 209–239. doi:10.1023/b:mach.0000033120.25363.1e.
 ^ M. Belkin, P. Niyogi, V. Sindhwani. On Manifold Regularization. AISTATS 2005.
 ^ Iscen, Ahmet; Tolias, Giorgos; Avrithis, Yannis; Chum, Ondrej (2019). "Label Propagation for Deep SemiSupervised Learning". Conference on Computer Vision and Pattern Recognition (CVPR): 5065–5074. arXiv:1904.04717. doi:10.1109/CVPR.2019.00521. ISBN 9781728132938. S2CID 104291869. Retrieved 26 March 2021.
 ^ Burkhart, Michael C.; Shan, Kyle (2020). "Deep LowDensity Separation for Semisupervised Classification". International Conference on Computational Science (ICCS). Lecture Notes in Computer Science. 12139: 297–311. doi:10.1007/9783030504205_22. ISBN 9783030504199.
 ^ Triguero, Isaac; García, Salvador; Herrera, Francisco (20131126). "Selflabeled techniques for semisupervised learning: taxonomy, software and empirical study". Knowledge and Information Systems. 42 (2): 245–284. doi:10.1007/s101150130706y. ISSN 02191377. S2CID 1955810.
 ^ Fazakis, Nikos; Karlos, Stamatis; Kotsiantis, Sotiris; Sgarbas, Kyriakos (20151229). "SelfTrained LMT for Semisupervised Learning". Computational Intelligence and Neuroscience. 2016: 3057481. doi:10.1155/2016/3057481. PMC 4709606. PMID 26839531.
 ^ Didaci, Luca; Fumera, Giorgio; Roli, Fabio (20121107). Gimel’farb, Georgy; Hancock, Edwin; Imiya, Atsushi; Kuijper, Arjan; Kudo, Mineichi; Omachi, Shinichiro; Windeatt, Terry; Yamada, Keiji (eds.). Analysis of Cotraining Algorithm with Very Small Training Sets. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 719–726. doi:10.1007/9783642341663_79. ISBN 9783642341656. S2CID 46063225.
 ^ Zhu, Xiaojin (2009). Introduction to semisupervised learning. Goldberg, A. B. (Andrew B.). [San Rafael, Calif.]: Morgan & Claypool Publishers. ISBN 9781598295481. OCLC 428541480.
 ^ Younger B. A.; Fearing D. D. (1999). "Parsing Items into Separate Categories: Developmental Change in Infant Categorization". Child Development. 70 (2): 291–303. doi:10.1111/14678624.00022.
 ^ Xu, F. & Tenenbaum, J. B. (2007). "Sensitivity to sampling in Bayesian word learning". Developmental Science. 10 (3): 288–297. CiteSeerX 10.1.1.141.7505. doi:10.1111/j.14677687.2007.00590.x. PMID 17444970.
 ^ Gweon, H., Tenenbaum J.B., and Schulz L.E (2010). "Infants consider both the sample and the sampling process in inductive generalization". Proc Natl Acad Sci U S A. 107 (20): 9066–71. Bibcode:2010PNAS..107.9066G. doi:10.1073/pnas.1003095107. PMC 2889113. PMID 20435914.
{{cite journal}}
: CS1 maint: multiple names: authors list (link)
Sources
 Chapelle, Olivier; Schölkopf, Bernhard; Zien, Alexander (2006). Semisupervised learning. Cambridge, Mass.: MIT Press. ISBN 9780262033589.
External links
 Manifold Regularization A freely available MATLAB implementation of the graphbased semisupervised algorithms Laplacian support vector machines and Laplacian regularized least squares.
 KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on) KEEL module for semisupervised learning.
 SemiSupervised Learning Software SemiSupervised Learning Software
 1.14. SemiSupervised — scikitlearn 0.22.1 documentation SemiSupervised algorithms in scikitlearn .