When building a classifier, we assume a large enough training data set with labels is available. This situation is what we call as supervised learning. In a real world setting, such training examples with labels need to be acquired. In any application domain where labeling requires domain expertise such as in medicine, gathering a large training set with labels is an expensive and time consuming task. In such cases, it is not uncommon to use a small set of correctly labeled examples to label the rest of training examples. This type of learning is referred as semi-supervised learning and it falls somewhere between supervised and unsupervised learning. Often the term semi-supervised classification is used to describe the process of labeling training examples using a small set of labeled examples to differentiate from semi-supervised clustering. In semi-supervised clustering, the goal is to group a given set of examples into different clusters with the condition that certain examples must be clustered together and certain examples must be put in different clusters. In other words, some kind of constraints are imposed on resulting clusters in terms of cluster memberships of certain specified examples. In this blog post, I am going to illustrate semi-supervised classification leaving semi-supervised clustering for another post. When we have a small set of labeled examples and we want to rely on them to label a much larger set of unlabeled training examples, we need to make some assumptions. For example, we might make an assumption that training examples close to each other are likely to have similar class labels, an assumption made when applying k-nearest neighbor classification. Instead one might assume classes to have Gaussian distribution and we may try to iteratively find the distribution parameters. We must remember that our result will be as good as our assumptions are. Label Propagation Algorithm (LPA)
One of the semi-supervised classification method is label propagation that I will explain here. This method is based on the assumption that examples near each other are likely to have similar class labels. The basic idea of this method is to consider all examples, labeled and unlabeled, as interconnected nodes in a network. Each node in the network tries to propagate its label to other nodes. How much of a node’s label influences other nodes is determined by their respective closeness or proximity. We will work through a series of steps to illustrate the working of the label propagation algorithm. Let us consider the following nine training examples, each with two-features: