Sit back
Let's learn
About

Semi-Supervised Learning vs. Self-Supervised Learning: What is the difference?

Published on:
April 7, 2023
Published by:
Professor Ishwar Sethi

Have you often wondered about the difference between the semi-supervised learning and the self-supervised learning? Well! This post might be of help. Let us begin with supervised learning, the most popular machine learning methodology to build predictive models. It uses annotated or labeled data to train a predictive model. A label attached to a data vector is nothing but the response that we expect from a predictive model with that data vector as the input during the model training. For example, we will label pictures of cats and dogs with labels cat and dog while building a Cat versus Dog classifier. When building a classifier, we assume a large enough training data set with labels is available.
When the data has no labels attached to it, then the learning is known as unsupervised learning. It is a learning methodology that tries to group data into different groups based upon similarities of the training vectors. The k-means clustering is the most well-known unsupervised learning technique.

Semi-Supervised Learning

In a real world setting, training examples with labels need to be acquired for a predictive modeling task. Attaching labels is costly and time-consuming and it often requires expert annotators. Thus, we often need ways to work with a small labeled training data set. In certain situations, we may be able to acquire, in addition to a small labeled training data set, additional training examples without labels with labeling being too expensive to perform. In such cases, it is possible to label the unlabeled examples using the small available set of labeled examples. This type of learning is referred as semi-supervised learning and it falls somewhere between supervised and unsupervised learning. Often the term semi-supervised classification is used to describe the process of labeling training examples using a small set of labeled examples to differentiate from semi-supervised clustering. In semi-supervised clustering, the goal is to group a given set of examples into different clusters with the condition that certain examples must be clustered together and certain examples must be put in different clusters. In other words, some kind of constraints are imposed on resulting clusters in terms of cluster memberships of certain specified examples. You can see an example of semi-supervised classification in one of my earlier blog posts. In another blog post, you can read about constrained k-means clustering as a technique for semi-supervised clustering.

Transfer Learning

Let us now consider another possibility; we have a small set of labeled examples but cannot acquire more training examples even without the labels. How do we deal with such situations? One possible solution in such situations is transfer learning. In transfer learning, we take a trained predictive model that was trained on a related task and re-train it with our available labeled data. The purpose of re-training is to fine-tune the parameters of the trained model to make it perform well for our predictive task. The use of transfer learning is popular in deep learning where numerous trained predictive models are publicly available. While performing transfer learning, we often employ data augmentation to the available labeled examples to create additional examples with labels. The common data augmentation operations include translation, rotation, cropping and resizing, and blurring.

Self-Supervised Learning

The Self-supervised learning is essentially unsupervised learning wherein the labels, the desired predictions, are provided by the data itself and hence the name self-supervised learning. Unlike the traditional unsupervised learning where the goal is to group data using some similarity measure, the objective of the self-supervised learning is to learn the latent characteristics of the data that could be useful in many ways. Although the self-supervised learning has been around for a long time, for example as in autoencoders, its current popularity is primarily due its use in training the large language models such as BERT, RoBERTa, and ALBERT.

Let us look at the example below to understand how the desired output is defined via self-learning. In the example, the words in the lighter shades are masked and the model is trained to predict the masked words using the surrounding words. Thus, the masked words function as labels. The masking of the words is done in a random fashion for the given corpus and no manual labeling is needed.

The idea of self-generating the labels is not limited to random masking of words; several variations at the word level as well as the sentence level are possible and have been successfully used in different language modeling efforts. For example, self-learning can be employed to predict the neighboring sentences that come before and after a selected sentence in a given document. The tasks defined to perform self-supervised learning are called pretext tasks because these tasks are not the end-goal and the results of these tasks are used for building the final systems.

The above ideas of self-generating labels for prediction are easily extended to images to define a variety of pretext tasks for self-supervised learning. As an example, images can be subjected to rotations of (90 degrees, 180 degrees etc.) and the pretext task is defined to predict the rotation applied to the images. Such a pretext task can make the model learn the canonical orientation of image objects. Data augmentation is also commonly used in self-supervised learning to create image variations as shown below in an example taken from the SimCLR paper. The pretext task in such a situation is to learn that a pair of data augmented images of an object are similar.

Self-Supervised Learning Example

Now that we know how self-labeling is done, let us look at an example of training in self-supervised mode; we will base it on the SimCLR paper referred above. The basic idea is to take a minibatch of N images. For every image in the minibatch, we generate two different versions through data augmentation operators that are shown in the above example image. The two augmented images are fed to a network consisting of two identical parallel encoder-projector paths as illustrated below. The basic processing consists of generating hidden/latent representations hi and hj of the data augmented image pairs and then projecting them to some suitable space to measure the level of agreement.

The agreement between the projections and hence between the hidden representations is measured using the contrastive loss function on a minibatch basis. In a minibatch of N images, the training generates 2N pairs of data augmented images. For any data augmented pair, the pair itself is treated as a positive example while the rest of the 2(N-1) pairs are treated as negative examples. The learning takes place to minimize the contrastive loss function defined as below for a positive example pair {i,j}.

Let us try to understand the contrastive loss function. The function within the red rectangle in the numerator stands for similarity measured using the normalized dot product and the function within the red rectangle in the denominator is the indicator function whose value is 1 whenever k and i are not equal. What the argument of log above is measuring is the closeness of the {i,j} image pair against the closeness of the i-th image against all minibatch images. The purpose of τ in the loss function is simply to scale the similarity values.

The result of the pretext task in this example is the hidden representation h. You can use it for any suitable task, for example classification. Such tasks making use of the output from the pretext task are called downstream tasks. In the present example, the downstream task was set up to be a linear classifier which was shown to yield better accuracy than achieved by other means, thus establishing the benefit of the self-supervised learning.

To summarize, the self-supervised learning appears to be of great help in building AI systems using very large amount of unlabeled data. Application domains such as medical decision making where annotation calls for experts and is expensive are poised to be beneficiaries of self-supervised learning.

Finally, if you want to train the SimCLR model using STL10 or CIFAR10 datasets, follow the link below courtesy of Thalles Silva.

Check Out These Brilliant Topics
Understanding Tensors and Tensor Decompositions: Part 3
Published on:
April 6, 2023
Published by:
Professor Ishwar Sethi

This is my third post on tensors and tensor decompositions. The first post on this topic primarily introduced tensors and some related terminology. The second post was meant to illustrate a particularly simple tensor decomposition method, called the CP decomposition. In this post, I will describe another tensor decomposition method, known as the Tucker decomposition. While the CP decomposition’s chief merit is its simplicity, it is limited in its approximation capability and it requires the same number of components in each mode. The Tucker decomposition, on the other hand, is extremely efficient in terms of approximation and allows different number of components in different modes. Before going any further, lets look at factor matrices and n-mode product of a tensor and a matrix. Factor Matrices Recall the CP decomposition of an order three tensor expressed as X≈∑r=1Rar∘br∘cr, where (∘ ) represents the outer product. We can also represent this decomposition in terms of organizing the vectors, ar,br,cr,r=1,⋯R , into three matrices, A, B, and C, as A=[a1a2⋯aR], B=[b1b2⋯bR],and C=[c1c2⋯cR] The CP decomposition is then expressed as X≈[Λ;A,B,C], where Λ is a super-diagonal tensor with all zero entries except the diagonal elements. The matrices A, B, and C are called the factor matrices. Next, lets try to understand the n-mode product. Multiplying a Tensor and a Matrix How do you multiply a tensor and a matrix? The answer is via n-mode product. The n-mode product of a tensor X∈RI1×I2×⋯IN with a matrix U∈RJ1×In is a tensor of size I1×I2×⋯In−1×J×In+1×⋯×IN, and is denoted by X×nU . The product is calculated by multiplying each mode-n fibre by the U matrix. Lets look at an example to better understand the n-mode product. Lets consider a 2x2x3 tensor whose frontal slices are:

Want Us to Send You New Posts?

We add Value. Not spam.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Kevadiya INC. © 2025 All Rights Reserved.