SBERT

SBERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding. The sentence is converted into word embedding and passed through a BERT network to get the context vector. Researchers experimented with different pooling options but found that at the mean pooling works the best. The context vector is then averaged out to get the sentence embeddings.

SBERT uses three objective functions to update the weights of the BERT model. The Bert model is structured differently based on the type of training data that drives the objective function.

1. Classification objective Function

  • This model architecture uses part of a sentence along with labels as training data.
  • Here the Bert model is structured as a siamese network. What is the Siamese netowrk? It consists of two identical subnetworks each of which is a BERT model. The two models share/have the same parameters/weights. Parameter updating is mirrored across both sub-models. On the top of the polling layer, we have softmax classifier with the number of nodes the as number of labels the in-training data.
  • The sentences are passed together to get sentence embeddings u and v along with element-wise different |u-v|. These three vectors (u, v,|u-v|) are multiplied with a weight vector W of size (3n*K) to get a softmax classification.

    Here,
    • n is the dimension of the sentence embeddings
    • k the number of labels.
  • The optimization is performed using cross-entropy loss.

SBERT with Classification Objective Function

2. Regression Objective function

This also uses the pair of sentences with labels as training data. The network is also structured as a Siamese network. However, instead of the softmax layer the output of the pooling layer is used to calculate cosine similarity and mean squared-error loss is used as the objective function to train the BERT model weights.

SBERT with Regression Objective Function

3. Triplet objective function

Here the model is structured as triplet networks.

  • In a Triplet Network, three subnetworks process an anchor sentence, a positive (similar) sentence, and a negative (dissimilar) sentence. The model learns to minimize the distance between the anchor and positive sentences while maximizing the distance between the anchor and negative sentences.
  • To train the model we need a dataset that has an anchor dataset a, a positive sentence p, and a negative sentence n. An example of such a data set is ‘The Wikipedia section triplets dataset’.

Mathematically, we minimize the following loss function.

  • with sa, sp, sn the sentence embedding for a/n/p,
  • || · || a distance metric and margin.
  • Margin ϵ ensures that sp is at least closer to sa than sn.

Python Implementation

To implement it first we need to install Sentence transformer framework

!pip install -U sentence-transformers
  • The SentenceTransformer class is used to load an SBERT model.
  • The SentenceTransformer class is used to load an SBERT model.
  • We use the scipy cosine distance to calculate the distance between two vectors. To get similarity we subtract it from 1

Python3

#!pip install -U sentence-transformers
 
from scipy.spatial import distance
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
 
# Sample sentence
sentences = ["The movie is awesome. It was a good thriller",
             "We are learning NLP throughg w3wiki",
             "The baby learned to walk in the 5th month itself"]
 
 
test = "I liked the movie."
print('Test sentence:',test)
test_vec = model.encode([test])[0]
 
 
for sent in sentences:
    similarity_score = 1-distance.cosine(test_vec, model.encode([sent])[0])
    print(f'\nFor {sent}\nSimilarity Score = {similarity_score} ')

                    

Output:

Test sentence: I liked the movie.

For The movie is awesome. It was a good thriller
Similarity Score = 0.682051956653595

For We are learning NLP throughg w3wiki
Similarity Score = 0.0878136083483696

For The baby learned to walk in the 5th month itself
Similarity Score = 0.04816452041268349

Different Techniques for Sentence Semantic Similarity in NLP

Semantic similarity is the similarity between two words or two sentences/phrase/text. It measures how close or how different the two pieces of word or text are in terms of their meaning and context.

In this article, we will focus on how the semantic similarity between two sentences is derived. We will cover the following most used models.

  1. Dov2Vec – An extension of word2vec
  2. SBERT – Transformer-based model in which the encoder part captures the meaning of words in a sentence.
  3. InferSent -It uses bi-directional LSTM to encode sentences and infer semantics.
  4. USE (universal sentence encoder) – It’s a model trained by Google that generates fixed-size embeddings for sentences that can be used for any NLP task.

Similar Reads

What is Semantic Similarity?

Semantic Similarity refers to the degree of similarity between the words. The focus is on the structure and lexical resemblance of words and phrases. Semantic similarity delves into the understanding and meaning of the content. The aim of the similarity is to measure how closely related or analogous the concepts, ideas, or information conveyed in two texts are....

Word Embedding

To understand semantic relationships between sentences one must be aware of the word embeddings. Word embeddings are used for vectorized representation of words. The simplest form of word embedding is a one-hot vector. However, these are sparse, very high dimensional, and do not capture meaning. The more advanced form consists of the Word2Vec (skip-gram, cbow), GloVe, and Fasttext which capture semantic information in low dimensional space. Kindly look at the embedded link to get a deeper understanding of this....

Word2Vec

Word2Vec represents the words as high-dimensional vectors so that we get semantically similar words close to each other in the vector space. There are two main architectures for Word2Vec:...

Doc2Vec

Similar to word2vec Doc2Vec has two types of models based on skip gram and CBOW. We will look at the skip gram-based model as this model performs better than the cbow-based model. This skip-gram-based model is called ad PV-DM (Distributed Memory Model of Paragraph Vectors)....

SBERT

...

InferSent

SBERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding. The sentence is converted into word embedding and passed through a BERT network to get the context vector. Researchers experimented with different pooling options but found that at the mean pooling works the best. The context vector is then averaged out to get the sentence embeddings....

USE – Universal Sentence Encoder

...

Conclusion

The structure comprises two components:...

Contact Us