Doc2Vec

Similar to word2vec Doc2Vec has two types of models based on skip gram and CBOW. We will look at the skip gram-based model as this model performs better than the cbow-based model. This skip-gram-based model is called ad PV-DM (Distributed Memory Model of Paragraph Vectors).

PV-DM model

PV-DM is an extension of Word2Vec in the sense that it consists of one paragraph vector in addition to the word vectors.

Paragraph Vector and Word Vectors: Suppose there are N paragraphs in the corpus, and M words in the vocabulary.
- Now based on the lines of word2vec we will have M*Q (Q is the dimesnon of word embedding) matrix which will be our embedding matrix for the word embedding. Additionally, we will have an N*P matrix for our paragraphs (p is the dimension of paragraph embedding).
- The paragraph vector is shared across all contexts generated from the same paragraph but not across different paragraphs.
- The word vector matrix W is shared across paragraphs.
Averaging or Concatenation: To predict the next word in a context, the paragraph vector and word vectors are combined using either averaging or concatenation.
Distributed Memory Model (PV-DM): The paragraph token acts as a memory that retains information about what is missing from the current context or the topic of the paragraph.
Training with Stochastic Gradient Descent: Stochastic gradient descent is used to train the paragraph vectors and word vectors. The gradient is obtained via backpropagation. During each step of stochastic gradient descent, a fixed-length context is sampled from a random paragraph, and the error gradient is computed to update the model parameters.
Inference at Prediction Time: Once the model is trained the paragraph vectors are discarded and only the word vectors and softmax weights are retained.
- For finding the paragraph vector of new text During prediction time, an inference step is performed. This is also achieved through gradient descent. In this step, the word vectors (W) and softmax weights, are fixed while the paragraph vector is learned through backpropagation.
- For two sentences for which we need to find the similarity the two paragraph vectors are obtained and based on the similarity between the two vectors the similarity between the sentences is obtained

In summary, the algorithm itself has two key stages:

Training to get word vectors W, softmax weights U, b, and paragraph vectors D on already seen paragraphs.
The inference stage is to get paragraph vectors D for new paragraphs (never seen before) by adding more columns in D and gradient descending on D while holding W, U, and mixed.

We use the learned paragraph vectors to predict some particular labels using a standard classifier, e.g., logistic regression.

Python Implementation of Doc2Vec

Below is the simple implementation of Doc2Vec.

We first tokenize the words in each document and convert them to lowercase.
We then create TaggedDocument objects required for training the Doc2Vec model. Each document is associated with a unique tag (document ID). This is the paragraph vector.
The parameters (vector_size, window, min_count, workers, epochs) control the model’s dimensions, context window size, minimum word count, parallelization, and training epochs.
We then infer a vector representation for a new document that was not part of the training data.
We then calculate the similarity score.

Python3

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
 
# Sample data
data = ["The movie is awesome. It was a good thriller",
        "We are learning NLP throughg w3wiki",
        "The baby learned to walk in the 5th month itself"]
 
# Tokenizing the data
tokenized_data = [word_tokenize(document.lower()) for document in data]
 
# Creating TaggedDocument objects
tagged_data = [TaggedDocument(words=words, tags=[str(idx)])
               for idx, words in enumerate(tokenized_data)]
 
 
# Training the Doc2Vec model
model = Doc2Vec(vector_size=100, window=2, min_count=1, workers=4, epochs=1000)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count,
            epochs=model.epochs)
 
# Infer vector for a new document
new_document = "The baby was laughing and palying"
print('Original Document:', new_document)
 
inferred_vector = model.infer_vector(word_tokenize(new_document.lower()))
 
# Find most similar documents
similar_documents = model.dv.most_similar(
    [inferred_vector], topn=len(model.dv))
 
# Print the most similar documents
for index, score in similar_documents:
    print(f"Document {index}: Similarity Score: {score}")
    print(f"Document Text: {data[int(index)]}")
    print()

Output:

Original Document: The baby was laughing and palying
Document 2: Similarity Score: 0.9838361740112305
Document Text: The baby learned to walk in the 5th month itself

Document 0: Similarity Score: 0.9455077648162842
Document Text: The movie is awesome. It was a good thriller

Document 1: Similarity Score: 0.8828089833259583
Document Text: We are learning NLP throughg w3wiki

Different Techniques for Sentence Semantic Similarity in NLP

Semantic similarity is the similarity between two words or two sentences/phrase/text. It measures how close or how different the two pieces of word or text are in terms of their meaning and context.

In this article, we will focus on how the semantic similarity between two sentences is derived. We will cover the following most used models.

Dov2Vec – An extension of word2vec
SBERT – Transformer-based model in which the encoder part captures the meaning of words in a sentence.
InferSent -It uses bi-directional LSTM to encode sentences and infer semantics.
USE (universal sentence encoder) – It’s a model trained by Google that generates fixed-size embeddings for sentences that can be used for any NLP task.

Doc2Vec

Python Implementation of Doc2Vec

Python3

Different Techniques for Sentence Semantic Similarity in NLP

Similar Reads

Contact Us