Code Implementation of Negative Sampling for word2vec

1. Importing Neccesary Libraries and Hyperparameters and Corpus

This section sets up the initial parameters required for training the Skip-gram model with negative sampling. It also defines a small example corpus consisting of motivational quotes for training purposes.

Python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader

# Hyperparameters
embedding_dim = 100
context_size = 2  # Number of context words to use
num_negative_samples = 5  # Number of negative samples per positive sample
learning_rate = 0.001
num_epochs = 5

# Example corpus
corpus = [
    "we are what we repeatedly do excellence then is not an act but a habit",
    "the only way to do great work is to love what you do",
    "if you can dream it you can do it",
    "do not wait to strike till the iron is hot but make it hot by striking",
    "whether you think you can or you think you cannot you are right",
]


2. Preprocessing the Corpus

The function preprocess_corpus tokenizes the corpus into individual words and creates a vocabulary from these words. It then maps each word to a unique index and vice versa, which will be used for training the model.

Python
# Preprocess the corpus
def preprocess_corpus(corpus):
    words = [word for sentence in corpus for word in sentence.split()]
    vocab = set(words)
    word_to_idx = {word: idx for idx, word in enumerate(vocab)}
    idx_to_word = {idx: word for word, idx in word_to_idx.items()}
    return words, word_to_idx, idx_to_word

words, word_to_idx, idx_to_word = preprocess_corpus(corpus)


3. Generating Training Data

The function generate_training_data creates training pairs (target, context) by considering a window of context words around each target word in the corpus. This data will be used to train the Skip-gram model.

Python
# Generate training data
def generate_training_data(words, word_to_idx, context_size):
    data = []
    for i in range(context_size, len(words) - context_size):
        target_word = word_to_idx[words[i]]
        context_words = [word_to_idx[words[i - j - 1]] for j in range(context_size)]
        context_words += [word_to_idx[words[i + j + 1]] for j in range(context_size)]
        for context_word in context_words:
            data.append((target_word, context_word))
    return data

training_data = generate_training_data(words, word_to_idx, context_size)


4. Custom Dataset Class

A custom PyTorch dataset class,Word2VecDataset, is defined to handle the training data. This class is then wrapped in a DataLoader to facilitate batching and shuffling during training.

Python
# Custom Dataset class
class Word2VecDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

dataset = Word2VecDataset(training_data)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)


5. Negative Sampling

The function get_negative_samples generates negative samples for each target word. These samples are used in the Skip-gram model to improve its performance by teaching it what words should not be predicted as context for a given target.

Python
# Negative Sampling
def get_negative_samples(target, num_negative_samples, vocab_size):
    neg_samples = []
    while len(neg_samples) < num_negative_samples:
        neg_sample = np.random.randint(0, vocab_size)
        if neg_sample != target:
            neg_samples.append(neg_sample)
    return neg_samples


6. Skip-gram Model with Negative Sampling

A PyTorch neural network model, SkipGramNegSampling, is defined to implement the Skip-gram model with negative sampling. This model includes embeddings for both target and context words and calculates the loss using log-sigmoid functions.

Python
# Skip-gram Model with Negative Sampling
class SkipGramNegSampling(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramNegSampling, self).__init__()
        self.vocab_size = vocab_size
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.log_sigmoid = nn.LogSigmoid()

    def forward(self, target, context, negative_samples):
        target_embedding = self.embeddings(target)
        context_embedding = self.context_embeddings(context)
        negative_embeddings = self.context_embeddings(negative_samples)
        
        positive_score = self.log_sigmoid(torch.sum(target_embedding * context_embedding, dim=1))
        negative_score = self.log_sigmoid(-torch.bmm(negative_embeddings, target_embedding.unsqueeze(2)).squeeze(2)).sum(1)
        
        loss = - (positive_score + negative_score).mean()
        return loss


7. Training the Model

This section initializes the model and optimizer and then trains the model over several epochs. During each epoch, it processes the training data, computes the loss, and updates the model parameters to minimize the loss.

Python
# Training the model
vocab_size = len(word_to_idx)
model = SkipGramNegSampling(vocab_size, embedding_dim)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    total_loss = 0
    for target, context in dataloader:
        target = target.long()
        context = context.long()
        negative_samples = torch.LongTensor([get_negative_samples(t.item(), num_negative_samples, vocab_size) for t in target])

        optimizer.zero_grad()
        loss = model(target, context, negative_samples)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(dataloader)}")


8. Getting Word Embeddings and Finding Similar Words

After training, the word embeddings are extracted from the model. A function get_similar_words is defined to find words with similar embeddings to a given word, based on cosine similarity. The code then demonstrates how to find similar words for the word “do”.

Python
# Getting the word embeddings
embeddings = model.embeddings.weight.detach().numpy()

# Function to get similar words
def get_similar_words(word, top_n=5):
    idx = word_to_idx[word]
    word_embedding = embeddings[idx]
    similarities = np.dot(embeddings, word_embedding)
    closest_idxs = (-similarities).argsort()[1:top_n+1]
    return [idx_to_word[idx] for idx in closest_idxs]

# Example usage
print(get_similar_words("do"))

Output:

['dream', 'right', 'hot', 'if', 'strike']

Advantages of Negative Sampling

  • Computational Efficiency: By reducing the number of words whose weights are updated, negative sampling makes the training of large-scale models feasible.
  • Scalability: It enables the training of word embeddings on very large corpora with extensive vocabularies.
  • Improved Performance: Negative sampling often leads to better word embeddings by focusing on distinguishing true context pairs from random pairs, which helps in capturing the semantic relationships more effectively.

Negaitve Sampling Using word2vec

Word2Vec, developed by Tomas Mikolov and colleagues at Google, has revolutionized natural language processing by transforming words into meaningful vector representations. Among the key innovations that made Word2Vec both efficient and effective is the technique of negative sampling. This article delves into what negative sampling is, why it’s crucial, and how it works within the Word2Vec framework.

Similar Reads

What is Word2Vec?

Word2Vec is a set of neural network models that learn word embeddings—continuous vector representations of words—based on their context within a corpus. The two main architectures of Word2Vec are:...

The Role of Negative Sampling

Training Word2Vec models, especially the Skip-Gram model, involves handling vast amounts of data. This poses a computational challenge, particularly when calculating the softmax function over a large vocabulary, which is computationally expensive. Negative sampling addresses this by simplifying the problem....

Code Implementation of Negative Sampling for word2vec

1. Importing Neccesary Libraries and Hyperparameters and Corpus...

Conclusion

Negative sampling is a cornerstone technique that significantly enhances the efficiency and scalability of Word2Vec models. By simplifying the training objective, it allows for the effective learning of high-quality word embeddings even from large and complex datasets. Understanding and implementing negative sampling is crucial for anyone looking to leverage Word2Vec for natural language processing tasks....

Contact Us