Negaitve Sampling Using word2vec

Word2Vec, developed by Tomas Mikolov and colleagues at Google, has revolutionized natural language processing by transforming words into meaningful vector representations. Among the key innovations that made Word2Vec both efficient and effective is the technique of negative sampling. This article delves into what negative sampling is, why it’s crucial, and how it works within the Word2Vec framework.

What is Word2Vec?

Word2Vec is a set of neural network models that learn word embeddings—continuous vector representations of words—based on their context within a corpus. The two main architectures of Word2Vec are:

  • Continuous Bag of Words (CBOW): Predicts the target word from its context.
  • Skip-Gram: Predicts the context words given a target word.

Both models aim to maximize the probability of word-context pairs observed in the training corpus.

The Role of Negative Sampling

Training Word2Vec models, especially the Skip-Gram model, involves handling vast amounts of data. This poses a computational challenge, particularly when calculating the softmax function over a large vocabulary, which is computationally expensive. Negative sampling addresses this by simplifying the problem.

What is Negative Sampling?

Negative sampling is a technique that modifies the training objective from predicting the entire probability distribution of the vocabulary (as in softmax) to focusing on distinguishing the target word from a few noise (negative) words. Instead of updating the weights for all words in the vocabulary, negative sampling updates the weights for only a small number of words, significantly reducing computation.

How Negative Sampling Works?

In negative sampling, for each word-context pair, the model not only processes the actual context words (positive samples) but also a few randomly chosen words from the vocabulary that do not appear in the context (negative samples). The modified objective function aims to:

  • Maximize the probability that a word-context pair (target word and its context word) is observed in the corpus.
  • Minimize the probability that randomly sampled word-context pairs are observed.

Code Implementation of Negative Sampling for word2vec

1. Importing Neccesary Libraries and Hyperparameters and Corpus

This section sets up the initial parameters required for training the Skip-gram model with negative sampling. It also defines a small example corpus consisting of motivational quotes for training purposes.

Python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader

# Hyperparameters
embedding_dim = 100
context_size = 2  # Number of context words to use
num_negative_samples = 5  # Number of negative samples per positive sample
learning_rate = 0.001
num_epochs = 5

# Example corpus
corpus = [
    "we are what we repeatedly do excellence then is not an act but a habit",
    "the only way to do great work is to love what you do",
    "if you can dream it you can do it",
    "do not wait to strike till the iron is hot but make it hot by striking",
    "whether you think you can or you think you cannot you are right",
]


2. Preprocessing the Corpus

The function preprocess_corpus tokenizes the corpus into individual words and creates a vocabulary from these words. It then maps each word to a unique index and vice versa, which will be used for training the model.

Python
# Preprocess the corpus
def preprocess_corpus(corpus):
    words = [word for sentence in corpus for word in sentence.split()]
    vocab = set(words)
    word_to_idx = {word: idx for idx, word in enumerate(vocab)}
    idx_to_word = {idx: word for word, idx in word_to_idx.items()}
    return words, word_to_idx, idx_to_word

words, word_to_idx, idx_to_word = preprocess_corpus(corpus)


3. Generating Training Data

The function generate_training_data creates training pairs (target, context) by considering a window of context words around each target word in the corpus. This data will be used to train the Skip-gram model.

Python
# Generate training data
def generate_training_data(words, word_to_idx, context_size):
    data = []
    for i in range(context_size, len(words) - context_size):
        target_word = word_to_idx[words[i]]
        context_words = [word_to_idx[words[i - j - 1]] for j in range(context_size)]
        context_words += [word_to_idx[words[i + j + 1]] for j in range(context_size)]
        for context_word in context_words:
            data.append((target_word, context_word))
    return data

training_data = generate_training_data(words, word_to_idx, context_size)


4. Custom Dataset Class

A custom PyTorch dataset class,Word2VecDataset, is defined to handle the training data. This class is then wrapped in a DataLoader to facilitate batching and shuffling during training.

Python
# Custom Dataset class
class Word2VecDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

dataset = Word2VecDataset(training_data)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)


5. Negative Sampling

The function get_negative_samples generates negative samples for each target word. These samples are used in the Skip-gram model to improve its performance by teaching it what words should not be predicted as context for a given target.

Python
# Negative Sampling
def get_negative_samples(target, num_negative_samples, vocab_size):
    neg_samples = []
    while len(neg_samples) < num_negative_samples:
        neg_sample = np.random.randint(0, vocab_size)
        if neg_sample != target:
            neg_samples.append(neg_sample)
    return neg_samples


6. Skip-gram Model with Negative Sampling

A PyTorch neural network model, SkipGramNegSampling, is defined to implement the Skip-gram model with negative sampling. This model includes embeddings for both target and context words and calculates the loss using log-sigmoid functions.

Python
# Skip-gram Model with Negative Sampling
class SkipGramNegSampling(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramNegSampling, self).__init__()
        self.vocab_size = vocab_size
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.log_sigmoid = nn.LogSigmoid()

    def forward(self, target, context, negative_samples):
        target_embedding = self.embeddings(target)
        context_embedding = self.context_embeddings(context)
        negative_embeddings = self.context_embeddings(negative_samples)
        
        positive_score = self.log_sigmoid(torch.sum(target_embedding * context_embedding, dim=1))
        negative_score = self.log_sigmoid(-torch.bmm(negative_embeddings, target_embedding.unsqueeze(2)).squeeze(2)).sum(1)
        
        loss = - (positive_score + negative_score).mean()
        return loss


7. Training the Model

This section initializes the model and optimizer and then trains the model over several epochs. During each epoch, it processes the training data, computes the loss, and updates the model parameters to minimize the loss.

Python
# Training the model
vocab_size = len(word_to_idx)
model = SkipGramNegSampling(vocab_size, embedding_dim)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    total_loss = 0
    for target, context in dataloader:
        target = target.long()
        context = context.long()
        negative_samples = torch.LongTensor([get_negative_samples(t.item(), num_negative_samples, vocab_size) for t in target])

        optimizer.zero_grad()
        loss = model(target, context, negative_samples)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(dataloader)}")


8. Getting Word Embeddings and Finding Similar Words

After training, the word embeddings are extracted from the model. A function get_similar_words is defined to find words with similar embeddings to a given word, based on cosine similarity. The code then demonstrates how to find similar words for the word “do”.

Python
# Getting the word embeddings
embeddings = model.embeddings.weight.detach().numpy()

# Function to get similar words
def get_similar_words(word, top_n=5):
    idx = word_to_idx[word]
    word_embedding = embeddings[idx]
    similarities = np.dot(embeddings, word_embedding)
    closest_idxs = (-similarities).argsort()[1:top_n+1]
    return [idx_to_word[idx] for idx in closest_idxs]

# Example usage
print(get_similar_words("do"))

Output:

['dream', 'right', 'hot', 'if', 'strike']

Advantages of Negative Sampling

  • Computational Efficiency: By reducing the number of words whose weights are updated, negative sampling makes the training of large-scale models feasible.
  • Scalability: It enables the training of word embeddings on very large corpora with extensive vocabularies.
  • Improved Performance: Negative sampling often leads to better word embeddings by focusing on distinguishing true context pairs from random pairs, which helps in capturing the semantic relationships more effectively.

Conclusion

Negative sampling is a cornerstone technique that significantly enhances the efficiency and scalability of Word2Vec models. By simplifying the training objective, it allows for the effective learning of high-quality word embeddings even from large and complex datasets. Understanding and implementing negative sampling is crucial for anyone looking to leverage Word2Vec for natural language processing tasks.





Contact Us