A. Clip Model Finetuning

Part I. Model Building & Traning Pipeline

Step 3: Setting up Configurations:

The below code sets up essential hyperparameters and configurations for CLIP model. It includes settings for image and text data processing, batch size, learning rates, and training epochs. It also specifies the use of a GPU if available, and the choice of model architectures for image and text encoding. Additionally, it defines parameters for a projection head used for both image and text encoders, including the dimensionality of the projection and dropout rate. These configurations are crucial for the successful training and execution of the machine learning model.

Python3

# ----- Setting up Hyper Parameters in Configurations ----- #
 
class CFG:
    debug = False
    image_path =  img_folder  # Specify your Image directory path
    captions_path = "."
    batch_size = 128
    num_workers = 4
    head_lr = 1e-3
    image_encoder_lr = 1e-4
    text_encoder_lr = 1e-5
    weight_decay = 1e-3
    patience = 3
    factor = 0.8
    epochs = 15
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_name = 'resnet50'
    image_embedding = 2048
    text_encoder_model = "distilbert-base-uncased"
    text_embedding = 768
    text_tokenizer = "distilbert-base-uncased"
    max_length = 200
    pretrained = True  # for both image encoder and text encoder
    trainable = True  # for both image encoder and text encoder
    temperature = 1.0
    size = 224
    # For projection head: used for both image and text encoders
    num_projection_layers = 1
    projection_dim = 256
    dropout = 0.1

Step 4: Setting up Utils:

The below code defines utility functions for monitoring and managing metrics during training. It includes an AvgMeter class to calculate averages and a function get_lr to extract the learning rate from an optimizer.

Python3

# ----- Setting up Utils ----- #
 
class AvgMeter:
    def __init__(self, name="Metric"):
        self.name = name
        self.reset()
    def reset(self):
        self.avg, self.sum, self.count = [0] * 3
    def update(self, val, count=1):
        self.count += count
        self.sum += val * count
        self.avg = self.sum / self.count
    def __repr__(self):
        text = f"{self.name}: {self.avg:.4f}"
        return text
 
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group["lr"]

Step 5: Building Custom Torch Dataset:

The below code defines a custom dataset class to transform the input images & text to a specific format that CLIP model intakes. It takes image filenames, captions, a tokenizer, and transforms as inputs, allowing for efficient data loading and processing. Additionally, it provides image transformation functions based on the specified mode (train or not) through get_transforms.

Python3

# ----- Building Custom Dataset ----- #
 
 
class CLIPDataset(torch.utils.data.Dataset):
    def __init__(self, image_filenames, captions, tokenizer, transforms):
        """
        image_filenames and captions must have the same length; so, if there are
        multiple captions for each image, the image_filenames must have repetitive
        file names.
        """
        self.image_filenames = image_filenames
        self.captions = list(captions)
        self.encoded_captions = tokenizer(
            list(captions), padding=True, truncation=True, max_length=CFG.max_length)
        self.transforms = transforms
 
    def __getitem__(self, idx):
        item = {
            key: torch.tensor(values[idx])
            for key, values in self.encoded_captions.items()
        }
        image = cv2.imread(f"{CFG.image_path}/{self.image_filenames[idx]}")
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image = self.transforms(image=image)['image']
        item['image'] = torch.tensor(image).permute(2, 0, 1).float()
        item['caption'] = self.captions[idx]
        return item
 
    def __len__(self):
        return len(self.captions)
 
 
def get_transforms(mode="train"):
    if mode == "train":
        return A.Compose(
            [
                A.Resize(CFG.size, CFG.size, always_apply=True),
                A.Normalize(max_pixel_value=255.0, always_apply=True),
            ]
        )
    else:
        return A.Compose(
            [
                A.Resize(CFG.size, CFG.size, always_apply=True),
                A.Normalize(max_pixel_value=255.0, always_apply=True),
            ]
        )

Step 6: Image Encoder Class:

The CLIP model uses the below Image Encoder Class to pass the image to Resnet50 i.e. the Image Encoder for CLIP model. It is basically used to extract features from image data. We define an image encoder class, which utilizes a pre-trained model to encode images into fixed-size vectors. The model’s architecture, pre-training status, and trainability are configurable.

Python3

# ----- Image Encoder ----- #
 
class ImageEncoder(nn.Module):
    # Encode images to a fixed size vector
    def __init__(self, model_name=CFG.model_name, pretrained=CFG.pretrained, trainable=CFG.trainable):
        super().__init__()
        self.model = timm.create_model(
            model_name, pretrained, num_classes=0, global_pool="avg")
        for p in self.model.parameters():
            p.requires_grad = trainable
 
    def forward(self, x):
        return self.model(x)

Step 7: Text Encoder Class:

CLIP model has a text encoder which is Distilled Bert. It is used to encoder extracts sentence embeddings from text input.

Python3

# ----- Text Encoder ----- #
 
class TextEncoder(nn.Module):
    def __init__(self, model_name=CFG.text_encoder_model, pretrained=CFG.pretrained, trainable=CFG.trainable):
        super().__init__()
        if pretrained:
            self.model = DistilBertModel.from_pretrained(model_name)
        else:
            self.model = DistilBertModel(config=DistilBertConfig())
        for p in self.model.parameters():
            p.requires_grad = trainable
        # W are using the CLS token hidden representation as the sentence's embedding
        self.target_token_idx = 0
 
    def forward(self, input_ids, attention_mask):
        output = self.model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = output.last_hidden_state
        return last_hidden_state[:, self.target_token_idx, :]

Step 8: Projection Head Class:

Below code defines a projection head module for dimensionality reduction of input image embeddings & text embeddings. It includes linear projections, activation functions (GELU), dropout, and layer normalization. The module is used to transform embeddings into a lower-dimensional space while preserving important features in order to increase training efficiency and decrease training time.

Python3

# ----- Projection Head ----- #
 
class ProjectionHead(nn.Module):
    def __init__(
        self,
        embedding_dim,
        projection_dim=CFG.projection_dim,
        dropout=CFG.dropout
    ):
        super().__init__()
        self.projection = nn.Linear(embedding_dim, projection_dim)
        self.gelu = nn.GELU()
        self.fc = nn.Linear(projection_dim, projection_dim)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(projection_dim)
 
    def forward(self, x):
        projected = self.projection(x)
        x = self.gelu(projected)
        x = self.fc(x)
        x = self.dropout(x)
        x = x + projected
        x = self.layer_norm(x)
        return x

Step 9: Defining Clip Model:

Now we define our custom CLIP model class, where we initialize the constructor with the image encoder, text encoder & projection head. The model computes embeddings for images and texts and calculates a loss that encourages similar images and text to have high similarity scores. Cross-entropy loss is used for training, and the model aims to align image and text embeddings in a joint embedding space for various applications like image-text retrieval and understanding.

Python3

# ----- CLIP Model Define ----- #
 
class CLIPModel(nn.Module):
    def __init__(
        self,
        temperature=CFG.temperature,
        image_embedding=CFG.image_embedding,
        text_embedding=CFG.text_embedding,
    ):
        super().__init__()
        self.image_encoder = ImageEncoder()
        self.text_encoder = TextEncoder()
        self.image_projection = ProjectionHead(embedding_dim=image_embedding)
        self.text_projection = ProjectionHead(embedding_dim=text_embedding)
        self.temperature = temperature
 
    def forward(self, batch):
        # Getting Image and Text Features
        image_features = self.image_encoder(batch["image"])
        text_features = self.text_encoder(
            input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
        )
        # Getting Image and Text Embeddings (with same dimension)
        image_embeddings = self.image_projection(image_features)
        text_embeddings = self.text_projection(text_features)
        # Calculating the Loss
        logits = (text_embeddings @ image_embeddings.T) / self.temperature
        images_similarity = image_embeddings @ image_embeddings.T
        texts_similarity = text_embeddings @ text_embeddings.T
        targets = F.softmax(
            (images_similarity + texts_similarity) / 2 * self.temperature, dim=-1
        )
        texts_loss = cross_entropy(logits, targets, reduction='none')
        images_loss = cross_entropy(logits.T, targets.T, reduction='none')
        loss = (images_loss + texts_loss) / 2.0  # shape: (batch_size)
        return loss.mean()
 
def cross_entropy(preds, targets, reduction='none'):
    log_softmax = nn.LogSoftmax(dim=-1)
    loss = (-targets * log_softmax(preds)).sum(1)
    if reduction == "none":
        return loss
    elif reduction == "mean":
        return loss.mean()

Step 10: Defining Training Functions for Clip Model:

Below code contains essential training methods to train a CLIP model. It includes functions for splitting a dataset into training and validation sets, building data loaders with transformations, setting up of epochs, batch size and other hyperparameters and performing training and validation epochs. These methods are crucial for training and evaluating CLIP model effectively.

Python3

# ----- Training Methods ----- #
 
def make_train_valid_dfs(df):
    # First 1,30,000 records for training
    train_dataframe = df.iloc[:130000, :]
    valid_dataframe = df.iloc[130000:, :]  # Last 30k records for validation
    return train_dataframe.reset_index(drop=True), valid_dataframe.reset_index(drop=True)
 
def build_loaders(dataframe, tokenizer, mode):
    transforms = get_transforms(mode=mode)
    dataset = CLIPDataset(
        dataframe["image"].values,
        dataframe["caption"].values,
        tokenizer=tokenizer,
        transforms=transforms,
    )
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=CFG.batch_size,
        num_workers=CFG.num_workers,
        shuffle=True if mode == "train" else False,
    )
    return dataloader
 
def train_epoch(model, train_loader, optimizer, lr_scheduler, step):
    loss_meter = AvgMeter()
    tqdm_object = tqdm(train_loader, total=len(train_loader))
    for batch in tqdm_object:
        batch = {k: v.to(CFG.device)
                 for k, v in batch.items() if k != "caption"}
        loss = model(batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if step == "batch":
            lr_scheduler.step()
        count = batch["image"].size(0)
        loss_meter.update(loss.item(), count)
        tqdm_object.set_postfix(
            train_loss=loss_meter.avg, lr=get_lr(optimizer))
    return loss_meter
 
def valid_epoch(model, valid_loader):
    loss_meter = AvgMeter()
    tqdm_object = tqdm(valid_loader, total=len(valid_loader))
    for batch in tqdm_object:
        batch = {k: v.to(CFG.device)
                 for k, v in batch.items() if k != "caption"}
        loss = model(batch)
        count = batch["image"].size(0)
        loss_meter.update(loss.item(), count)
        tqdm_object.set_postfix(valid_loss=loss_meter.avg)
    return loss_meter

Step 11: Train Validation Split:

We split the input data into train & validation split. In train set we have 1,30,000 records & in valid set we have 3654 records. We are not using any test set over here, that because we are just using CLIP model to extract feature embeddings out of skycam images.

Python3

# ----- Train-Valid Split ----- #
 
train_df, valid_df = make_train_valid_dfs(df) 
print(len(train_df), len(valid_df))
tokenizer = DistilBertTokenizer.from_pretrained(CFG.text_tokenizer)
train_loader = build_loaders(train_df, tokenizer, mode="train")
valid_loader = build_loaders(valid_df, tokenizer, mode="valid")

Output:

130000 3654

Step 12: Clip Model Finetuning:

Now, we finetune the CLIP model on our custom data. The below provided code segment loads a pre-trained CLIP model and sets up the training process. It defines the model’s parameters and optimizers, with separate learning rates for different components. It then runs the training loop for a specified number of epochs, saving the best model based on validation loss and adjusting the learning rate using a scheduler. This code trains the model and saves the best-performing version.

Python3

# ----- Loading Pretrained Model ----- #
 
model = CLIPModel().to(CFG.device)
params = [
    {"params": model.image_encoder.parameters(), "lr": CFG.image_encoder_lr},
    {"params": model.text_encoder.parameters(), "lr": CFG.text_encoder_lr},
    {"params": itertools.chain(
        model.image_projection.parameters(), model.text_projection.parameters()
    ), "lr": CFG.head_lr, "weight_decay": CFG.weight_decay}
]
optimizer = torch.optim.AdamW(params, weight_decay=0.)
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", patience=CFG.patience, factor=CFG.factor
)

Model Training

Python3

# ----- Model Training ----- #
step = "epoch"
best_loss = float('inf')
for epoch in range(CFG.epochs):
    print(f"Epoch: {epoch + 1}")
    model.train()
    train_loss = train_epoch(
        model, train_loader, optimizer, lr_scheduler, step)
    model.eval()
    with torch.no_grad():
        valid_loss = valid_epoch(model, valid_loader)
    if valid_loss.avg < best_loss:
        best_loss = valid_loss.avg
        torch.save(model.state_dict(), "CLIP_model.pt")
        print("Saved Best Model!")
    lr_scheduler.step(valid_loss.avg)

Step 13: Save the Clip Model & its configurations

Now, we save the Clip Model & its configs into pickle file. In Step 12, already a .pt extension model is been saved but still for model safety purpose we also save it in .pkl file.

Python3

with open('clip_mdl.pkl', 'wb') as f:
    pickle.dump(model, f)
with open('clip_cfg.pkl', 'wb') as f:
    pickle.dump(CFG, f)

Cloud Coverage Prediction using Skycam Images

Cloud coverage prediction is critical in weather forecasting and a variety of applications such as solar energy generation, aviation, and climate monitoring. Accurate forecasts help decision-makers and sectors plan for and adapt to changing weather conditions. The advancement of artificial intelligence and computer vision techniques in recent years has created new opportunities for enhancing cloud coverage forecasts.

One promising approach is the use of SkyCam images.

In the face of rapidly changing global climate patterns, there is an urgent need for innovative tools and technologies to better understand and predict weather-related phenomena.
One crucial aspect of climate analysis is the assessment of cloud coverage, which plays a pivotal role in influencing weather conditions and climate trends.
Experts may not always be available to monitor climatic shifts. Therefore, developing an automated weather monitoring system is crucial for various applications, including agriculture and disaster management.

The purpose of this research is to estimate the opaque Cloud Coverage from a Skycam Image using AI/ML methodologies.

Cloud Coverage Prediction using Skycam Images

Table of Content

Cloud Coverage Prediction using SkyCam Images
Implementations Cloud Coverage Prediction using SkyCam Images
Cloud Coverage Prediction Models:
Part I. Model Building & Traning Pipeline
A. Clip Model Finetuning
B. Catboost Regressor Model Building
Part II. UI Inference Codes for Deployed Model
Results:

A. Clip Model Finetuning

Step 3: Setting up Configurations:

Python3

Step 4: Setting up Utils:

Python3

Step 5: Building Custom Torch Dataset:

Python3

Step 6: Image Encoder Class:

Python3

Step 7: Text Encoder Class:

Python3

Step 8: Projection Head Class:

Python3

Step 9: Defining Clip Model:

Python3

Step 10: Defining Training Functions for Clip Model:

Python3

Step 11: Train Validation Split:

Python3

Step 12: Clip Model Finetuning:

Python3

Model Training

Python3

Step 13: Save the Clip Model & its configurations

Python3

Cloud Coverage Prediction using Skycam Images

Table of Content

Similar Reads

Contact Us