A. Clip Model Finetuning
Step 3: Setting up Configurations:
The below code sets up essential hyperparameters and configurations for CLIP model. It includes settings for image and text data processing, batch size, learning rates, and training epochs. It also specifies the use of a GPU if available, and the choice of model architectures for image and text encoding. Additionally, it defines parameters for a projection head used for both image and text encoders, including the dimensionality of the projection and dropout rate. These configurations are crucial for the successful training and execution of the machine learning model.
Python3
# ----- Setting up Hyper Parameters in Configurations ----- # class CFG: debug = False image_path = img_folder # Specify your Image directory path captions_path = "." batch_size = 128 num_workers = 4 head_lr = 1e - 3 image_encoder_lr = 1e - 4 text_encoder_lr = 1e - 5 weight_decay = 1e - 3 patience = 3 factor = 0.8 epochs = 15 device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" ) model_name = 'resnet50' image_embedding = 2048 text_encoder_model = "distilbert-base-uncased" text_embedding = 768 text_tokenizer = "distilbert-base-uncased" max_length = 200 pretrained = True # for both image encoder and text encoder trainable = True # for both image encoder and text encoder temperature = 1.0 size = 224 # For projection head: used for both image and text encoders num_projection_layers = 1 projection_dim = 256 dropout = 0.1 |
Step 4: Setting up Utils:
The below code defines utility functions for monitoring and managing metrics during training. It includes an AvgMeter class to calculate averages and a function get_lr to extract the learning rate from an optimizer.
Python3
# ----- Setting up Utils ----- # class AvgMeter: def __init__( self , name = "Metric" ): self .name = name self .reset() def reset( self ): self .avg, self . sum , self .count = [ 0 ] * 3 def update( self , val, count = 1 ): self .count + = count self . sum + = val * count self .avg = self . sum / self .count def __repr__( self ): text = f "{self.name}: {self.avg:.4f}" return text def get_lr(optimizer): for param_group in optimizer.param_groups: return param_group[ "lr" ] |
Step 5: Building Custom Torch Dataset:
The below code defines a custom dataset class to transform the input images & text to a specific format that CLIP model intakes. It takes image filenames, captions, a tokenizer, and transforms as inputs, allowing for efficient data loading and processing. Additionally, it provides image transformation functions based on the specified mode (train or not) through get_transforms.
Python3
# ----- Building Custom Dataset ----- # class CLIPDataset(torch.utils.data.Dataset): def __init__( self , image_filenames, captions, tokenizer, transforms): """ image_filenames and captions must have the same length; so, if there are multiple captions for each image, the image_filenames must have repetitive file names. """ self .image_filenames = image_filenames self .captions = list (captions) self .encoded_captions = tokenizer( list (captions), padding = True , truncation = True , max_length = CFG.max_length) self .transforms = transforms def __getitem__( self , idx): item = { key: torch.tensor(values[idx]) for key, values in self .encoded_captions.items() } image = cv2.imread(f "{CFG.image_path}/{self.image_filenames[idx]}" ) image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = self .transforms(image = image)[ 'image' ] item[ 'image' ] = torch.tensor(image).permute( 2 , 0 , 1 ). float () item[ 'caption' ] = self .captions[idx] return item def __len__( self ): return len ( self .captions) def get_transforms(mode = "train" ): if mode = = "train" : return A.Compose( [ A.Resize(CFG.size, CFG.size, always_apply = True ), A.Normalize(max_pixel_value = 255.0 , always_apply = True ), ] ) else : return A.Compose( [ A.Resize(CFG.size, CFG.size, always_apply = True ), A.Normalize(max_pixel_value = 255.0 , always_apply = True ), ] ) |
Step 6: Image Encoder Class:
The CLIP model uses the below Image Encoder Class to pass the image to Resnet50 i.e. the Image Encoder for CLIP model. It is basically used to extract features from image data. We define an image encoder class, which utilizes a pre-trained model to encode images into fixed-size vectors. The model’s architecture, pre-training status, and trainability are configurable.
Python3
# ----- Image Encoder ----- # class ImageEncoder(nn.Module): # Encode images to a fixed size vector def __init__( self , model_name = CFG.model_name, pretrained = CFG.pretrained, trainable = CFG.trainable): super ().__init__() self .model = timm.create_model( model_name, pretrained, num_classes = 0 , global_pool = "avg" ) for p in self .model.parameters(): p.requires_grad = trainable def forward( self , x): return self .model(x) |
Step 7: Text Encoder Class:
CLIP model has a text encoder which is Distilled Bert. It is used to encoder extracts sentence embeddings from text input.
Python3
# ----- Text Encoder ----- # class TextEncoder(nn.Module): def __init__( self , model_name = CFG.text_encoder_model, pretrained = CFG.pretrained, trainable = CFG.trainable): super ().__init__() if pretrained: self .model = DistilBertModel.from_pretrained(model_name) else : self .model = DistilBertModel(config = DistilBertConfig()) for p in self .model.parameters(): p.requires_grad = trainable # W are using the CLS token hidden representation as the sentence's embedding self .target_token_idx = 0 def forward( self , input_ids, attention_mask): output = self .model(input_ids = input_ids, attention_mask = attention_mask) last_hidden_state = output.last_hidden_state return last_hidden_state[:, self .target_token_idx, :] |
Step 8: Projection Head Class:
Below code defines a projection head module for dimensionality reduction of input image embeddings & text embeddings. It includes linear projections, activation functions (GELU), dropout, and layer normalization. The module is used to transform embeddings into a lower-dimensional space while preserving important features in order to increase training efficiency and decrease training time.
Python3
# ----- Projection Head ----- # class ProjectionHead(nn.Module): def __init__( self , embedding_dim, projection_dim = CFG.projection_dim, dropout = CFG.dropout ): super ().__init__() self .projection = nn.Linear(embedding_dim, projection_dim) self .gelu = nn.GELU() self .fc = nn.Linear(projection_dim, projection_dim) self .dropout = nn.Dropout(dropout) self .layer_norm = nn.LayerNorm(projection_dim) def forward( self , x): projected = self .projection(x) x = self .gelu(projected) x = self .fc(x) x = self .dropout(x) x = x + projected x = self .layer_norm(x) return x |
Step 9: Defining Clip Model:
Now we define our custom CLIP model class, where we initialize the constructor with the image encoder, text encoder & projection head. The model computes embeddings for images and texts and calculates a loss that encourages similar images and text to have high similarity scores. Cross-entropy loss is used for training, and the model aims to align image and text embeddings in a joint embedding space for various applications like image-text retrieval and understanding.
Python3
# ----- CLIP Model Define ----- # class CLIPModel(nn.Module): def __init__( self , temperature = CFG.temperature, image_embedding = CFG.image_embedding, text_embedding = CFG.text_embedding, ): super ().__init__() self .image_encoder = ImageEncoder() self .text_encoder = TextEncoder() self .image_projection = ProjectionHead(embedding_dim = image_embedding) self .text_projection = ProjectionHead(embedding_dim = text_embedding) self .temperature = temperature def forward( self , batch): # Getting Image and Text Features image_features = self .image_encoder(batch[ "image" ]) text_features = self .text_encoder( input_ids = batch[ "input_ids" ], attention_mask = batch[ "attention_mask" ] ) # Getting Image and Text Embeddings (with same dimension) image_embeddings = self .image_projection(image_features) text_embeddings = self .text_projection(text_features) # Calculating the Loss logits = (text_embeddings @ image_embeddings.T) / self .temperature images_similarity = image_embeddings @ image_embeddings.T texts_similarity = text_embeddings @ text_embeddings.T targets = F.softmax( (images_similarity + texts_similarity) / 2 * self .temperature, dim = - 1 ) texts_loss = cross_entropy(logits, targets, reduction = 'none' ) images_loss = cross_entropy(logits.T, targets.T, reduction = 'none' ) loss = (images_loss + texts_loss) / 2.0 # shape: (batch_size) return loss.mean() def cross_entropy(preds, targets, reduction = 'none' ): log_softmax = nn.LogSoftmax(dim = - 1 ) loss = ( - targets * log_softmax(preds)). sum ( 1 ) if reduction = = "none" : return loss elif reduction = = "mean" : return loss.mean() |
Step 10: Defining Training Functions for Clip Model:
Below code contains essential training methods to train a CLIP model. It includes functions for splitting a dataset into training and validation sets, building data loaders with transformations, setting up of epochs, batch size and other hyperparameters and performing training and validation epochs. These methods are crucial for training and evaluating CLIP model effectively.
Python3
# ----- Training Methods ----- # def make_train_valid_dfs(df): # First 1,30,000 records for training train_dataframe = df.iloc[: 130000 , :] valid_dataframe = df.iloc[ 130000 :, :] # Last 30k records for validation return train_dataframe.reset_index(drop = True ), valid_dataframe.reset_index(drop = True ) def build_loaders(dataframe, tokenizer, mode): transforms = get_transforms(mode = mode) dataset = CLIPDataset( dataframe[ "image" ].values, dataframe[ "caption" ].values, tokenizer = tokenizer, transforms = transforms, ) dataloader = torch.utils.data.DataLoader( dataset, batch_size = CFG.batch_size, num_workers = CFG.num_workers, shuffle = True if mode = = "train" else False , ) return dataloader def train_epoch(model, train_loader, optimizer, lr_scheduler, step): loss_meter = AvgMeter() tqdm_object = tqdm(train_loader, total = len (train_loader)) for batch in tqdm_object: batch = {k: v.to(CFG.device) for k, v in batch.items() if k ! = "caption" } loss = model(batch) optimizer.zero_grad() loss.backward() optimizer.step() if step = = "batch" : lr_scheduler.step() count = batch[ "image" ].size( 0 ) loss_meter.update(loss.item(), count) tqdm_object.set_postfix( train_loss = loss_meter.avg, lr = get_lr(optimizer)) return loss_meter def valid_epoch(model, valid_loader): loss_meter = AvgMeter() tqdm_object = tqdm(valid_loader, total = len (valid_loader)) for batch in tqdm_object: batch = {k: v.to(CFG.device) for k, v in batch.items() if k ! = "caption" } loss = model(batch) count = batch[ "image" ].size( 0 ) loss_meter.update(loss.item(), count) tqdm_object.set_postfix(valid_loss = loss_meter.avg) return loss_meter |
Step 11: Train Validation Split:
We split the input data into train & validation split. In train set we have 1,30,000 records & in valid set we have 3654 records. We are not using any test set over here, that because we are just using CLIP model to extract feature embeddings out of skycam images.
Python3
# ----- Train-Valid Split ----- # train_df, valid_df = make_train_valid_dfs(df) print ( len (train_df), len (valid_df)) tokenizer = DistilBertTokenizer.from_pretrained(CFG.text_tokenizer) train_loader = build_loaders(train_df, tokenizer, mode = "train" ) valid_loader = build_loaders(valid_df, tokenizer, mode = "valid" ) |
Output:
130000 3654
Step 12: Clip Model Finetuning:
Now, we finetune the CLIP model on our custom data. The below provided code segment loads a pre-trained CLIP model and sets up the training process. It defines the model’s parameters and optimizers, with separate learning rates for different components. It then runs the training loop for a specified number of epochs, saving the best model based on validation loss and adjusting the learning rate using a scheduler. This code trains the model and saves the best-performing version.
Python3
# ----- Loading Pretrained Model ----- # model = CLIPModel().to(CFG.device) params = [ { "params" : model.image_encoder.parameters(), "lr" : CFG.image_encoder_lr}, { "params" : model.text_encoder.parameters(), "lr" : CFG.text_encoder_lr}, { "params" : itertools.chain( model.image_projection.parameters(), model.text_projection.parameters() ), "lr" : CFG.head_lr, "weight_decay" : CFG.weight_decay} ] optimizer = torch.optim.AdamW(params, weight_decay = 0. ) lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, mode = "min" , patience = CFG.patience, factor = CFG.factor ) |
Model Training
Python3
# ----- Model Training ----- # step = "epoch" best_loss = float ( 'inf' ) for epoch in range (CFG.epochs): print (f "Epoch: {epoch + 1}" ) model.train() train_loss = train_epoch( model, train_loader, optimizer, lr_scheduler, step) model. eval () with torch.no_grad(): valid_loss = valid_epoch(model, valid_loader) if valid_loss.avg < best_loss: best_loss = valid_loss.avg torch.save(model.state_dict(), "CLIP_model.pt" ) print ( "Saved Best Model!" ) lr_scheduler.step(valid_loss.avg) |
Step 13: Save the Clip Model & its configurations
Now, we save the Clip Model & its configs into pickle file. In Step 12, already a .pt extension model is been saved but still for model safety purpose we also save it in .pkl file.
Python3
with open ( 'clip_mdl.pkl' , 'wb' ) as f: pickle.dump(model, f) with open ( 'clip_cfg.pkl' , 'wb' ) as f: pickle.dump(CFG, f) |
Cloud Coverage Prediction using Skycam Images
Cloud coverage prediction is critical in weather forecasting and a variety of applications such as solar energy generation, aviation, and climate monitoring. Accurate forecasts help decision-makers and sectors plan for and adapt to changing weather conditions. The advancement of artificial intelligence and computer vision techniques in recent years has created new opportunities for enhancing cloud coverage forecasts.
One promising approach is the use of SkyCam images.
- In the face of rapidly changing global climate patterns, there is an urgent need for innovative tools and technologies to better understand and predict weather-related phenomena.
- One crucial aspect of climate analysis is the assessment of cloud coverage, which plays a pivotal role in influencing weather conditions and climate trends.
- Experts may not always be available to monitor climatic shifts. Therefore, developing an automated weather monitoring system is crucial for various applications, including agriculture and disaster management.
The purpose of this research is to estimate the opaque Cloud Coverage from a Skycam Image using AI/ML methodologies.
Table of Content
- Cloud Coverage Prediction using SkyCam Images
- Implementations Cloud Coverage Prediction using SkyCam Images
- Cloud Coverage Prediction Models:
- Part I. Model Building & Traning Pipeline
- A. Clip Model Finetuning
- B. Catboost Regressor Model Building
- Part II. UI Inference Codes for Deployed Model
- Results:
Contact Us