Data Preprocessing

Data preprocessing involves the following steps:

Removal of extra characters from columns (like spaces in duration)
Conversion of values as per the requirement (e.g. 2.6k must be in form 2600)
Conversion of duration column into categories.

For implementing the above steps, let’s start with loading the excel file we created above.

Python3

import pandas as pd 
data = pd.read_excel('file.xlsx') 
data.head()

Output:

Removal of extra character from views column is done by checking if there is ‘k’ in the value and removing it. Then, converting it into float value then multiply it with 1000. Refer the below code for the same.

Python3

data['Views'] = data['Views'].str.replace(" views","") 
  
new = [] 
  
for i in data['Views']: 
    if(i.endswith('K')): 
        i = i.replace('K','') 
        new.append(float(i) * 1000) 
    else : 
        new.append(i) 
  
data['Views'] = new

Removal of extra character from Duration column is done by removing ‘\n’ . Then we need to convert it into seconds. For that, we will use loop and multiply the hour value with 3600 and minute value with 60 and add them with seconds value.

Python3

#Duration column cleaning 
data['Duration'] = data['Duration'].str.replace("\n","") 
  
new2 = [] 
  
for i in data['Duration']: 
    if(i=='SHORTS' or len(i.split(':'))==1): 
        new2.append(i) 
    elif(len(i.split(':'))==2): 
        i = i.split(':') 
        tim = int(i[0])*60 + int(i[1]) 
        new2.append(tim) 
    elif(len(i.split(':'))==3): 
        i = i.split(':') 
        tim = int(i[0])*3600 + int(i[1])*60 + int(i[2]) 
        new2.append(tim) 
          
data['Duration'] = new2

Once we get the seconds, we can easily categorize the values. In this article, we have taken 4 section :

SHORTS
Mini-Videos
Long-Videos
Very-Long-Videos

You can take more or less, as per your choice.

Python3

#Duration column categorization 
for i in data['Duration'].index: 
    val = data['Duration'].iloc[i] 
    if(val=='  SHORTS'): 
        continue
    elif(val in range(0,900)): 
        data.loc[i,'Duration'] = 'Mini-Videos'
    elif(val in range(901,3600)): 
        data.loc[i,'Duration'] = 'Long-Videos'
    else: 
        data.loc[i,'Duration'] = 'Very-Long-Videos'

After all the preprocessing, let’s check the new dataset.

Python3

data.head()

Output:

YouTube Data Scraping, Preprocessing and Analysis using Python

YouTube is one of the oldest and most popular video distribution platforms in the world. We can’t even imagine the video content available here. It has billion of users and viewers, which keeps on increasing every passing minute.

Since its origins, YouTube and its content have transformed very much. Now we have SHORTS, likes, and many more features.

So here we will be doing the analysis for the w3wiki Youtube channel, which includes the analysis of the time duration, likes, title of the video, etc.

Before that, we need the data. We can scrap the data using Web Scraping.

Tags:

#Machine Learning Projects #python #AI-ML-DS #Data Science #Machine Learning #Machine Learning #python

Web scraping the data

Text Preprocessing

Data Preprocessing

Python3

Python3

Python3

Python3

Python3

YouTube Data Scraping, Preprocessing and Analysis using Python

Similar Reads

Contact Us