Data Preprocessing

Data preprocessing involves the following steps:

  1. Removal of extra characters from columns (like spaces in duration)
  2. Conversion of values as per the requirement (e.g. 2.6k must be in form 2600)
  3. Conversion of duration column into categories.

For implementing the above steps, let’s start with loading the excel file we created above.

Python3




import pandas as pd
data = pd.read_excel('file.xlsx')
data.head()


Output:

 

Removal of extra character from views column is done by checking if there is  ‘k’ in the value and  removing it. Then, converting it into float value then multiply it with 1000. Refer the below code for the same.

Python3




data['Views'] = data['Views'].str.replace(" views","")
  
new = []
  
for i in data['Views']:
    if(i.endswith('K')):
        i = i.replace('K','')
        new.append(float(i) * 1000)
    else :
        new.append(i)
  
data['Views'] = new


Removal of extra character from Duration column is done by removing ‘\n’ . Then we need to convert it into seconds. For that, we will use loop and multiply the hour value with 3600 and minute value with 60 and add them with seconds value.

Python3




#Duration column cleaning
data['Duration'] = data['Duration'].str.replace("\n","")
  
new2 = []
  
for i in data['Duration']:
    if(i=='SHORTS' or len(i.split(':'))==1):
        new2.append(i)
    elif(len(i.split(':'))==2):
        i = i.split(':')
        tim = int(i[0])*60 + int(i[1])
        new2.append(tim)
    elif(len(i.split(':'))==3):
        i = i.split(':')
        tim = int(i[0])*3600 + int(i[1])*60 + int(i[2])
        new2.append(tim)
          
data['Duration'] = new2


Once we get the seconds, we can easily categorize the values. In this article, we have taken 4 section : 

  • SHORTS
  • Mini-Videos
  • Long-Videos
  • Very-Long-Videos

You can take more or less, as per your choice.

Python3




#Duration column categorization
for i in data['Duration'].index:
    val = data['Duration'].iloc[i]
    if(val=='  SHORTS'):
        continue
    elif(val in range(0,900)):
        data.loc[i,'Duration'] = 'Mini-Videos'
    elif(val in range(901,3600)):
        data.loc[i,'Duration'] = 'Long-Videos'
    else:
        data.loc[i,'Duration'] = 'Very-Long-Videos'


After all the preprocessing, let’s check the new dataset.

Python3




data.head()


Output:

 

YouTube Data Scraping, Preprocessing and Analysis using Python

YouTube is one of the oldest and most popular video distribution platforms in the world. We can’t even imagine the video content available here. It has billion of users and viewers, which keeps on increasing every passing minute.

Since its origins, YouTube and its content have transformed very much. Now we have SHORTS, likes, and many more features.

So here we will be doing the analysis for the w3wiki Youtube channel, which includes the analysis of the time duration, likes, title of the video, etc.

Before that, we need the data. We can scrap the data using Web Scraping.

Similar Reads

Web scraping the data

Web Scraping is the automation of the data extraction process from websites. Web Scrapers automatically load and extract data from websites based on user requirements. These can be custom-built to work for one site or can be configured to work with any website....

Data Preprocessing

...

Text Preprocessing

...

Data Visualization

...

Contact Us