Data Preprocessing
Data preprocessing involves the following steps:
- Removal of extra characters from columns (like spaces in duration)
- Conversion of values as per the requirement (e.g. 2.6k must be in form 2600)
- Conversion of duration column into categories.
For implementing the above steps, let’s start with loading the excel file we created above.
Python3
import pandas as pd data = pd.read_excel( 'file.xlsx' ) data.head() |
Output:
Removal of extra character from views column is done by checking if there is ‘k’ in the value and removing it. Then, converting it into float value then multiply it with 1000. Refer the below code for the same.
Python3
data[ 'Views' ] = data[ 'Views' ]. str .replace( " views" ,"") new = [] for i in data[ 'Views' ]: if (i.endswith( 'K' )): i = i.replace( 'K' ,'') new.append( float (i) * 1000 ) else : new.append(i) data[ 'Views' ] = new |
Removal of extra character from Duration column is done by removing ‘\n’ . Then we need to convert it into seconds. For that, we will use loop and multiply the hour value with 3600 and minute value with 60 and add them with seconds value.
Python3
#Duration column cleaning data[ 'Duration' ] = data[ 'Duration' ]. str .replace( "\n" ,"") new2 = [] for i in data[ 'Duration' ]: if (i = = 'SHORTS' or len (i.split( ':' )) = = 1 ): new2.append(i) elif ( len (i.split( ':' )) = = 2 ): i = i.split( ':' ) tim = int (i[ 0 ]) * 60 + int (i[ 1 ]) new2.append(tim) elif ( len (i.split( ':' )) = = 3 ): i = i.split( ':' ) tim = int (i[ 0 ]) * 3600 + int (i[ 1 ]) * 60 + int (i[ 2 ]) new2.append(tim) data[ 'Duration' ] = new2 |
Once we get the seconds, we can easily categorize the values. In this article, we have taken 4 section :
- SHORTS
- Mini-Videos
- Long-Videos
- Very-Long-Videos
You can take more or less, as per your choice.
Python3
#Duration column categorization for i in data[ 'Duration' ].index: val = data[ 'Duration' ].iloc[i] if (val = = ' SHORTS' ): continue elif (val in range ( 0 , 900 )): data.loc[i, 'Duration' ] = 'Mini-Videos' elif (val in range ( 901 , 3600 )): data.loc[i, 'Duration' ] = 'Long-Videos' else : data.loc[i, 'Duration' ] = 'Very-Long-Videos' |
After all the preprocessing, let’s check the new dataset.
Python3
data.head() |
Output:
YouTube Data Scraping, Preprocessing and Analysis using Python
YouTube is one of the oldest and most popular video distribution platforms in the world. We can’t even imagine the video content available here. It has billion of users and viewers, which keeps on increasing every passing minute.
Since its origins, YouTube and its content have transformed very much. Now we have SHORTS, likes, and many more features.
So here we will be doing the analysis for the w3wiki Youtube channel, which includes the analysis of the time duration, likes, title of the video, etc.
Before that, we need the data. We can scrap the data using Web Scraping.
Contact Us