Web scraping the data

Data Preprocessing

Web Scraping is the automation of the data extraction process from websites. Web Scrapers automatically load and extract data from websites based on user requirements. These can be custom-built to work for one site or can be configured to work with any website.

Here, we will be using Selenium and BeautifulSoup for web scraping.

After extracting the data, we will be converting it into an excel file. So for that, we will be using XLSXwriter library.

Python3

import time 
from selenium import webdriver 
from bs4 import BeautifulSoup 
import xlsxwriter

The url of the main webpage/ youtube page must be provided in this format.

Python3

# provide the url of the channel whose data you want to fetch 
urls = [ 
    'https://www.youtube.com/c/w3wikiVideos/videos'
]

Now , we will be creating the soup of the content extracted by the chrome driver.

You need to specify the path of chrome driver at the place of path_of_chrome_driver.

If you don’t have it, please install it and then specify the correct location.

Note : Mostly, the download location is ‘C:\Downloads\chromedriver.exe’

Python3

times = 0
row = 0
t = v = d = [] 
driver = webdriver.Chrome(executable_path='path_of_chrome_driver') 
for url in urls: 
    driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url)) 
    while times < 5: 
        time.sleep(1)    
        driver.execute_script( 
            "window.scrollTo(0, document.documentElement.scrollHeight);") 
        times += 1
    content = driver.page_source.encode('utf-8').strip() 
    soup = BeautifulSoup(content, 'lxml')

Now we will be extracting the title, duration, views as per their respective id/ class and storing them in a separate list. You can add more columns in a same way.

Python3

#Title 
titles = soup.findAll('a', id='video-title') 
t =[] 
for i in titles: 
    t.append(i.text) 
  
#Views 
views = soup.findAll('span', class_='style-scope ytd-grid-video-renderer') 
v = [] 
count = 0
for i in range(len(views)): 
    if i%2 == 0: 
        v.append(views[i].text) 
    else: 
        continue
  
#Duration 
duration = soup.findAll( 
    'span', class_='style-scope ytd-thumbnail-overlay-time-status-renderer') 
d = [] 
for i in duration: 
    d.append(i.text) 

Once we have the list, we are now ready to create the excel file.

Note : After creating an excel file and adding all the items. Please close it using workbook.close() command, else it will not show at the specified location.

Python3

workbook = xlsxwriter.Workbook('file.xlsx') 
worksheet = workbook.add_worksheet() 
  
worksheet.write(0, 0, "Title") 
worksheet.write(0, 1, "Views") 
worksheet.write(0, 2, "Duration") 
  
row = 1
for title, view, dura in zip(t,v,d): 
    worksheet.write(row, 0, title) 
    worksheet.write(row, 1, view) 
    worksheet.write(row, 2, dura) 
    row += 1
  
workbook.close()

YouTube Data Scraping, Preprocessing and Analysis using Python

YouTube is one of the oldest and most popular video distribution platforms in the world. We can’t even imagine the video content available here. It has billion of users and viewers, which keeps on increasing every passing minute.

Since its origins, YouTube and its content have transformed very much. Now we have SHORTS, likes, and many more features.

So here we will be doing the analysis for the w3wiki Youtube channel, which includes the analysis of the time duration, likes, title of the video, etc.

Before that, we need the data. We can scrap the data using Web Scraping.

Tags:

#Machine Learning Projects #python #AI-ML-DS #Data Science #Machine Learning #Machine Learning #python

Data Preprocessing

Web scraping the data

Python3

Python3

Python3

Python3

Python3

YouTube Data Scraping, Preprocessing and Analysis using Python

Similar Reads

Contact Us