Text Preprocessing
Text preprocessing refers to the cleaning of the text data by doing the following steps :
- Removal of punctuations
- Lowercase the characters
- Create tokens
- Remove Stopwords
We can do all these using NLTK Library.
Python3
import re from tqdm import tqdm import nltk nltk.download( 'punkt' ) nltk.download( 'stopwords' ) from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem.porter import PorterStemmer |
After importing the libraries, run the below code for processing the Title column.
Python3
def preprocess_text(text_data): preprocessed_text = [] for sentence in tqdm(text_data): sentence = re.sub(r '[^\w\s]' , '', sentence) preprocessed_text.append( ' ' .join(token.lower() for token in str (sentence).split() if token not in stopwords.words( 'english' ))) return preprocessed_text preprocessed_review = preprocess_text(data[ 'Title' ].values) data[ 'Title' ] = preprocessed_review |
YouTube Data Scraping, Preprocessing and Analysis using Python
YouTube is one of the oldest and most popular video distribution platforms in the world. We can’t even imagine the video content available here. It has billion of users and viewers, which keeps on increasing every passing minute.
Since its origins, YouTube and its content have transformed very much. Now we have SHORTS, likes, and many more features.
So here we will be doing the analysis for the w3wiki Youtube channel, which includes the analysis of the time duration, likes, title of the video, etc.
Before that, we need the data. We can scrap the data using Web Scraping.
Contact Us