Build a Recommendation Engine With Collaborative Filtering
Recommendation engines are responsible for enhancing user experience in every domain whether itβs online shopping, social media, or movie streaming. With millions of content generated per second, it gets extremely difficult for businesses to recommend customers with content of their interest and behavior. This is where recommendation systems come into play and help with personalized recommendations.
In this article, we will understand what is collaborative filtering and how we can use it to build our recommendation system.
Building a Recommendation Engine With Collaborative Filtering in Python
In this implementation, we will build an item-item memory-based recommendation engine using Python which recommends top-5 books to the user based on their choice. You can download the datasets from here:
- books.csv
- ratings.csv
- users.csv
Step 1: Importing Necessary libraries
We need to import the below libraries for implementing the recommendation engine.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
Step 2: Load the Dataset
Getting data descriptions using βinfo()β method.
# Load datasets
users = pd.read_csv('/kaggle/input/book-recommendation-dataset/Users.csv')
books = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')
ratings = pd.read_csv('/kaggle/input/book-recommendation-dataset/Ratings.csv')
# Get dataset info
users.info()
books.info()
ratings.info()
Output:
users:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User-ID 278858 non-null int64
1 Location 278858 non-null object
2 Age 168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB
books:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ISBN 271360 non-null object
1 Book-Title 271360 non-null object
2 Book-Author 271358 non-null object
3 Year-Of-Publication 271360 non-null object
4 Publisher 271358 non-null object
5 Image-URL-S 271360 non-null object
6 Image-URL-M 271360 non-null object
7 Image-URL-L 271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB
ratings:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User-ID 1149780 non-null int64
1 ISBN 1149780 non-null object
2 Book-Rating 1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
Step 3: Data Cleaning and Preparation
In this step, we clean the data and get it ready for model building.
We have many records with the same book title with different publishers and publishing years. So, we drop the rows with duplicate book titles and store it in the βnew_bookβ data frame.
# Drop rows with duplicate book title
new_books = books.drop_duplicates('Book-Title')
We then merge the βratingsβ df with βnew_booksβ df on βISBNβ i.e. unique identification number for books and store the result in βratings_with_nameβ. We also drop the columns that we do not require like βISBNβ, βImage-URL-Sβ etc.
# Merge ratings and new_books df
ratings_with_name = ratings.merge(new_books, on='ISBN')
# Drop non-relevant columns
ratings_with_name.drop(['ISBN', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis = 1, inplace = True)
Now, we merge the βratingsβ df with βusersβ df to get βusers_ratings_matrixβ. Similarly, we will drop the non-relevant columns.
# Merge new 'ratings_with_name' df with users df
users_ratings_matrix = ratings_with_name.merge(users, on='User-ID')
# Drop non-relevant columns
users_ratings_matrix.drop(['Location', 'Age'], axis = 1, inplace = True)
# Print the first few rows of the new dataframe
users_ratings_matrix.head()
Output:
User-ID Book-Rating Book-Title Book-Author Year-Of-Publication Publisher
0 276725 0 Flesh Tones: A Novel M. J. Rose 2002 Ballantine Books
1 2313 5 Flesh Tones: A Novel M. J. Rose 2002 Ballantine Books
2 2313 8 In Cold Blood (Vintage International) TRUMAN CAPOTE 1994 Vintage
3 2313 9 Divine Secrets of the Ya-Ya Sisterhood : A Novel Rebecca Wells 1996 HarperCollins
4 2313 5 The Mistress of Spices Chitra Banerjee Divakaruni 1998 Anchor Books/Doubleday
Checking and dropping null values.
# Check for null values
users_ratings_matrix.isna().sum()
# Drop null values
users_ratings_matrix.dropna(inplace = True)
print(users_ratings_matrix.isna().sum())
Output:
User-ID 0
Book-Rating 0
Book-Title 0
Book-Author 0
Year-Of-Publication 0
Publisher 0
dtype: int64
Since we have too many entries in βusers_ratings_matrixβ. We will filter down the matrix to users who gave many book ratings and then filter on the basis most rated books.
The code filters a DataFrame users_ratings_matrix
containing user-book interactions based on two criteria:
- Users with Many Book Ratings: It groups the DataFrame by the βUser-IDβ column and counts the number of ratings each user has given creating a boolean mask
x
where each entry indicates whether the user has given more than 100 ratings. - Books with Most Ratings: It further filters the DataFrame
filtered_users_ratings
(which contains users with many ratings) based on books that have received at least 50 ratings.
# Filter down 'users_ratings_matrix' on the basis of users who gave many book ratings
x = users_ratings_matrix.groupby('User-ID').count()['Book-Rating'] > 100
knowledgeable_users = x[x].index
filtered_users_ratings = users_ratings_matrix[users_ratings_matrix['User-ID'].isin(knowledgeable_users)]
# Filter down 'users_ratings_matrix' on the basis of books with most ratings
y = filtered_users_ratings.groupby('Book-Title').count()['Book-Rating'] >= 50
famous_books = y[y].index
final_users_ratings = filtered_users_ratings[filtered_users_ratings['Book-Title'].isin(famous_books)]
Now, we will create the pivot table for βfinal_users_ratingsβ df. It will be a sparse user-rating matrix where each row will contain all the user ratings for a particular item and each column will contain all the item ratings by a particular user.
# Pivot table creation
pivot_table = final_users_ratings.pivot_table(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating')
# Filling the NA values with '0'
pivot_table.fillna(0, inplace = True)
pivot_table.head()
Output:
User-ID 254 507 882 1424 1435 1733 1903 2033 2110 2276 ... 274549 274808 275020 275970 276680 277427 277478 277639 278188 278418
Book-Title
1984 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1st to Die: A Novel 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2nd Chance 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Blondes 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
There is no standalone library for implementing centered cosine similarity in scikit-learn. So, first, we standardize the pivot table using βStandardScalerβ and then use cosine similarity on the standardized data.
# Standardize the pivot table
scaler = StandardScaler(with_mean=True, with_std=True)
pivot_table_normalized = scaler.fit_transform(pivot_table)
Step 4: Model Building
First, we calculate the similarity matrix for all the items using βcosine_similarityβ.
# Calculate the similarity matrix for all the books
similarity_score = cosine_similarity(pivot_table_normalized)
Then, we create a function called βrecommend()β which recommends to top books to the user based on their choice.
- The code finds the numerical index of the given book name in the pivot table.
- It sorts the similarity scores for the given book in descending order.
- It selects the top 5 similar books (excluding the given book itself).
- It retrieves the details (title, author, and image URL) of the similar books from the
new_books
DataFrame. - It formats the information and returns it as a list.
def recommend(book_name):
# Returns the numerical index for the book_name
index = np.where(pivot_table.index==book_name)[0][0]
# Sorts the similarities for the book_name in descending order
similar_books = sorted(list(enumerate(similarity_score[index])),key=lambda x:x[1], reverse=True)[1:6]
# To return result in list format
data = []
for index,similarity in similar_books:
item = []
# Get the book details by index
temp_df = new_books[new_books['Book-Title'] == pivot_table.index[index]]
# Only add the title, author, and image-url to the result
item.extend(temp_df['Book-Title'].values)
item.extend(temp_df['Book-Author'].values)
item.extend(temp_df['Image-URL-M'].values)
data.append(item)
return data
Step 5: Model Validating
# Call the recommend method
recommend('1984',similarity_score)
Output:
[["Foucault's Pendulum",
'Umberto Eco',
'http://images.amazon.com/images/P/0345368754.01.MZZZZZZZ.jpg'],
['Tis : A Memoir',
'Frank McCourt',
'http://images.amazon.com/images/P/0684848783.01.MZZZZZZZ.jpg'],
['Animal Farm',
'George Orwell',
'http://images.amazon.com/images/P/0451526341.01.MZZZZZZZ.jpg'],
['The Glass Lake',
'Maeve Binchy',
'http://images.amazon.com/images/P/0440221595.01.MZZZZZZZ.jpg'],
['Summer Pleasures',
'Nora Roberts',
'http://images.amazon.com/images/P/0373218397.01.MZZZZZZZ.jpg']]
Conclusion
Building a recommendation engine using collaborative filtering is a robust way to enhance personalization in services. By following the above steps, one can achieve a highly effective recommendation system that is sensitive to user preferences and behaviors.
Contact Us