Build a Recommendation Engine With Collaborative Filtering

Why is deep learning used in recommender systems?

Learning Curve To Identify Overfit & Underfit

Recommendation engines are responsible for enhancing user experience in every domain whether it’s online shopping, social media, or movie streaming. With millions of content generated per second, it gets extremely difficult for businesses to recommend customers with content of their interest and behavior. This is where recommendation systems come into play and help with personalized recommendations.

In this article, we will understand what is collaborative filtering and how we can use it to build our recommendation system.

Building a Recommendation Engine With Collaborative Filtering in Python

In this implementation, we will build an item-item memory-based recommendation engine using Python which recommends top-5 books to the user based on their choice. You can download the datasets from here:

books.csv
ratings.csv
users.csv

Step 1: Importing Necessary libraries

We need to import the below libraries for implementing the recommendation engine.

Python

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

Step 2: Load the Dataset

Getting data descriptions using ‘info()’ method.

Python

# Load datasets
users = pd.read_csv('/kaggle/input/book-recommendation-dataset/Users.csv')
books = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')
ratings = pd.read_csv('/kaggle/input/book-recommendation-dataset/Ratings.csv')

# Get dataset info
users.info()
books.info()
ratings.info()

Output:

users:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB

books:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB

ratings:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB

Step 3: Data Cleaning and Preparation

In this step, we clean the data and get it ready for model building.

We have many records with the same book title with different publishers and publishing years. So, we drop the rows with duplicate book titles and store it in the ‘new_book‘ data frame.

Python

# Drop rows with duplicate book title
new_books = books.drop_duplicates('Book-Title')

We then merge the ‘ratings‘ df with ‘new_books‘ df on ‘ISBN‘ i.e. unique identification number for books and store the result in ‘ratings_with_name‘. We also drop the columns that we do not require like ‘ISBN‘, ‘Image-URL-S’ etc.

Python

# Merge ratings and new_books df
ratings_with_name = ratings.merge(new_books, on='ISBN')

# Drop non-relevant columns
ratings_with_name.drop(['ISBN', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis = 1, inplace = True)

Now, we merge the ‘ratings‘ df with ‘users‘ df to get ‘users_ratings_matrix‘. Similarly, we will drop the non-relevant columns.

Python

# Merge new 'ratings_with_name' df with users df
users_ratings_matrix = ratings_with_name.merge(users, on='User-ID')

# Drop non-relevant columns
users_ratings_matrix.drop(['Location', 'Age'], axis = 1, inplace = True)

# Print the first few rows of the new dataframe
users_ratings_matrix.head()

Output:

    User-ID    Book-Rating    Book-Title    Book-Author    Year-Of-Publication    Publisher
0    276725    0    Flesh Tones: A Novel    M. J. Rose    2002    Ballantine Books
1    2313    5    Flesh Tones: A Novel    M. J. Rose    2002    Ballantine Books
2    2313    8    In Cold Blood (Vintage International)    TRUMAN CAPOTE    1994    Vintage
3    2313    9    Divine Secrets of the Ya-Ya Sisterhood : A Novel    Rebecca Wells    1996    HarperCollins
4    2313    5    The Mistress of Spices    Chitra Banerjee Divakaruni    1998    Anchor Books/Doubleday

Checking and dropping null values.

Python

# Check for null values
users_ratings_matrix.isna().sum()
# Drop null values
users_ratings_matrix.dropna(inplace = True)
print(users_ratings_matrix.isna().sum())

Output:

User-ID                0
Book-Rating            0
Book-Title             0
Book-Author            0
Year-Of-Publication    0
Publisher              0
dtype: int64

Since we have too many entries in ‘users_ratings_matrix‘. We will filter down the matrix to users who gave many book ratings and then filter on the basis most rated books.

The code filters a DataFrame users_ratings_matrix containing user-book interactions based on two criteria:

Users with Many Book Ratings: It groups the DataFrame by the ‘User-ID’ column and counts the number of ratings each user has given creating a boolean mask x where each entry indicates whether the user has given more than 100 ratings.
Books with Most Ratings: It further filters the DataFrame filtered_users_ratings (which contains users with many ratings) based on books that have received at least 50 ratings.

Python

# Filter down 'users_ratings_matrix' on the basis of users who gave many book ratings
x = users_ratings_matrix.groupby('User-ID').count()['Book-Rating'] > 100
knowledgeable_users = x[x].index
filtered_users_ratings = users_ratings_matrix[users_ratings_matrix['User-ID'].isin(knowledgeable_users)]

# Filter down 'users_ratings_matrix' on the basis of books with most ratings
y = filtered_users_ratings.groupby('Book-Title').count()['Book-Rating'] >= 50
famous_books = y[y].index
final_users_ratings = filtered_users_ratings[filtered_users_ratings['Book-Title'].isin(famous_books)]

Now, we will create the pivot table for ‘final_users_ratings‘ df. It will be a sparse user-rating matrix where each row will contain all the user ratings for a particular item and each column will contain all the item ratings by a particular user.

Python

# Pivot table creation
pivot_table = final_users_ratings.pivot_table(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating')

# Filling the NA values with '0'
pivot_table.fillna(0, inplace = True)
pivot_table.head()

Output:


User-ID    254    507    882    1424    1435    1733    1903    2033    2110    2276    ...    274549    274808    275020    275970    276680    277427    277478    277639    278188    278418
Book-Title                                                                                    
1984    9.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
1st to Die: A Novel    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
2nd Chance    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
4 Blondes    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0

There is no standalone library for implementing centered cosine similarity in scikit-learn. So, first, we standardize the pivot table using ‘StandardScaler‘ and then use cosine similarity on the standardized data.

Python

# Standardize the pivot table
scaler = StandardScaler(with_mean=True, with_std=True)
pivot_table_normalized = scaler.fit_transform(pivot_table)

Step 4: Model Building

First, we calculate the similarity matrix for all the items using ‘cosine_similarity‘.

Python

# Calculate the similarity matrix for all the books
similarity_score = cosine_similarity(pivot_table_normalized)

Then, we create a function called ‘recommend()‘ which recommends to top books to the user based on their choice.

The code finds the numerical index of the given book name in the pivot table.
It sorts the similarity scores for the given book in descending order.
It selects the top 5 similar books (excluding the given book itself).
It retrieves the details (title, author, and image URL) of the similar books from the new_books DataFrame.
It formats the information and returns it as a list.

Python

def recommend(book_name):
    
    # Returns the numerical index for the book_name
    index = np.where(pivot_table.index==book_name)[0][0]
    
    # Sorts the similarities for the book_name in descending order
    similar_books = sorted(list(enumerate(similarity_score[index])),key=lambda x:x[1], reverse=True)[1:6]
    
    # To return result in list format
    data = []
    
    for index,similarity in similar_books:
        item = []
        # Get the book details by index
        temp_df = new_books[new_books['Book-Title'] == pivot_table.index[index]]
        
        # Only add the title, author, and image-url to the result
        item.extend(temp_df['Book-Title'].values)
        item.extend(temp_df['Book-Author'].values)
        item.extend(temp_df['Image-URL-M'].values)
        
        data.append(item)
    return data

Step 5: Model Validating

Python

# Call the recommend method
recommend('1984',similarity_score)

Output:

[["Foucault's Pendulum",
  'Umberto Eco',
  'http://images.amazon.com/images/P/0345368754.01.MZZZZZZZ.jpg'],
 ['Tis : A Memoir',
  'Frank McCourt',
  'http://images.amazon.com/images/P/0684848783.01.MZZZZZZZ.jpg'],
 ['Animal Farm',
  'George Orwell',
  'http://images.amazon.com/images/P/0451526341.01.MZZZZZZZ.jpg'],
 ['The Glass Lake',
  'Maeve Binchy',
  'http://images.amazon.com/images/P/0440221595.01.MZZZZZZZ.jpg'],
 ['Summer Pleasures',
  'Nora Roberts',
  'http://images.amazon.com/images/P/0373218397.01.MZZZZZZZ.jpg']]

Conclusion

Building a recommendation engine using collaborative filtering is a robust way to enhance personalization in services. By following the above steps, one can achieve a highly effective recommendation system that is sensitive to user preferences and behaviors.

Tags:

#AI-ML-DS #Machine Learning #Machine Learning

Why is deep learning used in recommender systems?

Learning Curve To Identify Overfit & Underfit

Build a Recommendation Engine With Collaborative Filtering

Building a Recommendation Engine With Collaborative Filtering in Python

Step 1: Importing Necessary libraries

Step 2: Load the Dataset

Step 3: Data Cleaning and Preparation

Step 4: Model Building

Step 5: Model Validating

Conclusion

Contact Us