Build a Recommendation Engine With Collaborative Filtering

Recommendation engines are responsible for enhancing user experience in every domain whether it’s online shopping, social media, or movie streaming. With millions of content generated per second, it gets extremely difficult for businesses to recommend customers with content of their interest and behavior. This is where recommendation systems come into play and help with personalized recommendations.

In this article, we will understand what is collaborative filtering and how we can use it to build our recommendation system.

Building a Recommendation Engine With Collaborative Filtering in Python

In this implementation, we will build an item-item memory-based recommendation engine using Python which recommends top-5 books to the user based on their choice. You can download the datasets from here:

  • books.csv
  • ratings.csv
  • users.csv

Step 1: Importing Necessary libraries

We need to import the below libraries for implementing the recommendation engine.

Python
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

Step 2: Load the Dataset

Getting data descriptions using β€˜info()’ method.

Python
# Load datasets
users = pd.read_csv('/kaggle/input/book-recommendation-dataset/Users.csv')
books = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')
ratings = pd.read_csv('/kaggle/input/book-recommendation-dataset/Ratings.csv')

# Get dataset info
users.info()
books.info()
ratings.info()

Output:

users:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User-ID 278858 non-null int64
1 Location 278858 non-null object
2 Age 168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB

books:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ISBN 271360 non-null object
1 Book-Title 271360 non-null object
2 Book-Author 271358 non-null object
3 Year-Of-Publication 271360 non-null object
4 Publisher 271358 non-null object
5 Image-URL-S 271360 non-null object
6 Image-URL-M 271360 non-null object
7 Image-URL-L 271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB

ratings:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User-ID 1149780 non-null int64
1 ISBN 1149780 non-null object
2 Book-Rating 1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB

Step 3: Data Cleaning and Preparation

In this step, we clean the data and get it ready for model building.

We have many records with the same book title with different publishers and publishing years. So, we drop the rows with duplicate book titles and store it in the β€˜new_bookβ€˜ data frame.

Python
# Drop rows with duplicate book title
new_books = books.drop_duplicates('Book-Title')

We then merge the β€˜ratingsβ€˜ df with β€˜new_booksβ€˜ df on β€˜ISBNβ€˜ i.e. unique identification number for books and store the result in β€˜ratings_with_nameβ€˜. We also drop the columns that we do not require like β€˜ISBNβ€˜, β€˜Image-URL-S’ etc.

Python
# Merge ratings and new_books df
ratings_with_name = ratings.merge(new_books, on='ISBN')

# Drop non-relevant columns
ratings_with_name.drop(['ISBN', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis = 1, inplace = True)

Now, we merge the β€˜ratingsβ€˜ df with β€˜usersβ€˜ df to get β€˜users_ratings_matrixβ€˜. Similarly, we will drop the non-relevant columns.

Python
# Merge new 'ratings_with_name' df with users df
users_ratings_matrix = ratings_with_name.merge(users, on='User-ID')

# Drop non-relevant columns
users_ratings_matrix.drop(['Location', 'Age'], axis = 1, inplace = True)

# Print the first few rows of the new dataframe
users_ratings_matrix.head()

Output:

    User-ID    Book-Rating    Book-Title    Book-Author    Year-Of-Publication    Publisher
0 276725 0 Flesh Tones: A Novel M. J. Rose 2002 Ballantine Books
1 2313 5 Flesh Tones: A Novel M. J. Rose 2002 Ballantine Books
2 2313 8 In Cold Blood (Vintage International) TRUMAN CAPOTE 1994 Vintage
3 2313 9 Divine Secrets of the Ya-Ya Sisterhood : A Novel Rebecca Wells 1996 HarperCollins
4 2313 5 The Mistress of Spices Chitra Banerjee Divakaruni 1998 Anchor Books/Doubleday

Checking and dropping null values.

Python
# Check for null values
users_ratings_matrix.isna().sum()
# Drop null values
users_ratings_matrix.dropna(inplace = True)
print(users_ratings_matrix.isna().sum())

Output:

User-ID                0
Book-Rating 0
Book-Title 0
Book-Author 0
Year-Of-Publication 0
Publisher 0
dtype: int64

Since we have too many entries in β€˜users_ratings_matrixβ€˜. We will filter down the matrix to users who gave many book ratings and then filter on the basis most rated books.

The code filters a DataFrame users_ratings_matrix containing user-book interactions based on two criteria:

  1. Users with Many Book Ratings: It groups the DataFrame by the β€˜User-ID’ column and counts the number of ratings each user has given creating a boolean mask x where each entry indicates whether the user has given more than 100 ratings.
  2. Books with Most Ratings: It further filters the DataFrame filtered_users_ratings (which contains users with many ratings) based on books that have received at least 50 ratings.
Python
# Filter down 'users_ratings_matrix' on the basis of users who gave many book ratings
x = users_ratings_matrix.groupby('User-ID').count()['Book-Rating'] > 100
knowledgeable_users = x[x].index
filtered_users_ratings = users_ratings_matrix[users_ratings_matrix['User-ID'].isin(knowledgeable_users)]

# Filter down 'users_ratings_matrix' on the basis of books with most ratings
y = filtered_users_ratings.groupby('Book-Title').count()['Book-Rating'] >= 50
famous_books = y[y].index
final_users_ratings = filtered_users_ratings[filtered_users_ratings['Book-Title'].isin(famous_books)]

Now, we will create the pivot table for β€˜final_users_ratingsβ€˜ df. It will be a sparse user-rating matrix where each row will contain all the user ratings for a particular item and each column will contain all the item ratings by a particular user.

Python
# Pivot table creation
pivot_table = final_users_ratings.pivot_table(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating')

# Filling the NA values with '0'
pivot_table.fillna(0, inplace = True)
pivot_table.head()

Output:


User-ID 254 507 882 1424 1435 1733 1903 2033 2110 2276 ... 274549 274808 275020 275970 276680 277427 277478 277639 278188 278418
Book-Title
1984 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1st to Die: A Novel 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2nd Chance 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Blondes 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

There is no standalone library for implementing centered cosine similarity in scikit-learn. So, first, we standardize the pivot table using β€˜StandardScalerβ€˜ and then use cosine similarity on the standardized data.

Python
# Standardize the pivot table
scaler = StandardScaler(with_mean=True, with_std=True)
pivot_table_normalized = scaler.fit_transform(pivot_table)

Step 4: Model Building

First, we calculate the similarity matrix for all the items using β€˜cosine_similarityβ€˜.

Python
# Calculate the similarity matrix for all the books
similarity_score = cosine_similarity(pivot_table_normalized)

Then, we create a function called β€˜recommend()β€˜ which recommends to top books to the user based on their choice.

  1. The code finds the numerical index of the given book name in the pivot table.
  2. It sorts the similarity scores for the given book in descending order.
  3. It selects the top 5 similar books (excluding the given book itself).
  4. It retrieves the details (title, author, and image URL) of the similar books from the new_books DataFrame.
  5. It formats the information and returns it as a list.
Python
def recommend(book_name):
    
    # Returns the numerical index for the book_name
    index = np.where(pivot_table.index==book_name)[0][0]
    
    # Sorts the similarities for the book_name in descending order
    similar_books = sorted(list(enumerate(similarity_score[index])),key=lambda x:x[1], reverse=True)[1:6]
    
    # To return result in list format
    data = []
    
    for index,similarity in similar_books:
        item = []
        # Get the book details by index
        temp_df = new_books[new_books['Book-Title'] == pivot_table.index[index]]
        
        # Only add the title, author, and image-url to the result
        item.extend(temp_df['Book-Title'].values)
        item.extend(temp_df['Book-Author'].values)
        item.extend(temp_df['Image-URL-M'].values)
        
        data.append(item)
    return data

Step 5: Model Validating

Python
# Call the recommend method
recommend('1984',similarity_score)

Output:

[["Foucault's Pendulum",
'Umberto Eco',
'http://images.amazon.com/images/P/0345368754.01.MZZZZZZZ.jpg'],
['Tis : A Memoir',
'Frank McCourt',
'http://images.amazon.com/images/P/0684848783.01.MZZZZZZZ.jpg'],
['Animal Farm',
'George Orwell',
'http://images.amazon.com/images/P/0451526341.01.MZZZZZZZ.jpg'],
['The Glass Lake',
'Maeve Binchy',
'http://images.amazon.com/images/P/0440221595.01.MZZZZZZZ.jpg'],
['Summer Pleasures',
'Nora Roberts',
'http://images.amazon.com/images/P/0373218397.01.MZZZZZZZ.jpg']]

Conclusion

Building a recommendation engine using collaborative filtering is a robust way to enhance personalization in services. By following the above steps, one can achieve a highly effective recommendation system that is sensitive to user preferences and behaviors.



Contact Us