Predict Tinder Matches with Machine Learning

Waiter's Tip Prediction using Machine Learning

In this article, we are going to make a project on the Tinder Match-Making Recommender system. Most social media platform have their own recommender system algorithms. In our project, which works like Tinder, we are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them. We are going to make this project from basic and the steps we are going to follow are as:

Importing Libraries

We will import all the libraries in one place so that we don’t have to import packages every time we use them. This practice will save us time and reserve our memory space.

Numpy – A Python library that is used for numerical mathematical computation and handling multidimensional ndarray, it also has a very large collection of mathematical functions to operate on this array
Pandas – A Python library built on top of NumPy for effective matrix multiplication and dataframe manipulation, it is also used for data cleaning, data merging, data reshaping, and data aggregation
Matplotlib – It is used for plotting 2D and 3D visualization plots, it also supports a variety of output formats including graphs
Seaborn – Seaborn library is made on top of Matplotlib it is used for plotting beautiful plots.

Python3

import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
sns.set_style("darkgrid", 
              {"grid.color": ".6", 
               "grid.linestyle": ":"}) 
import category_encoders as ce 
from sklearn.decomposition import TruncatedSVD 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity

We will use the panda read_csv() function to read our CSV file. You can download the respective dataset from here which has been used in this article for demonstration purpose.

Python3

# reading dataset using panda 
tinder_df = pd.read_csv("data.csv")

After executing this function the dataset will be stored as a dataframe in the tinder_df variable. We can view the first five rows of the dataframe using tinder_df.head().

Exploratory Data Analysis of the Dataset

In exploratory data analysis(EDA), we try to gain essential pieces of information from the dataframe. EDA is considered to be one of the time-consuming parts of a data science project about 75% of our work will be in doing EDA of the dataset. However, we will see next that our effort will get justified in the end.

We will first see the dimension of our dataset using the panda shape() function. The output of this function will be a tuple having a total number of columns and rows.

Python3

# shape of the dataset 
print(tinder_df.shape) 

output:

(2001, 22)

Next, we will use the info() function from the pandas to see the information about the dataset. The function will give Dtype and Non-Null counts of all the columns.

Python3

# information about the dataset 
tinder_df.info() 

Output :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2001 entries, 0 to 2000
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              2001 non-null   object 
 1   username             2001 non-null   object 
 2   age                  2001 non-null   int64  
 3   status               2001 non-null   object 
 4   sex                  2001 non-null   object 
 5   orientation          2001 non-null   object 
 6   drinks               2001 non-null   object 
 7   drugs                2001 non-null   object 
 8   height               2001 non-null   float64
 9   job                  2001 non-null   object 
 10  location             2001 non-null   object 
 11  pets                 2001 non-null   object 
 12  smokes               2001 non-null   object 
 13  language             2001 non-null   object 
 14  new_languages        2001 non-null   object 
 15  body_profile         2001 non-null   object 
 16  education_level      2001 non-null   float64
 17  dropped_out          2001 non-null   object 
 18  bio                  2001 non-null   object 
 19  interests            2001 non-null   object 
 20  other_interests      2001 non-null   object 
 21  location_preference  2001 non-null   object 
dtypes: float64(2), int64(1), object(19)
memory usage: 344.0+ KB

The function shows that the Dataset has a total of 2 float dtype columns 1 int dtype column and 19 object dtype columns. To see the total number of unique elements in each column. We will use the Pandas nunique() function.

Python3

# Number of unique element in the columns 
tinder_df.nunique() 

Output:

user_id                2001
username               1995
age                      52
status                    4
sex                       2
orientation               3
drinks                    6
drugs                     3
height                   25
job                      21
location                 70
pets                     15
smokes                    5
language                575
new_languages             3
body_profile             12
education_level           5
dropped_out               2
bio                    2001
interests                31
other_interests          31
location_preference       3
dtype: int64

Data Wrangling

In data wrangling, we process and transform the data to get the most useful and better structure out of it. To divide and summarize our dataset based on a column category. We will use the pandas groupby() method

Python3

tinder_df.groupby(['sex', 'drugs'])['drugs'] \ 
    .count() \ 
    .reset_index(name='unique_drug_count') 

Output:

    sex    drugs    unique_drug_count
0    f    never        711
1    f    often        5
2    f    sometimes    146
3    m    never        875
4    m    often        13
5    m    sometimes    251

We can also group people based on their interest in learning new languages and college dropouts.

Python3

tinder_df.groupby(['new_languages', 'dropped_out']) \ 
            ['dropped_out'].count(). \ 
            reset_index(name='drop_out_people count') 

Output:

new_languages    dropped_out    drop_out_people count
0    interested            no                 594
1    interested            yes                 39
2    not interested        no                 999
3    not interested        yes                 51
4    somewhat interested    no                 305
5    somewhat interested    yes                 13

Data Visualization

Data visualization is an important part of storytelling. In data visualization, we make interactive plots using Python libraries to demonstrate the ideas which columns are trying to tell.

Python3

# distribution of age 
sns.histplot(tinder_df["age"], kde=True) 

Output:

Histplot of age using seaborn

The age column has a long tail which shows it has a deviation from a normal distribution. Later we will apply some transformation to this age column to make it a normal distribution. Next, we will plot a histogram plot of the Height column.

Python3

# Distribution of height 
sns.histplot(tinder_df["height"], kde=True) 

Output:

Histplot of Height column using seaborn

We can also plot a pie chart for the numerical data to see the percentage contribution in a certain range. we may be interested in knowing the percentage of people in a certain age range who are using Tinder. We will use the pandas cut() function to create bins for the numerical data.

Python3

# Set the size of the figure to 10 inches 
# wide by 8 inches tall 
plt.figure(figsize=(6, 6)) 
  
# Divide the data into categories 
bins = [18, 30, 40, 50, 60, 70] 
  
# Use the `cut` function to assign 
# each data point to a category 
categories = pd.cut(tinder_df["age"], bins, 
                    labels=["18-30", "30-40", 
                            "40-50", "50-60", "60-70"]) 
  
# Count the number of data points in each category 
counts = categories.value_counts() 
  
# Plot the data as a pie chart 
plt.pie(counts, labels=counts.index, autopct='%1.1f%%') 
plt.show() 

Output:

Pie chart for the percentage of age distribution

We can use the Histplot function from Seaborn to create a graph that shows the count of people in a particular job.

Python3

plt.figure(figsize=(6, 6)) 
sns.histplot(x="job", data=tinder_df, 
             color="coral") 
  
# rotate x-axis labels vertically 
plt.xticks(rotation=90) 
plt.title("Distribution of job of each candidate", 
          fontsize=14) 
  
plt.xlabel("Job id", fontsize=12) 
plt.ylabel("Count of people", fontsize=12) 
  
plt.show() 

Output:

Count of people in a particular job using Histplot

Data Manipulation

In data manipulation, we manipulate elements of the dataset accordingly for the purpose of modeling the data. Previously we have seen that the numerical data column age has a long right tail so basically, it is a right-skewed column. Hence we will apply log transformation on this column to make it a normally distributed column.

To encode data from categorical object Dtype into numerical data we will use 3 types of encodings.

One-Hot encoding – We will use this when there will be multiple categories in the column.
Label encoding – This method will be used when there will be very fewer categories in the column.
Binary encoding – This type of encoding is similar to on-hot encoding However it creates lesser new columns by encoding each category into binary digits.

There are several different transformations available for decreasing skewness like inverse transformation, square root transformation, log transformation, etc to apply. However, it depends upon us and column skewness to choose the right type of transformation.

We will handle each continuous variable and manipulate it to change in the corresponding numerical column.

Python3

# check if every row has a 
# common language as english 
tinder_df['language'].str.contains('english')\ 
    .unique() 

Output:

array([ True])

Since there are 571 unique rows in the language column and every row has English as the column language. It will create a very sparse matrix if we do one-hot encoding of the language column so we will create another column that counts the number of unique languages that a person knows and eventually, we drop the language column.

Python3

# count the number of languages in each row 
tinder_df['num_languages'] = tinder_df['language']\ 
    .str.count(',') + 1
tinder_df.drop(["language"], axis=1, inplace=True) 

To encode location preference we will assign a number to each location’s preferred place like anywhere giving the lowest preference to 1 and the same city having the highest preference to equal 2.5.

Python3

place_type_strength = { 
    'anywhere': 1.0, 
    'same state': 2.0, 
    'same city': 2.5
} 
  
tinder_df['location_preference'] = \ 
    tinder_df['location_preference']\ 
    .apply(lambda x: place_type_strength[x]) 

We can easily handle columns that have only two unique categorical values by label encoding.

Python3

two_unique_values_column = { 
    'sex': {'f': 1, 'm': 0}, 
    'dropped_out': {'no': 0, 'yes': 1} 
} 
  
tinder_df.replace(two_unique_values_column, 
                  inplace=True)

We will divide all four distinct elements into two parts.

Either he is single or available.
Either he is married or seeing someone higher weight is given to the people who are single or available.

Python3

status_type_strength = { 
    'single': 2.0, 
    'available': 2.0, 
    'seeing someone': 1.0, 
    'married': 1.0
} 
tinder_df['status'] = tinder_df['status']\ 
    .apply(lambda x: 
           status_type_strength[x]) 

Since orientation is an important element of the dataframe and it is Nominal Categorical Data so here we will do one hot encoding of this column.

Python3

# create a LabelEncoder object 
orientation_encoder = LabelEncoder() 
  
# fit the encoder on the orientation column 
orientation_encoder.fit(tinder_df['orientation']) 
  
# encode the orientation column using the fitted encoder 
tinder_df['orientation'] = orientation_encoder.\ 
    transform(tinder_df['orientation']) 
  
# Drop the existing orientation column 
tinder_df.drop("orientation", axis=1, inplace=True) 

In the drinking column 6 unique values. However, we can categorize these 6 values into three broader values. This way we only have to make three extra columns in One-Hot encoding.

Python3

drinking_habit = { 
    'socially': 'sometimes', 
    'rarely': 'sometimes', 
    'not at all': 'do not drink', 
    'often': 'drinks often', 
    'very often': 'drinks often', 
    'desperately': 'drinks often'
} 
tinder_df['drinks'] = tinder_df['drinks']\ 
    .apply(lambda x: 
           drinking_habit[x]) 
# create a LabelEncoder object 
habit_encoder = LabelEncoder() 
  
# fit the encoder on the drinks and drugs columns 
habit_encoder.fit(tinder_df[['drinks', 'drugs']] 
                  .values.reshape(-1)) 
  
# encode the drinks and drugs columns 
# using the fitted encoder 
tinder_df['drinks_encoded'] = \ 
    habit_encoder.transform(tinder_df['drinks']) 
tinder_df['drugs_encoded'] = \ 
    habit_encoder.transform(tinder_df['drugs']) 
  
# Drop the existing drink and drugs column 
tinder_df.drop(["drinks", "drugs"], axis=1, 
               inplace=True) 

The location data has 70 unique columns, if we do one-hot encoding of this column it is going to create 70 more columns so here we will use our geographical knowledge to divide data into the broader regions.

Python3

region_dict = {'southern_california': ['los angeles', 
                         'san diego', 'hacienda heights', 
                         'north hollywood', 'phoenix'], 
               'new_york': ['brooklyn', 
                            'new york']} 
  
def get_region(city): 
    for region, cities in region_dict.items(): 
        if city.lower() in [c.lower() for c in cities]: 
            return region 
    return "northern_california"
  
  
tinder_df['location'] = tinder_df['location']\ 
           .str.split(', ')\ 
          .str[0].apply(get_region) 
# perform one hot encoding 
location_encoder = OneHotEncoder() 
  
# fit and transform the location column 
location_encoded = location_encoder.fit_transform\ 
                       (tinder_df[['location']]) 
  
# create a new DataFrame with the encoded columns 
location_encoded_df = pd.DataFrame(location_encoded.toarray()\ 
                         , columns=location_encoder.\ 
                           get_feature_names_out(['location'])) 
  
# concatenate the new DataFrame with the original DataFrame 
tinder_df = pd.concat([tinder_df, location_encoded_df], axis=1) 
# Drop the existing location column 
tinder_df.drop(["location"], axis=1, inplace=True) 

Job is an important part of the individual identity we can not drop this column also we can not generalize these to broader categories so here we will do One-Hot encoding.

Python3

# create a LabelEncoder object 
job_encoder = LabelEncoder() 
  
# fit the encoder on the job column 
job_encoder.fit(tinder_df['job']) 
  
# encode the job column using the fitted encoder 
tinder_df['job_encoded'] = job_encoder.\ 
    transform(tinder_df['job']) 
  
# drop the original job column 
tinder_df.drop('job', axis=1, inplace=True) 

We have 6 different types of people for smoking characteristics here we will change these to only two characteristics either they smoke or they do not smoke.

Python3

smokes = { 
   'no': 1.0, 
   'sometimes': 0,  
   'yes': 0, 
   'when drinking':0, 
   'trying to quit':0
} 
tinder_df['smokes'] = tinder_df['smokes']\ 
                             .apply(lambda x: smokes[x])

For the pets column, we will do Binary encoding.

Python3

bin_enc = ce.BinaryEncoder(cols=['pets']) 
  
# fit and transform the pet column 
pet_enc = bin_enc.fit_transform(tinder_df['pets']) 
  
# add the encoded columns to the original dataframe 
tinder_df = pd.concat([tinder_df, pet_enc], axis=1) 
  
tinder_df.drop("pets",axis=1,inplace = True) 

For the new_language and body_profile columns, we will simply do One-Hot encoding.

Python3

# create a LabelEncoder object 
location_encoder = LabelEncoder() 
  
# fit the encoder on the job column 
location_encoder.fit(tinder_df['new_languages']) 
  
# encode the job column using the fitted encoder 
tinder_df['new_languages'] = location_encoder.transform( 
    tinder_df['new_languages']) 
  
# create an instance of LabelEncoder 
le = LabelEncoder() 
  
# encode the body_profile column 
tinder_df["body_profile"] = le.fit_transform(tinder_df["body_profile"]) 

Data Modelling

In data modeling, we will first use TfidfVectorizer from the sklearn package to convert bio-categorical object Dtype into the numerical column. Note that output from the tfidVectorizer is a sparse matrix so here we will use SVD (Singular Value Decomposition) to reduce the dimensionality of the matrix.

For the purpose of finding a similarity between the user and our current present profile, we will use cosine similarity between the user and stored profile.

This is a content-based filtering algorithm in which we are using the user’s profile information to recommend other profiles with similar characteristics. This algorithm recommends the profiles which have the highest cosine similarity score with the user.

Python3

# Initialize TfidfVectorizer object 
tfidf = TfidfVectorizer(stop_words='english') 
  
# Fit and transform the text data 
tfidf_matrix = tfidf.fit_transform(tinder_df['bio']) 
  
# Get the feature names from the TfidfVectorizer object 
feature_names = tfidf.vocabulary_ 
  
# Convert tfidf matrix to DataFrame 
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), 
                        columns=feature_names) 
  
# Add non-text features to the tfidf_df dataframe 
tinder_dfs = tinder_df.drop(["bio", "user_id", 
                             "username"], axis=1) 
tinder_dfs = pd.concat([tinder_dfs, 
                        tfidf_df], axis=1) 
# Apply SVD to the feature matrix 
svd = TruncatedSVD(n_components=100) 
svd_matrix = svd.fit_transform(tinder_dfs) 
  
# Calculate the cosine similarity 
# between all pairs of users 
cosine_sim = cosine_similarity(svd_matrix) 

Model Prediction

To get recommendations for the new user we will define a new recommend function.

Python3

def recommend(user_df, num_recommendations=5): 
  
    # Apply SVD to the feature 
    # matrix of the user_df dataframe 
    svd_matrixs = svd.transform(user_df) 
  
    # Calculate the cosine similarity 
    # between the user_df and training set users 
    cosine_sim_new = cosine_similarity(svd_matrixs, svd_matrix) 
  
    # Get the indices of the top 
    # num_recommendations similar users 
    sim_scores = list(enumerate(cosine_sim_new[0])) 
    sim_scores = sorted(sim_scores, 
                        key=lambda x: x[1], reverse=True) 
    sim_indices = [i[0] for i in
                   sim_scores[1:num_recommendations+1]] 
  
    # Return the user_ids of the recommended users 
    return tinder_df['username'].iloc[sim_indices] 

Next, we will take input from the user and convert it into a dataframe so that we can use this information to make new predictions.

Python3

user_df = {} 
  
# Get user input for numerical columns 
user_df['age'] = float(input("Enter age: ")) 
user_df['status'] = float(input("Enter status: ")) 
user_df['sex'] = float(input("Enter sex \ 
              (0 for female, 1 for male): ")) 
user_df['height'] = float(input("Enter \ 
                height in inches: ")) 
user_df['smokes'] = float(input("Enter smokes\ 
                  (0 for no, 1 for yes): ")) 
user_df['new_languages'] = float( 
    input("Enter number of new \ 
         languages learned: ")) 
user_df['body_profile'] = float(input("Enter body \ 
              profile (0-1)")) 
user_df['education_level'] = float(input("Enter \ 
              education level (1-5): ")) 
user_df['dropped_out'] = float( 
    input("Enter dropped out (0 for no, 1 for yes): ")) 
user_df['bio'] = [input("Enter bio: ")] 
user_df['location_preference'] = float( 
    input("Enter location preference (0-2): ")) 
user_df['num_languages'] = float(input("\ 
               Enter number of languages known: ")) 
user_df['drinks_encoded'] = float(input("\ 
               Enter drinks encoded (0-3): ")) 
user_df['drugs_encoded'] = float(input("\ 
                  Enter drugs encoded (0-2): ")) 
  
# Get user input for one-hot encoded categorical columns 
user_df['location_new_york'] = float( 
    input("Enter location_new_york (0 or 1): ")) 
user_df['location_northern_california'] = float( 
    input("Enter location_northern_california (0 or 1): ")) 
user_df['location_southern_california'] = float( 
    input("Enter location_southern_california (0 or 1): ")) 
user_df['job_encoded'] = float(input("\ 
               Enter job encoded (0-9): ")) 
user_df['pets_0'] = float(input("\ 
                Enter pets_0 (0 or 1): ")) 
user_df['pets_1'] = float(input("\ 
                  Enter pets_1 (0 or 1): ")) 
user_df['pets_2'] = float(input("\ 
               Enter pets_2 (0 or 1): ")) 
user_df['pets_3'] = float(input("\ 
                  Enter pets_3 (0 or 1): ")) 
  
# Convert tfidf matrix to DataFrame 
tfidf_df = pd.DataFrame(tfidf.transform( 
    user_df['bio']).toarray(), columns=feature_names) 
  
# Convert the user input 
# dictionary to a Pandas DataFrame 
user_df = pd.DataFrame(user_df, index=[0]) 
user_df.drop("bio", axis=1, inplace=True) 
user_df = pd.concat([user_df, tfidf_df], axis=1) 

Output:

Enter age: 22
Enter status: 1
Enter sex (0 for female, 1 for male): 1
Enter height in inches: 60
Enter smokes 0 for no, 1 for yes): 0
Enter number of new languages learned: 2
Enter body profile (0-1)1
Enter education level (1-5): 4
Enter dropped out (0 for no, 1 for yes): 1
Enter bio: I am a foodie and traveller. But sometimes like to sit alone in a 
corner and read a good fiction.
Enter location preference (0-2): 2
Enter number of languages known: 2
Enter drinks encoded (0-3): 0
Enter drugs encoded (0-2): 0
Enter location_new_york (0 or 1): 0
Enter location_northern_california (0 or 1): 1
Enter location_southern_california (0 or 1): 0
Enter job encoded (0-9): 4
Enter pets_0 (0 or 1): 0
Enter pets_1 (0 or 1): 0
Enter pets_2 (0 or 1): 0
Enter pets_3 (0 or 1): 0

Call function to print the recommended user.

Python3

print(recommend(user_df))

Output:

23      Ronald Millwood
550        Terry Ostrov
1685       Thomas Moran
1044    Travis Pergande
241       Carol Valente
Name: username, dtype: object

This is a very basic content-based recommender system but there are multiple models which are based on Deep Learning and work really well when provided to the real-world dataset.

Tags:

#Machine Learning Projects #ML-Classification #ML-Clustering #python #AI-ML-DS #Data Science #Machine Learning #Machine Learning #python

Waiter's Tip Prediction using Machine Learning

Predict Tinder Matches with Machine Learning

Importing Libraries

Python3

Python3

Exploratory Data Analysis of the Dataset

Python3

Python3

Python3

Data Wrangling

Python3

Python3

Data Visualization

Python3

Python3

Python3

Python3

Data Manipulation

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Data Modelling

Python3

Model Prediction

Python3

Python3

Python3

Contact Us