Predict Tinder Matches with Machine Learning
In this article, we are going to make a project on the Tinder Match-Making Recommender system. Most social media platform have their own recommender system algorithms. In our project, which works like Tinder, we are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them. We are going to make this project from basic and the steps we are going to follow are as:
Importing Libraries
We will import all the libraries in one place so that we don’t have to import packages every time we use them. This practice will save us time and reserve our memory space.
- Numpy – A Python library that is used for numerical mathematical computation and handling multidimensional ndarray, it also has a very large collection of mathematical functions to operate on this array
- Pandas – A Python library built on top of NumPy for effective matrix multiplication and dataframe manipulation, it is also used for data cleaning, data merging, data reshaping, and data aggregation
- Matplotlib – It is used for plotting 2D and 3D visualization plots, it also supports a variety of output formats including graphs
- Seaborn – Seaborn library is made on top of Matplotlib it is used for plotting beautiful plots.
Python3
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style( "darkgrid" , { "grid.color" : ".6" , "grid.linestyle" : ":" }) import category_encoders as ce from sklearn.decomposition import TruncatedSVD from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity |
We will use the panda read_csv() function to read our CSV file. You can download the respective dataset from here which has been used in this article for demonstration purpose.
Python3
# reading dataset using panda tinder_df = pd.read_csv( "data.csv" ) |
After executing this function the dataset will be stored as a dataframe in the tinder_df variable. We can view the first five rows of the dataframe using tinder_df.head().
Exploratory Data Analysis of the Dataset
In exploratory data analysis(EDA), we try to gain essential pieces of information from the dataframe. EDA is considered to be one of the time-consuming parts of a data science project about 75% of our work will be in doing EDA of the dataset. However, we will see next that our effort will get justified in the end.
We will first see the dimension of our dataset using the panda shape() function. The output of this function will be a tuple having a total number of columns and rows.
Python3
# shape of the dataset print (tinder_df.shape) |
output:
(2001, 22)
Next, we will use the info() function from the pandas to see the information about the dataset. The function will give Dtype and Non-Null counts of all the columns.
Python3
# information about the dataset tinder_df.info() |
Output :
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2001 entries, 0 to 2000 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 2001 non-null object 1 username 2001 non-null object 2 age 2001 non-null int64 3 status 2001 non-null object 4 sex 2001 non-null object 5 orientation 2001 non-null object 6 drinks 2001 non-null object 7 drugs 2001 non-null object 8 height 2001 non-null float64 9 job 2001 non-null object 10 location 2001 non-null object 11 pets 2001 non-null object 12 smokes 2001 non-null object 13 language 2001 non-null object 14 new_languages 2001 non-null object 15 body_profile 2001 non-null object 16 education_level 2001 non-null float64 17 dropped_out 2001 non-null object 18 bio 2001 non-null object 19 interests 2001 non-null object 20 other_interests 2001 non-null object 21 location_preference 2001 non-null object dtypes: float64(2), int64(1), object(19) memory usage: 344.0+ KB
The function shows that the Dataset has a total of 2 float dtype columns 1 int dtype column and 19 object dtype columns. To see the total number of unique elements in each column. We will use the Pandas nunique() function.
Python3
# Number of unique element in the columns tinder_df.nunique() |
Output:
user_id 2001 username 1995 age 52 status 4 sex 2 orientation 3 drinks 6 drugs 3 height 25 job 21 location 70 pets 15 smokes 5 language 575 new_languages 3 body_profile 12 education_level 5 dropped_out 2 bio 2001 interests 31 other_interests 31 location_preference 3 dtype: int64
Data Wrangling
In data wrangling, we process and transform the data to get the most useful and better structure out of it. To divide and summarize our dataset based on a column category. We will use the pandas groupby() method
Python3
tinder_df.groupby([ 'sex' , 'drugs' ])[ 'drugs' ] \ .count() \ .reset_index(name = 'unique_drug_count' ) |
Output:
sex drugs unique_drug_count 0 f never 711 1 f often 5 2 f sometimes 146 3 m never 875 4 m often 13 5 m sometimes 251
We can also group people based on their interest in learning new languages and college dropouts.
Python3
tinder_df.groupby([ 'new_languages' , 'dropped_out' ]) \ [ 'dropped_out' ].count(). \ reset_index(name = 'drop_out_people count' ) |
Output:
new_languages dropped_out drop_out_people count 0 interested no 594 1 interested yes 39 2 not interested no 999 3 not interested yes 51 4 somewhat interested no 305 5 somewhat interested yes 13
Data Visualization
Data visualization is an important part of storytelling. In data visualization, we make interactive plots using Python libraries to demonstrate the ideas which columns are trying to tell.
Python3
# distribution of age sns.histplot(tinder_df[ "age" ], kde = True ) |
Output:
The age column has a long tail which shows it has a deviation from a normal distribution. Later we will apply some transformation to this age column to make it a normal distribution. Next, we will plot a histogram plot of the Height column.
Python3
# Distribution of height sns.histplot(tinder_df[ "height" ], kde = True ) |
Output:
We can also plot a pie chart for the numerical data to see the percentage contribution in a certain range. we may be interested in knowing the percentage of people in a certain age range who are using Tinder. We will use the pandas cut() function to create bins for the numerical data.
Python3
# Set the size of the figure to 10 inches # wide by 8 inches tall plt.figure(figsize = ( 6 , 6 )) # Divide the data into categories bins = [ 18 , 30 , 40 , 50 , 60 , 70 ] # Use the `cut` function to assign # each data point to a category categories = pd.cut(tinder_df[ "age" ], bins, labels = [ "18-30" , "30-40" , "40-50" , "50-60" , "60-70" ]) # Count the number of data points in each category counts = categories.value_counts() # Plot the data as a pie chart plt.pie(counts, labels = counts.index, autopct = '%1.1f%%' ) plt.show() |
Output:
We can use the Histplot function from Seaborn to create a graph that shows the count of people in a particular job.
Python3
plt.figure(figsize = ( 6 , 6 )) sns.histplot(x = "job" , data = tinder_df, color = "coral" ) # rotate x-axis labels vertically plt.xticks(rotation = 90 ) plt.title( "Distribution of job of each candidate" , fontsize = 14 ) plt.xlabel( "Job id" , fontsize = 12 ) plt.ylabel( "Count of people" , fontsize = 12 ) plt.show() |
Output:
Data Manipulation
In data manipulation, we manipulate elements of the dataset accordingly for the purpose of modeling the data. Previously we have seen that the numerical data column age has a long right tail so basically, it is a right-skewed column. Hence we will apply log transformation on this column to make it a normally distributed column.
To encode data from categorical object Dtype into numerical data we will use 3 types of encodings.
- One-Hot encoding – We will use this when there will be multiple categories in the column.
- Label encoding – This method will be used when there will be very fewer categories in the column.
- Binary encoding – This type of encoding is similar to on-hot encoding However it creates lesser new columns by encoding each category into binary digits.
There are several different transformations available for decreasing skewness like inverse transformation, square root transformation, log transformation, etc to apply. However, it depends upon us and column skewness to choose the right type of transformation.
We will handle each continuous variable and manipulate it to change in the corresponding numerical column.
Python3
# check if every row has a # common language as english tinder_df[ 'language' ]. str .contains( 'english' )\ .unique() |
Output:
array([ True])
Since there are 571 unique rows in the language column and every row has English as the column language. It will create a very sparse matrix if we do one-hot encoding of the language column so we will create another column that counts the number of unique languages that a person knows and eventually, we drop the language column.
Python3
# count the number of languages in each row tinder_df[ 'num_languages' ] = tinder_df[ 'language' ]\ . str .count( ',' ) + 1 tinder_df.drop([ "language" ], axis = 1 , inplace = True ) |
To encode location preference we will assign a number to each location’s preferred place like anywhere giving the lowest preference to 1 and the same city having the highest preference to equal 2.5.
Python3
place_type_strength = { 'anywhere' : 1.0 , 'same state' : 2.0 , 'same city' : 2.5 } tinder_df[ 'location_preference' ] = \ tinder_df[ 'location_preference' ]\ . apply ( lambda x: place_type_strength[x]) |
We can easily handle columns that have only two unique categorical values by label encoding.
Python3
two_unique_values_column = { 'sex' : { 'f' : 1 , 'm' : 0 }, 'dropped_out' : { 'no' : 0 , 'yes' : 1 } } tinder_df.replace(two_unique_values_column, inplace = True ) |
We will divide all four distinct elements into two parts.
- Either he is single or available.
- Either he is married or seeing someone higher weight is given to the people who are single or available.
Python3
status_type_strength = { 'single' : 2.0 , 'available' : 2.0 , 'seeing someone' : 1.0 , 'married' : 1.0 } tinder_df[ 'status' ] = tinder_df[ 'status' ]\ . apply ( lambda x: status_type_strength[x]) |
Since orientation is an important element of the dataframe and it is Nominal Categorical Data so here we will do one hot encoding of this column.
Python3
# create a LabelEncoder object orientation_encoder = LabelEncoder() # fit the encoder on the orientation column orientation_encoder.fit(tinder_df[ 'orientation' ]) # encode the orientation column using the fitted encoder tinder_df[ 'orientation' ] = orientation_encoder.\ transform(tinder_df[ 'orientation' ]) # Drop the existing orientation column tinder_df.drop( "orientation" , axis = 1 , inplace = True ) |
In the drinking column 6 unique values. However, we can categorize these 6 values into three broader values. This way we only have to make three extra columns in One-Hot encoding.
Python3
drinking_habit = { 'socially' : 'sometimes' , 'rarely' : 'sometimes' , 'not at all' : 'do not drink' , 'often' : 'drinks often' , 'very often' : 'drinks often' , 'desperately' : 'drinks often' } tinder_df[ 'drinks' ] = tinder_df[ 'drinks' ]\ . apply ( lambda x: drinking_habit[x]) # create a LabelEncoder object habit_encoder = LabelEncoder() # fit the encoder on the drinks and drugs columns habit_encoder.fit(tinder_df[[ 'drinks' , 'drugs' ]] .values.reshape( - 1 )) # encode the drinks and drugs columns # using the fitted encoder tinder_df[ 'drinks_encoded' ] = \ habit_encoder.transform(tinder_df[ 'drinks' ]) tinder_df[ 'drugs_encoded' ] = \ habit_encoder.transform(tinder_df[ 'drugs' ]) # Drop the existing drink and drugs column tinder_df.drop([ "drinks" , "drugs" ], axis = 1 , inplace = True ) |
The location data has 70 unique columns, if we do one-hot encoding of this column it is going to create 70 more columns so here we will use our geographical knowledge to divide data into the broader regions.
Python3
region_dict = { 'southern_california' : [ 'los angeles' , 'san diego' , 'hacienda heights' , 'north hollywood' , 'phoenix' ], 'new_york' : [ 'brooklyn' , 'new york' ]} def get_region(city): for region, cities in region_dict.items(): if city.lower() in [c.lower() for c in cities]: return region return "northern_california" tinder_df[ 'location' ] = tinder_df[ 'location' ]\ . str .split( ', ' )\ . str [ 0 ]. apply (get_region) # perform one hot encoding location_encoder = OneHotEncoder() # fit and transform the location column location_encoded = location_encoder.fit_transform\ (tinder_df[[ 'location' ]]) # create a new DataFrame with the encoded columns location_encoded_df = pd.DataFrame(location_encoded.toarray()\ , columns = location_encoder.\ get_feature_names_out([ 'location' ])) # concatenate the new DataFrame with the original DataFrame tinder_df = pd.concat([tinder_df, location_encoded_df], axis = 1 ) # Drop the existing location column tinder_df.drop([ "location" ], axis = 1 , inplace = True ) |
Job is an important part of the individual identity we can not drop this column also we can not generalize these to broader categories so here we will do One-Hot encoding.
Python3
# create a LabelEncoder object job_encoder = LabelEncoder() # fit the encoder on the job column job_encoder.fit(tinder_df[ 'job' ]) # encode the job column using the fitted encoder tinder_df[ 'job_encoded' ] = job_encoder.\ transform(tinder_df[ 'job' ]) # drop the original job column tinder_df.drop( 'job' , axis = 1 , inplace = True ) |
We have 6 different types of people for smoking characteristics here we will change these to only two characteristics either they smoke or they do not smoke.
Python3
smokes = { 'no' : 1.0 , 'sometimes' : 0 , 'yes' : 0 , 'when drinking' : 0 , 'trying to quit' : 0 } tinder_df[ 'smokes' ] = tinder_df[ 'smokes' ]\ . apply ( lambda x: smokes[x]) |
For the pets column, we will do Binary encoding.
Python3
bin_enc = ce.BinaryEncoder(cols = [ 'pets' ]) # fit and transform the pet column pet_enc = bin_enc.fit_transform(tinder_df[ 'pets' ]) # add the encoded columns to the original dataframe tinder_df = pd.concat([tinder_df, pet_enc], axis = 1 ) tinder_df.drop( "pets" ,axis = 1 ,inplace = True ) |
For the new_language and body_profile columns, we will simply do One-Hot encoding.
Python3
# create a LabelEncoder object location_encoder = LabelEncoder() # fit the encoder on the job column location_encoder.fit(tinder_df[ 'new_languages' ]) # encode the job column using the fitted encoder tinder_df[ 'new_languages' ] = location_encoder.transform( tinder_df[ 'new_languages' ]) # create an instance of LabelEncoder le = LabelEncoder() # encode the body_profile column tinder_df[ "body_profile" ] = le.fit_transform(tinder_df[ "body_profile" ]) |
Data Modelling
In data modeling, we will first use TfidfVectorizer from the sklearn package to convert bio-categorical object Dtype into the numerical column. Note that output from the tfidVectorizer is a sparse matrix so here we will use SVD (Singular Value Decomposition) to reduce the dimensionality of the matrix.
For the purpose of finding a similarity between the user and our current present profile, we will use cosine similarity between the user and stored profile.
This is a content-based filtering algorithm in which we are using the user’s profile information to recommend other profiles with similar characteristics. This algorithm recommends the profiles which have the highest cosine similarity score with the user.
Python3
# Initialize TfidfVectorizer object tfidf = TfidfVectorizer(stop_words = 'english' ) # Fit and transform the text data tfidf_matrix = tfidf.fit_transform(tinder_df[ 'bio' ]) # Get the feature names from the TfidfVectorizer object feature_names = tfidf.vocabulary_ # Convert tfidf matrix to DataFrame tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns = feature_names) # Add non-text features to the tfidf_df dataframe tinder_dfs = tinder_df.drop([ "bio" , "user_id" , "username" ], axis = 1 ) tinder_dfs = pd.concat([tinder_dfs, tfidf_df], axis = 1 ) # Apply SVD to the feature matrix svd = TruncatedSVD(n_components = 100 ) svd_matrix = svd.fit_transform(tinder_dfs) # Calculate the cosine similarity # between all pairs of users cosine_sim = cosine_similarity(svd_matrix) |
Model Prediction
To get recommendations for the new user we will define a new recommend function.
Python3
def recommend(user_df, num_recommendations = 5 ): # Apply SVD to the feature # matrix of the user_df dataframe svd_matrixs = svd.transform(user_df) # Calculate the cosine similarity # between the user_df and training set users cosine_sim_new = cosine_similarity(svd_matrixs, svd_matrix) # Get the indices of the top # num_recommendations similar users sim_scores = list ( enumerate (cosine_sim_new[ 0 ])) sim_scores = sorted (sim_scores, key = lambda x: x[ 1 ], reverse = True ) sim_indices = [i[ 0 ] for i in sim_scores[ 1 :num_recommendations + 1 ]] # Return the user_ids of the recommended users return tinder_df[ 'username' ].iloc[sim_indices] |
Next, we will take input from the user and convert it into a dataframe so that we can use this information to make new predictions.
Python3
user_df = {} # Get user input for numerical columns user_df[ 'age' ] = float ( input ( "Enter age: " )) user_df[ 'status' ] = float ( input ( "Enter status: " )) user_df[ 'sex' ] = float ( input ("Enter sex \ ( 0 for female, 1 for male): ")) user_df[ 'height' ] = float ( input ("Enter \ height in inches: ")) user_df[ 'smokes' ] = float ( input ("Enter smokes\ ( 0 for no, 1 for yes): ")) user_df[ 'new_languages' ] = float ( input ("Enter number of new \ languages learned: ")) user_df[ 'body_profile' ] = float ( input ("Enter body \ profile ( 0 - 1 )")) user_df[ 'education_level' ] = float ( input ("Enter \ education level ( 1 - 5 ): ")) user_df[ 'dropped_out' ] = float ( input ( "Enter dropped out (0 for no, 1 for yes): " )) user_df[ 'bio' ] = [ input ( "Enter bio: " )] user_df[ 'location_preference' ] = float ( input ( "Enter location preference (0-2): " )) user_df[ 'num_languages' ] = float ( input ("\ Enter number of languages known: ")) user_df[ 'drinks_encoded' ] = float ( input ("\ Enter drinks encoded ( 0 - 3 ): ")) user_df[ 'drugs_encoded' ] = float ( input ("\ Enter drugs encoded ( 0 - 2 ): ")) # Get user input for one-hot encoded categorical columns user_df[ 'location_new_york' ] = float ( input ( "Enter location_new_york (0 or 1): " )) user_df[ 'location_northern_california' ] = float ( input ( "Enter location_northern_california (0 or 1): " )) user_df[ 'location_southern_california' ] = float ( input ( "Enter location_southern_california (0 or 1): " )) user_df[ 'job_encoded' ] = float ( input ("\ Enter job encoded ( 0 - 9 ): ")) user_df[ 'pets_0' ] = float ( input ("\ Enter pets_0 ( 0 or 1 ): ")) user_df[ 'pets_1' ] = float ( input ("\ Enter pets_1 ( 0 or 1 ): ")) user_df[ 'pets_2' ] = float ( input ("\ Enter pets_2 ( 0 or 1 ): ")) user_df[ 'pets_3' ] = float ( input ("\ Enter pets_3 ( 0 or 1 ): ")) # Convert tfidf matrix to DataFrame tfidf_df = pd.DataFrame(tfidf.transform( user_df[ 'bio' ]).toarray(), columns = feature_names) # Convert the user input # dictionary to a Pandas DataFrame user_df = pd.DataFrame(user_df, index = [ 0 ]) user_df.drop( "bio" , axis = 1 , inplace = True ) user_df = pd.concat([user_df, tfidf_df], axis = 1 ) |
Output:
Enter age: 22 Enter status: 1 Enter sex (0 for female, 1 for male): 1 Enter height in inches: 60 Enter smokes 0 for no, 1 for yes): 0 Enter number of new languages learned: 2 Enter body profile (0-1)1 Enter education level (1-5): 4 Enter dropped out (0 for no, 1 for yes): 1 Enter bio: I am a foodie and traveller. But sometimes like to sit alone in a corner and read a good fiction. Enter location preference (0-2): 2 Enter number of languages known: 2 Enter drinks encoded (0-3): 0 Enter drugs encoded (0-2): 0 Enter location_new_york (0 or 1): 0 Enter location_northern_california (0 or 1): 1 Enter location_southern_california (0 or 1): 0 Enter job encoded (0-9): 4 Enter pets_0 (0 or 1): 0 Enter pets_1 (0 or 1): 0 Enter pets_2 (0 or 1): 0 Enter pets_3 (0 or 1): 0
Call function to print the recommended user.
Python3
print (recommend(user_df)) |
Output:
23 Ronald Millwood 550 Terry Ostrov 1685 Thomas Moran 1044 Travis Pergande 241 Carol Valente Name: username, dtype: object
This is a very basic content-based recommender system but there are multiple models which are based on Deep Learning and work really well when provided to the real-world dataset.
Contact Us