Data Manipulation
In data manipulation, we manipulate elements of the dataset accordingly for the purpose of modeling the data. Previously we have seen that the numerical data column age has a long right tail so basically, it is a right-skewed column. Hence we will apply log transformation on this column to make it a normally distributed column.
To encode data from categorical object Dtype into numerical data we will use 3 types of encodings.
- One-Hot encoding – We will use this when there will be multiple categories in the column.
- Label encoding – This method will be used when there will be very fewer categories in the column.
- Binary encoding – This type of encoding is similar to on-hot encoding However it creates lesser new columns by encoding each category into binary digits.
There are several different transformations available for decreasing skewness like inverse transformation, square root transformation, log transformation, etc to apply. However, it depends upon us and column skewness to choose the right type of transformation.
We will handle each continuous variable and manipulate it to change in the corresponding numerical column.
Python3
# check if every row has a # common language as english tinder_df[ 'language' ]. str .contains( 'english' )\ .unique() |
Output:
array([ True])
Since there are 571 unique rows in the language column and every row has English as the column language. It will create a very sparse matrix if we do one-hot encoding of the language column so we will create another column that counts the number of unique languages that a person knows and eventually, we drop the language column.
Python3
# count the number of languages in each row tinder_df[ 'num_languages' ] = tinder_df[ 'language' ]\ . str .count( ',' ) + 1 tinder_df.drop([ "language" ], axis = 1 , inplace = True ) |
To encode location preference we will assign a number to each location’s preferred place like anywhere giving the lowest preference to 1 and the same city having the highest preference to equal 2.5.
Python3
place_type_strength = { 'anywhere' : 1.0 , 'same state' : 2.0 , 'same city' : 2.5 } tinder_df[ 'location_preference' ] = \ tinder_df[ 'location_preference' ]\ . apply ( lambda x: place_type_strength[x]) |
We can easily handle columns that have only two unique categorical values by label encoding.
Python3
two_unique_values_column = { 'sex' : { 'f' : 1 , 'm' : 0 }, 'dropped_out' : { 'no' : 0 , 'yes' : 1 } } tinder_df.replace(two_unique_values_column, inplace = True ) |
We will divide all four distinct elements into two parts.
- Either he is single or available.
- Either he is married or seeing someone higher weight is given to the people who are single or available.
Python3
status_type_strength = { 'single' : 2.0 , 'available' : 2.0 , 'seeing someone' : 1.0 , 'married' : 1.0 } tinder_df[ 'status' ] = tinder_df[ 'status' ]\ . apply ( lambda x: status_type_strength[x]) |
Since orientation is an important element of the dataframe and it is Nominal Categorical Data so here we will do one hot encoding of this column.
Python3
# create a LabelEncoder object orientation_encoder = LabelEncoder() # fit the encoder on the orientation column orientation_encoder.fit(tinder_df[ 'orientation' ]) # encode the orientation column using the fitted encoder tinder_df[ 'orientation' ] = orientation_encoder.\ transform(tinder_df[ 'orientation' ]) # Drop the existing orientation column tinder_df.drop( "orientation" , axis = 1 , inplace = True ) |
In the drinking column 6 unique values. However, we can categorize these 6 values into three broader values. This way we only have to make three extra columns in One-Hot encoding.
Python3
drinking_habit = { 'socially' : 'sometimes' , 'rarely' : 'sometimes' , 'not at all' : 'do not drink' , 'often' : 'drinks often' , 'very often' : 'drinks often' , 'desperately' : 'drinks often' } tinder_df[ 'drinks' ] = tinder_df[ 'drinks' ]\ . apply ( lambda x: drinking_habit[x]) # create a LabelEncoder object habit_encoder = LabelEncoder() # fit the encoder on the drinks and drugs columns habit_encoder.fit(tinder_df[[ 'drinks' , 'drugs' ]] .values.reshape( - 1 )) # encode the drinks and drugs columns # using the fitted encoder tinder_df[ 'drinks_encoded' ] = \ habit_encoder.transform(tinder_df[ 'drinks' ]) tinder_df[ 'drugs_encoded' ] = \ habit_encoder.transform(tinder_df[ 'drugs' ]) # Drop the existing drink and drugs column tinder_df.drop([ "drinks" , "drugs" ], axis = 1 , inplace = True ) |
The location data has 70 unique columns, if we do one-hot encoding of this column it is going to create 70 more columns so here we will use our geographical knowledge to divide data into the broader regions.
Python3
region_dict = { 'southern_california' : [ 'los angeles' , 'san diego' , 'hacienda heights' , 'north hollywood' , 'phoenix' ], 'new_york' : [ 'brooklyn' , 'new york' ]} def get_region(city): for region, cities in region_dict.items(): if city.lower() in [c.lower() for c in cities]: return region return "northern_california" tinder_df[ 'location' ] = tinder_df[ 'location' ]\ . str .split( ', ' )\ . str [ 0 ]. apply (get_region) # perform one hot encoding location_encoder = OneHotEncoder() # fit and transform the location column location_encoded = location_encoder.fit_transform\ (tinder_df[[ 'location' ]]) # create a new DataFrame with the encoded columns location_encoded_df = pd.DataFrame(location_encoded.toarray()\ , columns = location_encoder.\ get_feature_names_out([ 'location' ])) # concatenate the new DataFrame with the original DataFrame tinder_df = pd.concat([tinder_df, location_encoded_df], axis = 1 ) # Drop the existing location column tinder_df.drop([ "location" ], axis = 1 , inplace = True ) |
Job is an important part of the individual identity we can not drop this column also we can not generalize these to broader categories so here we will do One-Hot encoding.
Python3
# create a LabelEncoder object job_encoder = LabelEncoder() # fit the encoder on the job column job_encoder.fit(tinder_df[ 'job' ]) # encode the job column using the fitted encoder tinder_df[ 'job_encoded' ] = job_encoder.\ transform(tinder_df[ 'job' ]) # drop the original job column tinder_df.drop( 'job' , axis = 1 , inplace = True ) |
We have 6 different types of people for smoking characteristics here we will change these to only two characteristics either they smoke or they do not smoke.
Python3
smokes = { 'no' : 1.0 , 'sometimes' : 0 , 'yes' : 0 , 'when drinking' : 0 , 'trying to quit' : 0 } tinder_df[ 'smokes' ] = tinder_df[ 'smokes' ]\ . apply ( lambda x: smokes[x]) |
For the pets column, we will do Binary encoding.
Python3
bin_enc = ce.BinaryEncoder(cols = [ 'pets' ]) # fit and transform the pet column pet_enc = bin_enc.fit_transform(tinder_df[ 'pets' ]) # add the encoded columns to the original dataframe tinder_df = pd.concat([tinder_df, pet_enc], axis = 1 ) tinder_df.drop( "pets" ,axis = 1 ,inplace = True ) |
For the new_language and body_profile columns, we will simply do One-Hot encoding.
Python3
# create a LabelEncoder object location_encoder = LabelEncoder() # fit the encoder on the job column location_encoder.fit(tinder_df[ 'new_languages' ]) # encode the job column using the fitted encoder tinder_df[ 'new_languages' ] = location_encoder.transform( tinder_df[ 'new_languages' ]) # create an instance of LabelEncoder le = LabelEncoder() # encode the body_profile column tinder_df[ "body_profile" ] = le.fit_transform(tinder_df[ "body_profile" ]) |
Predict Tinder Matches with Machine Learning
In this article, we are going to make a project on the Tinder Match-Making Recommender system. Most social media platform have their own recommender system algorithms. In our project, which works like Tinder, we are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them. We are going to make this project from basic and the steps we are going to follow are as:
Contact Us