Data Manipulation

In data manipulation, we manipulate elements of the dataset accordingly for the purpose of modeling the data. Previously we have seen that the numerical data column age has a long right tail so basically, it is a right-skewed column. Hence we will apply log transformation on this column to make it a normally distributed column.

To encode data from categorical object Dtype into numerical data we will use 3 types of encodings.

One-Hot encoding – We will use this when there will be multiple categories in the column.
Label encoding – This method will be used when there will be very fewer categories in the column.
Binary encoding – This type of encoding is similar to on-hot encoding However it creates lesser new columns by encoding each category into binary digits.

There are several different transformations available for decreasing skewness like inverse transformation, square root transformation, log transformation, etc to apply. However, it depends upon us and column skewness to choose the right type of transformation.

We will handle each continuous variable and manipulate it to change in the corresponding numerical column.

Python3

# check if every row has a 
# common language as english 
tinder_df['language'].str.contains('english')\ 
    .unique() 

Output:

array([ True])

Since there are 571 unique rows in the language column and every row has English as the column language. It will create a very sparse matrix if we do one-hot encoding of the language column so we will create another column that counts the number of unique languages that a person knows and eventually, we drop the language column.

Python3

# count the number of languages in each row 
tinder_df['num_languages'] = tinder_df['language']\ 
    .str.count(',') + 1
tinder_df.drop(["language"], axis=1, inplace=True) 

To encode location preference we will assign a number to each location’s preferred place like anywhere giving the lowest preference to 1 and the same city having the highest preference to equal 2.5.

Python3

place_type_strength = { 
    'anywhere': 1.0, 
    'same state': 2.0, 
    'same city': 2.5
} 
  
tinder_df['location_preference'] = \ 
    tinder_df['location_preference']\ 
    .apply(lambda x: place_type_strength[x]) 

We can easily handle columns that have only two unique categorical values by label encoding.

Python3

two_unique_values_column = { 
    'sex': {'f': 1, 'm': 0}, 
    'dropped_out': {'no': 0, 'yes': 1} 
} 
  
tinder_df.replace(two_unique_values_column, 
                  inplace=True)

We will divide all four distinct elements into two parts.

Either he is single or available.
Either he is married or seeing someone higher weight is given to the people who are single or available.

Python3

status_type_strength = { 
    'single': 2.0, 
    'available': 2.0, 
    'seeing someone': 1.0, 
    'married': 1.0
} 
tinder_df['status'] = tinder_df['status']\ 
    .apply(lambda x: 
           status_type_strength[x]) 

Since orientation is an important element of the dataframe and it is Nominal Categorical Data so here we will do one hot encoding of this column.

Python3

# create a LabelEncoder object 
orientation_encoder = LabelEncoder() 
  
# fit the encoder on the orientation column 
orientation_encoder.fit(tinder_df['orientation']) 
  
# encode the orientation column using the fitted encoder 
tinder_df['orientation'] = orientation_encoder.\ 
    transform(tinder_df['orientation']) 
  
# Drop the existing orientation column 
tinder_df.drop("orientation", axis=1, inplace=True) 

In the drinking column 6 unique values. However, we can categorize these 6 values into three broader values. This way we only have to make three extra columns in One-Hot encoding.

Python3

drinking_habit = { 
    'socially': 'sometimes', 
    'rarely': 'sometimes', 
    'not at all': 'do not drink', 
    'often': 'drinks often', 
    'very often': 'drinks often', 
    'desperately': 'drinks often'
} 
tinder_df['drinks'] = tinder_df['drinks']\ 
    .apply(lambda x: 
           drinking_habit[x]) 
# create a LabelEncoder object 
habit_encoder = LabelEncoder() 
  
# fit the encoder on the drinks and drugs columns 
habit_encoder.fit(tinder_df[['drinks', 'drugs']] 
                  .values.reshape(-1)) 
  
# encode the drinks and drugs columns 
# using the fitted encoder 
tinder_df['drinks_encoded'] = \ 
    habit_encoder.transform(tinder_df['drinks']) 
tinder_df['drugs_encoded'] = \ 
    habit_encoder.transform(tinder_df['drugs']) 
  
# Drop the existing drink and drugs column 
tinder_df.drop(["drinks", "drugs"], axis=1, 
               inplace=True) 

The location data has 70 unique columns, if we do one-hot encoding of this column it is going to create 70 more columns so here we will use our geographical knowledge to divide data into the broader regions.

Python3

region_dict = {'southern_california': ['los angeles', 
                         'san diego', 'hacienda heights', 
                         'north hollywood', 'phoenix'], 
               'new_york': ['brooklyn', 
                            'new york']} 
  
def get_region(city): 
    for region, cities in region_dict.items(): 
        if city.lower() in [c.lower() for c in cities]: 
            return region 
    return "northern_california"
  
  
tinder_df['location'] = tinder_df['location']\ 
           .str.split(', ')\ 
          .str[0].apply(get_region) 
# perform one hot encoding 
location_encoder = OneHotEncoder() 
  
# fit and transform the location column 
location_encoded = location_encoder.fit_transform\ 
                       (tinder_df[['location']]) 
  
# create a new DataFrame with the encoded columns 
location_encoded_df = pd.DataFrame(location_encoded.toarray()\ 
                         , columns=location_encoder.\ 
                           get_feature_names_out(['location'])) 
  
# concatenate the new DataFrame with the original DataFrame 
tinder_df = pd.concat([tinder_df, location_encoded_df], axis=1) 
# Drop the existing location column 
tinder_df.drop(["location"], axis=1, inplace=True) 

Job is an important part of the individual identity we can not drop this column also we can not generalize these to broader categories so here we will do One-Hot encoding.

Python3

# create a LabelEncoder object 
job_encoder = LabelEncoder() 
  
# fit the encoder on the job column 
job_encoder.fit(tinder_df['job']) 
  
# encode the job column using the fitted encoder 
tinder_df['job_encoded'] = job_encoder.\ 
    transform(tinder_df['job']) 
  
# drop the original job column 
tinder_df.drop('job', axis=1, inplace=True) 

We have 6 different types of people for smoking characteristics here we will change these to only two characteristics either they smoke or they do not smoke.

Python3

smokes = { 
   'no': 1.0, 
   'sometimes': 0,  
   'yes': 0, 
   'when drinking':0, 
   'trying to quit':0
} 
tinder_df['smokes'] = tinder_df['smokes']\ 
                             .apply(lambda x: smokes[x])

For the pets column, we will do Binary encoding.

Python3

bin_enc = ce.BinaryEncoder(cols=['pets']) 
  
# fit and transform the pet column 
pet_enc = bin_enc.fit_transform(tinder_df['pets']) 
  
# add the encoded columns to the original dataframe 
tinder_df = pd.concat([tinder_df, pet_enc], axis=1) 
  
tinder_df.drop("pets",axis=1,inplace = True) 

For the new_language and body_profile columns, we will simply do One-Hot encoding.

Python3

# create a LabelEncoder object 
location_encoder = LabelEncoder() 
  
# fit the encoder on the job column 
location_encoder.fit(tinder_df['new_languages']) 
  
# encode the job column using the fitted encoder 
tinder_df['new_languages'] = location_encoder.transform( 
    tinder_df['new_languages']) 
  
# create an instance of LabelEncoder 
le = LabelEncoder() 
  
# encode the body_profile column 
tinder_df["body_profile"] = le.fit_transform(tinder_df["body_profile"]) 

Predict Tinder Matches with Machine Learning

In this article, we are going to make a project on the Tinder Match-Making Recommender system. Most social media platform have their own recommender system algorithms. In our project, which works like Tinder, we are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them. We are going to make this project from basic and the steps we are going to follow are as:

Tags:

#Machine Learning Projects #ML-Classification #ML-Clustering #python #AI-ML-DS #Data Science #Machine Learning #Machine Learning #python

Data Wrangling

Data Modelling

Data Manipulation

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Predict Tinder Matches with Machine Learning

Similar Reads

Contact Us