Data Manipulation

In data manipulation, we manipulate elements of the dataset accordingly for the purpose of modeling the data. Previously we have seen that the numerical data column age has a long right tail so basically, it is a  right-skewed column. Hence we will apply log transformation on this column to make it a normally distributed column.

To encode data from categorical object Dtype into numerical data we will use 3 types of encodings.

  1. One-Hot encoding – We will use this when there will be multiple categories in the column.
  2. Label encoding – This method will be used when there will be very fewer categories in the column.
  3. Binary encoding – This type of encoding is similar to on-hot encoding However it creates lesser new columns by encoding each category into binary digits. 

There are several different transformations available for decreasing skewness like inverse transformation, square root transformation, log transformation, etc to apply. However, it depends upon us and column skewness to choose the right type of transformation.

We will handle each continuous variable and manipulate it to change in the corresponding numerical column.

Python3




# check if every row has a
# common language as english
tinder_df['language'].str.contains('english')\
    .unique()


Output:

array([ True])

Since there are 571 unique rows in the language column and every row has English as the column language. It will create a very sparse matrix if we do one-hot encoding of the language column so we will create another column that counts the number of unique languages that a person knows and eventually, we drop the language column.

Python3




# count the number of languages in each row
tinder_df['num_languages'] = tinder_df['language']\
    .str.count(',') + 1
tinder_df.drop(["language"], axis=1, inplace=True)


To encode location preference we will assign a number to each location’s preferred place like anywhere giving the lowest preference to 1 and the same city having the highest preference to equal 2.5.

Python3




place_type_strength = {
    'anywhere': 1.0,
    'same state': 2.0,
    'same city': 2.5
}
  
tinder_df['location_preference'] = \
    tinder_df['location_preference']\
    .apply(lambda x: place_type_strength[x])


We can easily handle columns that have only two unique categorical values by label encoding.

Python3




two_unique_values_column = {
    'sex': {'f': 1, 'm': 0},
    'dropped_out': {'no': 0, 'yes': 1}
}
  
tinder_df.replace(two_unique_values_column,
                  inplace=True)


We will divide all four distinct elements into two parts.

  1.  Either he is single or available.
  2. Either he is married or seeing someone higher weight is given to the people who are single or available.

Python3




status_type_strength = {
    'single': 2.0,
    'available': 2.0,
    'seeing someone': 1.0,
    'married': 1.0
}
tinder_df['status'] = tinder_df['status']\
    .apply(lambda x:
           status_type_strength[x])


Since orientation is an important element of the dataframe and it is Nominal Categorical Data so here we will do one hot encoding of this column.

Python3




# create a LabelEncoder object
orientation_encoder = LabelEncoder()
  
# fit the encoder on the orientation column
orientation_encoder.fit(tinder_df['orientation'])
  
# encode the orientation column using the fitted encoder
tinder_df['orientation'] = orientation_encoder.\
    transform(tinder_df['orientation'])
  
# Drop the existing orientation column
tinder_df.drop("orientation", axis=1, inplace=True)


In the drinking column 6 unique values. However, we can categorize these 6 values into three broader values. This way we only have to make three extra columns in One-Hot encoding

Python3




drinking_habit = {
    'socially': 'sometimes',
    'rarely': 'sometimes',
    'not at all': 'do not drink',
    'often': 'drinks often',
    'very often': 'drinks often',
    'desperately': 'drinks often'
}
tinder_df['drinks'] = tinder_df['drinks']\
    .apply(lambda x:
           drinking_habit[x])
# create a LabelEncoder object
habit_encoder = LabelEncoder()
  
# fit the encoder on the drinks and drugs columns
habit_encoder.fit(tinder_df[['drinks', 'drugs']]
                  .values.reshape(-1))
  
# encode the drinks and drugs columns
# using the fitted encoder
tinder_df['drinks_encoded'] = \
    habit_encoder.transform(tinder_df['drinks'])
tinder_df['drugs_encoded'] = \
    habit_encoder.transform(tinder_df['drugs'])
  
# Drop the existing drink and drugs column
tinder_df.drop(["drinks", "drugs"], axis=1,
               inplace=True)


The location data has 70 unique columns, if we do one-hot encoding of this column it is going to create 70 more columns so here we will use our geographical knowledge to divide data into the broader regions. 

Python3




region_dict = {'southern_california': ['los angeles',
                         'san diego', 'hacienda heights',
                         'north hollywood', 'phoenix'],
               'new_york': ['brooklyn',
                            'new york']}
  
def get_region(city):
    for region, cities in region_dict.items():
        if city.lower() in [c.lower() for c in cities]:
            return region
    return "northern_california"
  
  
tinder_df['location'] = tinder_df['location']\
           .str.split(', ')\
          .str[0].apply(get_region)
# perform one hot encoding
location_encoder = OneHotEncoder()
  
# fit and transform the location column
location_encoded = location_encoder.fit_transform\
                       (tinder_df[['location']])
  
# create a new DataFrame with the encoded columns
location_encoded_df = pd.DataFrame(location_encoded.toarray()\
                         , columns=location_encoder.\
                           get_feature_names_out(['location']))
  
# concatenate the new DataFrame with the original DataFrame
tinder_df = pd.concat([tinder_df, location_encoded_df], axis=1)
# Drop the existing location column
tinder_df.drop(["location"], axis=1, inplace=True)


Job is an important part of the individual identity we can not drop this column also we can not generalize these to broader categories so here we will do One-Hot encoding.

Python3




# create a LabelEncoder object
job_encoder = LabelEncoder()
  
# fit the encoder on the job column
job_encoder.fit(tinder_df['job'])
  
# encode the job column using the fitted encoder
tinder_df['job_encoded'] = job_encoder.\
    transform(tinder_df['job'])
  
# drop the original job column
tinder_df.drop('job', axis=1, inplace=True)


We have 6 different types of people for smoking characteristics here we will change these to only two characteristics either they smoke or they do not smoke.

Python3




smokes = {
   'no': 1.0,
   'sometimes': 0
   'yes': 0,
   'when drinking':0,
   'trying to quit':0
}
tinder_df['smokes'] = tinder_df['smokes']\
                             .apply(lambda x: smokes[x])


For the pets column, we will do Binary encoding.

Python3




bin_enc = ce.BinaryEncoder(cols=['pets'])
  
# fit and transform the pet column
pet_enc = bin_enc.fit_transform(tinder_df['pets'])
  
# add the encoded columns to the original dataframe
tinder_df = pd.concat([tinder_df, pet_enc], axis=1)
  
tinder_df.drop("pets",axis=1,inplace = True)


For the new_language and body_profile columns, we will simply do One-Hot encoding

Python3




# create a LabelEncoder object
location_encoder = LabelEncoder()
  
# fit the encoder on the job column
location_encoder.fit(tinder_df['new_languages'])
  
# encode the job column using the fitted encoder
tinder_df['new_languages'] = location_encoder.transform(
    tinder_df['new_languages'])
  
# create an instance of LabelEncoder
le = LabelEncoder()
  
# encode the body_profile column
tinder_df["body_profile"] = le.fit_transform(tinder_df["body_profile"])


Predict Tinder Matches with Machine Learning

In this article, we are going to make a project on the Tinder Match-Making Recommender system. Most social media platform have their own recommender system algorithms. In our project, which works like Tinder, we are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them. We are going to make this project from basic and the steps we are going to follow are as:

Similar Reads

Importing Libraries

We will import all the libraries in one place so that we don’t have to import packages every time we use them. This practice will save us time and reserve our memory space....

Exploratory Data Analysis of the Dataset

...

Data Wrangling

...

Data Manipulation

In exploratory data analysis(EDA), we try to gain essential pieces of information from the dataframe. EDA is considered to be one of the time-consuming parts of a data science project about 75% of our work will be in doing EDA of the dataset. However, we will see next that our effort will get justified in the end....

Data Modelling

...

Model Prediction

...

Contact Us