Cleaning Categorical Data in Python
To understand this problem, a new data frame with just one feature, phone numbers are created.
Python3
phone_numbers = [] for i in range ( 100 ): # phone numbers could be of length 9 or 10 number = random.randint( 100000000 , 9999999999 ) # +91 code is inserted in some cases if (i % 2 = = 0 ): phone_numbers.append( '+91 ' + str (number)) else : phone_numbers.append( str (number)) phone_numbers_data = pd.DataFrame({ 'phone_numbers' : phone_numbers }) phone_numbers_data.head() |
Output:
Based on the use case, the code before numbers could be dropped or added for missing ones. Similarly, phone numbers with less than 10 numbers should be discarded.
Python3
phone_numbers_data[ 'phone_numbers' ] = phone_numbers_data[ 'phone_numbers' ]\ . str .replace( '\+91 ' , '') num_digits = phone_numbers_data[ 'phone_numbers' ]. str . len () invalid_numbers_index = phone_numbers_data[num_digits < 10 ].index phone_numbers_data[ 'phone_numbers' ] = phone_numbers_data.drop( invalid_numbers_index) phone_numbers_data = phone_numbers_data.dropna() phone_numbers_data.head() |
Output:
Finally, we can verify whether the data is clean or not.
Python3
assert phone_numbers_data[ 'phone_numbers' ]. str .contains( '\+91 ' ). all () = = False assert (phone_numbers_data[ 'phone_numbers' ]. str . len () ! = 10 ). all () = = False |
Handling Categorical Data in Python
Categorical data is a set of predefined categories or groups an observation can fall into. Categorical data can be found everywhere. For instance, survey responses like marital status, profession, educational qualifications, etc. However, certain problems can arise with categorical data that must be dealt with before proceeding with any other task. This article discusses various methods to handle categorical data in a DataFrame. So, let us look at some problems posed by categorical data and how to handle categorical data in a DataFrame.
As mentioned earlier, categorical data can only take up a finite set of values. However, due to human error, while filling out a survey form, or any other reason, some bogus values could be found in the dataset.
Contact Us