Handling Missing Values
You all must be wondering why a dataset will contain any missing values. It can occur when no information is provided for one or more items or for a whole unit. For Example, Suppose different users being surveyed may choose not to share their income, and some users may choose not to share their address in this way many datasets went missing. Missing Data is a very big problem in real-life scenarios.
Missing Data can also refer to as NA(Not Available) values in pandas. There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :
Now let’s check if there are any missing values in our dataset or not.
df.isnull().sum()
Output:
First Name 67
Gender 145
Start Date 0
Last Login Time 0
Salary 0
Bonus % 0
Senior Management 67
Team 43
dtype: int64
We can see that every column has a different amount of missing values. Like Gender has 145 missing values and salary has 0. Now for handling these missing values there can be several cases like dropping the rows containing NaN or replacing NaN with either mean, median, mode, or some other value.
Now, let’s try to fill in the missing values of gender with the string “No Gender”.
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()
Output:
First Name 67
Gender 0
Start Date 0
Last Login Time 0
Salary 0
Bonus % 0
Senior Management 67
Team 43
dtype: int64
We can see that now there is no null value for the gender column. Now, Let’s fill the senior management with the mode value.
mode = df['Senior Management'].mode().values[0]
df['Senior Management']= df['Senior Management'].replace(np.nan, mode)
df.isnull().sum()
Output:
First Name 67
Gender 0
Start Date 0
Last Login Time 0
Salary 0
Bonus % 0
Senior Management 0
Team 43
dtype: int64
Now for the first name and team, we cannot fill the missing values with arbitrary data, so, let’s drop all the rows containing these missing values.
df = df.dropna(axis = 0, how ='any')
print(df.isnull().sum())
df.shape
Output:
First Name 0
Gender 0
Start Date 0
Last Login Time 0
Salary 0
Bonus % 0
Senior Management 0
Team 0
dtype: int64
(899, 8)
We can see that our dataset is now free of all the missing values and after dropping the data the number of rows also reduced from 1000 to 899.
For more information, refer to Working with Missing Data in Pandas.
Steps for Mastering Exploratory Data Analysis | EDA Steps
Mastering exploratory data analysis (EDA) is crucial for understanding your data, identifying patterns, and generating insights that can inform further analysis or decision-making. Data is the lifeblood of cutting-edge groups, and the capability to extract insights from records has become a crucial talent in today’s statistics-pushed world. Exploratory Data Analysis (EDA) is a powerful method that allows analysts, scientists, and researchers to gain complete knowledge of their data earlier than projecting formal modeling or speculation testing.
It is an iterative procedure that entails summarizing, visualizing, and exploring information to find patterns, anomalies, and relationships that might not be apparent at once. In this complete article, we will understand and implement critical steps for performing Exploratory Data Analysis. Here are steps to help you master EDA:
Steps for Mastering Exploratory Data Analysis
- Step 1: Understand the Problem and the Data
- Step 2: Import and Inspect the Data
- Step 3: Handling Missing Values
- Step 4: Explore Data Characteristics
- Step 5: Perform Data Transformation
- Step 6: Visualize Data Relationships
- Step 7: Handling Outliers
- Step 8: Communicate Findings and Insights
Contact Us