Import and Inspect the Data
Once you have clean expertise of the problem and the information, the following step is to import the data into your evaluation environment (e.g., Python, R, or a spreadsheet program). During this step, looking into the statistics is critical to gain initial know-how of its structure, variable kinds, and capability issues.
Here are a few obligations you could carry out at this stage:
- Load the facts into your analysis environment, ensuring that the facts are imported efficiently and without errors or truncations.
- Examine the size of the facts (variety of rows and columns) to experience its length and complexity.
- Identify facts sorts and formats for each variable, as these records may be necessary for the following facts manipulation and evaluation steps.
- Look for any apparent errors or inconsistencies in the information, such as invalid values, mismatched units, or outliers, that can indicate exceptional issues with information.
For this article, we will use the employee data. It contains 8 columns namely – First Name, Gender, Start Date, Last Login, Salary, Bonus%, Senior Management, and Team. We can get the dataset here Employees.csv.
Let’s read the dataset using the Pandas read_csv() function and print the 1st five rows. To print the first five rows we will use the head() function.
import pandas as pd
import numpy as np
# read datasdet using pandas
df = pd.read_csv('employees.csv')
df.head()
Output:
First Name Gender Start Date Last Login Time Salary Bonus % Senior Management Team
0 Douglas Male 8/6/1993 12:42 PM 97308 6.945 True Marketing
1 Thomas Male 3/31/1996 6:53 AM 61933 4.170 True NaN
2 Maria Female 4/23/1993 11:17 AM 130590 11.858 False Finance
3 Jerry Male 3/4/2005 1:00 PM 138705 9.340 True Finance
4 Larry Male 1/24/1998 4:47 PM 101004 1.389 True Client Services
Getting Insights About The Dataset
Let’s see the shape of the data using the shape.
df.shape
Output:
(1000, 8)
This means that this dataset has 1000 rows and 8 columns.
Now, let’s also see the columns and their data types. For this, we will use the info() method.
# information about the dataset
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 First Name 933 non-null object
1 Gender 855 non-null object
2 Start Date 1000 non-null object
3 Last Login Time 1000 non-null object
4 Salary 1000 non-null int64
5 Bonus % 1000 non-null float64
6 Senior Management 933 non-null object
7 Team 957 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB
We can see the number of unique elements in our dataset. This will help us in deciding which type of encoding to choose for converting categorical columns into numerical columns.
df.nunique()
Output:
First Name 200
Gender 2
Start Date 972
Last Login Time 720
Salary 995
Bonus % 971
Senior Management 2
Team 10
dtype: int64
Let’s get a quick summary of the dataset using the pandas describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of the distribution of data.
df.describe()
Output:
Salary Bonus %
count 1000.000000 1000.000000
mean 90662.181000 10.207555
std 32923.693342 5.528481
min 35013.000000 1.015000
25% 62613.000000 5.401750
50% 90428.000000 9.838500
75% 118740.250000 14.838000
max 149908.000000 19.944000
Note we can also get the description of categorical columns of the dataset if we specify include =’all’ in the describe function.
Till now we have got an idea about the dataset used. Now Let’s see if our dataset contains any missing values or not.
Steps for Mastering Exploratory Data Analysis | EDA Steps
Mastering exploratory data analysis (EDA) is crucial for understanding your data, identifying patterns, and generating insights that can inform further analysis or decision-making. Data is the lifeblood of cutting-edge groups, and the capability to extract insights from records has become a crucial talent in today’s statistics-pushed world. Exploratory Data Analysis (EDA) is a powerful method that allows analysts, scientists, and researchers to gain complete knowledge of their data earlier than projecting formal modeling or speculation testing.
It is an iterative procedure that entails summarizing, visualizing, and exploring information to find patterns, anomalies, and relationships that might not be apparent at once. In this complete article, we will understand and implement critical steps for performing Exploratory Data Analysis. Here are steps to help you master EDA:
Steps for Mastering Exploratory Data Analysis
- Step 1: Understand the Problem and the Data
- Step 2: Import and Inspect the Data
- Step 3: Handling Missing Values
- Step 4: Explore Data Characteristics
- Step 5: Perform Data Transformation
- Step 6: Visualize Data Relationships
- Step 7: Handling Outliers
- Step 8: Communicate Findings and Insights
Contact Us