Import and Inspect the Data

Once you have clean expertise of the problem and the information, the following step is to import the data into your evaluation environment (e.g., Python, R, or a spreadsheet program). During this step, looking into the statistics is critical to gain initial know-how of its structure, variable kinds, and capability issues.

Here are a few obligations you could carry out at this stage:

  • Load the facts into your analysis environment, ensuring that the facts are imported efficiently and without errors or truncations.
  • Examine the size of the facts (variety of rows and columns) to experience its length and complexity.
  • Identify facts sorts and formats for each variable, as these records may be necessary for the following facts manipulation and evaluation steps.
  • Look for any apparent errors or inconsistencies in the information, such as invalid values, mismatched units, or outliers, that can indicate exceptional issues with information.

For this article, we will use the employee data. It contains 8 columns namely – First Name, Gender, Start Date, Last Login, Salary, Bonus%, Senior Management, and Team. We can get the dataset here Employees.csv.

Let’s read the dataset using the Pandas read_csv() function and print the 1st five rows. To print the first five rows we will use the head() function.

Python3
import pandas as pd
import numpy as np
# read datasdet using pandas
df = pd.read_csv('employees.csv')
df.head()

Output:

    First Name    Gender    Start Date    Last Login Time    Salary    Bonus %    Senior Management    Team
0    Douglas    Male    8/6/1993    12:42 PM    97308    6.945    True    Marketing
1    Thomas    Male    3/31/1996    6:53 AM    61933    4.170    True    NaN
2    Maria    Female    4/23/1993    11:17 AM    130590    11.858    False    Finance
3    Jerry    Male    3/4/2005    1:00 PM    138705    9.340    True    Finance
4    Larry    Male    1/24/1998    4:47 PM    101004    1.389    True    Client Services

Getting Insights About The Dataset

Let’s see the shape of the data using the shape.

Python3
df.shape

Output:

(1000, 8)

This means that this dataset has 1000 rows and 8 columns.

Now, let’s also see the columns and their data types. For this, we will use the info() method.

Python3
# information about the dataset
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB

We can see the number of unique elements in our dataset. This will help us in deciding which type of encoding to choose for converting categorical columns into numerical columns.

Python3
df.nunique()

Output:

First Name           200
Gender                 2
Start Date           972
Last Login Time      720
Salary               995
Bonus %              971
Senior Management      2
Team                  10
dtype: int64

Let’s get a quick summary of the dataset using the pandas describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of the distribution of data.

Python3
df.describe()

Output:

    Salary    Bonus %
count    1000.000000    1000.000000
mean    90662.181000    10.207555
std    32923.693342    5.528481
min    35013.000000    1.015000
25%    62613.000000    5.401750
50%    90428.000000    9.838500
75%    118740.250000    14.838000
max    149908.000000    19.944000

Note we can also get the description of categorical columns of the dataset if we specify include =’all’  in the describe function.

Till now we have got an idea about the dataset used. Now Let’s see if our dataset contains any missing values or not.

Steps for Mastering Exploratory Data Analysis | EDA Steps

Mastering exploratory data analysis (EDA) is crucial for understanding your data, identifying patterns, and generating insights that can inform further analysis or decision-making. Data is the lifeblood of cutting-edge groups, and the capability to extract insights from records has become a crucial talent in today’s statistics-pushed world. Exploratory Data Analysis (EDA) is a powerful method that allows analysts, scientists, and researchers to gain complete knowledge of their data earlier than projecting formal modeling or speculation testing.

It is an iterative procedure that entails summarizing, visualizing, and exploring information to find patterns, anomalies, and relationships that might not be apparent at once. In this complete article, we will understand and implement critical steps for performing Exploratory Data Analysis. Here are steps to help you master EDA:

Steps for Mastering Exploratory Data Analysis

  • Step 1: Understand the Problem and the Data
  • Step 2: Import and Inspect the Data
  • Step 3: Handling Missing Values
  • Step 4: Explore Data Characteristics
  • Step 5: Perform Data Transformation
  • Step 6: Visualize Data Relationships
  • Step 7: Handling Outliers
  • Step 8: Communicate Findings and Insights

Similar Reads

Step 1: Understand the Problem and the Data

The first step in any information evaluation project is to sincerely apprehend the trouble you are trying to resolve and the statistics you have at your disposal. This entails asking questions consisting of:...

Step 2: Import and Inspect the Data

Once you have clean expertise of the problem and the information, the following step is to import the data into your evaluation environment (e.g., Python, R, or a spreadsheet program). During this step, looking into the statistics is critical to gain initial know-how of its structure, variable kinds, and capability issues....

Step 3: Handling Missing Values

You all must be wondering why a dataset will contain any missing values. It can occur when no information is provided for one or more items or for a whole unit. For Example, Suppose different users being surveyed may choose not to share their income, and some users may choose not to share their address in this way many datasets went missing. Missing Data is a very big problem in real-life scenarios....

Step 4: Explore Data Characteristics

By exploring the characteristics of your information very well, you can gain treasured insights into its structure, pick out capability problems or anomalies, and inform your subsequent evaluation and modeling choices. Documenting any findings or observations from this step is critical, as they may be relevant for destiny reference or communication with stakeholders....

Step 5: Perform Data Transformation

Data transformation is a critical step within the EDA process because it enables you to prepare your statistics for similar evaluation and modeling. Depending on the traits of your information and the necessities of your analysis, you may need to carry out various ameliorations to ensure that your records are in the most appropriate layout....

Step 6: Visualize Data Relationships

To visualize data relationships, we’ll explore univariate, bivariate, and multivariate analyses using the employees dataset. These visualizations will help uncover patterns, trends, and relationships within the data....

Step 7: Handling Outliers

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect outliers, and the removal process of these outliers from the dataframe is the same as removing a data item from the panda’s dataframe....

Step 8: Communicate Findings and Insights

The final step in the EDA technique is effectively discussing your findings and insights. This includes summarizing your evaluation, highlighting fundamental discoveries, and imparting your outcomes cleanly and compellingly....

Conclusion

Exploratory Data Analysis is a powerful and vital technique for gaining deep information about your records earlier than venture formal modeling or speculation testing. By following the seven steps mentioned in this newsletter – knowing how the problem and information, uploading and inspecting the information, managing missing information, exploring data traits, appearing data transformation, visualizing data relationships, and communicating findings and insights – you may free up the whole potential of your records and extract valuable insights that could pressure informed decision-making....

FAQ’s

1. What are the critical steps of the EDA procedure?...

Contact Us