Utilizing Python Libraries for Effective Data Processing

Use Cases and Examples: Cleaning Up the Dataset

Let’s analyze sales dataset and use these python libraries for data wrangling. The dataset reveals valuable insights into customer purchasing behavior, item popularity, and category-specific trends. Businesses can leverage this information to optimize marketing strategies, enhance customer engagement, and increase sales.

Import Required Libraries and loading CSV file

Python

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("customers_data.csv")
# Display the first few rows of the dataset
print("\nFirst few rows of the dataset:\n", df.head())

Output:

First few rows of the dataset:
    Customer ID  Item ID Customer Name Item Category  Price
0            1       22            Om      clothing   56.0
1            2       22         Karan      homeware   71.0
2            3       77       Bhavesh        sports   66.0
3            4       70        Chetan      clothing   56.0
4            5       67         Karan      clothing   56.0

Data Cleaning and Validation

Check for Missing Values, with isnull().sum()
Fill for missing Values, Forward fills any missing values in the DataFrame to maintain data continuity.

Python

# Checking for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)
# Fill missing values if any (forward fill for simplicity)
df.fillna(method='ffill', inplace=True)

Output:

Missing values in each column:
 Customer ID      0
Item ID          0
Customer Name    0
Item Category    0
Price            0
dtype: int64

Ensure Correct Data Types

Converts the Customer ID, Item ID, and Price columns to the appropriate data types using astype().

Python

# Ensure correct data types
df["Customer ID"] = df["Customer ID"].astype(int)
df["Item ID"] = df["Item ID"].astype(int)
df["Price"] = df["Price"].astype(float)
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Customer ID    1000 non-null   int64  
 1   Item ID        1000 non-null   int64  
 2   Customer Name  1000 non-null   object 
 3   Item Category  1000 non-null   object 
 4   Price          1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB

Exploratory Data Analysis

Display Basic Statistics

Let’s observe basic statistical details like mean, median, etc., for numerical columns using describe()

Python

# Display basic statistics
print("\nBasic statistics:\n", df.describe())

Output:

Basic statistics:
        Customer ID      Item ID        Price
count  1000.000000  1000.000000  1000.000000
mean    500.500000    50.736000    55.917000
std     288.819436    28.557273    14.890192
min       1.000000     1.000000    27.000000
25%     250.750000    26.000000    55.000000
50%     500.500000    51.000000    56.000000
75%     750.250000    75.000000    66.000000
max    1000.000000   100.000000    71.000000

Define the Target Item Category

Specifies the item category of interest. You can change “sports” to any other category as needed.

Python

#Define the target item category
target_category = "sports"

Filter Data for Purchases Belonging to the “Target Category”

Filters the DataFrame to include only rows where the item category matches the target category.

Python

#Filter data for purchases belonging to the target category
df_filtered = df[df["Item Category"] == target_category]
df_filtered.head()

Output:

Customer ID    Item ID    Customer Name    Item Category    Price
2    3    77    Bhavesh    sports    66.0
6    7    44    Naveen    sports    66.0
9    10    35    Yash    sports    66.0
11    12    90    Zubair    sports    66.0
16    17    24    Jagdish    sports    66.0

Group Purchases by Customer ID and Calculate Total Spent per Customer

Groups the filtered data by Customer ID and calculates the total spending for each customer using groupby().

Python

#Group purchases by customer ID and calculate total spent per customer
customer_spending = df_filtered.groupby("Customer ID")["Price"].sum()
customer_spending.head()

Output:

Customer ID
3      66.0
7      66.0
10     66.0
12     66.0
17     66.0
       ... 
967    66.0
968    66.0
978    66.0
981    66.0
990    66.0
Name: Price, Length: 202, dtype: float64

Identify Frequent Buyers

Sorts customers by total spending in descending order and selects the top 10 spenders.

Python

# Identify frequent buyers (e.g., top 10 customers spending the most)
frequent_buyers = customer_spending.sort_values(ascending=False).head(10)

Calculate Total Revenue from Frequent Buyers

Calculates the total revenue generated by the top 10 spenders.

Python

# Calculate total revenue from frequent buyers
total_revenue_frequent = frequent_buyers.sum()
total_revenue_frequent

Output:

660.0

Analyzing the Results

Prints the top 10 customers and the total revenue generated by them.

Python

# Presenting Results
print("\nTop 10 Customers (by spending) on", target_category, "items:")
print(frequent_buyers)

print("\nTotal Revenue Generated by Frequent Buyers:", total_revenue_frequent)

Output:

Top 10 Customers (by spending) on sports items:
Customer ID
3      66.0
726    66.0
699    66.0
701    66.0
708    66.0
711    66.0
712    66.0
714    66.0
715    66.0
717    66.0
Name: Price, dtype: float64

Total Revenue Generated by Frequent Buyers: 660.0

Visualize Results

Bar Plot of Top 10 Customers by Spending

Creates a bar plot of the top 10 customers by spending and saves it as frequent_buyers.png

Python

# Visualize Results
plt.figure(figsize=(10, 6))
frequent_buyers.plot(kind='bar')
plt.title('Top 10 Customers by Spending on Sports Items')
plt.xlabel('Customer ID')
plt.ylabel('Total Spending')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('frequent_buyers.png')  # Save the plot as an image file
plt.show()

Output:

Top 10 Customers by Spending

Histogram of Spending Distribution

Creates a histogram showing the distribution of spending on sports items and saves it as spending_distribution.png.

Python

# Distribution of spending in the target category
plt.figure(figsize=(10, 6))
df_filtered['Price'].plot(kind='hist', bins=20, edgecolor='black')
plt.title('Distribution of Spending on Sports Items')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('spending_distribution.png')
plt.show()

Output:

Distribution of spending on sports items

Mastering Python Libraries for Effective data processing

Python has become the go-to programming language for data science and data processing due to its simplicity, readability, and extensive library support. In this article, we will explore some of the most effective Python libraries for data processing, highlighting their key features and applications.

Table of Content

Recommended Libraries: Efficient Data Processing
Use Cases and Examples: Cleaning Up the Dataset
Utilizing Python Libraries for Effective Data Processing

Tags:

#Data Science Blogathon 2024 #interview-questions #AI-ML-DS #Blogathon #Data Analysis

Use Cases and Examples: Cleaning Up the Dataset

Conclusion

Utilizing Python Libraries for Effective Data Processing

Data Cleaning and Validation

Ensure Correct Data Types

Exploratory Data Analysis

Define the Target Item Category

Filter Data for Purchases Belonging to the “Target Category”

Group Purchases by Customer ID and Calculate Total Spent per Customer

Identify Frequent Buyers

Calculate Total Revenue from Frequent Buyers

Analyzing the Results

Visualize Results

Mastering Python Libraries for Effective data processing

Similar Reads

Contact Us