Utilizing Python Libraries for Effective Data Processing

Let’s analyze sales dataset and use these python libraries for data wrangling. The dataset reveals valuable insights into customer purchasing behavior, item popularity, and category-specific trends. Businesses can leverage this information to optimize marketing strategies, enhance customer engagement, and increase sales.

Import Required Libraries and loading CSV file

Python
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("customers_data.csv")
# Display the first few rows of the dataset
print("\nFirst few rows of the dataset:\n", df.head())

Output:

First few rows of the dataset:
Customer ID Item ID Customer Name Item Category Price
0 1 22 Om clothing 56.0
1 2 22 Karan homeware 71.0
2 3 77 Bhavesh sports 66.0
3 4 70 Chetan clothing 56.0
4 5 67 Karan clothing 56.0

Data Cleaning and Validation

Python
# Checking for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)
# Fill missing values if any (forward fill for simplicity)
df.fillna(method='ffill', inplace=True)

Output:

Missing values in each column:
Customer ID 0
Item ID 0
Customer Name 0
Item Category 0
Price 0
dtype: int64

Ensure Correct Data Types

Converts the Customer ID, Item ID, and Price columns to the appropriate data types using astype().

Python
# Ensure correct data types
df["Customer ID"] = df["Customer ID"].astype(int)
df["Item ID"] = df["Item ID"].astype(int)
df["Price"] = df["Price"].astype(float)
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer ID 1000 non-null int64
1 Item ID 1000 non-null int64
2 Customer Name 1000 non-null object
3 Item Category 1000 non-null object
4 Price 1000 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB

Exploratory Data Analysis

Display Basic Statistics

Let’s observe basic statistical details like mean, median, etc., for numerical columns using describe()

Python
# Display basic statistics
print("\nBasic statistics:\n", df.describe())

Output:

Basic statistics:
Customer ID Item ID Price
count 1000.000000 1000.000000 1000.000000
mean 500.500000 50.736000 55.917000
std 288.819436 28.557273 14.890192
min 1.000000 1.000000 27.000000
25% 250.750000 26.000000 55.000000
50% 500.500000 51.000000 56.000000
75% 750.250000 75.000000 66.000000
max 1000.000000 100.000000 71.000000

Define the Target Item Category

Specifies the item category of interest. You can change “sports” to any other category as needed.

Python
#Define the target item category
target_category = "sports"

Filter Data for Purchases Belonging to the “Target Category”

Filters the DataFrame to include only rows where the item category matches the target category.

Python
#Filter data for purchases belonging to the target category
df_filtered = df[df["Item Category"] == target_category]
df_filtered.head()

Output:

Customer ID    Item ID    Customer Name    Item Category    Price
2 3 77 Bhavesh sports 66.0
6 7 44 Naveen sports 66.0
9 10 35 Yash sports 66.0
11 12 90 Zubair sports 66.0
16 17 24 Jagdish sports 66.0

Group Purchases by Customer ID and Calculate Total Spent per Customer

Groups the filtered data by Customer ID and calculates the total spending for each customer using groupby().

Python
#Group purchases by customer ID and calculate total spent per customer
customer_spending = df_filtered.groupby("Customer ID")["Price"].sum()
customer_spending.head()

Output:

Customer ID
3 66.0
7 66.0
10 66.0
12 66.0
17 66.0
...
967 66.0
968 66.0
978 66.0
981 66.0
990 66.0
Name: Price, Length: 202, dtype: float64

Identify Frequent Buyers

Sorts customers by total spending in descending order and selects the top 10 spenders.

Python
# Identify frequent buyers (e.g., top 10 customers spending the most)
frequent_buyers = customer_spending.sort_values(ascending=False).head(10)

Calculate Total Revenue from Frequent Buyers

Calculates the total revenue generated by the top 10 spenders.

Python
# Calculate total revenue from frequent buyers
total_revenue_frequent = frequent_buyers.sum()
total_revenue_frequent

Output:

660.0

Analyzing the Results

Prints the top 10 customers and the total revenue generated by them.

Python
# Presenting Results
print("\nTop 10 Customers (by spending) on", target_category, "items:")
print(frequent_buyers)

print("\nTotal Revenue Generated by Frequent Buyers:", total_revenue_frequent)

Output:

Top 10 Customers (by spending) on sports items:
Customer ID
3 66.0
726 66.0
699 66.0
701 66.0
708 66.0
711 66.0
712 66.0
714 66.0
715 66.0
717 66.0
Name: Price, dtype: float64

Total Revenue Generated by Frequent Buyers: 660.0

Visualize Results

Bar Plot of Top 10 Customers by Spending

Creates a bar plot of the top 10 customers by spending and saves it as frequent_buyers.png

Python
# Visualize Results
plt.figure(figsize=(10, 6))
frequent_buyers.plot(kind='bar')
plt.title('Top 10 Customers by Spending on Sports Items')
plt.xlabel('Customer ID')
plt.ylabel('Total Spending')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('frequent_buyers.png')  # Save the plot as an image file
plt.show()

Output:

Top 10 Customers by Spending

Histogram of Spending Distribution

Creates a histogram showing the distribution of spending on sports items and saves it as spending_distribution.png.

Python
# Distribution of spending in the target category
plt.figure(figsize=(10, 6))
df_filtered['Price'].plot(kind='hist', bins=20, edgecolor='black')
plt.title('Distribution of Spending on Sports Items')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('spending_distribution.png')
plt.show()

Output:

Distribution of spending on sports items

Mastering Python Libraries for Effective data processing

Python has become the go-to programming language for data science and data processing due to its simplicity, readability, and extensive library support. In this article, we will explore some of the most effective Python libraries for data processing, highlighting their key features and applications.

Table of Content

  • Recommended Libraries: Efficient Data Processing
  • Use Cases and Examples: Cleaning Up the Dataset
  • Utilizing Python Libraries for Effective Data Processing

Similar Reads

Recommended Libraries: Efficient Data Processing

Python offers a wide range of libraries, but three superstars stand out for data wrangling:...

Use Cases and Examples: Cleaning Up the Dataset

Before you build anything, you need to sort through the mess. Pandas empowers to do the same. Some common data cleaning tasks Pandas helps with:...

Utilizing Python Libraries for Effective Data Processing

Let’s analyze sales dataset and use these python libraries for data wrangling. The dataset reveals valuable insights into customer purchasing behavior, item popularity, and category-specific trends. Businesses can leverage this information to optimize marketing strategies, enhance customer engagement, and increase sales....

Conclusion

Python offers a rich ecosystem of libraries for effective data processing. Libraries like Pandas, NumPy, and SciPy provide powerful tools for data manipulation, numerical computation, and handling large datasets. By leveraging these libraries, data scientists and analysts can efficiently process and analyze data, leading to more insightful and actionable results.They empower you to:...

Contact Us