Descriptive Analysis Pandas
Describe dataset
A. For numerical datatype
print(df.describe())
Output:
QUANTITY PRICE DISCOUNT
count 6.00000 6.000000 5.000000
mean 22.50000 65.000000 7.200000
std 10.05485 24.289916 1.923538
min 10.00000 30.000000 5.000000
25% 17.75000 52.500000 6.000000
50% 21.50000 65.000000 7.000000
75% 24.50000 77.500000 8.000000
max 40.00000 100.000000 10.000000
B. For object datatype
print(df.describe(include=['O']))
Output:
FRUITS
count 6
unique 6
top Mango
freq 1
Unique values
# Check the unique values in the dataset
df.FRUITS.unique()
Output:
array(['Mango', 'Apple', 'Banana', 'Orange', 'Grapes', 'Pineapple'],
dtype=object)
# Count the total unique values
df.FRUITS.value_counts()
Output:
Mango 1
Apple 1
Banana 1
Orange 1
Grapes 1
Pineapple 1
Name: FRUITS, dtype: int64
Sum values
print(df['PRICE'].sum())
Output:
360
Cumulative Sum
print(df['PRICE'].cumsum())
Output:
0 80
1 180
2 230
3 300
4 360
Name: PRICE, dtype: int64
Minimum Values
# Minimumn PRICE
df['PRICE'].min()
Output:
30
Maximum Values
# Maximum PRICE
df['PRICE'].max()
Output:
100
Mean
# Mean PRICE
df['PRICE'].mean()
Output:
65.0
Median
# Median PRICE
df['PRICE'].median()
Output:
65.0
Variance
# Variance
df['PRICE'].var()
Output:
590.0
Standard Deviation
# Stardard Deviation
df['PRICE'].std()
Output:
24.289915602982237
Quantile
# Quantile
df['PRICE'].quantile([0, 0.25, 0.75, 1])
Output:
0.00 30.0
0.25 52.5
0.75 77.5
1.00 100.0
Name: PRICE, dtype: float64
Apply any custom function
# Apply any custom function
def summation(col):
if col.dtypes != 'int64':
return col.count()
else:
return col.sum()
df.apply(summation)
Output:
FRUITS 6
QUANTITY 135
PRICE 390
DISCOUNT 5
dtype: int64
Covariance
print(df.cov(numeric_only=True))
Output:
QUANTITY PRICE DISCOUNT
QUANTITY 101.1 53.0 -10.4
PRICE 53.0 590.0 -18.0
DISCOUNT -10.4 -18.0 3.7
Correlation
print(df.corr(numeric_only=True))
Output:
QUANTITY PRICE DISCOUNT
QUANTITY 1.000000 0.217007 -0.499210
PRICE 0.217007 1.000000 -0.486486
DISCOUNT -0.499210 -0.486486 1.000000
Missing Values
Check for null values using isnull() function.
# Check for null values
print(df.isnull())
Output:
FRUITS QUANTITY PRICE DISCOUNT
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False True
Column-wise null values count
# Total count of null values
print(df.isnull().sum())
Output:
FRUITS 0
QUANTITY 0
PRICE 0
DISCOUNT 1
dtype: int64
Fill the null values with mean()
Mean = df.DISCOUNT.mean()
# Fill the null values
df['DISCOUNT'] = df['DISCOUNT'].fillna(Mean)
print(df)
Output:
FRUITS QUANTITY PRICE DISCOUNT
0 Mango 40 80 5.0
1 Apple 20 100 7.0
2 Banana 25 50 10.0
3 Orange 10 70 8.0
4 Grapes 23 60 6.0
5 Pineapple 17 30 7.2
We can also drop null values rows using the below command
# Drop the null values
df.dropna(inplace=True)
Add a column to the Existing dataset
# Values to add
Origin = pd.Series(data=['BH', 'J&K',
'BH', 'MP',
'WB', 'WB'])
# Add a column in dataset
df['Origin'] = Origin
print(df)
Output:
FRUITS QUANTITY PRICE DISCOUNT Origin
0 Mango 40 80 5.0 BH
1 Apple 20 100 7.0 J&K
2 Banana 25 50 10.0 BH
3 Orange 10 70 8.0 MP
4 Grapes 23 60 6.0 WB
5 Pineapple 17 30 NaN WB
Add a column using the existing columns values
# Add a column using the existing columns values
df = df.assign(Paid_Price=lambda df:
(df.QUANTITY * df.PRICE)\
-(df.QUANTITY * df.PRICE)\
*df.DISCOUNT/100)
print(df)
Output:
FRUITS QUANTITY PRICE DISCOUNT Origin Paid_Price
0 Mango 40 80 5.0 BH 3040.0
1 Apple 20 100 7.0 J&K 1860.0
2 Banana 25 50 10.0 BH 1125.0
3 Orange 10 70 8.0 MP 644.0
4 Grapes 23 60 6.0 WB 1297.2
5 Pineapple 17 30 NaN WB NaN
Group By
Group the DataFrame by the ‘Origin’ column using groupby() methods
# Group the DataFrame by 'Origin' column
grouped = df.groupby(by='Origin')
# Compute the sum as per Origin State
# All the above function can be
# applied here like median, std etc
print(grouped.agg(['sum', 'mean']))
Output:
QUANTITY PRICE DISCOUNT Paid_Price
sum mean sum mean sum mean sum mean
Origin
BH 65 32.5 130 65.0 15.0 7.5 4165.0 2082.5
J&K 20 20.0 100 100.0 7.0 7.0 1860.0 1860.0
MP 10 10.0 70 70.0 8.0 8.0 644.0 644.0
WB 40 20.0 90 45.0 6.0 6.0 1297.2 1297.2
Outlier Detection using Box plot
we can use a boxplot for Detection of the outliers.
# Box plot
df.boxplot(column='PRICE', grid=False)
Output:
Bar Plot with Pandas
plot.bar() method is used to plot bar in pandas.
df.plot.bar(x='FRUITS', y=['QUANTITY', 'PRICE', 'DISCOUNT'])
Output:
Histogram with pandas
plot.hist() methods is used to create a histogram.
df['QUANTITY'].plot.hist(bins=3)
Output:
Scatter Plot with Pandas
scatter() methods used to create a scatter plot in pandas.
df.plot.scatter(x='PRICE', y='DISCOUNT')
Output:
Pie Chart with Pandas
plot.pie() methods used to create pie chart.
grouped = df.groupby(['Origin'])
grouped.sum().plot.pie(y='Paid_Price', subplots=True)
Output:
Pandas Cheat Sheet for Data Science in Python
Pandas is a powerful and versatile library that allows you to work with data in Python. It offers a range of features and functions that make data analysis fast, easy, and efficient. Whether you are a data scientist, analyst, or engineer, Pandas can help you handle large datasets, perform complex operations, and visualize your results.
This Pandas Cheat Sheet is designed to help you master the basics of Pandas and boost your data skills. It covers the most common and useful commands and methods that you need to know when working with data in Python. You will learn how to create, manipulate, and explore data frames, how to apply various functions and calculations, how to deal with missing values and duplicates, how to merge and reshape data, and much more.
If you are new to Data Science using Python and Pandas, or if you want to refresh your memory, this cheat sheet is a handy reference that you can use anytime. It will save you time and effort by providing you with clear and concise examples of how to use Pandas effectively.
Contact Us