Descriptive Analysis Pandas

Describe dataset

A. For numerical datatype

Python3

print(df.describe())

Output:

       QUANTITY       PRICE   DISCOUNT
count   6.00000    6.000000   5.000000
mean   22.50000   65.000000   7.200000
std    10.05485   24.289916   1.923538
min    10.00000   30.000000   5.000000
25%    17.75000   52.500000   6.000000
50%    21.50000   65.000000   7.000000
75%    24.50000   77.500000   8.000000
max    40.00000  100.000000  10.000000

B. For object datatype

Python3

print(df.describe(include=['O']))

Output:

       FRUITS
count       6
unique      6
top     Mango
freq        1

Unique values

Python3

# Check the unique values in the dataset
df.FRUITS.unique()

Output:

array(['Mango', 'Apple', 'Banana', 'Orange', 'Grapes', 'Pineapple'],
      dtype=object)

Python3

# Count the total unique values
df.FRUITS.value_counts()

Output:

Mango        1
Apple        1
Banana       1
Orange       1
Grapes       1
Pineapple    1
Name: FRUITS, dtype: int64

Sum values

Python3

print(df['PRICE'].sum())

Output:

Cumulative Sum

Python3

print(df['PRICE'].cumsum())

Output:

0     80
1    180
2    230
3    300
4    360
Name: PRICE, dtype: int64

Minimum Values

Python3

# Minimumn PRICE
df['PRICE'].min()

Output:

Maximum Values

Python3

# Maximum PRICE
df['PRICE'].max()

Output:

Mean

Python3

# Mean PRICE
df['PRICE'].mean()

Output:

65.0

Median

Python3

# Median PRICE
df['PRICE'].median()

Output:

65.0

Variance

Python3

# Variance
df['PRICE'].var()

Output:

590.0

Standard Deviation

Python3

# Stardard Deviation
df['PRICE'].std()

Output:

24.289915602982237

Quantile

Python3

# Quantile
df['PRICE'].quantile([0, 0.25, 0.75, 1])

Output:

0.00     30.0
0.25     52.5
0.75     77.5
1.00    100.0
Name: PRICE, dtype: float64

Apply any custom function

Python3

# Apply any custom function
def summation(col):
    if col.dtypes != 'int64':
        return col.count()
    else:
        return col.sum()


df.apply(summation)

Output:

FRUITS        6
QUANTITY    135
PRICE       390
DISCOUNT      5
dtype: int64

Covariance

Python3

print(df.cov(numeric_only=True))

Output:

          QUANTITY  PRICE  DISCOUNT
QUANTITY     101.1   53.0     -10.4
PRICE         53.0  590.0     -18.0
DISCOUNT     -10.4  -18.0       3.7

Correlation

Python3

print(df.corr(numeric_only=True))

Output:

          QUANTITY     PRICE  DISCOUNT
QUANTITY  1.000000  0.217007 -0.499210
PRICE     0.217007  1.000000 -0.486486
DISCOUNT -0.499210 -0.486486  1.000000

Missing Values

Check for null values using isnull() function.

Python3

# Check for null values
print(df.isnull())

Output:

   FRUITS  QUANTITY  PRICE  DISCOUNT
0   False     False  False     False
1   False     False  False     False
2   False     False  False     False
3   False     False  False     False
4   False     False  False     False
5   False     False  False      True

Column-wise null values count

Python3

# Total count of null values
print(df.isnull().sum())

Output:

FRUITS      0
QUANTITY    0
PRICE       0
DISCOUNT    1
dtype: int64

Fill the null values with mean()

Python3

Mean = df.DISCOUNT.mean()

# Fill the null values
df['DISCOUNT'] = df['DISCOUNT'].fillna(Mean)
print(df)

Output:

      FRUITS  QUANTITY  PRICE  DISCOUNT
0      Mango        40     80       5.0
1      Apple        20    100       7.0
2     Banana        25     50      10.0
3     Orange        10     70       8.0
4     Grapes        23     60       6.0
5  Pineapple        17     30       7.2

We can also drop null values rows using the below command

Python3

# Drop the null values
df.dropna(inplace=True)

Add a column to the Existing dataset

Python3

# Values to add
Origin = pd.Series(data=['BH', 'J&K',
                         'BH', 'MP',
                         'WB', 'WB'])

# Add a column in dataset
df['Origin'] = Origin
print(df)

Output:

      FRUITS  QUANTITY  PRICE  DISCOUNT Origin
0      Mango        40     80       5.0     BH
1      Apple        20    100       7.0    J&K
2     Banana        25     50      10.0     BH
3     Orange        10     70       8.0     MP
4     Grapes        23     60       6.0     WB
5  Pineapple        17     30       NaN     WB

Add a column using the existing columns values

Python3

# Add a column using the existing columns values
df = df.assign(Paid_Price=lambda df:
               (df.QUANTITY * df.PRICE)\
               -(df.QUANTITY * df.PRICE)\
               *df.DISCOUNT/100)
print(df)

Output:

      FRUITS  QUANTITY  PRICE  DISCOUNT Origin  Paid_Price
0      Mango        40     80       5.0     BH      3040.0
1      Apple        20    100       7.0    J&K      1860.0
2     Banana        25     50      10.0     BH      1125.0
3     Orange        10     70       8.0     MP       644.0
4     Grapes        23     60       6.0     WB      1297.2
5  Pineapple        17     30       NaN     WB         NaN

Group By

Group the DataFrame by the ‘Origin’ column using groupby() methods

Python3

# Group the DataFrame by 'Origin' column
grouped = df.groupby(by='Origin')

# Compute the sum as per Origin State
# All the above function can be
# applied here like median, std etc
print(grouped.agg(['sum', 'mean']))

Output:

       QUANTITY       PRICE        DISCOUNT      Paid_Price        
            sum  mean   sum   mean      sum mean        sum    mean
Origin                                                             
BH           65  32.5   130   65.0     15.0  7.5     4165.0  2082.5
J&K          20  20.0   100  100.0      7.0  7.0     1860.0  1860.0
MP           10  10.0    70   70.0      8.0  8.0      644.0   644.0
WB           40  20.0    90   45.0      6.0  6.0     1297.2  1297.2

Outlier Detection using Box plot

we can use a boxplot for Detection of the outliers.

Python3

# Box plot
df.boxplot(column='PRICE', grid=False)

Output:

Bar Plot with Pandas

plot.bar() method is used to plot bar in pandas.

Python3

df.plot.bar(x='FRUITS', y=['QUANTITY', 'PRICE', 'DISCOUNT'])

Output:

Histogram with pandas

plot.hist() methods is used to create a histogram.

Python3

df['QUANTITY'].plot.hist(bins=3)

Output:

Scatter Plot with Pandas

scatter() methods used to create a scatter plot in pandas.

Python3

df.plot.scatter(x='PRICE', y='DISCOUNT')

Output:

Pie Chart with Pandas

plot.pie() methods used to create pie chart.

Python3

grouped = df.groupby(['Origin'])
grouped.sum().plot.pie(y='Paid_Price', subplots=True)

Output:

Pandas Cheat Sheet for Data Science in Python

Pandas is a powerful and versatile library that allows you to work with data in Python. It offers a range of features and functions that make data analysis fast, easy, and efficient. Whether you are a data scientist, analyst, or engineer, Pandas can help you handle large datasets, perform complex operations, and visualize your results.

This Pandas Cheat Sheet is designed to help you master the basics of Pandas and boost your data skills. It covers the most common and useful commands and methods that you need to know when working with data in Python. You will learn how to create, manipulate, and explore data frames, how to apply various functions and calculations, how to deal with missing values and duplicates, how to merge and reshape data, and much more.

If you are new to Data Science using Python and Pandas, or if you want to refresh your memory, this cheat sheet is a handy reference that you can use anytime. It will save you time and effort by providing you with clear and concise examples of how to use Pandas effectively.

Descriptive Analysis Pandas

Describe dataset

Unique values

Sum values

Cumulative Sum

Minimum Values

Maximum Values

Mean

Median

Variance

Standard Deviation

Quantile

Apply any custom function

Covariance

Correlation

Missing Values

We can also drop null values rows using the below command

Add a column to the Existing dataset

Add a column using the existing columns values

Group By

Outlier Detection using Box plot

Bar Plot with Pandas

Histogram with pandas

Scatter Plot with Pandas

Pie Chart with Pandas

Pandas Cheat Sheet for Data Science in Python

Similar Reads

Contact Us