Descriptive Analysis Pandas

Describe dataset

A. For numerical datatype

Python3
print(df.describe())

Output:

       QUANTITY       PRICE   DISCOUNT
count 6.00000 6.000000 5.000000
mean 22.50000 65.000000 7.200000
std 10.05485 24.289916 1.923538
min 10.00000 30.000000 5.000000
25% 17.75000 52.500000 6.000000
50% 21.50000 65.000000 7.000000
75% 24.50000 77.500000 8.000000
max 40.00000 100.000000 10.000000

B. For object datatype

Python3
print(df.describe(include=['O']))

Output:

       FRUITS
count 6
unique 6
top Mango
freq 1

Unique values

Python3
# Check the unique values in the dataset
df.FRUITS.unique()

Output:

array(['Mango', 'Apple', 'Banana', 'Orange', 'Grapes', 'Pineapple'],
dtype=object)
Python3
# Count the total unique values
df.FRUITS.value_counts()

Output:

Mango        1
Apple 1
Banana 1
Orange 1
Grapes 1
Pineapple 1
Name: FRUITS, dtype: int64

Sum values

Python3
print(df['PRICE'].sum())

Output:

360

Cumulative Sum

Python3
print(df['PRICE'].cumsum())

Output:

0     80
1 180
2 230
3 300
4 360
Name: PRICE, dtype: int64

Minimum Values

Python3
# Minimumn PRICE
df['PRICE'].min()

Output:

30

Maximum Values

Python3
# Maximum PRICE
df['PRICE'].max()

Output:

100

Mean

Python3
# Mean PRICE
df['PRICE'].mean()

Output:

65.0

Median

Python3
# Median PRICE
df['PRICE'].median()

Output:

65.0

Variance

Python3
# Variance
df['PRICE'].var()

Output:

590.0

Standard Deviation

Python3
# Stardard Deviation
df['PRICE'].std()

Output:

24.289915602982237

Quantile

Python3
# Quantile
df['PRICE'].quantile([0, 0.25, 0.75, 1])

Output:

0.00     30.0
0.25 52.5
0.75 77.5
1.00 100.0
Name: PRICE, dtype: float64

Apply any custom function

Python3
# Apply any custom function
def summation(col):
    if col.dtypes != 'int64':
        return col.count()
    else:
        return col.sum()


df.apply(summation)

Output:

FRUITS        6
QUANTITY 135
PRICE 390
DISCOUNT 5
dtype: int64

Covariance

Python3
print(df.cov(numeric_only=True))

Output:

          QUANTITY  PRICE  DISCOUNT
QUANTITY 101.1 53.0 -10.4
PRICE 53.0 590.0 -18.0
DISCOUNT -10.4 -18.0 3.7

Correlation

Python3
print(df.corr(numeric_only=True))

Output:

          QUANTITY     PRICE  DISCOUNT
QUANTITY 1.000000 0.217007 -0.499210
PRICE 0.217007 1.000000 -0.486486
DISCOUNT -0.499210 -0.486486 1.000000

Missing Values

Check for null values using isnull() function.

Python3
# Check for null values
print(df.isnull())

Output:

   FRUITS  QUANTITY  PRICE  DISCOUNT
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False True

Column-wise null values count

Python3
# Total count of null values
print(df.isnull().sum())

Output:

FRUITS      0
QUANTITY 0
PRICE 0
DISCOUNT 1
dtype: int64

Fill the null values with mean()

Python3
Mean = df.DISCOUNT.mean()

# Fill the null values
df['DISCOUNT'] = df['DISCOUNT'].fillna(Mean)
print(df)

Output:

      FRUITS  QUANTITY  PRICE  DISCOUNT
0 Mango 40 80 5.0
1 Apple 20 100 7.0
2 Banana 25 50 10.0
3 Orange 10 70 8.0
4 Grapes 23 60 6.0
5 Pineapple 17 30 7.2

We can also drop null values rows using the below command

Python3
# Drop the null values
df.dropna(inplace=True)

Add a column to the Existing dataset

Python3
# Values to add
Origin = pd.Series(data=['BH', 'J&K',
                         'BH', 'MP',
                         'WB', 'WB'])

# Add a column in dataset
df['Origin'] = Origin
print(df)

Output:

      FRUITS  QUANTITY  PRICE  DISCOUNT Origin
0 Mango 40 80 5.0 BH
1 Apple 20 100 7.0 J&K
2 Banana 25 50 10.0 BH
3 Orange 10 70 8.0 MP
4 Grapes 23 60 6.0 WB
5 Pineapple 17 30 NaN WB

Add a column using the existing columns values

Python3
# Add a column using the existing columns values
df = df.assign(Paid_Price=lambda df:
               (df.QUANTITY * df.PRICE)\
               -(df.QUANTITY * df.PRICE)\
               *df.DISCOUNT/100)
print(df)

Output:

      FRUITS  QUANTITY  PRICE  DISCOUNT Origin  Paid_Price
0 Mango 40 80 5.0 BH 3040.0
1 Apple 20 100 7.0 J&K 1860.0
2 Banana 25 50 10.0 BH 1125.0
3 Orange 10 70 8.0 MP 644.0
4 Grapes 23 60 6.0 WB 1297.2
5 Pineapple 17 30 NaN WB NaN

Group By

Group the DataFrame by the ‘Origin’ column using groupby() methods

Python3
# Group the DataFrame by 'Origin' column
grouped = df.groupby(by='Origin')

# Compute the sum as per Origin State
# All the above function can be
# applied here like median, std etc
print(grouped.agg(['sum', 'mean']))

Output:

       QUANTITY       PRICE        DISCOUNT      Paid_Price        
sum mean sum mean sum mean sum mean
Origin
BH 65 32.5 130 65.0 15.0 7.5 4165.0 2082.5
J&K 20 20.0 100 100.0 7.0 7.0 1860.0 1860.0
MP 10 10.0 70 70.0 8.0 8.0 644.0 644.0
WB 40 20.0 90 45.0 6.0 6.0 1297.2 1297.2

Outlier Detection using Box plot

we can use a boxplot for Detection of the outliers.

Python3
# Box plot
df.boxplot(column='PRICE', grid=False)

Output:

Bar Plot with Pandas

plot.bar() method is used to plot bar in pandas.

Python3
df.plot.bar(x='FRUITS', y=['QUANTITY', 'PRICE', 'DISCOUNT'])

Output:

Histogram with pandas

plot.hist() methods is used to create a histogram.

Python3
df['QUANTITY'].plot.hist(bins=3)

Output:

Scatter Plot with Pandas

scatter() methods used to create a scatter plot in pandas.

Python3
df.plot.scatter(x='PRICE', y='DISCOUNT')

Output:

Pie Chart with Pandas

plot.pie() methods used to create pie chart.

Python3
grouped = df.groupby(['Origin'])
grouped.sum().plot.pie(y='Paid_Price', subplots=True)

Output:

Pandas Cheat Sheet for Data Science in Python

Pandas is a powerful and versatile library that allows you to work with data in Python. It offers a range of features and functions that make data analysis fast, easy, and efficient. Whether you are a data scientist, analyst, or engineer, Pandas can help you handle large datasets, perform complex operations, and visualize your results.

This Pandas Cheat Sheet is designed to help you master the basics of Pandas and boost your data skills. It covers the most common and useful commands and methods that you need to know when working with data in Python. You will learn how to create, manipulate, and explore data frames, how to apply various functions and calculations, how to deal with missing values and duplicates, how to merge and reshape data, and much more.

If you are new to Data Science using Python and Pandas, or if you want to refresh your memory, this cheat sheet is a handy reference that you can use anytime. It will save you time and effort by providing you with clear and concise examples of how to use Pandas effectively.

Similar Reads

Pandas Cheat Sheet

This Pandas Cheat Sheet will help you enhance your understanding of the Pandas library and gain proficiency in working with DataFrames, importing/exporting data, performing functions and operations, and utilizing visualization methods to explore DataFrame information effectively....

What is Pandas?

Python’s Pandas open-source package is a tool for data analysis and management. It was developed by Wes McKinney and is used in various fields, including data science, finance, and social sciences. Pandas’ key features encompass the use of DataFrame and Series objects, efficient indexing capabilities, data alignment, and swift handling of missing data....

Installing Pandas

If you have Python installed, you can use the following command to install Pandas:...

Importing Pandas

Once Pandas is installed, you can import it into your Python script or Jupyter Notebook using the following import statement:...

Data Structures in Pandas

Pandas provides two main data structures: Series and DataFrame....

Hands-on Practice on Pandas

Load the pandas libraries...

I/O Pandas Series and Dataframe

Creating Pandas Series....

Sorting, Reindexing, Renaming, Reshaping, Dropping

Sorting by values...

Dataframe Slicing and Observation

A. Observation...

Combine Two data sets

Create 1st dataframe...

Descriptive Analysis Pandas

Describe dataset...

Conclusion

In conclusion, the Pandas Cheat Sheet serves as an invaluable resource for data scientists and Python users. Its concise format and practical examples provide quick access to essential Pandas functions and methods. By leveraging this pandas cheat sheet, users can streamline their data manipulation tasks, gain insights from complex datasets, and make informed decisions. Overall, the Pandas Cheat Sheet is a must-have tool for enhancing productivity and efficiency in data science projects....

Pandas Cheat Sheet – FAQs

1. What is a Pandas cheat sheet?...

Contact Us