Determining the distribution of a variable

Types of Continuous Probability Distributions

Example :

Consider the iris dataset and let us try to understand how the petal length is distributed, here are the steps to be considered

Execute on jupyter notebook or any other ide that supports libraries

Python3

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm  # loading normal distribution
 
# Step 1: Load the Iris dataset
url = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
iris_data = pd.read_csv(url)
 
# Step 2: Select the feature for analysis (e.g., petal length)
selected_feature = 'petal_length'
selected_data = iris_data[selected_feature]
 
# Step 3: Plot the histogram of the selected feature
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(selected_data, bins=30, density=True, color='skyblue', alpha=0.6)
plt.title('Histogram of {}'.format(selected_feature))
plt.xlabel(selected_feature)
plt.ylabel('Density')
plt.grid(True)
# Step 4: Fit a Gaussian distribution to the selected feature
estimated_mean, estimated_std = np.mean(selected_data), np.std(selected_data)
 
# Step 5: Plot the histogram along with the fitted Gaussian distribution
plt.subplot(1, 2, 2)
plt.hist(selected_data, bins=30, density=True, color='skyblue', alpha=0.6)
 
x = np.linspace(np.min(selected_data), np.max(selected_data), 100)
pdf = norm.pdf(x, estimated_mean, estimated_std)
plt.plot(x, pdf, color='red', linestyle='--', linewidth=2)
 
plt.title('Histogram and Fitted Gaussian Distribution of {}'.format(selected_feature))
plt.xlabel(selected_feature)
plt.ylabel('Density')
plt.legend(['Fitted Gaussian Distribution', 'Histogram'])
plt.grid(True)
 
plt.tight_layout()
plt.show()

Output:

Explanation for the output:

Histogram: The left side of the image shows a histogram, which is a graphical representation of the distribution of petal lengths. Each bar in the histogram represents a range of petal lengths, and the height of the bar shows how many petals fall within that range. The histogram appears to be roughly bell-shaped, suggesting that the petal lengths are normally distributed.
Gaussian Distribution: The right side of the image shows a fitted Gaussian distribution, also known as a normal distribution. This is a theoretical curve that represents the probability distribution of a normally distributed variable. We see that there are two blocks of hostogram, It’s possible that the data comes from two distinct populations with different petal lengths, resulting in two overlapping normal distributions. This could happen if the sample includes flowers from different varieties or grown under different conditions.

These graphs provide insights into the distribution of petal lengths in the Iris dataset and help us assess whether a Gaussian distribution is a suitable model for representing this data.

Continuous Probability Distributions for Machine Learning

Machine learning relies heavily on probability distributions because they offer a framework for comprehending the uncertainty and variability present in data. Specifically, for a given dataset, continuous probability distributions express the chance of witnessing continuous outcomes, like real numbers.

Table of Content

What are Continuous probability distributions?
Importance in Machine Learning
Types of Continuous Probability Distributions
Determining the distribution of a variable

Tags:

#ML-Statistics #AI-ML-DS #Machine Learning #Machine Learning