How to use Regression Metrics on California House Prices Dataset In Python

Here are the steps for applying regression metrics to our model, and for a better understanding, we’ve illustrated them using the example of predicting house prices.

Import Libraries and Load the Dataset

Python

#importing Libraries
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

                    

We import necessary libraries and load the dataset from our own source or from scikit-learn library.

Loading the Dataset

Python3

# Load the California Housing Prices dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

                    

The code loads the dataset for California Housing Prices using the scikit-learn fetch_california_housing function, builds a DataFrame (df) containing the dataset’s characteristics and the target variable, and then adds the target variable to the DataFrame.

Data Splitting and Train-Test Split

Python

# Split the data into features (X) and target variable (y)
X = df.drop(columns=['target'])
y = df['target']
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

                    

The code divides the dataset into features (X) and the target variable (y) by removing the ‘target’ column from the DataFrame and allocating it to X while assigning the ‘target’ column to y. With a fixed random seed (random_state=42) for repeatability, it then further divides the data into training and testing sets, utilizing 80% of the data for training (X_train and y_train) and 20% for testing (X_test and y_test).

Create and Train the Regression Model

Python

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

                    

This code builds a linear regression model (model) and trains it using training data (X_train and y_train) to discover a linear relationship between the characteristics and the target variable.

Make Predictions

Python

# Make predictions on the test set
y_pred = model.predict(X_test)

                    

The code estimates the values of the target variable based on the discovered relationships between features and the target variable, using the trained Linear Regression model (model) to make predictions (y_pred) on the test set (X_test).

Calculate Evaluation Metrics

Python

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)
 
# Print the evaluation metrics
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r_squared)
print("Root Mean Squared Error (RMSE):", rmse)

                    

Output:

Mean Absolute Error (MAE): 0.5332001304956553
Mean Squared Error (MSE): 0.5558915986952444
R-squared (R²): 0.5757877060324508
Root Mean Squared Error (RMSE): 0.7455813830127764

The code computes four regression assessment metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared (R2), and Root Mean Squared Error (RMSE), based on the predicted values (y_pred) and the actual values from the test set (y_test). The model’s success in foretelling the values of the target variable is then evaluated by printing these metrics, which shed light on the model’s precision and goodness of fit.

Understanding the output:

1. Mean Absolute Error (MAE): 0.5332

  • An MAE of 0.5332 means that, on average, the model’s predictions are approximately $0.5332 away from the true house prices.

2. Mean Squared Error (MSE): 0.5559

  • An MSE of 0.5559 means that, on average, the squared prediction errors are approximately 0.5559.

3. R-squared (R²): 0.5758

  • An R² of 0.5758 indicates that the model can explain approximately 57.58% of the variance in house prices.

4. Root Mean Squared Error (RMSE): 0.7456

  • An RMSE of 0.7456 indicates that, on average, the model’s predictions have an error of approximately $0.7456 in the same units as the house prices.

Regression Metrics

Machine learning is an effective tool for predicting numerical values, and regression is one of its key applications. In the arena of regression analysis, accurate estimation is crucial for measuring the overall performance of predictive models. This is where the famous machine learning library Python Scikit-Learn comes in. Scikit-Learn gives a complete set of regression metrics to evaluate the quality of regression models.

In this article, we are able to explore the basics of regression metrics in scikit-learn, discuss the steps needed to use them effectively, provide some examples, and show the desired output for each metric.

Similar Reads

Regression

Regression fashions are algorithms used to expect continuous numerical values primarily based on entering features. In scikit-learn, we will use numerous regression algorithms, such as Linear Regression, Decision Trees, Random Forests, and Support Vector Machines (SVM), amongst others....

Types of Regression Metrics

Some common regression metrics in scikit-learn with examples...

Using Regression Metrics on California House Prices Dataset

...

Conclusion

...

Contact Us