How to use Regression Metrics on California House Prices Dataset In Python
Here are the steps for applying regression metrics to our model, and for a better understanding, we’ve illustrated them using the example of predicting house prices.
Import Libraries and Load the Dataset
Python
#importing Libraries import pandas as pd import numpy as np from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score |
We import necessary libraries and load the dataset from our own source or from scikit-learn library.
Loading the Dataset
Python3
# Load the California Housing Prices dataset data = fetch_california_housing() df = pd.DataFrame(data.data, columns = data.feature_names) df[ 'target' ] = data.target |
The code loads the dataset for California Housing Prices using the scikit-learn fetch_california_housing function, builds a DataFrame (df) containing the dataset’s characteristics and the target variable, and then adds the target variable to the DataFrame.
Data Splitting and Train-Test Split
Python
# Split the data into features (X) and target variable (y) X = df.drop(columns = [ 'target' ]) y = df[ 'target' ] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) |
The code divides the dataset into features (X) and the target variable (y) by removing the ‘target’ column from the DataFrame and allocating it to X while assigning the ‘target’ column to y. With a fixed random seed (random_state=42) for repeatability, it then further divides the data into training and testing sets, utilizing 80% of the data for training (X_train and y_train) and 20% for testing (X_test and y_test).
Create and Train the Regression Model
Python
# Create and train the Linear Regression model model = LinearRegression() model.fit(X_train, y_train) |
This code builds a linear regression model (model) and trains it using training data (X_train and y_train) to discover a linear relationship between the characteristics and the target variable.
Make Predictions
Python
# Make predictions on the test set y_pred = model.predict(X_test) |
The code estimates the values of the target variable based on the discovered relationships between features and the target variable, using the trained Linear Regression model (model) to make predictions (y_pred) on the test set (X_test).
Calculate Evaluation Metrics
Python
# Calculate evaluation metrics mae = mean_absolute_error(y_test, y_pred) mse = mean_squared_error(y_test, y_pred) r_squared = r2_score(y_test, y_pred) rmse = np.sqrt(mse) # Print the evaluation metrics print ( "Mean Absolute Error (MAE):" , mae) print ( "Mean Squared Error (MSE):" , mse) print ( "R-squared (R²):" , r_squared) print ( "Root Mean Squared Error (RMSE):" , rmse) |
Output:
Mean Absolute Error (MAE): 0.5332001304956553
Mean Squared Error (MSE): 0.5558915986952444
R-squared (R²): 0.5757877060324508
Root Mean Squared Error (RMSE): 0.7455813830127764
The code computes four regression assessment metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared (R2), and Root Mean Squared Error (RMSE), based on the predicted values (y_pred) and the actual values from the test set (y_test). The model’s success in foretelling the values of the target variable is then evaluated by printing these metrics, which shed light on the model’s precision and goodness of fit.
Understanding the output:
1. Mean Absolute Error (MAE): 0.5332
- An MAE of 0.5332 means that, on average, the model’s predictions are approximately $0.5332 away from the true house prices.
2. Mean Squared Error (MSE): 0.5559
- An MSE of 0.5559 means that, on average, the squared prediction errors are approximately 0.5559.
3. R-squared (R²): 0.5758
- An R² of 0.5758 indicates that the model can explain approximately 57.58% of the variance in house prices.
4. Root Mean Squared Error (RMSE): 0.7456
- An RMSE of 0.7456 indicates that, on average, the model’s predictions have an error of approximately $0.7456 in the same units as the house prices.
Regression Metrics
Machine learning is an effective tool for predicting numerical values, and regression is one of its key applications. In the arena of regression analysis, accurate estimation is crucial for measuring the overall performance of predictive models. This is where the famous machine learning library Python Scikit-Learn comes in. Scikit-Learn gives a complete set of regression metrics to evaluate the quality of regression models.
In this article, we are able to explore the basics of regression metrics in scikit-learn, discuss the steps needed to use them effectively, provide some examples, and show the desired output for each metric.
Contact Us