Regression Models for California Housing Price Prediction

In this article, we will build a machine-learning model that predicts the median housing price using the California housing price dataset from the StatLib repository. The dataset is based on the 1990 California census and has metrics. It is a supervised learning task (labeled training) because each instance has an expected output (median housing price). It is a univariate multiple regression task since we predict a single value based on multiple features.

Table of Content

  • California House Price Prediction
  • Training Models for California Housing Price Forecasting
    • 1. Linear Regression Model
    • 2. Decision Tree Regression Model
    • 3. Random Forest Regression Model
    • Evaluating Using Cross-Validation
  • Fine Tune The Models

California House Price Prediction

California House Price Prediction is a popular dataset used to practice building machine learning models for regression tasks. We will be following these steps to predict the house prices.

Step 1: Loading California House Price Dataset

The read_csv() method read a csv file to dataframe and the info() method helps to get a quick description of the data such as columns, the total number of rows, each attribute type and the number of nonnull values.

Python
import pandas as pd
housing= pd.read_csv("https://media.w3wiki.net/wp-content/uploads/20240319120216/housing.csv")
housing.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object

As we can see there are 20640 instances in the dataset. The total_bedrooms has only 20433 non-null values (207 districts are missing), and all attributes are numerical except the ocean_proximity field (a text field). The median_house_value is the housing price, which we need to predict using our machine learning model.

Before getting to model training, let’s analyse how attributes in the housing data correlate with the median house value (house price). We can easily find the standard correlation coefficient using the corr() method. Since the ocean_proximity attribute field is non-numeric, we need to drop the field to calculate the correlation.

Python
def find_correlation(housing_numeric):
  # computing standard correlation coefficient
  corr_matrix = housing_numeric.corr()
  # fetch and return attribute correlates 
  # with the median housing value
  return corr_matrix["median_house_value"].sort_values(
    ascending=False)
  
# drop ocean_proximity column
housing_numeric = housing.drop("ocean_proximity", axis=1)
# find correlation coefficient
cor_coef = find_correlation(housing_numeric)
print("Correlation Coefficient::", cor_coef)

Output:

Correlation Coefficient:: median_house_value    1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160
Name: median_house_value, dtype: float64
  • Here, the median house value tends to go up when the median income increases. Similarly, you can notice a small negative correlation with the latitude; the median house value has a slight tendency to go down when we go north.

Training Models for California Housing Price Forecasting

The process of training a machine learning model involves preparing the data for ML and providing an ML algorithm. Since our aim is to predict a value from a labeled training dataset, we must use regression ML algorithms. Here we will explore a few regression models to identify a promising model based on the prediction error. We will be using following models:

  1. Linear Regression Model
  2. Decision Tree Regression Model
  3. Random Forest Regression Model

1. Linear Regression Model

Let’s start with a linear regression model. In our dataset, we have different sets of attributes such as median_income, total_rooms, latitude and longitude, and so on. Before using multiple features to predict median_house_value, let’s start with a single feature.

From the correlation data, it is clear that median_income is a promising attribute. Let’s check the scatterplot diagram between median_income and median_house_value.

Python
# scatter plot diagram
housing.plot(kind='scatter', 
             x="median_income", y="median_house_value", 
             alpha=0.1)

Output:

median income versus median house value

From the above diagram it is clear that the correlation is very strong, has an upward trend and not too dispersed.

Let’s make use of median_income to train a linear regression model. Since median_income doesn’t have any missing data, we don’t need to deal with missing values.

Here, we need to get training and testing dataset for the machine learning model. Sckit-Learn provides a few functions to split the datasets into multiple subsets. The simplest function is train_test_split().

But the problem with the train_test_split method is that it is a purely random sampling method. It serves its purpose if the dataset is large enough. Our housing dataset is small and the above method may result in sampling bias. So, we need a stratified sampling method.

In stratified sampling, the right number of instances are sampled from each stratum to ensure that the test set represents the entire population. From correlation data, we understood that the median income is an important attribute to predict median housing prices. So our training and test set should represent various categories of income in the whole dataset.

The stratified shuffleSplit cross-validator object is a is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Let’s create the median income histogram for a detail analysis.

Python
housing.hist(column='median_income', 
             bins=50, figsize=(9,6))

Output:

median income histogram

From the above histogram, it’s clear that most median income values are clustered around 1.5 to 6.

To do stratified sampling based on the income category, we need to create an income category attribute. Let’s create 5 categories (labeled from 1 to 5): category 1 ranges from 0 to 1.5 (i.e., less than $15,000), category 2 ranges from 1.5 to 3, and so on.

Python
import numpy as np

housing["income_cat"] = pd.cut(housing["median_income"], 
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                              labels=[1, 2, 3, 4, 5])

housing["income_cat"].hist()

Output:

income category histogram

Here we use pd.cut() method to create an income category attribute with 5 categories.

Note: It is important that you should not have too many strata and each startum should have sufficient number of instances in your dataset.

Let’s predict the median_house_value based on the median_income. We need to fetch training and testing data based on stratified sampling and use a linear regression model to find the best fit and finally predict the housing price.

The data is categorized based on the income category so that the startified sampling method can create training and test data from various income categories.

Python
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import mean_squared_error

def fetch_housing_data():
  return pd.read_csv(
    "https://media.w3wiki.net/wp-content/uploads/20240319120216/housing.csv")

def set_income_category(housing_selected):
    # set income category based on median income
    housing_selected["income_cat"] = pd.cut(housing_selected["median_income"], 
                                            bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                                            labels=[1, 2, 3, 4, 5])
    return housing_selected

def get_strat_train_test_dataset(housing_selected):
    # stratified sampling
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    housing_split = split.split(housing_selected, housing_selected["income_cat"])
    # get train and test dataset
    for train_index, test_index in housing_split:
        train_set = housing_selected.loc[train_index]
        test_set = housing_selected.loc[test_index]
        
    # drop income_category from train and test dataset
    for set_ in (train_set, test_set):
        set_.drop("income_cat", axis=1, inplace=True)
    
    return train_set, test_set

def sprt_train_and_label_set(train_set):
    # drop median_house_value from training data
    housing_tr = train_set.drop("median_house_value", axis=1)
    # create a new dataframe with median_house_value
    housing_labels = train_set["median_house_value"].copy()
    return housing_tr, housing_labels

def get_rmse(housing_labels, predicted_data):
    # get mean squared error to analyse prediction error
    mse = mean_squared_error(housing_labels, predicted_data)
    rmse = np.sqrt(mse)
    return rmse

# get housing data
housing = fetch_housing_data()
# copy median_income and median_house_value
housing_selected = housing[['median_income', 'median_house_value']].copy()
# set income category based on median_icome
housing_selected = set_income_category(housing_selected)
# stratified sampling
train_set, test_set = get_strat_train_test_dataset(housing_selected)
# seperate label and data from training set
housing_tr, housing_labels = sprt_train_and_label_set(train_set)


In the above code, we have prepared the data for regression. Now, Let’s make use of the prepared data to find the best fit and prediction using linear regression.

Python
from sklearn.linear_model import LinearRegression
# linear regression model for best fit
lin_reg = LinearRegression()
lin_reg.fit(housing_tr, housing_labels)

# sample data to test from training set
sample_data = housing_tr.iloc[:5]
sample_labels = housing_labels.iloc[:5]

# predict the median_house_value
predicted_data = lin_reg.predict(sample_data)
print("Predicted Price:", predicted_data)
print("Actual Price:", list(sample_labels))

Output:

Predicted Price: [135958.75805364 309735.008975   165232.3998617  138162.41971241
232903.1766333 ]
Actual Price: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]

Let’s compute the prediction error for the training data.

Python
# pass the training data and identify the prediction error
predicted_data = lin_reg.predict(housing_tr)
lin_rmse = get_rmse(housing_labels, predicted_data)
print("root mean square error:", lin_rmse)

Output:

root mean square error: 84056.18763327331

A prediction error is not very satisfying, but it is better than nothing. Let’s look at other dependent features to predict the housing price. Hope you have noticed a negative correlation with latitude. We can plot a scatter diagram for housing price w.r.t. location for better understanding.

Python
import matplotlib.pyplot as plt
# plot
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, 
             s=housing["population"]/100, label="population", figsize=(7,5), 
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,)
plt.legend()

Output:

Housing price: red is expensive, blue is cheap, larger circle indicates areas with a large population

The prices have a slight tendency to go down when we go north. The ocean proximity attribute may be useful as well (the housing prices in the costal district are not too high).

Let’s try a linear regression model to predict the median_house_value based on the remaining features except the ocean_proximity field (non-numeric). Before starting, we need to consider a few scenarios:

  • Since total_bedrooms has missing values, we need to deal with it. Scikit-Learn provides a SimpleImputer instance to take care of missing values.
  • Our numeric attributes are on different scales so we need to apply feature scaling. For standardization, we can use the StandardScaler transformer from Scikit-Learn.

Dealing with missing values is a data cleaning process. Apart from missing values other steps include the removal of unwanted observations, fixing structural errors, and managing unwanted outliers.

Here, we can compute the median value and fill in the missing values. First, you need to create a SimpleImputer instance and then mention the strategy as median to replace the missing values with the median.

Let’s apply it to our housing dataset. We need to drop the ‘ocean_proximity’ feature since the median can be computed only from numeric attributes.

Python
from sklearn.impute import SimpleImputer

# drop ocean_proximity column 
housing_num = housing.drop("ocean_proximity", axis=1)

# imputer instance
imputer = SimpleImputer(strategy="median")
imputer.fit(housing_num)
housing_imputer = imputer.transform(housing_num)
print("Imputer return data type:", type(housing_imputer))

# convert back to pandas dataframe
housing_new = pd.DataFrame(
    housing_imputer, columns=housing_num.columns, index=housing_num.index)
print("New Housing Dataset:")
print(housing_new.info())

Output:

Imputer return data type: <class 'numpy.ndarray'>
New Housing Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20640 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB

You can notice now that the ‘total_bedrooms’ has no missing values. Here we applied the imputer to all the numeric attributes to deal with future missing values for other attributes.

Standard Scaler:

Machine learning algorithms don’t perform well for numerical attributes with different scales. For example, the total number of rooms ranges from 6 to 39320, but the median income only ranges from 0 to 15. Hence, feature scaling is an important transformation that we need to apply to our dataset.

StandardScaler from Scikit-Learn makes use of the standardization technique, where it subtracts the mean value and then divides by the standard deviation. Hence, the resulting distribution has unit variance.

We have explored two transformation techniques: SimpleImputer and StandardScaler. We need to prepare our housing data based on the transformation techniques. Then provide the data to the linear regression model to identify the best fit and finally predict the housing price.

Since we need to execute multiple transformation steps, it is important to understand the transformation pipeline.

Transformation Pipelines:

Now it’s time to prepare our data based on the above steps and feed it to the linear regression model to predict the median_house_value.

Python
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


def transformation_pipeline():
    # pipeline execution
    num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('std_scaler', StandardScaler())
    ])
    return num_pipeline


# get housing data
housing = fetch_housing_data()
# set income category based on median_icome
housing = set_income_category(housing)
# stratified sampling
train_set, test_set = get_strat_train_test_dataset(housing)
# seperate label and data from training set
housing_tr, housing_labels = sprt_train_and_label_set(train_set)
# drop ocean_proximity 
housing_num = housing_tr.drop("ocean_proximity", axis=1)
# numeric pipeline execution
num_pipeline = transformation_pipeline()
housing_num_tr = num_pipeline.fit_transform(housing_num)

In the above code we provided all the attributes for data preparation, except ocean proximity attribute. Using pipeling execution, we executed SimpleImputer() and StandardScaler() instances sequentially.

The data for ML model is ready. Let’s pass it to a linear regression model.

Python
from sklearn.linear_model import LinearRegression

# linear regression model to identify best fit
lin_reg = LinearRegression()
lin_reg.fit(housing_num_tr, housing_labels)
# pass the prepared data and predict the housing price
predicted_data = lin_reg.predict(housing_num_tr)
# pass the predicted data and identify the prediction error
lin_rmse = get_rmse(housing_labels, predicted_data)
print("Linear Regression Prediction error:", lin_rmse)

Output:

Linear Regression Prediction error: 69957.72079714121

In the above code, we passed our prepared data and the labels to the linear regression model for prediction. Based on the predicted data, we identified the prediction error using root mean square error.

The prediction error is 69,957, which is much better than the previous. As of now, we haven’t included the ‘ocean_proximity’ feature. Since most machine learning algorithms prefer to work with numeric attributes we need to convert ocean_proximity to a numeric attribute.

Let’s deal with the ‘ocean_proximity’ attribute. Hope you have noticed that the value of ocean_proximity is not just a text attribute. The values were repetitive, which means that it is probably a categorical attribute. Let’s look at how many districts (rows) belong to each category (ocean_proximity).

Python
housing[["ocean_proximity"]].value_counts()

Output:

ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: count, dtype: int64

So our aim is to convert category text attributes to numeric. Let’s make use of the Scikit-Learn OneHotEncoder class to convert categorical values into one-hot vectors.

OneHotEncoder:

One hot encoder creates one binary attribute per category. When the category is <1H OCEAN then the value is [1., 0., 0., 0., 0.], for INLAND the value is [0., 1., 0., 0., 0.], and so on.

The output of the encoder is a SciPy sparse matrix. This is useful when we have a large number of categories. The sparse matrix only store the location of non-zero elements. You can take a look at the code below:

Python
from sklearn.preprocessing import OneHotEncoder

# fetch ocean_proximity
housing_cat = housing[["ocean_proximity"]]
# one hot encoder
cat_encoder = OneHotEncoder()
housing_cat_encoder = cat_encoder.fit_transform(housing_cat)
# convert sparse matrix to array to dispaly the result
result = housing_cat_encoder.toarray()
print("Data:")
print(result)
category_list = cat_encoder.categories_
print("Category List::", category_list)

Output:

Data:
[[0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0.]
...
[0. 1. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 1. 0. 0. 0.]]
Category List:: [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)]

The above code converts ocean_proximity feature to a binary attribute using OneHotEncoder.

So now the question is how to add the category conversion code to our existing pipeline. Unfortunately, we cannot do it since our existing pipeline deals with numeric attributes. But the great news is that Scikit-Learn provides a ColumnTransformer class to execute both numeric pipeline and category conversion code.

ColumnTransformer:

With the help of the CoulmnTransformer class, the numeric columns will be transformed using the num_pipeline, and the categorical columns should be transformed using the OneHotEncoder class. Let’s look at the code:

Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

housing_prepared = full_pipeline.fit_transform(housing_tr)


Attribute Combination:

Before getting to the complete code, let’s cross-check the attributes to see whether we need to omit any unrealistic features or combine any attributes for more insights. With the help of the describe() method, we can get a summary of numerical attributes.

Python
housing.describe()

Output:

    longitude    latitude    housing_median_age    total_rooms    total_bedrooms    population    households    median_income    median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

The mean describes the average value, and the std rows shows standard deviation (measures how dispersed the values are). 25%, 50%, and 75% show the percentile. For example, 50% of the districts have a housing_median_age lower than 29. Let’s analyse the histogram of the available attributes.

Python
import matplotlib.pyplot as plt

housing.hist(bins=50, figsize=(20, 15))
plt.show()

Output:

Histogram for numerical attribute

If we look at the total_rooms, the data says that 25% of the districts have fewer than 1447 rooms. It doesn’t seem like very useful information. The data provides more insights if we know the number of rooms in each household. Similarly, bedrooms w.r.t. room and population per household provide more insights. Let’s create the new attributes

housing[“rooms_per_household”] = housing[“total_rooms”]/housing[“households”]

housing[“bedrooms_per_room”] = housing[“total_bedrooms”]/housing[“total_rooms”]

housing[“population_per_household”] = housing[“population”]/housing[“households”]

Let’s create a custom transformer for attribute combinations. Since the newly combined attributes are numeric, we can add our custom attribute to the existing numeric pipeline (num_pipeline). Let’s look at the below code for attribute combination using a custom transformer.

Python
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix,households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
  def fit(self, X, y=None):
    return self
  def transform(self, X, y=None):
    rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
    population_per_household = X[:, population_ix] / X[:, households_ix]
    bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
    return np.c_[X, rooms_per_household, population_per_household, 
                 bedrooms_per_room]

Here we have two base classes: the TransformerMixin class (provides fit_transform() method) and the BaseEstimator (automatic hyperparameter tuning method). Using the transform() method we combined our attributes with the existing dataset.

Finally, let’s look at the complete code with OneHotEncoder and attribute combination. The code is as follows:

Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


def transformation_pipeline():
    # pipeline execution
    num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler())
    ])
    return num_pipeline

def complete_pipeline(num_pipeline, num_attribs, cat_attribs):
    full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),])
    return full_pipeline


# numeric pipeline execution
num_pipeline = transformation_pipeline()
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
# pipeline execution of numeric and column attribute
full_pipeline = complete_pipeline(num_pipeline, num_attribs, cat_attribs)
housing_prepared = full_pipeline.fit_transform(housing_tr)

Output:

[[-0.94135046  1.34743822  0.02756357 ...  0.          0.
0. ]
[ 1.17178212 -1.19243966 -1.72201763 ... 0. 0.
1. ]
[ 0.26758118 -0.1259716 1.22045984 ... 0. 0.
0. ]
...
[-1.5707942 1.31001828 1.53856552 ... 0. 0.
0. ]
[-1.56080303 1.2492109 -1.1653327 ... 0. 0.
0. ]
[-1.28105026 2.02567448 -0.13148926 ... 0. 0.
0. ]]

In the above code, we prepared our data considering category attribute and numeric attributes using ColumnTransformer class. And, also combined few more attributes to the existing dataset.

Now let’s check the prediction error for the training dataset.

Python
# linear regression model to identify best fit
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
# pass the prepared data and predict the housing price
predicted_data = lin_reg.predict(housing_prepared)
# pass the predicted data and identify the prediction error
lin_rmse = get_rmse(housing_labels, predicted_data)
print("Linear Regression Prediction error:", lin_rmse)

Output

Linear Regression Prediction error: 68627.87390018745

The prediction error of $68627 is not a great a score since most of the district housing prices ranges between $120,000 and $265,000. Our linear regression model has underfit the data. Let’s try other regression models.

2. Decision Tree Regression Model

A decision tree is a powerful model that can perform both classification and regression tasks and even multi-output tasks. A decision tree regressor is capable of identifying complex, non-linear relationships in the data. Let’s train a DecisionTreeRegressor model on the California housing dataset.

Python
from sklearn.tree import DecisionTreeRegressor

# Decision tree regressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
# predict the price
predicted_data = tree_reg.predict(housing_prepared)
# pass the predicted data and identify the prediction error
tree_rmse = get_rmse(predicted_data, housing_labels)
print("Decision tree prediction error", tree_rmse)

Output

Decision tree prediction error 0.0

In the above code, we used Decision Tree Regression from sklearn and passed the prepared data for prediction. Then, the predicted data is used to identify the prediction error.

The output says there is no error at all. It may be because the model has overfitted the data. We can use cross-validation for a better evaluation. But before that, let’s try another powerful model: the Random Forest Regressor.

3. Random Forest Regression Model

Decision trees are the fundamental components of random forests. The Random Forest Regression model works by training many decision trees on random subsets of the features, thereby providing the average prediction data.

Python
from sklearn.ensemble import RandomForestRegressor
# Random Forest Regressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
# predict the price
predicted_data = forest_reg.predict(housing_prepared)
# identify the prediction error
rf_rmse = get_rmse(housing_labels, predicted_data)
print("Random forest prediction error", rf_rmse)

Output

Random forest prediction error 18647.49680314922

In the above code we used RandomForestRegressor from sklearn to predict the housing price. Let’s evaluate using cross-validation to identify the precision of prediction.

Evaluating Using Cross-Validation

Hope you have noticed that our decision tree regressor has zero prediction error. Either the model is absolutely perfect or it has badly overfitted the data. Let’s make use of the K-fold cross-validation feature.

Python
from sklearn.model_selection import cross_val_score

def cross_validation(reg_model, housing_prepared, housing_labels):
    scores = cross_val_score(
      reg_model, housing_prepared,
      housing_labels,
      scoring="neg_mean_squared_error", cv=10)
    rmse_scores = np.sqrt(-scores)
    print("Scores:", rmse_scores)
    print("Mean:", rmse_scores.mean())
    print("StandardDeviation:", rmse_scores.std())

print("Decision Tree Regression Model Cross Validation")
cross_validation(tree_reg, housing_prepared, housing_labels)
print("Linear Regression Model Cross Validation")
cross_validation(lin_reg, housing_prepared, housing_labels)
print("Random Forest Regression Model Cross Validation")
cross_validation(forest_reg, housing_prepared, housing_labels)

Output:

-----Decision Tree Regression Model Cross Validation------
Scores: [71533.47891958 70166.21545877 68593.53658755 71619.11809769
69314.10628101 79025.77574459 71115.88174428 73372.74838203
68064.45351565 70566.00011375]
Mean: 71337.13148448928
StandardDeviation: 2961.791010100865
-----Linear Regression Model Cross Validation------
Scores: [71762.76364394 64114.99166359 67771.17124356 68635.19072082
66846.14089488 72528.03725385 73997.08050233 68802.33629334
66443.28836884 70139.79923956]
Mean: 69104.07998247063
StandardDeviation: 2880.3282098180634
-----Random Forest Regression Model Cross Validation------
Scores: [51422.13025992 49000.24363195 47048.3962238 51669.06395259
47679.55344003 52020.75772926 52488.54494522 49884.32230139
48273.7784342 53751.73962139]
Mean: 50323.85305397397
StandardDeviation: 2149.429249563004

In the above code, we identified the prediction error using cross-validation for different regression model. Here, we randomly split the training set into 10 distinct subsets called folds. So the cross-validation feature can train and evaluate the model 10 times by picking a different fold each time and training on the other 9 folds.

Now the decision tree has a mean prediction error of 71337, whereas the linear regression score is 69104. The Random Forest Regressor seems to be a promising model with a prediction error of 50323.

Fine Tune The Models

Now that we have a promising model, the RandomForestRegressor model, where the prediction error is lower compared to other models. However, you can save the different regression models using the Python pickle module or by using the joblib library, so that we can make use of each model and analyse it based on future data.

Now let’s consider the RandomForestRegressor model to fine-tune. Here, we need to try different sets of parameters to find a great combination of values. Scikit-Learn provides GridSearchCV to find the best combination.

GridSearchCV:

Let’s search for the best combination of parameters using GridSearchCV for the RandomForestRegressor model.

Python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# set combination of parameter values
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features':[2, 4, 6, 8]}, 
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features':[2, 3, 4]},
]
# Random Forest Regressor model
forest_reg = RandomForestRegressor()

# GridSearchCV for best combination
grid_search = GridSearchCV(
    forest_reg, param_grid, cv=5, 
    scoring='neg_mean_squared_error', 
    return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

# check evaluation score for each combination
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

Output:

63489.44678439431 {'max_features': 2, 'n_estimators': 3}
55914.06605689392 {'max_features': 2, 'n_estimators': 10}
52583.048916101296 {'max_features': 2, 'n_estimators': 30}
59608.064875188174 {'max_features': 4, 'n_estimators': 3}
53074.52102481039 {'max_features': 4, 'n_estimators': 10}
50537.70482111732 {'max_features': 4, 'n_estimators': 30}
59281.87297911482 {'max_features': 6, 'n_estimators': 3}
52192.80565540959 {'max_features': 6, 'n_estimators': 10}
49968.7758384899 {'max_features': 6, 'n_estimators': 30}
58471.73925661535 {'max_features': 8, 'n_estimators': 3}
51953.125658262914 {'max_features': 8, 'n_estimators': 10}
50230.79990492117 {'max_features': 8, 'n_estimators': 30}
62707.854380894714 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54159.58534875528 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59738.24656857167 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52598.8191076302 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59127.10881498229 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51869.16885620829 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

In the above code, we used GridSearchCV from sklearn for fine tuning. In param_grid, we have two dictionaries.

  • The sklearn evaluates the first dictionary combinations (3 * 4 = 12 combinations) of n_estimators and max_features, and then evaluates the second dictionary combinations (2 * 3= 6 combinations) with the bootstrap hyperparameter set to false.
  • The bootstrap is a method for sampling data points (with or without replacement).
  • So the grid search will evaluate a total of 18 combinations (12 + 6) of RandomForestRegressor and train each model 5 times (cv=5)

You can notice the evaluation score for each set of combinations. The RMSE score of 49968 seems to be the best estimator, with max_features at 6 and n_estimators at 30. 

Python
# best estimator
print(grid_search.best_estimator_)

Output:

RandomForestRegressor(max_features=6, n_estimators=30)

So let’s consider the best estimator as our final model and evaluate it on the test set.

Python
final_model = grid_search.best_estimator_

housing_test = test_set.drop("median_house_value", axis=1)
housing_lbl_test = test_set["median_house_value"].copy()

housing_test_prepared = full_pipeline.transform(housing_test)

final_predictions = final_model.predict(housing_test_prepared)

final_rmse = get_rmse(final_predictions, housing_lbl_test)
print("Test set prediction error:", final_rmse)

Output:

Test set prediction error: 47491.062677250884

Here we have taken the best estimator as our final model and then prepared data by passing test data as parameter to our transformation pipeline. Finally, the prepared data is fed to our final model for prediction. You can notice a prediction error of $47491.

After utilizing various regression models including Linear Regression, Decision Tree Regression, and Random Forest Regression to predict median house prices in California. After fine-tuning with GridSearchCV, our final Random Forest model achieved a prediction error of $47,491 on the test set.



Contact Us