Fine Tune The Models

Now that we have a promising model, the RandomForestRegressor model, where the prediction error is lower compared to other models. However, you can save the different regression models using the Python pickle module or by using the joblib library, so that we can make use of each model and analyse it based on future data.

Now let’s consider the RandomForestRegressor model to fine-tune. Here, we need to try different sets of parameters to find a great combination of values. Scikit-Learn provides GridSearchCV to find the best combination.

GridSearchCV:

Let’s search for the best combination of parameters using GridSearchCV for the RandomForestRegressor model.

Python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# set combination of parameter values
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features':[2, 4, 6, 8]}, 
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features':[2, 3, 4]},
]
# Random Forest Regressor model
forest_reg = RandomForestRegressor()

# GridSearchCV for best combination
grid_search = GridSearchCV(
    forest_reg, param_grid, cv=5, 
    scoring='neg_mean_squared_error', 
    return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

# check evaluation score for each combination
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

Output:

63489.44678439431 {'max_features': 2, 'n_estimators': 3}
55914.06605689392 {'max_features': 2, 'n_estimators': 10}
52583.048916101296 {'max_features': 2, 'n_estimators': 30}
59608.064875188174 {'max_features': 4, 'n_estimators': 3}
53074.52102481039 {'max_features': 4, 'n_estimators': 10}
50537.70482111732 {'max_features': 4, 'n_estimators': 30}
59281.87297911482 {'max_features': 6, 'n_estimators': 3}
52192.80565540959 {'max_features': 6, 'n_estimators': 10}
49968.7758384899 {'max_features': 6, 'n_estimators': 30}
58471.73925661535 {'max_features': 8, 'n_estimators': 3}
51953.125658262914 {'max_features': 8, 'n_estimators': 10}
50230.79990492117 {'max_features': 8, 'n_estimators': 30}
62707.854380894714 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54159.58534875528 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59738.24656857167 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52598.8191076302 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59127.10881498229 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51869.16885620829 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

In the above code, we used GridSearchCV from sklearn for fine tuning. In param_grid, we have two dictionaries.

  • The sklearn evaluates the first dictionary combinations (3 * 4 = 12 combinations) of n_estimators and max_features, and then evaluates the second dictionary combinations (2 * 3= 6 combinations) with the bootstrap hyperparameter set to false.
  • The bootstrap is a method for sampling data points (with or without replacement).
  • So the grid search will evaluate a total of 18 combinations (12 + 6) of RandomForestRegressor and train each model 5 times (cv=5)

You can notice the evaluation score for each set of combinations. The RMSE score of 49968 seems to be the best estimator, with max_features at 6 and n_estimators at 30. 

Python
# best estimator
print(grid_search.best_estimator_)

Output:

RandomForestRegressor(max_features=6, n_estimators=30)

So let’s consider the best estimator as our final model and evaluate it on the test set.

Python
final_model = grid_search.best_estimator_

housing_test = test_set.drop("median_house_value", axis=1)
housing_lbl_test = test_set["median_house_value"].copy()

housing_test_prepared = full_pipeline.transform(housing_test)

final_predictions = final_model.predict(housing_test_prepared)

final_rmse = get_rmse(final_predictions, housing_lbl_test)
print("Test set prediction error:", final_rmse)

Output:

Test set prediction error: 47491.062677250884

Here we have taken the best estimator as our final model and then prepared data by passing test data as parameter to our transformation pipeline. Finally, the prepared data is fed to our final model for prediction. You can notice a prediction error of $47491.

After utilizing various regression models including Linear Regression, Decision Tree Regression, and Random Forest Regression to predict median house prices in California. After fine-tuning with GridSearchCV, our final Random Forest model achieved a prediction error of $47,491 on the test set.



Regression Models for California Housing Price Prediction

In this article, we will build a machine-learning model that predicts the median housing price using the California housing price dataset from the StatLib repository. The dataset is based on the 1990 California census and has metrics. It is a supervised learning task (labeled training) because each instance has an expected output (median housing price). It is a univariate multiple regression task since we predict a single value based on multiple features.

Table of Content

  • California House Price Prediction
  • Training Models for California Housing Price Forecasting
    • 1. Linear Regression Model
    • 2. Decision Tree Regression Model
    • 3. Random Forest Regression Model
    • Evaluating Using Cross-Validation
  • Fine Tune The Models

Similar Reads

California House Price Prediction

California House Price Prediction is a popular dataset used to practice building machine learning models for regression tasks. We will be following these steps to predict the house prices....

Training Models for California Housing Price Forecasting

The process of training a machine learning model involves preparing the data for ML and providing an ML algorithm. Since our aim is to predict a value from a labeled training dataset, we must use regression ML algorithms. Here we will explore a few regression models to identify a promising model based on the prediction error. We will be using following models:...

Fine Tune The Models

Now that we have a promising model, the RandomForestRegressor model, where the prediction error is lower compared to other models. However, you can save the different regression models using the Python pickle module or by using the joblib library, so that we can make use of each model and analyse it based on future data....

Contact Us