Fine Tune The Models
Now that we have a promising model, the RandomForestRegressor model, where the prediction error is lower compared to other models. However, you can save the different regression models using the Python pickle module or by using the joblib library, so that we can make use of each model and analyse it based on future data.
Now let’s consider the RandomForestRegressor model to fine-tune. Here, we need to try different sets of parameters to find a great combination of values. Scikit-Learn provides GridSearchCV to find the best combination.
GridSearchCV:
Let’s search for the best combination of parameters using GridSearchCV for the RandomForestRegressor model.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
# set combination of parameter values
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features':[2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features':[2, 3, 4]},
]
# Random Forest Regressor model
forest_reg = RandomForestRegressor()
# GridSearchCV for best combination
grid_search = GridSearchCV(
forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
# check evaluation score for each combination
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
Output:
63489.44678439431 {'max_features': 2, 'n_estimators': 3}
55914.06605689392 {'max_features': 2, 'n_estimators': 10}
52583.048916101296 {'max_features': 2, 'n_estimators': 30}
59608.064875188174 {'max_features': 4, 'n_estimators': 3}
53074.52102481039 {'max_features': 4, 'n_estimators': 10}
50537.70482111732 {'max_features': 4, 'n_estimators': 30}
59281.87297911482 {'max_features': 6, 'n_estimators': 3}
52192.80565540959 {'max_features': 6, 'n_estimators': 10}
49968.7758384899 {'max_features': 6, 'n_estimators': 30}
58471.73925661535 {'max_features': 8, 'n_estimators': 3}
51953.125658262914 {'max_features': 8, 'n_estimators': 10}
50230.79990492117 {'max_features': 8, 'n_estimators': 30}
62707.854380894714 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54159.58534875528 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59738.24656857167 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52598.8191076302 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59127.10881498229 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51869.16885620829 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
In the above code, we used GridSearchCV from sklearn for fine tuning. In param_grid, we have two dictionaries.
- The sklearn evaluates the first dictionary combinations (3 * 4 = 12 combinations) of n_estimators and max_features, and then evaluates the second dictionary combinations (2 * 3= 6 combinations) with the bootstrap hyperparameter set to false.
- The bootstrap is a method for sampling data points (with or without replacement).
- So the grid search will evaluate a total of 18 combinations (12 + 6) of RandomForestRegressor and train each model 5 times (cv=5)
You can notice the evaluation score for each set of combinations. The RMSE score of 49968 seems to be the best estimator, with max_features at 6 and n_estimators at 30.
# best estimator
print(grid_search.best_estimator_)
Output:
RandomForestRegressor(max_features=6, n_estimators=30)
So let’s consider the best estimator as our final model and evaluate it on the test set.
final_model = grid_search.best_estimator_
housing_test = test_set.drop("median_house_value", axis=1)
housing_lbl_test = test_set["median_house_value"].copy()
housing_test_prepared = full_pipeline.transform(housing_test)
final_predictions = final_model.predict(housing_test_prepared)
final_rmse = get_rmse(final_predictions, housing_lbl_test)
print("Test set prediction error:", final_rmse)
Output:
Test set prediction error: 47491.062677250884
Here we have taken the best estimator as our final model and then prepared data by passing test data as parameter to our transformation pipeline. Finally, the prepared data is fed to our final model for prediction. You can notice a prediction error of $47491.
After utilizing various regression models including Linear Regression, Decision Tree Regression, and Random Forest Regression to predict median house prices in California. After fine-tuning with GridSearchCV, our final Random Forest model achieved a prediction error of $47,491 on the test set.
Regression Models for California Housing Price Prediction
In this article, we will build a machine-learning model that predicts the median housing price using the California housing price dataset from the StatLib repository. The dataset is based on the 1990 California census and has metrics. It is a supervised learning task (labeled training) because each instance has an expected output (median housing price). It is a univariate multiple regression task since we predict a single value based on multiple features.
Table of Content
- California House Price Prediction
- Training Models for California Housing Price Forecasting
- 1. Linear Regression Model
- 2. Decision Tree Regression Model
- 3. Random Forest Regression Model
- Evaluating Using Cross-Validation
- Fine Tune The Models
Contact Us