California House Price Prediction
California House Price Prediction is a popular dataset used to practice building machine learning models for regression tasks. We will be following these steps to predict the house prices.
Step 1: Loading California House Price Dataset
The read_csv() method read a csv file to dataframe and the info() method helps to get a quick description of the data such as columns, the total number of rows, each attribute type and the number of nonnull values.
import pandas as pd
housing= pd.read_csv("https://media.w3wiki.org/wp-content/uploads/20240319120216/housing.csv")
housing.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
As we can see there are 20640 instances in the dataset. The total_bedrooms has only 20433 non-null values (207 districts are missing), and all attributes are numerical except the ocean_proximity field (a text field). The median_house_value is the housing price, which we need to predict using our machine learning model.
Before getting to model training, let’s analyse how attributes in the housing data correlate with the median house value (house price). We can easily find the standard correlation coefficient using the corr() method. Since the ocean_proximity attribute field is non-numeric, we need to drop the field to calculate the correlation.
def find_correlation(housing_numeric):
# computing standard correlation coefficient
corr_matrix = housing_numeric.corr()
# fetch and return attribute correlates
# with the median housing value
return corr_matrix["median_house_value"].sort_values(
ascending=False)
# drop ocean_proximity column
housing_numeric = housing.drop("ocean_proximity", axis=1)
# find correlation coefficient
cor_coef = find_correlation(housing_numeric)
print("Correlation Coefficient::", cor_coef)
Output:
Correlation Coefficient:: median_house_value 1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160
Name: median_house_value, dtype: float64
- Here, the median house value tends to go up when the median income increases. Similarly, you can notice a small negative correlation with the latitude; the median house value has a slight tendency to go down when we go north.
Regression Models for California Housing Price Prediction
In this article, we will build a machine-learning model that predicts the median housing price using the California housing price dataset from the StatLib repository. The dataset is based on the 1990 California census and has metrics. It is a supervised learning task (labeled training) because each instance has an expected output (median housing price). It is a univariate multiple regression task since we predict a single value based on multiple features.
Table of Content
- California House Price Prediction
- Training Models for California Housing Price Forecasting
- 1. Linear Regression Model
- 2. Decision Tree Regression Model
- 3. Random Forest Regression Model
- Evaluating Using Cross-Validation
- Fine Tune The Models
Contact Us