California House Price Prediction

California House Price Prediction is a popular dataset used to practice building machine learning models for regression tasks. We will be following these steps to predict the house prices.

Step 1: Loading California House Price Dataset

The read_csv() method read a csv file to dataframe and the info() method helps to get a quick description of the data such as columns, the total number of rows, each attribute type and the number of nonnull values.

Python
import pandas as pd
housing= pd.read_csv("https://media.w3wiki.org/wp-content/uploads/20240319120216/housing.csv")
housing.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object

As we can see there are 20640 instances in the dataset. The total_bedrooms has only 20433 non-null values (207 districts are missing), and all attributes are numerical except the ocean_proximity field (a text field). The median_house_value is the housing price, which we need to predict using our machine learning model.

Before getting to model training, let’s analyse how attributes in the housing data correlate with the median house value (house price). We can easily find the standard correlation coefficient using the corr() method. Since the ocean_proximity attribute field is non-numeric, we need to drop the field to calculate the correlation.

Python
def find_correlation(housing_numeric):
  # computing standard correlation coefficient
  corr_matrix = housing_numeric.corr()
  # fetch and return attribute correlates 
  # with the median housing value
  return corr_matrix["median_house_value"].sort_values(
    ascending=False)
  
# drop ocean_proximity column
housing_numeric = housing.drop("ocean_proximity", axis=1)
# find correlation coefficient
cor_coef = find_correlation(housing_numeric)
print("Correlation Coefficient::", cor_coef)

Output:

Correlation Coefficient:: median_house_value    1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160
Name: median_house_value, dtype: float64
  • Here, the median house value tends to go up when the median income increases. Similarly, you can notice a small negative correlation with the latitude; the median house value has a slight tendency to go down when we go north.

Regression Models for California Housing Price Prediction

In this article, we will build a machine-learning model that predicts the median housing price using the California housing price dataset from the StatLib repository. The dataset is based on the 1990 California census and has metrics. It is a supervised learning task (labeled training) because each instance has an expected output (median housing price). It is a univariate multiple regression task since we predict a single value based on multiple features.

Table of Content

  • California House Price Prediction
  • Training Models for California Housing Price Forecasting
    • 1. Linear Regression Model
    • 2. Decision Tree Regression Model
    • 3. Random Forest Regression Model
    • Evaluating Using Cross-Validation
  • Fine Tune The Models

Similar Reads

California House Price Prediction

California House Price Prediction is a popular dataset used to practice building machine learning models for regression tasks. We will be following these steps to predict the house prices....

Training Models for California Housing Price Forecasting

The process of training a machine learning model involves preparing the data for ML and providing an ML algorithm. Since our aim is to predict a value from a labeled training dataset, we must use regression ML algorithms. Here we will explore a few regression models to identify a promising model based on the prediction error. We will be using following models:...

Fine Tune The Models

Now that we have a promising model, the RandomForestRegressor model, where the prediction error is lower compared to other models. However, you can save the different regression models using the Python pickle module or by using the joblib library, so that we can make use of each model and analyse it based on future data....

Contact Us