Pair plots using Scatter matrix in Pandas
Checking for collinearity among attributes of a dataset, is one of the most important steps in data preprocessing. A good way to understand the correlation among the features, is to create scatter plots for each pair of attributes. Pandas has a function scatter_matrix(), for this purpose. scatter_matrix() can be used to easily generate a group of scatter plots between all pairs of numerical features. It creates a plot for each numerical feature against every other numerical feature and also a histogram for each of them.
Syntax : pandas.plotting.scatter_matrix(frame)
Parameters :
frame : the dataframe to be plotted.
The dataset contains prices and other statistics about the houses in the California district.
import pandas as pd # loading the dataset data = pd.read_csv( 'housing.csv' ) # inspecting the data data.info() |
RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
Creating the scatter plots
Let us select three numeric columns; median_house_value, housing_median_age and median_income, for plotting. Note that Pandas plots depend on Matplotlib, so it needs to be imported first.
import matplotlib.pyplot as plt from pandas.plotting import scatter_matrix # selecting three numerical features features = [ 'median_house_value' , 'housing_median_age' , 'median_income' ] # plotting the scatter matrix # with the features scatter_matrix(data[features]) plt.show() |
Each scatter plot in the matrix helps us understand the correlation between the corresponding pair of attributes. As we can see, median_income and median_house_value are quite strongly correlated. The main diagonal contains the histograms for each attribute.
Contact Us