Linear Regression
This project will work on how to predict the prices of homes based on the properties of the house. I will determine which house affected the final sale price and how effectively we can predict the sale price.
This notebook uses the following pedagogical patterns:
Learning Objectives
By the end of this notebook, the reader should be able to perform Linear Regression techniques in Python. This includes:
- Importing and formatting data
- Training the LinearRegression model from the
sklearn.linear_model
library - Work with qualitative and quantitative data, and effectively deal with instances of categorical data
- Analyze and determine proper handling of redundant and/or inconsistent data features
- Create a heatmap visual with
matplot.lib
library
Read the Data
For this exercise we will use the Ames Housing dataset to delineate our training and testing data. The Ames dataset is a collection of different characteristics and observations on houses in Ames, Iowa that is commonly used to create and test algorithms for predicting the prices of homes in that area. It contains 82 columns, or features, of data that describe the residences, such as:
- Lot Area: Lot size in square feet
- Overall Qual: Rates the overall material and finish of the house
- Overall Cond: Rates the overall condition of the house
- Year Built: Original construction date
- Low Qual Fin SF: Low quality finished square feet (all floors)
- Full Bath: Full bathrooms above grade
- Fireplaces: Number of fireplaces
and so on.
We will import this dataset using the pandas
library, which is an open-source data analytics tool for Python that allows the use of dataframe
objects and clean file parsing. Libraries in general are simply collections of functions in Python that users can import instead of having to write out the code themselves. This is done with the following lines of code:
import pandas as pd # The 'as' renames the library to whatever follows, in this case 'pd' to simplify the syntax
data = pd.read_csv("data/AmesHousing.txt", delimiter = '\t')
Here the read_csv
function looks at the fie AmesHousing
and turns it into a dataframe data
, which we can now use in Python. The delimiter \t
is simply telling the function to distinguish any characteristics separated by a tab as a new value.
Next we will split the data so that the first 1,460 entries will be our training set, which we will use to build our linear regression model, while the remaining entries will represent the test set, which we will use to test its accuracy.
train = data[0:1460]
test = data[1460:]
target = 'SalePrice'
print(train.info())
Model the Data
Simple Regression
In this case, we will use a simple linear regression to evaluate the relationship between two variables: living area (“Gr Liv Area”) and price (“SalePrice”). Assuming a linear relationship, the equation for the two variables will of course look like this:
where $x$ is the independent variable (“Gr Liv Area”), $y$ is the dependent (“SalePrice”), $m$ is the slope, and $b$ is the intercept. Because these variables are data and thus cannot be changed, a linear regression looks to control $m$ and $b$ instead. Essentially, a linear regression simply changes these two terms to fit multiple lines onto the data, reviews the error between the line and the data points, and then returns the line that generates the least amount of error.
In Python, the function that performs the regression is linearRegression.fit()
, which can be found in the scikit learning library. Simply include the two variables as arguments and it will find the line that best fits for you. We can also use the mean_squared_error
function to get the total variance of the linear function, which can be found in the numpy
library.
import numpy as np
from sklearn.linear_model import LinearRegression # You can use this method to directly call for a function in a library
lr = LinearRegression()
lr.fit(train[['Gr Liv Area']], train['SalePrice'])
from sklearn.metrics import mean_squared_error
train_predictions = lr.predict(train[['Gr Liv Area']])
test_predictions = lr.predict(test[['Gr Liv Area']])
train_mse = mean_squared_error(train_predictions, train['SalePrice'])
test_mse = mean_squared_error(test_predictions, test['SalePrice'])
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print(lr.coef_)
print(train_rmse)
print(test_rmse)
In this case, we use the lr.coef_()
to get the coefficient of the linear function which is 116.87. More than that, the standard error for the train data is 56034 and test data is 57088. Now, let’s make the result more visible by plotting.
The following is the linear regression line made from data in “train”:
import matplotlib.pyplot as plt
plt.scatter(train[['Gr Liv Area']], train[['SalePrice']], color='black')
plt.xlabel('Gr Liv Area in Train', fontsize = '18')
plt.ylabel('train_predictions' ,fontsize = '18')
trainPlot =plt.plot(train[['Gr Liv Area']], train_predictions, color='blue', linewidth=3)
trainPlot
Now let’s juxtapose the model with the test dataset to see if it can accurately predict the value:
import matplotlib.pyplot as plt
plt.scatter(test[['Gr Liv Area']], test[['SalePrice']], color='black')
plt.xlabel('Gr Liv Area in Test', fontsize = '18')
plt.ylabel('test_predictions' ,fontsize = '18')
testPlot = plt.plot(test[['Gr Liv Area']], test_predictions, color='blue', linewidth=3)
testPlot
Since the test data looks like it concentrates around the linear regression model, we can conclude that the model can predict the “Sale Price”.
Multiple Regression
In the real world, a multiple regression is a more useful technique since we need to evaluate more than one correlation in most cases. Now, we will still predict the SalePrice, but with one more variable - Overall Condition (Overall Cond). In this case the model will be a binary linear equation in the form of:
where $a_0$ stands for the intial value while both “Overall Cond” and “Gr Liv Area” is zero, $coef_{Cond}$ stands for the coefficient of Overall Cond, and $coef_{Area}$ stands for the coefficient of Gr Liv Area.
from sklearn.metrics import mean_squared_error
cols = ['Overall Cond', 'Gr Liv Area']
lr.fit(train[cols], train['SalePrice'])
train_predictions = lr.predict(train[cols])
test_predictions = lr.predict(test[cols])
train_rmse_2 = np.sqrt(mean_squared_error(train_predictions, train['SalePrice']))
test_rmse_2 = np.sqrt(mean_squared_error(test_predictions, test['SalePrice']))
print(lr.coef_)
print(lr.intercept_)
print(train_rmse_2)
print(test_rmse_2)
such that the linear model will be like:
However, it’s hard to make a geometric explanation since the model will be either surface or high-dimension which can’t be plotted.
Handling Data Types with Missing/Non-numeric Values
In the machine learning workflow, once we’ve selected the model we want to use, selecting the appropriate features for that model is the next important step. In the following code snippets, I will explore how to use correlation between features and the target column, correlation between features, and variance of features to select features.
I will specifically focus on selecting from feature columns that don’t have any missing values or don’t need to be transformed to be useful (e.g. columns like Year Built and Year Remod/Add).
numerical_train = train.select_dtypes(include=['int64', 'float'])
numerical_train = numerical_train.drop(['PID', 'Year Built', 'Year Remod/Add', 'Garage Yr Blt', 'Mo Sold', 'Yr Sold'], axis=1)
null_series = numerical_train.isnull().sum()
full_cols_series = null_series[null_series == 0]
print(full_cols_series)
Correlating Feature Columns with Target Columns
I will show the the correlation between feature columns and target columns(SalesPrice) by percentage.
train_subset = train[full_cols_series.index]
corrmat = train_subset.corr()
sorted_corrs = corrmat['SalePrice'].abs().sort_values()
print(sorted_corrs)
Correlation Matrix Heatmap
We now have a decent list of candidate features to use in our model, sorted by how strongly they’re correlated with the SalePrice column. For now, I will keep only the features that have a correlation of 0.3 or higher. This cutoff is a bit arbitrary and, in general, it’s a good idea to experiment with this cutoff. For example, you can train and test models using the columns selected using different cutoffs and see where your model stops improving.
The next thing we need to look for is for potential collinearity between some of these feature columns. Collinearity is when two feature columns are highly correlated and stand the risk of duplicating information. If we have two features that convey the same information using two different measures or metrics, we need to choose just one or predictive accuracy can suffer.
While we can check for collinearity between two columns using the correlation matrix, we run the risk of information overload. We can instead generate a correlation matrix heatmap using Seaborn to visually compare the correlations and look for problematic pairwise feature correlations. Because we’re looking for outlier values in the heatmap, this visual representation is easier.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
strong_corrs = sorted_corrs[sorted_corrs > 0.3]
corrmat = train_subset[strong_corrs.index].corr()
ax = sns.heatmap(corrmat)
ax
Train and Test the Model
Based on the correlation matrix heatmap, we can tell that the following pairs of columns are strongly correlated:
- Gr Liv Area and TotRms AbvGrd
- Garage Area and Garage Cars
We will only use one of these pairs and remove any columns with missing values
final_corr_cols = strong_corrs.drop(['Garage Cars', 'TotRms AbvGrd'])
features = final_corr_cols.drop(['SalePrice']).index
target = 'SalePrice'
clean_test = test[final_corr_cols.index].dropna()
lr = LinearRegression()
lr.fit(train[features], train['SalePrice'])
train_predictions = lr.predict(train[features])
test_predictions = lr.predict(clean_test[features])
train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print(train_rmse)
print(test_rmse)
Removing Low Variance Features
The last technique I will explore is removing features with low variance. When the values in a feature column have low variance, they don’t meaningfully contribute to the model’s predictive capability. On the extreme end, let’s imagine a column with a variance of 0. This would mean that all of the values in that column were exactly the same. This means that the column isn’t informative and isn’t going to help the model make better predictions.
To make apples to apples comparisions between columns, we need to standardize all of the columns to vary between 0 and 1. Then, we can set a cutoff value for variance and remove features that have less than that variance amount.
unit_train = train[features]/(train[features].max())
sorted_vars = unit_train.var().sort_values()
print(sorted_vars)
Final Model
Let’s set a cutoff variance of 0.015, remove the Open Porch SF feature, and train and test a model using the remaining features.
features = features.drop(['Open Porch SF'])
clean_test = test[final_corr_cols.index].dropna()
lr = LinearRegression()
lr.fit(train[features], train['SalePrice'])
train_predictions = lr.predict(train[features])
test_predictions = lr.predict(clean_test[features])
train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])
train_rmse_2 = np.sqrt(train_mse)
test_rmse_2 = np.sqrt(test_mse)
print(lr.intercept_)
print(lr.coef_)
print(train_rmse_2)
print(test_rmse_2)
The final model will be a 7-dimension linear function which looks like:
Feature Transformation
To understand how linear regression works, I have stuck to using features from the training dataset that contained no missing values and were already in a convenient numeric representation. In this section, we’ll explore how to transform some of the the remaining features so we can use them in our model. Broadly, the process of processing and creating new features is known as feature engineering.
train = data[0:1460]
test = data[1460:]
train_null_counts = train.isnull().sum()
df_no_mv = train[train_null_counts[train_null_counts==0].index]
Categorical Features
You’ll notice that some of the columns in the data frame df_no_mv contain string values. To use these features in our model, we need to transform them into numerical representations:
text_cols = df_no_mv.select_dtypes(include=['object']).columns
for col in text_cols:
print(col+":", len(train[col].unique()))
for col in text_cols:
train[col] = train[col].astype('category')
train['Utilities'].cat.codes.value_counts()
Dummy Coding
When we convert a column to the categorical data type, pandas assigns a number from 0 to n-1 (where n is the number of unique values in a column) for each value. The drawback with this approach is that one of the assumptions of linear regression is violated here. Linear regression operates under the assumption that the features are linearly correlated with the target column. For a categorical feature, however, there’s no actual numerical meaning to the categorical codes that pandas assigned for that colum. An increase in the Utilities column from 1 to 2 has no correlation value with the target column, and the categorical codes are instead used for uniqueness and exclusivity (the category associated with 0 is different than the one associated with 1).
The common solution is to use a technique called dummy coding. Dummy coding takes a categorical variable and turns it into a set of dichotomous (0 or 1) variables. For example, if we were to look at feature of ‘House Style’ we would see that there are three participants, each with a nominal variable:
Style | Code |
---|---|
1 Floor | 1 |
1.5 Floors | 2 |
2 Floors | 3 |
can be made useful by setting 1 Floor as a baseline and then comparing the other categories to that baseline, like so:
Style | 1.5 Floors | 2 Floors |
---|---|---|
1.5 Floors | 1 | 0 |
2 Floors | 0 | 1 |
1 Floor | 0 | 0 |
The pandas function .get_dummies
will perform this conversion automatically:
dummy_cols = pd.DataFrame()
for col in text_cols:
col_dummies = pd.get_dummies(train[col])
train = pd.concat([train, col_dummies], axis=1)
del train[col]
train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']
Missing Values
Now I will focus on handling columns with missing values. When values are missing in a column, there are two main approaches we can take:
- Remove rows containing missing values for specific columns Pro: Rows containing missing values are removed, leaving only clean data for modeling Con: Entire observations from the training set are removed, which can reduce overall prediction accuracy
- Impute (or replace) missing values using a descriptive statistic from the column Pro: Missing values are replaced with potentially similar estimates, preserving the rest of the observation in the model. Con: Depending on the approach, we may be adding noisy data for the model to learn
Given that we only have 1460 training examples (with ~80 potentially useful features), we don’t want to remove any of these rows from the dataset. Let’s instead focus on imputation techniques.
We’ll focus on columns that contain at least 1 missing value but less than 365 missing values (or 25% of the number of rows in the training set). There’s no strict threshold, and many people instead use a 50% cutoff (if half the values in a column are missing, it’s automatically dropped). Having some domain knowledge can help with determining an acceptable cutoff value.
df_missing_values = train[train_null_counts[(train_null_counts>0) & (train_null_counts<584)].index]
print(df_missing_values.isnull().sum())
print(df_missing_values.dtypes)
Inputing Missing Values
It looks like about half of the columns in df_missing_values
are string columns (object data type), while about half are float64 columns. For numerical columns with missing values, a common strategy is to compute the mean, median, or mode of each column and replace all missing values in that column with that value. So we will take the float64 columns and fill all of the missing values with the mean using the function df_missing_values.mean()
:
float_cols = df_missing_values.select_dtypes(include=['float'])
float_cols = float_cols.fillna(df_missing_values.mean())
print(float_cols.isnull().sum())
And as you can see from the print statement, now all of the float64 columns are without missing values.
Conclusion
This Notebook talks about how to do linear regression in machine learning by analyzing the real example of the Ames Housing dataset. In this case, to do the linear regression not only means we need to figure out the correlation among all the variable, but also eliminate the variable with either insignificant influence or missing value.