House Prices: Advanced Regression Techniques (Kaggle)
In this article, we'll average a stacked ensemble with its base learners and a strong public kernel to rank in the top 10% in the Kaggle competition [House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).
The competition challenges teams to predict the sale price of houses in Ames, Iowa, given 79 explanatory variables, each of which is described [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).
If you aren't familiar with ensemble methods, I recommend reading the [Kaggle Ensembling Guide](https://mlwave.com/kaggle-ensembling-guide/), [Scikit-learn's documentation](https://scikit-learn.org/stable/modules/ensemble.html) and the [Wikipedia page](https://en.wikipedia.org/wiki/Ensemble_learning).
The full code and data for this article are available here: https://github.com/jackdry/house-prices-advanced-regression-techniques-kaggle.
## Import packages
```python
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.model_selection import KFold, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.ensemble import GradientBoostingRegressor
from mlxtend.regressor import StackingCVRegressor
from sklearn.preprocessing import RobustScaler
from scipy.stats import boxcox_normmax, zscore
from multiprocessing import cpu_count
from lightgbm import LGBMRegressor
from scipy.special import boxcox1p
import matplotlib.pyplot as plt
from sklearn.svm import SVR
import pandas as pd
import numpy as np
```
## Read data
```python
train = pd.read_csv("train.csv", index_col=0)
test = pd.read_csv("test.csv", index_col=0)
sample = pd.read_csv("sample_submission.csv")
```
```python
train.head()
```
```output
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
```
```python
train.shape
```
```output
(1460, 80)
```
## Concatenate training and test features
We concatenate features so that we don't have to impute missing values, transform features, etc., separately for the training and test sets.
First, however, we remove houses from the training set with ground living area greater than 4,500 sq. ft, since, as shown by the plot below, these are outliers.
```python
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.scatter(train["GrLivArea"], train["SalePrice"], c="#3f72af", zorder=3, alpha=0.9)
ax.axvline(4500, c="#112d4e", ls="--", zorder=2)
ax.set_xlabel("Ground living area (sq. ft)", labelpad=10)
ax.set_ylabel("Sale price ($)", labelpad=10)
plt.show()
```
```output
<img src="https://jackdry.s3.us-east-2.amazonaws.com/content/articles/house-prices-advanced-regression-techniques-kaggle/outliers.png">
```
```python
train = train[train["GrLivArea"] < 4500]
X = pd.concat([train.drop("SalePrice", axis=1), test])
```
## Transform target variable
<!-- TODO: explain why we transform. -->
```python
y_train = np.log(train["SalePrice"])
```
## Impute missing values
The plot below shows the number of missing values in columns with at least one missing value.
```python
nans = X.isna().sum().sort_values(ascending=False)
nans = nans[nans > 0]
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.bar(nans.index, nans.values, zorder=2, color="#3f72af")
ax.set_ylabel("No. of missing values", labelpad=10)
ax.set_xlim(-0.6, len(nans) - 0.4)
ax.xaxis.set_tick_params(rotation=90)
plt.show()
```
```output
<img src="https://jackdry.s3.us-east-2.amazonaws.com/content/articles/house-prices-advanced-regression-techniques-kaggle/nans.png">
```
We impute the missing values in these columns using the code below.
```python
cols = ["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageCond", "GarageQual", "GarageFinish", "GarageType", "BsmtCond", "BsmtExposure", "BsmtQual", "BsmtFinType2", "BsmtFinType1"]
X[cols] = X[cols].fillna("None")
cols = ["GarageYrBlt", "MasVnrArea", "BsmtHalfBath", "BsmtFullBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "GarageCars"]
X[cols] = X[cols].fillna(0)
cols = ["MasVnrType", "MSZoning", "Utilities", "Exterior1st", "Exterior2nd", "SaleType", "Electrical", "KitchenQual", "Functional"]
X[cols] = X.groupby("Neighborhood")[cols].transform(lambda x: x.fillna(x.mode()[0]))
cols = ["GarageArea", "LotFrontage"]
X[cols] = X.groupby("Neighborhood")[cols].transform(lambda x: x.fillna(x.median()))
```
## Engineer features
```python
X["TotalSF"] = X["GrLivArea"] + X["TotalBsmtSF"]
X["TotalPorchSF"] = X["OpenPorchSF"] + X["EnclosedPorch"] + X["3SsnPorch"] + X["ScreenPorch"]
X["TotalBath"] = X["FullBath"] + X["BsmtFullBath"] + 0.5 * (X["BsmtHalfBath"] + X["HalfBath"])
```
## Categorize `MSSubClass` and `YrSold`
Looking at its description below, the levels of `MSSubClass` don't seem to have a natural ordering, so we'll represent it as a categorical, rather than numerical, feature.
```output
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
```
We'll also represent `YrSold` as a categorical feature, to allow for a more flexible relationship with `SalePrice` (especially important because of the [2008 financial crisis](https://en.wikipedia.org/wiki/Financial_crisis_of_2007%E2%80%932008)).
```python
cols = ["MSSubClass", "YrSold"]
X[cols] = X[cols].astype("category")
```
## Transform features
To better exploit any seasonality in `SalePrice`, we transform `MoSold` (which is a cyclical feature) using the method described [here](http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).
```python
X["SinMoSold"] = np.sin(2 * np.pi * X["MoSold"] / 12)
X["CosMoSold"] = np.cos(2 * np.pi * X["MoSold"] / 12)
X = X.drop("MoSold", axis=1)
```
We then transform highly skewed features using [`boxcox1p`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.boxcox1p.html) and [`boxcox_normmax`](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.boxcox_normmax.html), and scale features using [`RobustScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html).
```python
skew = X.skew(numeric_only=True).abs()
cols = skew[skew > 1].index
for col in cols:
X[col] = boxcox1p(X[col], boxcox_normmax(X[col] + 1))
```
```python
cols = X.select_dtypes(np.number).columns
X[cols] = RobustScaler().fit_transform(X[cols])
```
Lastly, we use [`pd.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to convert all categorical variables into dummy variables.
```python
X = pd.get_dummies(X)
```
## Recover training and test features
```python
X_train = X.loc[train.index]
X_test = X.loc[test.index]
```
## Remove outliers from training data
To remove outliers, we fit a linear model to the training data and remove examples with a studentized residual greater than 3.
```python
residuals = y_train - LinearRegression().fit(X_train, y_train).predict(X_train)
outliers = residuals[np.abs(zscore(resids)) > 3].index
```
```python
outliers
```
```output
Int64Index([31, 89, 108, 432, 463, 496, 582, 589, 633, 689, 729, 775, 875, 969, 971, 1063, 1213, 1325, 1433, 1454], dtype='int64', name='Id')
```
```python
X_train = X_train.drop(outliers)
y_train = y_train.drop(outliers)
```
## Define `random_search`
We'll use a random search to optimize hyperparameters for each of our models. Since we don't have much data, we'll use 5-fold cross-validation (instead of a validation set) to score each iteration.
```python
kf = KFold(n_splits=5, random_state=0, shuffle=True)
rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))
scorer = make_scorer(rmse, greater_is_better=False)
```
```python
def random_search(model, grid, n_iter=100):
n_jobs = max(cpu_count() - 2, 1)
search = RandomizedSearchCV(model, grid, n_iter, scorer, n_jobs=n_jobs, cv=kf, random_state=0, verbose=True)
return search.fit(X_train, y_train)
```
## Optimize stacked ensemble
The code below optimizes 5 base learners and then a stacked ensemble of them.
```python
ridge_search = random_search(Ridge(), {"alpha": np.logspace(-1, 2, 500)})
lasso_search = random_search(Lasso(), {"alpha": np.logspace(-5, -1, 500)})
svr_search = random_search(SVR(), {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)})
lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)})
gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)})
```
```python
models = [search.best_estimator_ for search in [ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]]
stack_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20)
models.append(stack_search.best_estimator_)
```
## Create submission
We store the predictions of the base learners and stacked ensemble in a list and add predictions from [paulorzp's kernel](https://www.kaggle.com/paulorzp/blend-linear-regressors) which, as of writing, ranks about 820th with a score of 0.11563.
```python
preds = [model.predict(X_test) for model in models]
preds.append(np.log(pd.read_csv("blend-linear-regressors.csv")["SalePrice"]))
```
We create our submission by averaging `preds`, giving weights of 0.1 to each of the base learners and 0.25 to the ensemble and paulorzp's submission.
```python
preds = np.average(preds, axis=0, weights=[0.1] * 5 + [0.25] * 2)
submission = pd.DataFrame({"Id": sample["Id"], "SalePrice": np.exp(preds)})
submission.to_csv("submission.csv", index=False)
```
It achieves a score of 0.11488 which, as of writing, ranks about 550th, just inside the top 10%.
My email list
Any thoughts?
Comment with twitter or reddit.
Comments
No comments.