Jack Dry

House Prices: Advanced Regression Techniques (Kaggle)


In this article, we'll average a stacked ensemble with its base learners and a strong public kernel to rank in the top 10% in the Kaggle competition [House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). The competition challenges teams to predict the sale price of houses in Ames, Iowa, given 79 explanatory variables, each of which is described [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). If you aren't familiar with ensemble methods, I recommend reading the [Kaggle Ensembling Guide](https://mlwave.com/kaggle-ensembling-guide/), [Scikit-learn's documentation](https://scikit-learn.org/stable/modules/ensemble.html) and the [Wikipedia page](https://en.wikipedia.org/wiki/Ensemble_learning). The full code and data for this article are available here: https://github.com/jackdry/house-prices-advanced-regression-techniques-kaggle. ## Import packages ```python from sklearn.linear_model import Ridge, Lasso, LinearRegression from sklearn.model_selection import KFold, RandomizedSearchCV from sklearn.metrics import mean_squared_error, make_scorer from sklearn.ensemble import GradientBoostingRegressor from mlxtend.regressor import StackingCVRegressor from sklearn.preprocessing import RobustScaler from scipy.stats import boxcox_normmax, zscore from multiprocessing import cpu_count from lightgbm import LGBMRegressor from scipy.special import boxcox1p import matplotlib.pyplot as plt from sklearn.svm import SVR import pandas as pd import numpy as np ``` ## Read data ```python train = pd.read_csv("train.csv", index_col=0) test = pd.read_csv("test.csv", index_col=0) sample = pd.read_csv("sample_submission.csv") ``` ```python train.head() ``` ```output MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice Id 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000 ``` ```python train.shape ``` ```output (1460, 80) ``` ## Concatenate training and test features We concatenate features so that we don't have to impute missing values, transform features, etc., separately for the training and test sets. First, however, we remove houses from the training set with ground living area greater than 4,500 sq. ft, since, as shown by the plot below, these are outliers. ```python fig, ax = plt.subplots(figsize=(10, 6)) ax.grid() ax.scatter(train["GrLivArea"], train["SalePrice"], c="#3f72af", zorder=3, alpha=0.9) ax.axvline(4500, c="#112d4e", ls="--", zorder=2) ax.set_xlabel("Ground living area (sq. ft)", labelpad=10) ax.set_ylabel("Sale price ($)", labelpad=10) plt.show() ``` ```output <img src="https://jackdry.s3.us-east-2.amazonaws.com/content/articles/house-prices-advanced-regression-techniques-kaggle/outliers.png"> ``` ```python train = train[train["GrLivArea"] < 4500] X = pd.concat([train.drop("SalePrice", axis=1), test]) ``` ## Transform target variable <!-- TODO: explain why we transform. --> ```python y_train = np.log(train["SalePrice"]) ``` ## Impute missing values The plot below shows the number of missing values in columns with at least one missing value. ```python nans = X.isna().sum().sort_values(ascending=False) nans = nans[nans > 0] fig, ax = plt.subplots(figsize=(10, 6)) ax.grid() ax.bar(nans.index, nans.values, zorder=2, color="#3f72af") ax.set_ylabel("No. of missing values", labelpad=10) ax.set_xlim(-0.6, len(nans) - 0.4) ax.xaxis.set_tick_params(rotation=90) plt.show() ``` ```output <img src="https://jackdry.s3.us-east-2.amazonaws.com/content/articles/house-prices-advanced-regression-techniques-kaggle/nans.png"> ``` We impute the missing values in these columns using the code below. ```python cols = ["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageCond", "GarageQual", "GarageFinish", "GarageType", "BsmtCond", "BsmtExposure", "BsmtQual", "BsmtFinType2", "BsmtFinType1"] X[cols] = X[cols].fillna("None") cols = ["GarageYrBlt", "MasVnrArea", "BsmtHalfBath", "BsmtFullBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "GarageCars"] X[cols] = X[cols].fillna(0) cols = ["MasVnrType", "MSZoning", "Utilities", "Exterior1st", "Exterior2nd", "SaleType", "Electrical", "KitchenQual", "Functional"] X[cols] = X.groupby("Neighborhood")[cols].transform(lambda x: x.fillna(x.mode()[0])) cols = ["GarageArea", "LotFrontage"] X[cols] = X.groupby("Neighborhood")[cols].transform(lambda x: x.fillna(x.median())) ``` ## Engineer features ```python X["TotalSF"] = X["GrLivArea"] + X["TotalBsmtSF"] X["TotalPorchSF"] = X["OpenPorchSF"] + X["EnclosedPorch"] + X["3SsnPorch"] + X["ScreenPorch"] X["TotalBath"] = X["FullBath"] + X["BsmtFullBath"] + 0.5 * (X["BsmtHalfBath"] + X["HalfBath"]) ``` ## Categorize `MSSubClass` and `YrSold` Looking at its description below, the levels of `MSSubClass` don't seem to have a natural ordering, so we'll represent it as a categorical, rather than numerical, feature. ```output MSSubClass: Identifies the type of dwelling involved in the sale. 20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 40 1-STORY W/FINISHED ATTIC ALL AGES 45 1-1/2 STORY - UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 60 2-STORY 1946 & NEWER 70 2-STORY 1945 & OLDER 75 2-1/2 STORY ALL AGES 80 SPLIT OR MULTI-LEVEL 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES ``` We'll also represent `YrSold` as a categorical feature, to allow for a more flexible relationship with `SalePrice` (especially important because of the [2008 financial crisis](https://en.wikipedia.org/wiki/Financial_crisis_of_2007%E2%80%932008)). ```python cols = ["MSSubClass", "YrSold"] X[cols] = X[cols].astype("category") ``` ## Transform features To better exploit any seasonality in `SalePrice`, we transform `MoSold` (which is a cyclical feature) using the method described [here](http://blog.davidkaleko.com/feature-engineering-cyclical-features.html). ```python X["SinMoSold"] = np.sin(2 * np.pi * X["MoSold"] / 12) X["CosMoSold"] = np.cos(2 * np.pi * X["MoSold"] / 12) X = X.drop("MoSold", axis=1) ``` We then transform highly skewed features using [`boxcox1p`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.boxcox1p.html) and [`boxcox_normmax`](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.boxcox_normmax.html), and scale features using [`RobustScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html). ```python skew = X.skew(numeric_only=True).abs() cols = skew[skew > 1].index for col in cols: X[col] = boxcox1p(X[col], boxcox_normmax(X[col] + 1)) ``` ```python cols = X.select_dtypes(np.number).columns X[cols] = RobustScaler().fit_transform(X[cols]) ``` Lastly, we use [`pd.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to convert all categorical variables into dummy variables. ```python X = pd.get_dummies(X) ``` ## Recover training and test features ```python X_train = X.loc[train.index] X_test = X.loc[test.index] ``` ## Remove outliers from training data To remove outliers, we fit a linear model to the training data and remove examples with a studentized residual greater than 3. ```python residuals = y_train - LinearRegression().fit(X_train, y_train).predict(X_train) outliers = residuals[np.abs(zscore(resids)) > 3].index ``` ```python outliers ``` ```output Int64Index([31, 89, 108, 432, 463, 496, 582, 589, 633, 689, 729, 775, 875, 969, 971, 1063, 1213, 1325, 1433, 1454], dtype='int64', name='Id') ``` ```python X_train = X_train.drop(outliers) y_train = y_train.drop(outliers) ``` ## Define `random_search` We'll use a random search to optimize hyperparameters for each of our models. Since we don't have much data, we'll use 5-fold cross-validation (instead of a validation set) to score each iteration. ```python kf = KFold(n_splits=5, random_state=0, shuffle=True) rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred)) scorer = make_scorer(rmse, greater_is_better=False) ``` ```python def random_search(model, grid, n_iter=100): n_jobs = max(cpu_count() - 2, 1) search = RandomizedSearchCV(model, grid, n_iter, scorer, n_jobs=n_jobs, cv=kf, random_state=0, verbose=True) return search.fit(X_train, y_train) ``` ## Optimize stacked ensemble The code below optimizes 5 base learners and then a stacked ensemble of them. ```python ridge_search = random_search(Ridge(), {"alpha": np.logspace(-1, 2, 500)}) lasso_search = random_search(Lasso(), {"alpha": np.logspace(-5, -1, 500)}) svr_search = random_search(SVR(), {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)}) lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}) gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}) ``` ```python models = [search.best_estimator_ for search in [ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]] stack_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20) models.append(stack_search.best_estimator_) ``` ## Create submission We store the predictions of the base learners and stacked ensemble in a list and add predictions from [paulorzp's kernel](https://www.kaggle.com/paulorzp/blend-linear-regressors) which, as of writing, ranks about 820th with a score of 0.11563. ```python preds = [model.predict(X_test) for model in models] preds.append(np.log(pd.read_csv("blend-linear-regressors.csv")["SalePrice"])) ``` We create our submission by averaging `preds`, giving weights of 0.1 to each of the base learners and 0.25 to the ensemble and paulorzp's submission. ```python preds = np.average(preds, axis=0, weights=[0.1] * 5 + [0.25] * 2) submission = pd.DataFrame({"Id": sample["Id"], "SalePrice": np.exp(preds)}) submission.to_csv("submission.csv", index=False) ``` It achieves a score of 0.11488 which, as of writing, ranks about 550th, just inside the top 10%.

My email list


Any thoughts?

Comment with twitter or reddit.


Comments

No comments.