I found some free time and thought I'd finally get some more practice at dimensionality reduction. With this goal in mind, I went onto Kaggle and found a competition(Estimate house prices) which looked appropriate to practice these skill with. Throughout this post I walk through the steps I took from cleaning and standardizing the data, to finally performing PCA and fitting a simple linear regression to the top five most influential eigenvectors! Not the most accurate regression ever, but great practice and surprisingly efficient given it drops 81 variables into only 5.

In [ ]:
##-2018.05.09 // Andrew Trick

As mentioned in the summary above, I've been wanting to get some practice with PCA and other dimension reduction techniques. I found a house price dataset on Kaggle which looks to be a great dataset to practice these techniques with. While not technically too large to model out a regression with, 81 variables is a lot to take in and I thought it would be interesting to finally explore feature reduction. Below is my step by step process from importing the data, cleaning and standardizing, reducing, to finally fitting a model for kaggle submission. I iterate over steps a few times, try stuff that eventually doesn't work, and even conduct a KMeans which is really pointless in the projet aside from my exploring SKlearn some.

In [1]:
import pandas as pd
import mca

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix, r2_score, mean_squared_error

import matplotlib.pyplot as plt

First step I'll import the data and inspect it a bit. It's already split between a train and test split- typical of Kaggle. Let's import all of this and then work with primarily the train set. I'll also grab the sample submission now as I plan to upload to Kaggle and see is a regression using either clusters or principal components is effective.

In [2]:
data_dir = 'data/'
df_train = pd.read_csv(data_dir + 'train.csv')
df_test = pd.read_csv(data_dir + 'test.csv')
df_sample = pd.read_csv(data_dir + 'sample_submission.csv')

df_train.shape
Out[2]:
(1460, 81)
In [3]:
df_train.head()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [4]:
df_train.dtypes.head()
Out[4]:
Id               int64
MSSubClass       int64
MSZoning        object
LotFrontage    float64
LotArea          int64
dtype: object

Cut this to 5,but there is a combination of about half continuous and half categorical variables in the dataset. 81 features in total.. A perfect option for dimensionality reduction! Further exploring:

In [5]:
df_train.describe()
Out[5]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

Thats a lot to take in.. eough description of the dataframe. Time to check for any missing values and correct if necessary.

In [6]:
df_train.isnull().sum()
Out[6]:
Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive          0
WoodDeckSF          0
OpenPorchSF         0
EnclosedPorch       0
3SsnPorch           0
ScreenPorch         0
PoolArea            0
PoolQC           1453
Fence            1179
MiscFeature      1406
MiscVal             0
MoSold              0
YrSold              0
SaleType            0
SaleCondition       0
SalePrice           0
Length: 81, dtype: int64

A few of these are almost all null values.. Lets completely drop any variables with over 1/3 of the values missing:

In [7]:
df_train = df_train.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)
df_train.shape
Out[7]:
(1460, 76)
In [8]:
df_train_cont = pd.DataFrame()

for i in df_train:
    if df_train[i].dtype == "int64" or df_train[i].dtype == "float64":
        df_train_cont[i] = df_train[i]
        del df_train[i]
        
df_train_cont.head()
Out[8]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
0 1 60 65.0 8450 7 5 2003 2003 196.0 706 ... 0 61 0 0 0 0 0 2 2008 208500
1 2 20 80.0 9600 6 8 1976 1976 0.0 978 ... 298 0 0 0 0 0 0 5 2007 181500
2 3 60 68.0 11250 7 5 2001 2002 162.0 486 ... 0 42 0 0 0 0 0 9 2008 223500
3 4 70 60.0 9550 7 5 1915 1970 0.0 216 ... 0 35 272 0 0 0 0 2 2006 140000
4 5 60 84.0 14260 8 5 2000 2000 350.0 655 ... 192 84 0 0 0 0 0 12 2008 250000

5 rows × 38 columns

So we split up the continuous and the categorical variables in the dataset. Time to fill in the remaining missing values in the continuous set with their respective columns means:

In [9]:
imp = Imputer(missing_values='NaN', strategy='mean')

df_train_cont_imp = imp.fit_transform(df_train_cont)
df_train_cont_imp = pd.DataFrame(df_train_cont_imp, columns = df_train_cont.columns)

df_train_cont_imp.isnull().sum().any()
Out[9]:
False

Successfully filled in null values.. lets split out the result (sale price) and perform some PCA!

In [10]:
y_train = df_train_cont_imp["SalePrice"]
del df_train_cont_imp["SalePrice"]
Out[10]:
0    208500.0
1    181500.0
2    223500.0
3    140000.0
4    250000.0
Name: SalePrice, dtype: float64

Before we perform any dimensionality reduction, lets review were we're at!

df_train: categorical training variables

df_train_cont_imp: (no missing) continuous training variables

y_train: dependant variable of modeling

df_test: test data for kaggle- NEEDS TO GO THROUGH PIPELINE

Objective from here:

-standardize continuous values

-PCA on continuous

-K-modes clustering on categorical!?

-Combine principal components and clusters of categorical to get X_train

-fit models

In [11]:
scaler = StandardScaler()


for i in df_train_cont_imp:
    X_train = scaler.fit_transform(df_train_cont_imp)
X_train = pd.DataFrame(X_train, columns = df_train_cont_imp.columns)

X_train.head()
Out[11]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
0 -1.730865 0.073375 -0.229372 -0.207142 0.651479 -0.517200 1.050994 0.878668 0.511418 0.575425 ... 0.351000 -0.752176 0.216503 -0.359325 -0.116339 -0.270208 -0.068692 -0.087688 -1.599111 0.138777
1 -1.728492 -0.872563 0.451936 -0.091886 -0.071836 2.179628 0.156734 -0.429577 -0.574410 1.171992 ... -0.060731 1.626195 -0.704483 -0.359325 -0.116339 -0.270208 -0.068692 -0.087688 -0.489110 -0.614439
2 -1.726120 0.073375 -0.093110 0.073480 0.651479 -0.517200 0.984752 0.830215 0.323060 0.092907 ... 0.631726 -0.752176 -0.070361 -0.359325 -0.116339 -0.270208 -0.068692 -0.087688 0.990891 0.138777
3 -1.723747 0.309859 -0.456474 -0.096897 0.651479 -0.517200 -1.863632 -0.720298 -0.574410 -0.499274 ... 0.790804 -0.752176 -0.176048 4.092524 -0.116339 -0.270208 -0.068692 -0.087688 -1.599111 -1.367655
4 -1.721374 0.073375 0.633618 0.375148 1.374795 -0.517200 0.951632 0.733308 1.364570 0.463568 ... 1.698485 0.780197 0.563760 -0.359325 -0.116339 -0.270208 -0.068692 -0.087688 2.100892 0.138777

5 rows × 37 columns

In [12]:
del X_train['Id']

X_train.shape
Out[12]:
(1460, 36)

Time to run PCA on this set. I'll first transfer this to an array and then apply sklearn.decomp.PCA to the X_train dataset.

In [13]:
X_train = X_train.values

pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)

explained_variance = pca.explained_variance_ratio_
print(explained_variance)

X_train_pca = X_train[:,:5]
X_train_pca = pd.DataFrame(X_train_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])
[ 0.19812105  0.08900553  0.0714888   0.05614703  0.04091307]

About 46% of the explained variance from only 5 principal components. Not too bad... I think.. lets continue with this and try a k-means on the components just for fun!

In [14]:
kmeans = KMeans(n_clusters = 3, random_state = 42)

X_train_clusters = kmeans.fit(X_train)
 
X_train_clusters.labels_
X_train_clusters.cluster_centers_
Out[14]:
array([[ 1.8471566 ,  1.67522714, -0.68752213, -0.74411962, -0.14086944],
       [-2.20431745, -0.19357077,  0.38318628, -0.00859105, -0.08315146],
       [ 2.21848876, -1.55823088,  0.03890235,  0.87915433,  0.32763546]])

Alright, on to the categorical data.. first to clean it of nulls and then dummy variable it out. As this is more a project to explore dimensionality reduction, I'll allow the label encoder to encoder null values over to a value.

In [15]:
encoder = LabelEncoder()
hot_encoder = OneHotEncoder()

#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()

#iterate over categorical cols and transform into int values
for i in df_train:
    df_train_enc[i] = encoder.fit_transform(df_train[i])
    
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())

df_train_dummies.head()
Out[15]:
0 1 2 3 4 5 6 7 8 9 ... 744 745 746 747 748 749 750 751 752 753
0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

5 rows × 754 columns

okay.. so we just did the exact opposite of dimensionality reduction.. thats what dummy variables do. Lets try multiple correspondence analysis, the categorical PCA!

In [16]:
df_train_mca = mca.MCA(df_train_dummies, ncols = 5)
print(df_train_mca.L) #eigenvalues
[ 0.00895017  0.00755986  0.00481571  0.0029876   0.00267983  0.00247353
  0.00244682  0.00078794]

Well, not too valueable really. It's not explaining all too much of the variance. Lets go back to the the first categorical and take a more traditional approach. First to view unique counts per column:

In [17]:
for i in df_train:
    print i
    print(df_train[i].value_counts())
MSZoning
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64
Street
Pave    1454
Grvl       6
Name: Street, dtype: int64
LotShape
Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64
LandContour
Lvl    1311
Bnk      63
HLS      50
Low      36
Name: LandContour, dtype: int64
Utilities
AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64
LotConfig
Inside     1052
Corner      263
CulDSac      94
FR2          47
FR3           4
Name: LotConfig, dtype: int64
LandSlope
Gtl    1382
Mod      65
Sev      13
Name: LandSlope, dtype: int64
Neighborhood
NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mitchel     49
NoRidge     41
Timber      38
IDOTRR      37
ClearCr     28
SWISU       25
StoneBr     25
MeadowV     17
Blmngtn     17
BrDale      16
Veenker     11
NPkVill      9
Blueste      2
Name: Neighborhood, dtype: int64
Condition1
Norm      1260
Feedr       81
Artery      48
RRAn        26
PosN        19
RRAe        11
PosA         8
RRNn         5
RRNe         2
Name: Condition1, dtype: int64
Condition2
Norm      1445
Feedr        6
Artery       2
RRNn         2
PosN         2
RRAn         1
RRAe         1
PosA         1
Name: Condition2, dtype: int64
BldgType
1Fam      1220
TwnhsE     114
Duplex      52
Twnhs       43
2fmCon      31
Name: BldgType, dtype: int64
HouseStyle
1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: HouseStyle, dtype: int64
RoofStyle
Gable      1141
Hip         286
Flat         13
Gambrel      11
Mansard       7
Shed          2
Name: RoofStyle, dtype: int64
RoofMatl
CompShg    1434
Tar&Grv;      11
WdShngl       6
WdShake       5
Membran       1
Metal         1
ClyTile       1
Roll          1
Name: RoofMatl, dtype: int64
Exterior1st
VinylSd    515
HdBoard    222
MetalSd    220
Wd Sdng    206
Plywood    108
CemntBd     61
BrkFace     50
WdShing     26
Stucco      25
AsbShng     20
Stone        2
BrkComm      2
AsphShn      1
ImStucc      1
CBlock       1
Name: Exterior1st, dtype: int64
Exterior2nd
VinylSd    504
MetalSd    214
HdBoard    207
Wd Sdng    197
Plywood    142
CmentBd     60
Wd Shng     38
Stucco      26
BrkFace     25
AsbShng     20
ImStucc     10
Brk Cmn      7
Stone        5
AsphShn      3
Other        1
CBlock       1
Name: Exterior2nd, dtype: int64
MasVnrType
None       864
BrkFace    445
Stone      128
BrkCmn      15
Name: MasVnrType, dtype: int64
ExterQual
TA    906
Gd    488
Ex     52
Fa     14
Name: ExterQual, dtype: int64
ExterCond
TA    1282
Gd     146
Fa      28
Ex       3
Po       1
Name: ExterCond, dtype: int64
Foundation
PConc     647
CBlock    634
BrkTil    146
Slab       24
Stone       6
Wood        3
Name: Foundation, dtype: int64
BsmtQual
TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64
BsmtCond
TA    1311
Gd      65
Fa      45
Po       2
Name: BsmtCond, dtype: int64
BsmtExposure
No    953
Av    221
Gd    134
Mn    114
Name: BsmtExposure, dtype: int64
BsmtFinType1
Unf    430
GLQ    418
ALQ    220
BLQ    148
Rec    133
LwQ     74
Name: BsmtFinType1, dtype: int64
BsmtFinType2
Unf    1256
Rec      54
LwQ      46
BLQ      33
ALQ      19
GLQ      14
Name: BsmtFinType2, dtype: int64
Heating
GasA     1428
GasW       18
Grav        7
Wall        4
OthW        2
Floor       1
Name: Heating, dtype: int64
HeatingQC
Ex    741
TA    428
Gd    241
Fa     49
Po      1
Name: HeatingQC, dtype: int64
CentralAir
Y    1365
N      95
Name: CentralAir, dtype: int64
Electrical
SBrkr    1334
FuseA      94
FuseF      27
FuseP       3
Mix         1
Name: Electrical, dtype: int64
KitchenQual
TA    735
Gd    586
Ex    100
Fa     39
Name: KitchenQual, dtype: int64
Functional
Typ     1360
Min2      34
Min1      31
Mod       15
Maj1      14
Maj2       5
Sev        1
Name: Functional, dtype: int64
GarageType
Attchd     870
Detchd     387
BuiltIn     88
Basment     19
CarPort      9
2Types       6
Name: GarageType, dtype: int64
GarageFinish
Unf    605
RFn    422
Fin    352
Name: GarageFinish, dtype: int64
GarageQual
TA    1311
Fa      48
Gd      14
Ex       3
Po       3
Name: GarageQual, dtype: int64
GarageCond
TA    1326
Fa      35
Gd       9
Po       7
Ex       2
Name: GarageCond, dtype: int64
PavedDrive
Y    1340
N      90
P      30
Name: PavedDrive, dtype: int64
SaleType
WD       1267
New       122
COD        43
ConLD       9
ConLw       5
ConLI       5
CWD         4
Oth         3
Con         2
Name: SaleType, dtype: int64
SaleCondition
Normal     1198
Partial     125
Abnorml     101
Family       20
Alloca       12
AdjLand       4
Name: SaleCondition, dtype: int64

Several of these cetagories are overwhelmingly distributed to one value. As such, they will not provide useful information in forecasting.. I'm going to set a threshold of 75% within one value.. IE: if one value in a category holds >= 75% of the total count, it gets removed from the dataset.. This acocunts for:

MSZoning, Street, LandContour, Utilities, LandSlope, Condition1, Condition2, BldgType, RoofStyle, RoofMat1, ExterCond, BsmtQual, BsmtFinType2, Heating, CentralAir, Electrical, Functional, GarageQual, GarageCond, PavedDrive, SaleType, SaleCondition

While Numerous of these would be expected to be useful in predicting price (central air, paved driveway, sale condition) I'm still removing at this threshold to try to reduce dimensionality. I'll evbentually compare to a full-on regression with all variables and see what the difference relates to.

Lets remove them:

In [18]:
df_train = df_train.drop(["MSZoning", "Street", "LandContour", "Utilities", "LandSlope", "Condition1", "Condition2", "BldgType", 
                          "RoofStyle", "RoofMatl", "ExterCond", "BsmtQual", "BsmtFinType2", "Heating", "CentralAir", "Electrical", 
                          "Functional", "GarageQual", "GarageCond", "PavedDrive", "SaleType", "SaleCondition"], axis=1)
df_train.head()
Out[18]:
LotShape LotConfig Neighborhood HouseStyle Exterior1st Exterior2nd MasVnrType ExterQual Foundation BsmtCond BsmtExposure BsmtFinType1 HeatingQC KitchenQual GarageType GarageFinish
0 Reg Inside CollgCr 2Story VinylSd VinylSd BrkFace Gd PConc TA No GLQ Ex Gd Attchd RFn
1 Reg FR2 Veenker 1Story MetalSd MetalSd None TA CBlock TA Gd ALQ Ex TA Attchd RFn
2 IR1 Inside CollgCr 2Story VinylSd VinylSd BrkFace Gd PConc TA Mn GLQ Ex Gd Attchd RFn
3 IR1 Corner Crawfor 2Story Wd Sdng Wd Shng None TA BrkTil Gd No ALQ Gd Gd Detchd Unf
4 IR1 FR2 NoRidge 2Story VinylSd VinylSd BrkFace Gd PConc TA Av GLQ Ex Gd Attchd RFn

Lets encode these out and get to model fitting. (This is just a rewriting of the above encoding, but with the smaller dataset.)

In [19]:
#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()

#iterate over categorical cols and transform into int values
for i in df_train:
    df_train_enc[i] = encoder.fit_transform(df_train[i])
    
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())

df_train_dummies.head()
Out[19]:
0 1 2 3 4 5 6 7 8 9 ... 391 392 393 394 395 396 397 398 399 400
0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

5 rows × 401 columns

Alright, lets combine the two train sets (continuous and categorical). We'll split the Kaggle train set into a train and test subset so we can test the accuracy as well. After that, We'll fit a regression and see the results:

In [20]:
X_train = X_train_pca
X_train.head()

for i in df_train_dummies:
    X_train[i] =df_train_dummies[i]
    
X_train.shape
Out[20]:
(1460, 406)
In [21]:
X_train_save = X_train #needed for later
y_train_save = y_train #needed for later

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
print("okay, this is getting messy and difficult to track in jupyter.. lets model and gtfo")
okay, this is getting messy and difficult to track in jupyter.. lets model and gtfo
In [22]:
#linear
clf = LinearRegression().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    
Mean squared error: 1063481822160301124330534731776.00
Variance score: -144861071958222913536.00

Well something is off here. lets print out the preds to true values:

In [23]:
print("True       Predicted")
for i in range(0, len(y_pred)):
    print("%.2f        %.2f") %(y_test.iloc[i], y_pred[i])
True       Predicted
154500.00        150211.00
325000.00        301361.00
115000.00        108992.00
159000.00        173719.00
315500.00        312825.00
75500.00        138320126160922.00
311500.00        253260.00
146000.00        148172.00
84500.00        138320126158674.00
135500.00        131382.00
145000.00        129060.00
130000.00        112941.00
81000.00        124811.00
214000.00        234873.00
181000.00        184833.00
134500.00        117976.00
183500.00        193918.00
135000.00        123773.00
118400.00        137025.00
226000.00        208449.00
155000.00        160173.00
210000.00        199413.00
173500.00        177851.00
129000.00        120808.00
192000.00        203707.00
153900.00        162917.00
181134.00        202883.00
141000.00        120239.00
181000.00        181342.00
208900.00        191221.00
127000.00        125141.00
284000.00        288563.00
200500.00        138320126372920.00
135750.00        96995.00
255000.00        256505.00
140000.00        134191.00
138000.00        120298.00
219500.00        213180.00
310000.00        307259.00
97000.00        -774136251902434.00
114500.00        142973.00
205000.00        222909.00
119500.00        95397.00
253293.00        355635.00
128500.00        126239.00
117500.00        143418.00
115000.00        106424.00
127000.00        119092.00
451950.00        412803.00
144000.00        132118.00
119000.00        112934.00
196000.00        209345.00
115000.00        112250.00
287000.00        257300.00
144500.00        193217.00
260000.00        224363.00
213000.00        194860.00
175000.00        202897.00
107000.00        121559.00
107500.00        1725825612082280.50
68500.00        78946.00
154000.00        176947.00
317000.00        306067.00
264132.00        264680.00
283463.00        275617.00
243000.00        185570.00
109000.00        87792.00
305000.00        310483.00
93500.00        138320126241910.00
176000.00        183413.00
118858.00        1725825504232654.50
134000.00        128268.00
109008.00        102973.00
93500.00        -774136254329829.00
611657.00        423795.00
173000.00        191545.00
348000.00        301939.00
341000.00        323617.00
141000.00        157451.00
124900.00        122659.00
118000.00        113566.00
67000.00        -1554537323787872.00
113000.00        114221.00
91300.00        96145.00
149500.00        151220.00
133000.00        138577.00
266000.00        243967.00
190000.00        228239.00
155900.00        159075.00
155835.00        187658.00
153500.00        144105.00
152000.00        168074.00
124500.00        142124.00
301000.00        257600.00
136500.00        138320126227596.00
169990.00        201483.00
205000.00        209734.00
183900.00        161974.00
204900.00        192513.00
260000.00        263733.00
163500.00        113063.00
224900.00        203469.00
244000.00        246738.00
132000.00        148597.00
194000.00        206130.00
156500.00        198485.00
156000.00        136923.00
275000.00        296543.00
145000.00        165230.00
135000.00        210297.00
60000.00        138320126187433.00
124000.00        138320126239118.00
127000.00        136659.00
137500.00        141155.00
213500.00        201041.00
119000.00        127766.00
107900.00        99353.00
123000.00        117569.00
112000.00        1725825618631814.50
284000.00        283433.00
133000.00        118655.00
149000.00        163041.00
169000.00        179597.00
207000.00        185300.00
175000.00        195858.00
137000.00        127718.00
236000.00        234605.00
79500.00        105241.00
144000.00        166678.00
162900.00        195239.00
185900.00        190545.00
369900.00        296073.00
197900.00        187585.00
104000.00        138320126283852.00
35311.00        74353.00
337500.00        370623.00
367294.00        350601.00
130250.00        150079.00
230000.00        229071.00
755000.00        485857.00
403000.00        364357.00
132000.00        117788.00
178000.00        181370.00
136500.00        157698.00
145000.00        156978.00
123000.00        118015.00
250000.00        259562.00
187100.00        183581.00
133900.00        135873.00
67000.00        138320126169044.00
137500.00        142958.00
155000.00        148617.00
200624.00        -8653893472972565.00
154300.00        145019.00
91000.00        92689.00
136000.00        138320126228596.00
108959.00        92647.00
140000.00        128766.00
86000.00        138320126201490.00
131400.00        128927.00
179900.00        188881.00
144000.00        84173.00
293077.00        279677.00
144500.00        156131.00
118500.00        104763.00
141000.00        138320126276588.00
239000.00        244213.00
276000.00        295022.00
556581.00        401247.00
244400.00        189966.00
360000.00        376705.00
103200.00        138320126203828.00
102000.00        138320126226476.00
151000.00        142324.00
285000.00        306595.00
134432.00        -774136251876770.00
113000.00        112747.00
187500.00        218847.00
125500.00        125485.00
177500.00        185998.00
179900.00        224773.00
55993.00        112097.00
132500.00        122842.00
135000.00        128932.00
255000.00        237135.00
140000.00        116236.00
271000.00        252505.00
246578.00        204611.00
202500.00        226097.00
75000.00        67890.00
122500.00        123907.00
108480.00        99525.00
160000.00        179893.00
171000.00        157812.00
196000.00        228883.00
225000.00        180511.00
197000.00        201043.00
40000.00        109905.00
172500.00        208755.00
154900.00        147669.00
280000.00        228550.00
175000.00        212461.00
147000.00        107849.00
315000.00        302866.00
185000.00        199013.00
135500.00        121819.00
239500.00        222069.00
139000.00        137251.00
140000.00        116580.00
110000.00        98251.00
225000.00        211325.00
143500.00        153057.00
128950.00        138320126227238.00
172500.00        178613.00
241500.00        220326.00
262500.00        245885.00
194201.00        214354.00
143000.00        122727.00
130000.00        127966.00
126000.00        -774136211604861.00
142500.00        131248.00
254000.00        243943.00
217500.00        225380.00
66500.00        107989.00
201000.00        231147.00
155000.00        136984.00
68400.00        92379.00
64500.00        112649.00
173000.00        175250.00
102776.00        105775.00
84900.00        -774136243291964.00
165600.00        173911.00
120000.00        120584.00
135000.00        107419.00
220000.00        224611.00
153575.00        145213.00
195400.00        191925.00
147000.00        151899.00
277000.00        235819.00
143000.00        129173.00
105900.00        117508.00
242000.00        233884.00
194500.00        217067.00
438780.00        414802.00
185000.00        185123.00
107500.00        138320126235972.00
165000.00        161152.00
176000.00        168580.00
129900.00        135860.00
115000.00        99227.00
192140.00        157348.00
160000.00        152723.00
145000.00        130093.00
86000.00        114795.00
158000.00        150398.00
127500.00        150855.00
115000.00        121843.00
119500.00        124727.00
175900.00        186719.00
240000.00        230065.00
395000.00        292513.00
165000.00        190207.00
128200.00        126002.00
275000.00        241374.00
311872.00        333837.00
214000.00        264196.00
153500.00        160347.00
144000.00        162383.00
115000.00        109338.00
180000.00        194829.00
465000.00        -8652679814782547.00
180000.00        234497.00
253000.00        236517.00
85000.00        78592.00
101800.00        1725825612516762.00
148500.00        126353.00
137500.00        139437.00
318061.00        305143.00
143000.00        243483.00
140000.00        130494.00
192500.00        5849406439234922.00
92000.00        138320126211547.00
197000.00        191633.00
109500.00        113817.00
297000.00        276491.00
185750.00        191003.00
230000.00        199221.00
89471.00        123403.00
260000.00        288824.00
189000.00        211275.00
108000.00        98723.00
124500.00        118627.00
145000.00        120283.00
178000.00        166875.00
85000.00        138320126207371.00
175000.00        148939.00
127000.00        156835.00
149900.00        150848.00
174000.00        204456.00
125500.00        111787.00
175500.00        192644.00
225000.00        232678.00
129000.00        124576.00
159950.00        143923.00
157000.00        166119.00
205000.00        194794.00
140000.00        148306.00
200000.00        211151.00
217000.00        247819.00
125000.00        122265.00
159500.00        116462.00
184000.00        170168.00
96500.00        126309.00
200000.00        213463.00
149000.00        131097.00
178900.00        190158.00
184900.00        -8653581893279559.00
164700.00        162167.00
335000.00        332750.00
87500.00        1725825641496850.00
233170.00        243361.00
149350.00        148185.00
133000.00        115034.00
98000.00        138320126189518.00
207500.00        194142.00
150000.00        160024.00
134900.00        139607.00
187750.00        200477.00
149500.00        152687.00
116050.00        92949.00
153337.00        157879.00
165000.00        143732.00
130500.00        159244.00
206300.00        199895.00
146000.00        159029.00
126500.00        107896.00
139000.00        184977.00
113000.00        138320126168551.00
79000.00        138320126185880.00
182000.00        -8652809840427027.00
188000.00        189026.00
129000.00        134809.00
160000.00        138320126237881.00
174000.00        179009.00
219210.00        231392.00
310000.00        336157.00
374000.00        348145.00
100000.00        117929.00
250000.00        219850.00
145000.00        126189.00
325000.00        279416.00
380000.00        377130.00
275000.00        283691.00
180000.00        179456.00
245350.00        215241.00
151500.00        140879.00
134500.00        107990.00
94000.00        138320126207187.00
216000.00        210017.00
350000.00        344335.00
195000.00        190204.00
120000.00        126433.00
228500.00        218045.00
248000.00        256537.00
124000.00        125469.00
191000.00        180685.00
181000.00        193363.00
105000.00        101319.00
139900.00        121797.00
157900.00        139285.00
130500.00        103079.00
172400.00        178517.00
130000.00        114067.00
178000.00        193458.00
161500.00        213060.00
119900.00        139045.00
239000.00        218895.00
190000.00        187535.00
85400.00        90611.00
205000.00        187157.00
134000.00        157096.00
168000.00        174856.00
185000.00        184494.00
180500.00        124479.00
152000.00        142014.00
215000.00        220313.00
153000.00        129503.00
106000.00        124801.00
340000.00        248105.00
159000.00        174063.00
120000.00        121939.00
115000.00        1725825612612082.50
128000.00        150991.00
314813.00        338699.00
131000.00        134171.00
446261.00        380933.00
127500.00        136555.00
155000.00        185090.00
177500.00        180883.00
128500.00        141651.00
176000.00        173214.00
402000.00        338309.00
130000.00        138320126255192.00
145000.00        138453.00
147000.00        133947.00
115000.00        138320126202910.00
189000.00        187737.00
143000.00        141074.00
240000.00        220399.00
230000.00        239589.00
190000.00        257705.00
213000.00        210433.00
82500.00        138320126210703.00
274900.00        323037.00
155000.00        176213.00
423000.00        335065.00
128500.00        138165.00
335000.00        362733.00
262000.00        1977392763534562.50
129500.00        113577.00
222500.00        226664.00
270000.00        303206.00
207500.00        241393.00
175000.00        197931.00
238000.00        236415.00
135900.00        138320126259177.00
224000.00        243293.00
170000.00        185195.00
185000.00        199637.00
263000.00        245408.00
62383.00        111109.00
150000.00        163169.00
171000.00        204763.00
139000.00        122581.00
126175.00        136056.00
205950.00        -8652818881057405.00
110000.00        118022.00
485000.00        409522.00
174000.00        166949.00
145000.00        136073.00
215200.00        217373.00
320000.00        280178.00
212000.00        224297.00
168500.00        183295.00
189000.00        207801.00
140000.00        138871.00
187500.00        175023.00
254900.00        236189.00
135000.00        115596.00
151000.00        132415.00
112500.00        142966.00
184000.00        165857.00
197000.00        241757.00
161500.00        177418.00
85000.00        119763.00
203000.00        198777.00
325000.00        346638.00
313000.00        297451.00
237000.00        -8652690454334707.00
180500.00        149137.00
137000.00        131775.00
90350.00        160051.00
150900.00        179002.00
214000.00        203415.00
121500.00        165591.00
745000.00        431961.00
190000.00        186177.00
181000.00        179219.00
60000.00        893910032540262.00
239000.00        252748.00
118500.00        138320126212778.00
109500.00        -774136251924978.00
110000.00        112669.00
105500.00        99684.00
199900.00        199245.00
271900.00        243597.00
131500.00        143228.00
135000.00        137710.00
119000.00        108789.00
215000.00        194265.00
133500.00        140300.00
169000.00        177765.00

So it appears one of the variables is drastically throwing off a few of the predictions. the majority are, while not great, not too far off from expected.

In [24]:
clf.coef_
Out[24]:
array([  1.62516326e+04,   4.40655927e+03,   2.03253328e+03,
        -3.04495955e+03,  -7.80482467e+03,   9.28415933e+15,
         9.28415933e+15,   9.28415933e+15,   9.28415933e+15,
         7.63267190e+15,   7.63267190e+15,   7.63267190e+15,
         7.63267190e+15,   7.63267190e+15,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -5.77037165e+15,  -5.77037165e+15,  -5.77037165e+15,
        -5.77037165e+15,  -5.77037165e+15,  -5.77037165e+15,
        -5.77037165e+15,  -5.77037165e+15,  -1.77139543e+15,
        -7.02077691e+13,  -8.77485397e+14,  -1.77139543e+15,
        -1.83102432e+14,  -1.77139543e+15,  -1.77139543e+15,
         2.05997334e+14,  -1.77139543e+15,  -1.77139543e+15,
        -1.77139543e+15,  -1.77139543e+15,  -1.77139543e+15,
        -1.77139543e+15,  -1.77139543e+15,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -1.71781742e+15,  -1.29524422e+14,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -8.06110533e+11,  -1.11769025e+12,   8.65277578e+15,
         8.65277578e+15,  -3.40577044e+10,   8.53283709e+10,
        -4.30983234e+10,   9.59677471e+10,   8.65277578e+15,
         8.65277578e+15,   8.65277578e+15,   8.65277578e+15,
        -9.03357387e+15,  -9.03357387e+15,  -9.03357387e+15,
        -9.03357387e+15,  -1.50534191e+16,  -1.50534191e+16,
        -1.50534191e+16,  -1.25534572e+16,  -1.50534191e+16,
        -1.50534191e+16,  -1.08351220e+08,   7.96750505e+14,
        -4.80882164e+14,   4.02981534e+07,   8.64768686e+06,
         2.89682608e+07,  -2.31971348e+06,   6.03989554e+06,
         1.36059843e+14,  -4.84235608e+05,  -2.80539364e+14,
        -6.45212466e+14,  -2.93144416e+14,  -8.16910341e+14,
        -8.05137463e+14,  -4.47669191e+14,  -5.89947766e+14,
        -2.57311280e+14,   5.35000000e+01,  -1.02261277e+15,
        -5.03262310e+14,  -7.43156193e+13,   1.60000000e+01,
        -9.23200167e+14,  -4.00000000e+00,   7.50000000e-01,
        -5.17347298e+14,  -3.21850117e+14,   6.73470178e+12,
        -6.60887834e+14,  -9.56502604e+14,   2.67892076e+14,
         4.59802839e+14,   1.78581717e+14,  -6.67916101e+14,
         5.20303742e+13,   6.87599182e-04,   2.51998155e+15,
         2.51998155e+15,   9.65444226e+14,   2.51998155e+15,
        -3.05514637e+14,   3.59945209e+13,  -4.80559816e+14,
         3.74497854e-02,   1.07409219e-02,  -2.84644040e-03,
         1.41220980e+15,  -2.80583663e+14,  -9.26421126e-06,
         5.47993185e+14,   3.48050889e-07,   3.07561780e+14,
        -6.59850280e+14,  -3.28780392e+14,  -1.00422572e+15,
        -9.54965663e+14,  -3.54725088e+14,   5.00222086e-12,
         6.55896725e+13,   5.45071479e+14,  -9.26715498e+14,
        -9.57680202e+13,  -9.12247184e+14,  -8.48216793e+14,
         0.00000000e+00,   2.97777268e+14,   0.00000000e+00,
        -1.45519152e-11,  -1.26016051e+15,  -5.28954756e+14,
        -6.01891084e+14,  -1.45395814e+15,   3.63797881e-12,
        -6.82121026e-13,   9.09494702e-13,  -9.09494702e-13,
        -4.97442151e+14,  -1.20462126e+15,  -5.84940644e+15,
        -5.84940644e+15,  -5.84940644e+15,  -5.84940644e+15,
        -5.28954756e+14,  -6.59850280e+14,   5.47993185e+14,
         2.97777268e+14,   0.00000000e+00,   0.00000000e+00,
        -2.80791437e+14,  -5.80134879e+14,   4.59839771e+14,
        -1.09972116e+14,  -3.04151171e+14,   2.48743740e+14,
         0.00000000e+00,   1.29047575e+13,  -2.09162348e+14,
         0.00000000e+00,   0.00000000e+00,  -4.18309330e+14,
         0.00000000e+00,  -7.27918499e+14,  -1.26241561e+15,
        -5.20434013e+14,  -2.73930701e+14,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -6.20672219e+14,
        -8.48318685e+14,  -2.40664986e+14,  -6.15173512e+14,
         0.00000000e+00,   0.00000000e+00,  -2.05770692e+14,
         0.00000000e+00,  -5.39759547e+14,  -1.84306015e+14,
        -2.98429298e+14,   4.10356114e+15,   4.10356114e+15,
         4.10356114e+15,   4.10356114e+15,   4.10356114e+15,
         4.10356114e+15,   4.35371297e+15,   4.35371297e+15,
         4.35371297e+15,   4.35371297e+15,   4.35371297e+15,
        -2.77641934e+15,  -2.77641934e+15,  -2.77641934e+15,
        -2.77641934e+15,   0.00000000e+00,   1.49586085e+14,
        -4.55706690e+14,  -3.95475089e+14,   8.83466323e+13,
        -7.95538603e+13,  -4.75835651e+14,   0.00000000e+00,
         1.29047575e+13,  -5.09616641e+14,  -7.53905798e+13,
         5.43556264e+13,  -2.47236503e+14,   0.00000000e+00,
         0.00000000e+00,  -6.16894023e+14,  -3.00429275e+15,
        -1.43566949e+14,  -4.18309331e+14,   0.00000000e+00,
         0.00000000e+00,  -1.05576023e+13,  -2.86981769e+14,
         7.08108599e+14,   2.17149181e+14,  -1.18262878e+15,
         1.58346906e+14,   6.32700306e+13,   1.92131286e+14,
        -1.84306015e+14,   0.00000000e+00,   0.00000000e+00,
        -9.44916123e+13,   0.00000000e+00,   0.00000000e+00,
        -4.10944237e+14,  -1.26770966e+14,   9.71200810e+13,
         3.15937475e+13,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -1.79219044e+14,  -4.91634772e+14,
        -3.88248968e+13,   0.00000000e+00,   0.00000000e+00,
        -2.81575971e+14,   0.00000000e+00,   2.22254187e+14,
        -1.22489141e+14,  -9.96608043e+13,  -2.97824709e+14,
        -1.37040245e+13,   0.00000000e+00,  -2.46124845e+14,
        -3.04151171e+14,  -1.95416363e+13,  -1.68850042e+14,
         0.00000000e+00,   0.00000000e+00,   5.95814089e+13,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -8.52313998e+14,   0.00000000e+00,   0.00000000e+00,
        -2.01035139e+14,  -5.26732230e+13,   0.00000000e+00,
        -2.98429298e+14,   1.60391021e+14,  -1.45637624e+14,
        -9.60563158e+12,  -1.73184325e+14,   0.00000000e+00,
         0.00000000e+00,  -4.12913995e+13,  -3.18248396e+13,
         0.00000000e+00,  -2.50284060e+15,  -2.50284060e+15,
        -2.50284060e+15,  -2.50284060e+15,  -2.50284060e+15,
        -2.50284060e+15,   1.59504583e+14,  -2.26666758e+14,
         0.00000000e+00,   0.00000000e+00,   4.78573897e+14,
         2.86597262e+15,  -4.18309133e+14,   0.00000000e+00,
        -1.27762524e+14,   0.00000000e+00,   3.71296515e+14,
        -1.92675752e+14,   0.00000000e+00,   1.08916377e+14,
         1.04430865e+15,  -3.30451412e+14,   0.00000000e+00,
        -1.84306015e+14,   0.00000000e+00,   3.05299159e+13,
        -2.98429298e+14,   0.00000000e+00,   4.08989175e+13,
         0.00000000e+00,   0.00000000e+00,   7.08108599e+14,
        -2.87906212e+14,   3.37515524e+14,   1.29047575e+13,
        -2.35440207e+14,   0.00000000e+00,   1.43255845e+14,
        -4.38285138e+13,  -1.69913874e+14,   1.07804718e+14,
         0.00000000e+00,   1.48661642e+14,   0.00000000e+00,
         3.53314646e+14,  -1.06495287e+14,  -1.24616102e+14,
        -3.55469307e+14,   0.00000000e+00,   7.31749797e+12,
        -1.28714495e+14,  -2.01590157e+14,  -1.58309855e+13,
        -1.15491601e+13,  -3.86593219e+13,   0.00000000e+00,
         6.27150125e+13,  -4.10944237e+14,   0.00000000e+00,
         0.00000000e+00,  -1.18778490e+14,   3.48641994e+13,
        -8.56469031e+13,   0.00000000e+00,   0.00000000e+00,
        -3.04151171e+14,   0.00000000e+00,   0.00000000e+00,
        -2.98711147e+14,  -1.97901535e+14,   3.17386564e+14,
        -9.94952294e+13,   0.00000000e+00,   0.00000000e+00,
         2.57154962e+14,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -6.29295464e+13,
         0.00000000e+00,  -3.60574314e+14,  -9.70287266e+13,
         7.13993872e+14,  -5.87662659e+13,  -1.43566949e+14,
        -2.96667032e+14,   2.36452047e+15,   2.36452047e+15,
         2.36452047e+15])

Looks like a large portion of the dummy variables are.. severe. lets just cut them and see how solely the principal vectors do.

In [37]:
X_train = X_train_save[["pca1","pca2","pca3", "pca4", "pca5"]]
y_train = y_train_save
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
In [38]:
#linear
clf = LinearRegression().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    
Mean squared error: 1570704767.92
Variance score: 0.79
In [40]:
#Lasso
clf = LassoCV().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    
Mean squared error: 1611351349.64
Variance score: 0.78
In [41]:
#Ridge
clf = Ridge().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)    
Mean squared error: 1570874057.65
Variance score: 0.79
In [42]:
print("True       Predicted")
for i in range(0, 10):
    print("%.2f        %.2f") %(y_test.iloc[i], y_pred[i])
True       Predicted
154500.00        147888.31
325000.00        293508.12
115000.00        96145.85
159000.00        162123.22
315500.00        272257.67
75500.00        53688.24
311500.00        213377.88
146000.00        160354.30
84500.00        51587.39
135500.00        138603.64

So in effect, We've reduced an 81 variable dataset (>1000 if we count all dummy variables possible) into only 5 variables thanks to PCA. while 79% variance explained ins't all too great for something like predicting house final sale prices, I'll take it for now (as I've run out of free time for the week). I'd like to get back into this and further explore possibilities of fine-tune these categorical vars and fit to some non-linear regression models... Till then, lets upload this to Kaggle just to see how it goes!

To do this, I'll need to run the df_test data through the pipeline.

In [44]:
df_test = df_test.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)
Out[44]:
Id                  0
MSSubClass          0
MSZoning            4
LotFrontage       227
LotArea             0
Street              0
Alley            1352
LotShape            0
LandContour         0
Utilities           2
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         1
Exterior2nd         1
MasVnrType         16
MasVnrArea         15
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
HalfBath            0
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         1
TotRmsAbvGrd        0
Functional          2
Fireplaces          0
FireplaceQu       730
GarageType         76
GarageYrBlt        78
GarageFinish       78
GarageCars          1
GarageArea          1
GarageQual         78
GarageCond         78
PavedDrive          0
WoodDeckSF          0
OpenPorchSF         0
EnclosedPorch       0
3SsnPorch           0
ScreenPorch         0
PoolArea            0
PoolQC           1456
Fence            1169
MiscFeature      1408
MiscVal             0
MoSold              0
YrSold              0
SaleType            1
SaleCondition       0
Length: 80, dtype: int64
In [45]:
df_test_cont = pd.DataFrame()

for i in df_test:
    if df_test[i].dtype == "int64" or df_test[i].dtype == "float64":
        df_test_cont[i] = df_test[i]

df_train_cont.head()
Out[45]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
0 1 60 65.0 8450 7 5 2003 2003 196.0 706 ... 0 61 0 0 0 0 0 2 2008 208500
1 2 20 80.0 9600 6 8 1976 1976 0.0 978 ... 298 0 0 0 0 0 0 5 2007 181500
2 3 60 68.0 11250 7 5 2001 2002 162.0 486 ... 0 42 0 0 0 0 0 9 2008 223500
3 4 70 60.0 9550 7 5 1915 1970 0.0 216 ... 0 35 272 0 0 0 0 2 2006 140000
4 5 60 84.0 14260 8 5 2000 2000 350.0 655 ... 192 84 0 0 0 0 0 12 2008 250000

5 rows × 38 columns

In [47]:
df_test_cont_imp = imp.fit_transform(df_test_cont)
df_test_cont_imp = pd.DataFrame(df_test_cont_imp, columns = df_test_cont.columns)

df_test_cont_imp.isnull().sum().any()
Out[47]:
False
In [48]:
for i in df_test_cont_imp:
    X_test = scaler.transform(df_test_cont_imp)
X_test = pd.DataFrame(X_test, columns = df_test_cont_imp.columns)

X_test.head()
Out[48]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
0 1.733238 -0.872563 0.451936 0.110763 -0.795151 0.381743 -0.340077 -1.156380 -0.574410 0.053428 ... 1.202536 0.365179 -0.704483 -0.359325 -0.116339 1.882709 -0.068692 -0.087688 -0.119110 1.64521
1 1.735610 -0.872563 0.497357 0.375850 -0.071836 0.381743 -0.439440 -1.301740 0.023903 1.051363 ... -0.753188 2.384400 -0.160950 -0.359325 -0.116339 -0.270208 -0.068692 25.116309 -0.119110 1.64521
2 1.737983 0.073375 0.179413 0.332053 -0.795151 -0.517200 0.852269 0.636400 -0.574410 0.761852 ... 0.042202 0.939819 -0.191147 -0.359325 -0.116339 -0.270208 -0.068692 -0.087688 -1.229111 1.64521
3 1.740356 0.073375 0.361095 -0.054002 -0.071836 0.381743 0.885390 0.636400 -0.463612 0.347326 ... -0.013943 2.121024 -0.160950 -0.359325 -0.116339 -0.270208 -0.068692 -0.087688 -0.119110 1.64521
4 1.742728 1.492282 -1.228623 -0.552407 1.374795 -0.517200 0.686666 0.345679 -0.574410 -0.396190 ... 0.154492 -0.752176 0.533564 -0.359325 -0.116339 2.313293 -0.068692 -0.087688 -1.969111 1.64521

5 rows × 37 columns

In [49]:
del X_test['Id']
X_test = X_test.values

X_test = pca.transform(X_test)

X_test_pca = X_test[:,:5]
X_test_pca = pd.DataFrame(X_test_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])
X_test_pca.head()
Out[49]:
pca1 pca2 pca3 pca4 pca5
0 -2.480251 -1.442848 0.787127 0.232835 -0.824734
1 -1.205705 0.018728 2.461575 -0.715957 -0.199358
2 0.719761 0.213787 -1.043330 -1.509953 -0.001801
3 1.176124 0.546120 -1.118030 -1.286063 -0.589375
4 0.137644 -1.025616 -2.117799 0.992937 0.336105
In [53]:
preds = clf.predict(X_test_pca)

df_sample.SalePrice = preds

'''WRITE TO CSV'''
df_sample.to_csv('Linear_Reg_PCA.csv', index=False)

Kaggle results: As expected.. Not all too great. About 75 percentile.. The project was great practice though. I was able to get some practice with: PCA, KMeans clustering, Multiple Correspondence Analysis, along with all the typical cleaning and scaling that was done throughout.

As this page eventually turned into a much longer and messier version than I had hoped for, any further exploration into this project will be done and uploaded on a new post!