I found some free time and thought I'd finally get some more practice at dimensionality reduction. With this goal in mind, I went onto Kaggle and found a competition(Estimate house prices) which looked appropriate to practice these skill with. Throughout this post I walk through the steps I took from cleaning and standardizing the data, to finally performing PCA and fitting a simple linear regression to the top five most influential eigenvectors! Not the most accurate regression ever, but great practice and surprisingly efficient given it drops 81 variables into only 5.

In [ ]:

##-2018.05.09 // Andrew Trick

As mentioned in the summary above, I've been wanting to get some practice with PCA and other dimension reduction techniques. I found a house price dataset on Kaggle which looks to be a great dataset to practice these techniques with. While not technically too large to model out a regression with, 81 variables is a lot to take in and I thought it would be interesting to finally explore feature reduction. Below is my step by step process from importing the data, cleaning and standardizing, reducing, to finally fitting a model for kaggle submission. I iterate over steps a few times, try stuff that eventually doesn't work, and even conduct a KMeans which is really pointless in the projet aside from my exploring SKlearn some.

In [1]:

import pandas as pd
import mca

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix, r2_score, mean_squared_error

import matplotlib.pyplot as plt

First step I'll import the data and inspect it a bit. It's already split between a train and test split- typical of Kaggle. Let's import all of this and then work with primarily the train set. I'll also grab the sample submission now as I plan to upload to Kaggle and see is a regression using either clusters or principal components is effective.

In [2]:

data_dir = 'data/'
df_train = pd.read_csv(data_dir + 'train.csv')
df_test = pd.read_csv(data_dir + 'test.csv')
df_sample = pd.read_csv(data_dir + 'sample_submission.csv')

df_train.shape

Out[2]:

(1460, 81)

In [3]:

df_train.head()

Out[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

In [4]:

df_train.dtypes.head()

Out[4]:

Id               int64
MSSubClass       int64
MSZoning        object
LotFrontage    float64
LotArea          int64
dtype: object

Cut this to 5,but there is a combination of about half continuous and half categorical variables in the dataset. 81 features in total.. A perfect option for dimensionality reduction! Further exploring:

In [5]:

df_train.describe()

Out[5]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	...	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	...	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	421.610009	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	...	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	1.000000	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	365.750000	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	730.500000	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	...	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	1095.250000	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	...	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	1460.000000	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	...	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

8 rows × 38 columns

Thats a lot to take in.. eough description of the dataframe. Time to check for any missing values and correct if necessary.

In [6]:

df_train.isnull().sum()

Out[6]:

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive          0
WoodDeckSF          0
OpenPorchSF         0
EnclosedPorch       0
3SsnPorch           0
ScreenPorch         0
PoolArea            0
PoolQC           1453
Fence            1179
MiscFeature      1406
MiscVal             0
MoSold              0
YrSold              0
SaleType            0
SaleCondition       0
SalePrice           0
Length: 81, dtype: int64

A few of these are almost all null values.. Lets completely drop any variables with over 1/3 of the values missing:

In [7]:

df_train = df_train.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)
df_train.shape

Out[7]:

(1460, 76)

In [8]:

df_train_cont = pd.DataFrame()

for i in df_train:
    if df_train[i].dtype == "int64" or df_train[i].dtype == "float64":
        df_train_cont[i] = df_train[i]
        del df_train[i]
        
df_train_cont.head()

Out[8]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	WoodDeckSF	OpenPorchSF	EnclosedPorch	MoSold	YrSold	SalePrice
0	1	60	65.0	8450	7	5	2003	2003	196.0	706	...	0	61	0	2	2008	208500
1	2	20	80.0	9600	6	8	1976	1976	0.0	978	...	298	0	0	5	2007	181500
2	3	60	68.0	11250	7	5	2001	2002	162.0	486	...	0	42	0	9	2008	223500
3	4	70	60.0	9550	7	5	1915	1970	0.0	216	...	0	35	272	2	2006	140000
4	5	60	84.0	14260	8	5	2000	2000	350.0	655	...	192	84	0	12	2008	250000

5 rows × 38 columns

So we split up the continuous and the categorical variables in the dataset. Time to fill in the remaining missing values in the continuous set with their respective columns means:

In [9]:

imp = Imputer(missing_values='NaN', strategy='mean')

df_train_cont_imp = imp.fit_transform(df_train_cont)
df_train_cont_imp = pd.DataFrame(df_train_cont_imp, columns = df_train_cont.columns)

df_train_cont_imp.isnull().sum().any()

Out[9]:

False

Successfully filled in null values.. lets split out the result (sale price) and perform some PCA!

In [10]:

y_train = df_train_cont_imp["SalePrice"]
del df_train_cont_imp["SalePrice"]

Out[10]:

0    208500.0
1    181500.0
2    223500.0
3    140000.0
4    250000.0
Name: SalePrice, dtype: float64

Before we perform any dimensionality reduction, lets review were we're at!

df_train: categorical training variables

df_train_cont_imp: (no missing) continuous training variables

y_train: dependant variable of modeling

df_test: test data for kaggle- NEEDS TO GO THROUGH PIPELINE

Objective from here:

-standardize continuous values

-PCA on continuous

-K-modes clustering on categorical!?

-Combine principal components and clusters of categorical to get X_train

-fit models

In [11]:

scaler = StandardScaler()


for i in df_train_cont_imp:
    X_train = scaler.fit_transform(df_train_cont_imp)
X_train = pd.DataFrame(X_train, columns = df_train_cont_imp.columns)

X_train.head()

Out[11]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold
0	-1.730865	0.073375	-0.229372	-0.207142	0.651479	-0.517200	1.050994	0.878668	0.511418	0.575425	...	0.351000	-0.752176	0.216503	-0.359325	-0.116339	-0.270208	-0.068692	-0.087688	-1.599111	0.138777
1	-1.728492	-0.872563	0.451936	-0.091886	-0.071836	2.179628	0.156734	-0.429577	-0.574410	1.171992	...	-0.060731	1.626195	-0.704483	-0.359325	-0.116339	-0.270208	-0.068692	-0.087688	-0.489110	-0.614439
2	-1.726120	0.073375	-0.093110	0.073480	0.651479	-0.517200	0.984752	0.830215	0.323060	0.092907	...	0.631726	-0.752176	-0.070361	-0.359325	-0.116339	-0.270208	-0.068692	-0.087688	0.990891	0.138777
3	-1.723747	0.309859	-0.456474	-0.096897	0.651479	-0.517200	-1.863632	-0.720298	-0.574410	-0.499274	...	0.790804	-0.752176	-0.176048	4.092524	-0.116339	-0.270208	-0.068692	-0.087688	-1.599111	-1.367655
4	-1.721374	0.073375	0.633618	0.375148	1.374795	-0.517200	0.951632	0.733308	1.364570	0.463568	...	1.698485	0.780197	0.563760	-0.359325	-0.116339	-0.270208	-0.068692	-0.087688	2.100892	0.138777

5 rows × 37 columns

In [12]:

del X_train['Id']

X_train.shape

Out[12]:

(1460, 36)

Time to run PCA on this set. I'll first transfer this to an array and then apply sklearn.decomp.PCA to the X_train dataset.

In [13]:

X_train = X_train.values

pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)

explained_variance = pca.explained_variance_ratio_
print(explained_variance)

X_train_pca = X_train[:,:5]
X_train_pca = pd.DataFrame(X_train_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])

[ 0.19812105  0.08900553  0.0714888   0.05614703  0.04091307]

About 46% of the explained variance from only 5 principal components. Not too bad... I think.. lets continue with this and try a k-means on the components just for fun!

In [14]:

kmeans = KMeans(n_clusters = 3, random_state = 42)

X_train_clusters = kmeans.fit(X_train)
 
X_train_clusters.labels_
X_train_clusters.cluster_centers_

Out[14]:

array([[ 1.8471566 ,  1.67522714, -0.68752213, -0.74411962, -0.14086944],
       [-2.20431745, -0.19357077,  0.38318628, -0.00859105, -0.08315146],
       [ 2.21848876, -1.55823088,  0.03890235,  0.87915433,  0.32763546]])

Alright, on to the categorical data.. first to clean it of nulls and then dummy variable it out. As this is more a project to explore dimensionality reduction, I'll allow the label encoder to encoder null values over to a value.

In [15]:

encoder = LabelEncoder()
hot_encoder = OneHotEncoder()

#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()

#iterate over categorical cols and transform into int values
for i in df_train:
    df_train_enc[i] = encoder.fit_transform(df_train[i])
    
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())

df_train_dummies.head()

Out[15]:

	3	6	7	...	747	748	752
0	1.0	1.0	0.0	...	1.0	0.0	1.0
1	1.0	1.0	0.0	...	1.0	0.0	1.0
2	1.0	1.0	1.0	...	1.0	0.0	1.0
3	1.0	1.0	1.0	...	1.0	1.0	0.0
4	1.0	1.0	1.0	...	1.0	0.0	1.0

5 rows × 754 columns

okay.. so we just did the exact opposite of dimensionality reduction.. thats what dummy variables do. Lets try multiple correspondence analysis, the categorical PCA!

In [16]:

df_train_mca = mca.MCA(df_train_dummies, ncols = 5)
print(df_train_mca.L) #eigenvalues

[ 0.00895017  0.00755986  0.00481571  0.0029876   0.00267983  0.00247353
  0.00244682  0.00078794]

Well, not too valueable really. It's not explaining all too much of the variance. Lets go back to the the first categorical and take a more traditional approach. First to view unique counts per column:

In [17]:

for i in df_train:
    print i
    print(df_train[i].value_counts())

MSZoning
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64
Street
Pave    1454
Grvl       6
Name: Street, dtype: int64
LotShape
Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64
LandContour
Lvl    1311
Bnk      63
HLS      50
Low      36
Name: LandContour, dtype: int64
Utilities
AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64
LotConfig
Inside     1052
Corner      263
CulDSac      94
FR2          47
FR3           4
Name: LotConfig, dtype: int64
LandSlope
Gtl    1382
Mod      65
Sev      13
Name: LandSlope, dtype: int64
Neighborhood
NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mitchel     49
NoRidge     41
Timber      38
IDOTRR      37
ClearCr     28
SWISU       25
StoneBr     25
MeadowV     17
Blmngtn     17
BrDale      16
Veenker     11
NPkVill      9
Blueste      2
Name: Neighborhood, dtype: int64
Condition1
Norm      1260
Feedr       81
Artery      48
RRAn        26
PosN        19
RRAe        11
PosA         8
RRNn         5
RRNe         2
Name: Condition1, dtype: int64
Condition2
Norm      1445
Feedr        6
Artery       2
RRNn         2
PosN         2
RRAn         1
RRAe         1
PosA         1
Name: Condition2, dtype: int64
BldgType
1Fam      1220
TwnhsE     114
Duplex      52
Twnhs       43
2fmCon      31
Name: BldgType, dtype: int64
HouseStyle
1Story    726
2Story    445
1.5Fin    154
SLvl       65
SFoyer     37
1.5Unf     14
2.5Unf     11
2.5Fin      8
Name: HouseStyle, dtype: int64
RoofStyle
Gable      1141
Hip         286
Flat         13
Gambrel      11
Mansard       7
Shed          2
Name: RoofStyle, dtype: int64
RoofMatl
CompShg    1434
Tar&Grv;      11
WdShngl       6
WdShake       5
Membran       1
Metal         1
ClyTile       1
Roll          1
Name: RoofMatl, dtype: int64
Exterior1st
VinylSd    515
HdBoard    222
MetalSd    220
Wd Sdng    206
Plywood    108
CemntBd     61
BrkFace     50
WdShing     26
Stucco      25
AsbShng     20
Stone        2
BrkComm      2
AsphShn      1
ImStucc      1
CBlock       1
Name: Exterior1st, dtype: int64
Exterior2nd
VinylSd    504
MetalSd    214
HdBoard    207
Wd Sdng    197
Plywood    142
CmentBd     60
Wd Shng     38
Stucco      26
BrkFace     25
AsbShng     20
ImStucc     10
Brk Cmn      7
Stone        5
AsphShn      3
Other        1
CBlock       1
Name: Exterior2nd, dtype: int64
MasVnrType
None       864
BrkFace    445
Stone      128
BrkCmn      15
Name: MasVnrType, dtype: int64
ExterQual
TA    906
Gd    488
Ex     52
Fa     14
Name: ExterQual, dtype: int64
ExterCond
TA    1282
Gd     146
Fa      28
Ex       3
Po       1
Name: ExterCond, dtype: int64
Foundation
PConc     647
CBlock    634
BrkTil    146
Slab       24
Stone       6
Wood        3
Name: Foundation, dtype: int64
BsmtQual
TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64
BsmtCond
TA    1311
Gd      65
Fa      45
Po       2
Name: BsmtCond, dtype: int64
BsmtExposure
No    953
Av    221
Gd    134
Mn    114
Name: BsmtExposure, dtype: int64
BsmtFinType1
Unf    430
GLQ    418
ALQ    220
BLQ    148
Rec    133
LwQ     74
Name: BsmtFinType1, dtype: int64
BsmtFinType2
Unf    1256
Rec      54
LwQ      46
BLQ      33
ALQ      19
GLQ      14
Name: BsmtFinType2, dtype: int64
Heating
GasA     1428
GasW       18
Grav        7
Wall        4
OthW        2
Floor       1
Name: Heating, dtype: int64
HeatingQC
Ex    741
TA    428
Gd    241
Fa     49
Po      1
Name: HeatingQC, dtype: int64
CentralAir
Y    1365
N      95
Name: CentralAir, dtype: int64
Electrical
SBrkr    1334
FuseA      94
FuseF      27
FuseP       3
Mix         1
Name: Electrical, dtype: int64
KitchenQual
TA    735
Gd    586
Ex    100
Fa     39
Name: KitchenQual, dtype: int64
Functional
Typ     1360
Min2      34
Min1      31
Mod       15
Maj1      14
Maj2       5
Sev        1
Name: Functional, dtype: int64
GarageType
Attchd     870
Detchd     387
BuiltIn     88
Basment     19
CarPort      9
2Types       6
Name: GarageType, dtype: int64
GarageFinish
Unf    605
RFn    422
Fin    352
Name: GarageFinish, dtype: int64
GarageQual
TA    1311
Fa      48
Gd      14
Ex       3
Po       3
Name: GarageQual, dtype: int64
GarageCond
TA    1326
Fa      35
Gd       9
Po       7
Ex       2
Name: GarageCond, dtype: int64
PavedDrive
Y    1340
N      90
P      30
Name: PavedDrive, dtype: int64
SaleType
WD       1267
New       122
COD        43
ConLD       9
ConLw       5
ConLI       5
CWD         4
Oth         3
Con         2
Name: SaleType, dtype: int64
SaleCondition
Normal     1198
Partial     125
Abnorml     101
Family       20
Alloca       12
AdjLand       4
Name: SaleCondition, dtype: int64

Several of these cetagories are overwhelmingly distributed to one value. As such, they will not provide useful information in forecasting.. I'm going to set a threshold of 75% within one value.. IE: if one value in a category holds >= 75% of the total count, it gets removed from the dataset.. This acocunts for:

MSZoning, Street, LandContour, Utilities, LandSlope, Condition1, Condition2, BldgType, RoofStyle, RoofMat1, ExterCond, BsmtQual, BsmtFinType2, Heating, CentralAir, Electrical, Functional, GarageQual, GarageCond, PavedDrive, SaleType, SaleCondition

While Numerous of these would be expected to be useful in predicting price (central air, paved driveway, sale condition) I'm still removing at this threshold to try to reduce dimensionality. I'll evbentually compare to a full-on regression with all variables and see what the difference relates to.

Lets remove them:

In [18]:

df_train = df_train.drop(["MSZoning", "Street", "LandContour", "Utilities", "LandSlope", "Condition1", "Condition2", "BldgType", 
                          "RoofStyle", "RoofMatl", "ExterCond", "BsmtQual", "BsmtFinType2", "Heating", "CentralAir", "Electrical", 
                          "Functional", "GarageQual", "GarageCond", "PavedDrive", "SaleType", "SaleCondition"], axis=1)
df_train.head()

Out[18]:

	LotShape	LotConfig	Neighborhood	HouseStyle	Exterior1st	Exterior2nd	MasVnrType	ExterQual	Foundation	BsmtCond	BsmtExposure	BsmtFinType1	HeatingQC	KitchenQual	GarageType	GarageFinish
0	Reg	Inside	CollgCr	2Story	VinylSd	VinylSd	BrkFace	Gd	PConc	TA	No	GLQ	Ex	Gd	Attchd	RFn
1	Reg	FR2	Veenker	1Story	MetalSd	MetalSd	None	TA	CBlock	TA	Gd	ALQ	Ex	TA	Attchd	RFn
2	IR1	Inside	CollgCr	2Story	VinylSd	VinylSd	BrkFace	Gd	PConc	TA	Mn	GLQ	Ex	Gd	Attchd	RFn
3	IR1	Corner	Crawfor	2Story	Wd Sdng	Wd Shng	None	TA	BrkTil	Gd	No	ALQ	Gd	Gd	Detchd	Unf
4	IR1	FR2	NoRidge	2Story	VinylSd	VinylSd	BrkFace	Gd	PConc	TA	Av	GLQ	Ex	Gd	Attchd	RFn

Lets encode these out and get to model fitting. (This is just a rewriting of the above encoding, but with the smaller dataset.)

In [19]:

#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()

#iterate over categorical cols and transform into int values
for i in df_train:
    df_train_enc[i] = encoder.fit_transform(df_train[i])
    
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())

df_train_dummies.head()

Out[19]:

	0	3	4	6	8	...	399	400
0	0.0	1.0	0.0	0.0	1.0	...	1.0	0.0
1	0.0	1.0	0.0	1.0	0.0	...	1.0	0.0
2	1.0	0.0	0.0	0.0	1.0	...	1.0	0.0
3	1.0	0.0	1.0	0.0	0.0	...	0.0	1.0
4	1.0	0.0	0.0	1.0	0.0	...	1.0	0.0

5 rows × 401 columns

Alright, lets combine the two train sets (continuous and categorical). We'll split the Kaggle train set into a train and test subset so we can test the accuracy as well. After that, We'll fit a regression and see the results:

In [20]:

X_train = X_train_pca
X_train.head()

for i in df_train_dummies:
    X_train[i] =df_train_dummies[i]
    
X_train.shape

Out[20]:

(1460, 406)

In [21]:

X_train_save = X_train #needed for later
y_train_save = y_train #needed for later

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
print("okay, this is getting messy and difficult to track in jupyter.. lets model and gtfo")

okay, this is getting messy and difficult to track in jupyter.. lets model and gtfo

In [22]:

#linear
clf = LinearRegression().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)

Mean squared error: 1063481822160301124330534731776.00
Variance score: -144861071958222913536.00

Well something is off here. lets print out the preds to true values:

In [23]:

print("True       Predicted")
for i in range(0, len(y_pred)):
    print("%.2f        %.2f") %(y_test.iloc[i], y_pred[i])

True       Predicted
154500.00        150211.00
325000.00        301361.00
115000.00        108992.00
159000.00        173719.00
315500.00        312825.00
75500.00        138320126160922.00
311500.00        253260.00
146000.00        148172.00
84500.00        138320126158674.00
135500.00        131382.00
145000.00        129060.00
130000.00        112941.00
81000.00        124811.00
214000.00        234873.00
181000.00        184833.00
134500.00        117976.00
183500.00        193918.00
135000.00        123773.00
118400.00        137025.00
226000.00        208449.00
155000.00        160173.00
210000.00        199413.00
173500.00        177851.00
129000.00        120808.00
192000.00        203707.00
153900.00        162917.00
181134.00        202883.00
141000.00        120239.00
181000.00        181342.00
208900.00        191221.00
127000.00        125141.00
284000.00        288563.00
200500.00        138320126372920.00
135750.00        96995.00
255000.00        256505.00
140000.00        134191.00
138000.00        120298.00
219500.00        213180.00
310000.00        307259.00
97000.00        -774136251902434.00
114500.00        142973.00
205000.00        222909.00
119500.00        95397.00
253293.00        355635.00
128500.00        126239.00
117500.00        143418.00
115000.00        106424.00
127000.00        119092.00
451950.00        412803.00
144000.00        132118.00
119000.00        112934.00
196000.00        209345.00
115000.00        112250.00
287000.00        257300.00
144500.00        193217.00
260000.00        224363.00
213000.00        194860.00
175000.00        202897.00
107000.00        121559.00
107500.00        1725825612082280.50
68500.00        78946.00
154000.00        176947.00
317000.00        306067.00
264132.00        264680.00
283463.00        275617.00
243000.00        185570.00
109000.00        87792.00
305000.00        310483.00
93500.00        138320126241910.00
176000.00        183413.00
118858.00        1725825504232654.50
134000.00        128268.00
109008.00        102973.00
93500.00        -774136254329829.00
611657.00        423795.00
173000.00        191545.00
348000.00        301939.00
341000.00        323617.00
141000.00        157451.00
124900.00        122659.00
118000.00        113566.00
67000.00        -1554537323787872.00
113000.00        114221.00
91300.00        96145.00
149500.00        151220.00
133000.00        138577.00
266000.00        243967.00
190000.00        228239.00
155900.00        159075.00
155835.00        187658.00
153500.00        144105.00
152000.00        168074.00
124500.00        142124.00
301000.00        257600.00
136500.00        138320126227596.00
169990.00        201483.00
205000.00        209734.00
183900.00        161974.00
204900.00        192513.00
260000.00        263733.00
163500.00        113063.00
224900.00        203469.00
244000.00        246738.00
132000.00        148597.00
194000.00        206130.00
156500.00        198485.00
156000.00        136923.00
275000.00        296543.00
145000.00        165230.00
135000.00        210297.00
60000.00        138320126187433.00
124000.00        138320126239118.00
127000.00        136659.00
137500.00        141155.00
213500.00        201041.00
119000.00        127766.00
107900.00        99353.00
123000.00        117569.00
112000.00        1725825618631814.50
284000.00        283433.00
133000.00        118655.00
149000.00        163041.00
169000.00        179597.00
207000.00        185300.00
175000.00        195858.00
137000.00        127718.00
236000.00        234605.00
79500.00        105241.00
144000.00        166678.00
162900.00        195239.00
185900.00        190545.00
369900.00        296073.00
197900.00        187585.00
104000.00        138320126283852.00
35311.00        74353.00
337500.00        370623.00
367294.00        350601.00
130250.00        150079.00
230000.00        229071.00
755000.00        485857.00
403000.00        364357.00
132000.00        117788.00
178000.00        181370.00
136500.00        157698.00
145000.00        156978.00
123000.00        118015.00
250000.00        259562.00
187100.00        183581.00
133900.00        135873.00
67000.00        138320126169044.00
137500.00        142958.00
155000.00        148617.00
200624.00        -8653893472972565.00
154300.00        145019.00
91000.00        92689.00
136000.00        138320126228596.00
108959.00        92647.00
140000.00        128766.00
86000.00        138320126201490.00
131400.00        128927.00
179900.00        188881.00
144000.00        84173.00
293077.00        279677.00
144500.00        156131.00
118500.00        104763.00
141000.00        138320126276588.00
239000.00        244213.00
276000.00        295022.00
556581.00        401247.00
244400.00        189966.00
360000.00        376705.00
103200.00        138320126203828.00
102000.00        138320126226476.00
151000.00        142324.00
285000.00        306595.00
134432.00        -774136251876770.00
113000.00        112747.00
187500.00        218847.00
125500.00        125485.00
177500.00        185998.00
179900.00        224773.00
55993.00        112097.00
132500.00        122842.00
135000.00        128932.00
255000.00        237135.00
140000.00        116236.00
271000.00        252505.00
246578.00        204611.00
202500.00        226097.00
75000.00        67890.00
122500.00        123907.00
108480.00        99525.00
160000.00        179893.00
171000.00        157812.00
196000.00        228883.00
225000.00        180511.00
197000.00        201043.00
40000.00        109905.00
172500.00        208755.00
154900.00        147669.00
280000.00        228550.00
175000.00        212461.00
147000.00        107849.00
315000.00        302866.00
185000.00        199013.00
135500.00        121819.00
239500.00        222069.00
139000.00        137251.00
140000.00        116580.00
110000.00        98251.00
225000.00        211325.00
143500.00        153057.00
128950.00        138320126227238.00
172500.00        178613.00
241500.00        220326.00
262500.00        245885.00
194201.00        214354.00
143000.00        122727.00
130000.00        127966.00
126000.00        -774136211604861.00
142500.00        131248.00
254000.00        243943.00
217500.00        225380.00
66500.00        107989.00
201000.00        231147.00
155000.00        136984.00
68400.00        92379.00
64500.00        112649.00
173000.00        175250.00
102776.00        105775.00
84900.00        -774136243291964.00
165600.00        173911.00
120000.00        120584.00
135000.00        107419.00
220000.00        224611.00
153575.00        145213.00
195400.00        191925.00
147000.00        151899.00
277000.00        235819.00
143000.00        129173.00
105900.00        117508.00
242000.00        233884.00
194500.00        217067.00
438780.00        414802.00
185000.00        185123.00
107500.00        138320126235972.00
165000.00        161152.00
176000.00        168580.00
129900.00        135860.00
115000.00        99227.00
192140.00        157348.00
160000.00        152723.00
145000.00        130093.00
86000.00        114795.00
158000.00        150398.00
127500.00        150855.00
115000.00        121843.00
119500.00        124727.00
175900.00        186719.00
240000.00        230065.00
395000.00        292513.00
165000.00        190207.00
128200.00        126002.00
275000.00        241374.00
311872.00        333837.00
214000.00        264196.00
153500.00        160347.00
144000.00        162383.00
115000.00        109338.00
180000.00        194829.00
465000.00        -8652679814782547.00
180000.00        234497.00
253000.00        236517.00
85000.00        78592.00
101800.00        1725825612516762.00
148500.00        126353.00
137500.00        139437.00
318061.00        305143.00
143000.00        243483.00
140000.00        130494.00
192500.00        5849406439234922.00
92000.00        138320126211547.00
197000.00        191633.00
109500.00        113817.00
297000.00        276491.00
185750.00        191003.00
230000.00        199221.00
89471.00        123403.00
260000.00        288824.00
189000.00        211275.00
108000.00        98723.00
124500.00        118627.00
145000.00        120283.00
178000.00        166875.00
85000.00        138320126207371.00
175000.00        148939.00
127000.00        156835.00
149900.00        150848.00
174000.00        204456.00
125500.00        111787.00
175500.00        192644.00
225000.00        232678.00
129000.00        124576.00
159950.00        143923.00
157000.00        166119.00
205000.00        194794.00
140000.00        148306.00
200000.00        211151.00
217000.00        247819.00
125000.00        122265.00
159500.00        116462.00
184000.00        170168.00
96500.00        126309.00
200000.00        213463.00
149000.00        131097.00
178900.00        190158.00
184900.00        -8653581893279559.00
164700.00        162167.00
335000.00        332750.00
87500.00        1725825641496850.00
233170.00        243361.00
149350.00        148185.00
133000.00        115034.00
98000.00        138320126189518.00
207500.00        194142.00
150000.00        160024.00
134900.00        139607.00
187750.00        200477.00
149500.00        152687.00
116050.00        92949.00
153337.00        157879.00
165000.00        143732.00
130500.00        159244.00
206300.00        199895.00
146000.00        159029.00
126500.00        107896.00
139000.00        184977.00
113000.00        138320126168551.00
79000.00        138320126185880.00
182000.00        -8652809840427027.00
188000.00        189026.00
129000.00        134809.00
160000.00        138320126237881.00
174000.00        179009.00
219210.00        231392.00
310000.00        336157.00
374000.00        348145.00
100000.00        117929.00
250000.00        219850.00
145000.00        126189.00
325000.00        279416.00
380000.00        377130.00
275000.00        283691.00
180000.00        179456.00
245350.00        215241.00
151500.00        140879.00
134500.00        107990.00
94000.00        138320126207187.00
216000.00        210017.00
350000.00        344335.00
195000.00        190204.00
120000.00        126433.00
228500.00        218045.00
248000.00        256537.00
124000.00        125469.00
191000.00        180685.00
181000.00        193363.00
105000.00        101319.00
139900.00        121797.00
157900.00        139285.00
130500.00        103079.00
172400.00        178517.00
130000.00        114067.00
178000.00        193458.00
161500.00        213060.00
119900.00        139045.00
239000.00        218895.00
190000.00        187535.00
85400.00        90611.00
205000.00        187157.00
134000.00        157096.00
168000.00        174856.00
185000.00        184494.00
180500.00        124479.00
152000.00        142014.00
215000.00        220313.00
153000.00        129503.00
106000.00        124801.00
340000.00        248105.00
159000.00        174063.00
120000.00        121939.00
115000.00        1725825612612082.50
128000.00        150991.00
314813.00        338699.00
131000.00        134171.00
446261.00        380933.00
127500.00        136555.00
155000.00        185090.00
177500.00        180883.00
128500.00        141651.00
176000.00        173214.00
402000.00        338309.00
130000.00        138320126255192.00
145000.00        138453.00
147000.00        133947.00
115000.00        138320126202910.00
189000.00        187737.00
143000.00        141074.00
240000.00        220399.00
230000.00        239589.00
190000.00        257705.00
213000.00        210433.00
82500.00        138320126210703.00
274900.00        323037.00
155000.00        176213.00
423000.00        335065.00
128500.00        138165.00
335000.00        362733.00
262000.00        1977392763534562.50
129500.00        113577.00
222500.00        226664.00
270000.00        303206.00
207500.00        241393.00
175000.00        197931.00
238000.00        236415.00
135900.00        138320126259177.00
224000.00        243293.00
170000.00        185195.00
185000.00        199637.00
263000.00        245408.00
62383.00        111109.00
150000.00        163169.00
171000.00        204763.00
139000.00        122581.00
126175.00        136056.00
205950.00        -8652818881057405.00
110000.00        118022.00
485000.00        409522.00
174000.00        166949.00
145000.00        136073.00
215200.00        217373.00
320000.00        280178.00
212000.00        224297.00
168500.00        183295.00
189000.00        207801.00
140000.00        138871.00
187500.00        175023.00
254900.00        236189.00
135000.00        115596.00
151000.00        132415.00
112500.00        142966.00
184000.00        165857.00
197000.00        241757.00
161500.00        177418.00
85000.00        119763.00
203000.00        198777.00
325000.00        346638.00
313000.00        297451.00
237000.00        -8652690454334707.00
180500.00        149137.00
137000.00        131775.00
90350.00        160051.00
150900.00        179002.00
214000.00        203415.00
121500.00        165591.00
745000.00        431961.00
190000.00        186177.00
181000.00        179219.00
60000.00        893910032540262.00
239000.00        252748.00
118500.00        138320126212778.00
109500.00        -774136251924978.00
110000.00        112669.00
105500.00        99684.00
199900.00        199245.00
271900.00        243597.00
131500.00        143228.00
135000.00        137710.00
119000.00        108789.00
215000.00        194265.00
133500.00        140300.00
169000.00        177765.00

So it appears one of the variables is drastically throwing off a few of the predictions. the majority are, while not great, not too far off from expected.

In [24]:

clf.coef_

Out[24]:

array([  1.62516326e+04,   4.40655927e+03,   2.03253328e+03,
        -3.04495955e+03,  -7.80482467e+03,   9.28415933e+15,
         9.28415933e+15,   9.28415933e+15,   9.28415933e+15,
         7.63267190e+15,   7.63267190e+15,   7.63267190e+15,
         7.63267190e+15,   7.63267190e+15,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -8.36533038e+14,  -8.36533038e+14,  -8.36533038e+14,
        -5.77037165e+15,  -5.77037165e+15,  -5.77037165e+15,
        -5.77037165e+15,  -5.77037165e+15,  -5.77037165e+15,
        -5.77037165e+15,  -5.77037165e+15,  -1.77139543e+15,
        -7.02077691e+13,  -8.77485397e+14,  -1.77139543e+15,
        -1.83102432e+14,  -1.77139543e+15,  -1.77139543e+15,
         2.05997334e+14,  -1.77139543e+15,  -1.77139543e+15,
        -1.77139543e+15,  -1.77139543e+15,  -1.77139543e+15,
        -1.77139543e+15,  -1.77139543e+15,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -1.71781742e+15,  -1.29524422e+14,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -1.29524422e+14,  -1.29524422e+14,  -1.29524422e+14,
        -8.06110533e+11,  -1.11769025e+12,   8.65277578e+15,
         8.65277578e+15,  -3.40577044e+10,   8.53283709e+10,
        -4.30983234e+10,   9.59677471e+10,   8.65277578e+15,
         8.65277578e+15,   8.65277578e+15,   8.65277578e+15,
        -9.03357387e+15,  -9.03357387e+15,  -9.03357387e+15,
        -9.03357387e+15,  -1.50534191e+16,  -1.50534191e+16,
        -1.50534191e+16,  -1.25534572e+16,  -1.50534191e+16,
        -1.50534191e+16,  -1.08351220e+08,   7.96750505e+14,
        -4.80882164e+14,   4.02981534e+07,   8.64768686e+06,
         2.89682608e+07,  -2.31971348e+06,   6.03989554e+06,
         1.36059843e+14,  -4.84235608e+05,  -2.80539364e+14,
        -6.45212466e+14,  -2.93144416e+14,  -8.16910341e+14,
        -8.05137463e+14,  -4.47669191e+14,  -5.89947766e+14,
        -2.57311280e+14,   5.35000000e+01,  -1.02261277e+15,
        -5.03262310e+14,  -7.43156193e+13,   1.60000000e+01,
        -9.23200167e+14,  -4.00000000e+00,   7.50000000e-01,
        -5.17347298e+14,  -3.21850117e+14,   6.73470178e+12,
        -6.60887834e+14,  -9.56502604e+14,   2.67892076e+14,
         4.59802839e+14,   1.78581717e+14,  -6.67916101e+14,
         5.20303742e+13,   6.87599182e-04,   2.51998155e+15,
         2.51998155e+15,   9.65444226e+14,   2.51998155e+15,
        -3.05514637e+14,   3.59945209e+13,  -4.80559816e+14,
         3.74497854e-02,   1.07409219e-02,  -2.84644040e-03,
         1.41220980e+15,  -2.80583663e+14,  -9.26421126e-06,
         5.47993185e+14,   3.48050889e-07,   3.07561780e+14,
        -6.59850280e+14,  -3.28780392e+14,  -1.00422572e+15,
        -9.54965663e+14,  -3.54725088e+14,   5.00222086e-12,
         6.55896725e+13,   5.45071479e+14,  -9.26715498e+14,
        -9.57680202e+13,  -9.12247184e+14,  -8.48216793e+14,
         0.00000000e+00,   2.97777268e+14,   0.00000000e+00,
        -1.45519152e-11,  -1.26016051e+15,  -5.28954756e+14,
        -6.01891084e+14,  -1.45395814e+15,   3.63797881e-12,
        -6.82121026e-13,   9.09494702e-13,  -9.09494702e-13,
        -4.97442151e+14,  -1.20462126e+15,  -5.84940644e+15,
        -5.84940644e+15,  -5.84940644e+15,  -5.84940644e+15,
        -5.28954756e+14,  -6.59850280e+14,   5.47993185e+14,
         2.97777268e+14,   0.00000000e+00,   0.00000000e+00,
        -2.80791437e+14,  -5.80134879e+14,   4.59839771e+14,
        -1.09972116e+14,  -3.04151171e+14,   2.48743740e+14,
         0.00000000e+00,   1.29047575e+13,  -2.09162348e+14,
         0.00000000e+00,   0.00000000e+00,  -4.18309330e+14,
         0.00000000e+00,  -7.27918499e+14,  -1.26241561e+15,
        -5.20434013e+14,  -2.73930701e+14,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -6.20672219e+14,
        -8.48318685e+14,  -2.40664986e+14,  -6.15173512e+14,
         0.00000000e+00,   0.00000000e+00,  -2.05770692e+14,
         0.00000000e+00,  -5.39759547e+14,  -1.84306015e+14,
        -2.98429298e+14,   4.10356114e+15,   4.10356114e+15,
         4.10356114e+15,   4.10356114e+15,   4.10356114e+15,
         4.10356114e+15,   4.35371297e+15,   4.35371297e+15,
         4.35371297e+15,   4.35371297e+15,   4.35371297e+15,
        -2.77641934e+15,  -2.77641934e+15,  -2.77641934e+15,
        -2.77641934e+15,   0.00000000e+00,   1.49586085e+14,
        -4.55706690e+14,  -3.95475089e+14,   8.83466323e+13,
        -7.95538603e+13,  -4.75835651e+14,   0.00000000e+00,
         1.29047575e+13,  -5.09616641e+14,  -7.53905798e+13,
         5.43556264e+13,  -2.47236503e+14,   0.00000000e+00,
         0.00000000e+00,  -6.16894023e+14,  -3.00429275e+15,
        -1.43566949e+14,  -4.18309331e+14,   0.00000000e+00,
         0.00000000e+00,  -1.05576023e+13,  -2.86981769e+14,
         7.08108599e+14,   2.17149181e+14,  -1.18262878e+15,
         1.58346906e+14,   6.32700306e+13,   1.92131286e+14,
        -1.84306015e+14,   0.00000000e+00,   0.00000000e+00,
        -9.44916123e+13,   0.00000000e+00,   0.00000000e+00,
        -4.10944237e+14,  -1.26770966e+14,   9.71200810e+13,
         3.15937475e+13,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -1.79219044e+14,  -4.91634772e+14,
        -3.88248968e+13,   0.00000000e+00,   0.00000000e+00,
        -2.81575971e+14,   0.00000000e+00,   2.22254187e+14,
        -1.22489141e+14,  -9.96608043e+13,  -2.97824709e+14,
        -1.37040245e+13,   0.00000000e+00,  -2.46124845e+14,
        -3.04151171e+14,  -1.95416363e+13,  -1.68850042e+14,
         0.00000000e+00,   0.00000000e+00,   5.95814089e+13,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -8.52313998e+14,   0.00000000e+00,   0.00000000e+00,
        -2.01035139e+14,  -5.26732230e+13,   0.00000000e+00,
        -2.98429298e+14,   1.60391021e+14,  -1.45637624e+14,
        -9.60563158e+12,  -1.73184325e+14,   0.00000000e+00,
         0.00000000e+00,  -4.12913995e+13,  -3.18248396e+13,
         0.00000000e+00,  -2.50284060e+15,  -2.50284060e+15,
        -2.50284060e+15,  -2.50284060e+15,  -2.50284060e+15,
        -2.50284060e+15,   1.59504583e+14,  -2.26666758e+14,
         0.00000000e+00,   0.00000000e+00,   4.78573897e+14,
         2.86597262e+15,  -4.18309133e+14,   0.00000000e+00,
        -1.27762524e+14,   0.00000000e+00,   3.71296515e+14,
        -1.92675752e+14,   0.00000000e+00,   1.08916377e+14,
         1.04430865e+15,  -3.30451412e+14,   0.00000000e+00,
        -1.84306015e+14,   0.00000000e+00,   3.05299159e+13,
        -2.98429298e+14,   0.00000000e+00,   4.08989175e+13,
         0.00000000e+00,   0.00000000e+00,   7.08108599e+14,
        -2.87906212e+14,   3.37515524e+14,   1.29047575e+13,
        -2.35440207e+14,   0.00000000e+00,   1.43255845e+14,
        -4.38285138e+13,  -1.69913874e+14,   1.07804718e+14,
         0.00000000e+00,   1.48661642e+14,   0.00000000e+00,
         3.53314646e+14,  -1.06495287e+14,  -1.24616102e+14,
        -3.55469307e+14,   0.00000000e+00,   7.31749797e+12,
        -1.28714495e+14,  -2.01590157e+14,  -1.58309855e+13,
        -1.15491601e+13,  -3.86593219e+13,   0.00000000e+00,
         6.27150125e+13,  -4.10944237e+14,   0.00000000e+00,
         0.00000000e+00,  -1.18778490e+14,   3.48641994e+13,
        -8.56469031e+13,   0.00000000e+00,   0.00000000e+00,
        -3.04151171e+14,   0.00000000e+00,   0.00000000e+00,
        -2.98711147e+14,  -1.97901535e+14,   3.17386564e+14,
        -9.94952294e+13,   0.00000000e+00,   0.00000000e+00,
         2.57154962e+14,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -6.29295464e+13,
         0.00000000e+00,  -3.60574314e+14,  -9.70287266e+13,
         7.13993872e+14,  -5.87662659e+13,  -1.43566949e+14,
        -2.96667032e+14,   2.36452047e+15,   2.36452047e+15,
         2.36452047e+15])

Looks like a large portion of the dummy variables are.. severe. lets just cut them and see how solely the principal vectors do.

In [37]:

X_train = X_train_save[["pca1","pca2","pca3", "pca4", "pca5"]]
y_train = y_train_save
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)

In [38]:

#linear
clf = LinearRegression().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)

Mean squared error: 1570704767.92
Variance score: 0.79

In [40]:

#Lasso
clf = LassoCV().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)

Mean squared error: 1611351349.64
Variance score: 0.78

In [41]:

#Ridge
clf = Ridge().fit(X_train, y_train)

y_pred = clf.predict(X_test)

# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)     

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)

Mean squared error: 1570874057.65
Variance score: 0.79

In [42]:

print("True       Predicted")
for i in range(0, 10):
    print("%.2f        %.2f") %(y_test.iloc[i], y_pred[i])

True       Predicted
154500.00        147888.31
325000.00        293508.12
115000.00        96145.85
159000.00        162123.22
315500.00        272257.67
75500.00        53688.24
311500.00        213377.88
146000.00        160354.30
84500.00        51587.39
135500.00        138603.64

So in effect, We've reduced an 81 variable dataset (>1000 if we count all dummy variables possible) into only 5 variables thanks to PCA. while 79% variance explained ins't all too great for something like predicting house final sale prices, I'll take it for now (as I've run out of free time for the week). I'd like to get back into this and further explore possibilities of fine-tune these categorical vars and fit to some non-linear regression models... Till then, lets upload this to Kaggle just to see how it goes!

To do this, I'll need to run the df_test data through the pipeline.

In [44]:

df_test = df_test.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)

Out[44]:

Id                  0
MSSubClass          0
MSZoning            4
LotFrontage       227
LotArea             0
Street              0
Alley            1352
LotShape            0
LandContour         0
Utilities           2
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         1
Exterior2nd         1
MasVnrType         16
MasVnrArea         15
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
HalfBath            0
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         1
TotRmsAbvGrd        0
Functional          2
Fireplaces          0
FireplaceQu       730
GarageType         76
GarageYrBlt        78
GarageFinish       78
GarageCars          1
GarageArea          1
GarageQual         78
GarageCond         78
PavedDrive          0
WoodDeckSF          0
OpenPorchSF         0
EnclosedPorch       0
3SsnPorch           0
ScreenPorch         0
PoolArea            0
PoolQC           1456
Fence            1169
MiscFeature      1408
MiscVal             0
MoSold              0
YrSold              0
SaleType            1
SaleCondition       0
Length: 80, dtype: int64

In [45]:

df_test_cont = pd.DataFrame()

for i in df_test:
    if df_test[i].dtype == "int64" or df_test[i].dtype == "float64":
        df_test_cont[i] = df_test[i]

df_train_cont.head()

Out[45]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	WoodDeckSF	OpenPorchSF	EnclosedPorch	MoSold	YrSold	SalePrice
0	1	60	65.0	8450	7	5	2003	2003	196.0	706	...	0	61	0	2	2008	208500
1	2	20	80.0	9600	6	8	1976	1976	0.0	978	...	298	0	0	5	2007	181500
2	3	60	68.0	11250	7	5	2001	2002	162.0	486	...	0	42	0	9	2008	223500
3	4	70	60.0	9550	7	5	1915	1970	0.0	216	...	0	35	272	2	2006	140000
4	5	60	84.0	14260	8	5	2000	2000	350.0	655	...	192	84	0	12	2008	250000

5 rows × 38 columns

In [47]:

df_test_cont_imp = imp.fit_transform(df_test_cont)
df_test_cont_imp = pd.DataFrame(df_test_cont_imp, columns = df_test_cont.columns)

df_test_cont_imp.isnull().sum().any()

Out[47]:

False

In [48]:

for i in df_test_cont_imp:
    X_test = scaler.transform(df_test_cont_imp)
X_test = pd.DataFrame(X_test, columns = df_test_cont_imp.columns)

X_test.head()

Out[48]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold
0	1.733238	-0.872563	0.451936	0.110763	-0.795151	0.381743	-0.340077	-1.156380	-0.574410	0.053428	...	1.202536	0.365179	-0.704483	-0.359325	-0.116339	1.882709	-0.068692	-0.087688	-0.119110	1.64521
1	1.735610	-0.872563	0.497357	0.375850	-0.071836	0.381743	-0.439440	-1.301740	0.023903	1.051363	...	-0.753188	2.384400	-0.160950	-0.359325	-0.116339	-0.270208	-0.068692	25.116309	-0.119110	1.64521
2	1.737983	0.073375	0.179413	0.332053	-0.795151	-0.517200	0.852269	0.636400	-0.574410	0.761852	...	0.042202	0.939819	-0.191147	-0.359325	-0.116339	-0.270208	-0.068692	-0.087688	-1.229111	1.64521
3	1.740356	0.073375	0.361095	-0.054002	-0.071836	0.381743	0.885390	0.636400	-0.463612	0.347326	...	-0.013943	2.121024	-0.160950	-0.359325	-0.116339	-0.270208	-0.068692	-0.087688	-0.119110	1.64521
4	1.742728	1.492282	-1.228623	-0.552407	1.374795	-0.517200	0.686666	0.345679	-0.574410	-0.396190	...	0.154492	-0.752176	0.533564	-0.359325	-0.116339	2.313293	-0.068692	-0.087688	-1.969111	1.64521

5 rows × 37 columns

In [49]:

del X_test['Id']
X_test = X_test.values

X_test = pca.transform(X_test)

X_test_pca = X_test[:,:5]
X_test_pca = pd.DataFrame(X_test_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])
X_test_pca.head()

Out[49]:

	pca1	pca2	pca3	pca4	pca5
0	-2.480251	-1.442848	0.787127	0.232835	-0.824734
1	-1.205705	0.018728	2.461575	-0.715957	-0.199358
2	0.719761	0.213787	-1.043330	-1.509953	-0.001801
3	1.176124	0.546120	-1.118030	-1.286063	-0.589375
4	0.137644	-1.025616	-2.117799	0.992937	0.336105

In [53]:

preds = clf.predict(X_test_pca)

df_sample.SalePrice = preds

'''WRITE TO CSV'''
df_sample.to_csv('Linear_Reg_PCA.csv', index=False)

Kaggle results: As expected.. Not all too great. About 75 percentile.. The project was great practice though. I was able to get some practice with: PCA, KMeans clustering, Multiple Correspondence Analysis, along with all the typical cleaning and scaling that was done throughout.

As this page eventually turned into a much longer and messier version than I had hoped for, any further exploration into this project will be done and uploaded on a new post!

	3	6	7	...	747	748	752
0	1.0	1.0	0.0	...	1.0	0.0	1.0
1	1.0	1.0	0.0	...	1.0	0.0	1.0
2	1.0	1.0	1.0	...	1.0	0.0	1.0
3	1.0	1.0	1.0	...	1.0	1.0	0.0
4	1.0	1.0	1.0	...	1.0	0.0	1.0

	0	3	4	6	8	...	399	400
0	0.0	1.0	0.0	0.0	1.0	...	1.0	0.0
1	0.0	1.0	0.0	1.0	0.0	...	1.0	0.0
2	1.0	0.0	0.0	0.0	1.0	...	1.0	0.0
3	1.0	0.0	1.0	0.0	0.0	...	0.0	1.0
4	1.0	0.0	0.0	1.0	0.0	...	1.0	0.0

	3	6	7	...	747	748	752
0	1.0	1.0	0.0	...	1.0	0.0	1.0
1	1.0	1.0	0.0	...	1.0	0.0	1.0
2	1.0	1.0	1.0	...	1.0	0.0	1.0
3	1.0	1.0	1.0	...	1.0	1.0	0.0
4	1.0	1.0	1.0	...	1.0	0.0	1.0

	0	3	4	6	8	...	399	400
0	0.0	1.0	0.0	0.0	1.0	...	1.0	0.0
1	0.0	1.0	0.0	1.0	0.0	...	1.0	0.0
2	1.0	0.0	0.0	0.0	1.0	...	1.0	0.0
3	1.0	0.0	1.0	0.0	0.0	...	0.0	1.0
4	1.0	0.0	0.0	1.0	0.0	...	1.0	0.0

	3	6	7	...	747	748	752
0	1.0	1.0	0.0	...	1.0	0.0	1.0
1	1.0	1.0	0.0	...	1.0	0.0	1.0
2	1.0	1.0	1.0	...	1.0	0.0	1.0
3	1.0	1.0	1.0	...	1.0	1.0	0.0
4	1.0	1.0	1.0	...	1.0	0.0	1.0

	0	3	4	6	8	...	399	400
0	0.0	1.0	0.0	0.0	1.0	...	1.0	0.0
1	0.0	1.0	0.0	1.0	0.0	...	1.0	0.0
2	1.0	0.0	0.0	0.0	1.0	...	1.0	0.0
3	1.0	0.0	1.0	0.0	0.0	...	0.0	1.0
4	1.0	0.0	0.0	1.0	0.0	...	1.0	0.0