I found some free time and thought I'd finally get some more practice at dimensionality reduction. With this goal in mind, I went onto Kaggle and found a competition(Estimate house prices) which looked appropriate to practice these skill with. Throughout this post I walk through the steps I took from cleaning and standardizing the data, to finally performing PCA and fitting a simple linear regression to the top five most influential eigenvectors! Not the most accurate regression ever, but great practice and surprisingly efficient given it drops 81 variables into only 5.
##-2018.05.09 // Andrew Trick
As mentioned in the summary above, I've been wanting to get some practice with PCA and other dimension reduction techniques. I found a house price dataset on Kaggle which looks to be a great dataset to practice these techniques with. While not technically too large to model out a regression with, 81 variables is a lot to take in and I thought it would be interesting to finally explore feature reduction. Below is my step by step process from importing the data, cleaning and standardizing, reducing, to finally fitting a model for kaggle submission. I iterate over steps a few times, try stuff that eventually doesn't work, and even conduct a KMeans which is really pointless in the projet aside from my exploring SKlearn some.
import pandas as pd
import mca
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix, r2_score, mean_squared_error
import matplotlib.pyplot as plt
First step I'll import the data and inspect it a bit. It's already split between a train and test split- typical of Kaggle. Let's import all of this and then work with primarily the train set. I'll also grab the sample submission now as I plan to upload to Kaggle and see is a regression using either clusters or principal components is effective.
data_dir = 'data/'
df_train = pd.read_csv(data_dir + 'train.csv')
df_test = pd.read_csv(data_dir + 'test.csv')
df_sample = pd.read_csv(data_dir + 'sample_submission.csv')
df_train.shape
df_train.head()
df_train.dtypes.head()
Cut this to 5,but there is a combination of about half continuous and half categorical variables in the dataset. 81 features in total.. A perfect option for dimensionality reduction! Further exploring:
df_train.describe()
Thats a lot to take in.. eough description of the dataframe. Time to check for any missing values and correct if necessary.
df_train.isnull().sum()
A few of these are almost all null values.. Lets completely drop any variables with over 1/3 of the values missing:
df_train = df_train.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)
df_train.shape
df_train_cont = pd.DataFrame()
for i in df_train:
if df_train[i].dtype == "int64" or df_train[i].dtype == "float64":
df_train_cont[i] = df_train[i]
del df_train[i]
df_train_cont.head()
So we split up the continuous and the categorical variables in the dataset. Time to fill in the remaining missing values in the continuous set with their respective columns means:
imp = Imputer(missing_values='NaN', strategy='mean')
df_train_cont_imp = imp.fit_transform(df_train_cont)
df_train_cont_imp = pd.DataFrame(df_train_cont_imp, columns = df_train_cont.columns)
df_train_cont_imp.isnull().sum().any()
Successfully filled in null values.. lets split out the result (sale price) and perform some PCA!
y_train = df_train_cont_imp["SalePrice"]
del df_train_cont_imp["SalePrice"]
Before we perform any dimensionality reduction, lets review were we're at!
df_train: categorical training variables
df_train_cont_imp: (no missing) continuous training variables
y_train: dependant variable of modeling
df_test: test data for kaggle- NEEDS TO GO THROUGH PIPELINE
Objective from here:
-standardize continuous values
-PCA on continuous
-K-modes clustering on categorical!?
-Combine principal components and clusters of categorical to get X_train
-fit models
scaler = StandardScaler()
for i in df_train_cont_imp:
X_train = scaler.fit_transform(df_train_cont_imp)
X_train = pd.DataFrame(X_train, columns = df_train_cont_imp.columns)
X_train.head()
del X_train['Id']
X_train.shape
Time to run PCA on this set. I'll first transfer this to an array and then apply sklearn.decomp.PCA to the X_train dataset.
X_train = X_train.values
pca = PCA(n_components = 5)
X_train = pca.fit_transform(X_train)
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
X_train_pca = X_train[:,:5]
X_train_pca = pd.DataFrame(X_train_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])
About 46% of the explained variance from only 5 principal components. Not too bad... I think.. lets continue with this and try a k-means on the components just for fun!
kmeans = KMeans(n_clusters = 3, random_state = 42)
X_train_clusters = kmeans.fit(X_train)
X_train_clusters.labels_
X_train_clusters.cluster_centers_
Alright, on to the categorical data.. first to clean it of nulls and then dummy variable it out. As this is more a project to explore dimensionality reduction, I'll allow the label encoder to encoder null values over to a value.
encoder = LabelEncoder()
hot_encoder = OneHotEncoder()
#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()
#iterate over categorical cols and transform into int values
for i in df_train:
df_train_enc[i] = encoder.fit_transform(df_train[i])
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())
df_train_dummies.head()
okay.. so we just did the exact opposite of dimensionality reduction.. thats what dummy variables do. Lets try multiple correspondence analysis, the categorical PCA!
df_train_mca = mca.MCA(df_train_dummies, ncols = 5)
print(df_train_mca.L) #eigenvalues
Well, not too valueable really. It's not explaining all too much of the variance. Lets go back to the the first categorical and take a more traditional approach. First to view unique counts per column:
for i in df_train:
print i
print(df_train[i].value_counts())
Several of these cetagories are overwhelmingly distributed to one value. As such, they will not provide useful information in forecasting.. I'm going to set a threshold of 75% within one value.. IE: if one value in a category holds >= 75% of the total count, it gets removed from the dataset.. This acocunts for:
MSZoning, Street, LandContour, Utilities, LandSlope, Condition1, Condition2, BldgType, RoofStyle, RoofMat1, ExterCond, BsmtQual, BsmtFinType2, Heating, CentralAir, Electrical, Functional, GarageQual, GarageCond, PavedDrive, SaleType, SaleCondition
While Numerous of these would be expected to be useful in predicting price (central air, paved driveway, sale condition) I'm still removing at this threshold to try to reduce dimensionality. I'll evbentually compare to a full-on regression with all variables and see what the difference relates to.
Lets remove them:
df_train = df_train.drop(["MSZoning", "Street", "LandContour", "Utilities", "LandSlope", "Condition1", "Condition2", "BldgType",
"RoofStyle", "RoofMatl", "ExterCond", "BsmtQual", "BsmtFinType2", "Heating", "CentralAir", "Electrical",
"Functional", "GarageQual", "GarageCond", "PavedDrive", "SaleType", "SaleCondition"], axis=1)
df_train.head()
Lets encode these out and get to model fitting. (This is just a rewriting of the above encoding, but with the smaller dataset.)
#initialize empty frames for storage
df_train_enc = pd.DataFrame()
df_train_hot_enc = pd.DataFrame()
#iterate over categorical cols and transform into int values
for i in df_train:
df_train_enc[i] = encoder.fit_transform(df_train[i])
#encoder into binary dummies
df_train_hot_enc = hot_encoder.fit_transform(df_train_enc)
df_train_dummies = pd.DataFrame(df_train_hot_enc.toarray())
df_train_dummies.head()
Alright, lets combine the two train sets (continuous and categorical). We'll split the Kaggle train set into a train and test subset so we can test the accuracy as well. After that, We'll fit a regression and see the results:
X_train = X_train_pca
X_train.head()
for i in df_train_dummies:
X_train[i] =df_train_dummies[i]
X_train.shape
X_train_save = X_train #needed for later
y_train_save = y_train #needed for later
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
print("okay, this is getting messy and difficult to track in jupyter.. lets model and gtfo")
#linear
clf = LinearRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)
Well something is off here. lets print out the preds to true values:
print("True Predicted")
for i in range(0, len(y_pred)):
print("%.2f %.2f") %(y_test.iloc[i], y_pred[i])
So it appears one of the variables is drastically throwing off a few of the predictions. the majority are, while not great, not too far off from expected.
clf.coef_
Looks like a large portion of the dummy variables are.. severe. lets just cut them and see how solely the principal vectors do.
X_train = X_train_save[["pca1","pca2","pca3", "pca4", "pca5"]]
y_train = y_train_save
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
#linear
clf = LinearRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)
#Lasso
clf = LassoCV().fit(X_train, y_train)
y_pred = clf.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)
#Ridge
clf = Ridge().fit(X_train, y_train)
y_pred = clf.predict(X_test)
# The mean squared error
print("Mean squared error: %.2f") % mean_squared_error(y_test, y_pred)
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f') % r2_score(y_test, y_pred)
print("True Predicted")
for i in range(0, 10):
print("%.2f %.2f") %(y_test.iloc[i], y_pred[i])
So in effect, We've reduced an 81 variable dataset (>1000 if we count all dummy variables possible) into only 5 variables thanks to PCA. while 79% variance explained ins't all too great for something like predicting house final sale prices, I'll take it for now (as I've run out of free time for the week). I'd like to get back into this and further explore possibilities of fine-tune these categorical vars and fit to some non-linear regression models... Till then, lets upload this to Kaggle just to see how it goes!
To do this, I'll need to run the df_test data through the pipeline.
df_test = df_test.drop(["Alley", 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1)
df_test_cont = pd.DataFrame()
for i in df_test:
if df_test[i].dtype == "int64" or df_test[i].dtype == "float64":
df_test_cont[i] = df_test[i]
df_train_cont.head()
df_test_cont_imp = imp.fit_transform(df_test_cont)
df_test_cont_imp = pd.DataFrame(df_test_cont_imp, columns = df_test_cont.columns)
df_test_cont_imp.isnull().sum().any()
for i in df_test_cont_imp:
X_test = scaler.transform(df_test_cont_imp)
X_test = pd.DataFrame(X_test, columns = df_test_cont_imp.columns)
X_test.head()
del X_test['Id']
X_test = X_test.values
X_test = pca.transform(X_test)
X_test_pca = X_test[:,:5]
X_test_pca = pd.DataFrame(X_test_pca, columns = ["pca1", "pca2", "pca3", "pca4", "pca5"])
X_test_pca.head()
preds = clf.predict(X_test_pca)
df_sample.SalePrice = preds
'''WRITE TO CSV'''
df_sample.to_csv('Linear_Reg_PCA.csv', index=False)
Kaggle results: As expected.. Not all too great. About 75 percentile.. The project was great practice though. I was able to get some practice with: PCA, KMeans clustering, Multiple Correspondence Analysis, along with all the typical cleaning and scaling that was done throughout.
As this page eventually turned into a much longer and messier version than I had hoped for, any further exploration into this project will be done and uploaded on a new post!