Summary: It's been alittle over three years since I attempted my first ML project of classifying survival of the titanic.. A pretty famous dataset and task for this field I'd say. I achieved a accuracy of around 75% in my first attempt at this (I believe with a random forest model). I though it would be nice to revisit this and give it another go now that I'm a little more experienced. Additionally, I've wanted to explore ensemble learned some more, particularly XGBoost. Let's dive in!

In [1]:

#By Andrew Trick

Titanic Revisited (w/ XGBoost)¶

Goal¶

I originally worked with this dataset about 3.5 years ago when working through Udacity's Nanodegree in Data Analytics. I had, more or less, no idea what I was doing then. I thought it would be nice to revisit this dataset and see if I could get a better accuracy than my first time through (which was around 74% if I recall). Particularly, I've been wanting to work with XGBoost for awhile now and this seemed like an appropriate classification problem to give it a go on!

In [1]:

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import math
import xgboost

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
sub = pd.read_csv('input/gender_submission.csv')

train.head()

Out[1]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Feature Generation and Removal¶

Time to generate a few features which may be of use in classification.

In [2]:

#Does the passanger have a cabin?
train['cabin_binary'] = train["Cabin"].apply(lambda i: 0 if str(i) == "nan" else 1)

#Family Size
train['family_size'] = 1 + train['SibSp'] + train['Parch']
train['solo'] = train["family_size"].apply(lambda i: 1 if i == 1 else 0)

#Fix Nulls
train['Embarked'] = train['Embarked'].fillna('S')
train['Age'] = train['Age'].fillna(int(np.mean(train['Age'])))
train['Fare'] = train['Fare'].fillna(np.mean(train['Fare']))

#A few age specific Binaries
train['Child'] = train["Age"].apply(lambda i: 1 if i <= 17 and i > 6 else 0)
train['toddler'] = train["Age"].apply(lambda i: 1 if i <= 6 else 0)
train['Elderly'] = train["Age"].apply(lambda i: 1 if i >= 60 else 0)

# Fancy fancy
train['fancy'] = train['Fare'].apply(lambda i: 1 if i >= 100 else 0)

# standard
train['standard_fare'] = train['Fare'].apply(lambda i: 1 if i <= 10.0 else 0)

#No requirement to standardize in DT models, but might as well
fare_scaler = StandardScaler()
fare_scaler.fit(train['Fare'].values.reshape(-1, 1))
train['fare_std'] = fare_scaler.transform(train['Fare'].values.reshape(-1, 1))

#get status of passanger
train['title'] = 'default'

for i in train.values:
    name = i[3] #First checks for rare titles (Thanks Anisotropic's wonderful Kernel for inspiration//help here!)
    for e in ['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']:
        if e in name:
            train.loc[train['Name'] == name, 'title'] = 'rare'
    if 'Miss' in name or  'Mlle' in name or 'Ms' in name or 'Mme' in name or 'Mrs' in name:
        train.loc[train['Name'] == name, 'title'] = 'Ms'
    if 'Mr.' in name or 'Master' in name:
        train.loc[train['Name'] == name, 'title'] = 'Mr'


train.head(10)

Out[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	...	cabin_binary	family_size	solo	Child	toddler	standard_fare	fare_std	title
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	...	0	2	0	0	0	1	-0.502445	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	...	1	2	0	0	0	0	0.786845	Ms
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	...	0	1	1	0	0	1	-0.488854	Ms
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	...	1	2	0	0	0	0	0.420730	Ms
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	...	0	1	1	0	0	1	-0.486337	Mr
5	6	0	3	Moran, Mr. James	male	29.0	0	0	330877	8.4583	...	0	1	1	0	0	1	-0.478116	Mr
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	...	1	1	1	0	0	0	0.395814	Mr
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	...	0	5	0	0	1	0	-0.224083	Mr
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	...	0	3	0	0	0	0	-0.424256	Ms
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	...	0	2	0	1	0	0	-0.042956	Ms

10 rows × 22 columns

Lets send the test data through the same pipeline!

In [3]:

#Does the passanger have a cabin?
test['cabin_binary'] = test["Cabin"].apply(lambda i: 0 if str(i) == "nan" else 1)

#Family Size
test['family_size'] = 1 + test['SibSp'] + test['Parch']
test['solo'] = test["family_size"].apply(lambda i: 1 if i == 1 else 0)

#Fix Nulls
test['Embarked'] = test['Embarked'].fillna('S')
test['Age'] = test['Age'].fillna(int(np.mean(test['Age'])))
test['Fare'] = test['Fare'].fillna(np.mean(test['Fare']))

#A few age specific Binaries
test['Child'] = test["Age"].apply(lambda i: 1 if i <= 17 and i > 6 else 0)
test['toddler'] = test["Age"].apply(lambda i: 1 if i <= 6 else 0)
test['Elderly'] = test["Age"].apply(lambda i: 1 if i >= 60 else 0)

# Fancy fancy
test['fancy'] = test['Fare'].apply(lambda i: 1 if i >= 100 else 0)
test['standard_fare'] = test['Fare'].apply(lambda i: 1 if i <= 10.0 else 0)

#standardize
test['fare_std'] = fare_scaler.transform(test['Fare'].values.reshape(-1, 1))

#get status of passanger
test['title'] = 'default'

for i in test.values:
    name = i[2] #First checks for rare titles (Thanks Anisotropic's wonderful Kernel for inspiration//help here!)
    for e in ['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']:
        if e in name:
            test.loc[test['Name'] == name, 'title'] = 'rare'
    if 'Miss' in name or  'Mlle' in name or 'Ms' in name or 'Mme' in name or 'Mrs' in name:
        test.loc[test['Name'] == name, 'title'] = 'Ms'
    if 'Mr.' in name or 'Master' in name:
        test.loc[test['Name'] == name, 'title'] = 'Mr'


test.head(10)

Out[3]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	...	family_size	solo	Child	Elderly	standard_fare	fare_std	title
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	...	1	1	0	0	1	-0.490783	Mr
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	...	2	0	0	0	1	-0.507479	Ms
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	...	1	1	0	1	1	-0.453367	Mr
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	...	1	1	0	0	1	-0.474005	Mr
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	...	3	0	0	0	0	-0.401017	Ms
5	897	3	Svensson, Mr. Johan Cervin	male	14.0	0	0	7538	9.2250	NaN	...	1	1	1	0	1	-0.462679	Mr
6	898	3	Connolly, Miss. Kate	female	30.0	0	0	330972	7.6292	NaN	...	1	1	0	0	1	-0.494810	Ms
7	899	2	Caldwell, Mr. Albert Francis	male	26.0	1	1	248738	29.0000	NaN	...	3	0	0	0	0	-0.064516	Mr
8	900	3	Abrahim, Mrs. Joseph (Sophie Halaut Easu)	female	18.0	0	0	2657	7.2292	NaN	...	1	1	0	0	1	-0.502864	Ms
9	901	3	Davies, Mr. John Samuel	male	21.0	2	0	A/4 48871	24.1500	NaN	...	3	0	0	0	0	-0.162169	Mr

10 rows × 21 columns

Remove Unneccesary Features and Encode Categorical

In [4]:

train = pd.get_dummies(train, columns=["Sex", "Embarked", "title"])
test = pd.get_dummies(test, columns=["Sex", "Embarked", "title"])

train = train.drop(['Name','PassengerId', 'Ticket', 'Cabin', 'Fare', 'SibSp'], axis = 1)
test = test.drop(['Name','PassengerId', 'Ticket', 'Cabin', 'Fare', 'SibSp'], axis = 1)

train.head()

Out[4]:

	Survived	Pclass	Age	cabin_binary	family_size	solo	...	standard_fare	fare_std	Sex_female	Sex_male	Embarked_C	Embarked_S	title_Mr	title_Ms
0	0	3	22.0	0	2	0	...	1	-0.502445	0	1	0	1	1	0
1	1	1	38.0	1	2	0	...	0	0.786845	1	0	1	0	0	1
2	1	3	26.0	0	1	1	...	1	-0.488854	1	0	0	1	0	1
3	1	1	35.0	1	2	0	...	0	0.420730	1	0	0	1	0	1
4	0	3	35.0	0	1	1	...	1	-0.486337	0	1	0	1	1	0

5 rows × 21 columns

Quick EDA Visuals¶

Lets do just a little bit of EDA with Seaborn:

In [9]:

#COR MATRIX OF numerical vars
plt.figure(figsize=(14,12))
plt.title('Corelation Matrix', size=8)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=plt.cm.RdBu, linecolor='white', annot=True)
plt.show()

In [10]:

#Family size histo
sns.distplot(train['family_size'])
plt.show()

In [11]:

#boxplot of family size and survival
sns.boxplot("Survived", y="family_size", data = train)
plt.show()

In [12]:

#Fare to Age relationship?
sns.lmplot('fare_std', 'Age', data = train, 
           fit_reg=False,scatter_kws={"marker": "D", "s": 20}) 
plt.show()

In [13]:

#boxplot of age and survival
sns.boxplot("Survived", "Age", data = train)
plt.show()

In [14]:

#bar chart of age and survival
sns.lmplot("Survived", "fare_std", data = train, fit_reg=False)
plt.show()

Classifing with XGBoost¶

I'll be comparing XG to AdaBoost, GradientBoost, RandomForest, andddd maybe a SVC or something else.

First off to split the train data into a train/test for cross validation testing

In [15]:

X = train.drop(['Survived'], axis = 1)
y = train['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Alright, first letsfirst try some more traditional models.. AKA: Random Forest and Standard Tree

In [12]:

#Random Forest Setup
ranfor = RandomForestClassifier()
parameters = {'n_estimators':[10,50,100], 'random_state': [42, 138], \
              'max_features': ['auto', 'log2', 'sqrt']}
ranfor_clf = GridSearchCV(ranfor, parameters)
ranfor_clf.fit(X_train, y_train)


'''CROSS VALIDATE'''
cv_results = cross_validate(ranfor_clf, X_train, y_train)
cv_results['test_score']  

y_pred = ranfor_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.810055865922

In [16]:

##Decision Tree Go
dt = DecisionTreeClassifier()
parameters = {'random_state': [42, 138],'max_features': ['auto', 'log2', 'sqrt']}
dt_clf = GridSearchCV(dt, parameters)
dt_clf.fit(X_train, y_train)

y_pred = dt_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.776536312849

At 81% with a random forest.. I'm stoked as this is already better than my last attempt. Lets keep pushing on with it and give some boosting models a go.

In [47]:

ada = AdaBoostClassifier(base_estimator = DecisionTreeClassifier())
parameters = {'n_estimators':[10,50,100], 'random_state': [42, 138], 'learning_rate': [0.1, 0.5, 0.8, 1.0]}
ada_clf = GridSearchCV(ada, parameters)
ada_clf.fit(X_train, y_train)

cv_results = cross_validate(ada_clf, X_train, y_train)
cv_results['test_score']  

y_pred = ada_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.8156424581005587

In [17]:

gradBoost = GradientBoostingClassifier()
parameters = {'n_estimators':[10,50,100], 'random_state': [42, 138], 'learning_rate': [0.1, 0.5, 0.8, 1.0], \
             'loss' : ['deviance', 'exponential']}
gb_clf = GridSearchCV(gradBoost, parameters)
gb_clf.fit(X_train, y_train)

cv_results = cross_validate(gb_clf, X_train, y_train)
cv_results['test_score']  

y_pred = gb_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.821229050279

So GradientBoosting has given me the best score at 82% so far.. lets finally make our way to XG:

In [13]:

xg = xgboost.XGBClassifier(max_depth = 3, n_estimators = 400, learning_rate = 0.1)
xg.fit(X_train, y_train)

cv_results = cross_validate(xg, X_train, y_train)
cv_results['test_score']  

y_pred = xg.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.832402234637

In [85]:

'''Confusion Matrix'''
y_pred = xg.predict(X_test)
# TN, FP, FN, TP
confusion_matrix(y_test, y_pred)

Out[85]:

array([[91, 14],
       [17, 57]])

So.. A liiiittttleee better than the Gradient boost. I'm happy with it for a quick project like this. Lets write it out and submit.

In [13]:

xg = xgboost.XGBClassifier(max_depth = 3, n_estimators = 400, learning_rate = 0.1)
xg.fit(X, y)

cv_results = cross_validate(xg, X, y)
predictions = xg.predict(test)

sub['Survived'] = predictions
sub.to_csv("first_submission_xgb.csv", index=False)