Part three of my attempt to predict NCAA tourney results based on past game data. As cleaning and var creation is out of the way now, this segment will focus on fitting the data to different classifiers in scikit-lelarn and tweaking parameters to determine a classification model with the best prediction power.

In [17]:
import os
import numpy as np 
import pandas as pd 
import math
from sklearn.utils import shuffle
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import grid_search
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn import svm

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

Import list is large here as I threw in a large amount of the classifiers included in sklearn. As noticeable later, once I get the trainn and test data split I toss it into several different models and use the accuracy scores of each to fine tune//pick a final model. Data importing next.

A few ideas of importance- I was originally using only the cleaned regular season data (train_data_diff.csv) and splitting it into a test and train. This resulted in extremely high accuracy with most models due to a look ahead bias which was ineherent in training and testing temporal data from the same set (Thanks for the help, Jenn!). As such, I included the tourny_test variable so as to train from regular season data and test on the post-season. This accurately should repressent fitting the 2018 regular season data to the upcoming tourney season.

I additionaly cut the 2017 year off of the tourney data so as to have a year completely cut from the dataset to also test. 2017 test data is included in the 'season_test.csv'.

In [18]:
####################
### IMPORT FILES ###
####################
data_dir = '../data/'
df = pd.read_csv(data_dir + 'train_data_diff.csv')
sample_sub = pd.read_csv(data_dir + 'SampleSubmissionStage1.csv')
n_test_games = len(sample_sub)
tourney_tester = pd.read_csv(data_dir + 'season_test.csv')
tourny_data = pd.read_csv(data_dir + 'NCAATourneyCompactResults.csv')
tourny_data = tourny_data[(tourny_data.Season >= 2003) & (tourny_data.Season <= 2017)]

Next is a handful of helper function that will facilitate in extracting the year and teams from Kaggles sample submission. This will allow me to use their sample submission to set predictions and submit to the competition. The get_year_t1_t2 function came from the Basic Starter Kernel of the competition site by Julie Elliott.

The first functions is used to create a win/loss ration which I missed in earlier manipulation. Get_stat will be used to iterate over the test dataset and the sample submission from Kaggle to retrieve the appropriate stat differentials as this is what is primarily used in the training of the model. Probably should have set win and loss variables first instead of writing out this long line using .means().. but this worked out well it seems.

Set and format Train is the function which will be called when creating new data tables for creating the X variables for the test and submission data. This works primarily in connection with get stat. It also creates a Win Loss Ratio for each row and includes the Team ID's for the winners and losers.

The final function created provides a way to retrieve the average wins for a TeamID in their post season. IE: Duke on average wins 2.15 games per year in March Madness. Just another generated var I thought would be (and later after feature selection decidedly is) useful in predicting winners.

It take a decent amount of time to iterate through the data and return (~7 minutes or so).

In [19]:
##############################################
########## HELPER FUNCTIONS ##################
##############################################
'''Set WIN LOSS RATION'''
def get_count(teamID, year, wl):
    if wl == 1:
        try:
            return df[(df.TeamID == teamID) & (df.Season == year) & (df.Result == 1)].TeamID.value_counts().iloc[0]
        except IndexError:
            return 0
    else:
        try:
            return df[(df.TeamID == teamID) & (df.Season == year) & (df.Result == 0)].TeamID.value_counts().iloc[0]
        except IndexError:
            return 0


'''PULL INFO FROM SAMPLE SUBMISSION'''
def get_year_t1_t2(ID):
    """Return a tuple with ints `year`, `team1` and `team2`."""
    return (int(x) for x in ID.split('_'))


'''SETS TEAMS STAT MEAN DIFFERENTIALS BASED ON REGULAR SEASON DATA'''
def get_stat(stat, t1, t2, year):
    if not math.isnan(df[(df.TeamID == t1) & (df.Season == year)][stat].mean() - df[(df.TeamID == t2) & (df.Season == year)][stat].mean()):
        return df[(df.TeamID == t1) & (df.Season == year)][stat].mean() - df[(df.TeamID == t2) & (df.Season == year)][stat].mean()  
    else:
        return df[(df.TeamID == t1)][stat].mean() - df[(df.TeamID == t2)][stat].mean()



'''PULLS TRAINING DATA AND ENTERS INTO EMPTY DATASET'''
def set_and_format_train(data_set, input_df, stat_list):
    for ii, row in input_df.iterrows():
        year, t1, t2 = get_year_t1_t2(row.ID)
        col_num = 0
    
        for team_stat in stat_list:
            data_set[ii, col_num] = get_stat(team_stat, t1, t2, year)
            col_num += 1
            #WL Ratio
        data_set[ii, col_num] =  get_count(t1, year, 1)/ (get_count(t1, year, 0) + get_count(t1, year, 1)).astype('float') - \
        get_count(t2, year, 1)/ (get_count(t2, year, 0) + get_count(t2, year, 1)).astype('float')
        col_num += 1
            #Win average in Tourny
        data_set[ii, col_num] = get_win_avg(t1, t2)
            
'''Get avg wins of tourny games for team'''
tourny_data_train = tourny_data[(tourny_data.Season < 2017)]
tourny_win_avg = tourny_data_train.groupby('WTeamID').count().Season / tourny_data.groupby('WTeamID').Season.nunique()
 
def get_win_avg(t1, t2):
    try:
        return tourny_win_avg[t1] - tourny_win_avg[t2]    
    except KeyError:
        try:
            return -tourny_win_avg[t2]
        except KeyError:
            try:
                return tourny_win_avg[t1]
            except KeyError:
                return 0

A little more housecleaning before actual model experimentation. This will set the win loss ratio for each team in the training dataframe and turn each location variable into categorical dummies. I actually end up never using this as it is extremely difficult to retrieve this information for the test data.. I hope to include it in the final for the 2018 tourney though!

NOTE: I also want to eventually include teamID and coaches in as categorical indepent variables as well. Hopefully I'll get some time to add this in eventually.

In [20]:
'''SET DUMMIES'''
loc_dummies = pd.get_dummies(df.Loc)
df = pd.concat([df, loc_dummies], axis = 1)
df.head()
Out[20]:
AST AST_Diff BLK BLK_Diff Coach DR DR_Diff FGP FGP3 FGP3_Diff ... Result SEED_Diff STL STL_Diff Season Seed TeamID A H N
0 14.000000 4.666667 4.176471 0.676471 mark_gottfried 26.411765 3.911765 0.444393 0.347418 0.040750 ... 1 9.0 7.235294 1.401961 2003 10.0 1104 0 0 1
1 15.380952 6.047619 4.095238 0.595238 rick_stansbury 26.380952 3.880952 0.495357 0.361980 0.055311 ... 1 4.0 9.285714 3.452381 2003 5.0 1280 0 0 1
2 13.400000 4.066667 5.750000 2.250000 eddie_sutton 24.500000 2.000000 0.474798 0.382793 0.076124 ... 1 5.0 9.650000 3.816667 2003 6.0 1329 0 1 0
3 14.590909 5.257576 3.818182 0.318182 rick_barnes 26.636364 4.136364 0.456127 0.343126 0.036458 ... 1 0.0 6.954545 1.121212 2003 1.0 1400 0 1 0
4 14.590909 5.257576 3.818182 0.318182 rick_barnes 26.636364 4.136364 0.456127 0.343126 0.036458 ... 1 0.0 6.954545 1.121212 2003 1.0 1400 1 0 0

5 rows × 31 columns

Next I create the test dataset. This pulls the year and team names for a season game then creates team differential variables based on regular season data of that year. If data from that specific year is unavailable as is the case with some of the teams, an overall team average (throughout all years) is considered.

In [21]:
#######################################
###### CREATE TEST DATA ## ############
#######################################
'''SET TEST DATAFRAME'''
test_data = np.zeros(shape=(len(tourney_tester), 12))

'''SETTING FEATURES'''
stat_list = ['PPG', 'FGP', 'AST', 'FGP3', 'Seed', 'FTP', 'DR', 'STL', 'BLK', 'Rank'] 
set_and_format_train(test_data, tourney_tester, stat_list)

'''REULTS AND SHUFFLE DATA'''
test_data_results = tourney_tester['Result']      
X_data, y_data = shuffle(test_data, test_data_results)

'''FILL NaN's'''
imp = Imputer(missing_values='NaN', strategy='median', axis=1) 
imp.fit(X_data)
X_data = imp.fit_transform(X_data)

Time to rescale the features in the dataset! I opted to standardize as, suggested again by a friend, a mean of 0 is nice and efficient for most classifiers.

In [22]:
######################################
######### RESCALE DATA ###############
######################################
'''Standardize'''
scaler = StandardScaler()
scaler.fit(X_data)
X_data = scaler.transform(X_data)

And now for setting the training data and splitting into a test set. This will allow for a good amount (80%) of the data to be used in training the model, with the remainder available to test the accuracy of the fit.

I had tried a feature selector with a few of the final models, yet inclusion of all 14 features yeilded better results over the versions with only the top 20% influencers... So this is commented out. SelectKBest also appears to increase accuracy for some models, while decreasing for others.. Regardless, I'm leaving it out for this final version.

While feature selection might be useful in explaination of the predictions (PPG, WLRatio, Rank, and Win_Avg are by far the top influenctial variables), I opted to include even minimally influental variables in final model if they added to the accuracy of the final model even slightly.

The only unexpected var that required a drop was offensive rebounds. They were all over the place and had no correlation with winning the game whatsoever. Similarly, game location can not be deciphered in the kaggle submission file, so I drop this dummy var as well. I hope to include it back in for the final test when the tourny teams are selected.

In [23]:
############################################
#### FEATURE SELECTION AND TRAIN SPLIT #####
############################################
'''SELECTION SCORES'''
#X_new = SelectPercentile(percentile = 20).fit_transform(X_data, y_data)

selector = SelectKBest(k = 6)
X_new = selector.fit_transform(X_data, y_data)
#selector.scores_

'''SPLIT TRAIN AND TEST'''
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, random_state=42)
y_train.head()
Out[23]:
960     1
543     1
1247    0
1834    0
583     1
Name: Result, dtype: int64

And finally to the model fitting. Most are commented out which indicates that they were not asd powerful and accurate as the final selected: SVC with a linear kernel. I worry simply judging by accuracy is not the most efficient and apt method to deciding between fits, but it appears to have a direct relation to how highly the final submission scores on Kaggle as well.

While Random Forest and SVC both have great accuracy.. as excpected with a dataset like this.. I was excited to see that Adaboost was so efficient at prediction based on the inputs. I had luck with it in past projects and was excited to tune it again. It was also extremely fun to give a neural network a try for the first time with scikit's Multi-layer perceptron classifier. While it gave great accuracy for the cross validation, test results were lackluster.

In [24]:
#######################################
######### FITTING MODELS ##############
#######################################
'''GAUSSIAN NAIVE BAYES'''
#clf = GaussianNB()
#clf.fit(X_train, y_train)


'''LOGISTIC REGRESSION'''
#logreg = LogisticRegression()
#params = {'C': np.logspace(start=-5, stop=3, num=9)}
#clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True)
#clf.fit(X_train, y_train)


'''LOG REG CV'''
#clf = LogisticRegressionCV()
#clf.fit(X_train, y_train)


'''TRAIN MODEL - RANDOM FOREST'''
#clf = RandomForestClassifier(random_state=42)
#clf.fit(X_train, y_train)


'''ADABOOST'''
ada = AdaBoostClassifier()
parameters = {'n_estimators':[10,50,100], 'random_state': [None, 0, 42, 138], \
              'learning_rate': [0.1, 0.5, 0.8, 1.0]}
clf = grid_search.GridSearchCV(ada, parameters)

clf.fit(X_train, y_train)


'''K Nearest Neighbor'''
#knn = KNeighborsClassifier()
#parameters = {'algorithm':['auto','ball_tree', 'kd_tree'], 'n_neighbors': [5, 100], \
 #             'weights': ['uniform', 'distance'], 'p': [1, 2]}
#clf = grid_search.GridSearchCV(knn, parameters)

#clf.fit(X_train, y_train) 


'''Support Vector Machine''' 
#svc = svm.SVC()
#parameters = {'kernel':['linear','rbf'], 'random_state': [None, 0, 42, 138], \
#              'gamma': ['auto', 0.25, 0.5, 0.7, 0.9], 'C': [0.2, 0.5, 0.8, 1.0], \
#              'probability': [True]}
#clf = grid_search.GridSearchCV(svc, parameters)

#clf.fit(X_train, y_train)


'''Multi-layer Perceptron Classifier (NN)'''
#mlp = MLPClassifier()
#parameters = {'hidden_layer_sizes':[(100,), (50,),(200,)], 'activation': ['identity','relu', 'logistic', 'tanh'], \
#              'solver': ['adam', 'lbfgs'], 'learning_rate': ['constant', 'invscaling', 'adaptive'], 'max_iter': [100, 200, 300], \
#              'early_stopping': [True]}
#clf = grid_search.GridSearchCV(mlp, parameters)
 
#clf.fit(X_train, y_train)
Out[24]:
'Multi-layer Perceptron Classifier (NN)'

The next section encompasses the cross validation and accuracy testing for the models. I utilized a grid search for parameter turning for most of these as well, so gridsearch results are also called for below. Adaboost classifier sits with a final accuracy of around 74% and a average cross validation of around 72%. Not great, but a better predictor than I could personal result in for march madness games.

Confusion matrix is also included jsut to give an idea of where the mistakes in the estimator are.

In [27]:
################################
### CLASSIFIER  REVIEW ########
################################
'''FEATURE SELECTION (IF GRIDSEARCH)'''
clf.best_estimator_
clf.best_score_

'''CROSS VALIDATE'''
cv_results = cross_validate(clf, X_train, y_train)
cv_results['test_score']  

'''Confusion Matrix'''
y_pred = clf.predict(X_test)
# TN,FP, FN, TP
confusion_matrix(y_test, y_pred)

'''PREDICTOR ACCURACY'''
score = clf.score(X_test, y_test)
print score
0.737913486005

Final step is to use the sample submission from Kaggle and report the model's predicted probability for each game.

As with the test creation, this section provides the method to iterate through the sample submission and report the model predicted probabilities my fit finds. Sets the differentials and other stats to create appropriate variables to run the model on. Again the iterative functions used add to the overall run time of this cell: (~15 mins).

In [ ]:
#######################################
###### FORMAT SUBMISSION FILE #########
#######################################
'''SET TEST DATAFRAME'''
X_sub = np.zeros(shape=(n_test_games,  12)) 

'''SETTING FEATURES'''
stat_list = ['PPG', 'FGP', 'AST', 'FGP3', 'Seed', 'FTP', 'DR', 'STL', 'BLK', 'Rank'] 
set_and_format_train(X_sub, sample_sub, stat_list)

'''Fill NaN's'''
imp = Imputer(missing_values='NaN', strategy='median', axis=1) 
imp.fit(X_sub)
X_sub = imp.fit_transform(X_sub)

And to the final part of the project.. Using the model on the Kaggle sample submission and turning in on the competition site to see how I fare. Predict_proba is used here as the competition is graded in a log loss method which scores based on the probability scores set to each game. I then clip the extreme ends of the scores as log loss methods highly deduct for highly certain false positives.

In [ ]:
'''MAKE PREDICTIONS'''
preds = clf.predict_proba(X_sub)[:,1]

'''CLIP PREDICTIONS'''
clipped_preds = np.clip(preds, 0.05, 0.95)
sample_sub.Pred = clipped_preds

'''WRITE TO CSV'''
#sample_sub.to_csv('Ada_12Feature_clipped_sub.csv', index=False)

Kaggle Results

So the decent accuracy of the model on here relates about as expected in a log loss environment. I scored 0.603 on the competition site.. I believe this equates to around 70% accuracy? Slightly better than the Basic starter kernel which only used Seed differential in predicting probabilities. I'm happy about it, but still far from the top scores in the competition.

This is not to say the model will not be effective in round two of the competition where current tourney data is unavailable. I believe my model's use of only regular season data is both a pro and a con. It can allow me to appropriately fit each game in the 2018 bracket based soley on regular season data from this year. Conversely, it also does not take into account the randomness that is seen within Tourny games- Hence the 'madness' of the bracket.

If time becomes available again for me before the beginning of the tourney, I plan to find a way to include betting market odds into the training data. I also think utilizing those dummy location variables and introducing team Id's or possibly including some lagged past season tourney placements may help the efficiency of the model as well.

How Did My Predicted Bracket Do:

Kaggle: About 50th Percntile.. I'll take it for my first competition submission.

ESPN: Top 4% of all submitted brackets in ESPN's bracket contest.. Woo!