Section two working towards a finding a model in which to predict the 2018 NCAA tourney results. This segment takes the cleaned csv file from part I and further manipulates it into a format appropriate for fitting to models. The primary problem with the current format is the fact each row revolves around a game. This code will split the rows into two- One for the winner and one for the loser.

In [2]:
import pandas as pd 

The primary problem with the current format is the fact each row revolves around a game. This code will iterate through each row and split it into two- One for the winner and one for the loser. Results of a 1 for the winner and 0 for the loser are then added to provide the models label variables in which to test with.

As normal, step one will import data and look at current data. I took some time and threw together a quick correlation matrix in R before this and detemined a few unneccesary variables. Length is one. Another I'll cut later on is Offensive Rebound.

In [18]:
#import data and trim unnecessary columns
data_dir = '../data/'
df = pd.read_csv('../data/MM_Cleaned_2.csv')

df.drop(labels=['WLength', 'LLength'], inplace=True, axis=1) 

df.head()
Out[18]:
Season WTeamID LTeamID WLoc LLoc WSeed LSeed SeedDiff WCoach LCoach ... LSeason_OR LSeason_DR Season_OR_diff Season_DR_diff WSeason_stl LSeason_stl Season_stl_diff WSeason_blk LSeason_blk Season_blk_diff
0 2003 1104 1328 N N 10.0 1 9.0 mark_gottfried kelvin_sampson ... 13.333333 22.5 0.196078 3.911765 7.235294 5.833333 1.401961 4.176471 3.5 0.676471
1 2003 1280 1328 N N 5.0 1 4.0 rick_stansbury kelvin_sampson ... 13.333333 22.5 -0.666667 3.880952 9.285714 5.833333 3.452381 4.095238 3.5 0.595238
2 2003 1329 1328 H A 6.0 1 5.0 eddie_sutton kelvin_sampson ... 13.333333 22.5 -0.733333 2.000000 9.650000 5.833333 3.816667 5.750000 3.5 2.250000
3 2003 1400 1328 H A 1.0 1 0.0 rick_barnes kelvin_sampson ... 13.333333 22.5 2.893939 4.136364 6.954545 5.833333 1.121212 3.818182 3.5 0.318182
4 2003 1400 1328 A H 1.0 1 0.0 rick_barnes kelvin_sampson ... 13.333333 22.5 2.893939 4.136364 6.954545 5.833333 1.121212 3.818182 3.5 0.318182

5 rows × 37 columns

I then wanted to incorperate an RPI variable for each tearm/year.. A var I missed to include in part I. Thanks to Kaggle, this was easy to achieve as a dataset included in the comp had this held within. I import this and pull out RPI data, then merge with the working dataframe.

In [19]:
'''Import RPI and Connect'''
df_massey = pd.read_csv(data_dir + 'MasseyOrdinals.csv')
df_rpi = df_massey[(df_massey.SystemName == 'RPI') & (df_massey.RankingDayNum == 133)]
df_rpi.drop(labels=['SystemName', 'RankingDayNum'], inplace=True, axis=1) 

df_winrpi = df_rpi.rename(columns={'TeamID' : 'WTeamID', 'OrdinalRank' : 'WOrdinalRank'})
df_lossrpi = df_rpi.rename(columns={'TeamID' : 'LTeamID', 'OrdinalRank' : 'LOrdinalRank'})
df_dummy = pd.merge(left=df, right=df_winrpi, on=['Season', 'WTeamID'])
df = pd.merge(left=df_dummy, right=df_lossrpi, on=['Season', 'LTeamID'])
df['Rank_Diff'] = df['WOrdinalRank'] - df['LOrdinalRank']

df.head()
C:\Users\andyj\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
Out[19]:
Season WTeamID LTeamID WLoc LLoc WSeed LSeed SeedDiff WCoach LCoach ... Season_DR_diff WSeason_stl LSeason_stl Season_stl_diff WSeason_blk LSeason_blk Season_blk_diff WOrdinalRank LOrdinalRank Rank_Diff
0 2003 1104 1328 N N 10.0 1 9.0 mark_gottfried kelvin_sampson ... 3.911765 7.235294 5.833333 1.401961 4.176471 3.5 0.676471 38 3 35
1 2003 1280 1328 N N 5.0 1 4.0 rick_stansbury kelvin_sampson ... 3.880952 9.285714 5.833333 3.452381 4.095238 3.5 0.595238 24 3 21
2 2003 1329 1328 H A 6.0 1 5.0 eddie_sutton kelvin_sampson ... 2.000000 9.650000 5.833333 3.816667 5.750000 3.5 2.250000 22 3 19
3 2003 1400 1328 H A 1.0 1 0.0 rick_barnes kelvin_sampson ... 4.136364 6.954545 5.833333 1.121212 3.818182 3.5 0.318182 4 3 1
4 2003 1400 1328 A H 1.0 1 0.0 rick_barnes kelvin_sampson ... 4.136364 6.954545 5.833333 1.121212 3.818182 3.5 0.318182 4 3 1

5 rows × 40 columns

Not to split the rows into winners and losers.. with included 1, 0 results. I simply pull each var of interest andf put it into either a win or loss dataframe. I incorperate the differential scores by adding the actual to the winning team and the opposite sign for the losing team... These vars will be the primary values of interest in model I imagine.

In [20]:
'''TRAIN cols'''
df_wins = pd.DataFrame()
df_wins['Season'] = df['Season']
df_wins['TeamID'] = df['WTeamID']
df_wins['OtherTeamID'] = df['LTeamID']
df_wins['Loc'] = df['WLoc']
df_wins['Coach'] = df['WCoach']
df_wins['Seed'] = df['WSeed']

df_wins['PPG'] = df['WSeason_ppg']
df_wins['AST'] = df['WSeason_ast']
df_wins['FGP'] = df['WSeason_FGPercent']
df_wins['FGP3'] = df['WSeason_FGPercent3']
df_wins['FTP'] = df['WSeason_FTPercent']
df_wins['OR'] = df['WSeason_OR']
df_wins['DR'] = df['WSeason_DR']
df_wins['STL'] = df['WSeason_stl']
df_wins['BLK'] = df['WSeason_blk']
df_wins['Rank'] = df['WOrdinalRank']

df_wins['PPG_Diff'] = df['Season_ppg_diff']
df_wins['SEED_Diff'] = df['SeedDiff']
df_wins['FGP_Diff'] = df['Season_FGP_Diff']
df_wins['AST_Diff'] = df['Season_ast_diff']
df_wins['FGP3_Diff'] = df['Season_FGP3_Diff']
df_wins['FTP_Diff'] = df['Season_FTP_Diff']
df_wins['OR_Diff'] = df['Season_OR_diff']
df_wins['DR_Diff'] = df['Season_DR_diff'] 
df_wins['STL_Diff'] = df['Season_stl_diff']
df_wins['BLK_Diff'] = df['Season_blk_diff']
df_wins['Rank_Diff'] = df['Rank_Diff']
df_wins['Result'] = 1

df_losses = pd.DataFrame()
df_losses['Season'] = df['Season']
df_losses['TeamID'] = df['LTeamID']
df_losses['OtherTeamID'] = df['WTeamID']
df_losses['Loc'] = df['LLoc']
df_losses['Coach'] = df['LCoach']
df_losses['Seed'] = df['LSeed']

df_losses['PPG'] = df['LSeason_ppg']
df_losses['AST'] = df['LSeason_ast']
df_losses['FGP'] = df['LSeason_FGPercent']
df_losses['FGP3'] = df['LSeason_FGPercent3']
df_losses['FTP'] = df['LSeason_FTPercent']
df_losses['OR'] = df['LSeason_OR']
df_losses['DR'] = df['LSeason_DR']
df_losses['STL'] = df['LSeason_stl']
df_losses['BLK'] = df['LSeason_blk']
df_wins['Rank'] = df['LOrdinalRank']

df_losses['PPG_Diff'] = -df['Season_ppg_diff']
df_losses['SEED_Diff'] = -df['SeedDiff']
df_losses['FGP_Diff'] = -df['Season_FGP_Diff']
df_losses['AST_Diff'] = -df['Season_ast_diff']
df_losses['FGP3_Diff'] = -df['Season_FGP3_Diff']
df_losses['FTP_Diff'] = -df['Season_FTP_Diff']
df_losses['OR_Diff'] = -df['Season_OR_diff']
df_losses['DR_Diff'] = -df['Season_DR_diff'] 
df_losses['STL_Diff'] = -df['Season_stl_diff']
df_losses['BLK_Diff'] = -df['Season_blk_diff']
df_losses['Rank_Diff'] = -df['Rank_Diff']
df_losses['Result'] = 0

With these splits done, lets combine them!

In [21]:
df_predictions = pd.concat((df_wins, df_losses))
df_predictions.TeamID = df_predictions.TeamID.astype('category')
df_predictions.head()
Out[21]:
AST AST_Diff BLK BLK_Diff Coach DR DR_Diff FGP FGP3 FGP3_Diff ... PPG_Diff Rank Rank_Diff Result SEED_Diff STL STL_Diff Season Seed TeamID
0 14.000000 4.666667 4.176471 0.676471 mark_gottfried 26.411765 3.911765 0.444393 0.347418 0.040750 ... 18.539216 3.0 35 1 9.0 7.235294 1.401961 2003 10.0 1104
1 15.380952 6.047619 4.095238 0.595238 rick_stansbury 26.380952 3.880952 0.495357 0.361980 0.055311 ... 17.500000 3.0 21 1 4.0 9.285714 3.452381 2003 5.0 1280
2 13.400000 4.066667 5.750000 2.250000 eddie_sutton 24.500000 2.000000 0.474798 0.382793 0.076124 ... 17.233333 3.0 19 1 5.0 9.650000 3.816667 2003 6.0 1329
3 14.590909 5.257576 3.818182 0.318182 rick_barnes 26.636364 4.136364 0.456127 0.343126 0.036458 ... 22.060606 3.0 1 1 0.0 6.954545 1.121212 2003 1.0 1400
4 14.590909 5.257576 3.818182 0.318182 rick_barnes 26.636364 4.136364 0.456127 0.343126 0.036458 ... 22.060606 3.0 1 1 0.0 6.954545 1.121212 2003 1.0 1400

5 rows × 28 columns

Looks all good. We have a new dataframe in pandas with each team's stats per year//other information of interest along with the necessary label variable of game Result (w/l). Time to print it and I'll have the final model building phase up soon!

In [22]:
'''PRINT OUT CLEAN CSV'''
df_predictions.to_csv('../data/train_data_diff.csv', index=False)