Section two working towards a finding a model in which to predict the 2018 NCAA tourney results. This segment takes the cleaned csv file from part I and further manipulates it into a format appropriate for fitting to models. The primary problem with the current format is the fact each row revolves around a game. This code will split the rows into two- One for the winner and one for the loser.

In [2]:

import pandas as pd

The primary problem with the current format is the fact each row revolves around a game. This code will iterate through each row and split it into two- One for the winner and one for the loser. Results of a 1 for the winner and 0 for the loser are then added to provide the models label variables in which to test with.

As normal, step one will import data and look at current data. I took some time and threw together a quick correlation matrix in R before this and detemined a few unneccesary variables. Length is one. Another I'll cut later on is Offensive Rebound.

In [18]:

#import data and trim unnecessary columns
data_dir = '../data/'
df = pd.read_csv('../data/MM_Cleaned_2.csv')

df.drop(labels=['WLength', 'LLength'], inplace=True, axis=1) 

df.head()

Out[18]:

	Season	WTeamID	LTeamID	WLoc	LLoc	WSeed	LSeed	SeedDiff	WCoach	LCoach	...	LSeason_OR	LSeason_DR	Season_OR_diff	Season_DR_diff	WSeason_stl	LSeason_stl	Season_stl_diff	WSeason_blk	LSeason_blk	Season_blk_diff
0	2003	1104	1328	N	N	10.0	1	9.0	mark_gottfried	kelvin_sampson	...	13.333333	22.5	0.196078	3.911765	7.235294	5.833333	1.401961	4.176471	3.5	0.676471
1	2003	1280	1328	N	N	5.0	1	4.0	rick_stansbury	kelvin_sampson	...	13.333333	22.5	-0.666667	3.880952	9.285714	5.833333	3.452381	4.095238	3.5	0.595238
2	2003	1329	1328	H	A	6.0	1	5.0	eddie_sutton	kelvin_sampson	...	13.333333	22.5	-0.733333	2.000000	9.650000	5.833333	3.816667	5.750000	3.5	2.250000
3	2003	1400	1328	H	A	1.0	1	0.0	rick_barnes	kelvin_sampson	...	13.333333	22.5	2.893939	4.136364	6.954545	5.833333	1.121212	3.818182	3.5	0.318182
4	2003	1400	1328	A	H	1.0	1	0.0	rick_barnes	kelvin_sampson	...	13.333333	22.5	2.893939	4.136364	6.954545	5.833333	1.121212	3.818182	3.5	0.318182

5 rows × 37 columns

I then wanted to incorperate an RPI variable for each tearm/year.. A var I missed to include in part I. Thanks to Kaggle, this was easy to achieve as a dataset included in the comp had this held within. I import this and pull out RPI data, then merge with the working dataframe.

In [19]:

'''Import RPI and Connect'''
df_massey = pd.read_csv(data_dir + 'MasseyOrdinals.csv')
df_rpi = df_massey[(df_massey.SystemName == 'RPI') & (df_massey.RankingDayNum == 133)]
df_rpi.drop(labels=['SystemName', 'RankingDayNum'], inplace=True, axis=1) 

df_winrpi = df_rpi.rename(columns={'TeamID' : 'WTeamID', 'OrdinalRank' : 'WOrdinalRank'})
df_lossrpi = df_rpi.rename(columns={'TeamID' : 'LTeamID', 'OrdinalRank' : 'LOrdinalRank'})
df_dummy = pd.merge(left=df, right=df_winrpi, on=['Season', 'WTeamID'])
df = pd.merge(left=df_dummy, right=df_lossrpi, on=['Season', 'LTeamID'])
df['Rank_Diff'] = df['WOrdinalRank'] - df['LOrdinalRank']

df.head()

C:\Users\andyj\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.

Out[19]:

	Season	WTeamID	LTeamID	WLoc	LLoc	WSeed	LSeed	SeedDiff	WCoach	LCoach	...	Season_DR_diff	WSeason_stl	LSeason_stl	Season_stl_diff	WSeason_blk	LSeason_blk	Season_blk_diff	WOrdinalRank	LOrdinalRank	Rank_Diff
0	2003	1104	1328	N	N	10.0	1	9.0	mark_gottfried	kelvin_sampson	...	3.911765	7.235294	5.833333	1.401961	4.176471	3.5	0.676471	38	3	35
1	2003	1280	1328	N	N	5.0	1	4.0	rick_stansbury	kelvin_sampson	...	3.880952	9.285714	5.833333	3.452381	4.095238	3.5	0.595238	24	3	21
2	2003	1329	1328	H	A	6.0	1	5.0	eddie_sutton	kelvin_sampson	...	2.000000	9.650000	5.833333	3.816667	5.750000	3.5	2.250000	22	3	19
3	2003	1400	1328	H	A	1.0	1	0.0	rick_barnes	kelvin_sampson	...	4.136364	6.954545	5.833333	1.121212	3.818182	3.5	0.318182	4	3	1
4	2003	1400	1328	A	H	1.0	1	0.0	rick_barnes	kelvin_sampson	...	4.136364	6.954545	5.833333	1.121212	3.818182	3.5	0.318182	4	3	1

5 rows × 40 columns

Not to split the rows into winners and losers.. with included 1, 0 results. I simply pull each var of interest andf put it into either a win or loss dataframe. I incorperate the differential scores by adding the actual to the winning team and the opposite sign for the losing team... These vars will be the primary values of interest in model I imagine.

In [20]:

'''TRAIN cols'''
df_wins = pd.DataFrame()
df_wins['Season'] = df['Season']
df_wins['TeamID'] = df['WTeamID']
df_wins['OtherTeamID'] = df['LTeamID']
df_wins['Loc'] = df['WLoc']
df_wins['Coach'] = df['WCoach']
df_wins['Seed'] = df['WSeed']

df_wins['PPG'] = df['WSeason_ppg']
df_wins['AST'] = df['WSeason_ast']
df_wins['FGP'] = df['WSeason_FGPercent']
df_wins['FGP3'] = df['WSeason_FGPercent3']
df_wins['FTP'] = df['WSeason_FTPercent']
df_wins['OR'] = df['WSeason_OR']
df_wins['DR'] = df['WSeason_DR']
df_wins['STL'] = df['WSeason_stl']
df_wins['BLK'] = df['WSeason_blk']
df_wins['Rank'] = df['WOrdinalRank']

df_wins['PPG_Diff'] = df['Season_ppg_diff']
df_wins['SEED_Diff'] = df['SeedDiff']
df_wins['FGP_Diff'] = df['Season_FGP_Diff']
df_wins['AST_Diff'] = df['Season_ast_diff']
df_wins['FGP3_Diff'] = df['Season_FGP3_Diff']
df_wins['FTP_Diff'] = df['Season_FTP_Diff']
df_wins['OR_Diff'] = df['Season_OR_diff']
df_wins['DR_Diff'] = df['Season_DR_diff'] 
df_wins['STL_Diff'] = df['Season_stl_diff']
df_wins['BLK_Diff'] = df['Season_blk_diff']
df_wins['Rank_Diff'] = df['Rank_Diff']
df_wins['Result'] = 1

df_losses = pd.DataFrame()
df_losses['Season'] = df['Season']
df_losses['TeamID'] = df['LTeamID']
df_losses['OtherTeamID'] = df['WTeamID']
df_losses['Loc'] = df['LLoc']
df_losses['Coach'] = df['LCoach']
df_losses['Seed'] = df['LSeed']

df_losses['PPG'] = df['LSeason_ppg']
df_losses['AST'] = df['LSeason_ast']
df_losses['FGP'] = df['LSeason_FGPercent']
df_losses['FGP3'] = df['LSeason_FGPercent3']
df_losses['FTP'] = df['LSeason_FTPercent']
df_losses['OR'] = df['LSeason_OR']
df_losses['DR'] = df['LSeason_DR']
df_losses['STL'] = df['LSeason_stl']
df_losses['BLK'] = df['LSeason_blk']
df_wins['Rank'] = df['LOrdinalRank']

df_losses['PPG_Diff'] = -df['Season_ppg_diff']
df_losses['SEED_Diff'] = -df['SeedDiff']
df_losses['FGP_Diff'] = -df['Season_FGP_Diff']
df_losses['AST_Diff'] = -df['Season_ast_diff']
df_losses['FGP3_Diff'] = -df['Season_FGP3_Diff']
df_losses['FTP_Diff'] = -df['Season_FTP_Diff']
df_losses['OR_Diff'] = -df['Season_OR_diff']
df_losses['DR_Diff'] = -df['Season_DR_diff'] 
df_losses['STL_Diff'] = -df['Season_stl_diff']
df_losses['BLK_Diff'] = -df['Season_blk_diff']
df_losses['Rank_Diff'] = -df['Rank_Diff']
df_losses['Result'] = 0

With these splits done, lets combine them!

In [21]:

df_predictions = pd.concat((df_wins, df_losses))
df_predictions.TeamID = df_predictions.TeamID.astype('category')
df_predictions.head()

Out[21]:

	AST	AST_Diff	BLK	BLK_Diff	Coach	DR	DR_Diff	FGP	FGP3	FGP3_Diff	...	PPG_Diff	Rank	Rank_Diff	Result	SEED_Diff	STL	STL_Diff	Season	Seed	TeamID
0	14.000000	4.666667	4.176471	0.676471	mark_gottfried	26.411765	3.911765	0.444393	0.347418	0.040750	...	18.539216	3.0	35	1	9.0	7.235294	1.401961	2003	10.0	1104
1	15.380952	6.047619	4.095238	0.595238	rick_stansbury	26.380952	3.880952	0.495357	0.361980	0.055311	...	17.500000	3.0	21	1	4.0	9.285714	3.452381	2003	5.0	1280
2	13.400000	4.066667	5.750000	2.250000	eddie_sutton	24.500000	2.000000	0.474798	0.382793	0.076124	...	17.233333	3.0	19	1	5.0	9.650000	3.816667	2003	6.0	1329
3	14.590909	5.257576	3.818182	0.318182	rick_barnes	26.636364	4.136364	0.456127	0.343126	0.036458	...	22.060606	3.0	1	1	0.0	6.954545	1.121212	2003	1.0	1400
4	14.590909	5.257576	3.818182	0.318182	rick_barnes	26.636364	4.136364	0.456127	0.343126	0.036458	...	22.060606	3.0	1	1	0.0	6.954545	1.121212	2003	1.0	1400

5 rows × 28 columns

Looks all good. We have a new dataframe in pandas with each team's stats per year//other information of interest along with the necessary label variable of game Result (w/l). Time to print it and I'll have the final model building phase up soon!

In [22]:

'''PRINT OUT CLEAN CSV'''
df_predictions.to_csv('../data/train_data_diff.csv', index=False)