First step in creating a 2018 March Madness Bracket predictor for both personal practice with scikit-learn and to compete in the Kaggle competition this year. Part 1 revolves around cleaning the data and creating new variables to use in modeling.

In [ ]:

import pandas as pd

I've been meaning to give this a try for a few years now and a timely week off from school provided me the chance to delve into this project. Multiple datasets obtained from Kaggle.com are the primary resources used throughout the process.. This worked out to also provide me experience in contributing to a Kaggle competition- something else I have been meaning to do for a long time.

Step One - Import the data:

In [1]:

#import seed and results data
data_dir = '../data/'
df_seeds = pd.read_csv(data_dir + 'NCAATourneySeeds.csv')
df_tour = pd.read_csv(data_dir + 'RegularSeasonDetailedResults.csv')

df_seeds.head()

df_tour.head()

Out[1]:

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	WFGM	WFGA	...	LFGA3	LFTM	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF
0	2003	10	1104	68	1328	62	N	27	58	...	10	16	22	10	22	8	18	9	2	20
1	2003	10	1272	70	1393	63	N	26	62	...	24	9	20	20	25	7	12	8	6	16
2	2003	11	1266	73	1437	61	N	24	58	...	26	14	23	31	22	9	12	2	5	23
3	2003	11	1296	56	1457	50	N	18	38	...	22	8	15	17	20	9	19	4	3	23
4	2003	11	1400	77	1208	71	N	30	61	...	16	17	27	21	15	12	10	7	1	14

5 rows × 34 columns

Next step I set down some functions to add new columns to the data. Thanks to Kaggle this dataset is largely clean and organized, therefore most of this process is creating new columns and manipulating the data. I found the most useful manipulations for predicting game winners was to average out regular season stats and then figure the differential between the two teams. Although tourney stats are available as inputs, regular season makes more sense when predicting the results of the coming tourney.

With that... Setting loser location, cutting region off of seed variable (Function obtained from *Basic Starter kernel - Julia Elliott), and dropping unneeded columns:

In [2]:

'''Function to set Losing Location'''
def get_lose_loc(win_loc):
    lose_loc = ''
    if win_loc == 'N':
        lose_loc = 'N'  
    elif win_loc == 'H':
        lose_loc = 'A'  
    else:
        lose_loc = 'H'
    return lose_loc

#Get Loser Location
df_tour['LLoc'] = df_tour.apply(lambda row: get_lose_loc(row.WLoc), axis = 1)  

""" SET SEED TO INT ONLY """
def seed_to_int(seed):
    #Get just the digits from the seeding. Return as int
    s_int = int(seed[1:3])
    return s_int
df_seeds['seed_int'] = df_seeds.Seed.apply(seed_to_int)
df_seeds.drop(labels=['Seed'], inplace=True, axis=1) # This is the string label


"""DROP COLS and ADD SCORE DIFF"""
df_tour.drop(labels=['DayNum', 'NumOT'], inplace=True, axis=1)

#ADD score differential
df_tour['Point_Diff'] = df_tour.WScore - df_tour.LScore
df_tour.head()

Out[2]:

	Season	WTeamID	WScore	LTeamID	LScore	WLoc	WFGM	WFGA	WFGM3	WFGA3	...	LFTA	LOR	LDR	LAst	LTO	LStl	LBlk	LPF	LLoc	Point_Diff
0	2003	1104	68	1328	62	N	27	58	3	14	...	22	10	22	8	18	9	2	20	N	6
1	2003	1272	70	1393	63	N	26	62	8	20	...	20	20	25	7	12	8	6	16	N	7
2	2003	1266	73	1437	61	N	24	58	8	18	...	23	31	22	9	12	2	5	23	N	12
3	2003	1296	56	1457	50	N	18	38	3	9	...	15	17	20	9	19	4	3	23	N	6
4	2003	1400	77	1208	71	N	30	61	6	14	...	27	21	15	12	10	7	1	14	N	6

5 rows × 34 columns

Next up I used the same process more or less to import or create new variables and merge them into the working dataframe. Pandas made this quite simple to do and allowed me to spend more time thinking of important variables rather than figuring how to connect the data. Almost all introduced variables include a 'differential' variables as well.

In [4]:

"""MERGE SEED and TEAM"""
df_winseeds = df_seeds.rename(columns={'TeamID':'WTeamID', 'seed_int':'WSeed'})
df_lossseeds = df_seeds.rename(columns={'TeamID':'LTeamID', 'seed_int':'LSeed'})
df_dummy = pd.merge(left=df_tour, right=df_winseeds, how='left', on=['Season', 'WTeamID']).fillna(17)
df_concat = pd.merge(left=df_dummy, right=df_lossseeds, on=['Season', 'LTeamID']).fillna(17)

df_concat['SeedDiff'] = df_concat.WSeed - df_concat.LSeed
df_concat.head()


"""IMPORT COACHES AND MERGE WITH CONCAT"""
df_coaches = pd.read_csv(data_dir + "TeamCoaches.csv")
df_coaches.drop(labels=['FirstDayNum', 'LastDayNum'], inplace=True, axis=1) 

df_wincoaches = df_coaches.rename(columns={'TeamID':'WTeamID', 'CoachName':'WCoach'})
df_losscoaches = df_coaches.rename(columns={'TeamID':'LTeamID', 'CoachName':'LCoach'})
df_dummy = pd.merge(left=df_concat, right=df_wincoaches, on=['Season', 'WTeamID'])
df_working = pd.merge(left=df_dummy, right=df_losscoaches, on=['Season', 'LTeamID'])

df_working.head()

Out[4]:

	Season	WTeamID	WScore	LTeamID	LScore	WLoc	WFGM	WFGA	WFGM3	WFGA3	...	LStl	LBlk	LPF	LLoc	Point_Diff	WSeed	LSeed	SeedDiff	WCoach	LCoach
0	2003	1104	68	1328	62	N	27	58	3	14	...	9	2	20	N	6	10.0	1	9.0	mark_gottfried	kelvin_sampson
1	2003	1280	54	1328	45	N	18	50	3	13	...	8	4	18	N	9	5.0	1	4.0	rick_stansbury	kelvin_sampson
2	2003	1329	48	1328	46	H	19	49	1	8	...	4	3	18	A	2	6.0	1	5.0	eddie_sutton	kelvin_sampson
3	2003	1400	67	1328	61	H	25	64	5	14	...	5	3	20	A	6	1.0	1	0.0	rick_barnes	kelvin_sampson
4	2003	1400	76	1328	71	A	27	50	6	11	...	8	6	23	H	5	1.0	1	0.0	rick_barnes	kelvin_sampson

5 rows × 39 columns

Continuing on... Adding years in the division tourneys.

In [5]:

"""ADD TIME IN DIV ONE"""
df_length = pd.read_csv(data_dir + "Teams.csv")
df_length['length'] = df_length.LastD1Season - df_length.FirstD1Season
df_length.drop(labels=['TeamName', 'FirstD1Season', 'LastD1Season'], inplace=True, axis=1)

#MERGE
df_winlength = df_length.rename(columns={'TeamID':'WTeamID', 'length':'WLength'})
df_losslength = df_length.rename(columns={'TeamID':'LTeamID', 'length':'LLength'})
df_dummy = pd.merge(left=df_working, right=df_winlength, on=['WTeamID'])
df_working = pd.merge(left=df_dummy, right=df_losslength, on=['LTeamID'])

df_working.head()

Out[5]:

	Season	WTeamID	WScore	LTeamID	LScore	WLoc	WFGM	WFGA	WFGM3	WFGA3	...	LPF	LLoc	Point_Diff	WSeed	LSeed	SeedDiff	WCoach	LCoach	WLength	LLength
0	2003	1104	68	1328	62	N	27	58	3	14	...	20	N	6	10.0	1	9.0	mark_gottfried	kelvin_sampson	33	33
1	2003	1280	54	1328	45	N	18	50	3	13	...	18	N	9	5.0	1	4.0	rick_stansbury	kelvin_sampson	33	33
2	2003	1329	48	1328	46	H	19	49	1	8	...	18	A	2	6.0	1	5.0	eddie_sutton	kelvin_sampson	33	33
3	2005	1329	79	1328	67	H	30	57	8	14	...	18	A	12	2.0	3	-1.0	eddie_sutton	kelvin_sampson	33	33
4	2009	1329	71	1328	70	N	25	60	9	25	...	18	N	1	8.0	2	6.0	travis_ford	jeff_capel	33	33

5 rows × 41 columns

Andd now a long tedious process of averaging out each stat available in the regular season dataset- for both winner and loser- and merging it to the working dataframe, including differential. Basically the same code was copy-pasted for each stat. In retrospect, I should have written a generalized function and just run for each stat in a list.

In [6]:

"""EXPLORE AND INCLUDE DETAILED RESULTS"""
df_detailed = pd.read_csv(data_dir + 'RegularSeasonDetailedResults.csv')

#POINTS PER GAME
df_Wppg = df_detailed.copy()
df_Wppg.drop(labels=['LTeamID', 'LScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
WSeason_ppg = df_Wppg.groupby(['Season', 'WTeamID'])['WScore'].mean()
WSeason_ppg = WSeason_ppg.to_frame()
WSeason_ppg.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_ppg.columns = ["Season", "WTeamID", "WSeason_ppg"]
df_working = pd.merge(left=df_working, right=WSeason_ppg, on=['Season', 'WTeamID'])

##LOSING PPG
df_Lppg = df_detailed.copy()
df_Lppg.drop(labels=['WTeamID', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
LSeason_ppg = df_Lppg.groupby(['Season', 'LTeamID'])['LScore'].mean()
LSeason_ppg = LSeason_ppg.to_frame()
LSeason_ppg.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_ppg.columns = ["Season", "LTeamID", "LSeason_ppg"]
df_working = pd.merge(left=df_working, right=LSeason_ppg, on=['Season', 'LTeamID'])

##SEASON PPG DIFFERENTIAL
df_working['Season_ppg_diff'] = df_working['WSeason_ppg'] - df_working['LSeason_ppg']


########## ASSISTS PER GAME SEAONS
df_assist = df_detailed.copy()
df_assist.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
WSeason_ast = df_assist.groupby(['Season', 'WTeamID'])['WAst'].mean()
WSeason_ast = WSeason_ast.to_frame()
WSeason_ast.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_ast.columns = ["Season", "WTeamID", "WSeason_ast"]
df_working = pd.merge(left=df_working, right=WSeason_ast, on=['Season', 'WTeamID'])


##LOSING ASSIST
df_assist = df_detailed.copy()
df_assist.drop(labels=['WTeamID', 'WScore', 'LScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'WAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
LSeason_ast = df_assist.groupby(['Season', 'LTeamID'])['LAst'].mean()
LSeason_ast = LSeason_ast.to_frame()
LSeason_ast.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_ast.columns = ["Season", "LTeamID", "LSeason_ast"]
df_working = pd.merge(left=df_working, right=LSeason_ast, on=['Season', 'LTeamID'])

#ASSIST SDIFF
df_working['Season_ast_diff'] = df_working['WSeason_ast'] - df_working['LSeason_ast']


#FIELD GOAL PERCENTAGES
df_fgp = df_detailed.copy()
df_fgp.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                      'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
df_fgp['WPerent'] = df_fgp['WFGM'] / df_fgp['WFGA']    
WSeason_FGPercent = df_fgp.groupby(['Season', 'WTeamID'])['WPerent'].mean()
WSeason_FGPercent = WSeason_FGPercent.to_frame()
WSeason_FGPercent.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_FGPercent.columns = ["Season", "WTeamID", "WSeason_FGPercent"]
df_working = pd.merge(left=df_working, right=WSeason_FGPercent, on=['Season', 'WTeamID'])

##LOSING FGPercent
df_fgp = df_detailed.copy()
df_fgp.drop(labels=['WTeamID', 'WScore', 'LScore', 'DayNum', 'WLoc', 'NumOT', 
                      'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'WFGM', 'WFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
df_fgp['LPerent'] = df_fgp['LFGM'] / df_fgp['LFGA']    
LSeason_FGPercent = df_fgp.groupby(['Season', 'LTeamID'])['LPerent'].mean()
LSeason_FGPercent = LSeason_FGPercent.to_frame()
LSeason_FGPercent.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_FGPercent.columns = ["Season", "LTeamID", "LSeason_FGPercent"]
df_working = pd.merge(left=df_working, right=LSeason_FGPercent, on=['Season', 'LTeamID'])

#SEASON FGPERCENT DIFF
df_working['Season_FGP_Diff'] = df_working['WSeason_FGPercent'] - df_working['LSeason_FGPercent']


#####THREES####################################
#THREE PERCENTAGES
df_fgp = df_detailed.copy()
df_fgp.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                      'WFTM', 'WFTA', 'WFGM', 'WFGA',
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
df_fgp['WPerent'] = df_fgp['WFGM3'] / df_fgp['WFGA3']    
WSeason_FGPercent = df_fgp.groupby(['Season', 'WTeamID'])['WPerent'].mean()
WSeason_FGPercent = WSeason_FGPercent.to_frame()
WSeason_FGPercent.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_FGPercent.columns = ["Season", "WTeamID", "WSeason_FGPercent3"]
df_working = pd.merge(left=df_working, right=WSeason_FGPercent, on=['Season', 'WTeamID'])

##LOSING FGPercent
df_fgp = df_detailed.copy()
df_fgp.drop(labels=['WTeamID', 'WScore', 'LScore', 'DayNum', 'WLoc', 'NumOT', 
                      'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'WFGM', 'WFGA', 'LFGM', 'LFGA','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
df_fgp['LPerent'] = df_fgp['LFGM3'] / df_fgp['LFGA3']    
LSeason_FGPercent = df_fgp.groupby(['Season', 'LTeamID'])['LPerent'].mean()
LSeason_FGPercent = LSeason_FGPercent.to_frame()
LSeason_FGPercent.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_FGPercent.columns = ["Season", "LTeamID", "LSeason_FGPercent3"]
df_working = pd.merge(left=df_working, right=LSeason_FGPercent, on=['Season', 'LTeamID'])

#SEASON FGPERCENT DIFF
df_working['Season_FGP3_Diff'] = df_working['WSeason_FGPercent3'] - df_working['LSeason_FGPercent3']



######FREE THROWSSS#################3
#FIELD GOAL PERCENTAGES
df_fgp = df_detailed.copy()
df_fgp.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                      'WFGM3', 'WFGA3','WFGM', 'WFGA', 
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
df_fgp['WPerent'] = df_fgp['WFTM'] / df_fgp['WFTA']    
WSeason_FGPercent = df_fgp.groupby(['Season', 'WTeamID'])['WPerent'].mean()
WSeason_FGPercent = WSeason_FGPercent.to_frame()
WSeason_FGPercent.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_FGPercent.columns = ["Season", "WTeamID", "WSeason_FTPercent"]
df_working = pd.merge(left=df_working, right=WSeason_FGPercent, on=['Season', 'WTeamID'])

##LOSING FGPercent
df_fgp = df_detailed.copy()
df_fgp.drop(labels=['WTeamID', 'WScore', 'LScore', 'DayNum', 'WLoc', 'NumOT', 
                      'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                     'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 
                     'WFGM', 'WFGA', 'LFGM3', 'LFGA3','LFGM', 'LFGA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
df_fgp['LPerent'] = df_fgp['LFTM'] / df_fgp['LFTA']    
LSeason_FGPercent = df_fgp.groupby(['Season', 'LTeamID'])['LPerent'].mean()
LSeason_FGPercent = LSeason_FGPercent.to_frame()
LSeason_FGPercent.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_FGPercent.columns = ["Season", "LTeamID", "LSeason_FTPercent"]
df_working = pd.merge(left=df_working, right=LSeason_FGPercent, on=['Season', 'LTeamID'])

#SEASON FGPERCENT DIFF
df_working['Season_FTP_Diff'] = df_working['WSeason_FTPercent'] - df_working['LSeason_FTPercent']


########## REBOUNDS PER GAME SEAONS
df_rebounds = df_detailed.copy()
df_rebounds.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                      'WTO', 'WStl', 'WBlk', 'WPF', 'WAst',
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
WSeason_OR = df_rebounds.groupby(['Season', 'WTeamID'])['WOR'].mean()
WSeason_OR = WSeason_OR.to_frame()
WSeason_OR.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_OR.columns = ["Season", "WTeamID", "WSeason_OR"]
df_working = pd.merge(left=df_working, right=WSeason_OR, on=['Season', 'WTeamID'])

##Defensive Rebounds
df_rebounds = df_detailed.copy()
df_rebounds.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                      'WTO', 'WStl', 'WBlk', 'WPF', 'WAst',
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
WSeason_DR = df_rebounds.groupby(['Season', 'WTeamID'])['WDR'].mean()
WSeason_DR = WSeason_DR.to_frame()
WSeason_DR.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_DR.columns = ["Season", "WTeamID", "WSeason_DR"]
df_working = pd.merge(left=df_working, right=WSeason_DR, on=['Season', 'WTeamID'])


####LOSING REBOUNDS
df_rebounds = df_detailed.copy()
df_rebounds.drop(labels=['WTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                      'WTO', 'WStl', 'WBlk', 'WPF', 'WAst', 'WOR', 'WDR',
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                      'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
LSeason_OR = df_rebounds.groupby(['Season', 'LTeamID'])['LOR'].mean()
LSeason_OR = LSeason_OR.to_frame()
LSeason_OR.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_OR.columns = ["Season", "LTeamID", "LSeason_OR"]
df_working = pd.merge(left=df_working, right=LSeason_OR, on=['Season', 'LTeamID'])

##Defensive Rebounds
df_rebounds = df_detailed.copy()
df_rebounds.drop(labels=['WTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 
                      'WTO', 'WStl', 'WBlk', 'WPF', 'WAst', 'WOR', 'WDR',
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                      'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
LSeason_DR = df_rebounds.groupby(['Season', 'LTeamID'])['LDR'].mean()
LSeason_DR = LSeason_DR.to_frame()
LSeason_DR.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_DR.columns = ["Season", "LTeamID", "LSeason_DR"]
df_working = pd.merge(left=df_working, right=LSeason_DR, on=['Season', 'LTeamID'])


#rebound SDIFF
df_working['Season_OR_diff'] = df_working['WSeason_OR'] - df_working['LSeason_OR']
df_working['Season_DR_diff'] = df_working['WSeason_DR'] - df_working['LSeason_DR']


##STEALS
df_steals = df_detailed.copy()
df_steals.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 'WAst',
                     'WOR', 'WDR', 'WTO', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
WSeason_stl = df_steals.groupby(['Season', 'WTeamID'])['WStl'].mean()
WSeason_stl = WSeason_stl.to_frame()
WSeason_stl.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_stl.columns = ["Season", "WTeamID", "WSeason_stl"]
df_working = pd.merge(left=df_working, right=WSeason_stl, on=['Season', 'WTeamID'])

##LOSING STEALS\]
df_steals = df_detailed.copy()
df_steals.drop(labels=['WTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 'WAst',
                     'WOR', 'WDR', 'WTO', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'WStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
LSeason_stl = df_steals.groupby(['Season', 'LTeamID'])['LStl'].mean()
LSeason_stl = LSeason_stl.to_frame()
LSeason_stl.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_stl.columns = ["Season", "LTeamID", "LSeason_stl"]
df_working = pd.merge(left=df_working, right=LSeason_stl, on=['Season', 'LTeamID'])

#STEALS SDIFF
df_working['Season_stl_diff'] = df_working['WSeason_stl'] - df_working['LSeason_stl']


###BLOCKS
df_blocks = df_detailed.copy()
df_blocks.drop(labels=['LTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 'WAst',
                     'WOR', 'WDR', 'WTO', 'WStl', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'], inplace=True, axis=1)
    
WSeason_blk = df_blocks.groupby(['Season', 'WTeamID'])['WBlk'].mean()
WSeason_blk = WSeason_blk.to_frame()
WSeason_blk.reset_index(inplace=True)  
#MERGE TO WORKING
WSeason_blk.columns = ["Season", "WTeamID", "WSeason_blk"]
df_working = pd.merge(left=df_working, right=WSeason_blk, on=['Season', 'WTeamID'])

##LOSING blocks\]
df_blocks = df_detailed.copy()
df_blocks.drop(labels=['WTeamID', 'LScore', 'WScore', 'DayNum', 'WLoc', 'NumOT', 
                     'WFGM', 'WFGA', 'WFGM3', 'WFGA3','WFTM', 'WFTA', 'WAst',
                     'WOR', 'WDR', 'WTO', 'WBlk', 'WPF', 
                     'LFGM', 'LFGA', 'LFGM3', 'LFGA3','LFTM', 'LFTA', 
                     'LOR', 'LDR', 'LAst', 'LTO', 'WStl', 'LStl', 'LPF'], inplace=True, axis=1)
    
LSeason_blk = df_blocks.groupby(['Season', 'LTeamID'])['LBlk'].mean()
LSeason_blk = LSeason_blk.to_frame()
LSeason_blk.reset_index(inplace=True)  
#MERGE TO WORKING
LSeason_blk.columns = ["Season", "LTeamID", "LSeason_blk"]
df_working = pd.merge(left=df_working, right=LSeason_blk, on=['Season', 'LTeamID'])

#blocks SDIFF
df_working['Season_blk_diff'] = df_working['WSeason_blk'] - df_working['LSeason_blk']

"""SET TEAM ID as Categorica; and drop Scores """
df_working.WTeamID = df_working.WTeamID.astype('category')
df_working.LTeamID = df_working.LTeamID.astype('category')

df_working.drop(labels=['WScore', 'LScore', 'Point_Diff'], inplace=True, axis=1)
df_working.drop(labels=['WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 
                        'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk',
                        'WPF','LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 
                        'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk',
                        'LPF' ], inplace=True, axis=1)

In [7]:

df_working.head()

Out[7]:

	Season	WTeamID	LTeamID	WLoc	LLoc	WSeed	LSeed	SeedDiff	WCoach	LCoach	...	LSeason_OR	LSeason_DR	Season_OR_diff	Season_DR_diff	WSeason_stl	LSeason_stl	Season_stl_diff	WSeason_blk	LSeason_blk	Season_blk_diff
0	2003	1104	1328	N	N	10.0	1	9.0	mark_gottfried	kelvin_sampson	...	13.333333	22.5	0.196078	3.911765	7.235294	5.833333	1.401961	4.176471	3.5	0.676471
1	2003	1280	1328	N	N	5.0	1	4.0	rick_stansbury	kelvin_sampson	...	13.333333	22.5	-0.666667	3.880952	9.285714	5.833333	3.452381	4.095238	3.5	0.595238
2	2003	1329	1328	H	A	6.0	1	5.0	eddie_sutton	kelvin_sampson	...	13.333333	22.5	-0.733333	2.000000	9.650000	5.833333	3.816667	5.750000	3.5	2.250000
3	2003	1400	1328	H	A	1.0	1	0.0	rick_barnes	kelvin_sampson	...	13.333333	22.5	2.893939	4.136364	6.954545	5.833333	1.121212	3.818182	3.5	0.318182
4	2003	1400	1328	A	H	1.0	1	0.0	rick_barnes	kelvin_sampson	...	13.333333	22.5	2.893939	4.136364	6.954545	5.833333	1.121212	3.818182	3.5	0.318182

5 rows × 39 columns

Last step of this process was to print out the edited dataset as a csv for further work:

In [8]:

'''RETURN NEW CSV WITH CREATED COLUMNS'''
#df_working.to_csv('MM_Cleaned_2.csv', index=False)

Out[8]:

'RETURN NEW CSV WITH CREATED COLUMNS'