I found some free time and thought I'd finally get some more practice at dimensionality reduction. With this goal in mind, I went onto Kaggle and found a competition(Estimate house prices) which looked appropriate to practice these skill with. Throughout this post I walk through the steps I took from cleaning and standardizing the data, to finally performing PCA and fitting a simple linear regression to the top five most influential eigenvectors! Not the most accurate regression ever, but great practice and surprisingly efficient given it drops 81 variables into only 5.
Part three of my attempt to predict NCAA tourney results based on past game data. As cleaning and var creation is out of the way now, this segment will focus on fitting the data to different classifiers in scikit-lelarn and tweaking parameters to determine a classification model with the best prediction power.
Section two working towards a finding a model in which to predict the 2018 NCAA tourney results. This segment takes the cleaned csv file from part I and further manipulates it into a format appropriate for fitting to models. The primary problem with the current format is the fact each row revolves around a game. This code will split the rows into two- One for the winner and one for the loser.
First step in creating a 2018 March Madness Bracket predictor for both personal practice with scikit-learn and to compete in the Kaggle competition this year. Part 1 revolves around cleaning the data and creating new variables to use in modeling.
Testing and evaluating numerous machine learning techniques to determine best option for predicting fruad occurances in Enron email dataset. The most efficient predictor ended up being an Adaboost algorithm with 50 n_estimators. This method using decision tree as a 'weak learner' came out with about 85% accuracy, p-value of 39, and an r-squared of around 32. Originally conducted for Udacity Nanodegree project.
A look at the world happiness index and an evaluation of the factors that contribute to the general happiness of a countries population. Variables were primarily focused on economic, political, and elegantarian factors. Originally conducted as a project for my SNHU Applied Stats II class. As it ended at around 30 pages, included is only a subset of the full project.