Titanic Survival Classification w/ XGBoost

Summary: It's been alittle over three years since I attempted my first ML project of classifying survival of the titanic.. A pretty famous dataset and task for this field I'd say. I achieved a accuracy of around 75% in my first attempt at this (I believe with a random forest model). I though it would be nice to revisit this and give it another go now that I'm a little more experienced. Additionally, I've wanted to explore ensemble learned some more, particularly XGBoost. Let's dive in!

more ...

Lyric Analysis

An assortment of word clouds I made from lyrics I scraped for a school project. Size of each word is relative to frequency of usage by the artist in (up to) 150 of their most recent songs. Mount Eerie references nature a lot while Aesop Rock just references A LOT of things. Harry and the Potters, well... .

more ...

Practice at Dimensionality Reduction with SKlearn

I found some free time and thought I'd finally get some more practice at dimensionality reduction. With this goal in mind, I went onto Kaggle and found a competition(Estimate house prices) which looked appropriate to practice these skill with. Throughout this post I walk through the steps I took from cleaning and standardizing the data, to finally performing PCA and fitting a simple linear regression to the top five most influential eigenvectors! Not the most accurate regression ever, but great practice and surprisingly efficient given it drops 81 variables into only 5.

more ...

2018 March Madness Bracket Predictor (part III)

Part three of my attempt to predict NCAA tourney results based on past game data. As cleaning and var creation is out of the way now, this segment will focus on fitting the data to different classifiers in scikit-lelarn and tweaking parameters to determine a classification model with the best prediction power.

more ...

2018 March Madness Bracket Predictor (part II)

Section two working towards a finding a model in which to predict the 2018 NCAA tourney results. This segment takes the cleaned csv file from part I and further manipulates it into a format appropriate for fitting to models. The primary problem with the current format is the fact each row revolves around a game. This code will split the rows into two- One for the winner and one for the loser.

more ...

2018 March Madness Bracket Predictor (part I)

First step in creating a 2018 March Madness Bracket predictor for both personal practice with scikit-learn and to compete in the Kaggle competition this year. Part 1 revolves around cleaning the data and creating new variables to use in modeling.

more ...