Identifying Advertisements with ANN's

I pull data from the UCI Machine Learning Repo and use it to train a model which can identify advertisements based upon their image size and URL terminology. I work through cleaning the data, attempting a few different fitting algorithms, and end with some parameter-tuning of an ANN. My final model results in over 97% accuracy in classifying advertisements in testing. Originally conducted for a Machine Learning course as SHHU focused on the R language.

more ...

Multiple Linear Regression to Predict Consumer Spending

As in the last post, here's some more work in excel with economic variables. This time I use value forecasts of 30y mortgage, unemployment, and personal income rates, figured in a similar manner as before (annual growth/change rates - 10y moving averages) to predict future levels of personal consumption expenditures. I run a multilinear regression analysis to forecast PCE based upon the three independent variables and end up with some pretty strong results and an adjusted R-squared of .974.

more ...

Classifying Comment Toxicity w/ NLTK and NB

I've wanted to get more practice with natural language processing, so I grabbed a dataset of Wikipedia comments from a past Kaggle challenge to attempt to classify toxicity of each comment. Train data was a collection of over 150k comments connected to user-defined classifications of 'toxic', 'severe toxic', 'obscene', 'insult', 'threat', 'hate'. Here's the process I took to create a model which identifies these particular classifications of future comments!

First is to import libraries we'll be using


Titanic Survival Classification w/ XGBoost

Summary: It's been alittle over three years since I attempted my first ML project of classifying survival of the titanic.. A pretty famous dataset and task for this field I'd say. I achieved a accuracy of around 75% in my first attempt at this (I believe with a random forest model). I though it would be nice to revisit this and give it another go now that I'm a little more experienced. Additionally, I've wanted to explore ensemble learned some more, particularly XGBoost. Let's dive in!


Time Series Analysis in R

Lately I've been looking to explore time-series modeling and to get more practice with manipulating data in R. Luckily, a dataset on Kaggle provided me the opportunity to do both of these things. Attempting to predict future sale volume of items in stores (from 1-C russian store sales dataset) gave me the chance to apply the theoretical knowledge I've been studying of time-series analysis (ARIMA in particual). Additionally, I was able to get more comfortable with dplyr and lubridate in the process. more ...

Practice at Dimensionality Reduction with SKlearn

I found some free time and thought I'd finally get some more practice at dimensionality reduction. With this goal in mind, I went onto Kaggle and found a competition(Estimate house prices) which looked appropriate to practice these skill with. Throughout this post I walk through the steps I took from cleaning and standardizing the data, to finally performing PCA and fitting a simple linear regression to the top five most influential eigenvectors! Not the most accurate regression ever, but great practice and surprisingly efficient given it drops 81 variables into only 5.