I've wanted to get more practice with natural language processing, so I grabbed a dataset of Wikipedia comments from a past Kaggle challenge to attempt to classify toxicity of each comment. Train data was a collection of over 150k comments connected to user-defined classifications of 'toxic', 'severe toxic', 'obscene', 'insult', 'threat', 'hate'. Here's the process I took to create a model which identifies these particular classifications of future comments!
First is to import libraries we'll be using
import pandas as pd
import numpy as np
import re
import os
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
We can see from above, I opted to go with a naive bayes model for the final classification. While a few other classifiers may have been suitable as well (DT or RF in particular), I wanted to use this primarily as practice for nlp, and as such only trained and fit one model without tweaking.
Data imports next..There were way too large and I therefore won't have them included on the site//in this notebook.
#set working dir
os.chdir('D:/Projects/Kaggle/toxic_comments')
# import
df_test = pd.read_csv("data/test.csv")
df_test_labs = pd.read_csv("data/test_labels.csv")
df_train = pd.read_csv("data/train.csv")
df_sub = pd.read_csv("data/sample_submission.csv")
So first step for the actual preprocessing is to create a corpus of the words from each comment. I iterate through each line of the csv, grab the comment, cut the punctuation, split into a list, stem each word, connect them back into a str, and finally append it to the corpus.
NOTE: The data was too large for my comp to deal with all at once. I kept running into memory issues, so I ended up cutting the data and using about 1/3 of it to train.
#init corpus
corpus = []
#loop through df and clean comments
for i in range(0, 50000):
#reg_exp to replace anything not text to a space and drop to lower case
comment = re.sub('[^a-zA-Z]', ' ', df_train['comment_text'][i]).lower()
#split into list for processing
comment = comment.split()
#check for stopwords and remove
comment = [word for word in comment if not word in set(stopwords.words('english'))]
#stem the word!
ps = PorterStemmer()
comment = [str(ps.stem(word)) for word in comment]
#back to string
comment = ' '.join(comment)
corpus.append(comment)
#track progress
if i%1000 == 0:
print((float(i)/len(df_train))*100)
I next took the corpus and vectorized it into a sparse matrix. This provides the matrix to train off of which is suitable for NB or most other classifiers. Of course an identifier is required for the train, so I also bring in each classification in here as well.
#Bag of Words Model - sparse matrix (tokenize)
cv = CountVectorizer(max_features = 25000) #max words to store
X = cv.fit_transform(corpus).toarray()
y_tox = df_train.iloc[0:50000,2].values
y_sev_tox = df_train.iloc[0:50000,3].values
y_obs = df_train.iloc[0:50000,4].values
y_threat = df_train.iloc[0:50000,5].values
y_insult = df_train.iloc[0:50000,6].values
y_hate = df_train.iloc[0:50000,7].values
So, as I'm looking to classify 6 different possibilities rather than just one, I ended up creating two dictionaries to store the targets and models in, respectively. I use a list of the dict keys to iterate through them both.
#model for each predicted type
tests = {'y_tox' : y_tox,
'y_sev_tox' : y_sev_tox,
'y_obs' : y_obs,
'y_threat' : y_threat,
'y_insult' : y_insult,
'y_hate' : y_hate}
models = {'y_tox' : GaussianNB(),
'y_sev_tox' : GaussianNB(),
'y_obs' : GaussianNB(),
'y_threat' : GaussianNB(),
'y_insult' : GaussianNB(),
'y_hate' : GaussianNB()}
preds = {}
test_names = ['y_tox', 'y_sev_tox', 'y_obs', 'y_threat', 'y_insult', 'y_hate']
Finally for the training of the model, I iterate through each target classification and train the sparse matrix(X) against them.
While I'm not bringing the data into this notebook, the accuracy confusion matrix values were pretty solid. Of the six classifications, NB averages an accuracy or around .94.
for i in test_names:
#test_train split (toxic)
X_train, X_test, y_train, y_test = train_test_split(X, tests[i], test_size = 0.05, random_state = 42)
#Train Model (naive bayes)
models[i].fit(X_train, y_train)
#predict
preds[i] = models[i].predict(X_test)
#review model
print(i)
print(confusion_matrix(y_test, preds[i]))
print(accuracy_score(y_test, preds[i]))
Alright, with the model made its simply a matter of running the test data through the pipeline!
First: create the corpus to make another sparse matrix with:
# RUN TEST THROUGH PIPELINE
test_corpus = []
for i in range(130000, 153164):
#reg_exp to replace anything not text to a space and drop to lower case
comment = re.sub('[^a-zA-Z]', ' ', df_test['comment_text'][i]).lower()
#split into list for processing
comment = comment.split()
#check for stopwords and remove
comment = [word for word in comment if not word in set(stopwords.words('english'))]
#stem the word!
ps = PorterStemmer()
comment = [str(ps.stem(word)) for word in comment]
#back to string
comment = ' '.join(comment)
test_corpus.append(comment)
#track progress
if i%1000 == 0:
print((float(i)/50000)*100)
#kaggle test array
X_kaggle = cv.transform(test_corpus).toarray()
kaggle_preds = {}
As (again) we're predicting 6 target possibilities, I ran a for each loop to hit each model and predict the expected binary yes/no of each comment classification. Predictions were stored in another dictionary for the same reasons as all the others.
#predict probability for each
for i in test_names:
print i
kaggle_preds[i] = models[i].predict(X_kaggle)
#list out in sample sub
sample_sub = pd.DataFrame(df_sub.iloc[130000:,0])
sample_sub = sample_sub.reset_index(drop=True)
sample_placement = pd.DataFrame(kaggle_preds)
sample_placement = sample_placement.reindex_axis(['y_tox','y_sev_tox', 'y_obs','y_threat', 'y_insult', 'y_hate'], axis=1)
sample_sub = sample_sub.join(sample_placement)
sample_sub = sample_sub.rename(columns={'y_tox': 'toxic', 'y_sev_tox': 'sever_toxic',
'y_obs': 'obscene', 'y_threat': 'threat',
'y)insult': 'insult', 'y_hate': 'indentity_hate'})
And with that, all that remains is to write the rewritten sample_sub into a csv and upload to Kaggle. Fun Fact: Due to memory issues I had run the transforming and prediction steps in 4 different segments... each of these taking about 45 minutes to run.
Kaggle results were decent but nothing special. Then again, I fit only one model, didn't compare it to any others, and didn't attempt any param tuning, dimensionality reduction, standardization, or other steps which could have helped out here. Again, it was to practice nlp (mainly with the nltk kit). With that in mind, I feel quite successful in this project.