
This project required a Natural Language Processing Machine Learning Model to efficiently categorize and label thousands of existing pieces of content on the client web page.
This project was not only very interesting in itself, but has lead to very insightful analytics opportunities when it comes to analysis how potential customers interact with the website based on the purpose of individual pages.
For this project I was presented with a limited labeled data set of about 100 pages and asked to label the rest of the web pages based on 5 main categories. I then expanded the labeled data set by reviewing high funnel pages and assigning a label to an additional 100 pages.
Once I believed I had a good representation of what high funnel pages would look like I built out an NLP model to assign meaning to the remaining pages. The final model achieved an accuracy of 87% which was a great result based on the limited data set.
Below I have shared a demo of the code I used to produce this model.
Coding Demo
Importing Libraries
In [ ]:
import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns
Importing The Dataset
In [ ]:
dataset = pd.read_csv('Page_Content.csv') labeled_datadset = pd.read_csv('Page Labeld Data.csv')
Data Cleaning
In [ ]:
# creating the labeled training set df = pd.merge(dataset, labeled_datadset, how = "right", on =['Page URL'], ) df.shape
In [ ]:
# removing rows with empty 'Page Content' cells df = df.dropna(subset = ["Page Content"], how = "all")
In [ ]:
df.shape
In [ ]:
# labels represented in the labeled training set sns.countplot(df['Page Purpose'])
Text Cleaning
In [ ]:
# importing text cleaning libraries import re import nltk from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer import string
In [ ]:
# Text cleaning function def message_cleaning(message): test_punc_removed = [char for char in message if char not in string.punctuation ] test_punc_removed_join = ''.join(test_punc_removed) test_punc_removed_join_nums = re.sub('[^a-zA-Z]+', ' ', test_punc_removed_join) test_punc_removed_join_nums_clean = [word for word in test_punc_removed_join_nums.split() if word.lower() not in stopwords.words('english')] return test_punc_removed_join_nums_clean
In [ ]:
# Vectorizing the text from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(analyzer = message_cleaning) df_countvectorizer = vectorizer.fit_transform(df['Page Content']) print(vectorizer.get_feature_names())
In [ ]:
print(df_countvectorizer.toarray())
In [ ]:
df_countvectorizer.shape
Splitting Train & Test Sets
In [ ]:
# labeling the X and y data X = df_countvectorizer y = label
In [ ]:
# Spliting the training and testing data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
In [ ]:
# Training the Niave Bayes classifier from sklearn.naive_bayes import MultinomialNB NB_Classifier = MultinomialNB() NB_Classifier.fit(X_train, y_train)
Evaluating the Model
In [ ]:
from sklearn.metrics import classification_report, confusion_matrix y_predict_train = NB_Classifier.predict(X_train)
In [ ]:
# confusion matrix to test accuracy of training model cm = confusion_matrix(y_train, y_predict_train) sns.heatmap(cm, annot = True)
In [ ]:
print(classification_report(y_train, y_predict_train))
In [ ]:
# confusion matrix to test accuracy of testing model y_predict_test = NB_Classifier.predict(X_test) cm2 = confusion_matrix(y_test, y_predict_test) sns.heatmap(cm2, annot = True)
In [ ]:
print(classification_report(y_test, y_predict_test))
In this case I was working with a small amount of data which means that I wanted to validate my model using train test split, however I wanted to take advantage of the whole dataset for training when creating my final model.
Training On The Whole Labeled Dataset
In [ ]:
from sklearn.naive_bayes import MultinomialNB NB_Classifier = MultinomialNB() label = df["Page Purpose"].values
In [ ]:
# training the model on the full labeled training set NB_Classifier.fit(df_countvectorizer, label)
In [ ]:
# sample test testing_sample = ["This school is the best! "] testing_sample_countvectorizer = vectorizer.transform(testing_sample) NB_Classifier.predict(testing_sample_countvectorizer)
In [ ]:
# loop to predict and assign values to all unlabeled data in the dataset Page_Purpose = [] for row in dataset['Page Content']: testing_sample_countvectorizor = vectorizer.transform([row]) purpose = [(NB_Classifier.predict(testing_sample_countvectorizor))] Page_Purpose.append(purpose) dataset['Page_Purpose'] = Page_Purpose
In [ ]:
dataset
In [ ]:
# Saveing predictions to a csv dataset.to_csv('Page Purpose Remaining Labaled Dataset.csv', index = False)
In [ ]:
final_df = pd.read_csv('Page Purpose Remaining Labaled Dataset.csv')
In [ ]:
# distribution of final predicted labels sns.countplot(final_df['Page_Purpose'])