This project required a Natural Language Processing Machine Learning Model to efficiently categorize and label thousands of existing pieces of content on the client web page.

This project was not only very interesting in itself, but has lead to very insightful analytics opportunities when it comes to analysis how potential customers interact with the website based on the purpose of individual pages.

For this project I was presented with a limited labeled data set of about 100 pages and asked to label the rest of the web pages based on 5 main categories. I then expanded the labeled data set by reviewing high funnel pages and assigning a label to an additional 100 pages.

Once I believed I had a good representation of what high funnel pages would look like I built out an NLP model to assign meaning to the remaining pages. The final model achieved an accuracy of 87% which was a great result based on the limited data set.

Below I have shared a demo of the code I used to produce this model.

Coding Demo

Importing Libraries

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Importing The Dataset

In [ ]:

dataset = pd.read_csv('Page_Content.csv')
labeled_datadset = pd.read_csv('Page Labeld Data.csv')

Data Cleaning

In [ ]:

# creating the labeled training set
df = pd.merge(dataset, labeled_datadset, how = "right", on =['Page URL'], )
df.shape

In [ ]:

# removing rows with empty 'Page Content' cells
df = df.dropna(subset = ["Page Content"], how = "all")

In [ ]:

df.shape

In [ ]:

# labels represented in the labeled training set
sns.countplot(df['Page Purpose'])

Text Cleaning

In [ ]:

# importing text cleaning libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string

In [ ]:

# Text cleaning function
def message_cleaning(message):
    test_punc_removed = [char for char in message if char not in string.punctuation ]
    test_punc_removed_join = ''.join(test_punc_removed)
    test_punc_removed_join_nums = re.sub('[^a-zA-Z]+', ' ', test_punc_removed_join)
    test_punc_removed_join_nums_clean = [word for word in test_punc_removed_join_nums.split() if word.lower() not in stopwords.words('english')]
    return test_punc_removed_join_nums_clean

In [ ]:

# Vectorizing the text
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = message_cleaning)
df_countvectorizer = vectorizer.fit_transform(df['Page Content'])
print(vectorizer.get_feature_names())

In [ ]:

print(df_countvectorizer.toarray())

In [ ]:

df_countvectorizer.shape

Splitting Train & Test Sets

In [ ]:

# labeling the X and y data
X = df_countvectorizer
y = label

In [ ]:

# Spliting the training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

In [ ]:

# Training the Niave Bayes classifier
from sklearn.naive_bayes import MultinomialNB
NB_Classifier = MultinomialNB()
NB_Classifier.fit(X_train, y_train)

Evaluating the Model

In [ ]:

from sklearn.metrics import classification_report, confusion_matrix
y_predict_train = NB_Classifier.predict(X_train)

In [ ]:

# confusion matrix to test accuracy of training model
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)

In [ ]:

print(classification_report(y_train, y_predict_train))

In [ ]:

# confusion matrix to test accuracy of testing model
y_predict_test = NB_Classifier.predict(X_test)
cm2 = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm2, annot = True)

In [ ]:

print(classification_report(y_test, y_predict_test))

In this case I was working with a small amount of data which means that I wanted to validate my model using train test split, however I wanted to take advantage of the whole dataset for training when creating my final model.

Training On The Whole Labeled Dataset

In [ ]:

from sklearn.naive_bayes import MultinomialNB

NB_Classifier = MultinomialNB()
label = df["Page Purpose"].values

In [ ]:

# training the model on the full labeled training set
NB_Classifier.fit(df_countvectorizer, label)

In [ ]:

# sample test
testing_sample = ["This school is the best! "]

testing_sample_countvectorizer = vectorizer.transform(testing_sample)
NB_Classifier.predict(testing_sample_countvectorizer)

In [ ]:

# loop to predict and assign values to all unlabeled data in the dataset
Page_Purpose = []

for row in dataset['Page Content']:
    testing_sample_countvectorizor = vectorizer.transform([row])
    purpose = [(NB_Classifier.predict(testing_sample_countvectorizor))]
    Page_Purpose.append(purpose) 
    
dataset['Page_Purpose'] = Page_Purpose

In [ ]:

dataset

In [ ]:

# Saveing predictions to a csv
dataset.to_csv('Page Purpose Remaining Labaled Dataset.csv', index = False)

In [ ]:

final_df = pd.read_csv('Page Purpose Remaining Labaled Dataset.csv')

In [ ]:

# distribution of final predicted labels
sns.countplot(final_df['Page_Purpose'])