Natural Language Processing

This project required a Natural Language Processing Machine Learning Model to efficiently categorize and label thousands of existing pieces of content on the client web page.

This project was not only very interesting in itself, but has lead to very insightful analytics opportunities when it comes to analysis how potential customers interact with the website based on the purpose of individual pages.

For this project I was presented with a limited labeled data set of about 100 pages and asked to label the rest of the web pages based on 5 main categories. I then expanded the labeled data set by reviewing high funnel pages and assigning a label to an additional 100 pages.

Once I believed I had a good representation of what high funnel pages would look like I built out an NLP model to assign meaning to the remaining pages. The final model achieved an accuracy of 87% which was a great result based on the limited data set.

Below I have shared a demo of the code I used to produce this model.

Coding Demo

Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import category_encoders

Importing The Dataset

dataset = pd.read_csv('Page_Content.csv')
labeled_datadset = pd.read_csv('Page Purpose Labeld Data.csv')

Data Cleaning

# creating the labeled training set
df = pd.merge(dataset, labeled_datadset, how = "right", on =['Page URL'], )
df.shape
# removing rows with empty 'Page Content' cells
df = df.dropna(subset = ["Page Content"], how = "all")
df.shape
(430, 3)
# creating a label encoder (did not use in final model)
#enc = category_encoders.one_hot.OneHotEncoder(cols = ['Page Purpose'], drop_invariant = False, use_cat_names = True, return_df = True,)
# df2 = enc.fit_transform(df)
# df2 = df2.drop(['Page Purpose_-1'], axis = 1)
# df2
# labels represented in the labeled training set
sns.countplot(df['Page Purpose'])
<matplotlib.axes._subplots.AxesSubplot at 0x1679f242358>

Text Cleaning

# importing text cleaning libraries
import re
import nltk from nltk.corpus
import stopwords from nltk.stem.porter
import PorterStemmer import string
# Text cleaning function
def message_cleaning(message):
test_punc_removed = [char for char in message if char not in string.punctuation ]
test_punc_removed_join = ''.join(test_punc_removed)
test_punc_removed_join_nums = re.sub('[^a-zA-Z]+', ' ', test_punc_removed_join)
test_punc_removed_join_nums_clean = [word for word in test_punc_removed_join_nums.split() if word.lower() not in stopwords.words('english')]
return test_punc_removed_join_nums_clean
# Vectorizing the text
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = message_cleaning)
df_countvectorizer = vectorizer.fit_transform(df2['Page Content'])
print(vectorizer.get_feature_names())
print(df_countvectorizer.toarray())
[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
df_countvectorizer.shape
(430, 9150)

Splitting Train & Test Sets

# labeling the X and y data 
X = df_countvectorizer
y = label
# Splitting the training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# Training the Niave Bayes classifier 
from sklearn.naive_bayes import MultinomialNB
NB_Classifier = MultinomialNB()
NB_Classifier.fit(X_train, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Evaluating the Model

from sklearn.metrics import classification_report, confusion_matrix
y_predict_train = NB_Classifier.predict(X_train)
# confusion matrix to test accuracy of training model
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x167a3353eb8>
print(classification_report(y_train, y_predict_train))
              precision    recall  f1-score   support

     Convert       0.99      0.98      0.99       154
      Engage       0.92      0.92      0.92        13
     Enhance       1.00      1.00      1.00         9
      Evolve       0.92      1.00      0.96        45
     Inspire       1.00      1.00      1.00        17
       Trust       1.00      0.98      0.99       106

    accuracy                           0.98       344
   macro avg       0.97      0.98      0.98       344
weighted avg       0.98      0.98      0.98       344

# confusion matrix to test accuracy of testing model
y_predict_test = NB_Classifier.predict(X_test)
cm2 = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm2, annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x167a3429cc0>
print(classification_report(y_test, y_predict_test))
              precision    recall  f1-score   support

   Add Value       0.00      0.00      0.00         2
     Convert       0.95      1.00      0.98        42
      Engage       0.00      0.00      0.00         2
     Enhance       1.00      0.75      0.86         4
      Evolve       0.85      1.00      0.92        11
     Inspire       0.00      0.00      0.00         2
       Trust       0.88      0.91      0.89        23

    accuracy                           0.90        86
   macro avg       0.53      0.52      0.52        86
weighted avg       0.85      0.90      0.87        86

In this case I was working with a small amount of data which means that I wanted to validate my model using train test split, however I wanted to take advantage of the whole dataset for training when creating my final model.

Training On The Whole Labeled Dataset

from sklearn.naive_bayes import MultinomialNB 
NB_Classifier = MultinomialNB()
label = df["Page Purpose"].values
# training the model on the full labeled training set
NB_Classifier.fit(df_countvectorizer, label)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
# sample test 
testing_sample = ["This school is the best! "] testing_sample_countvectorizer = vectorizer.transform(testing_sample) NB_Classifier.predict(testing_sample_countvectorizer)
array(['Convert'], dtype='<U9')
# loop to predict and assign values to all unlabeled data in the dataset Page_Purpose = [] 
for row in dataset['Page Content']:
testing_sample_countvectorizor = vectorizer.transform([row])
purpose = [(NB_Classifier.predict(testing_sample_countvectorizor))]
Page_Purpose.append(purpose)

dataset['Page_Purpose'] = Page_Purpose
# Saveing predictions to a csv
dataset.to_csv('Page Purpose Whole Labaled Dataset.csv', index = False)
final_df = pd.read_csv('Page Purpose Whole Labaled Dataset.csv')
# distribution of final predicted labels
sns.countplot(final_df['Page_Purpose'])
<matplotlib.axes._subplots.AxesSubplot at 0x167a2fafb38>