Approach
First we collected all of the spend data from the stakeholders responsible for those budgets. We were able to collect data for spend for the past 4 years however the smallest level of granularity available for Brand TV spend was at the month level. With this in mind we built two models, one without Brand TV spend that could be built at the week level of granularity focused on digital campaigns, and one with Brand TV spend built at the month level of granularity. I was responsible for building the Brand TV spend model.
Because my model would be based on TV spend, the next step was to create the Ad Stock feature. The effects of TV spend accumulate over time which is why it is common practice to represent TV spend with Ad stock which represents that accumulation. First I multiplied the Brand TV spend by the GRP which represents the percentage of our target audience that was reached with our TV ads. Next I used a linear regression using Brand TV spend and applications. Finally I was able to calculate the Ad Stock rate (the rate at which Brand TV spend accumulates) by reducing the Sum of Squared Errors of that model by changing the ad stock rate, a variable in the regression equation. This rate is then used to create the final Brand Tv Spend representing Ad Stock.
The next step was to check for collinearity between features. With nearly 50 possible features there were several instances of collinearity, furthermore with only 4 years of data available at the month level it was necessary to reduce the number of features to input into the model
To help with relevant feature selection I ran a random forest regression model for feature selection which narrowed the features to be included to 11, with the Ad Stock Spend variable as most significant, followed by Inflation, Google Search Volumes, and High Spend Digital Campaigns. This aligned perfectly with the focus of my model which was on macro level factors.
I then trained a multiple linear regression model using the selected features. For the final model I also used fbprophet to forecast Google Search Volumes as they were an important feature in the final model. Additionally I made predictions about GRP and inflation to be used in the final model.
My final model was combined with the micro focus model made by my teammate and is currently predicting future applications more accurately than the departments original ARIMA model.
Code Demo
#importing the libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report
In [ ]:
#importing the data df = pd.read_csv("SpendData.csv")
In [ ]:
df.dropna(inplace=True)
In [ ]:
df.columns
In [ ]:
# checking for coliniarity df_corr = df.corr()
In [ ]:
sns.heatmap(df_corr, cmap='seismic_r') plt.show() df_corr
In [ ]:
#Random Forest Feature Selection from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import RandomForestClassifier
In [ ]:
labels= df.iloc[:,1:].columns labels = pd.DataFrame(columns=labels)
In [ ]:
X = df.iloc[:,1:].values y = df.iloc[:,0].values
In [ ]:
sel = SelectFromModel(RandomForestClassifier(n_estimators=100))
In [ ]:
sel.fit(X,y)
In [ ]:
#number of featuers selected selected_feat = labels.columns[(sel.get_support())] len(selected_feat)
In [ ]:
#name of selected featuers print(selected_feat)
In [ ]:
#Create the final df with the selected featuers testdf = df[['Conversions','Selected Featuer 1','Selected Featuer 2', 'Selected Featuer 3','Selected Featuer 4', 'Selected Featuer 5','Selected Featuer 6',]
In [ ]:
#splitting the final df into train and test sets X = testdf.iloc[:,1:].values y = testdf.iloc[:,0].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)
In [ ]:
#linear Regression Model from sklearn.linear_model import LinearRegression regressor = LinearRegression()
In [ ]:
regressor.fit(X_train, y_train)
In [ ]:
y_pred = regressor.predict(X_test)