Biz Tips: An introduction to regression analysis for marketers

# An introduction to regression analysis for marketers

#### How to use regression to analyse everything from PPC to print to web analytics to radio and more using R

Regression analysis is a powerful tool for marketers. Regression sounds really important, doesn’t it? It sounds like proper ninja mathematics stuff, particularly when you add various qualifiers in front of it: linear regression, multiple regression, polynomial regression. “What are you working on Katie?”, “Oh, just analysing how our ROAS changes with PPC spend using non-linear regression.”

Pretty dang sweet.

While it sounds really ninja, it’s a concept that’s very easy to get started with. Completely understanding the assumptions that underpin the models, interpreting models and drawing conclusions, tweaking and improving models (and more besides) can take time, but that doesn’t mean that you can’t get started with regression today. So let’s do that.

In this article, I’ll cover the basics of performing a simple linear regression in R, how to pull out what we’re interested in from the model summary, and some basic model checking. In a [shortly forthcoming] part two, we’ll look at what to do when a simple straight line doesn’t fit your data and how to compare different models.

#### Moving forward while regressing

You know in Excel, when you have a scatterplot, perhaps sales versus customers, and you ask Excel to add a trendline — the one that (hopefully) shows that, as your number of customers increases, so do your sales? Well, that’s regression.

At it’s heart, regression is about understanding trends and variability in your data. Do sales go up with PPC spend? They do? Great. If I double PPC spend do sales double? How confident can I be that the sales will double? The great thing about regression when you’re a marketer though, is that it’s not just about the trendline, it can tell you what’s working and what’s not. You just have to make sure the data is there for it do get to work on.

In this article, we’ll be using the amazing data science platform that is R to perform our analyses. Yes, R is a programming language, but — if you’re an Excel user — you probably code already; you just do it in a small, white strip at the top of the screen and it starts with =…

This isn’t an article about the technical background to regression. We won’t be spending too much time on y-hat or talking about residuals. For some interactive background to regression, I recommend DataCamp’s Correlation and Regression course; you can do the first chapter for free.

Without further ado, it’s time to dive in to our first regression example: a simple linear regression to look at what happens to our revenue when we spend more on PPC.

#### What goes up…

…must be analysed with a linear regression analysis. Okay, not really true, but that’s what we’re going to do here. We have a dataset (all of the data and code in this article can be copied/pasted/forked and more from GitHub) that summarises some of our PPC activity over time. One thing I mentioned earlier on was making sure the regression has the data to work with. This is what I mean: in this case, we played about with the daily spend, randomly moving it lower and higher.

If we don’t change a variable, how will we ever know the effect of changing it. And, perhaps more importantly, what’s the point of calling it a variable in the first place?

Let’s get our dataset imported and see what we’re working with:

`# import data and have a quick lookdisplay <- read_csv("display_data.csv")glimpse(display)`
`Observations: 40Variables: 8\$ spend         22.61, 37.28, 55.57, 45.42, 50.22\$ clicks        165, 228, 291, 247, 290, 172, 68, 112\$ impressions   8672, 11875, 14631, 11709, 14768, 8698\$ display       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0\$ transactions  2, 2, 3, 2, 3, 2, 1, 1, 3, 3, 4, 5, 2\$ revenue       58.88, 44.92, 141.56, 209.76, 197.68\$ ctr           1.90, 1.92, 1.99, 2.11, 1.96, 1.98\$ con_rate      1.21, 0.88, 1.03, 0.81, 1.03, 1.16`

We have 40 days of data that tell us how much we spent, how many clicks, impressions and transactions we got, whether or not a display campaign was running, as well as our revenue, click-through-rate and conversion rate. It’s revenue that pays the bills, so that’s our target. We’ll start by having a look at the spend and revenue data:

`# plot the data to see what it looks likeplot(display\$spend, display\$revenue)`

It would appear that, as we spend more, we generate more revenue. That’s a really good start.

While R’s built-in plotting functions are great, we’ll switch to using ggplot2, and add that trendline as if we were in Excel. Only faster.

`ggplot(display, aes(x = spend,                    y = revenue)) +  geom_point() +  geom_smooth(method = "lm")   # add trendline using linear model`

What a difference a line makes. That ‘up’ trend would convince the C-level, right? The grey border around the blue line indicates how confident we are in that line. You can see how it gets wider to the extremes; we have fewer data points there so it’s less confident.

It looks that we have a reasonable correlation, so let’s formalise our model and see what we can learn.

`# build linear modellm_mod1 <- lm(revenue ~ spend, data = display)`
`# look at our model with summary functionsummary(lm_mod1)`
`Call:lm(formula = revenue ~ spend, data = display)`
`Residuals:     Min       1Q   Median       3Q      Max -175.640  -56.226    1.448   65.235  210.987`
`Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept)  15.7058    35.1727   0.447    0.658    spend         5.2517     0.6624   7.928 1.42e-09 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`Residual standard error: 93.82 on 38 degrees of freedomMultiple R-squared:  0.6232, Adjusted R-squared:  0.6133 F-statistic: 62.86 on 1 and 38 DF,  p-value: 1.415e-09`

There is a lot of information in there, but let’s pick out what would be most important to us in understanding how our PPC campaign is performing. If you remember some maths from high school, you might recall that the equation for a straight line on a graph is y = ax + b; a tells us how steep the line is, and b describes at what point it crosses the y axis (where x = 0). The Estimate column in the Coefficients section of the output gives us a and b.

For our model, b (the intercept) is 15.71, and a (the gradient) is 5.25. For our purposes, a is the interesting number here, as it tells us what we’re getting back from our PPC spend. With a gradient of 5.25, our model suggests that each £1 we spend on PPC is leading to an increase of revenue of £5.25.

However, note that this only works because we are using the same units on each axis. If our y axis was measured in millions (1 = 1 million etc.), then we’d still get the same estimate of a, but we’d have to remember that meant we’d generate an extra £5.25m for each £1 of spend. I wish my campaigns did that.

The Pr(>|t|) value for spend is something we want to look at as well. It tells us how statistically significant this pattern is. With a value of less than 0.05, there would be less than a 5% chance that our revenue would be rising with spend due to random chance. With a value in the order of 10^-9, we can be fairly confident that we’re looking at a real trend.

IMPORTANT! Just because we have a statistically significant trend, it doesn’t mean that the increased revenue is BECAUSE of our increased PPC spend; CORRELATION DOES NOT EQUAL CAUSATION. If you have a sale that leads to huge amounts of revenue, chances are you’d up your PPC budget, right? But did that drive the revenue, or would the great deals have driven more conversion anyway?

The R-squared value (what we’d normally refer to as the correlation coefficient) let’s us know how much of the variation in the data is explained by our model. In this case, our simple linear model with a single independent variable is explaining around 61% of the variability. Not too bad. But I think we can do better…

#### Extending our regression model

By adding another explanatory variable, can we explain more of the variability in our data? Looking at the dataset, we can see that there is a variable called display. This indicates whether additional display marketing campaigns were running and takes a value of 1 for yes and 0 for no. How do we include this information in our model?

First, we make sure that R knows that this variable is categorical and not really a number:

`# convert display variable to categoricaldisplay\$display <- as.factor(display\$display)`

Then all we need to do is add it in to our linear model expression. We’ll run the summary again to see what that does to our figures:

`# build extended regression model including display lm_mod2 <- lm(revenue ~ spend + display, data = display)`
`# look at our model with summary functionsummary(lm_mod2)`
`Call:lm(formula = revenue ~ spend + display, data = display)`
`Residuals:     Min       1Q   Median       3Q      Max -189.420  -45.527    5.566   54.943  154.340`
`Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept) -41.4377    32.2789  -1.284 0.207214    spend         5.3556     0.5523   9.698 1.05e-11 ***display1    104.2878    24.7353   4.216 0.000154 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`Residual standard error: 78.14 on 37 degrees of freedomMultiple R-squared:  0.7455, Adjusted R-squared:  0.7317 F-statistic: 54.19 on 2 and 37 DF,  p-value: 1.012e-11`

Great! We can now explain around 73% of our data’s variability. The gradient estimate for our spend has stayed about the same, but now we’ve added a new number: what does that 104.29 mean for display1? display1 simply refers to cases where our display campaign is on. The 104.29 doesn’t change the slope of the line, but it means that, as our revenue increases by a bit over £5 for each £1 more we spend, if the additional campaigns are running we’ll see an extra £104.29 of revenue for that PPC spend than if they weren’t. You can see what that looks like in this article on ‘parallel slopes regression’.

If you have additional explanatory variables in your dataset, you can add those in the same way, but don’t get carried away using everything that you have available, sometimes the best, most robust models can be quite lean. As you continue to add more explanatory variables to your model, you’ll likely be able to explain more of the variability, but that doesn’t always give you a more useful model. If you want to know more, search for the wonderfully-named Akaike’s Information Criterion.

#### Checking a linear regression model in R

Linear regression, as with most statistical techniques, comes with a set of assumptions. In order to be confident in our model, we want to check that we haven’t violated these assumptions. Using the built in plot function in R, we can check these fairly easily. All we need to do is pass our model object to plot, and we get back a helpful collection of diagnostic figures.

We’ll do just that with our first model to check that all is well and everything is as it should be. Running the plot(lm_mod1) code will generate a series of four plots. We won’t spend much time on picking them apart. The key thing is to understand what they look like when all is well, and recognise the patterns when something’s not right.

`# plot first model to produce diagnostic plotsplot(lm_mod1)`

This is a reasonable example of what this plot should look like. There aren’t any strange patterns in the dots, beyond a linear spread along the centreline. As we don’t see any funky, curvy patterns, we can probably be happy that a linear model is a safe choice.

The normal Q-Q plot checks that our residuals are normally distributed (an important assumption of linear regression). Dots lying along the dotted line are good, dots lining up in a banana shape or making an s-shaped curve are not.

In the Scale-Location plot, we want to check that the spread of the dots from top to bottom remains about constant as we move from left to right. Here, we are checking the assumption of equal variance (homoscedasticity). What we wouldn’t want to see is the dots spreading out as we move from one end to the other.

The Residuals vs Leverage plot checks for points (such as outliers) that might have a significant influence on the model. Points in the top right and bottom right can be a problem, but, here, none of the points cross over the dotted red (Cook’s distance) lines, so we’re in reasonable shape. If you have points that step over the line, you can get a feel for their effect by removing them from the dataset, rebuilding the model and looking at how the coefficients have changed.

With that very quick overview of the linear regression diagnostic plots out of the way, I’ll now refer you to this more comprehensive resource from the University of Virginia, that gives a much better description of these plots and provides examples of what they should, and shouldn’t look like.

That was a very brief introduction to linear regression using R. Regression is a very useful and important technique in data analysis, and not just for marketers. If you are a marketer, regression can help you get a feel for your return on advertising spend, the effect of device type on website visit behaviour, and what concurrent print or TV advertising campaigns have on your KPIs.

Unlike machine learning, you can start using regression effectively with much smaller datasets, so download a report from Google Analytics and give it a go!

#### Thanks for reading The Marketing & Growth Hacking Publication 