Google Analytics is an excellent tracking tool for determining click-through paths and their resulting behaviour. But what if you wanted to determine the effect of something more subtle, with no direct click path?…
A good example of this might be TV or radio ad-slot data. How do you know if your TV or radio ads have an effect on visits to your website, or on some other time-related data?
Obviously, the first thing to ensure is that you have time-related data for both metrics that you want to compare. For this example, I am going to describe TV ad-data and visits to a website, but you could equally look at other time-sensitive data such as social media conversations, radio ads, online PR, online sales, visits from search, display schedules and so on.
Your other data set must also be time-related in a similarly robust way.
Normalising your data
Once you have two data sets, you need to ensure they both use the same time scale, i.e. if your TV data is only accurate to the nearest 5 minute period, then you should aggregate up the website visit data to this scale. Always downscale to the least accurate metric you need to compare.
Once you have the two data sets on the same time scale, you will probably need to fill in the blanks – i.e. add in all the time points on the scale with no visits, or no TV impacts.
I’ll talk about impacts when describing the TV data. Without getting too deeply involved in definition, an “impact” is a measure describing an estimate of the target audience that the ad is likely to have been seen by. It is similar in a way to an impression on a display ad, although not really the same thing at all!
However, what you should have when you have normalised your data is two data sets with the same scale (e.g. 5 minute intervals) and the same length (i.e. starting and finishing at the same time, with the same number of data points).
Here is a simple data table as an example:
The data is the same length, the same scale and zero-point data has been filled in the gaps.
Obviously, you will typically have a bigger data set than this – I expect you’d work with data sets with many tens of thousands of data points for this kind of analysis.
The next step is to carry out a regression analysis on the data.
This is where you start to strive to remember all that statistical maths you vaguely remember from high school and finally put it to some use!
If you need to remind yourself how to do all this, Khan Academy is a great resource.
First of all, we need ignore the time in your data. By this I mean that the next step does not use your time stamp at all. What we want to do is consider our two data series as though it were two-dimensional points on a scatter graph – i.e., your TV impacts data is the X coordinate, and the visit data is the Y coordinate. We are asking ourselves the question “what is the correlation between X and Y?” Or in another sense, “if we know X, how confidently can we predict Y?”
Therefore, by undertaking this analysis we could answer the question, if we know how many TV impacts will happen, can we predict how many visits there will be to the website?
So having turned out two data sets into one long list of (X,Y) coordinates, we need to calculate the regression line for our data. The regression line is sometimes called the line of best fit. In essence it is the average of all the two dimensional data. In statistical terms, it has the least total squared error.
Once we have this, we can calculate the coefficient of determination (R2), which tells us the strength of the correlation between our X and Y coordinates, i.e. the correlation between our two original data sets.
Now we have a measure of the correlation between the data sets, we need to time-shift the original data and do the same again.
In the example we are discussion here, we are using TV impacts as our X coordinate. We want to predict visits based on TV impacts, so we keep the TV data intact, and time shift the visit data only.
Using the example data from earlier, we move the visit data up one row, keeping everything else the same.
Now we repeat the same procedure exactly, and build another regression line used to calculate another R2 value.
Keep repeating this process, time shifting the visit data and calculating R2 values, and you will end up with a data set that something like the following (note that this has not been calculated from the example data I gave above, as there will not be anywhere near enough data points in the sample).
|Time Shift||R2 value|
Understanding the results
The example result above shows the strongest correlation at a time shift of 5 minutes. Given all the different things that could affect visits, 0.33 is strong enough a value to say there is at least some correlation. The closer the R2 value is to zero, the less correlation there is.
So what we have learned about our fictitious TV and visit data above is that there is some correlation between TV impacts and website visits for our example and that the visits are most likely to occur five minutes after the ad is shown.
This example is entirely fictitious, by the way, and I’m only giving it as an example of the technique. However, this technique is exactly the right way to go about determining the correlation between time-logged data sets.