Stop Google Analytics referral spam for good

By John Hughes

24 February 2021

Discover how the newest Google Analytics referral spam requires new ways of filtering. A step-by-step guide to fixing the problem for good.

Recently, we found a new more sophisticated version of referral spam being pushed into Google Analytics data. It turned out that many people around the globe were noticing the same thing. This raises the question, how can we stop referral spam for good? After all, any spammed data is a severe threat to the validity and accuracy of analytics. How can you take the right actions to protect yourself from it?

This post is taken from the talk given by me at Martech Summit 2021, on February 24th, 2021.

Let us begin with a story

The story is taken from ancient Roman literature – an epic poem called the Aeneid by Virgil. In his poem, Virgil tells the legendary story of Aeneas, a refugee soldier, escaped from the City of Troy. Over the almost 10,000 lines of the poem Aeneus escapes Troy, travels to Italy, fights a war against the natives, and founds Rome.

It is a little more detailed than I have just made it sound, but the point is that the Aeneid is the literary source of the story of the Wooden Horse of Troy.

After ten long years of fruitless siege, the Greek besiegers, under the guidance of Odysseus construct an enormous wooden horse, inside which they hid a force of their best soldiers. The Greeks left the horse at the gates of Troy inscribed with the words,

“For their return home, the Greeks dedicate this offering to Athena.”

The remaining Greek army pretended to lift the siege and leave. On seeing this, the Trojans brought the gift of the wooden horse inside their city walls as a trophy and began to celebrate.

That night, the Greek soldiers came out of the horse, and opened the city gates for the remaining Greek army outside. The Trojan army was routed, and Aeneus was left without a home, thus beginning his epic journey.

There is a new Trojan Horse, and this time it is being left by hackers who want to push fake traffic into your Google Analytic accounts.

Much like the city gates of Troy, I’m going to show you that the gates to your Google Analytics are currently wide open to let this horse in. This “Greek army” of fake traffic is on its way. Hackers are trying to send fake website traffic to your Google Analytics accounts.

Why is this important?

Your digital analytics data is the foundation of your decision making. Every marketing tactic you use; every website optimisation you make; every penny you spend on traffic is predicated on the feedback loop you get through analytics.

What would happen if you were to find out that large chunks of the data inside your analytics account was in fact not real. Made up. Spurious. How would you feel about spending money on marketing to find that some of the traffic you thought it was driving was in fact more grounded in fiction than most of Virgil’s Aeneid?

For example, you might be running a paid traffic campaign and mistakenly think that large increases in non-paid traffic were down to some kind of “halo effect”. You might move budget to spend more on the parts of the campaign that, in reality, weren’t working that well and neglect those parts that were. You would be making incorrect budget investment decisions that damage the overall ROI of your digital marketing.

Or, you might see a large increase in organic traffic, and assume there was a change in user searching behaviour or maybe a Google Algorithm shift. You might dedicate resources to trying to capitalise on this moving them away from what really needs to be done.

Maybe you will see declining conversion rates, not realising that the cause is actually an inflation in website traffic, not a decline in real conversion. Perhaps you notice the increase in bounce rate caused also. You divert resources away from proactive marketing or reduce marketing spend while investigating how to tackle the imagined on-site conversion and engagement issues.

These examples are just some of the ways in which such bot traffic can cause you to unknowingly make poor decisions and leave marketing potentially hamstrung through uncertainty and confusion.

You may already be familiar with the phrase “referral spam”. This new bot traffic issue is very similar in many ways to referral spam, but it has some key differences that make it considerably more dangerous and tricky to deal with this time.

Referral spam appeared in your Google Analytics data, usually populating your referrer reports with the names of websites selling nefarious services. Sometimes there were dodgy SEO service websites or sites selling some kind of video optimisation service, or even ironically trying to sell you services to “stop referral spam”.

For some reason, they believed that their best way to market their own services was to inject their data into your referral reports, and that as an analytics user, you would think to yourself, “well, this guy certainly knows his way around screwing up my analytics data, maybe I should trust him with my marketing budget too.”

I guess it must work, there have been several variants of referral spam over the years.

A slightly more recent variant was much more random, originating in Russia and sending random strings into the browser language reports – which I think only anally retentive analysts like me ever bother looking at. Honourable mention goes to Donald Trump, getting support from Russian spammers. Who would’ve thought?

However, both of these techniques had one thing in common that made them relatively easy to protect against – they had a generic fingerprint. Usually in these cases, the traffic sent to Google Analytics had no hostname associated with it.

This meant it was relatively trivial to block almost all instances of referral spam from being collected into you Google Analytics data. Some other fields could also be used to generically fingerprint this referral spam and prevent it from becoming a nuisance.

A more targeted, but less generic protection would be to look for the more specific fingerprint of the referral spam and block that. For example, if the referral spam had a “referrer” of “XYZ-SEO.biz”, then block that specific referrer.

The problem with targeted protection of course is that it only blocks specific examples of referral spam not all, and so you were still left wide open to other referral spam attacks.

What makes the newer variants of referral spam more problematic?

The problem is that there is no generic fingerprint.

There is nothing like, for example, the absence of a hostname. The bot traffic is tracked apparently from many different devices across many different locations, using many different browsers and operating systems, from referrals, direct and organic traffic. The only way to tell, in the most recent attacks that the traffic was indeed bot traffic is that the “landing page” url was the spammed data.

Screenshot of Google Analytics landing page showing spammed data

The second problem it causes is that we have seen across a few sites that have been impacted that the URL spammed into the data is not always the same. In each website attack we have seen, the URL has been consistent (thereby pushing the landing page into prominence in reporting) but across the range of websites we have seen be attacked, the URL differs.

Also, the landing page was the URL of a site selling bot traffic. This is troublesome in a few ways – first of all, it implies that unscrupulous digital marketers could inflate the supposed traffic to your site by paying for bot traffic.

Alternatively, competitors could damage your analytics data by pointing bots at your analytics account.

I absolutely do not condone either of those things, but you should understand the very real risk that someone might do them to you.

Also, never visit sites spamming in this way. There is no telling what malware and so on you may end up infected with!

How can we protect ourselves against this?

We can’t defend against it using a generic filter as it has no generic fingerprint, but if we defend using targeted filters, the chances are that we won’t be protected against the next attack, or the one after that.

Let us start by taking a look at how the recent bot attack emerged.

It seems to have begun around the 31st January but we have also seen some intermittent activity in the days following that.

Originally, I became aware of the bot attack when a number of users I follow on Twitter first started to report that they could see bot traffic in their Google Analytics accounts. In fact, digging deeper, hundreds of Twitter users began to report seeing some form of traffic. Very soon, some Analytics and SEO blogs also began to report what they were seeing.

Consequently, we began a review across the various Google Analytics accounts to which we have access to determine if this had affected any of them.

It turns out it had. In fact, something in the order of 2% of the properties whose GA data we could review had already been affected.

It is a small sample size, but given we could see this was a fairly universal issue across websites around the World, it is disturbing to consider the number of Google Analytics accounts that could also be compromised.

How can you protect yourself against these spam bot traffic attacks? How can you protect against unscrupulous competitors pointing bot traffic at your site, or shady marketers buying bot traffic to gain some pretence at success?

How the bot traffic ends up in your Google Analytics account

Google has a number of APIs for Google Analytics. One of these is called the measurement API, and it is entirely open. Anyone can send any data to the Google Analytics measurement API, and it will be queued for inclusion in Google Analytics reports.

At Storm, we utilise the measurement API ourselves in certain scenarios. For example, where we want to track digital engagement and usage data to Google Analytics but don’t want to risk using scripts managed by third-parties, such as Google Analytics or Google Tag Manager. We do this in some secure environments such as in financial services and healthcare, where third-party scripts carry a specific security risk.

The bot similarly uses the measurement API to push data. We could see this by comparing Azure server log data which shows no real HTTP requests occurred for these rogue landing pages.

The bot has seemingly been fed information about the hostname and analytics account ID information that it needs to send realistic looking traffic to the API. This is not difficult, either using an automated scraper, or using a tool like Google Tag Assistant. Once armed with this data, the bot can simply send randomised device and location information with the hostname and account ID. It could even send actual real page URL information – if it were to do that, you would have no way of knowing at all that the data in your reports was fake.

It is likely that most referral spam is done by a scatter gun approach – a scraper grabbing the analytics ID and hostnames from sites to feed the bot software. A person is not literally using Google Tag Assistant but more likely sending out scraper bots to harvest hostnames and Google Analytics IDs.

To protect your Google Analytics data, we have to think laterally about how we can make it so that there is a fingerprint that we can work with. Something that is not easy for the bot to capture. The most obvious way to do this is to add something to your tracking code that is structurally unique, something that the bot would have to know about specifically about your site that would be not just different on a different site, but in a different place.

Effectively we want to give our tracking code a key to unlock a gate we add to put Google Analytics reports. We can do this using a number of different fields in Google Analytics that we can sacrifice to become the “key”.

In this example I will describe, we will use a custom dimension for this purpose.

In the freemium version of Google Analytics, so long as you are using Universal Analytics, you will have available 20 custom dimensions, or 200 for Analytics 360 accounts. Each custom dimension is indicated by an index number from 1-20, or 1-200 for GA360. For the sake of illustration, let’s assume that you use custom dimension index 5 as our key. We want to make sure that every hit sent to Google Analytics carries the correct value for custom dimension 5 so that you can set up a filter in Google Analytics to only allow that traffic into your reports. This is a three-step process.

First, set up the custom dimension. The name doesn’t matter as you can only see the dimension name in the GA admin. Make sure the custom dimension is set to record with ‘Hit’ scope.

Screenshot of Google Analytics showing how to create a custom dimension

Second, add your key to Google Tag Manager using a GA settings variable, or add it to your tracking code if using native GA tracking instead of Tag Manager. If you are using the measurement protocol, simply add the custom dimension onto the HTTP request.

Screenshot of Google Analytics showing how to update your GA settings variable in Google Tag Manager

Then, set up a filter in you Google Analytics views that requires the key. By doing this, traffic not carrying the right value in that custom dimension will not be saved in your reports.

Screenshot of Google Analytics showing how to create an appropriate Google Analytics filter

If you want to make it even harder for the spammers, there is more you can do:

If you have spare custom dimensions, you can send some decoy keys to those.
You might select keys on some pages differently to others and check off against a list of allowed keys using regular expressions.
Make the value of your key look like a piece of data you might want to collect like a content type rather than a random string of characters.

Doing this makes the effort to spam your Google Analytics property considerably harder – the spammer needs to figure out first of all that you are using a key, secondly where structurally your key is in your tracking code, and thirdly, what value needs to be sent. If you do this, your Google Analytics account should be well protected against such bot traffic in the future.

What if your data is already affected?

What about if you discover that you are already a victim of such a bot attack, and you analytics traffic has already been artificially inflated?

Luckily, Google Analytics has a feature that can help – Advanced Segments.

Advanced segments allow you to apply a kind of lens across traffic that has already been collected through which you can look at just some segment of traffic. They are, for many reasons, one of the single most useful, but also most under-utilised features of Google Analytics.

For example, you could apply a segment for “Paid traffic” and look at the information for traffic coming from your paid search, social and display campaigns – for example, what devices are they using, what are their landing pages, what browser language is set for those users, and so on.

Advanced segments help us here, because if we discover bot traffic in our Analytics data then we can use an advanced filter to remove it from our view of the data. Let’s be clear, it does not delete the data, the advanced segment just gives you a view of a subset of data, and if you define it correctly, then that subset would be “everything that is not bot traffic”. When you are not using the advanced segment, the bot traffic is back in the mixer with all the other traffic.

It is relatively easy to set up an advanced segment.

The first thing you need to do is identify how you can recognise the bot amongst all the other traffic. For these recent bot attacks, this has been the landing page URL, so we will use that in our example.

When you are in your Google Analytics reports, near the top of the page you will see the link to “Add segment”.

Screenshot of Google Analytics showing how to add a segment

Clicking this opens a new menu, listing all the pre-built system segments, and any custom ones you might already have created or that have been shared with you.

Top left of this menu is a big red button, labelled “+ New Segment”. Clicking that button opens the tool that you use to create a new segment.

Screenshot of Google Analytics showing how to create a new segment

The tool itself looks quite complicated but let us ignore most of its functionality and focus in only on what we need.

In the left-hand menu, we can see a sub-menu under the heading “Advanced”, and one of the two options in that menu is “Conditions”.

Screenshot of Google Analytics showing how to select conditions for the segment

Clicking that opens a very simple interface where we want to add the conditions for our segment.

First of all, check that the filter is set to “sessions”, and change the filter type so that it is set to “exclude”.

Screenshot of Google Analytics showing how to set filters to sessions

Then, click the left drop down menu in the tool and it will open up a list of dimensions on which you can filter the data. In our example, we are looking for “landing page”.

You can scroll down and look through the lists of dimensions for it, but it is quicker to start to type it in the search.

Screenshot of Google Analytics showing how to select the dimension to filter traffic

There are still a few dimensions with the phrase “landing page” in the name, so scroll down those few until you find the correct dimension.

Select that, and then in the second drop down menu, select “exactly matches”.

Screenshot of Google Analytics showing how to set match type to 'exactly matches'

Then in the text box, type exactly what appears in the landing page report as the landing page URL that you want to exclude.

Screenshot of Google Analytics showing how to type the landing page path into the text box

If you have been victim to more than one landing page from the bot traffic, you can chain these into a single advanced segment by clicking “OR” on the right-hand side and repeating these last few steps to add the other landing pages you want to block.

Screenshot of Google Analytics showing how to add any additional conditions to a filter

Finally, when you are done, remember to give your segment a useful and memorable name, and then click “save”.

Screenshot of Google Analytics showing how to choose a memorable name and save the segment

When you click save, the advanced segment is applied automatically to your view.

Screenshot of Google Analytics showing graphs of traffic with bot traffic removed

To go back to normal, click on the small dropdown menu at the top right of the segment name, and select the option “remove”.

To add it back again, click the “Add segment” link as you did when you wanted to create the segment before, but this time, you do not need to create it, you can find it quickly by searching the segment list for it by name.

As you have seen, this gives a useful method for viewing your data should it be infected already with bot traffic. Of course, prevention is better than cure, so please all take the steps you need to prevent future infection as I described earlier.

Back to our Wooden Horse of Troy

Once the Trojans had opened the gates and brought the horse inside the city wall, their fate was sealed. That night, the soldiers hidden inside sneaked out of the horse, overpowered the city gates defenders and opened the gates for the other Greek soldiers who had quietly returned to siege.

Troy has utterly destroyed. So much so that the city went from being a significant seat of power throughout the bronze age to being abandoned entirely at the end of the war. In fact, the location of Troy was forgotten during the Dark Ages, and in the early modern era, many scholars believed the stories of Troy to be simply myths. However, the ruins were rediscovered in North Western Turkey in the mid-nineteenth century, and it is now a UNESCO World Heritage site. What it is no longer, though, is a significant seat of power.

The message is that you may have been hit by bot traffic already or you may not have.

Don’t wait for the next bot traffic attacks before you decide to protect yourself.

Do these two things

Take steps to protect yourself against future bot attacks. Don’t wait to find out if you have already become a victim. Do it now!
Undertake some forensic analysis of your own Google Analytics data to see if you have been hit, and if so take account of this in your reporting and decision making. Validate your recent decisions making. Ignorance is not bliss.

Don’t let bot traffic cause bad decision making!

To get help to do either or both of these things, why not get in touch.