Referral spam in Google Analytics has been a growing issue, and one that has affected most accounts at some point in the last few months.
It results in odd data appearing in Google Analytics reports, data that is not true – it has been pushed into the account by some nefarious means.
There are a number of misconceptions about referral spam, and these can cause alarm, so before we look at how to tackle referral spam, let’s look at what it actually is, and why it happens.
To Beat a Crook You Must Think Like a Crook
For a moment, put yourself in the shoes of a spammer. Email spam still works, but it is getting more and more difficult to get emails through spam filters, both at mail servers and in email clients – is there another way to get your messages through to people who are not paying attention? What if the services you want to promote are aimed at webmasters in particular? What if you just want to cause some inconvenience? Some or all of these factors are at play for Google Analytics referral spammers.
As a technically adept spammer already, you realise that you can use Google’s Analytics platform to push data to any account. Maybe you can make your website appear like a top referrer so webmasters check it out to see what it is! It just needs a bit of knowledge about how the tracking code works and a real account number. But if you are a spammer at scale, you don’t even need to know real account numbers – you can just make them up. Enough will be real that you will get through frequently enough!
It’s not personal, then. Referral spam isn’t really targeting you specifically. You’re just a victim of it, like you would be a disease – viruses don’t choose you, you just end up catching them. Referral spam is just like a head cold for Google Analytics!
How Tracking Code Works
Google Analytics tracking code works by sending data to the Analytics servers whenever a hit – a pageview, event, transaction etc. – occurs. A whole host of data is included, depending on the type of hit. When a pageview is sent, this might include “source”. For any hit it might include “hostname”.
Source is the referrer of the hit – the URL of the website the user was at immediately before this hit.
Hostname is the website domain of your website.
This distinction is important, as many people confuse these in my experience. I suspect, the reason for the confusion is the name given to the problem we want to solve – referral spam – and the best method to fix it.
Solving the problem of referral spam is deceptively simple. We can look for its fingerprint and apply filters to Analytics profile views to prevent the data being recorded in the first place.
This is where it is important to understand the difference between source and hostname. A genuine hit to your website could, in theory, have any source, but it can only ever record a hostname that your website uses. Of course, your website might have more than one domain – example.com and example.co.uk, perhaps – but it is still a limited number, often just one or two.
The Fingerprint of Referral Spam
Remember, referral spam is not specifically targeting you. It uses a scatter gun approach with a random tracking account number. The hostname sent with referral spam is always wrong – it is either blank, or not your website.
A common misconception is that to prevent referral spam, you must block the offending referrer – i.e. any “source” that looks like it is referral spam. However, as I have explained, the real fingerprint is in the hostname – this is what we want to use to filter out the spam.
Another common misconception is that we must block each nefarious hostname. This is an incredibly inefficient way to solve the problem though; in theory there could be hundreds, and in the future, perhaps, thousands. Spammers could easily get around this solution by changing their hostname frequently. This is obviously not ideal.
Therefore, the sensible approach is to block ALL hostnames that are not your real website domains. This is achieved by using an INCLUDE filter in your profile view, only allowing in the hostnames you specify. This will block all referral spam using one simple filter.
Yayy! Regular Expressions
To do this effectively, especially if you have multiple domains, you should probably use regular expressions. I let out a small cheer. I’m a regular expression nerd. You might not be, so I’m going to walk you through the process. Prepare to fall in love with regex. Prepare, godammit!
Let’s say your website domains are www.example.com and www.example.co.uk. We want to block all hostnames that aren’t these. Let’s also say that you also have blog.example.com (and .co.uk) tracking to the same account, and that users don’t need to have www on the domain to see the main website. These are minor complications – they just increase the domains we need to let through.
In this example, there are six:
So, we need a regular expression to match all these.
Here it is:
Scary, huh? Well not really, here it is one part at a time, plus an explanation.
(www\.|blog\.)? This part says we want to see the domain start with www or with blog or with nothing. There’s a lot going on here, so let’s look at the mechanics of it.
The dot is a special character so we use a backslash to escape it – for example, www\. Is transformed to www.
The pipe (|) means logical OR – www\.|blog\. transforms to www. OR blog.
Brackets group the items inside together, so the question mark applies to the whole group. The question mark means ‘this preceding thing is optional’
(www\.|blog\.)? transforms to ‘Optionally, (www. OR blog.)’
The next part is easy ‘example’ has no special characters and so it undergoes no transformation. It simply matches ‘example’
\.co(m|\.uk) means we must match either .com or .co.uk at the end of the domain.We’ve met all these special characters above, so I’ll leave you to figure out how it fits together. Notice, though, that there is no ? – this match is not optional.
The above is an example, hopefully you see how to apply this principle to your own domains. If you have a quiet moment to yourself, why not love regular expressions some more!
Applying the Regex in a Profile View
To use this regular expression to filter the data recorded by Analytics, we need to create a profile filter. You need administrator rights to the full Google Analytics account to do this. Admin rights to just the property or profile is not enough. However, you create and add the profile filters at the profile view level.
While you first set this up, you probably want to duplicate the profile view you create this on so you can test the filter safely – if you get your filter wrong, you won’t know until you have already blocked data you didn’t mean to.
Set up a new profile view, and apply your new filter to that first. If, after a few days, it seems to be working OK, you’ll be safe to apply it to your main profile views.*
Create a new filter. We want it to be a custom filter. Select ‘Include’, and change the Filter Field setting to be ‘hostname’. In the Filter Pattern field, paste the regular expression.
You are done. It’s that simple.
There’s no need to add more filters later as spammers change the hostnames they use, they’re already blocked!
Piece of cake.
* Remember, you should always have one extra profile view with NO filters, including this referral spam filter. If you don’t, now is a good time to create one.
Read more about our Google Analytics Consultancy Services