Search engines like Google have a problem – it’s called ‘duplicate content’. Duplicate content means that similar content appears at multiple locations (URLs) on the web, and as a result search engines don’t know which URL to show in the search results. This can hurt the ranking of a webpage, and the problem only gets worse when people start linking to the different versions of the same content. This article will help you to understand the various causes of duplicate content, and to find the solution to each of them.
What is duplicate content?
Duplicate content is content which is available on multiple URLs on the web. Because more than one URL shows the same content, search engines don’t know which URL to list higher in the search results. Therefore they might rank both URLs lower and give preference to other webpages.
In this article, we’ll mostly focus on the technical causes of duplicate content and their solutions. If you’d like to get a broader perspective on duplicate content and learn how it relates to copied or scraped content or even keyword cannibalization, we’d advise you to read this post: What is duplicate content.
Let’s illustrate this with an example
Duplicate content can be likened to being at a crossroads where road signs point in two different directions for the same destination: Which road should you take? To make matters worse, the final destination is different too, but only ever so slightly. As a reader, you don’t mind because you get the content you came for, but a search engine has to pick which page to show in the search results because, of course, it doesn’t want to show the same content twice.
Let’s say your article about ‘keyword x’ appears at http://www.example.com/keyword-x/ and the same content also appears at http://www.example.com/article-category/keyword-x/. This situation is not fictitious: it happens in lots of modern Content Management Systems. Then let’s say your article has been picked up by several bloggers and some of them link to the first URL, while others link to the second. This is when the search engine’s problem shows its true nature: it’s your problem. The duplicate content is your problem because those links both promote different URLs. If they were all linking to the same URL, your chances of ranking for ‘keyword x’ would be higher.
If you don’t know whether your rankings are suffering from duplicate content issues, these duplicate content discovery tools will help you find out!
Causes of duplicate content
There are dozens of reasons for duplicate content. Most of them are technical: it’s not very often that a human decides to put the same content in two different places without making clear which is the original. Unless you’ve cloned a post and published it by accident of course. But otherwise, it feels unnatural to most of us.
There are many technical reasons though and it mostly happens because developers don’t think like a browser or even a user, let alone a search engine spider – they think like a programmer. Take that article we mentioned earlier, that appears on http://www.example.com/keyword-x/ and http://www.example.com/article-category/keyword-x/. If you ask the developer, they will say it only exists once.
Misunderstanding the concept of a URL
No, that developer hasn’t gone mad, they are just speaking a different language. A CMS will probably power the website, and in that database there’s only one article, but the website’s software just allows for that same article in the database to be retrieved through several URLs. That’s because, in the eyes of the developer, the unique identifier for that article is the ID that article has in the database, not the URL. But for the search engine, the URL is the unique identifier for a piece of content. If you explain that to a developer, they will begin to get the problem. And after reading this article, you’ll even be able to provide them with a solution right away.
You often want to keep track of your visitors and allow them, for instance, to store items they want to buy in a shopping cart. In order to do that, you have to give them a ‘session.’ A session is a brief history of what the visitor did on your site and can contain things like the items in their shopping cart. To maintain that session as a visitor clicks from one page to another, the unique identifier for that session – called the Session ID – needs to be stored somewhere. The most common solution is to do that with cookies. However, search engines don’t usually store cookies.
At that point, some systems fall back to using Session IDs in the URL. This means that every internal link on the website gets that Session ID added to its URL, and because that Session ID is unique to that session, it creates a new URL, and therefore duplicate content.
URL parameters used for tracking and sorting
Another cause of duplicate content is using URL parameters that do not change the content of a page, for instance in tracking links. You see, to a search engine, http://www.example.com/keyword-x/ and http://www.example.com/keyword-x/?source=rss are not the same URL. The latter might allow you to track what source people came from, but it might also make it harder for you to rank well – very much an unwanted side effect!
This doesn’t just go for tracking parameters, of course. It goes for every parameter you can add to a URL that doesn’t change the vital piece of content, whether that parameter is for ‘changing the sorting on a set of products’ or for ‘showing another sidebar’: all of them cause duplicate content.
Scrapers and content syndication
Most of the reasons for duplicate content are either the ‘fault’ of you or your website. Sometimes, however, other websites use your content, with or without your consent. They don’t always link to your original article, and therefore the search engine doesn’t ‘get’ it and has to deal with yet another version of the same article. The more popular your site becomes, the more scrapers you’ll get, making this problem bigger and bigger.
Order of parameters
Another common cause is that a CMS doesn’t use nice clean URLs, but rather URLs like /?id=1&cat=2, where ID refers to the article and cat refers to the category. The URL /?cat=2&id=1 will render the same results in most website systems, but they’re completely different for a search engine.
In my beloved WordPress, but also in some other systems, there is an option to paginate your comments. This leads to the content being duplicated across the article URL, and the article URL + /comment-page-1/, /comment-page-2/ etc.
If your content management system creates printer-friendly pages and you link to those from your article pages, Google will usually find them, unless you specifically block them. Now, ask yourself: Which version do you want Google to show? The one with your ads and peripheral content, or the one that only shows your article?
WWW vs. non-WWW
This is one of the oldest in the book, but sometimes search engines still get it wrong: WWW vs. non-WWW duplicate content, when both versions of your site are accessible. Another, less common situation but one I’ve seen as well is HTTP vs. HTTPS duplicate content, where the same content is served out over both.