Cleaning the website's search index

Added: 06.08.2017
There are quite many cases when search engines index a large number of pages on the website that do not carry useful information from their point of view: clear or unclear duplicate pages, technical garbage, official pages, etc. These pages can become an obstacle for timely re-indexing and correct ranking of the website, so it is highly advised to minimize their number in the index. You can do this in several ways, which can be divided into two large groups: a ban on indexing and splicing other pages of the website. Let us consider the peculiarities of each of the methods and the preferred variants of their application.
The main difference between the prohibition and the gluing (splicing) is that gluing needs the non-text characteristics of the page to be glued (let's call it non-canonical), such as the values of reference, behavioral and temporal factors, will be summed with the values of the relevant factors on the landing page (let's call it canonical). In the case of the indexation ban, all this information will be simply lost. Therefore, the pages that do not have any significant non-textual characteristics go under ban in the first case, for example, there are no links leading to them, and the amount of traffic on these pages is completely insignificant. As a rule, these are official pages, for example, rss-tape, personal users cabinets or search results on the website.

The page indexing ban can be done in the following ways:
1) Using the Disallow directive in the section for the corresponding user agent of the robots.txt file search engine
2) Using noindex content directives of the robots meta tag
In the first case, the crawling budget allocated for re-indexing pages of the site having a response of 200 OK will not be consumed, since the indexing robot will not be able to access the pages forbidden in the robots.txt file. Therefore, this method is more preferable. In the second case, the robot will download the pages, and only after they come up that will be detected by the indexing directive and banned. Thus, the site's crawling budget will be partially spent on the constant re-indexing of such pages.
One can partially solve this problem with the help of the correct configuration of the If-Modified-Since request processing (for more details, see my article "Last-Modified and If-Modified-Since headings"). Moreover, in the second case, pages that are forbidden to indexing may fall into the index for some time. This may take time, sometimes it may take months. Therefore, it is advisable to use the second method only in the following cases:
If the number of pages is large enough, and the peculiarities of their URLs are not able to list them space effective in the robots.txt directives using the exclusion rules for robots and the search engines supported by its extensions (for example, see the relevant documentation for Yandex and Google ). Yandex restricts the size of the robots.txt file in 32 KB, while Google makes it up to 500 KB.
If banned indexing pages are the only source of internal links to those pages of the website for some reason then it should be present in the search index. In this case, the content directive of the robots meta tag, as well as the noindex value, must also have a follow value that allows the search robot to follow the links on the page.
As it was mentioned above, unlike the prohibition of indexing, gluing pages allows you to summarize the values non-text factors of the pasted (non-canonical) page with the corresponding values of the target (canonical) page. The gluing can be carried out in the following ways:
1) Using a 301 redirect Moved Permanently
2) Using the Clean-param directive in the robots.txt file (only for special URLs with dynamic parameters)
3) Using the rel = "canonical" attribute of the tag

The 301 redirect is applicable in cases where the content of the non-canonical page is completely identical to the canonical content, so in this case the user can be redirected from one URL to another. So that, when you access the non-canonical URL, the crawling budget is not consumed because it has a response that is different from 200. Note that the gluing will not occur if you use a redirect with a 302 response.
This method is advised, for example, when changing the structure of the URL of the website or for gluing duplicate URLs with a slash at the end and without it. With non-canonical URL case it is necessary to give the user the content, that is, it must have a 200 respoonse, then in this occasion two other methods of gluing must be used.
Using the Clean-param directive in the robots.txt file is limited to pages that have dynamic parameters in the URL. It can be either parameters that do not affect the content (for example, session IDs or referrers), and those that affect (for example, sorting modes). The non-canonical URL is pasted to the canonical one, which is formed by removing the parameters specified in the directive. Naturally, such a canonical URL must have a 200 response, otherwise no gluing will ever occur.
Neither this method leads to crawling budget consumption. In this case, the search robot simply will not download the non-canonical URL. However, one must bear in mind that for the same reason the search engine will not know the links that are placed on the non-canonical URL. Therefore, it is recommended to use this method in cases when the "cut-off" parameters do not affect the content of the page, or the values of these parameters can be large enough to have a significant effect on the consumption of the crawling budget (for instance, search results on the website).
And finally, the third option, which is on my opinion the most preferable, is using the canonical attribute of the tag. The advantages of this method include that, as with any gluing, there is a totaling of non-textual factors of the non-canonical and canonical pages (which is confirmed by Yandex employee Alexander Smirnov on the Sixth Webmaster's), plus links that are on the non-canonical page were also confirmed in the generalized character blog of the Yandex support service Platon Shchukin).
The only downside of this method is that non-canonical pages will choose a crawling budget because they have a 200 response, just like in the case of noindex in the robots meta tag. Also the non-canonical page can hang in the index for a long time until it is glued together with the canonical one.
Nevertheless, this method is very suitable, for example, when gluing pagination pages, different sorting options, the results of applying filters to lists, etc., and also "cutting off" dynamic URL parameters. By the way, as for pagination, Google employees recommend using the rel = "next" and rel = "prev" attributes of the tag. Pay special attention Yandex does not support these directives. Therefore, I still recommend using rel = "canonical" for both search engines, this directive works fine for Google. There is a difference between Yandex and Google in the handling of the directive rel = "canonical" – since Yandex, unlike Google, does not support the cross-domain of this directive, that is, you can not glue pages that are on different subdomains.
In conclusion, I would like to note that multiple sequential application of gluing directives should be avoided. For instance, redirect chains or instructions to be used as a canonical page, which itself contains a rel = "canonical" directive with a third page. It is also applied when consistently combining different methods of gluing together. For instance, when the URL is created by means of "cutting" parameters using the Clean-param directive it automatically turns in to non-canonical. In such cases, the search engine can simply ignore the directives.