Search engine crawling and disallow directives

Disallowed Pages Showing Up as Good in The Core Web Vitals Report

Did you disallow some pages in robots.txt, but they are showing up as “good” in the Core Web Vitals report? See why it’s happening.

Here is the situation that we saw with one of the sites where we provide SEO consulting. The webmaster disallowed a directory in robots.txt. The directory hosts a handful of pages that are not important for the business as they don’t host any links, and they don’t have any rich content. There was some crawling activity on the pages, which was affecting the performance of the site. So we decided to disallow the whole directory hosting these pages.

The disallow rule /directory/subdirectory/* was added to robots.txt.

At the same time another team member was working on improving the site’s Core Web Vitals (CWV) scores. Several weeks after the introduction of the disallow rule, they noticed that the disallowed pages showed up as “Good” in the Core Vitals reporting in Google Search Console (GSC). The team was confused! Let’s see what’s happening here.

Disallow Directive in Robots.txt

A robots. txt file on a site is used as a set of directives to tell search engine crawlers about the pages and sections of the site that they can access. Robots.txt is used mostly to avoid overloading the site with requests from crawlers.

As per Google, robots.txt is not necessarily a mechanism to keep a page out of Google. This is important to note.

Disallow directives in robots.txt can tell search engines what files, pages, site sections not to access. The disallow directive includes the word “disallow” followed by the path not to access. If the disallow line is empty (without a path), it’s not disallowing anything.

The disallow directive will look like this in a site’s robots.txt.

User-agent: *
Disallow: /wp-login/

In this example all search engines are told not to access the /wp-login/ directory.

Core Web Vitals Report

The Core Web Vitals report in Google Search Console shows how a site’s pages perform. The data is based on real world usage, which should give a high level of confidence to webmasters.

Google introduced this report into GSC only a couple of years ago. The reason of adding this report was to help webmasters identify pages with poor user experience and fix them. Since longer page load times increase the bounce rate on pages, Google wants webmasters to have data and act on improvements. This is a win-win situation: better user engagement for site owners and easier crawling for Google.

What’s Happening?

The confusion is about how a blocked page can turn up “good” in the core vitals report. Sounds like a contradiction? But the two are not necessarily correlated.

The disallow directive in robots.txt is only telling search bots not to crawl the file, page, or directory. If the page was crawled and then indexed in the past because it was in a sitemap or there were internal links pointing to it, it is in Google’s index already.

The Core Web Vitals report includes data on indexed URLs only. So the URLs shown in the CWV report are the actual URLs from Google’s index for which data was recorded. This report works with actual URLs and not canonicals (which may be the case for the other GSC reports). This is why you may see a bunch of URLs with parameters in the core vitals reports.

In short, by telling the search engine bot not to crawl a page in our robots.txt, we are not removing that page from Google’s index.

What to Do?

Depending on the goal, there may be further steps needed.

  • If the goal for the disallow rule is to keep search bots off those pages to avoid overloading, we stop here (our case).
  • If the goal is to stop the pages from being indexed, also use the “noindex” tag on these pages.
  • If the goal is to remove the pages from Google’s or Bing’s cache, use the “noarchive” tag on the pages. As Bing supports several other meta tags like nocache, it’s good to use them too.

Final Thoughts

By adding a disallow directive into robots.txt, we are not creating a 100% reliable block. Search engines like Google may still crawl the disallowed path, especially if there are external or internal links pointing to those pages.

There is also a cohort of crawlers who will disregard disallow directives. To block them off, you will need to set exclude filters at the server or CDN level.

Like this article? Don't hesitate to share it:
Scroll to Top