Skip to main content
Back to issue library

High issue

Robots.txt rules blocking crawlable pages

Robots.txt mistakes can block crawlers from the pages, assets, or sections that need to remain discoverable.

Why it matters

Robots.txt controls crawler access before a page is fetched. A broad disallow rule can prevent search engines from seeing the very pages that otherwise have correct metadata and canonical tags.

Common signals

  • A broad Disallow rule matches public product, pricing, article, or landing pages.
  • Robots.txt changed during a deploy even though public routes did not change.
  • A sitemap is advertised but its URLs are blocked by matching robots rules.

Stable fix pattern

  1. Scope Disallow rules to private, API, or low-value paths instead of broad route prefixes.
  2. Keep the sitemap URL present and reachable from robots.txt.
  3. Test representative public URLs against the final production robots file.

How SitePulse will monitor this in v1

  • Fetches and parses robots.txt from the public origin.
  • Compares robots rules against priority URLs and sitemap samples.
  • Separates intentional API blocking from accidental public-page blocking.
Run a free scan