Roast my web logo
Log in

Robots.txt Examples for SEO: Common Rules, Mistakes, and Safe Patterns (2026)

RMRoast My Web Team6 min read
robots.txtseotechnical seocrawlabilitywebsite launch

If you searched for robots.txt examples, you probably do not need theory first. You need safe patterns you can copy, adapt, and test before they block the wrong pages.

The biggest mistake teams make with robots.txt is treating it like a deindex button. It is not. robots.txt controls crawling, not guaranteed index removal.

If you want to validate a live or draft file against a real URL, use the Robots.txt Tester. If you need to draft a file first, use the Robots.txt Generator. For the broader crawlability workflow, keep the Technical SEO Audit guide and SEO Website Audit Checklist nearby.

What robots.txt can and cannot do

Robots.txt can help you:

  • block crawlers from low-value sections
  • prevent staging or internal search pages from being crawled
  • reduce crawl waste on parameter-heavy areas
  • publish sitemap locations

Robots.txt cannot reliably:

  • remove already-known URLs from search on its own
  • replace a proper noindex strategy where indexing control is needed
  • fix duplicate content by itself
  • protect private content from real access

If content is sensitive, use authentication. Do not rely on robots.txt alone.

Example 1: Simple open site with sitemap

Use this when you want normal crawling and just need a clean baseline file.

User-agent: *
Disallow:

Sitemap: https://www.example.com/sitemap.xml

Why it works:

  • nothing important is blocked
  • the sitemap is easy for crawlers to discover
  • the file is easy to audit later

Example 2: Block staging completely

Use this only on staging or pre-production environments.

User-agent: *
Disallow: /

Important:

  • this should never survive a production launch
  • staging should also be protected with login or IP controls

If a launch went live with Disallow: /, treat that as a release issue and fix it immediately.

Example 3: Block internal search pages

Internal search results usually do not need search-engine crawl demand.

User-agent: *
Disallow: /search
Disallow: /?s=

Sitemap: https://www.example.com/sitemap.xml

Use this when:

  • site search pages create thin or duplicate combinations
  • filters or query pages expand infinitely
  • you want crawl budget focused on canonical pages

Test exact URLs before publishing. A bad rule here can accidentally catch intended landing pages or faceted navigation routes.

Example 4: Block admin and login areas

User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /checkout/

This is common and usually safe, but remember:

  • blocked URLs can still be discovered from links
  • private sections still need real access control

Example 5: Faceted navigation with a narrow allow pattern

This is where teams often get into trouble.

User-agent: *
Disallow: /collections/*?color=
Disallow: /collections/*?size=
Allow: /collections/new-arrivals

Sitemap: https://www.example.com/sitemap.xml

This kind of rule can help on parameter-heavy ecommerce sites, but only when:

  • the blocked combinations are truly low value
  • the allowed landing pages are intentional canonical pages
  • you have tested exact URLs with the Robots.txt Tester

Do not block filter paths blindly. Some faceted URLs may be valuable landing pages.

Example 6: Multi-bot rules

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /tmp/
Disallow: /search

Sitemap: https://www.example.com/sitemap.xml

This is useful only when you have a very specific reason to treat bots differently. Most sites should keep the file simple unless there is a clear crawl-management need.

Common robots.txt mistakes

1. Using robots.txt to try to deindex pages

If a URL is already known externally, search engines can still keep a URL indexed without crawling its content.

Use noindex where appropriate and remove internal references when the goal is index cleanup.

2. Leaving Disallow: / live after launch

This happens more often than teams admit, especially after rushed launches.

Check production robots.txt as part of your Website Launch Checklist, not just staging QA.

3. Blocking assets that pages need to render

If important CSS, JS, or media routes are blocked, crawlers may not fully understand the page layout or content.

Keep rendering resources crawlable unless you have a strong reason not to.

4. Publishing rules without testing exact URLs

Pattern assumptions are where most mistakes happen.

Always test:

  • one intended allowed URL
  • one intended blocked URL
  • one edge-case URL near the same pattern

5. Treating crawl-delay like a Google control

Crawl-delay is not a reliable Google lever. Keep it out unless you know another bot in your environment needs it.

10-minute robots.txt QA workflow

Step What to test Pass condition
1 Production file loads /robots.txt returns 200 and plain text
2 Core pages Important pages are crawlable
3 Low-value paths Internal search, admin, or staging patterns are blocked as intended
4 Sitemaps Correct sitemap lines are present
5 Launch safety No accidental Disallow: / on production

Use the Robots.txt Tester for live validation and the Robots.txt Generator to rebuild a cleaner file if the current one is messy.

Safe launch checklist for robots.txt

  • block staging, not production
  • keep primary revenue pages crawlable
  • test one real URL per rule before deploy
  • publish the production sitemap line
  • check the file again right after go-live

If your launch includes URL changes and redirects, pair this with How to Find Redirect Chains After a Website Migration and 301 vs 302 Redirects: When to Use Each and How to Test Them.

FAQ

Should I block internal search pages in robots.txt?

Often yes, especially when those pages create thin, duplicate, or infinite crawl paths.

Should I block tag pages or filter pages?

Only if they are genuinely low value and not part of your organic strategy. Test specific paths before publishing broad rules.

Can I use robots.txt to hide a staging site?

Use it as one layer, but staging should also be protected with authentication or IP restrictions.

What is the safest default robots.txt file?

The safest default for most live sites is a simple open file plus a sitemap line:

User-agent: *
Disallow:

Sitemap: https://www.example.com/sitemap.xml

Final rule

Good robots.txt files are short, intentional, and tested.

If a rule exists, you should be able to explain:

  • what exact URLs it is meant to affect
  • why those URLs should be blocked or allowed
  • how you tested the rule before deploy

If you cannot answer those three questions, simplify the file and test it again.

Ready to Win More Clients?

For less than your daily coffee, deliver powerful audits that impress clients, boost conversions, and grow your freelance business.

Don't wait; start turning your site audits into profits today!