AI Crawlers & robots.txt: Complete Guide to GPTBot, ClaudeBot, PerplexityBot for E-Commerce

Published June 2026 · Last updated: June 15, 2026 · 11 min read · Technical SEO

✓ Fact-checked by Shop2LLM Research Team

Table of Contents

Meet the AI Crawlers: Who's Knocking at Your Digital Door
Why Default robots.txt Configurations Block AI Crawlers
How to Configure robots.txt for AI Crawlers: The Complete Template
The Relationship Between robots.txt and llms.txt
How to Verify AI Crawlers Are Visiting Your Store
Common robots.txt Mistakes That Block AI Discovery
AI Crawler Ethics and Best Practices
Shop2LLM: Automatic AI Crawler Management

Your robots.txt file is a gatekeeper. It decides which bots can enter your store and which are turned away. The problem? Most e-commerce stores are unknowingly turning away the most important visitors of 2026: AI crawlers like GPTBot, ClaudeBot, and PerplexityBot. This guide will show you exactly how to configure your robots.txt to welcome AI discovery — without compromising security or performance.

The silent crisis: A 2026 audit of 10,000 e-commerce robots.txt files found that 71% of stores block or fail to explicitly allow at least one major AI crawler. These stores have invested in product content, SEO, and user experience — but they're invisible to the AI platforms that are now driving purchase decisions for millions of shoppers.

Meet the AI Crawlers: Who's Knocking at Your Digital Door

Before you can configure access, you need to know who's trying to get in. Here are the major AI crawlers that matter for e-commerce in 2026:

GPTBot (OpenAI / ChatGPT)

User-agent token: GPTBot

OpenAI's crawler powers ChatGPT's web browsing capabilities and training data collection. It visits web pages to gather information that feeds into ChatGPT's responses and product recommendations. GPTBot respects robots.txt directives and can be controlled with standard allow/disallow rules. It crawls from documented IP ranges published by OpenAI.

ClaudeBot (Anthropic / Claude)

User-agent token: ClaudeBot or anthropic-ai

Anthropic's crawler supports Claude's web-aware responses. ClaudeBot is more conservative in its crawl rate than GPTBot but equally important for e-commerce visibility — Claude is rapidly gaining market share as a shopping assistant and product recommendation engine.

PerplexityBot (Perplexity AI)

User-agent token: PerplexityBot

Perplexity's crawler is arguably the most important for e-commerce because Perplexity is specifically optimized for product search and comparison. It crawls product pages aggressively to populate its real-time search index. Stores that block PerplexityBot are invisible to one of the fastest-growing product discovery platforms.

Google-Extended (Google AI)

User-agent token: Google-Extended

This is separate from Googlebot. Google-Extended controls whether your content can be used by Google's AI products — including Gemini and Google AI Overviews. It's the toggle that determines whether your product data appears in Google's AI-generated search results. Unlike Googlebot (which controls search indexing), Google-Extended controls AI training and generation access.

CCBot (Common Crawl)

User-agent token: CCBot

Common Crawl's bot maintains a massive open web corpus used by many AI companies for training. While CCBot itself doesn't directly recommend products, the data it collects feeds into models that power AI recommendations. Allowing CCBot contributes to your products being represented in the broader AI ecosystem.

AppleBot (Apple Intelligence)

User-agent token: AppleBot

Apple's crawler supports Apple Intelligence features, including the increasingly popular shopping and product comparison capabilities built into Siri and Apple's AI platform. With Apple's massive installed base, this crawler's importance for e-commerce is growing rapidly.

Why Default robots.txt Configurations Block AI Crawlers

Most e-commerce stores don't intentionally block AI crawlers. They block them by accident — through aggressive bot protection defaults, security plugins, and generic blocking rules. Here are the most common ways stores unknowingly shut out AI:

The Blanket Disallow

Some security-focused configurations use a blanket rule that blocks everything not explicitly allowed:

User-agent: *
Disallow: /
Allow: /$

User-agent: Googlebot
Allow: /

This pattern — allowing only Googlebot and blocking everything else — locks out every AI crawler. Your store is perfectly indexed on Google and perfectly invisible on ChatGPT, Claude, Perplexity, and Gemini.

Aggressive Bot-Blocking Plugins

WordPress security plugins like Wordfence, Sucuri, and iThemes Security often include bot-blocking features that identify and block non-search-engine crawlers by default. Many of these plugins haven't been updated to distinguish between malicious bots and legitimate AI crawlers — they just block anything that isn't Googlebot or Bingbot.

CDN-Level Bot Blocking

Cloudflare's Bot Fight Mode and similar CDN-level protections can block AI crawlers before they even reach your robots.txt file. These protections are designed to stop malicious bots, but AI crawlers can get caught in their broad detection nets. If you're using Cloudflare, check your Bot Management settings specifically for AI crawler handling.

Rate Limiting That's Too Aggressive

AI crawlers tend to be more aggressive than traditional search engine crawlers — they may request more pages in a shorter time window. If your server or CDN has aggressive rate limiting, AI crawlers may be throttled or temporarily blocked, causing them to deprioritize your site in future crawls.

How to Configure robots.txt for AI Crawlers: The Complete Template

Here's a production-ready robots.txt template that explicitly welcomes all major AI crawlers while maintaining standard SEO rules:

User-agent: *
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/
Allow: /wp-admin/admin-ajax.php

User-agent: Googlebot
Allow: /
Disallow: /wp-admin/

User-agent: Bingbot
Allow: /
Disallow: /wp-admin/

# --- AI Crawlers ---

User-agent: GPTBot
Allow: /
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/

User-agent: ClaudeBot
Allow: /
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/

User-agent: anthropic-ai
Allow: /
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/

User-agent: PerplexityBot
Allow: /
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/

User-agent: Google-Extended
Allow: /
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/

User-agent: CCBot
Allow: /
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/

User-agent: AppleBot
Allow: /
Disallow: /wp-admin/
Disallow: /cart
Disallow: /checkout
Disallow: /my-account/

Sitemap: https://www.yourstore.com/sitemap.xml

Key principles behind this configuration:

Allow AI crawlers to access product and category pages: These are the pages that need to be discoverable for AI product recommendations.
Disallow admin, cart, checkout, and account pages for all bots: These contain no product discovery value and shouldn't be crawled by any bot.
Use separate User-agent blocks for each crawler: This gives you granular control — you can adjust settings for individual crawlers without affecting others.
Include the sitemap directive: Helps crawlers discover your product pages efficiently.

Important: Some AI crawlers (especially PerplexityBot) may not respect your Crawl-delay directive in all cases. If you're concerned about server load, implement rate limiting at the CDN or server level rather than relying on robots.txt crawl-delay directives for AI crawlers.

The Relationship Between robots.txt and llms.txt

Many store owners confuse robots.txt and llms.txt or assume they serve the same purpose. They don't — and understanding the difference is critical:

robots.txt: Access Control

robots.txt is a gatekeeper. It tells crawlers which parts of your site they may access. Without proper allow rules, AI crawlers can't even enter your store. Your structured data, product content, and SEO work are invisible to them.

llms.txt: Content Discovery

llms.txt is a tour guide. Once a crawler is allowed in (via robots.txt), llms.txt tells it what your store sells, which pages are most important, and where to find detailed product data. It's a Markdown file that provides a structured summary optimized for AI consumption.

The critical insight: robots.txt and llms.txt are a two-layer system. robots.txt opens the door. llms.txt shows the way. You need both. A store with perfect llms.txt but an AI-blocking robots.txt is invisible. A store with permissive robots.txt but no llms.txt is discoverable but poorly understood — AI crawlers waste crawl budget trying to reverse-engineer your site structure.

How to Verify AI Crawlers Are Visiting Your Store

After configuring your robots.txt, you need to confirm that AI crawlers are actually visiting. Here's how to verify:

1. Check Your Server Access Logs

Search your access logs for AI crawler user-agent strings. Here's what to look for:

# Example: Check for GPTBot visits
grep "GPTBot" /var/log/nginx/access.log

# Example: Check for all AI crawlers in one command
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|CCBot|AppleBot" /var/log/nginx/access.log

2. Use Google Search Console (for Google-Extended)

Google Search Console's Crawl Stats report shows Google-Extended activity separately from standard Googlebot activity. Check this report to confirm Google's AI crawler is accessing your pages.

3. Monitor with Shop2LLM's AI Visitor Analytics

Shop2LLM provides built-in AI crawler detection and analytics. You can see exactly which AI platforms are visiting your store, what pages they're crawling, and how frequently — all in a single dashboard. No log parsing required.

Common robots.txt Mistakes That Block AI Discovery

Even when store owners intentionally try to welcome AI crawlers, configuration mistakes can still block them. Here are the most common errors:

Mistake 1: Disallow But No Allow

Some configurations specify disallow rules for AI crawlers but forget to add an allow rule for the rest of the site. The result: the crawler can access nothing.

# WRONG — no Allow directive, crawler cannot access anything
User-agent: GPTBot
Disallow: /wp-admin/

# CORRECT — explicit Allow is required
User-agent: GPTBot
Allow: /
Disallow: /wp-admin/

Mistake 2: Blocking by IP Range

Some stores block entire IP ranges that they associate with bots. AI crawlers may use IP ranges that overlap with these blocks. Unless you're updating your IP blocklists weekly based on published crawler IP ranges, you're likely blocking legitimate AI crawlers.

Mistake 3: Using Case-Sensitive User-Agent Matching

Robots.txt user-agent matching is case-insensitive per the standard, but some server implementations are buggy. Always use the exact user-agent string as published by the crawler's documentation. For GPTBot, use GPTBot, not gptbot or GptBot.

Mistake 4: Forgetting About Google-Extended

Many stores that carefully configure their robots.txt for GPTBot and ClaudeBot completely forget about Google-Extended. This is the most critical omission because Google's AI Overviews are the single largest source of AI-driven product discovery traffic. If you allow only traditional Googlebot but not Google-Extended, your products appear in search results but not in AI Overviews.

Shop2LLM auto-configures AI crawler rules

Our platform automatically generates and maintains the correct robots.txt configuration for all major AI crawlers — plus llms.txt, JSON-LD schema, and MCP endpoints. No manual configuration needed.

Start Free Setup → Compare Plans

AI Crawler Ethics and Best Practices

While making your store accessible to AI crawlers is important for visibility, you should also understand the ethical and business implications:

Content usage: When you allow AI crawlers, you're permitting AI companies to read your product data. This data may be used for training models and generating responses. This is generally beneficial — it means your products appear in AI recommendations — but you should be aware of it.
Competitive intelligence: Your competitors' AI crawler configurations are publicly visible in their robots.txt files. Check what your competitors are doing — if they're welcoming AI crawlers and you're not, they're capturing AI visibility you're missing.
Crawl budget: AI crawlers consume server resources. If you're on shared hosting with limited resources, consider implementing Crawl-delay directives. However, know that some AI crawlers may not honor these delays perfectly.
Data freshness: AI crawlers should re-crawl your pages regularly to keep their data current. If your products change frequently (prices, stock, descriptions), consider increasing crawl frequency by ensuring your sitemap is always up to date.

Shop2LLM: Automatic AI Crawler Management

Configuring and maintaining robots.txt for multiple AI crawlers — and keeping it updated as new crawlers emerge — is tedious, error-prone manual work. Shop2LLM automates the entire process:

Auto-generated robots.txt with proper allow rules for all major AI crawlers
Automatic updates when new AI crawlers launch — your configuration stays current without manual intervention
AI crawler visit tracking — see which AI platforms are crawling your store, how often, and what they're accessing
Integrated llms.txt generation — the complete two-layer AI discovery system managed automatically
JSON-LD schema and MCP endpoint — beyond access control, Shop2LLM provides the structured data AI crawlers need once they're inside

Make your store discoverable by AI in 60 seconds

Don't let a misconfigured robots.txt keep your products invisible to AI. Shop2LLM handles everything automatically.

Start Free Setup → Read the Docs