Should an enterprise SaaS allow CCBot to crawl their site?

CCBot is Common Crawl, which is a major training data source for nearly every foundation model. Blocking CCBot is the equivalent of asking the AI ecosystem to forget you. The 2024 reasoning for blocking it was protection from training without compensation. The 2026 reality is that the AI shortlist your buyers consult is built from the data CCBot collected. The cost of blocking now usually exceeds the cost of allowing.

How does an enterprise SaaS get re-categorized from 'mid-market' to 'enterprise' in AI responses?

AI categorization is downstream of three signals: the entity statements on the company's own site, the schema markup that names customers and product scope, and the third-party sources AI cites. Most 'mid-market' framings persist because the AI's source documents say so and nothing in the entity graph contradicts them. The fix is Review schema for named enterprise customers, an explicit /enterprise landing page, founder Person schema, and listicle infiltration that re-frames the platform correctly.

Audit Blog Post

I Audited a Series C ESG SaaS Platform.
They Blocked the AI Crawlers Their European Buyers Use.

By Lesli Rose · May 14, 2026 · 9 min read

This company is one of the more cited names in mid-to-enterprise ESG and CSRD reporting. Fortune 500 customers across pharma, telco, banking, aerospace, retail, and energy. Three consecutive years of Gartner Market Guide recognition. A 2024 Technology Fast 50 award. A recent Series C of $37.7M CAD with a major Montreal venture firm leading and the existing investor base re-upping. The headline use of funds: European expansion ahead of the next CSRD reporting cycle.

And then I pulled their robots.txt.

The file actively blocks GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, Amazonbot, FacebookBot, and Twitterbot. It allows ChatGPT-User, PerplexityBot, OAI-SearchBot, and Applebot. The intent is the sensible 2024 stance: protect the training corpus, allow live citation. The side effect in 2026 is that the company is suppressing its own visibility in the exact AI tools its European procurement buyers are using to build the shortlist.

The Scores

Technical SEO

On-Page SEO

Content

Schema

AI Discoverability

Social SEO

Earned Visibility

An overall weighted score of 50/100. The AI Discoverability score of 22 is one of the lowest I have given to a Series C SaaS, and the gap is entirely self-inflicted through the robots.txt configuration. Strong content, strong customer base, strong stack, and the AI front door is bolted shut.

Finding #1: The robots.txt Blocks the Training Crawlers Your European Buyers Use

When a sustainability director in Frankfurt or Paris with a CSRD obligation opens Perplexity and types "best CSRD reporting platform," Perplexity does two things at once. It retrieves live web pages with PerplexityBot (which this company allows, fine) and it consults its trained entity model of "the CSRD platform category" (which Perplexity, like every other foundation model, builds in part from Common Crawl, which this company blocks).

The trained layer names the shortlist. Live retrieval only fetches information about the platforms the model already "knows." When training crawlers cannot access your site, your entity graph in every downstream model gets thinner over time. Competitors that allow training crawlers accumulate weight every week.

The mention rate proves the framing. Across five buyer-intent prompts on ChatGPT, Claude, and the web SERP Perplexity Sonar retrieves against, the company appeared in 40 percent of AI shortlists. Every appearance described the platform as "mid-market" or "gaining traction" while serving a customer roster of Fortune 500 multinationals.

The fix is two lines. Remove the Disallow on GPTBot, ClaudeBot, and Google-Extended. Reconsider CCBot. Keep the explicit Allow on PerplexityBot and ChatGPT-User. Document the change with security and legal.

Finding #2: /llms.txt Returns the Homepage HTML

I requested https://[domain]/llms.txt. The server returned 200 OK with the full homepage HTML as the response body. To an AI agent, this reads as "the file does not exist." For a Series C SaaS in an AI-discovered regulated category, the missing /llms.txt is a daily lost-citation event.

A properly structured /llms.txt for this company would name the platform, the four product modules, the supported frameworks (CSRD, ESRS, SFDR, EU Taxonomy, GRI, SASB, CDP, TCFD), the named Fortune 500 customer roster, the Series C anchor, and the Gartner Market Guide recognition. That single file, served as text/plain at the root, materially changes how AI agents describe the company. Most competitors do not ship one. The ones that do get cited more. More on AI crawler directives and llms.txt.

Finding #3: No SoftwareApplication Schema. No FAQPage. No Person.

The homepage carries one JSON-LD block: WebSite, LocalBusiness (with the Montreal HQ address), BreadcrumbList, and an empty placeholder Organization block (no name, no URL, no logo). That is it.

No SoftwareApplication schema. Mandatory for an enterprise SaaS. AI parsers use this to confirm the company sells a software product, not just exists. Without it, the AI has to infer the product from unstructured text and is less confident citing it.

No FAQPage schema anywhere on the site. FAQPage is what ChatGPT, Claude, and Perplexity directly parse for citation-style answers. On a site whose buyers ask CSRD and ESRS questions, the absence of FAQ markup is the single highest-volume lost-citation surface.

No Person schema for the founding team. CEO and co-founder data is on LinkedIn and Crunchbase but not linked from the site as sameAs. AI builds entity graphs by connecting Organization to Person to LinkedIn to Crunchbase. Each missing link weakens the citation.

No Review schema for the named customer roster. A dozen Fortune 500 companies are listed as customers on the homepage as logos. None carry Review schema with author Organization and reviewBody. The proof exists; the structured signal does not.

The site has the strongest customer roster of any platform in its mid-tier AI shortlist. The schema layer that surfaces that roster to AI is the layer that has not been built. That is the single biggest reason AI keeps describing the company as "mid-market." More on schema markup for SaaS.

Finding #4: The Pricing Page 302-Redirects to the Homepage

The natural URL a buyer or an AI agent tries for the pricing page returns 302 to the homepage. There is no /pricing page anywhere on the site. AI tools repeatedly answer pricing queries with "custom quote, contact sales," which routes the buyer to whichever competitor has any disclosed pricing signal. A pricing page does not need to publish a number to be useful; it needs to disclose the model (per-entity, per-report, per-user, ARR floor) so the AI has something to cite.

Finding #5: Cited in 3 of 8 Listicles, Always as "Mid-Market"

The company appears in three of the eight major "best CSRD platform 2026" listicles I checked. In every one, the framing is mid-tier ("closest to a full CSRD solution in the mid-market") while platforms with fewer Fortune 500 customers are framed as enterprise leaders. The listicle authors are competitors or analyst-adjacent publications; the framing they use is what the AI then cites.

The fix is two layers. First, repair the entity graph (Findings #1 through #3) so the AI has a corrected source to anchor against. Second, outreach to the EU-focused listicles (KEY ESG, Tanso in Munich, Coolset, Safaqes) for re-positioning, supported by the named-customer evidence and the Series C announcement. Most listicle authors update their rankings on request when the supporting evidence is strong enough. More on review platforms AI trusts.

What is Actually Working

Server-rendered Craft CMS stack. Fully crawlable HTML, Cloudflare CDN, Tailwind frontend. The platform is not the bottleneck.

Multi-locale. English, French, and German variants with proper hreflang declarations. EU expansion-ready at the infrastructure layer.

Customer roster. Twelve named Fortune 500 customers across pharma, telco, banking, aerospace, retail, and energy. Rarest asset in the category.

Analyst recognition. Three consecutive Gartner Market Guide inclusions. That is institutional credibility AI extractors weigh heavily once they can read it from the right surfaces.

Content velocity. 62 blog posts, 52 resources, 42 news items, recent CSRD-specific publishing cadence. The engine is running.

Strong LinkedIn presence. 14K followers, 142 employees, verified company page, active CEO voice. The social proof exists.

The Takeaway for Regulated-Category SaaS

CSRD reporting is the textbook AI-discovered category. The buyer has a regulator-imposed deadline. The platform decision is a board-visible procurement. The two-minute initial scan happens inside an AI tool before any human at the vendor knows the deal exists. Being absent from that initial scan, or being framed wrongly inside it, is the new version of being absent from the search results page.

This audit pattern repeats across regulated-category SaaS. Strong product, strong customer proof, strong content engine. The structural layer that decides whether AI recommends them is built for 2022 search behavior, not 2026 AI procurement behavior. The fixes are configuration and schema, not product. The hard part is already done. The missing piece is making the proof legible to the systems that decide the shortlist.

Frequently Asked Questions

Why is blocking GPTBot and ClaudeBot a problem for B2B SaaS?

GPTBot, ClaudeBot, and Google-Extended are the training crawlers that build the long-term entity knowledge AI models use to answer category questions. Live search crawlers only retrieve in the moment. The trained layer names the shortlist. Block the training crawlers and your entity graph in every downstream model thins out while competitors who allow them accumulate weight.

Why are CSRD reporting buyers using AI before contacting vendors?

EU CSRD compliance is a mandatory, time-bound, regulator-driven workflow. The sustainability director with the obligation has roughly two minutes for the initial vendor scan. AI-assisted research collapses that scan into a single Perplexity or ChatGPT query. The procurement shortlist comes out of that answer. Being absent is existential.

Should enterprise SaaS allow CCBot to crawl their site?

CCBot is Common Crawl, a major training source for nearly every foundation model. Blocking CCBot is asking the AI ecosystem to forget you. The 2024 reasoning was protection from training without compensation. The 2026 reality is that the AI shortlist your buyers consult is built from the data CCBot collected. The cost of blocking now usually exceeds the cost of allowing.

How does a SaaS get re-categorized from "mid-market" to "enterprise" in AI responses?

AI categorization is downstream of three signals: entity statements on the company site, schema that names customers and product scope, and the third-party sources AI cites. Most "mid-market" framings persist because the source documents say so and nothing in the entity graph contradicts them. Fix: Review schema for named enterprise customers, an explicit /enterprise landing, founder Person schema, and listicle re-positioning.

Does Your SaaS Have the Same AI-Citation Pattern?

I run the same audit on regulated-category SaaS, fintech, healthtech, and B2B platforms. Technical SEO, schema, AI discoverability, earned visibility, and a clear roadmap.

Run Your Visibility Report

I Audited a Series C ESG SaaS Platform.They Blocked the AI Crawlers Their European Buyers Use.