robots.txt control for host aliases via mod_rewrite

Suppose you have a website launched at two different hosts.

<VirtualHost *:80>
    ServerName www.example.com
    ServerAlias beta.example.com
    ....
</VirtualHost>

The content is the same but you want to serve a different robots.txt file, possibly excluding any indexing from the secondary host.

It would be handy if we could simply say:

User-agent: *
Allow: http://www.example.com/

User-agent: *
Disallow: http://beta.example.com/

to allow all bots crawl the primary host and dissalow them from the secondary one, but this syntax is imaginary. Firstly there is no Allow keyword in the spec, and secondly URLs must be relative.

The solution is to have 2 different robots.txt files:

robots-www.txt

User-agent: *
Dissalow:

robots-beta.txt

User-agent: *
Dissalow: /

and serve them via mod_rewrite like this:

<VirtualHost *:80>
    ServerName www.example.com
    ServerAlias beta.example.com
    ...
    RewriteEngine On
    RewriteCond %{HTTP_HOST} ^www\.example\.com$
    RewriteRule ^/robots.txt$ /robots-www.txt [L]
    RewriteCond %{HTTP_HOST} ^beta\.example\.com$
    RewriteRule ^/robots.txt$ /robots-beta.txt [L]
</VirtualHost>

Now http://www.example.com/robots.txt will silently serve robots-www.txt and http://beta.example.com/robots.txt will serve robots-beta.txt

This is also handy in domain name migration periods where you are waiting for dns to flush all around the globe until you feel safe for completely shutting down the secondary host and possibly assigning 301 redirects to the primary.

3 Responses to “robots.txt control for host aliases via mod_rewrite”

  1. Israel WebDev Says:

    This is perfect! Thanks.
    Small note: the robots-beta.txt should read

    User-agent: *
    Disallow: /

    and robots-www.txt should be:

    User-agent: *

    or a blank file (, or no file)

  2. Neil Ferns Says:

    We could use a
    Alias /robots.txt /home/www/www.example.com/htdocs/noallow-robots.txt
    which could prevent complicated mod rewrite rules.

    and have 2 diff virtaul hosts for the 2 diff domains pointing to the same htdocs.
    Alias /robots.txt /home/www/www.example.com/htdocs/Allow-robots.txt
    or no file.

  3. cherouvim Says:

    @Israel WebDev: I think plain “Dissalow:” is OK as well.

    @Neil Ferns: The virtual host in this example is 1 with 2 domains (aliases).