robots.txt control for host aliases via mod_rewrite
Suppose you have a website launched at two different hosts.
<VirtualHost *:80>
ServerName www.example.com
ServerAlias beta.example.com
....
</VirtualHost>
The content is the same but you want to serve a different robots.txt file, possibly excluding any indexing from the secondary host.
It would be handy if we could simply say:
User-agent: * Allow: http://www.example.com/ User-agent: * Disallow: http://beta.example.com/
to allow all bots crawl the primary host and dissalow them from the secondary one, but this syntax is imaginary. Firstly there is no Allow keyword in the spec, and secondly URLs must be relative.
The solution is to have 2 different robots.txt files:
robots-www.txt
User-agent: * Dissalow:
robots-beta.txt
User-agent: * Dissalow: /
and serve them via mod_rewrite like this:
<VirtualHost *:80>
ServerName www.example.com
ServerAlias beta.example.com
...
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule ^/robots.txt$ /robots-www.txt [L]
RewriteCond %{HTTP_HOST} ^beta\.example\.com$
RewriteRule ^/robots.txt$ /robots-beta.txt [L]
</VirtualHost>
Now http://www.example.com/robots.txt will silently serve robots-www.txt and http://beta.example.com/robots.txt will serve robots-beta.txt
This is also handy in domain name migration periods where you are waiting for dns to flush all around the globe until you feel safe for completely shutting down the secondary host and possibly assigning 301 redirects to the primary.
April 21st, 2009 at 23:39
This is perfect! Thanks.
Small note: the robots-beta.txt should read
User-agent: *
Disallow: /
and robots-www.txt should be:
User-agent: *
or a blank file (, or no file)
April 28th, 2009 at 8:50
We could use a
Alias /robots.txt /home/www/www.example.com/htdocs/noallow-robots.txt
which could prevent complicated mod rewrite rules.
and have 2 diff virtaul hosts for the 2 diff domains pointing to the same htdocs.
Alias /robots.txt /home/www/www.example.com/htdocs/Allow-robots.txt
or no file.
April 28th, 2009 at 8:57
@Israel WebDev: I think plain “Dissalow:” is OK as well.
@Neil Ferns: The virtual host in this example is 1 with 2 domains (aliases).