Dynamic Robots.txt with ASP.NET 2.0

The Problem

Scenario: You have an ASP.NET 2.0 website in IIS 6.0 that receives both http and http requests. Some people might link to the https versions of the page, but you really want Google (and other search engines) to only crawl http versions. There are relative links in the site, so you know that as soon as Google crawls one https link it is going to pick up the whole site. So as well as https links in the Google search results, you may also now have to contend with lower rankings because Google thinks there is duplicate content on the site (it sees the page in both http and https format).

You can see that Google recommends that you have a different robots.txt file for http and https, so that Google knows not to crawl the https version of your site. But how can you get IIS to serve up different versions of robots.txt?

The Solution

Lets make the robots.txt file dynamic. We’ll turn it into an ASP.NET handler so we can have .NET code in it. So you don’t need to recompile your whole web project just for this, we’ll take advantage of ASP.NET 2.0′s on-demand compilation ability, and to boot we’ll ensure the site still serves up other text files as it would normally.

Awesome, wot? :-)

The ASP.NET Pipeline

I’m assuming you know what ASP.NET HTTP Pipeline is and vaguely how it works.

If you don’t, here is a crash course in a few sentences. Based on file extension, when IIS receives a request it can either a) handle it by itself, or b) pass the request on to something other engine (e.g. ASP.NET) and then just hand back to the client whatever that engine responds with. By default, IIS will serve text files on its own, but we are going to tell IIS to let ASP.NET process text files – then we are going to tell ASP.NET to treat the robots.txt file as if it were executable code (in this case an HTTPHandler).

Create your Text File

Create robots.txt file in the root of your web project, containing the text below. This robots.txt is provided in C#. There is no reason that a similar solution will not work in VB.NET but I prefer C#.

Robots.txt


<%@ WebHandler Language="C#" Class="MyNamespace.robotshandler" %>

using System;
using System.Web;

namespace MyNamespace {

    public class robotshandler: IHttpHandler {

        public void ProcessRequest (HttpContext context) {
 
            context.Response.ContentType = "text/plain";
            context.Response.Write("User-agent: *\n");
                
            if (context.Request.ServerVariables["Https"]=="off"){
                // HTTP
                context.Response.Write("Allow: /\n");
                context.Response.Write("Disallow: /MyDisallowedDirectory/\n");
            } else {
                // HTTPS
                context.Response.Write("Disallow: /");
            }
        }
    
        public bool IsReusable {
            get {return false;}
        }
    }
}

Check that you can browse to your robots.txt file. You should see the code come through exactly as you typed it, WebHandler directive and all.

Getting IIS to Pass .txt Requests to ASP.NET

At the moment, IIS is still serving the txt files instead of letting .NET know about them – the request for a .txt file doesn’t even reach .NET. Lets tell IIS to pass that request through to .NET.

Get the path to the ASPX engine

  1. Open IIS and right click on your website and bring up the properties screen
  2. Go to Home Directory > Configuration. You will be on the Mappings Tab.
  3. Locate the ASPX item and click Edit – Copy the path in the Executable Field and cancel out of that window. We don’t want to change anything there – we just want the path

Create the ISAPI Entry for .txt

You are still on the Mappings Tab for your site in IIS 1. Click “Add” 2. Populate the Executable path with the value you copied in the last section 3. Enter “GET” in the “Limit To” field 4. Press OK a million times to save all your changes. You’re done with IIS.

Check the result

When you now try to browse to your text file you should get a blank page. This means IIS is now passing the request to ASP.NET, but ASP.NET doesn’t know what to do with it.

Getting ASP.NET to Handle a .txt Request as a Handler

OK. Web.config time!

Add the HTTP Handler

Add the following under system.web

<httpHandlers>
    <add path="/robots.txt" verb="GET" type="System.Web.UI.SimpleHandlerFactory" />
</httpHandlers>

What does this mean? Basically this is saying “When a GET request comes in for /robots.txt, process it the same way as an HTTP Handler”. Funnily enough, it just so happened to be a Handler that we defined in our robots.txt file!

Browse to your robots.txt file now, and you will get a .NET error relating to “No build provider registered for extension ‘.txt’”. O-oh! Whats happening here is .NET knows that it needs to process the code in our file, and it needs to do some compiling, but it doesn’t know what to build it with.

I think that if you had deployed a compiled HTTP Handler solution instead, the step below would not be required (I haven’t tested this – if you try it, please let me know how you go).

Add the Build Provider

Add the following to the web.config under system.web, compilation:

<buildProviders>
    <add extension=".txt" type="System.Web.Compilation.WebHandlerBuildProvider" />
</buildProviders>

Check your robots.txt file! It should be working, yay!

BUT – check another text file in your site. Oh noes! It is blank! .NET still doesn’t know what to do with text files other than our robots.txt. Read on for the fix0r.

Getting ASP.NET to Continue to Serve all Other .txt Files Normally

Lets add one more HTTPHandler entry to our web.config, under System.Web, HTTPHandlers.

<add path="*.txt" verb="GET" type="System.Web.StaticFileHandler" />

This line says that for everything ending in .txt, don’t try to process it – just return the static file. Note that our robots.txt matches the more specific rule we supplied earlier, so it will still be processed by the .NET engine.

Conclusion

This is one technique for a quick and easy dynamic robots.txt file taking advantage of ASP.NET 2.0′s on-demand compilation support. You could apply this principle to almost any file extension if you needed to have other dynamic files in your site, but if you need to do really funky or heavy stuff I would recommend writing proper compiled HTTP Handlers rather than using this quick fix technique.

Resources

7 Comments:

  1. Hi there, great article!

    Would it be possible to update the article with instructions on configuring same in IIS7?

    Great blog BTW, you should post more!

    Best Wishes,

    Paul

  2. The instructions for IIS7 is to set “System.Web.UI.SimpleHandlerFactory” as the Handler, using the “Handler Mappings” feature module within the site you want to change, while in IIS7 Manager.

    IIS7 Manager > Sites > SiteToChange > Handler Mappings > Add Managed Handler…

    Request Path: “*.txt”
    Type: “System.Web.UI.SimpleHandlerFactory”
    Name: “SimpleHandlerFactory”

  3. [...] By default IIS will treat .txt files as non dynamic and will simply render the text in them if they are called.  We want to tell IIS to parse them in the same way they would parse .asp files.  Note there’s a handy example of how to parse robots.txt files with ASP.net here. [...]

  4. How would I tell the robots.txt file which portions of the site to restrict? Would I use the web.config to do this? A code example would be great.

  5. Great article FTW!
    Not only helped me understand the concept of adding robots.txt for dealing with a HTTPS/HTTPS site but also introduced me to the concepts of ASP.NET HTTP Pipeline, HTTP Handlers and IIS App extension mappings.. :D

    Thanks a ton bud!

    Cheers

  6. With IIS7 it’s possible to set up the handler to only handle /robots.txt and you save the extra configuration to route everything_else.txt via the StaticFile handler.

  7. Hi guys!

    Here’s an updated tutorial for dynamic robots.txt in IIS7, with info for both classic and integrated pipeline mode.

    http://www.kleenecode.net/2012/08/16/dynamic-robots-txt-with-asp-net-in-iis7/

    Cheers,
    Carly

Leave a Reply