dateo. Coding Blog

Coding, Tech and Developers Blog

ASP.NET

Blazor

Robots

Real-world Blazor websites: Robots.txt

Dennis Frühauff on November 8th, 2023

Blazor has been around for some years now as an emerging web application platform for .NET users and developers. If, at some point, you decide to use it as a real-world project, there are a few things you will have to solve in a .NET-way. Today, we will look at how you can prevent search engine crawlers from inspecting all of your pages.

I have followed the development and evolution of Blazor closely over the past few years because I consider it a great and powerful alternative to the many PHP frameworks out there. As of now, this website is built with Statamic and I am happy with it right now. But, to be honest, I would be much more confident implementing and changing features if it were a web application built on top of .NET. To me, Blazor continuously proves to be a valid alternative and I have used it successfully in a few projects.

In this article, we will have a look at a real-world problem when it comes to publicly accessible sites with Blazor. Well, not a problem actually, but something we will have to do in a .NET/Blazor way, instead of HTML: Robots.txt and robot meta tags.

What are `robots.txt` and `robot` tags?

All of our known search engines (remember Lycos? Still out there!) use a software entity called a crawler to continuously go through all of the internet's web pages and scan them for new content. That being said, every search that will be performed on any search engine, will only act upon an indexed snapshot of the web at that point. It can never be really up to date. There will obviously be occasions when you'd like to instruct those crawlers not to inspect a specific portion of your web application. For example, who would like their WordPress admin pages to appear in the search results? Likewise, you'd probably like your disclaimer and, consequently, your address, not to appear in the top five search results on any search engine.

There are basically two different ways in which we can guide crawlers on how to inspect our web pages and applications. One is with a file called robots.txt, and the other is with meta tags that are placed directly in the head compartment of one or many pages on our site.

`robots.txt`

The robots.txt is a file that is usually served via www.yourdomain.com/robots.txt and is considered to be valid for your whole site. With it, you can instruct crawlers to avoid crawling unimportant content on your pages. It is a means to prevent overwhelming traffic on your page caused by crawlers.
It is, however, not a way to explicitly prevent crawlers from inspecting certain areas. Pages, even if blacklisted in the robots.txt can still appear on search results (they just won't have any meaningful description text).

Also, the behavior on how the robots.txt is interpreted is up to the crawler, i.e., up to the search engine. It is not guaranteed that any crawler truly obeys the instructions in this file.

Meta tags

Using meta tags such as noindex is a way to instruct crawlers to explicitly remove pages from search results. These tags can specified on a page-by-page basis and thus allow more fine-grained control over your content.
There are also other tags, such as nofollow and noarchive which are usually used in combination with each other.

<head>
  ...
  <meta name="robots" content="noindex, nofollow,noarchive" />
  ...
</head>

Again, it is not guaranteed that all search engine crawlers support or obey these instructions. If you really want to hide your content, use password protection.

Also, as a general heads-up: Using robots.txt and meta tags in combination can prove self-disrupting. A noindex tag on a page that is blacklisted via robots.txt will render it useless, because any respectable crawler will not come to reach this statement at all and might therefore still include this page in its search results.

Serving `robots.txt` in Blazor

The easiest way to serve a robots.txt file in your Blazor application is to use a middleware. We will have:

A file robots.txt with the desired content placed in the wwwroot folder of our Blazor project.
A RobotsMiddleware class serving exactly this file upon request.
A line in Program.cs registering this middleware in our project.

First, we will place the actual txt file in the content folder. For starters, the content can be something like this

# An example robots.txt allowing everything
User-agent: *
Allow: /

For information on what you can actually specify in there, I encourage you to take a look at Google's documentation or even Microsoft's real world example.

Second, we implement our middleware class like this:

public class RobotsMiddleware
{
    private const string RobotsFileName = "robots.txt";
    private static readonly string RobotsFilePath = Path.Combine(Directory.GetCurrentDirectory(), "wwwroot", RobotsFileName);
    private readonly RequestDelegate next;

    public RobotsMiddleware(RequestDelegate next)
    {
        this.next = next;
    }

    public async Task InvokeAsync(HttpContext context)
    {
        if (context.Request.Path.StartsWithSegments("/" + RobotsFileName))
        {
            if (File.Exists(RobotsFilePath))
            {
                var output = await File.ReadAllTextAsync(RobotsFilePath);
                context.Response.ContentType = "text/plain";
                await context.Response.WriteAsync(output);
                return;
            }
        }

        await this.next(context);
    }
}

This is a standard middleware implementation that is hit on every request made to our web application.
If the request path matches with the robots.txt file name, it will read our file, write it to the response and terminate the pipeline. If it does not, the next element in the pipeline will be called.

Now, this implementation is not terribly efficient when your application is hit by many crawlers in a short amount of time because it is doing file operations for every matching request. As homework, you might want to have a look at IMemoryCache to improve it ;-).

Lastly, we need to add a single line to our Program.cs to actually register our middleware:

app.UseMiddleware<RobotsMiddleware>();

And that is it. If you now run the application and navigate the corresponding path, you will be served with the content of the file in your project. Updating its content will also update what crawlers (and users) will see, just like it would on classic web application servers.

Please note that with this approach you could also serve other static or dynamic files that are usually found on web pages such as a sitemap or feeds. The middleware approach lets you fully customize those responses in a language you are familiar with.

Adding page-specific `<head>` content in Blazor

To gain more fine-grained control over the behavior of crawlers on your pages, we can also leverage adding specific tags to them separately.
To do this, Blazor has the concept of so-called outlets, which are prebuilt for both body and head of your pages. By opening the _Hosts.cshtml of any of your recent Blazor projects, you will notice a line like this:

<component type="typeof(HeadOutlet)" render-mode="ServerPrerendered" />

This is where Blazor will render page-specific header content for you if it is defined.
To do so, you can place the Blazor tag HeadContent on any of your pages:

@page "/disclaimer

<HeadContent>
    <meta name="robots" content="noindex, nofollow,noarchive" />
</HeadContent>

After running the application and navigating to this site, you should be able to find this tag inside the header content of the page. If you want to find out more about the specific robot tags, head over to the Google Developers Guide.

Now, one might be tempted to put this snipped inside of a Blazor component and reuse it, and maybe put other tags in different components to reuse those as well. But be careful: Blazor will only render the latest set of HeadContent and not magically merge all of your calls together.

Conclusion

I hope this article gave a short idea of how to implement both serving robots.txt and page-specific tags in your Blazor application to direct search engine crawlers in the right direction.

As the coding part in this article is rather short, I do not provide a GitHub repository this time.

Please share on social media, stay in touch via the contact form, and subscribe to our post newsletter!

Be the first to know when a new post was released

dateo. Coding Blog

Real-world Blazor websites: Robots.txt

What are robots.txt and robot tags?

robots.txt