The robots.txt file is a standard text file that tells search engine crawlers which pages they can access, scrape, and ultimately list in their search engine results.

The file is listed on the root URL of your site, https://domain.com/robots.txt, and can easily be added to your Django project.

Be sure to use the correct casing and spelling for anything added to the robots.txt file or it will not work correctly.

Create a robots.txt file

env > mysite > main > templates > main > (New File) robots.txt

User-Agent: [name of search engine crawler]
Disallow: [disallowed URL]
Disallow: [disallowed URL]

Sitemap: https://domain.com/sitemap.xml

First, create a new template called robots.txt file in your app's template folder, the same directory as all your HTML templates.

The basic structure of the robots.txt file specifies the user agent, a list of disallowed URL slugs, followed by the sitemap URL.

Be sure to name it correctly, using only lowercase letters. If there is a spelling or change in case, the file will not be found by search engine crawlers.

Syntax for robots.txt

User-Agents

The user agent is the name of the search engine web crawler. Web crawlers are also commonly known as bots or spiders given they crawl pages on the internet, copying the content on the page for search engine indexing.

specifying one user agent

User-Agent: Googlebot

If you are looking to set rules for one particular crawler, list the web crawlers' name as the user agent.

specifying more than one user agent

User-Agent: Googlebot
User-Agent: Bingbot
User-Agent: Slurp

If you want the rules to apply to multiple crawlers, list each web crawler's name as an individual agent. The example above lists Google, Bing, and Yahoo web crawlers.

specifying all user agents

User-Agent: *

But there are hundreds of web crawlers and bots on the internet so if you do not have a specific user-agent in mind, declare all web crawlers as user agents.

Write an asterisk mark, also known as wildcard, for the user agent to include all web crawlers.

Disallow URLs

Disallow entries are placed after the user agent, letting the crawler know which URLs they cannot access or scrape. Be careful not to block any pages or subdirectories you want to appear on Google or any other browser's search results.

disallowing one URL

User-Agent: *

Disallow: /page

If you have the page https://domain.com/page on your site and you wish to not have it indexed, then you just need to add the slug /page.

disallowing multiple URLs

User-Agent: *

Disallow: /page1
Disallow: /page2

Each URL will need its own disallow entry. There is no limit to the number of disallowed URLs on in a robots.txt file.

disallowing a directory and all of its subdirectories/pages

User-Agent: *

Disallow: /directory/

You can also add directories on the disallow list.

For example, if you have the URL https://domain.com/directory and it has the page https://domain.com/directory/page1 and the subdirectory https://domain.com/directory/subdirectory within it, list the main directory within two forward slashes.

The URL of the directory along with any URL pages and subdirectories are now included in the entry.

There is also no limit to the number of disallowed directories on in a robots.txt file.

disallowing the entire site

User-Agent: *

Disallow: /

If you don't want any of your website crawled, add one forward slash to a disallow entry. The forward lash represents the root directory of your website.

However, if you choose to not have any pages on your site crawled, your site will not appear on search engine results and negatively impact your SEO.

disallowing nothing

User-Agent: *

Disallow:

Finally, if you want all of your site accessible to web crawlers, list nothing next to disallow.

Disallow URLs by ending

disallowing URLs by file type

User-Agent: *

Disallow: /*.xls$

To disallow URLs with the same ending, add a forward slash with an asterisk mark, followed by a space, the name of the exact URL ending, and a dollar sign.

Disallow Images

disallowing images

User-Agent: Googlebot-Image

Disallow: /images/photo.jpg

If you do not want Google Images, for example, to scrape your site's images, add a disallow entry that lists the URL of the image.

Allow URLs

allowing a page within a disallowed directory

User-Agent: *

Disallow: /directory/
Allow: /directory/page

Although less commonly used, allow entries are used to give web crawlers access to a subdirectory or page in a disallowed directory.

Using the same directory as earlier, add an allow entry that gives web crawlers access the page https://domain.com/directory/page.

Now the crawlers can access this page while still not having access to the main directory or any of its other pages and subdirectories.

Multiple groups in robots.txt

adding multiple groups within robots.txt

# Group 1 - Google
User-Agent: Googlebot

Disallow: /directory/
Disallow: /page1
Disallow: /page2



# Group 2 - Bing
User-Agent: Bingbot

Disallow: /directory/
Allow: /directory/page

A robots.txt file can also contain multiple groups, each with its own user agent and disallowed URLs.

Sitemap

adding a sitemap

User-Agent: *

Disallow: /directory/
Disallow: /page1
Disallow: /page2

Sitemap: https://domain.com/sitemap.xml

The last thing added to the bottom of the file is the sitemap URL, a link to an XML file that web crawlers read to intelligently crawl the pages you deem important on your site.

If you have not created a sitemap for your Django project, go to How to Create a Sitemap in Django.

Add a URL path to robots.txt

env > mysite > main > urls.py

from django.urls import path
from . import views
from django.contrib.sitemaps.views import sitemap
from .sitemaps import ArticleSitemap
from django.views.generic.base import TemplateView #import TemplateView

app_name = "main"

sitemaps = {
    'blog':ArticleSitemap
}

urlpatterns = [
    path("", views.homepage, name="homepage"),
    path('sitemap.xml', sitemap, {'sitemaps': sitemaps}, name='django.contrib.sitemaps.views.sitemap'),
    path("robots.txt",TemplateView.as_view(template_name="main/robots.txt", content_type="text/plain")),  #add the robots.txt file
]

Once you have determined what needs to be added to robots.txt, save the file then go to your app's urls.py.

In urls.py, import TemplateView at the top of the file then add the robots.txt path, creating a new view. We have done this to writing limit extra lines of code for such a simple view.

Be sure to specify the correct path to the template. If you created your robots.txt file in the templates > main directory, you need to add main/robots.txt as the template_name.

Note, the sitemap URL pattern is also added to the URLs.

View robots.txt

At this point, you can now run your server and view the robots.txt file in your browser at http://127.0.0.1:8000/robots.txt. It will look like a plain text file with the user agent, disallowed URLs, and sitemap listed.

Robots.txt file

Create 2 tests for robots.txt

env > mysite > main > tests.py

from django.test import TestCase
from http import HTTPStatus

# Create your tests here.

class RobotsTest(TestCase):
    def test_get(self):
        response = self.client.get("/robots.txt")

        self.assertEqual(response.status_code, HTTPStatus.OK)
        self.assertEqual(response["content-type"], "text/plain")
        lines = response.content.decode().splitlines()
        self.assertEqual(lines[0], "User-Agent: *")

    def test_post(self):
        response = self.client.post("/robots.txt")

        self.assertEqual(response.status_code, HTTPStatus.METHOD_NOT_ALLOWED)

Time to test robots.txt. This code is based on the code provided on adamj.eu.

These tests are important to run before deployment to make sure that the file is written correctly.

The test_get() function checks if the robots.txt user appears on a GET request, checks to see if the context is plain text, and makes sure the user agent is the first line of the file.

The second function, text_post(), checks if the request is a POST and states the method is not allowed.

Run the tests from the command line

CLI - Passed tests

(env) C:\Users\Owner\desktop\code\env\mysite> py manage.py test


----------------------------------------------------------------------
Ran 2 tests in 0.033s

OK
Destroying test database for alias 'default'...

Run the command py manage.py test to run the two tests in your tests.py file. Your command prompt will state the number of tests ran followed by OK if they both passed.

CLI - Failed test

(env) C:\Users\Owner\desktop\code\env\mysite> py manage.py test
.
======================================================================
FAIL: test_get (main.tests.RobotsTxtTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Owner\Desktop\code\env\mysite\main\tests.py", line 13, in test_get
    self.assertEqual(lines[0], "User-Agent: *")
AssertionError: 'User-Agent: d' != 'User-Agent: *'
- User-Agent: d
?             ^
+ User-Agent: *
?             ^


----------------------------------------------------------------------
Ran 2 tests in 0.029s

FAILED (failures=1)
Destroying test database for alias 'default'...

If your test fails, you will get a Django traceback with the assertion error.

The test failed because we asserted that the first line of the robots.txt file must be "User-Agent: *".

In this case, the robots.txt has a User-Agent: d which does not equal User-Agent: *.

You can see from the line FAILED (failures=1) that the other test passed.

Fix the error and then run the test again.