TECHNOPHILE

All about Robots – All you need to know about robots.txt

Today in this blog post we will discuss deeply about the Robot exclusion standard or simply robot.txt. Probably you know the definition, Robots.txt is a text file put on the website to tell search robots which pages you would like them not to visit.  So what are search robots? Let’s begin from there.

What are crawlers a.k.a spiders?

Let’s try understand Crawlers by thinking about the working of Google search. Ever thought how Google brings up with the best possible search results when you query for something? There are billions of websites around the internet. And each website would have at least 10 pages. Do you think Google can go and read all of it before it serves for what you have asked? No supercomputer could do that much of work that faster.

Let’s consider a crawler as a special software which is assigned to browse various web pages around the internet and store the data. A crawler is sometimes called a spider. Google named its crawler as Googlebot. When the crawler browses a webpage, it will see other urls linked in it. Crawler follows those links and crawls those contents also. This process keeps on going. The underlying crawler software will decide which link needs to be crawled with more priority, what to do if a not found error appeared, etc.

Crawling algorithm

When crawler finds a web page, the crawling systems would render its contents. You can relate it to something like a browser does when we load a url to the address bar. Now, Google has specially written algorithms. With the help of these algorithms, the systems fetch certain set of keywords from each crawled website and map them together. You can think of it like the index page of a book. The index page will serve the overall content of the book with specific set of keywords right? Same is here and hence it can be called the Google Index. According to Google the size of index is in the range of 100,000,000 Gigabyte, no wonder right?

Credits: Wikimedia

Google’s algorithm of indexing is really really complex. We should understand that they have been upgrading it since the day Google was born. These days, apart from just links and webpages Google uses many other parameters such as location, machine and browser data, language, artificial intelligence and many more while indexing and serving the contents. With the help of hundreds of such parameters Google makes a ranking of the content while indexing it.

Serving the Search query

When you query for a particular data in Google, the first thing it does is analysing and understanding the query. In this step the system will analyse what kind of information the user is requesting. You might have noticed Google correcting your spelling mistake many times, don’t you? According to Google, it took five years for them to build such a comprehensive mechanism and it consistently improves with time.

Once the query is analysed the system will check for the language in which the query is entered. So that it can provide priority for the same while serving the results. Also, Google has a mechanism to understand the content of the page which is being indexed. For example, if you just enter “cat” into the search field, Google will not return you a website which contains the keyword a 100 times. Instead it gives priority for website contains more details and images about cat. For example, a Wikipedia page or the National Geographic website.

Ranking algorithm

Page ranking is decided by several factors. For example, if a website url is seen at many places in the internet, it could be some important website. It obviously gets a better ranking. Google also analyses the web traffic of many websites. If it finds specific sites to have high traffic, they also get a higher ranking. It is obvious that, links that are shared in highly ranked websites will also get better ranking, commonsense.

Along with this Google consider other aspects such as previous search, location, etc before delivering the search results. For example if you just query “weather” Google will analyse your IP location first and serves you with the weather information of the nearest major town. Result will be different if I am querying it from France.

You can read about the working of Google Search in depth in the official documentation.

Coming back to robots

So now you know the basic idea of a crawler and what it does. By default, these crawlers will crawl into everything they see. Is it advisable to do so? We do allow random people to visit our home. We would greet them decently. But what if the peek into your bathroom without permission? Of course, unacceptable. Same is the case with crawlers. Our application will be having several sensitive pages which we don’t want the public to visit. For example the admin login panel of a wordpress blog. Do we really want crawlers to index these pages? It is not going to benefit us in any ways other than inviting some attackers. For example, search inurl:wp-login.php intitle:”Log in” in Google see the results. So many admin Login panels are exposed to the web. Who knows whether their passwords are strong enough?

And the solution is..

We have a huge insecurity now, and for this reason there must be a way to control the activity of crawlers. And the solution is robot.txt file. It is a simple text file in which we can write rules and tell the crawlers not to crawl this particular portion of my blog. And then we can place this file under the root directory of the website. For example, the robots file of Google lies at https://www.google.com/robots.txt.

There you can see a huge list of rules. Remember that Googlebot is not the only crawler in web. There is MSNbot, Yahoo Slurp and many more. For these bots, a general rule of thumb is to check and obey the contents in robots.txt file supplied by website before actually start crawling it. So, when these crawlers visit our web application, ideally, they first check whether there is a robot.txt file available for the website under / directory, read its rules and then start crawling accordingly. Let’s see how can to write rules for these crawlers.

Content of robots.txt

Let’s take the first few lines from Google’s robots.txt file and analyse.

The first parameter is User-agent. Here we can specify the name of the crawler to which our rules should apply. If I want to write rules for Google crawler, I’ll write Googlebot here. The * symbol indicates that all the crawlers should follow the rules written under it. Then we have two more parameters namely Allow and Disallow.

    • Disallow: If you don’t want crawlers to visit specific paths and pages of your websites, you can specify them here. You can see that Google wants to restrict bots from crawling /search, /sdch, /groups and /index.html.
    • Allow: If you want to specifically allow certain pages to be allowed, then it can be specified here. See the entry /search/about? In the above line there is a rule to disallow /search, which means that any subdirectories and pages under /search will be disallowed from crawling. But Google explicitly want the crawlers to crawl the about page under /search, and hence the rule.

This is the very basic explanation of a robots.txt file. We will see some examples of various cases so that you can understand the idea deeply.

To exclude all robots from accessing anything under the root
User-agent: *
Disallow: /
To allow all crawlers complete access
User-agent: *
Disallow:
Alternatively, you can skip creating a robots.txt file, or create one with empty content.
To exclude a single robot
User-agent: Googlebot
Disallow: /

This will disallow Google’s crawler from the entire website.

To allow just Google crawler
User-agent: Google
Disallow:

User-agent: *
Disallow: /

Robots file is just one mechanism to control the web crawlers. If you understand the crawling, indexing and ranking algorithm really well, then there are advanced techniques to limit and enforce the crawling activity of bots such as Googlebot. It is called Search Engine Optimisation or simply SEO. We will discuss how SEO works for a WordPress website in a later stage.

How secure is robots.txt

Security of contents in robots.txt file is the area of our major interest. Well, the question is, are you really sure that the robots really follow the rules written in robots.txt? What is the guarantee? The answer is a plain NO. We can write whatever rule we need inside the robots.txt file. But obeying it or not is completely the robots call. Since Google and MSN are trusted search engines, we can assume that they are following the rules in robots.txt. But who knows whether Google is actually crawling the restricted pages as well? They can do it and hide the contents crawled from public. Moreover, anyone can write a crawler. I can write a crawler that does not obey the robots file.

What does that mean? That means robots.txt file should never be used to hide sensitive information. Moreover, it’s just a plain text file. Anyone can open it and read it, not just the robots.  A hacker would obviously peek into the robots.txt file if his target is your website. Do you want to tell him that the admin login page is under /admin by displaying it in a robots rule?

So, that’s it about robots for now. Please let me know your thoughts in the comment box.

What’s next?

You probably have noticed that I have used a special type of Google search query somewhere in the blog post.

inurl:wp-login.php intitle:”Log in”

Inurl and intitle are Google specific commands to extract specific data from Google indices. This kind of search mechanism is called Google Hacking. We will discuss more about it in the coming blog post. Please keep coming. Thanks!


Share
Leave a Comment

View Comments

  • Hey, you used to rite great, bbut the last few posts have beden kinda boring...
    I miss your great writings. Past several posts are just a little out of
    track! come on!

  • I am really thankful to the holder of this web site who has shared this fantastic paragraph at here.

  • Hi, yes this piece of writing is really fastidious and I have learned lot of things from it concerning
    blogging. thanks.

  • Hi there, just became aware of your blog through Google, and found that it's really informative.
    I am going to watch out for brussels. I wilkl be grateful if yyou
    continue thius in future. Lots of people will bee beneefited from yor writing.
    Cheers!

  • Hello there! I could have sworn I've visited this web site before but after going through some of the posts I realized it's neew too me.
    Nonetheless, I'm certainly happy I stumbled upon it and I'll bbe bookmarking it and checking back often!

  • Usually I ddo not read post on blogs, however I wish to
    say that tgis write-up very pressured me to take
    a look att and do so! Your writing style has been amazed me.

    Thannk you, quite nice post.

  • Fantastic goods from you, man. Ihave understand you stuff previous tto and you're just
    too fantastic. I really like what you've acquired here,
    certainly like what you arre tating and the way in which you say it.
    Youu make it enjoyable and you still care for too keep it wise.
    I can not wait to rezd much more from you. This is actuaally a wonderful website.

  • Heya! I'm at wodk browsing yopur blog from my nnew iphone!
    Just wanted to sayy I love reading your blog and look
    forward to all ylur posts! Keep up the outstanding work!

  • If you are going for finest contents like me, simply
    visit this website every day since it presents quality contents,
    thanks

Recent Posts

How to migrate from LastPass to Bitwarden

As a LastPass user, you might have noticed the changes introduced last day. The message… Read More

2 years ago

Amazon AWS network security checklist

This post presents with a few bunches of AWS network security checklist. It is basically… Read More

3 years ago

The long curated web security checklist based on OWASP

What is this web security checklist? Here is a curated web security checklist for developers… Read More

3 years ago

Penetration Testing for dummies – Part 3: Networking basics

In the last part of the blog series we have seen the history of internet… Read More

3 years ago

Penetration Testing for dummies – Part 2: Understanding web applications

Welcome back budding pen-testers. :) In the first part of the blog series we have… Read More

3 years ago

How to enable SSH on Google Cloud Compute Engine

Last day I was riddling with Evilginx, a phishing attack tool. It needs to be… Read More

4 years ago