Today in this blog post we will discuss deeply about the Robot exclusion standard or simply robot.txt. Probably you know the definition, Robots.txt is a text file put on the website to tell search robots which pages you would like them not to visit. So what are search robots? Let’s begin from there.
What are crawlers a.k.a spiders?
Let’s try understand Crawlers by thinking about the working of Google search. Ever thought how Google brings up with the best possible search results when you query for something? There are billions of websites around the internet. And each website would have at least 10 pages. Do you think Google can go and read all of it before it serves for what you have asked? No supercomputer could do that much of work that faster.
Let’s consider a crawler as a special software which is assigned to browse various web pages around the internet and store the data. A crawler is sometimes called a spider. Google named its crawler as Googlebot. When the crawler browses a webpage, it will see other urls linked in it. Crawler follows those links and crawls those contents also. This process keeps on going. The underlying crawler software will decide which link needs to be crawled with more priority, what to do if a not found error appeared, etc.
When crawler finds a web page, the crawling systems would render its contents. You can relate it to something like a browser does when we load a url to the address bar. Now, Google has specially written algorithms. With the help of these algorithms, the systems fetch certain set of keywords from each crawled website and map them together. You can think of it like the index page of a book. The index page will serve the overall content of the book with specific set of keywords right? Same is here and hence it can be called the Google Index. According to Google the size of index is in the range of 100,000,000 Gigabyte, no wonder right?
Google’s algorithm of indexing is really really complex. We should understand that they have been upgrading it since the day Google was born. These days, apart from just links and webpages Google uses many other parameters such as location, machine and browser data, language, artificial intelligence and many more while indexing and serving the contents. With the help of hundreds of such parameters Google makes a ranking of the content while indexing it.
Serving the Search query
When you query for a particular data in Google, the first thing it does is analysing and understanding the query. In this step the system will analyse what kind of information the user is requesting. You might have noticed Google correcting your spelling mistake many times, don’t you? According to Google, it took five years for them to build such a comprehensive mechanism and it consistently improves with time.
Once the query is analysed the system will check for the language in which the query is entered. So that it can provide priority for the same while serving the results. Also, Google has a mechanism to understand the content of the page which is being indexed. For example, if you just enter “cat” into the search field, Google will not return you a website which contains the keyword a 100 times. Instead it gives priority for website contains more details and images about cat. For example, a Wikipedia page or the National Geographic website.
Page ranking is decided by several factors. For example, if a website url is seen at many places in the internet, it could be some important website. It obviously gets a better ranking. Google also analyses the web traffic of many websites. If it finds specific sites to have high traffic, they also get a higher ranking. It is obvious that, links that are shared in highly ranked websites will also get better ranking, commonsense.
Along with this Google consider other aspects such as previous search, location, etc before delivering the search results. For example if you just query “weather” Google will analyse your IP location first and serves you with the weather information of the nearest major town. Result will be different if I am querying it from France.
You can read about the working of Google Search in depth in the official documentation.
Coming back to robots
So now you know the basic idea of a crawler and what it does. By default, these crawlers will crawl into everything they see. Is it advisable to do so? We do allow random people to visit our home. We would greet them decently. But what if the peek into your bathroom without permission? Of course, unacceptable. Same is the case with crawlers. Our application will be having several sensitive pages which we don’t want the public to visit. For example the admin login panel of a wordpress blog. Do we really want crawlers to index these pages? It is not going to benefit us in any ways other than inviting some attackers. For example, search inurl:wp-login.php intitle:”Log in” in Google see the results. So many admin Login panels are exposed to the web. Who knows whether their passwords are strong enough?
And the solution is..
We have a huge insecurity now, and for this reason there must be a way to control the activity of crawlers. And the solution is robot.txt file. It is a simple text file in which we can write rules and tell the crawlers not to crawl this particular portion of my blog. And then we can place this file under the root directory of the website. For example, the robots file of Google lies at https://www.google.com/robots.txt.
There you can see a huge list of rules. Remember that Googlebot is not the only crawler in web. There is MSNbot, Yahoo Slurp and many more. For these bots, a general rule of thumb is to check and obey the contents in robots.txt file supplied by website before actually start crawling it. So, when these crawlers visit our web application, ideally, they first check whether there is a robot.txt file available for the website under / directory, read its rules and then start crawling accordingly. Let’s see how can to write rules for these crawlers.
Content of robots.txt
Let’s take the first few lines from Google’s robots.txt file and analyse.
The first parameter is User-agent. Here we can specify the name of the crawler to which our rules should apply. If I want to write rules for Google crawler, I’ll write Googlebot here. The * symbol indicates that all the crawlers should follow the rules written under it. Then we have two more parameters namely Allow and Disallow.
- Disallow: If you don’t want crawlers to visit specific paths and pages of your websites, you can specify them here. You can see that Google wants to restrict bots from crawling /search, /sdch, /groups and /index.html.
- Allow: If you want to specifically allow certain pages to be allowed, then it can be specified here. See the entry /search/about? In the above line there is a rule to disallow /search, which means that any subdirectories and pages under /search will be disallowed from crawling. But Google explicitly want the crawlers to crawl the about page under /search, and hence the rule.
This is the very basic explanation of a robots.txt file. We will see some examples of various cases so that you can understand the idea deeply.
To exclude all robots from accessing anything under the root
User-agent: * Disallow: /
To allow all crawlers complete access
User-agent: * Disallow:
Alternatively, you can skip creating a robots.txt file, or create one with empty content.
To exclude a single robot
User-agent: Googlebot Disallow: /
This will disallow Google’s crawler from the entire website.
To allow just Google crawler
User-agent: Google Disallow: User-agent: * Disallow: /
Robots file is just one mechanism to control the web crawlers. If you understand the crawling, indexing and ranking algorithm really well, then there are advanced techniques to limit and enforce the crawling activity of bots such as Googlebot. It is called Search Engine Optimisation or simply SEO. We will discuss how SEO works for a WordPress website in a later stage.
How secure is robots.txt
Security of contents in robots.txt file is the area of our major interest. Well, the question is, are you really sure that the robots really follow the rules written in robots.txt? What is the guarantee? The answer is a plain NO. We can write whatever rule we need inside the robots.txt file. But obeying it or not is completely the robots call. Since Google and MSN are trusted search engines, we can assume that they are following the rules in robots.txt. But who knows whether Google is actually crawling the restricted pages as well? They can do it and hide the contents crawled from public. Moreover, anyone can write a crawler. I can write a crawler that does not obey the robots file.
What does that mean? That means robots.txt file should never be used to hide sensitive information. Moreover, it’s just a plain text file. Anyone can open it and read it, not just the robots. A hacker would obviously peek into the robots.txt file if his target is your website. Do you want to tell him that the admin login page is under /admin by displaying it in a robots rule?
So, that’s it about robots for now. Please let me know your thoughts in the comment box.
You probably have noticed that I have used a special type of Google search query somewhere in the blog post.
Inurl and intitle are Google specific commands to extract specific data from Google indices. This kind of search mechanism is called Google Hacking. We will discuss more about it in the coming blog post. Please keep coming. Thanks!