All about Robots – All you need to know about robots.txt
Today in this blog post we will discuss deeply about the Robot exclusion standard or simply robot.txt. Probably you know the definition, Robots.txt is a text file put on the website to tell search robots which pages you would like them not to visit. So what are search robots? Let’s begin from there.
What are crawlers a.k.a spiders?
Let’s try understand Crawlers by thinking about the working of Google search. Ever thought how Google brings up with the best possible search results when you query for something? There are billions of websites around the internet. And each website would have at least 10 pages. Do you think Google can go and read all of it before it serves for what you have asked? No supercomputer could do that much of work that faster.
Let’s consider a crawler as a special software which is assigned to browse various web pages around the internet and store the data. A crawler is sometimes called a spider. Google named its crawler as Googlebot. When the crawler browses a webpage, it will see other urls linked in it. Crawler follows those links and crawls those contents also. This process keeps on going. The underlying crawler software will decide which link needs to be crawled with more priority, what to do if a not found error appeared, etc.
Crawling algorithm
When crawler finds a web page, the crawling systems would render its contents. You can relate it to something like a browser does when we load a url to the address bar. Now, Google has specially written algorithms. With the help of these algorithms, the systems fetch certain set of keywords from each crawled website and map them together. You can think of it like the index page of a book. The index page will serve the overall content of the book with specific set of keywords right? Same is here and hence it can be called the Google Index. According to Google the size of index is in the range of 100,000,000 Gigabyte, no wonder right?

Google’s algorithm of indexing is really really complex. We should understand that they have been upgrading it since the day Google was born. These days, apart from just links and webpages Google uses many other parameters such as location, machine and browser data, language, artificial intelligence and many more while indexing and serving the contents. With the help of hundreds of such parameters Google makes a ranking of the content while indexing it.
Serving the Search query
When you query for a particular data in Google, the first thing it does is analysing and understanding the query. In this step the system will analyse what kind of information the user is requesting. You might have noticed Google correcting your spelling mistake many times, don’t you? According to Google, it took five years for them to build such a comprehensive mechanism and it consistently improves with time.
Once the query is analysed the system will check for the language in which the query is entered. So that it can provide priority for the same while serving the results. Also, Google has a mechanism to understand the content of the page which is being indexed. For example, if you just enter “cat” into the search field, Google will not return you a website which contains the keyword a 100 times. Instead it gives priority for website contains more details and images about cat. For example, a Wikipedia page or the National Geographic website.
Ranking algorithm
Page ranking is decided by several factors. For example, if a website url is seen at many places in the internet, it could be some important website. It obviously gets a better ranking. Google also analyses the web traffic of many websites. If it finds specific sites to have high traffic, they also get a higher ranking. It is obvious that, links that are shared in highly ranked websites will also get better ranking, commonsense.
Along with this Google consider other aspects such as previous search, location, etc before delivering the search results. For example if you just query “weather” Google will analyse your IP location first and serves you with the weather information of the nearest major town. Result will be different if I am querying it from France.
You can read about the working of Google Search in depth in the official documentation.
Coming back to robots
So now you know the basic idea of a crawler and what it does. By default, these crawlers will crawl into everything they see. Is it advisable to do so? We do allow random people to visit our home. We would greet them decently. But what if the peek into your bathroom without permission? Of course, unacceptable. Same is the case with crawlers. Our application will be having several sensitive pages which we don’t want the public to visit. For example the admin login panel of a wordpress blog. Do we really want crawlers to index these pages? It is not going to benefit us in any ways other than inviting some attackers. For example, search inurl:wp-login.php intitle:”Log in” in Google see the results. So many admin Login panels are exposed to the web. Who knows whether their passwords are strong enough?
And the solution is..
We have a huge insecurity now, and for this reason there must be a way to control the activity of crawlers. And the solution is robot.txt file. It is a simple text file in which we can write rules and tell the crawlers not to crawl this particular portion of my blog. And then we can place this file under the root directory of the website. For example, the robots file of Google lies at https://www.google.com/robots.txt.
There you can see a huge list of rules. Remember that Googlebot is not the only crawler in web. There is MSNbot, Yahoo Slurp and many more. For these bots, a general rule of thumb is to check and obey the contents in robots.txt file supplied by website before actually start crawling it. So, when these crawlers visit our web application, ideally, they first check whether there is a robot.txt file available for the website under / directory, read its rules and then start crawling accordingly. Let’s see how can to write rules for these crawlers.
Content of robots.txt
Let’s take the first few lines from Google’s robots.txt file and analyse.
The first parameter is User-agent. Here we can specify the name of the crawler to which our rules should apply. If I want to write rules for Google crawler, I’ll write Googlebot here. The * symbol indicates that all the crawlers should follow the rules written under it. Then we have two more parameters namely Allow and Disallow.
-
- Disallow: If you don’t want crawlers to visit specific paths and pages of your websites, you can specify them here. You can see that Google wants to restrict bots from crawling /search, /sdch, /groups and /index.html.
- Allow: If you want to specifically allow certain pages to be allowed, then it can be specified here. See the entry /search/about? In the above line there is a rule to disallow /search, which means that any subdirectories and pages under /search will be disallowed from crawling. But Google explicitly want the crawlers to crawl the about page under /search, and hence the rule.
This is the very basic explanation of a robots.txt file. We will see some examples of various cases so that you can understand the idea deeply.
To exclude all robots from accessing anything under the root
User-agent: * Disallow: /
To allow all crawlers complete access
User-agent: * Disallow:
Alternatively, you can skip creating a robots.txt file, or create one with empty content.
To exclude a single robot
User-agent: Googlebot Disallow: /
This will disallow Google’s crawler from the entire website.
To allow just Google crawler
User-agent: Google Disallow: User-agent: * Disallow: /
Robots file is just one mechanism to control the web crawlers. If you understand the crawling, indexing and ranking algorithm really well, then there are advanced techniques to limit and enforce the crawling activity of bots such as Googlebot. It is called Search Engine Optimisation or simply SEO. We will discuss how SEO works for a WordPress website in a later stage.
How secure is robots.txt
Security of contents in robots.txt file is the area of our major interest. Well, the question is, are you really sure that the robots really follow the rules written in robots.txt? What is the guarantee? The answer is a plain NO. We can write whatever rule we need inside the robots.txt file. But obeying it or not is completely the robots call. Since Google and MSN are trusted search engines, we can assume that they are following the rules in robots.txt. But who knows whether Google is actually crawling the restricted pages as well? They can do it and hide the contents crawled from public. Moreover, anyone can write a crawler. I can write a crawler that does not obey the robots file.
What does that mean? That means robots.txt file should never be used to hide sensitive information. Moreover, it’s just a plain text file. Anyone can open it and read it, not just the robots. A hacker would obviously peek into the robots.txt file if his target is your website. Do you want to tell him that the admin login page is under /admin by displaying it in a robots rule?
So, that’s it about robots for now. Please let me know your thoughts in the comment box.
What’s next?
You probably have noticed that I have used a special type of Google search query somewhere in the blog post.
Inurl and intitle are Google specific commands to extract specific data from Google indices. This kind of search mechanism is called Google Hacking. We will discuss more about it in the coming blog post. Please keep coming. Thanks!
20 Comments
Search marketing
Hey, you used to rite great, bbut the last few posts have beden kinda boring…
I miss your great writings. Past several posts are just a little out of
track! come on!
fishing shop near me,
I am really thankful to the holder of this web site who has shared this fantastic paragraph at here.
uncooled thermal imager
Hi, yes this piece of writing is really fastidious and I have learned lot of things from it concerning
blogging. thanks.
Mohammed
Hi there, just became aware of your blog through Google, and found that it’s really informative.
I am going to watch out for brussels. I wilkl be grateful if yyou
continue thius in future. Lots of people will bee beneefited from yor writing.
Cheers!
ultrasound machine for horses
Hello there! I could have sworn I’ve visited this web site before but after going through some of the posts I realized it’s neew too me.
Nonetheless, I’m certainly happy I stumbled upon it and I’ll bbe bookmarking it and checking back often!
www.physiotherapymart.com
Usually I ddo not read post on blogs, however I wish to
say that tgis write-up very pressured me to take
a look att and do so! Your writing style has been amazed me.
Thannk you, quite nice post.
www.dentalinstrumentcenter.com
Fantastic goods from you, man. Ihave understand you stuff previous tto and you’re just
too fantastic. I really like what you’ve acquired here,
certainly like what you arre tating and the way in which you say it.
Youu make it enjoyable and you still care for too keep it wise.
I can not wait to rezd much more from you. This is actuaally a wonderful website.
www.shockwavetherapystore.com
Very good article. I’m facing a few of these issues as well..
Corine
Heya! I’m at wodk browsing yopur blog from my nnew iphone!
Just wanted to sayy I love reading your blog and look
forward to all ylur posts! Keep up the outstanding work!
patternless edger for sale
If you are going for finest contents like me, simply
visit this website every day since it presents quality contents,
thanks
www.retinal-camera.com
Hi it’s me, I aam also visiting this web page regularly, this ebsite iss genuinely pleasant and the visitorts are
really sharing fastidious thoughts.
patternless lens edger
I know thks website provides qulity dependent content and extra information, is there any other web site whihh gives these data in quality?
agfa cr system for sale
Pretty component to content. I just stumbled upon your website
and iin accession capital to claim that Iacquire in fact loved account your weblog posts.
Anyway I will be subscribing in your augment or even I achievement you access persistently fast.
used patternless edger for sale
I was wondering if you ever thought of changing the page layout of your blog?
Its very well written; I love what youve got to say.
But maybe you could a little more in tthe way of content so people could connect with it better.
Youv got aan awful lot of text for only having 1 or 2 images.
Maybe yoou could space it out better?
wst_admin
Would definitely consider your valuable comment. 🙂
fundus camera retinal camera
I am in fact thankful to the owner of this web site who has shared this impressive piece of writing at at this place.
Www.Opticalequipmentstore.com
Hello there, I do bbelieve your web site maay be having browser compatibility problems.
Whenn I lokok at your web site in Safari, it looks fine however, if opening iin IE, it has some overlapping issues.
I merely wanted to give you a quick heads up! Aside from that,
great site!
www.metalsanalyzer.Com
I just couldn’t leave your site before suggesting that I actually enjoyed the standard information a person provide in your guests?
Is going to be back often in order to inspect new
posts
10seos
You write great content! continue doing what you do.
Anglea
Thanks to the excellent manual