Using a robots.txt file to disallow a directory

Written by Ironpaper | May 27, 2011

Website owners and designers can provide instructions to search robots to deny or provide access to specific parts of their website. It can be helpful if you have specific pages or sections of a website that you do not want to show up in search results. The practice can be abused as well, as some engineers build robots to specific search for and cache content contained within restricted areas. Those more sensitive areas (called to in the robots.txt file) should be protected. The robots.txt file is only a set of instructions--not a security mechanism. Google and other major search robots do follow the instructions within the file. The /robots.txt file is a publicly available file

It works likes this: a robot wants to vists a Web site URL, say https://www.cool-website-example.com/mypage.html. Before it does so, it firsts checks for https://www.cool-website-example.com/robots.txt, and finds:

User-agent: *
Disallow: /

These instructions tell the bot to not cache any content within the website.

Here are a more specific set of instructions:

User-agent: *
Disallow: /cgi-bin/
Disallow: /data/
Disallow: /~bob/

You need to create a unique line for each directory that you are restricting against. You cannot have both /cgi-bin/ and /data/ on the same line as a separate set of instructions (like this "Disallow: /cgi-bin/ /data/. This just won't work. Instead use the model above.

View full post