In this edition of the Daily Golden Nugget, I'm going to cover a topic that's security related. I'll split it into a non-technical discussion at the beginning and then the technical stuff at the end. Website owners can read the non techie stuff and forward the rest to their web programmer.
The topic is the robots.txt file and meta robots tags.
Basic Explanation ...
Every website must have a robots.txt file in order for Google and Bing to read your website. You should create this file the day your website goes live.
Users of your site will never see the robots.txt file because you will never link to it or include it in your website navigation. It's only there for search engines to look at to see if you included any directions for them. Those directions would tell them to read some pages of your website while ignoring others.
Why would you want to tell Google to ignore pages of your site?
That's a good question; and the answer is that normally you don't, unless that information should be hidden from prying eyes.
One good example of things to hide would be a directory of free stuff that you give to people after they sign up for your mailing list. For example, you give users a link to a special PDF about the 4C's of Diamonds after they sign up to your newsletter. The PDF needs to be readily available on your website, but you don't want Google to find it and include it in search results so you include information about that PDF directory in your robots.txt file.
Speaking of directories, you also don't want Google or Bing to attempt to pry through your back office areas of your site, so it's a good idea to include blocking instructions for those administrative folders in the robots.txt file.
Here are 4 popular directories you should include in the robots.txt file:
The robots.txt File Can't Be Used For Security
If you truly do not want something to be found on your website, then you need to password protect it. I've stumbled across PDFs in Google search results a few times that clearly were supposed to be internally confidential.
One time I found a private price list including company costs and another time I found a private placement offering memorandum that was only supposed to be seen by private investors. In both cases Google wouldn't include those files in the SERPs if they were listed in the robots.txt file; but even so, those types of things should require a password.
Google and Bing are not the only companies that read the robots.txt file. According to this bot list on robotstxt.org there are more than 300 known services that will read the file. According to BotsVsBrowsers.com they are tracking almost 20,000 different software bots that read the robots.txt file.
Don't Give Bad Guys The Keys To Your Site
It's a good idea to include a list in your robots.txt file of all the directories and pages you want to exclude from prying eyes, but don't specifically name each one. The better idea is to use a wild card to hide whole directories.
For example, you can organize all your free downloadable PDFs into one location like:
In there you might have:
You can easily hide both files simply by using the character string "/fr*" in the robots.txt file.
Google, Bing, and most of the other good bots will not read anything starting with "fr" on your website. Just make sure you don't also name a real page with "fr" otherwise they will ignore that too.
By using the asterisk (*) wild card you make it a little harder for bad guys (those hackers and crackers) to find the important pages of your website. Remember that this is not real security and it won't really stop a smart hacker from finding your files, but this method will help hide your activities from your competitor down the street.
If you don't manage your Google Webmaster Tools account you might want to skip the rest of this Nugget.
One of the things you need to always pay attention to is if Google can read your robots.txt file. The Site Dashboard in Webmaster Tools always shows the Current Status of your site. The screen shot below shows an important indicator in the Robots.txt fetch box. When you see that, you should immediately look at the Crawl Errors page, and the Robots.txt fetch report as shown here:
Google doesn't want to index anything on your site that you want hidden. If they can't read the robots.txt file they will stop crawling your site until they can.
Why wouldn't they be able to read robots.txt?
There are times when your web server might hiccup, or the file might become corrupted, or the internet connection to your web server might go down. It's times like this when Google knows that you have a robots.txt file, but they can't read it. That's when the wheels of website progress stop.
Just to make things clear: it's okay if you do not have a robots.txt file on your website; in which case you are giving Google permission to access everything.
The robots.txt File
The first thing to know about the file itself is that it needs to be saved as all lower case letters. That means "robots.txt" and not "Robots.txt" or "ROBOTS.TXT". It also needs to be saved in the root of your website so it appears at the location domain.com/robots.txt.
It is a normal text file that you would create in Windows Notepad or some other text only editor.
The instructions in the file are really basic, in that you allow or disallow different programs from reading your website. Although it is basic stuff, it does have a specific structure.
We start off with naming which bots, like Googlebot or Bingbot, we are giving instructions to, and then we tell them what they can and cannot do. Here's how a simple command to Googlebot would look:
You see we are using the command "User-agent:" to identify Googlebot. Every program and device connected to the internet has a signature footprint to identify itself, they are called user-agents. Most of the bots used by search engines include "bot" as part of their name, while an iPhone include "iphone" in their agent name and Galaxy S smartphones include "Samsung" in their name.
The "Disallow:" command tells Googlebot to avoid looking for or indexing files that begin with "/adm." Notice I'm using the asterisk (*) wildcard. Google likes to randomly try to access files on a website even if they are not linked anywhere. They are sneaky like that. The wildcard tells them to stop sneaking around where they don't belong.
Speaking of wildcards, you can use it to indicate all user-agents and bots like this:
This would block all software programs and search engine crawlers from reading your website. This is good to use when you are building a new website that you don't want indexed by search engines until you are ready to go live. Remember that there are a lot of bad guys out there that will ignore this "block everyone" command, so make sure you always have those important files securely locked up behind a password.
To allow everyone full access to your site, you can either delete the robots.txt file from your server, have a blank file, or you can put these instructions in it:
Notice how nothing is indicated on the Disallow line.
Meta Robots Tag
Even though you might disallow pages or directories from being directly read and indexed by Google and Bing, there's still a chance your hidden pages will get found every time someone links to them, or shares them socially.
To blog HTML pages from appearing in the search engines, you need to add the meta robots tag to the HTML code of the page. Here's what it looks like:
<meta name="robots" content="noindex">
That HTML tag, in addition to the disallow directive in robots.txt, will be the best method to hide files that otherwise need to be visible to the public.
Just for clarity, a more common variety of the meta robots command looks like this:
<meta name="robots" content="noindex,nofollow">
In this case, you are telling the search engine to ignore the page and also ignore all the links from the page. You see, the search engines will listen to you and they will not save the information from the page, but they still like to follow hyperlinks to deeper, perhaps even more secret areas of your site.
I've only touched on a brief amount of information regarding the robots.txt file and what you can do with it. To find our more information you should also refer to these pages:
Google Help: https://support.google.com/webmasters/answer/6062608?hl=en
Google Help: https://support.google.com/webmasters/answer/93710