A robots.txt file provides useful and vital information for search engine spiders that crawl the web and web pages, which is important at a tactical level of search engine optimization in Sri Lanka. Before these bots access the pages of a specific site, they check if the web page has a robots.txt file. Pages with robots.txt files makes web crawling more straightforward and efficient as these files will keep the bots from accessing certain web pages that should not be indexed by search engines.
The use of robots.txt file in web pages is a best practice that should not be overlooked during web development in Sri Lanka, even if it’s for the simple reason that some metrics programs will interpret the 404 response to the request for a missing robots.txt file as an error, which would, in turn result in invalid performance reporting. But the question is: what goes in that robots.txt file?
Both robots.txt and robots meta tags rely on cooperation from the robots and are necessarily not guaranteed to work for every bot. But if you require sturdier protection for unethical bots and other such agents, it becomes very important to use alternate methods like password protection. A common mistake is when webmasters unsuspectingly place sensitive URLs like administrative areas in robots.txt. This can be a grievous step because robots.txt is one of the first points of call for hackers where they use it to see the possible gaps to break into.
Robots.txt works well for:
Avoiding the indexation of duplicate content on a website, like print versions of html pages.
Auto-discovery of XML Sitemaps
Preventing crawlers from non-public parts of a website
Preventing search engines from attempting to index scripts, utilities and other such types of code.
To be safe from risks, the robots.txt file needs to be in the root of the domain and they must be named as ‘robots.txt’ in lowercase. A robots.txt file located in a subdirectory isn’t sustainable because bots only check for this file in the root of a domain.
Common problems with robots.txt
Pages that are blocked by using robots.txt disallows may still be in Google’s index and show up in search results – particularly if other sites link back to them. While a high ranking is not likely since Google can’t ‘see’ the page content, it has hasn’t much to go on further other than the anchor text of internal and inbound links and the URL (and the ODP title and description if in ODP/DMOZ).
Subsequently, the URL of a page and possibly other publicly available information can show up in search results. On the other hand, no content from your pages will be crawled, indexed or displayed. To completely prevent a page from being added to a search engine’s index even if other sites link to it is to use a ‘noindex’ robots meta tag and to make sure that the page is not disallowed in robots.txt. This way, when spiders crawl the page, it will recognize the noindex’ meta tag and drop the URL from the index.