Most of the website owners are happy to see their website to be indexed and Google should give results as soon as early possible.
Search engine spiders will crawl your whole website to analyse your content and store it in their database to give better results to users. when users requested for a particular query.
But there are some other situations you no need to showcase your website to others. Like when you just started off and make changes in content and try to make it so perfect day by day.
Until unless you feel its perfect and you would not want anyone see it that way. Hence you want to prevent search engines from indexing pages, folders or entire site.
Inbuilt WordPress Feature Discourage search engines from indexing this site
1. Login to WordPress Dashboard
2. Navigate Setting
3. Click on Reading
4. Make Enable the option “Discourage search engines from indexing this site”
5. Click “Save Changes”
Now your entire WordPress website is stopped Search Engines from Indexing.
Stop Search Engines Crawling Using Robots.txt File
robots.txt file indicates search engines to what search engine spiders should index and what they should not index.
1. Login to your Cpanel or FTP to access file Manager
2. Access the files inside the Public html Folder
3. Locate robots.txt file, If not you need to create text file and save the file name as robots.txt
4. Enter the following code in the robots.txt file and save it.
User-agent: *
Disallow: /
The Above code will block search engine from indexing whole WordPress website. This will block all your pages to be indexed.
“User-agent: * ” means this section applies to all robots.
“Disallow / ” tells the robot that it should not visit any pages on the site.
The Robot Exclusion Standard determines what search engine spiders should index and what they should not index. To do this, you need to create a new text file and save the file as robots.txt.
The concept behind the Robots.txt protocol is the same as the robots meta tag that I have discussed in length in this article. There are only a few basic rules.
User-agent – The search engine spider that the rule should be applied to
Disallow – The URL or directory that you want to block
Suppose if you want to block specific search engine to index the site you can do like below
User-agent: googlebot (Block Google search engine to Index)
User-agent: bingbot (Block Bing search engine to Index)
Most of the peoples generally use User-agent: * block all search engines
The Disallow rule, defines the URL or the directory you block. Therefore / would block search engines from indexing whole website and /gallery/ would block search engines from your WordPress Gallery section from WordPress
Here is a few examples to understand how to use robots.txt file to prevent search engines
To stop search engines from indexing your WordPress Comment section, you could use like this:
User-agent: *
Disallow: /*?comments=all
Block newsletter confirmation page from search engine, you could use something like this:
User-agent: *
Disallow: /email-subscription-confirmed/
To hide new year discount page, you could use something like this:
User-agent: *
Disallow: /new-year-discount/
The robots.txt file are case sensitive. Make sure use it correctly if the file name WordPress-Book.pdf and you mentioned as /wordpress-book.pdf/ in your robots.txt file.
Another rule that is available to you is Allow. This rules allows you to specify user agents are permitted. The code below will block all search engines, but it will allow Google Images to index the content inside your uploads folder in the WordPress. In general It will allow to Index the images from Google Image search Engine.
User-agent: *
Disallow: /
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
The robots.txt also supports pattern matching, which is useful for blocking files that have similar names like conditions like file name starting with, or File name ending with. For just blocking a few pages you no need to learn there pattern matching.
If you want to learn more on robots.txt you can get help from, Google and Bing You can see any website robots.txt and learn how they used on their web with url some websites do not use Robots.txt, so you may get a 404 error.
Here are some noted website examples of Robots.txt files will help you to understand how to take control over search engines.
- Walmart’s Robots.txt File
- Instagram’s Robots.text File
- Alibaba’s Robots.txt File
- Wall Street Journal’s Robots.txt File
The robots.txt file is the one way of preventing your whole website, posts or pages from search engine to index. by entering https://yourdomainname/robots.txt you can verify your rules applied on robots.txt file.
About Robots Meta Tag
Robots meta tag (meta directives) are the code which gives instructions to crawlers how to index web page content. Whereas robots.txt file gives suggestion how to crawl website’s page. Robots meta tag gives more accurate as well as exact instructions on how to crawl and index web pages.
Google recommends to block URL’s using the robots meta tag.
There are two types of robots meta tags. one is part of HTML page. (meta robotstag) and another web server sends as HTTP headers (x-robots-tag) These two robots meta tags do same as i.e, crawling or indexing instructions a meta tag provides“noindex” and “nofollow” can be used with both meta robots and the x-robots-tag. Major Difference is how they communicated to crawlers.
<meta name=”value” content=”value”>
The robots meta tag should be placed within the <head> section, like <head> and </head> of your WordPress theme header. There are different values available for the name as well as content. The values that Google instruct to block access to a page are robots and noindex:
<meta name=”robots” content=”noindex”>
The value robots refers to all search engines and the value noindex disallow search engines from showing web page in their results.
If you want to block web page from specific search engine you need to specify that search engine spider as value. Some general search engine spiders are:
1. googlebot – Google
2. googlebot-news – Google News
3. googlebot-image – Google Images
4. bingbot – Bing
5. teoma – Ask
The two major MSNBot and Slurp not mentioned, Windows Live Search, and MSN Search both use the MSNBot to Indexing their search engine. These search engines were re-branded as Bing in 2009 and MSNBot was replaced by Bingbot in October 2010. MSNBot-Media is crawler for images and video for Bing. MSNBot still handles some of crawling for Bing Search engine. Slurp used to crawl pages for Yahoo!. later in 2009 Yahoo! stopped it and with the help of Bing show their results.
If you want to block your web page from Google Images, you could use something like this:
<meta name=”googlebot-image” content=”noindex”>
By specifying two bots with separated by commas can able to block multiple Search engine.
<meta name=”googlebot-news,googlebot-image” content=”noindex”>
Meta tags give instructions to bots about how to crawl and index information they find on the webpage. If these tags are discovered by bots, their content value serves as strong instructions for crawler index behaviour. There are many values that can be used with the content value. The content value is not case-sensitive, but do some search engines may consider these values slightly differently.
Crawling and Indexing controlling values
1. index: Instruct search engine to index a web page and it’s a default meta tag; you don’t have to add this to your pages
2. noindex: Instruct search engine to not to index a web page.
3. follow: Instruct search engines to follow the links on the web page, whether it can index it or not.
4. nofollow: Instruct crawler to not follow any links on the page at all.
5. noimageindex: Disallow crawler not to index any images on that page. Of course, if images are linked to directly from elsewhere, search engine can still index them.
6. none: This is a shortcut for noindex and nofollow, or Instruct search engines don’t do anything with this page.
7. noarchive: Tells search engines should not show copy of this page.
8. nocache: Same as noarchive, But only used by MSN/Live.
9. nosnippet: Tells a search engine not to show a snippet (i.e. meta description) of this page in the search results and prevents them from caching the web page.
10. noodyp: Prevents search engines from using the description for this page in DMOZ as the snippet for your page in the search results. However, DMOZ was retired in early 2017
11. noydir: Prevents Yahoo.! from using description for this page Yahoo! directory as the snippet for your page in the search results. This Yahoo.! directory none of other search engines use, so they don’t support the tag.
12. unavailable_after: [RFC-850 date/time]: Disallow the page in search results after a date and time specified in the RFC 850 format
Table shows which search engines support which values:
Robots value | Yahoo! | Bing | Ask | |
index | Yes | Yes | Yes | Yes |
noindex | Yes | Yes | Yes | Yes |
none | Yes | – | – | Yes |
follow | Yes | – | – | Yes |
nofollow | Yes | Yes | Yes | Yes |
noarchive | Yes | Yes | Yes | Yes |
nosnippet | Yes | No | No | No |
noodp | Yes | Yes | Yes | No |
noydir | None | Yes | None | None |