What is robots.txt? | Advantages and disadvantages when using Robot.txt

What is Robot.txt?

Robots.txt is a special text format that is not HTML or any other type. It gives webmasters more flexibility in giving or without bots of search engine (SE) indexing an area of your website.

Create robots.txt file

When using robots.txt files, you need to be careful. Because if corrected wrongly, all SEO results will flow.

If your project is small and you are not sure what you are doing, it is best not to use a robots.txt file. Let things be just like that. Quang’s blog also doesn’t use robots.txt files.

However, for large projects, especially e – comerce, the use of the robot.txt file is almost mandatory. The robots.txt file helps Google index your website more effectively, preventing backlinks from scanning, as well as limiting duplicate content that is very common when SEO for the e-comerce field.

The Smart Website Content Choice You Should Know

What Is Traffic Website And What You Need To Know

Top 4 Website Backlink Test Tool

Advantages when using Robot.txt

Prevent bugs during the system setup process

In the process of website design (interface design, plugin installation, website structure building), things are still very messy. You should block Google bugs, so that it doesn’t index the incomplete content that you don’t want.

Insert Sitemap

A sitemap is like a map for Google to discover your site. If the number of indexes of the website is too large and the website does not have a sitemap, Google bugs may not have enough resources (crawl budget) to scan your website. From there, Google may not be able to index some important content.

A website can have more than one sitemap (eg article sitemap, image sitemap, news sitemap …). You should use a software to create a sitemap for the website, and then declare the sitemap links in the robots.txt file.

What is robots.txt

Prevent bugs check backlink

Currently in Vietnam, the three most popular backlink check tools are Ahrefs, Majestic and Moz. Their bugs are named AhrefsBot (Ahrefs), mj12bot (Majestic) and rogerbot (Moz), respectively.

To prevent opponents from using tools to analyze your backlinks, you can block their bugs in robots.txt files.

Prevent harmful bugs

In addition to the bug check backlink, there are some other types of harmful bugs.

For example, Amazon, the giant of the world e-commerce industry, must block a bug called EtaoSpider.

Block sensitive folders

Website source code, usually with sensitive directories, such as wp-admin, wp-includes, phpinfo.php, cgi-bin, memcache….

You should not let the bug search index index this content, because then, their content will be public on the internet. Hackers can get information from them, to attack your system.

Block bugs in e-commerce

create a robots.txt file

In e-commerce, there are some unique features for users such as:

Sign up for an account
Log in to your account
Cart
Transaction history
User interest (wishlist)
Internal search bar
Compare prices (price)
Sort attributes (high to low prices, bestsellers, A & B characters….)
Filter properties (manufacturer, color, price, capacity …)
Products no longer sold (comes with 301 redirects)

Those functions are indispensable for users, but often create duplicate content in SEO, and do not have any relevant content to support keyword SEO. Therefore, you can block indexing of these paths the robots.txt file.

In the file robot.txt, you use * (replace any string of characters) and $ (file format, such as .doc, .pdt, .ppt, .swf …, used at the end of a sentence) to block the corresponding file.

Disadvantages when using

When using the robots.txt file, be careful. Because if corrected wrongly, all SEO results will flow.

How it works

Crawl-Delay: This parameter determines how long (in seconds) bots must wait before moving on to the next section. This will be useful to prevent arbitrary search engine load servers.

#: is used before the lines to comment.

The robots.txt works by identifying a user-agent and a command for this user-agent.

The parameters are in robots.txt file

Disallow: is the area that you want to localize without search engine access.

User-agent: Declare the name of the search engine you want to control, for example: Googlebot, Yahoo! Slurp

Note when using robot.txt

To be found by bots, robots.txt files must be placed in the top-level directories of the site.
txt is case sensitive. So the file must be named robots.txt. (not Robots.txt or robots.TXT, …)
Do not put / wp-content / themes / or / wp-content / plugins / in the Disallow section. That will prevent search engines from correctly seeing the look of your blog or website.
Some user-agents may choose to bypass your standard robots.txt files. This is quite common with nefarious user-agents such as:
Malware robots (bots of malicious code)
Scraping processes (the process of gathering information on your own) email addresses
Robots.txt files are usually available and made public on the web. You only need to add /robots.txt to the end of any root domain to see the site’s directives.

This means that anyone can see the pages you want or don’t want to crawl. So do not use these files to hide the user’s personal information.

Each subdomain on a root domain will use separate wordpress txt files. This means that both blog.example.com and example.com should have their own robots.txt files. (blog.example.com/robots.txt and example.com/robots.txt). In short, this is considered to be the best way to indicate the location of any sitemaps associated with the domain at the end of the robots.txt file.