A web crawler (also known as a web spider or web robot) is a program or automated script, which browses the Internet in an automated manner. Web crawlers are mainly used by search engines to index downloaded pages and for automated maintenance tasks on a website, such as checking links or validating HTML code.Well-behaved robots will identify themselves, often supplying web or email addresses you can contact. In any case, the pattern of pages being read and the IP addresses being used soon sorts the men from the robots. Good robots will read robots.txt to see what your site policy is, but there are other ways of spotting robots. In addition to the search engine robots, other "user agents" will visit your site, e.g. to validate links to your site from other people's pages. Often these will just access the HEAD of the file, rather than doing a GET on the whole file.
However, not all crawlers are good crawlers! Therefore, it's good to know the good from the bad. Just as the major search engines use robots for ranking purposes, some agents are generated by specific software that allows users to download mirrored image of your site onto their hard drive for sometimes unethical purposes, such as plagiarism, or harvesting e-mail addresses (usually for spam). If you have a large or image heavy site, the practice of web site stripping could also have a serious impact on your bandwidth usage each month.
Next week we’ll talk about how to protect your site from robots and crawlers. Be sure you consult your SEO professional for additional ways to keep your content safe and original.
0 comments:
Post a Comment