I use XenForo 1.2.2 and I've noticed that many Xenforo forums disallow the /posts/ in their robots.txt files. Why not let robots crawl the /posts ? User-agent: * Disallow: /posts/
There must be a good reason. I also found that many forums disallow all pages in their robots.txt file
C&P of here... Which is a pretty good one Code: User-agent: * Disallow: /misc/ Disallow: /help/ Disallow: /search/ Disallow: /register/ Disallow: /login/ Disallow: /online/ Disallow: /lost-password/ Disallow: /account/ Disallow: /admin.php Disallow: /events/birthdays/ Disallow: /events/monthly Disallow: /events/weekly Disallow: /goto/ Disallow: /help/ Disallow: /login/ Disallow: /media/keyword/ Disallow: /media/user/ Disallow: /media/service/ Disallow: /media/submit/ Disallow: /misc/style?* Disallow: /misc/quick-navigation-menu?* Disallow: /online/ Disallow: /forums/7/ Disallow: /forums/20/ Disallow: /forums/70/ Disallow: /forums/49/ Disallow: /forums/155/ Disallow: /forums/156/ Disallow: /forums/184/ Disallow: /forums/200/ Disallow: /forums/188/ Disallow: /forums/186/ Disallow: /forums/187/ Disallow: /forums/189/ Disallow: /forums/191/ Allow: / Sitemap: http://admin-talk.com/sitemap/sitemap.xml.gz Sitemap: http://dir.admin-talk.com/sitemap.xml
Pretty much similar to my file User-agent: * Disallow: /find-new/ Disallow: /account/ Disallow: /goto/ Disallow: /login/ Disallow: /admin.php Disallow: /search/ Disallow: search.php Disallow: /help/ Disallow: /members/ Disallow: /misc/ Disallow: /online/ Allow: / User-agent: ia_archiver Allow: / User-agent: BoardTracker Disallow: / User-agent: BoardReader Disallow: / User-agent: Baiduspider User-agent: Baiduspider-video User-agent: Baiduspider-image Disallow: /
Well, there is always the chance that one tweak robots.txt to suit particular needs, problem could be that at every forum settings update, the robots file can be modified on the fly. I remember to have a long, long disallow list including search engines, crawlers and bots, but sometimes I wonder if all of those abide by the rules and refrain from crawl what is on-site. I read some time ago that the best way to prevent bots from crawling directories or files that we don't want to get crawled is making no reference to them in robots text to avoid they know about their existence... unless they are linked to other with crawling allowance.
Read this: http://xenforo.com/community/threads/posts-urls-being-indexed-by-google.48729/ I don't use robots.txt at all with xenForo. I let it index what it wants and let the built in SEO for xenForo do what it is suppose to do.
Well the idea of a robots.txt file is to stop certain bots that obey it, to not waste time looking at files that don't need indexing to speed things up for bots. I wouldn't go daft adding loads of entries, but some files and folders are obvious for blocking. Cache folders and files like install and config.php files e.t.c It can help a little blocking those obvious files and folders from being indexed. You can also use it to limit crawl numbers, which can save you on bandwidth.
Yes. Though today I did spot a instance where /posts/ were being indexed. Will add it to that thread.
I have noticed that more and more web sites and blogs are using the robot.txt file recently. I n the last month I have lost count of how many times I have looked for something in Google and found a site listed with a description saying that robot.txt had blocked the search engine from crawling the site. I didn't think they would index those sites if they were restricted by robot.txt.
its not very wise thing to do. Of course its none of my business but, I wouldnt do that. Some of your users may post sensitive data and - by means of not using robots.txt (=letting robots crawl your site) - these data get revealed. Its strange policy of yours.
Someone said that the golden rule to know if a file like robots.txt should be used or not is typing the address of a huge company followed by robots.txt, like in this particular case, and here you go, http://www.google.com/robots.txt http://www.nyc.com/robots.txt http://www.microsoft.com/robots.txt http://www.adobe.com/robots.txt And so on...
I don't have any competitors and when I did most of them used robots.txt to block spiders from crawling their content.
Disallowing directories also makes your site less vulnerable (to an extent). You'd be surprised how much informtion a robot can disclose about your website. It can index vulnerable files or sensitive directories, or even show parameters that could lead to MySQL injection.