Robots.txt Disallow /posts/ ???

Discussion in 'SEO, Traffic and Revenue' started by Sylvain, Oct 11, 2013.

  1. Sylvain

    Sylvain Regular Member

    I use XenForo 1.2.2 and I've noticed that many Xenforo forums disallow the /posts/ in their robots.txt files. Why not let robots crawl the /posts ?

    User-agent: *
    Disallow: /posts/
     
  2. Programmers World

    Programmers World Regular Member

    That's strange. I would think that it would be extremely useful to let robots crawl your posts.
     
  3. Sylvain

    Sylvain Regular Member

    There must be a good reason.

    I also found that many forums disallow all pages in their robots.txt file
     
  4. Cerberus

    Cerberus Admin Talk Staff

    C&P of here... Which is a pretty good one

    Code:
    User-agent: *
    Disallow: /misc/
    Disallow: /help/
    Disallow: /search/
    Disallow: /register/
    Disallow: /login/
    Disallow: /online/
    Disallow: /lost-password/
    Disallow: /account/
    Disallow: /admin.php
    Disallow: /events/birthdays/
    Disallow: /events/monthly
    Disallow: /events/weekly
    Disallow: /goto/
    Disallow: /help/
    Disallow: /login/
    Disallow: /media/keyword/
    Disallow: /media/user/
    Disallow: /media/service/
    Disallow: /media/submit/
    Disallow: /misc/style?*
    Disallow: /misc/quick-navigation-menu?*
    Disallow: /online/
    Disallow: /forums/7/
    Disallow: /forums/20/
    Disallow: /forums/70/
    Disallow: /forums/49/
    Disallow: /forums/155/
    Disallow: /forums/156/
    Disallow: /forums/184/
    Disallow: /forums/200/
    Disallow: /forums/188/
    Disallow: /forums/186/
    Disallow: /forums/187/
    Disallow: /forums/189/
    Disallow: /forums/191/
    
    Allow: /
    
    Sitemap: http://admin-talk.com/sitemap/sitemap.xml.gz
    Sitemap: http://dir.admin-talk.com/sitemap.xml
     
  5. Sylvain

    Sylvain Regular Member

    Pretty much similar to my file

    User-agent: *
    Disallow: /find-new/
    Disallow: /account/
    Disallow: /goto/
    Disallow: /login/
    Disallow: /admin.php
    Disallow: /search/
    Disallow: search.php
    Disallow: /help/
    Disallow: /members/
    Disallow: /misc/
    Disallow: /online/
    Allow: /
    User-agent: ia_archiver
    Allow: /
    User-agent: BoardTracker
    Disallow: /
    User-agent: BoardReader
    Disallow: /
    User-agent: Baiduspider
    User-agent: Baiduspider-video
    User-agent: Baiduspider-image
    Disallow: /
     
  6. MyDigitalpoint

    MyDigitalpoint Regular Member

    Well, there is always the chance that one tweak robots.txt to suit particular needs, problem could be that at every forum settings update, the robots file can be modified on the fly.

    I remember to have a long, long disallow list including search engines, crawlers and bots, but sometimes I wonder if all of those abide by the rules and refrain from crawl what is on-site.

    I read some time ago that the best way to prevent bots from crawling directories or files that we don't want to get crawled is making no reference to them in robots text to avoid they know about their existence... unless they are linked to other with crawling allowance.
     
  7. BamaStangGuy

    BamaStangGuy Administrator

  8. GTB

    GTB Regular Member

    Well the idea of a robots.txt file is to stop certain bots that obey it, to not waste time looking at files that don't need indexing to speed things up for bots. I wouldn't go daft adding loads of entries, but some files and folders are obvious for blocking. Cache folders and files like install and config.php files e.t.c

    It can help a little blocking those obvious files and folders from being indexed. You can also use it to limit crawl numbers, which can save you on bandwidth.
     
    Last edited: Oct 28, 2013
  9. Sylvain

    Sylvain Regular Member

  10. BamaStangGuy

    BamaStangGuy Administrator

    Yes. Though today I did spot a instance where /posts/ were being indexed. Will add it to that thread.
     
  11. deansaliba

    deansaliba Regular Member

    I have noticed that more and more web sites and blogs are using the robot.txt file recently. I n the last month I have lost count of how many times I have looked for something in Google and found a site listed with a description saying that robot.txt had blocked the search engine from crawling the site. I didn't think they would index those sites if they were restricted by robot.txt. :confused:
     
  12. Sylvain

    Sylvain Regular Member

    Some robots bypass the robots.txt file
     
  13. pixelek

    pixelek Regular Member

    its not very wise thing to do. Of course its none of my business but, I wouldnt do that. Some of your users may post sensitive data and - by means of not using robots.txt (=letting robots crawl your site) - these data get revealed. Its strange policy of yours.
     
  14. BamaStangGuy

    BamaStangGuy Administrator

    I have no idea how a robots.txt is going to protect my members from themselves?
     
  15. MyDigitalpoint

    MyDigitalpoint Regular Member

  16. Sylvain

    Sylvain Regular Member

    I checked my competitors.... some uses robots, some others don't
     
  17. cpvr

    cpvr Regular Member

    I don't have any competitors and when I did most of them used robots.txt to block spiders from crawling their content.
     
  18. JoeyJ

    JoeyJ Regular Member

    Disallowing directories also makes your site less vulnerable (to an extent). You'd be surprised how much informtion a robot can disclose about your website. It can index vulnerable files or sensitive directories, or even show parameters that could lead to MySQL injection.
     

Share This Page