SEO and robots.txt

Well my new year resolution is to try and learn more about SEO and get my site up the rankings, I have checked all the excellent help on this forum (though haven’t read through it all yet) so thanks to everyone for that, I am also going through Google’s extensive help pages but didn’t quite understand robots.txt, it seems it’s a good idea to add it to my root but what should I type in the file, here is the basic google suggestion: “User-Agent. * Allow: /” … is that all I need ? (not that I understand that either). Any advice very welcome and if anyone would care to check out my site with its current content I’d be very grateful. Best regards and a happy new year to all … Roger

http://www.rogerburton.co.uk/


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

By putting in “User-Agent” that means that any “spider” that visits your site can access your site. By putting in “* Allow: /” isn’t allowing anything.

I was always told that a robots file should only have “Disallow” (no allow) in it and then the directories listed underneath it that are to be hidden from indexing which is usually your cgi-bin, any tmp folders, or other sub-sites you may be running, etc. Remember that robots.txt is a public file and can be loaded up by anyone to see what is being hidden.

So, by having what you have in there right now (and changing allow to disallow) is the equivalent of having a blank robots.txt file or not having one at all. They’d both accomplish the same thing since it looks like you want all spiders to index everything.

You can adjust things within the txt file more in depth and by specific spiders for example:

User-agent: Google
Disallow:

User-agent: *
Disallow: /

This would allow Google complete access and then exclude all others.

Here’s a few articles that might help:

http://www.robotstxt.org/robotstxt.html

http://www.robotstxt.org/faq.html

Good luck Roger, hope the rank goes to the top!


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Excellent Dan, thank you so much … your links will give me some interesting reading this afternoon. Regards Roger


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Also, remember what is in your Robots.txt file is on the honor system.
Bad Robots can ignore the file if they wish.
LLE

On Dec 27, 2008, at 6:44 AM, Roger Burton wrote:

Excellent Dan, thank you so much … your links will give me some
interesting reading this afternoon. Regards Roger


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

In my robots.txt files, I have things like member only directories, my
test directory, anything that I don’t want crawled by a search engine.
Yes, I have my member directory password protected, but I also include
it in the robots.txt. You can add directories or single documents.
This is an example how how my robots.txt file looks:

User-Agent: *
Disallow: /memberdirectory/

User-Agent: *
Disallow: /members/

User-Agent: *
/programs.php

User-Agent: *
Disallow: /TEST/

User-Agent: *
Disallow: /uploadedphotos/


Robin Stark

On Dec 27, 2008, at 2:57 AM, Roger Burton wrote:

Well my new year resolution is to try and learn more about SEO and
get my site up the rankings, I have checked all the excellent help
on this forum (though haven’t read through it all yet) so thanks to
everyone for that, I am also going through Google’s extensive help
pages but didn’t quite understand robots.txt, it seems it’s a good
idea to add it to my root but what should I type in the file, here
is the basic google suggestion: “User-Agent. * Allow: /” … is that
all I need ? (not that I understand that either). Any advice very
welcome and if anyone would care to check out my site with its
current content I’d be very grateful. Best regards and a happy new
year to all … Roger

http://www.rogerburton.co.uk/


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Sometime around 27/12/08 (at 16:33 -0600) Robin Stark said:

I have my member directory password protected, but I also include it
in the robots.txt.

This will prevent a search engine spider from trying and failing to
look in that directory, but in terms of preventing access the
robots.txt file won’t do anything that hasn’t been dealt with already.

k


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Right, it does not protect it in any way except to keep search engines
from crawling that directory. So, you still have to protect it first.
I just put it in the robots file, too, because I had somebody insist
on using [gulp] FrontPage on one site, and when I turned on the FP
extensions, it deleted all the htaccess files … well, it’s a long
story, but I figured I would add it to the robots.txt file, too.


Robin Stark

On Dec 27, 2008, at 4:42 PM, Keith Martin wrote:

Sometime around 27/12/08 (at 16:33 -0600) Robin Stark said:

I have my member directory password protected, but I also include
it in the robots.txt.

This will prevent a search engine spider from trying and failing to
look in that directory, but in terms of preventing access the
robots.txt file won’t do anything that hasn’t been dealt with already.

k


Robin Stark
Web Flunky
www.webflunky.com

Instant Messaging:
GoogleTalk: email@hidden
iChat: email@hidden


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Right, it does not protect it in any way except to keep search
engines from crawling that directory.

Technically, they wouldn’t be able to get in any more than a human
visitor without the password. But it doesn’t hurt to include this,
and it might help keep the logs slightly more free of unnecessary
errors.

k


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Robots.txt helps search engines determine what DIRECTORIES they should be reading and indexing.

Think of it as a way for the spider to optimise its work effort to index your site. In effect you are telling it what is relevant or not for indexing. If a spider has to index everything on the billions of sites out there, it is nice for it to know that some of them are kind enough to leave it a message about what files can be excluded for indexing.

That’s basically the point of the robots file.

It’s very easy to create a small text file using a text editor and save it as robots.txt.

Then in Freeway create a new page in the main folder called for example “siteincludes.html” then apply the “PHP Use Include Pages” Action from Page Actions, and link your text file to it. (you might have to download this action)

The siteincludes page is really a “dummy page” so you can also apply the PHP Make Markup Page to remove any HTML that Freeway creates.

The result will be that the Robots.txt file that you included in the siteincludes page will be added in the main directory of your site - exactly where it needs to be.

Another thing is that you need to create a sitemap.xml file and include a reference to this in your robots.txt file.

If you want to see what I mean have a look at http://www.google.com/robots.txt

yes they even have their own one!

The sitemap.xml file is yet another method for instructing the webspiders to index specific pages of your site, but it performs the opposite of exclusion. With a few small bits of added info the spider determines how important the page is and how often it should come back to check for changes.

you can create and add the file in Freeway the same way as described above, just don’t forget that for a sitemap text file you need to use .xml instead of .txt

hope this helps some people.

Oh BTW a sitemap XML file is not the same as a page containing a map of your site that happens to be called a SiteMap. the XML file is for the spider, the page is for the Internet User.


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Thanks Chris, that’s an excellent well written explanation … even a dummy like me understood (most) of it. Good of you to take the time. Roger


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Hi Roger

I actually posted this also in the tutorials section with a couple of links, after I realised it was quite “tutorial”.

Check out the links for more information on how to structure the files.


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Hi Roger,

Subject to other folks’ opinions, you might try simplifying your page titles and making them more specific. The title of your home page, for example, is quite long and may be counterproductive.

Page titles and meta tag descriptions seem to be important. Google does not rank keywords anymore because of abuse.

Jim


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Thanks Jim, I’ll certainly give that a try … strange beast SEO ! Roger


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Once I’ve had the same problem. Here is a copy paste robots.txt including a few “unnessecary” servers.

URL:http://www.yourname.com

Alle Roboter spidern die Domain

User-agent: *
Disallow:

User-agent: Yahoo

Disallow:

User-agent: Google

Disallow:

User-agent: infoseek

Disallow:

User-agent: swissguide

Disallow:

User-agent: altavista

Disallow:

User-agent: lycos

Disallow:

User-agent: CherryPicker
Disallow: /

User-agent: PicScout
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: ia_archiver/1.6
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: grub-client
Disallow: /

User-agent: grub
Disallow: /

User-agent: looksmart
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: NetMechanic
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: WebmasterWorldForumBot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: WebEnhancer
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: QueryN Metasearch
Disallow: /

User-agent: Openfind data gathere
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu’s Link Sleuth 1.1c
Disallow: /

User-agent: Xenu’s
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Microsoft URL Control
Disallow: /

User-agent: Openbot
Disallow: /

User-agent: URL Control
Disallow: /

User-agent: Zeus Link Scout
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Iron33/1.0.2
Disallow: /

User-agent: Bookmark search tool
Disallow: /

User-agent: GetRight/4.2
Disallow: /

User-agent: FairAd Client
Disallow: /

User-agent: Gaisbot
Disallow: /

User-agent: Aqua_Products
Disallow: /

User-agent: Radiation Retriever 1.1
Disallow: /

User-agent: Flaming AttackBot
Disallow: /

User-agent: Oracle Ultra Search
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: PerMan
Disallow: /

User-agent: searchpreview
Disallow: /


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Why would you disallow so many search engines? Is not more the better?

Why not just have:

User-agent: *
Disallow:

And be done with it?

I have a five page webbsite with content and graphics. I want the bots to search the whole thing - no disallow entries, but why would I be selective as to what search engine index my site?

Also what is the proper entry fro Google? Just Google or Googlebot?

Thanks everyone!

~The Fuz


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

I agree that a simple;
User-agent: *
Disallow: /
would block all of the search engines Thomas has listed (as well as
the rest).

Typically most people want as many search engines as they can get to
spider and trawl their sites. The exceptions are generally sites that
are confidential or in another ways ‘pre-release’, blocking spiders
from trawling for e-mail addresses (99% of these won’t adhere to the
robots’ exclusion standard anyway so it is pointless trying to block
them), or preventing your image content from showing up in the results
of image searches. If you are a photographer or illustrator, for
example, seeing your work appear in the Google image search results
pages can be a little worrying.
The other search spider a lot of people tend to block is the Wayback
machine (http://www.archive.org/). It’s a great product but the spider
can really hog your bandwidth as it comes back again and again to
archive your sites.
Regards,
Tim.

On 17 May 2009, at 13:53, Fuzzy Z wrote:

Why would you disallow so many search engines? Is not more the better?

FreewayActions.com - Freeware and shareware actions for Freeway
Express & Pro.

Protect your mailto links from being harvested by spambots with Anti
Spam.
Only available at FreewayActions.com

http://www.freewayactions.com


freewaytalk mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options