[Pro] Robots.txt and sitemap.xml

This has also been posted elsewhere but it is a small tutorial so apologies if you’ve seen it before. I have added a couple of references though.

Robots.txt helps search engines determine what DIRECTORIES they should be reading and indexing.

Think of it as a way for the spider to optimise its work effort to index your site. In effect you are telling it what is relevant or not for indexing. You can think about how it works like this: If a spider has to index everything on the billions of sites out there, it is nice for it to know that some of them are kind enough to leave it a message about what files can be EXCLUDED for indexing.

That’s basically the point of the robots file.

It’s very easy to create a small text file using a text editor (I use the fantastic HyperEdit) and save it as robots.txt

you can find out how to structure the file here

Then in Freeway create a new page in the main folder called, for example, “siteincludes.html” then apply the “PHP Use Include Pages” Action from Page Actions, and link your text file to it. (you might have to download this action if you don’t already have it )

The siteincludes page is really a “dummy page” so you can also apply the PHP Make Markup Page to remove any HTML that Freeway creates.

The result will be that the Robots.txt file that you included in the siteincludes page will be added in the main directory of your site - exactly where it needs to be.

Another thing is that you need to create a sitemap.xml file and include a reference to this in your robots.txt file.

If you want to see what I mean have a look at http://www.google.com/robots.txt

yes they even have their own one!

The sitemap.xml file is yet another method for instructing the webspiders to index specific pages of your site, but it performs the opposite of exclusion. With a few small bits of added info the spider determines how important the page is and how often it should come back to check for changes.

you can read the official information here

you can create and add the file in Freeway the same way as described above, just don’t forget that for a sitemap text file you need to use .xml as an extension to the file name instead of .txt (ie the file should be called “sitemap.xml”)

I hope this helps some people.

Oh yes by the way a sitemap XML file is not the same as a page containing a list of pages on your site that happens to be called a SiteMap. the XML file is for the spider, the page is for the Internet User.


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Great write-up. Just one tip to add: to get the robots.txt file into the home directory, there’s no need to use PHP Use Include Pages. Simply use Upload Stuff (Tim Plumb, FreewayActions.com) or Extra Resources (Softpress, ActionsForge). Either one of those Actions can add files either to the same directory as the page they’re applied to or to the Resources folder. So you won’t even need to add another page to your site.

Walter


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

You could even use the convert to stylesheet action to create your robots.txt or sitemap.xml right inside your freeway document.


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Hello again.

Actually the reason why I use the PHP Use include pages is that it has a nice button to allow you to edit the text file directly, but I agree these alternatives are equally as good and may be better depending on the way you work.

Thanks for adding the comments.


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

I know this is an old thread - but I am adding a robots.txt file through upload stuff. I tested it through the google analytics / webmaster crawl and it shows a 404 but at the same time when you click on the actual link in the google webmaster veiw - it shows the robot file .
www.babybootyshop.com/robots.txt

is there a reason for this - do I have something wrong? I want the different engines to search the site.

Thank you again for all the help.

Julie


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Hi Julie

I’m not sure why google analytics gives a 404 either, however it will take a while for the spider to index your site -maybe even several weeks, so don’t panic if it hasn’t done it so far. But to be clear including the file does not “attract” spiders by default.

Indexing your site is not the same as thing as submitting your site to appear in search results.

And I think that’s what you are after isn’t it? By far the quickest way to get your site in Google, Yahoo Bing etc., is to follow their instructions to submit your URL and you should have your site in Google within 48 hours.

As far as the file goes… you put it in the right place at the root of your domain www directory, so I don’t see any reason why it won’t be seen by a spider - however I noted that you haven’t written any instructions to not visit anything - which is rather the point of the file. Otherwise you don’t need it at all.

Chris


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Thank you very much.

I used google to auto make it. When I went to the original site in this thread… I was unsure what to put.

I still am unsure.

And yes that is what I am looking to get … In and higher in the rankings

Julie

Sent from my iPhone

On Jun 22, 2010, at 17:49, Chris D’Costa email@hidden wrote:

Hi Julie

I’m not sure why google analytics gives a 404 either, however it will take a while for the spider to index your site -maybe even several weeks, so don’t panic if it hasn’t done it so far. But to be clear including the file does not “attract” spiders by default.

Indexing your site is not the same as thing as submitting your site to appear in search results.

And I think that’s what you are after isn’t it? By far the quickest way to get your site in Google, Yahoo Bing etc., is to follow their instructions to submit your URL and you should have your site in Google within 48 hours.

As far as the file goes… you put it in the right place at the root of your domain www directory, so I don’t see any reason why it won’t be seen by a spider - however I noted that you haven’t written any instructions to not visit anything - which is rather the point of the file. Otherwise you don’t need it at all.

Chris


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Hi Julie

The main purpose of these files is to direct the web spider to index your site in a thoughtful and efficient way.

I recall from experience it’s not clear at first what the difference is between “having your site indexed”, and “having it appear in search engines”, and lastly “having it appear high up in search results”.

These three aspects are different subject areas and should not be confused, although each are related somewhat, because you have to deal with all three at some point.

I would have thought the Wikipedia entry on the Robots.txt file is fairly explicit, but if you want to relate what it says there to how Freeway works - then what you need to do is look at the hierarchy navigator in Freeway (the left hand column) to identify the FOLDERS, that you want to prevent spiders from reading. You can even apply it to specific pages if you want to.

The position of the folders and files in Freeway is an exact representation of the Path that the files are eventually uploaded to on your site:

The first “forward slash” represents the root directory of your site - you don’t need to tell it explicitely to find www.yoursite.com

so, for example, normally Freeway puts a Resources folder underneath that root, and if you wanted to prevent the spider from going there type : /Resources/

Nested folder would appear like this: /Folder1/Folder2/

and so on.

The second aspect - “having it appear in search engines” - I mentioned follow the instructions on Google.

Lastly getting a high ranking - Search Engine Optimisition or SEO for short - is a whole industry in itself with highly paid consultants advising on how to spend your money. I’m not convinced that this is any better than a blanket mailshot approach to promoting your site, and in fact one project I’m currently working on aims to give high visibility to new players like yourself without any fees at all. I’d like to tell you more but it’s really at the drawing-board stage.

Chris


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Thank you for your explanation and help!

Julie
On Jun 23, 2010, at 6:00 AM, Chris D’Costa wrote:

Hi Julie

The main purpose of these files is to direct the web spider to index your site in a thoughtful and efficient way.

I recall from experience it’s not clear at first what the difference is between “having your site indexed”, and “having it appear in search engines”, and lastly “having it appear high up in search results”.

These three aspects are different subject areas and should not be confused, although each are related somewhat, because you have to deal with all three at some point.

I would have thought the Wikipedia entry on the Robots.txt file is fairly explicit, but if you want to relate what it says there to how Freeway works - then what you need to do is look at the hierarchy navigator in Freeway (the left hand column) to identify the FOLDERS, that you want to prevent spiders from reading. You can even apply it to specific pages if you want to.

The position of the folders and files in Freeway is an exact representation of the Path that the files are eventually uploaded to on your site:

The first “forward slash” represents the root directory of your site - you don’t need to tell it explicitely to find www.yoursite.com

so, for example, normally Freeway puts a Resources folder underneath that root, and if you wanted to prevent the spider from going there type : /Resources/

Nested folder would appear like this: /Folder1/Folder2/

and so on.

The second aspect - “having it appear in search engines” - I mentioned follow the instructions on Google.

Lastly getting a high ranking - Search Engine Optimisition or SEO for short - is a whole industry in itself with highly paid consultants advising on how to spend your money. I’m not convinced that this is any better than a blanket mailshot approach to promoting your site, and in fact one project I’m currently working on aims to give high visibility to new players like yourself without any fees at all. I’d like to tell you more but it’s really at the drawing-board stage.

Chris


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options

Please, visit this site: http://www.avantibg.net/2011/08/talk-to-a-file-called-robots-txt/ I want to know your comments. Thank you!


tutorials mailing list
email@hidden
Update your subscriptions at:
http://freewaytalk.net/person/options