From CLAWs olden times wiki archive

TheTome: HowtoSpider

How to spider the CLAWsite?

If you're on a dial-up, the most economical way to keep up to date
with your favourite discussion-type sites (forums, wikis, blogs and the like) is to spider 'em while you're downloading your mail. (For
example, using the Synchronize tool in IE, or a specialised spider program.) A spider downloads a base html page and follows links to
a certain depth.

The first time is likely to be a bit long-winded, as your spider
will be getting everything for the first time, but thereafter it will only get pages with changes (and new links). Most efficient.

Update for CLAWtiki?: Since tikiwiki has many more links on the
last-changes page (like to page histories), you'll end up downloading a whole lot more than you need. So, unless you specify
a long list of pages to exclude (which can be done with wget), you'll be wasting an awful lot of time and bandwidth...

Recommended Spiders

To keep up to date with CLAWsite? changes, choose these root pages
and depths:

CLAWsite? proper
http://claws.uct.ac.za/ Depth: 0 links (unless you feel like following all the old links)

CLAWtiki? base
http://claws.uct.ac.za/clawtiki/ Depth: 1 link (to capture all the important WikiWords?, shouldn't
change much)

CLAWtiki? Last Changes
http://claws.uct.ac.za/clawtiki/tiki-lastchanges.php Depth: 1 link (the most important, get all the new content when
it arrives)

That's it! Anything else you come across while browsing offline,
just make a note to visit when you're next on. For a more exhaustive (but obviously slower) download, increase these depths
by one. (Same applies for following links off the CLAWsite?.)

For other sites, keep an eye on the base/root URI, depth of 1, and
find the best page that keeps track of changes, depth of 1 (often the base is the best page). (If you need to log on and/or there are
links to log you off you will need to ask the webmaster to fix the robots.txt file to stop your spider following these links.)


d@vid Apr 20:

obviously amendments need to be made when the CLAWsite? (eventually)
gets updated -- and do we have robots.txt files in these directories?

Retrieved from http://claws.za.net/wikiarchive/TheTome/HowtoSpider
Page last modified on January 01, 1970, at 12:00 AM