edit SideBar

HowtoSpider

You are on the archive wiki. The new wiki is here.

How to spider the CLAWsite?

If you're on a dial-up, the most economical way to keep up to date
with your favourite discussion-type sites (forums, wikis, blogs and the like) is to spider 'em while you're downloading your mail. (For
example, using the Synchronize tool in IE, or a specialised spider program.) A spider downloads a base html page and follows links to
a certain depth.

The first time is likely to be a bit long-winded, as your spider
will be getting everything for the first time, but thereafter it will only get pages with changes (and new links). Most efficient.

Update for CLAWtiki?: Since tikiwiki has many more links on the
last-changes page (like to page histories), you'll end up downloading a whole lot more than you need. So, unless you specify
a long list of pages to exclude (which can be done with wget), you'll be wasting an awful lot of time and bandwidth...

Recommended Spiders

  • Synchronize (it's built into IE!)
  • wget --recursive --level <depth> <URL>

To keep up to date with CLAWsite? changes, choose these root pages
and depths:

CLAWsite? proper
http://claws.uct.ac.za/ Depth: 0 links (unless you feel like following all the old links)

CLAWtiki? base
http://claws.uct.ac.za/clawtiki/ Depth: 1 link (to capture all the important WikiWords?, shouldn't
change much)

CLAWtiki? Last Changes
http://claws.uct.ac.za/clawtiki/tiki-lastchanges.php Depth: 1 link (the most important, get all the new content when
it arrives)

That's it! Anything else you come across while browsing offline,
just make a note to visit when you're next on. For a more exhaustive (but obviously slower) download, increase these depths
by one. (Same applies for following links off the CLAWsite?.)

For other sites, keep an eye on the base/root URI, depth of 1, and
find the best page that keeps track of changes, depth of 1 (often the base is the best page). (If you need to log on and/or there are
links to log you off you will need to ask the webmaster to fix the robots.txt file to stop your spider following these links.)


d@vid Apr 20:

obviously amendments need to be made when the CLAWsite? (eventually)
gets updated -- and do we have robots.txt files in these directories?

Edit - History - Print - Recent Changes - Search
Page last modified on January 01, 1970, at 12:00 AM