nsite - tool for generating WWW site maps
nsite.pl [ -verbose ] [ -help ] [ -doc ] [ -depth <depth> ] [ -proxy <proxy URL> ] [ -[no]envproxy ] [ -agent <agent> ] [ -authen ] [ -format <html|text|xml|none> ] [ -summary <number of chars> ] [ -title <page title> ] [ -email <e-mail address> ] [ -index ] [ -nolinks ] [ -stats <filename> ] [ -output <filename> ] [ -altstart <filename> ] -url <root URL>
nSite generates site maps for a given WWW site. It walks a site from the root URL and generates an HTML, TEXT, or XML link page which illustrates the structure of the site.
The HTML site map consists of the page url, title, unique fingerprint, summary, and list of internal and external links. The links are 'clickable' with the internal links in blue and the external links in orange.
The TEXT site map consists of the page url, title, and unique fingerprint.
The XML site map is a list of XML <LINK>/<URL> structures.
The structure reflects the depth from the root page to the pages listed; i.e., the first-level bullets are pages accessible directly from the root page, at the next levels are pages accessible from those pages, etc. nSite assumes a typical, breadth-first, top-down site structure so pages may appear in a different order than originally intended.
Option to specify a root URL to generate a site map for. This option is required.
Option to specify the depth of the site map generated. If not specified, nSite will generate a sitemap of unlimited depth.
Option to specify the email address which is reported by the robot to the site where it gets pages from.
Specify an HTTP proxy to use.
If -envproxy is set, the proxy specified by the $http_proxy environment variable will be used (this is the default behaviour). Use -noenvproxy to suppress this. -proxy takes precedence over -envproxy.
Allows the user to specify an agent for the robot to pretend to be (e.g. 'Mozilla/4.5'). This can be necessary for sites that do browser sniffing for serving particular content, etc.
Option for specifying the output format the site map. Possible values are
Automatically extract a summary to display with the title. This will be truncated at the specified number of characters (default:200). To disable the summary display, set the number of chars to -1.
Option to specify a page title for the site map.
Option to use LWP::AuthenAgent to get HTML pages. This allows the user to type a username / password for pages that are access controlled.
Option to display an index (table of contents) for the site map.
Option to disable the display of the internal and external links for each page in the site map.
Option to start the mapping at a specific file instead of the default index file.
Option to output a statistics file with lines containing the following:
URL<tab>FINGERPRINT<tab>NUMBER_OF_LINKS<tab>DEPTH<tab>TITLE.
Option to output the site map to a file. (Defaults to standard output.)
Display a help message to standard output, with a brief description of nSite and its command-line switches.
Display the full documentation for nSite, generated from the embedded pod format documentation.
Print out the current version number for nSite.
Turn on verbose messages.
nSite makes use of the $http_proxy
environment variable, if it is
set.
HTML::Entities Getopt::Long LWP::AuthenAgent LWP::UserAgent Pod::Usage
XML support is very basic. It has been tested only on some Linux, Windows, and Irix systems.
Steve Horsburgh <shorsburgh@horsburgh.com>
This script is based on the 1997 sitemapper.pl script by Ave Wrigley <wrigley@cre.canon.co.uk>
Copyright (c) 2000, Horsburgh.com. All rights reserved.
This script is free software; you can redistribute it and/or modify it under GNU GPL. (See the file COPYING)