NAME

nsite - tool for generating WWW site maps


SYNOPSIS

    nsite.pl 
        [ -verbose ] 
        [ -help ]
        [ -doc ]
        [ -depth <depth> ] 
        [ -proxy <proxy URL> ] 
        [ -[no]envproxy ] 
        [ -agent <agent> ]
        [ -authen ] 
        [ -format <html|text|xml|none> ] 
        [ -summary <number of chars> ] 
        [ -title <page title> ] 
        [ -email <e-mail address> ]
        [ -index ]
        [ -nolinks ]
        [ -stats <filename> ]
        [ -output <filename> ]
        [ -altstart <filename> ]
        -url <root URL>


DESCRIPTION

nSite generates site maps for a given WWW site. It walks a site from the root URL and generates an HTML, TEXT, or XML link page which illustrates the structure of the site.

The HTML site map consists of the page url, title, unique fingerprint, summary, and list of internal and external links. The links are 'clickable' with the internal links in blue and the external links in orange.

The TEXT site map consists of the page url, title, and unique fingerprint.

The XML site map is a list of XML <LINK>/<URL> structures.

The structure reflects the depth from the root page to the pages listed; i.e., the first-level bullets are pages accessible directly from the root page, at the next levels are pages accessible from those pages, etc. nSite assumes a typical, breadth-first, top-down site structure so pages may appear in a different order than originally intended.


OPTIONS

-url <root URL>

Option to specify a root URL to generate a site map for. This option is required.

-depth <depth>

Option to specify the depth of the site map generated. If not specified, nSite will generate a sitemap of unlimited depth.

-email <email address>

Option to specify the email address which is reported by the robot to the site where it gets pages from.

-proxy <proxy URL>

Specify an HTTP proxy to use.

-[no]envproxy

If -envproxy is set, the proxy specified by the $http_proxy environment variable will be used (this is the default behaviour). Use -noenvproxy to suppress this. -proxy takes precedence over -envproxy.

-agent <agent>

Allows the user to specify an agent for the robot to pretend to be (e.g. 'Mozilla/4.5'). This can be necessary for sites that do browser sniffing for serving particular content, etc.

-format <formatting option>

Option for specifying the output format the site map. Possible values are

html
Simple HTML bulleted list (default). Consists of the page url, title, unique fingerprint, summary, and list of internal and external links. The links are 'clickable' with the internal links in blue and the external links in orange.

text
Plain text with indenting. Consists of the page url, title, and a unique fingerprint.

xml
An XML graph of linkage between pages. Consists of a list of XML <LINK>/<URL> structures.

none
Do not output the site map. Useful when you want to just output the stats file. (see -stats)

-summary <number of chars>

Automatically extract a summary to display with the title. This will be truncated at the specified number of characters (default:200). To disable the summary display, set the number of chars to -1.

-title <page title>

Option to specify a page title for the site map.

-authen

Option to use LWP::AuthenAgent to get HTML pages. This allows the user to type a username / password for pages that are access controlled.

-index

Option to display an index (table of contents) for the site map.

-nolinks

Option to disable the display of the internal and external links for each page in the site map.

-altstart

Option to start the mapping at a specific file instead of the default index file.

-stats <filename>

Option to output a statistics file with lines containing the following:

 URL<tab>FINGERPRINT<tab>NUMBER_OF_LINKS<tab>DEPTH<tab>TITLE.

-output <filename>

Option to output the site map to a file. (Defaults to standard output.)

-help

Display a help message to standard output, with a brief description of nSite and its command-line switches.

-doc

Display the full documentation for nSite, generated from the embedded pod format documentation.

-version

Print out the current version number for nSite.

-verbose

Turn on verbose messages.


ENVIRONMENT

nSite makes use of the $http_proxy environment variable, if it is set.


PREREQUISITES

    HTML::Entities
    Getopt::Long
    LWP::AuthenAgent
    LWP::UserAgent
    Pod::Usage


BUGS

XML support is very basic. It has been tested only on some Linux, Windows, and Irix systems.


AUTHOR

Steve Horsburgh <shorsburgh@horsburgh.com>


CREDITS

This script is based on the 1997 sitemapper.pl script by Ave Wrigley <wrigley@cre.canon.co.uk>


COPYRIGHT

Copyright (c) 2000, Horsburgh.com. All rights reserved.

This script is free software; you can redistribute it and/or modify it under GNU GPL. (See the file COPYING)