NAME

nsitemap - functions for generating a site map for a given site URL


SYNOPSIS

    use nsitemap;
    use LWP::UserAgent;
    my $ua = new LWP::UserAgent;
    my $sitemap = new nsitemap(
        EMAIL       => 'your@email.address',
        USERAGENT   => $ua,
        ROOT        => 'http://your.ip.address/'
    );
    $sitemap->generate();
    $sitemap->option( 'VERBOSE' => 1 );
    my $len = $sitemap->option( 'SUMMARY_LENGTH' );
    my $root = $sitemap->root();
    for my $url ( $sitemap->urls() )
    {
        if ( $sitemap->is_internal_url( $url ) )
        {
            # do something ...
        }
        my @links   = $sitemap->links( $url );
        my $title   = $sitemap->title( $url );
        my $summary = $sitemap->summary( $url );
        my $depth   = $sitemap->depth( $url );
        my $digest  = $sitemap->MD5digest( $url );
    }
    $sitemap->traverse(
        sub {
            my ( $sitemap, $url, $depth, $flag ) = @_;
            if ( $flag == 0 )
            {
                # do something at the start of a list of sub-pages ...
            }
            elsif( $flag == 1 )
            {
                # do something for each page ...
            }
            elsif( $flag == 2 )
            {
                # do something at the end of a list of sub-pages ...
            }
        }
    )


DESCRIPTION

The nsitemap module creates a site map for a WWW site, by traversing the site using the WWW::Robot module. The nsitemap object has a number of methods to access a list of all the urls in the site; a list of all the links for each url; page titles; page summaries; page fingerprints (MD5digest); and the depth, or mimimum number of links from the root URL to a page.


CONSTRUCTOR

nsitemap->new [ $option => $value ] ...

    my $sitemap = new nsitemap(
        EMAIL       => 'your@email.address',
        USERAGENT   => new LWP::UserAgent,
        ROOT        => 'http://www.my.com/'
    );

Possible option are:

EMAIL
The email address the robot uses to identify itself. This option is required.

ROOT
Root URL of the site for which the site map is being created. This option is required.

USERAGENT
User agent (typically 'new LWP::UserAgent') used by the robot. This option is required.

VERBOSE
Verbose flag, for printing out useful messages during traversal [0 or 1]. Defaults to 0.

SUMMARY_LENGTH
Maximum length of (automatically generated) summary. Defaults to 200.

DEPTH
Maximum depth of traversal. Defaults to no limit.


METHODS

generate( )

Method for generating the site map, based on the constructor options.

    $site->generate();

option( $option [=> $value ] )

Interface to get / set options after object construction.

    $site->option( 'VERBOSE' => 1 );
    my $len = $site->option( 'SUMMARY_LENGTH' );

root( )

Returns the root URL for the site.

    my $root = $site->root();

urls( )

Returns a list of all the URLs on the site map.

    my @urls = $site->urls();

is_internal_url( $url )

Returns 1 (one) if $url is an internal URL based on the ROOT value. Otherwise returns 0 (zero);

    if ( $site->is_internal_url( $url ) )
    {
        # do something ...
    }

links( $url )

Returns a list of all the links from a given URL in the site map.

    my @links = $site->links( $url );

title( $url )

Returns the title of the URL based on the TITLE tag

    my $title = $site->title( $url );

MD5digest( $url )

Returns the MD5_hex (fingerprint) of the URL.

    my $fingerprint = $site->MD5digest( $url );

summary( $url )

Returns a summary of the URL; generated using HTML::Summary. If the URL has a NAME='description' META tag, returns the value of CONTENT. Otherwise it attempts to summarize the text.

    my $summary = $site>summary( $url );

depth( $url )

Returns the minimum number of links to traverse from the root URL of the site to this URL. The root URL is at depth zero.

    my $depth = $sitemap->depth( $url );

traverse( \&callback )

The traverse method walks the site map, starting at the root node (spcificed by -url), and visits each URL in the order that they would be displayed in a sequential site map of the site. The callback is called in a number of places in the traversal as indicated by the $flag argument to the callback:

$flag = 0
Triggered before each set of daughter URLs of a given URL.

$flag = 1
Triggered for each URL.

$flag = 2
Triggered after each set of daughter URLs of a given URL.


SEE ALSO

    LWP::UserAgent
    HTML::Summary
    WWW::Robot


AUTHOR

Steve Horsburgh <shorsburgh@horsburgh.com>


CREDITS

This utility was inspired by the 1997 Sitemap.pm utility by Ave Wrigley <wrigley@cre.canon.co.uk>


COPYRIGHT

Copyright (c) 2000, Horsburgh.com. All rights reserved.

This script is free software; you can redistribute it and/or modify it under GNU GPL. (See the file COPYING)