nsitemap - functions for generating a site map for a given site URL
use nsitemap; use LWP::UserAgent;
my $ua = new LWP::UserAgent; my $sitemap = new nsitemap( EMAIL => 'your@email.address', USERAGENT => $ua, ROOT => 'http://your.ip.address/' );
$sitemap->generate(); $sitemap->option( 'VERBOSE' => 1 ); my $len = $sitemap->option( 'SUMMARY_LENGTH' );
my $root = $sitemap->root(); for my $url ( $sitemap->urls() ) { if ( $sitemap->is_internal_url( $url ) ) { # do something ... } my @links = $sitemap->links( $url ); my $title = $sitemap->title( $url ); my $summary = $sitemap->summary( $url ); my $depth = $sitemap->depth( $url ); my $digest = $sitemap->MD5digest( $url ); } $sitemap->traverse( sub { my ( $sitemap, $url, $depth, $flag ) = @_; if ( $flag == 0 ) { # do something at the start of a list of sub-pages ... } elsif( $flag == 1 ) { # do something for each page ... } elsif( $flag == 2 ) { # do something at the end of a list of sub-pages ... } } )
The nsitemap
module creates a site map for a WWW site, by traversing the
site using the WWW::Robot module. The nsitemap object has a number of methods
to access a list of all the urls in the site; a list of all the links for each
url; page titles; page summaries; page fingerprints (MD5digest); and the depth,
or mimimum number of links from the root URL to a page.
my $sitemap = new nsitemap( EMAIL => 'your@email.address', USERAGENT => new LWP::UserAgent, ROOT => 'http://www.my.com/' );
Possible option are:
Method for generating the site map, based on the constructor options.
$site->generate();
Interface to get / set options after object construction.
$site->option( 'VERBOSE' => 1 ); my $len = $site->option( 'SUMMARY_LENGTH' );
Returns the root URL for the site.
my $root = $site->root();
Returns a list of all the URLs on the site map.
my @urls = $site->urls();
Returns 1 (one) if $url is an internal URL based on the ROOT value. Otherwise returns 0 (zero);
if ( $site->is_internal_url( $url ) ) { # do something ... }
Returns a list of all the links from a given URL in the site map.
my @links = $site->links( $url );
Returns the title of the URL based on the TITLE tag
my $title = $site->title( $url );
Returns the MD5_hex (fingerprint) of the URL.
my $fingerprint = $site->MD5digest( $url );
Returns a summary of the URL; generated using HTML::Summary. If the URL has a NAME='description' META tag, returns the value of CONTENT. Otherwise it attempts to summarize the text.
my $summary = $site>summary( $url );
Returns the minimum number of links to traverse from the root URL of the site to this URL. The root URL is at depth zero.
my $depth = $sitemap->depth( $url );
The traverse method walks the site map, starting at the root node (spcificed by -url), and visits each URL in the order that they would be displayed in a sequential site map of the site. The callback is called in a number of places in the traversal as indicated by the $flag argument to the callback:
LWP::UserAgent HTML::Summary WWW::Robot
Steve Horsburgh <shorsburgh@horsburgh.com>
This utility was inspired by the 1997 Sitemap.pm utility by Ave Wrigley <wrigley@cre.canon.co.uk>
Copyright (c) 2000, Horsburgh.com. All rights reserved.
This script is free software; you can redistribute it and/or modify it under GNU GPL. (See the file COPYING)