Carmen::AP11::HTMLHeuristics - Extracting metadata from HTML documents.
use Carmen::AP11::HTMLHeuristics;
my $file="dok.html";
$heuristics=new Carmen::AP11::HTMLHeuristics($file);
$array_ref=$heuristics->getTitle;
foreach my $title (@$array_ref){
print "Weight:",$title->[0];
print "Method:",$title->[1];
print "Title :",$title->[2];
}
This module extracts metadata from HTML documents according to the heuristrics defined in the file 'heuristics.txt' of the distribution. There are currently methods for title, keywords and abstract which return information about the used extraction method, the weight and the extracted value. The returned value of these methods is an array reference of array references, see the above example for details.
new($file)
Stefan Kokkelink.
Copyright 2000/2001 Stefan Kokkelink. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.