NAME

Carmen::AP11::HTMLHeuristics - Extracting metadata from HTML documents.


SYNOPSIS


  use Carmen::AP11::HTMLHeuristics;

  my $file="dok.html";
  $heuristics=new Carmen::AP11::HTMLHeuristics($file);
  $array_ref=$heuristics->getTitle;
  foreach my $title (@$array_ref){
        print "Weight:",$title->[0];
        print "Method:",$title->[1];
        print "Title :",$title->[2];
  }


DESCRIPTION

This module extracts metadata from HTML documents according to the heuristrics defined in the file 'heuristics.txt' of the distribution. There are currently methods for title, keywords and abstract which return information about the used extraction method, the weight and the extracted value. The returned value of these methods is an array reference of array references, see the above example for details.


METHODS

new($file)
Constructor.

getTitle();
Extracts the title.

getKeyword();
Extracts the keywords.

getAbstract();
Extracts the abstract.


AUTHOR

Stefan Kokkelink.


COPYRIGHT

Copyright 2000/2001 Stefan Kokkelink. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.