Perl Cookbook

Perl CookbookSearch this book
Previous: 20.4. Converting ASCII to HTMLChapter 20
Web Automation
Next: 20.6. Extracting or Removing HTML Tags
 

20.5. Converting HTML to ASCII

Problem

You want to convert an HTML file into formatted plain ASCII.

Solution

If you have an external formatter like lynx, call an external program:

$ascii = `lynx -dump $filename`;

If you want to do it within your program and don't care about the things that the HTML::TreeBuilder formatter doesn't yet handle (tables and frames):

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($filename);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
$ascii = $formatter->format($html);

Discussion

These examples both assume you have the HTML text in a file. If your HTML is in a variable, you need to write it to a file for lynx to read. If you are using HTML::FormatText, use the HTML::TreeBuilder module:

use HTML::TreeBuilder;
use HTML::FormatText;

$html = HTML::TreeBuilder->new();
$html->parse($document);

$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

$ascii = $formatter->format($html);

If you use Netscape, its "Save as" option with the type set to "Text" does the best job with tables.

See Also

The documentation for the CPAN modules HTML::Parse, HTML::TreeBuilder, and HTML::FormatText; your system's lynx (1) manpage; Recipe 20.6


Previous: 20.4. Converting ASCII to HTMLPerl CookbookNext: 20.6. Extracting or Removing HTML Tags
20.4. Converting ASCII to HTMLBook Index20.6. Extracting or Removing HTML Tags