SourceForge.net Logo

pullparser

A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser.

Examples:

This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <a>...</a> tags:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    try:
        text = p.get_compressed_text(endat=("endtag", "a"))
    except pullparser.NoMoreTokensError:
        break
    print "%s\t%s" % (url, text)
This program extracts the <title> from the document:
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title
Thanks to Gisle Aas, who wrote HTML::TokeParser.

Download

All documentation (including this web page) is included in the distribution. This is the initial alpha release, but it's simple, working, documented & tested, and I don't anticipate any significant changes.

Development release.

For installation instructions, see the INSTALL file included in the distribution.

FAQs

John J. Lee, January 2004.