unfluffStatistical HTML content extraction in Python | |
Download |
unfluff Ranking & Summary
Advertisement
unfluff Tags
unfluff Description
Statistical HTML content extraction in Python unfluff is a statistical content extraction tool written in python - remove the useless fluff from arbitrary HTML pages.Based on methods discussed (and implemented) in various places, but most directly: * http://www.spicylogic.com/allenday/blog/2008/05/27/statistical-html-content-extraction/ * http://www2003.org/cdrom /papers/refereed/p583/p583-gupta.htmlAn experiment / work in progress.Usage:The command line tool can either take a file or a URL to extract. It prints the content tree to stdout:unfluff /path/to/something.htmlorunfluff -u 'http://some-website.com/interesting-article.html'The unfluff library has a few functions, which pretty much all do the same thing via different formats:import unfluffunfluff.from_url('http://whatever/')unfluff.from_file('/tmp/input.html')unfluff.from_string("< html >inline content< /html >")Both of these are native (C) extensions, which means you're best off looking for them in your friendly neighborhood package manager. Requirements: · Python · lxml · SciPy
unfluff Related Software