boilerpipe

A Java library for boilerplate removal and fulltext extraction from HTML pages
Download

boilerpipe Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Apache
  • Publisher Name:
  • Christian Kohlschütter
  • Publisher web site:
  • http://code.google.com/u/@UBhURFFSDxBAWAV8/
  • Operating Systems:
  • Mac OS X
  • File Size:
  • 2 MB

boilerpipe Tags


boilerpipe Description

boilerpipe is a free and open-source Java library that provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.boilerpipe already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.Detailed instructions on how to install and use the boilerpipe utility on your Mac are available HERE.boilerpipe is a cross-platform utility capable of running on any operating system that comes with Java support (e.g. Mac OS X, Windows, Linux).


boilerpipe Related Software