HTML stripping

Description

HTML2TEXT is a tool for removing HTML markup from websites. Only necessary metadata and plain text remain.

Input

The tool expects a local file in HTML-format as input. The file may be of any encoding.

Output

The output of HTML2TEXT is plain text which is written to standard output. It can be piped into a file. Output is encoded as UTF8.
Before the text metadata is inserted which has the following format:
<source><location>URL</location><date>YYYY-MM-DD</date></source>
URL - the URL or filename of the input
YYYY-MM-DD - the date of processing

This <source>-format is the standard input format for further processing. Several stripped HTML-files can be stored in a single file by simple concatenation.

Parameters

java -jar html2text.jar INPUT
INPUT    path to local inputfile

Examples

java -jar html2text.jar main.html >> output.txt

Download

http://asvdoku.informatik.uni-leipzig.de/corpora/data/uploads/html2text.zip