HTML stripping


HTML2TEXT is a tool for removing HTML markup from websites. Only necessary metadata and plain text remain.


The tool expects a local file in HTML-format as input. The file may be of any encoding.


The output of HTML2TEXT is plain text which is written to standard output. It can be piped into a file. Output is encoded as UTF8.
Before the text metadata is inserted which has the following format:
URL - the URL or filename of the input
YYYY-MM-DD - the date of processing

This <source>-format is the standard input format for further processing. Several stripped HTML-files can be stored in a single file by simple concatenation.


java -jar html2text.jar INPUT
INPUT    path to local inputfile


java -jar html2text.jar main.html >> output.txt