HTML2TEXT is a tool for removing HTML markup from websites. Only necessary metadata and plain text remain.
The tool expects a local file in HTML-format as input. The file may be of any encoding.
The output of HTML2TEXT is plain text which is written to standard output. It can be piped into a file. Output is encoded as UTF8.
Before the text metadata is inserted which has the following format:
URL - the URL or filename of the input
YYYY-MM-DD - the date of processing
This <source>-format is the standard input format for further processing. Several stripped HTML-files can be stored in a single file by simple concatenation.
java -jar html2text.jar INPUT
INPUT path to local inputfile
java -jar html2text.jar main.html >> output.txt