The ASV Segmenzitzer segmentizes full text into single sentences.
The tool expects a plain-text local file (or quelle-/source-format) as input. Input is expected to be encoded in UTF8.
The output is plain text or xml which is written to an output-file. Output is encoded as UTF8.
java -jar AsvSegmentizer.jar -i=test_input.txt -o=test_input_out.txt -a=preList.txt -b=postList.txt -r=preRules.txt -s=postRules.txt
-i the input-file
-o the output-file
-a a list of words before a sentence boundary candidate that vote against a sentence boundary
-b a list of words after a sentence boundary candidate that vote against a sentence boundary
-r a list of rules (Java RegExp) before a sentence boundary candidate that vote against a sentence boundary
-s a list of rules (Java RegExp) before a sentence boundary candidate that vote against a sentence boundary
-c (carriage return) switch handling of carriage returns as sentence boundaries on/off
-el (empty line) handling of empty lines => true: empty lines are handled as sentence boundaries
(makes sense in most cases; e.g. titles, headings without sentence boundaries); false: ignore empty lines (handle like single spaces)
-ih remove doublettes (generates a hash for each sentence and removes sentences already present)
-n triggers the usage of <quelle>-tags
-t (trim) multiple spaces are reduced to a single one and spaces at the start/end of a sentence are removed
java -jar AsvSegmentizer.jar -i=quelle_name_input.txt -o=quelle_name_input_out.txt -a=preList.txt -b=postList.txt -r=preRules.txt -s=postRules.txt -n=true