Sentence segmentation

Description

The ASV Segmenzitzer segmentizes full text into single sentences.

Input

The tool expects a plain-text local file (or quelle-/source-format) as input. Input is expected to be encoded in UTF8.

Output

The output  is plain text or xml which is written to an output-file. Output is encoded as UTF8.

Parameters

java -jar AsvSegmentizer.jar -i=test_input.txt -o=test_input_out.txt -a=preList.txt -b=postList.txt -r=preRules.txt -s=postRules.txt


-i the input-file

-o the output-file

-a a list of words before a sentence boundary candidate that vote against a sentence boundary

-b a list of words after a sentence boundary candidate that vote against a sentence boundary

-r a list of rules (Java RegExp) before a sentence boundary candidate that vote against a sentence boundary

-s a list of rules (Java RegExp) before a sentence boundary candidate that vote against a sentence boundary

-c (carriage return) switch handling of carriage returns as sentence boundaries on/off

-el (empty line) handling of empty lines => true: empty lines are handled as sentence boundaries
  (makes sense in most cases; e.g. titles, headings without sentence boundaries); false: ignore empty lines (handle like single spaces)

-ih remove doublettes (generates a hash for each sentence and removes sentences already present)

-n triggers the usage of <quelle>-tags

-t (trim) multiple spaces are reduced to a single one and spaces at the start/end of a sentence are removed

Examples

java -jar AsvSegmentizer.jar -i=quelle_name_input.txt -o=quelle_name_input_out.txt -a=preList.txt -b=postList.txt -r=preRules.txt -s=postRules.txt -n=true

Download

http://asvdoku.informatik.uni-leipzig.de/corpora/data/uploads/segmentizer.zip