Pattern-based sentence cleaning


The SentenceCleaner is a simple tool for the "cleaning" of sentences and replaces the old PHP cleaning scripts. By the definition of regular expressions and sentence length restrictions you can define rules that are checked for every sentence of the input file. Sentence without any matching rules will be written into the output file.

All rules are stored in the directory "rules/". They are divided into general (File "general.rules"), language specific (e.g. lang_deu.rules for German) and texttype specific (e.g. texttype_web.rules) rules. The syntax of these rules are described in the file "rules/general.rules". Every rule has a unique identifier (Integer after "RULE"). A rule in the texttype file overrides a rule in the general file with the same ID, a rule in the language specific file overrides the rules with same ID in both other files.

Input format

Supported are two input formats:

  • Wortschatz Rawtext format (old or new format "Rohtext" (<sources>)) (default)
  • Tabulator separated file with the paramter -c COLUMN (position of the column containing the sentences, counting from 0)

Output format

The output format complies with the used input format.


java -jar SentenceCleaner.jar -i INPUT -o OUTPUT [-l LANG_CODE] [-t TEXTTYPE] [-c COLUMN] [-m] [-r] [-s] [-v] [-e]
INPUT    path to inputfile
OUTPUT   path to outputfile
LANG_CODE        language code in ISO 639-3
TEXTTYPE         text type: web|news|wikipedia
COLUMN   column number: treats input as tabulator separated file, checks only specified column, index starts with 0
r        replace: replace HTML entities with UTF8 characters
s        summary: write summary to stdout
v        verbose: verbose output
e        exchange: write the ill-formed sentences to output (+triggered rule)


CSV (Medusa File):

java -jar SentenceCleaner.jar -i testdata/inputfile -o testdata/outputfile -c 1 -r


java -jar SentenceCleaner.jar -i testdata/inputfile_raw -o testdata/outputfile_raw -r