Pattern-based sentence cleaning
The SentenceCleaner is a simple tool for the "cleaning" of sentences and replaces the old PHP cleaning scripts. By the definition of regular expressions and sentence length restrictions you can define rules that are checked for every sentence of the input file. Sentence without any matching rules will be written into the output file.
All rules are stored in the directory "rules/". They are divided into general (File "general.rules"), language specific (e.g. lang_deu.rules for German) and texttype specific (e.g. texttype_web.rules) rules. The syntax of these rules are described in the file "rules/general.rules". Every rule has a unique identifier (Integer after "RULE"). A rule in the texttype file overrides a rule in the general file with the same ID, a rule in the language specific file overrides the rules with same ID in both other files.
Supported are two input formats:
- Wortschatz Rawtext format (old or new format "Rohtext" (<sources>)) (default)
- Tabulator separated file with the paramter -c COLUMN (position of the column containing the sentences, counting from 0)
The output format complies with the used input format.
java -jar SentenceCleaner.jar -i INPUT -o OUTPUT [-l LANG_CODE] [-t TEXTTYPE] [-c COLUMN] [-m] [-r] [-s] [-v] [-e] INPUT path to inputfile OUTPUT path to outputfile LANG_CODE language code in ISO 639-3 TEXTTYPE text type: web|news|wikipedia COLUMN column number: treats input as tabulator separated file, checks only specified column, index starts with 0 r replace: replace HTML entities with UTF8 characters s summary: write summary to stdout v verbose: verbose output e exchange: write the ill-formed sentences to output (+triggered rule)
CSV (Medusa File):
java -jar SentenceCleaner.jar -i testdata/inputfile -o testdata/outputfile -c 1 -r
java -jar SentenceCleaner.jar -i testdata/inputfile_raw -o testdata/outputfile_raw -r