Corpus pre-processing tools
All tools run under Linux. Many of them are written in Java and should also run under Windows. To avoid character set problems, make sure that the system character set is set to utf8.
Web data are the source for our corpora. So, first of all, Web pages need to be downloaded from the World Wide Web. For that, generic web crawlers such as HTTrack can be used.
Input: A list of Domains.
Output: Local HTML-files of the specified domains.
For more information and download, see: Web crawling
HTML stripping: HTML2TEXT
In a next step the text data has to be extracted from the raw HTML-files. This is called stripping.
Input: A HTML file, possibly with metadata (URL, crawling date, encoding, expected language).
Output: A file encoded in utf8 containing an XML header with the metadata and the text of the corresponding page. Because <source> is the main XML-tag, this format is called the <source>-format.
For more information and download, see: HTML stripping
Document based language identification
Typically the aim is to create monolingual corpora. But rarely does one know for sure that all downloaded documents contain text in the desired language.
Input: A file in the <source>-format
Output: Language-separated files in the <source>-format.
For more information and download, see: Document based language identification
For most languages, sentences end with special punctuation. Different writing systems have different sentence endings, and sometimes a punctuation mark leike the latin full stop may be used as sentence ending and for other purposes like abbreviations, ordinal numbers etc.
An abbreviation list helps to identify well-knoen abbreviations. For many languages, a special abbreviation list is provided. For all other languages a general abbreviation list is used.
Pattern-based rules are used for identifying ordinal numbers and guessing more abbreviations.
Input: Output of HTML2TEXT in the the <source>-format.
Output: Same file with linebreaks only at sentence boundaries. The XML-headers are unchanged, so the output file is still in the the <source>-format.
For more information and download, see: Sentence segmentation
Pattern-based sentence cleaning
The sentence cleaner uses a list of regular expressions to decide whether a sentence is wellformed. For instance, sentences containing TABs are not wellformed. This set of regular expressions may contain language-specific entries.
Input: Output of the sentence segmenter in the the <source>-format.
Output: Same file with non-sentences lines removed. The XML-headers are unchanged, so the output file is still in the the <source>-format.
For more detailed description and download: Pattern-based sentence cleaning
Sentence based language checking
Input: Output of the sentence cleaner in the the <source>-format.
Output: Same file with sentence lines not in the expexted language being removed. The XML-headers are unchanged, so the output file is still in the the <source>-format.
For more detailed description and download: Sentence based language checking
Before duplicate removal, the sentences are brought in a format independen of their ordering. This is called the line format. Still every sentence is on a single line, but the corresponding metadata (date and URL) are added to each sentence separated by TABs. So every line can be interpreted as a line in a three column table with the columns sentence, date and URL.
For the optional step of duplicate removal, all such data (coming from possibly more than one file of 4GB size) are sorted and made unique with respect to the first column. This is done using the following command line:
sort -o OUTPUT_FILE -t ' ' -u -k1,1 INPUT_FILE
(Note: The character enclosed by the apostropes is a TAB.)
Input: Output of the language identifyer in the the <source>-format.
Output: List of unique sentences in the the line format.
Input: List of sentences in the the line-format.
Output: MySQL corpus database.
More information and the tool for download is coming soon.
All tools are provided under the Creative Commons License BY NC. This means they can be used freely for non-commercial purposes as long as long as they include the following copyright notice: © Copyright Abteilung Automatische Sprachverarbeitung, Universität Leipzig.