Corpus pre-processing tools

General requrements

All tools run under Linux. Many of them are written in Java and should also run under Windows. To avoid character set problems, make sure that the system character set is set to utf8.

Web crawling

Web data are the source for our corpora. So, first of all, Web pages need to be downloaded from the World Wide Web. For that, generic web crawlers such as HTTrack can be used.

Input: A list of Domains.

Output: Local HTML-files of the specified domains.

For more information and download, see: Web crawling

HTML stripping: HTML2TEXT

In a next step the text data has to be extracted from the raw HTML-files. This is called stripping.

Input: A HTML file, possibly with metadata (URL, crawling date, encoding, expected language).

Output: A file encoded in utf8 containing an XML header with the metadata and the text of the corresponding page. Because <source> is the main XML-tag, this format is called the <source>-format.

For more information and download, see: HTML stripping

Document based language identification

Typically the aim is to create monolingual corpora. But rarely does one know for sure that all downloaded documents contain text in the desired language.

Input: A file in the <source>-format

Output: Language-separated files in the <source>-format.

For more information and download, see: Document based language identification

Sentence segmentation

For most languages, sentences end with special punctuation. Different writing systems have different sentence endings, and sometimes a punctuation mark leike the latin full stop may be used as sentence ending and for other purposes like abbreviations, ordinal numbers etc.

An abbreviation list helps to identify well-knoen abbreviations. For many languages, a special abbreviation list is provided. For all other languages a general abbreviation list is used.

Pattern-based rules are used for identifying ordinal numbers and guessing more abbreviations.

Input: Output of HTML2TEXT in the the <source>-format.

Output: Same file with linebreaks only at sentence boundaries. The XML-headers are unchanged, so the output file is still in the the <source>-format.

For more information and download, see: Sentence segmentation

Pattern-based sentence cleaning

The sentence cleaner uses a list of regular expressions to decide whether a sentence is wellformed. For instance, sentences containing TABs are not wellformed. This set of regular expressions may contain language-specific entries.

Input: Output of the sentence segmenter in the the <source>-format.

Output: Same file with non-sentences lines removed. The XML-headers are unchanged, so the output file is still in the the <source>-format.

For more detailed description and download: Pattern-based sentence cleaning

Sentence based language checking

Input: Output of the sentence cleaner in the the <source>-format.

Output: Same file with sentence lines not in the expexted language being removed. The XML-headers are unchanged, so the output file is still in the the <source>-format.

For more detailed description and download: Sentence based language checking

Duplicate removal

Before duplicate removal, the sentences are brought in a format independen of their ordering. This is called the line format. Still every sentence is on a single line, but the corresponding metadata (date and URL) are added to each sentence separated by TABs. So every line can be interpreted as a line in a three column table with the columns sentence, date and URL.

For the optional step of duplicate removal, all such data (coming from possibly more than one file of 4GB size) are sorted and made unique with respect to the first column. This is done using the following command line:

                    sort -o OUTPUT_FILE -t '          ' -u -k1,1 INPUT_FILE

(Note: The character enclosed by the apostropes is a TAB.)

Input: Output of the language identifyer in the the <source>-format.

Output: List of unique sentences in the the line format.

Corpus production

Input: List of sentences in the the line-format.

Output: MySQL corpus database.

More information and the tool for download is coming soon.

Licensing

All tools are provided under the Creative Commons License BY NC. This means they can be used freely for non-commercial purposes as long as long as they include the following copyright notice: ¬© Copyright  Abteilung Automatische Sprachverarbeitung, Universit√§t Leipzig.