About the Leipzig Corpora Collection
The Leipzig Corpora Collection (LCC) is a collection of corpora of comparable sources and equivalent processing for more than 250 languages. All data are searchable. Moreover, corpora up to one million sentences can be downloaded to have full access to all data.
According to their sources, the corpora are classified in three dimensions:
- language (sometaimes in connection with country of origin)
- genre (currently: news texts, random web texts, and Wikipedia texts)
- time: year of download
For language and corpus comparisons as well as different usage scenarios, subcorpora of normed sized (containing 10,000, 30,000, ..., 1 million sentences) are created.
For all languages with major ressources on the Internet, corpora will be created. With newspapers in about 120 languages (see ABYZ News Links) and about 230 Wikipedias with more than 1.000 articles, the number of languages for corpora production seems to be limited by about 300.
For more information, see the full list of corpora sorted by language and genre.
Corpus processing toolchain
The tools for corpus preprocessing are available here for download. All tools are provided under the Creative Commons License BY NC.
The standardized toolchain for corpus production contains the following steps.
- Web crawling
- HTML stripping (or XML-Stripping for Wikipedia)
- Document based language identification
- Sentence segmentation
- Duplicate removal
- Pattern-based sentence cleaning
- Sentence based language checking
- Tokenization and word indexing
- Word frequency calculation
- Word co-occurrence calculation
Optional post-processing (depending on the availability of the corresponding tools)
- POS tagging
- Near duplicate sentences detection and removal
- Word similarity 1: Distributional similar words
- Word similarity 2: String-similar words
Corpora for download
For most languages and genres, corpora are available for download:
- Subcorpora of the following sizes (measured in number of sentences): 10,000, 30,000, 100,000, 300,000 and 1,000,000. These subcorpora contain a random sample of all sentences;
- as MySQL database files and as plain text.
All corpora are provided under the Creative Commons License BY.
For all corpora, a set of standard statistics is applied. These statistics can be used for
- Corpus quality analysis (Are curves as smooth as expected? Have similar corpora similar parameters?
- Language comparison: In which parameters do corpora in different languages (but same genre and size) differ?
At the moment, more than 250 different analysis pages are generated per corpus. Different analysis types use information about:
- Characters and character n-grams
- Words and multi-words
- Word co-occurrences
These statistics are steadily extended and will be developed further.
- How to cite the Leipzig Corpora Collection
- Technical Report Series on Corpus Building
- Frequency Dictionaries
- Research papers