About the Leipzig Corpora Collection

General Description

The Leipzig Corpora Collection (LCC) is a collection of corpora of comparable sources and equivalent processing for more than 250 languages. All data are searchable. Moreover, corpora up to one million sentences can be downloaded to have full access to all data.

According to their sources, the corpora are classified in three dimensions:

  • language (sometaimes in connection with country of origin)
  • genre (currently: news texts, random web texts, and Wikipedia texts)
  • time: year of download

For language and corpus comparisons as well as different usage scenarios,  subcorpora of normed sized (containing 10,000, 30,000, ..., 1 million sentences) are created.

Languages

For all languages with major ressources on the Internet, corpora will be created. With newspapers in about 120 languages (see ABYZ News Links) and about 230 Wikipedias with more than 1.000 articles, the number of languages for corpora production seems to be limited by about 300.

For more information, see the full list of corpora sorted by language and genre.

Corpus processing toolchain

The tools for corpus preprocessing are available here for download. All tools are provided under the Creative Commons License BY NC.

The standardized toolchain for corpus production contains the following steps.

Corpora for download

For most languages and genres, corpora are available for download:

  • Subcorpora of the following sizes (measured in number of sentences): 10,000, 30,000, 100,000, 300,000 and 1,000,000. These subcorpora contain a random sample of all sentences;
  • as MySQL database files and as plain text.

All corpora are provided under the Creative Commons License BY.

Corpus and language statistics (CLS)

For all corpora, a set of standard statistics is applied. These statistics can be used for

  • Corpus quality analysis (Are curves as smooth as expected? Have similar corpora similar parameters?
  • Language comparison: In which parameters do corpora in different languages (but same genre and size) differ?

At the moment, more than 250 different analysis pages are generated per corpus. Different analysis types use information about:

  • Characters and character n-grams
  • Words and multi-words
  • Sentences
  • Word co-occurrences
  • Sources

These statistics are steadily extended and will be developed further.

 

References

  • How to cite the Leipzig Corpora Collection
  • Technical Report Series on Corpus Building
  • Frequency Dictionaries
  • Research papers