Corpora for download

Download page

Corpus naming convention

The name of a corpus consists of several parts, some of them are optional. The structure is as follows:

lang[-country]_genre_year_size, where

  • lang is the three character language code according to ISO 639, see http://www.ethnologue.org
  • country is the optional additional country of origin
  • genre is one of the following genres
    • news (i.e. newspaper articles published in the corresponding year and crawled daily),
    • newscrawl (i.e. newspaper articles crawled in a single crawl in the corresponding year)
    • web, (i.e. random web pages, mostly crawled using the crawler findlinks)
    • wikipedia (dump provided by wikipedia in the corresponding year) or
    • mixed (i.e. mixture of all corpora up to the corresponding year).
  • year is year of download (in the case of news identical with year of publication)
  • size is the number of sentences, one of 10K, 30K, 100K, 300K or 1M

Corpus formats

All corpora are available in two formats:

  • as MySQL database as described here.
  • as plain text files ready to use with text processing tools. For a first view, sentences and words are most interesting.

Licensing

All corpora are provided under the Creative Commons License BY. This means they can be used freely for any purposes as long as long as they include the following copyright notice:

¬© Copyright  Abteilung Automatische Sprachverarbeitung, Universit√§t Leipzig.

Larger corpora

Corpora of larger sizes like 3M and 10M sentences are available on request for non-commercial use.