About the Leipzig Corpora Collection

General Description

The Leipzig Corpora Collection (LCC) is a collection of corpora of comparable sources and equivalent processing for more than 250 languages. All data are searchable. Moreover, corpora up to one million sentences can be downloaded to have full access to all data.

According to their sources, the corpora are classified in three dimensions:

  • language (sometaimes in connection with country of origin)
  • genre (currently: news texts, random web texts, and Wikipedia texts)
  • time: year of download

For language and corpus comparisons as well as different usage scenarios,  subcorpora of normed sized (containing 10,000, 30,000, ..., 1 million sentences) are created.

Languages

For all languages with major ressources on the Internet, corpora will be created. With newspapers in about 120 languages (see ABYZ News Links) and about 230 Wikipedias with more than 1.000 articles, the number of languages for corpora production seems to be limited by about 300.

For more information, see the full list of corpora sorted by language and genre.

Corpus processing toolchain

The tools for corpus preprocessing are available here for download. All tools are provided under the Creative Commons License BY NC.

The standardized toolchain for corpus production contains the following steps.

  • Web crawling
  • HTML stripping (or XML-Stripping for Wikipedia)
  • Document based language identification
  • Sentence segmentation
  • Duplicate removal
  • Pattern-based sentence cleaning
  • Sentence based language checking
  • Corpus production
    • Tokenization and word indexing
    • Word frequency calculation
    • Word co-occurrence calculation: For every word A, the words B co-occurring significantly often with A as immediate left neighbour, immediate right neighbor or within the same sentence are given. For the calculation of the significance measure, the so-called log-likelihood ratio is used. Larger numbers mean stronger significance. At the moment, the top-50 word co-occurences are displayed.
  • Optional post-processing (depending on the availability of the corresponding tools)
    • POS tagging
    • Lemmatization
    • Near duplicate sentences detection and removal
    • Word similarity based on co-occurrences: Each word is described by its set of co-occuring words. To compare two words, the to corresponding sets are compared. The Dice coefficient counts the number of joint co-occurrences compared to the number of all co-occurrences of the two words. The dice coefficient is a number between 0 (no similarity) and 1 (total similarity). Similar words are displayed if both the dice coefficient and the absolute number of joint co-occurrences are above some theshold.
    • Word similarity 2: String-similar words

Corpora for download

For most languages and genres, corpora are available for download:

  • Subcorpora of the following sizes (measured in number of sentences): 10,000, 30,000, 100,000, 300,000 and 1,000,000. These subcorpora contain a random sample of all sentences;
  • as MySQL database files and as plain text.

All corpora are provided under the Creative Commons License BY.

Corpus and language statistics (CLS)

For all corpora, a set of standard statistics is applied. These statistics can be used for

  • Corpus quality analysis (Are curves as smooth as expected? Have similar corpora similar parameters?
  • Language comparison: In which parameters do corpora in different languages (but same genre and size) differ?

At the moment, more than 250 different analysis pages are generated per corpus. Different analysis types use information about:

  • Characters and character n-grams
  • Words and multi-words
  • Sentences
  • Word co-occurrences
  • Sources

These statistics are steadily extended and will be developed further.

 

References

  • How to cite the Leipzig Corpora Collection
  • Technical Report Series on Corpus Building
  • Frequency Dictionaries
  • Research papers