References

How to cite the Leipzig Corpora Collection

For the whole collection, please cite the following general paper:

  • Dirk Goldhahn, Thomas Eckart und Uwe Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2012 (Download)

For a more popular description of the German collection (Deutscher Wortschatz):

  • Uwe Quasthoff, Matthias Richter: Projekt Deutscher Wortschatz, Babylonia 3-2005, p. 33-35. (Download)

For corpora in in a given language, please cite the Technical Report Series on Corpus Building (see below) for the corresponding language. If this is not yet available, cite the general paper above.

Technical Report Series on Corpus Building

For a special language, a Technical Report Series on Corpus Building discribes

  • language specific issues on pre-processing
  • sources used for the corpora
  • language statistics, especially related to corpus quality.

Usually, there is one volume per language. For special corpora, there might be an additional report on this single corpus, possibly in the language of the corpus.

The series Frequency Dictionaries is published by Leipziger Universitätsverlag. The different dictionaries follow the same scheme:

  • The frequency dictionary is based on the word list of the largest corpus available for the corresponding language.
  • A chapter on language statistics describes the alphabet, the distribution of vowels and consonants, syllable and word length, text coverage, etc.
  • The most frequent words ordered by rank: The top-1.000 words printed and the top-1.000.000 words on the accompanying CD-ROM
  • The most frequent words ordered alphabetically: The top-10.000 words printed and the top-1.000.000 words on the accompanying CD-ROM
  • The word lists provided on the CD-ROM can be used freely under the Creative Commons Licence CC-BY (Vers. 3.0)

frequency dictionary DEU    

 

 

  • Vol. 1: Frequency Dictionary German (2011)
  • Vol. 2: Frequency Dictionary English (2012)
  • Vol. 3: Frequency Dictionary Icelandic (2012)
  • Vol. 4: Frequency Dictionary French (2013)
  • Vol. 5: Frequency Dictionary Hungarian (2013)

Research papers

  • Biemann, Chr.; Bordag, S.; Heyer, G.; Quasthoff, U.; Wolff, Chr.: Language-independent Methods for Compiling Monolingual Lexical Data, Proceedings of CicLING 2004, Seoul, Korea and Springer LNCS 2945, pp. 215-228, Springer Verlag Berlin Heidelberg [pdf]
  • Biemann, C., Heyer, G., Quasthoff, U. and Richter, M. (2007): The Leipzig Corpora Collection – Monolingual corpora of standard size. In: Proceedings of Corpus Linguistics 2007, Birmingham, UK [pdf]
  • Eckart, Th., Quasthoff, U. and Goldhahn, D: Language Statistics-Based Quality Assurance for Large Corpora. In: Proceedings of Asia Pacific Corpus Linguistics Conference 2012, Auckland, New Zealand, 2012. [pdf]
  • Eckart, Th., and Quasthoff, U.: Statistical Corpus and Language Comparison Using Comparable Corpora. In: Workshop on Building and Using Comparable Corpora, LREC, Malta, 2010. [pdf]
  • Goldhahn, D., Eckart, T., Quasthoff, U.: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages, In: Proceedings of the 8th Language Resources and Evaluation Conference (LREC) 2012 [pdf]
  • Hallsteinsdóttir, E., Eckart, T., Biemann, C., Quasthoff, U. and Richter, M. (2007). Íslenskur Orðasjóður - Building a Large Icelandic Corpus Proceedings of NODALIDA-07, Tartu, Estonia [pdf]
  • Quasthoff, U.; Richter, M. and Biemann. C. (2006): Corpus Portal for Search in Monolingual Corpora. In: Proceedings of the LREC 2006. [pdf]
  • Richter, M., Quasthoff, U., Hallsteinsdóttir, E. and Biemann, C. (2006): Exploiting the Leipzig Corpora Collection. Proceedings of IS-LTC'06, Ljubljana, Slovenia [pdf]