Document based language identification

Description

Typically the aim is to create monolingual corpora. But rarely does one know for sure that all downloaded documents contain text in the desired language. Our tool LangSepa allows users to classify the language of a large number of documents. This process is based on the distribution of words and letter unigrams and trigrams within the text which is compared to expected distributions for each language. The language which is most similar is then chosen as the document language. We enclosed data for about 20 languages with LangSepa. 

Input

The tool expects one or more local files in <source>-format with *.txt file-type as input. They need to be encoded in utf-8. Single files can contain several documents in <source>-format, simply by concatenating them into one consecutive file.

The files need to be placed in the main directory of LangSepa. All *.txt-files in this directory will be processed and deleted afterwards.

Output

The output of LangSepa are language-separated files in the <source>-format. Depending on the statistics used for deciding the language of a single document (distribution of words or letter unigrams or trigrams), it will be put in subfolders named accordigly.

Parameters

Simply run:

java -jar LangSepa.jar

There is no necessity to change the configuration files.

Download

You can download LangSepa by clicking here.

Installation

After downloading the package unzip it into the working directory of your choice.

For storing language data (necessary for running LangSepa) a local mysql-database is required.

In mysql execute the following stements

create database langsepa;
GRANT ALL PRIVILEGES ON langsepa.* TO 'langsepa'@'localhost' IDENTIFIED BY 'langsepa';

then leave mysql and go back to your working directory and execute:

mysql -u root -p langsepa < langsepa_DBs.sql

Now put your <source>-files in the working directory and run LangSepa.