The EMILLE Corpus: Beta Version

The EMILLE Corpus: Beta Version

Release Date 28^th March 2003

This CD contains a beta-version release of files from the EMILLE Corpus, including samples of written, spoken and parallel data.

In this file is a discussion of the licensing arrangements for use of the corpus, a list of the text types contained in the beta version, a brief discussion of the data which is not contained in this beta release but will be included in the full release, a summary of the encoding of the corpus, and a brief discussion of some known bugs and errors in the current version.

For further up-to-date information, including details of what uses have been made of the corpus data, see the project website at www.emille.lancs.ac.uk.

Feedback

We are anxious to hear from our beta users if you identify errors in the corpus, have problems making use of the data, or wish to make a query or other comment.

To send us feedback on any of these issues, please email Andrew Hardie:

a.hardie@lancaster.ac.uk

Licensing of the beta version

The beta version of the EMILLE Corpus is licensed for non-profit research use until such time as the full release version becomes available. Full details are contained in the file emille_beta_licence in this directory.

Contents

Contained in this beta release are the following datasets:

1. A preliminary release of the EMILLE parallel corpus. This will ultimately consist of 200,000 words of parallel data in six languages (English, Bengali, Hindi, Urdu, Gujarati and Punjabi). The data is drawn from UK government publications that have been published in several of these languages. In some cases, where a document did not exist in one of the languages, an in-house translation was made to “fill the gap” and make the corpus comprehensively parallel; where this is the case it is indicated in the header. The current version contains 121,000 words in each language, and covers 46 out of the approximately 70-75 texts which will be contained in the release version.

2. Written data in Tamil (13 million words), Hindi (5.6 million words), Gujarati (10.6 million words) and Punjabi (1.4 million words in the Gurmukhi alphabet). This data is drawn from news websites.

3. Spoken data in each of the five languages for which such data is being collected. This data is primarily, but not exclusively, drawn from transcripts of radio broadcasts from the BBC Asian Network and BBC Radio Lancashire; it was all collected in the UK. The final release will include 500,000 words in each of Hindi, Bengali, Urdu, Punjabi, and Gujarati; in this release the wordcounts are as follows:

Language	Words
Bengali	265,000
Hindi	40,000
Gujarati	136,000
Punjabi	40,000
Urdu	118,000

What’s missing

Much of the monolingual written data in the Corpus has not been included in this release due to ongoing encoding difficulties. It will be included in the release version. This will include data in Urdu, Sinhala, Punjabi in the Perso-Arabic alphabet, Bengali, Kannada, Malayala, Oriya, Assamese, Kashmiri, Telegu, and Marathi, plus a wider range of data for the languages already present in this release.

Encoding and annotation

The entirety of the corpus is encoded as plain Unicode text (using 2-byte Unicode characters, not the UTF-8 encoding). For more information on the Unicode Standard, see the website of the Unicode Consortium at http://www.unicode.org/

The corpus is marked up in SGML using level 1 CES compliant markup (with some exceptions as noted below under Known bugs and errors). Each file also includes a full header.

Some information about each text in the corpus is contained in its filename. The first three letters of a filename always encode the language of the file as follows:

hin Hindi

ben Bengali

urd Urdu

guj Gujarati

pun Punjabi

tam Tamil

eng English (parallel corpus only)

The next element indicates a spoken (-s-) or a written (-w-) file.

Spoken files are marked as demographically sampled (-dem-) or context governed (-cg-) as appropriate. Then the source of the data is indicated (usually the title of the radio programme from which the file was produced, e.g. -asiannet- for the BBC Asian Network).

For written files, the source of the text is indicated (e.g. -dunia- for the Hindi Webdunia news website), sometimes with one or more subcategories (e.g. -news-, -sport-).

For all monolingual texts, the filename ends either in the date of the document’s original publication (or the radio programme’s broadcast, etc.) or, where that information is not available, in a serial number.

Files in the parallel corpus are classified by the topic they address (-health-, -education-, -housing-, etc.); this is followed in the filename by that document’s unique identifier (e.g. -senguide-, -nation-, -rent-). Note that, with the exception of the first three characters indicating language, the filenames of parallel documents are identical.

The filename information is of course replicated within the header of each file, which also goes into somewhat greater detail.

Please note that all dates in both filenames and headers are in the format yy-mm-dd.

Known bugs and errors

The following errors could not be rectified in time for the beta release. They will be amended in the release version.

The written corpus does not yet contain any <s> elements to indicate sentences (although the parallel corpus does).

Due to a typographic error detected too late in the day to fix for the beta release, texts from the “Dinakaran” website have in their filenames the word -dinarkan-.

In the parallel corpus, the parallel file identifiers in the headers (which link a text to the corresponding files in other languages) have not yet been set up.

In the spoken corpus, where demographic information other than sex is listed for a speaker in the header, it should not be considered entirely reliable as final checks of this information are yet to be made.

In the Gujarati spoken texts, the speaker ID codes are not uniform across different texts – e.g. GF901 and GF004 are same person.

In the Urdu spoken texts, the speaker ID codes are not uniform across different texts – e.g. UM200 is used for different people in different texts

In the Bengali spoken texts, some texts lack the “id” attribute in the <u> element. In two to four texts, the speaker ID codes have not been implemented and speakers are indicated by forenames in CAPITAL LETTERS. There were previously multiple inconsistencies in the speaker ID codes; although these have been overhauled they have yet to be checked and thus may contain errors in uniformity akin to those in the Gujarati and Urdu texts.