The EMILLE Corpus: Beta
Version
Release Date 28th
March 2003
This
CD contains a beta-version release of files from the EMILLE Corpus, including
samples of written, spoken and parallel data.
In
this file is a discussion of the licensing arrangements for use of the corpus,
a list of the text types contained in the beta version, a brief discussion of
the data which is not contained in this beta release but will be included in
the full release, a summary of the encoding of the corpus, and a brief discussion
of some known bugs and errors in the current version.
For
further up-to-date information, including details of what uses have been made
of the corpus data, see the project website at www.emille.lancs.ac.uk.
Feedback
We
are anxious to hear from our beta users if you identify errors in the corpus,
have problems making use of the data, or wish to make a query or other comment.
To
send us feedback on any of these issues, please email Andrew Hardie:
a.hardie@lancaster.ac.uk
Licensing
of the beta version
The
beta version of the EMILLE Corpus is licensed for non-profit research use until
such time as the full release version becomes available. Full details are
contained in the file emille_beta_licence in this directory.
Contents
Contained
in this beta release are the following datasets:
1.
A
preliminary release of the EMILLE parallel corpus. This will ultimately consist
of 200,000 words of parallel data in six languages (English, Bengali, Hindi,
Urdu, Gujarati and Punjabi). The data is drawn from UK government publications
that have been published in several of these languages. In some cases, where a
document did not exist in one of the languages, an in-house translation was
made to “fill the gap” and make the corpus comprehensively parallel; where this
is the case it is indicated in the header. The current version contains 121,000
words in each language, and covers 46 out of the approximately 70-75 texts
which will be contained in the release version.
2.
Written
data in Tamil (13 million words), Hindi (5.6 million words), Gujarati (10.6
million words) and Punjabi (1.4 million words in the Gurmukhi alphabet). This
data is drawn from news websites.
3.
Spoken
data in each of the five languages for which such data is being collected. This
data is primarily, but not exclusively, drawn from transcripts of radio
broadcasts from the BBC Asian Network and BBC Radio Lancashire; it was all
collected in the UK. The final release will include 500,000 words in each of
Hindi, Bengali, Urdu, Punjabi, and Gujarati; in this release the wordcounts are
as follows:
Language |
Words |
Bengali |
265,000 |
Hindi |
40,000 |
Gujarati |
136,000 |
Punjabi |
40,000 |
Urdu |
118,000 |
What’s
missing
Much
of the monolingual written data in the Corpus has not been included in this
release due to ongoing encoding difficulties. It will be included in the
release version. This will include data in Urdu, Sinhala, Punjabi in the
Perso-Arabic alphabet, Bengali, Kannada, Malayala, Oriya, Assamese, Kashmiri,
Telegu, and Marathi, plus a wider range of data for the languages already
present in this release.
Encoding
and annotation
The
entirety of the corpus is encoded as plain Unicode text (using 2-byte Unicode
characters, not the UTF-8 encoding). For more information on the Unicode
Standard, see the website of the Unicode Consortium at http://www.unicode.org/
The
corpus is marked up in SGML using level 1 CES compliant markup (with some
exceptions as noted below under Known bugs and errors). Each file also
includes a full header.
Some
information about each text in the corpus is contained in its filename. The
first three letters of a filename always encode the language of the file as
follows:
hin Hindi
ben Bengali
urd Urdu
guj Gujarati
pun Punjabi
tam Tamil
eng English (parallel
corpus only)
The
next element indicates a spoken (-s-) or a written (-w-) file.
Spoken
files are marked as demographically sampled (-dem-) or context governed (-cg-)
as appropriate. Then the source of the data is indicated (usually the title of
the radio programme from which the file was produced, e.g. -asiannet- for the
BBC Asian Network).
For
written files, the source of the text is indicated (e.g. -dunia- for the Hindi
Webdunia news website), sometimes with one or more subcategories (e.g. -news-,
-sport-).
For
all monolingual texts, the filename ends either in the date of the document’s
original publication (or the radio programme’s broadcast, etc.) or, where that
information is not available, in a serial number.
Files
in the parallel corpus are classified by the topic they address (-health-,
-education-, -housing-, etc.); this is followed in the filename by that
document’s unique identifier (e.g. -senguide-, -nation-, -rent-). Note that,
with the exception of the first three characters indicating language, the
filenames of parallel documents are identical.
The
filename information is of course replicated within the header of each file,
which also goes into somewhat greater detail.
Please
note that all dates in both filenames and headers are in the format yy-mm-dd.
Known
bugs and errors
The
following errors could not be rectified in time for the beta release. They will
be amended in the release version.
The
written corpus does not yet contain any <s> elements to
indicate sentences (although the parallel corpus does).
Due
to a typographic error detected too late in the day to fix for the beta
release, texts from the “Dinakaran” website have in their filenames the word
-dinarkan-.
In
the parallel corpus, the parallel file identifiers in the headers
(which link a text to the corresponding files in other languages) have not yet
been set up.
In
the spoken corpus, where demographic information other than sex
is listed for a speaker in the header, it should not be considered entirely
reliable as final checks of this information are yet to be made.
In
the Gujarati spoken texts, the speaker ID codes are not uniform across
different texts – e.g. GF901 and GF004 are same person.
In
the Urdu spoken texts, the speaker ID codes are not uniform across different
texts – e.g. UM200 is used for different people in different texts
In
the Bengali spoken texts, some texts lack the “id” attribute in the <u>
element. In two to four texts, the speaker ID codes have not been implemented
and speakers are indicated by forenames in CAPITAL LETTERS. There were
previously multiple inconsistencies in the speaker ID codes; although these
have been overhauled they have yet to be checked and thus may contain errors in
uniformity akin to those in the Gujarati and Urdu texts.