Creating & Documenting Electronic Texts

The previous chapter showed what markup is, and how it plays a crucial role in almost every aspect of information processing. Now we shall learn about some crucial applications of descriptive markup which are ideally suited to the types of texts studied by those working in the arts and humanities disciplines.

5.1: The Standard Generalized Markup Language (SGML)

The late 1970s and early 1980s saw a consensus emerging that descriptive markup languages had numerous advantages over other types of text encoding. A number of products and macro languages appeared which were built around their own descriptive markup languages — and whilst these represented a step forward, they were also constrained by the fact that users were required to learn a new markup language each time, and could only describe those textual features which the markup scheme allowed (sometimes extensions were possible, but implementing them was rarely a straightforward process).

The International Standards Organisation (ISO) also recognised the value of descriptive markup schemes, and in 1986 an ISO committee released a new standard called ISO 8879, the Standard Generalized Markup Language (SGML). This complex document represented several years' effort by an international committee of experts, working together under the Chairmanship of Dr Charles Goldfarb (one of the creators of IBM's descriptive markup language, GML). Since SGML was a product of the International Standards process, the committee also had the benefit of input from experts from the numerous national standards bodies associated with the ISO, such as the UK's British Standards Institute (BSI).

5.1.1: SGML as metalanguage

A great deal of largely unjustified mystique surrounds SGML. You do not have to look very hard to find instances of SGML being described as 'difficult to learn', 'complex to implement', or 'expensive to use', when in fact it is none of these things. People all too frequently confuse the acronym, SGML, with SGML applications — many of which are indeed highly sophisticated and complex operations, designed to meet the rigorous demands of blue chip companies working in major international industries (automotive, pharmaceutical, or aerospace engineering). It should not be particularly surprising that a documentation system designed to control and support every aspect of the tens of thousands of pages of documentation needed to build and maintain a battleship, fix the latest passenger aircraft, or supplement a legal application for international recognition for a new advanced drug treatment, should appear overwhelmingly complex to an outsider. In fact, despite its name, SGML is not even a markup language. Instead, it would be more appropriate to call SGML a 'metalanguage'.

In a conventional markup language, such as HTML, users are offered a pre-defined set of markup tags from which they must make appropriate selections; if they suddenly introduce new tags which are not part of the HTML specification, then it is clear that the resulting document will not be considered valid HTML, and it may be rejected or incorrectly processed by HTML software (e.g. an HTML-compatible browser). SGML, on the other hand, does not offer a pre-defined set of markup tags. Rather, it offers a grammar and specific vocabulary which can be used to define other markup languages (hence 'metalanguage').

SGML is not constrained to any one particular type of application, and it is neither more nor less suited to producing technical documentation and specifications in the semiconductor industry, than it is for marking up linguistic features of ancient inscribed tablets of stone. In fact, SGML can be used to create a markup language to do pretty well anything, and that is both its greatest strength and weakness. SGML cannot be used 'out-of-the-box', so to speak, and because of this it has earned an undeserved reputation in some quarters as being troublesome and slow to implement. On the other hand, there are many SGML applications (and later we shall learn about one in particular), which can be used straightaway, as they offer a fully documented markup language which can be recognised by any one of a suite of tools and implemented with a minimum of fuss. SGML provides a mechanism for like-minded people with a shared concern to get together and define a common markup language which satisfies their needs and desires, rather than being limited by the vision of the designers of a closed, possibly proprietary markup scheme which only does half the job.

SGML offers another advantage in that it not only allows (groups of) users to define their own markup languages, it also provides a mechanism for ensuring that the rules of any particular markup language can be rigorously enforced by SGML-aware software. For example, within HTML, although there are six different levels of heading defined (e.g. the tags <H1> to <H6>) there is no requirement that they should be applied in a strictly hierarchical fashion; in other words, it is perfectly possible for a series of headings in an HTML document to be marked up as <H1>, then <H3>, followed by <H5>, followed in turn by <H2>, <H4>, and <H6> — all to achieve a particular visual appearance in a particular HTML browser. By contrast, should such a feature be deemed important, an SGML-based markup language could be written in such a way that suitable software can ensure that levels of heading nest in a strictly hierarchical fashion (and the strength of this approach can perhaps become even more evident when encoding other kinds of hierarchical structure, e.g. a <BOOK> must contain one or more <CHAPTER>s, each of which must in turn contain one or more <PARAGRAPH>s, and so on). We shall learn more about this in the following section.

There is one final, crucial, difference between SGML-based markup languages and other descriptive markup languages: the process by which International Standards are created, maintained, and updated. ISO Standards are subject to periodic formal review, and each time this work is undertaken it happens in full consultation with the various national standards bodies. The Committee which produced SGML has guaranteed that if and when any changes are introduced to the SGML standard, this will be done in such a way as to ensure backwards compatibility. This is not a decision which has been undertaken lightly, and the full implications can be inferred from the fact that commercial enterprises rarely make such an explicit commitment (and even when they do, users ought to reflect upon the likelihood that such a commitment will actually be fulfilled given the considerable pressures of a highly competitive marketplace). The essential difference has been characterised thus: the creators of SGML believe that a user's data should belong to that user, and not be tied up inextricably in a proprietary markup system over which that user has no control; whereas, the creators of a proprietary markup scheme can reasonably be expected to have little motivation to ensure that data encoded using their scheme can be easily migrated to, or processed by, a competitor's software products.

5.1.2: The SGML Document

The SGML standard gives a very rigid definition as to what constitutes an ‘SGML document’. Whilst there is no need for us to consider this definition in detail at this stage, it is worthwhile reviewing the major concepts as they offer a valuable insight into some crucial aspects of an electronic text. Perhaps first and foremost amongst these is the notion that an SGML document is a single logical entity, even though in practice that document may be composed of any number of physical data files, spread over a storage medium (e.g. a single computer's hard-disk) or even over different types of storage media connected together via a network. As today's electronic publications become more and more complex, mixing (multilingual) text with images, audio, and image data, it reinforces the need to ensure that they are created in line with accepted standards. For example, an article from an electronic journal mounted on a website may be delivered to the end-user in the form of a single HTML document, but that article (and indeed the whole journal), may rely upon dozens or hundreds of data files, a database to manage the entire collection of files, several bespoke scripts to handle the interfacing between the web and the database, and so on. Therefore, whenever we talk about an electronic document, it is vitally important to remember that this single logical entity may, in fact, consist of many separate data files.

SGML operates on the basis of there being three major parts which combine to form a single SGML document. Firstly, there is the SGML declaration, which specifies any system and software constraints. Secondly, there is the prolog, which defines the document structure. Lastly, there is the document instance, which contains what one would ordinarily think of as the document. Whilst this may perhaps appear unnecessarily complicated, in fact it provides an extremely valuable insight into the key components which are essential to the creation of an electronic document.

The SGML declaration tells any software that is going to process an SGML document all that it should need to know. For example, the SGML declaration specifies which character sets have been used in the document (normally ASCII or ISO 646, but more recently this could be Unicode, or ISO 10646). It also establishes any constraints on system variables (e.g. the length of markup tag names, or the depth to which tags can be nested), and states whether or not any of SGML's optional features have been used. The SGML standard offers a default set-up, so that, for example, the characters < and > are used to delimit markup tag names — and with the widespread acceptance of HTML, this has become the accepted way to indicate markup tags — but if for any reason this presented a problem for a particular application (e.g. encoding a lot of data in which < and > were heavily used to indicate something else), it would be possible to redefine the delimiters as @ or #, or whatever characters were deemed to be more appropriate.

The SGML declaration is important for a number of reasons. Although it may seem an unduly complicated approach, it is often these fundamental system or application dependencies which make it so difficult to move data around between different software and hardware environments. If the developers of wordprocessing packages had started off by agreeing on a single set of internal markup codes they would all use to indicate a change in font, the centring of a line of text, the occurrence of a pagebreak etc., then users' lives would have been made a great deal easier; however, this did not happen, and hence we are left in a situation where data created in one application cannot easily be read by another. We should also remember that as our reliance upon information technology grows, time passes, applications and companies appear or go bust, there may be data which we wish to exchange or reuse which were created when the world of computing was a very different place. It is a very telling lesson that although we are still able to access data inscribed on stone tablets or committed to papyrii or parchment hundreds (if not thousands) of years ago, we already have masses of computer-based data which are effectively lost to us because of technological progress, the demise of particular markup schemes, and so on. Furthermore, by supplying a default environment, the average end-user of an SGML-based encoding system is unlikely to have to familiarise him- or herself with the intricacies of the SGML declaration. Indeed it should be enough simply to be aware of the existence of the SGML declaration, and how it might affect one's ability to create, access, or exploit a particular source of data.

The next major part of an SGML document is the prolog, which must conform to the specification set out in the formal SGML standard, and the syntax given in the SGML declaration. Although it is hard to discuss the prolog without getting bogged down in the details of SGML, suffice it to say that it contains (at least one) document type declaration, which in turn contains (or references) a Document Type Definition (or DTD). The DTD is one of the single most important features of SGML, and what sets it apart from — not to say above — other descriptive markup schemes. Although we shall learn a little more about the process in the following section, the DTD contains a series of declarations which define the particular markup language which will be used in the document instance, and also specifies how the different parts of that language can interrelate (e.g. which markup tags are required and optional, the contexts in which they can be used, and so on). Often, when people talk about 'using SGML', they are actually talking about using a particular DTD, which is why some of the negative comments that have been made about SGML (e.g. 'It's too difficult.', or 'It doesn't allow me to encode those features which I consider to be important') are erroneous, because such complaints should properly be directed at the DTD (and thus aimed at the DTD designer) rather than at SGML in general. Other than some of the system constraints imposed by the SGML declaration, there are no strictures imposed by the SGML standard regarding how simple or complex the markup language defined in the DTD should be.

Whilst the syntax used to write a DTD is fairly straightforward, and most people find that they can start to read and write DTDs with surprising ease, to create a good DTD requires experience and familiarity with the needs and concerns of both data creators and end-users. A good DTD nearly always reflects a designer's understanding of all these aspects, an appreciation of the constraints imposed by the SGML standard, and a thorough process of document analysis (see Chapter 2) and DTD-testing. In many ways this situation is indicative of the fact that the creators of the SGML standard did not envisage that individual users would be very likely to produce their own DTDs for highly specific purposes. Rather, they thought (or perhaps hoped), that groups would form within industry sectors or large-scale enterprises to produce DTDs that were tailored to the needs of their particular application. Indeed, the areas in which the uptake of SGML has been most enthusiastic have been operating under exactly those sorts of conditions — for example, the international Air Transport Authority seeking to standardise aircraft maintenance documentation, or the pharmaceutical industry's attempts to streamline the documentary evidence needed to support applications to the US Food and Drug Administration. As we shall see, the DTD of prime importance to those working within the Arts and Humanities disciplines has already been written and documented by the members of the Text Encoding Initiative, and in that case the designers had the foresight to build in mechanisms to allow users to adapt or extend the DTD to suit their specific purposes. However, as a general rule, if users wish to write their own DTDs, or tweak an SGML declaration, they are entirely free to do so (within the framework set out by the SGML standard) — but the vast majority of SGML users prefer to rely upon an SGML declaration and DTD created by others, for all the benefits of interoperability and reusability promised by this approach.

This brings us to the third main part of an SGML document: namely, the document instance itself. This is the part of the document which contains a combination of raw data and markup, and its contents are constrained by both the SGML declaration, and the contents of the prolog (especially the declarations in the DTD). Clearly from the perspective of data creators and end-users, this is the most interesting part of an SGML document — and it is common practice for people to use the term 'SGML document' when they are actually referring to a document instance. Such confusion should be largely unproblematic, provided these users always remember that when they are interchanging data (i.e. a document instance) with colleagues, they should also pass on the relevant DTD and SGML declaration. In the next section we shall investigate the practical steps involved in the creation of an SGML document, and the very valuable role that can be played by SGML-aware software.

5.1.3: Creating Valid SGML Documents

How you create SGML documents will be greatly influenced by the aims of your project, the materials you are working with, and the resources available to you. For the purposes of this discussion, let us start by assuming that you have a collection of existing non-electronic materials which you wish to turn into some sort of electronic edition.

If you have worked your way through the chapter on document analysis (Chapter 2), then you will know what features of the source material are important to you, and what you will want to be able to encode with your markup. Similarly, if you have considered the options discussed in the chapter on digitization (Chapter 3), you will have some idea of the type of electronic files with which you will be starting to work. Essentially, if you have chosen to OCR the material yourself, you will be using ‘clear’ or ‘plain ASCII’ text files, which will need to undergo some sort of editing or translation as part of the markup process. Alternatively, if the material has been re-keyed, then you will either have electronic text files which already contain some basic markup, or you will also have plain ASCII text files.

Having identified the features you wish to encode, you will need to find a DTD which meets your requirements. Rather than trying to write your own DTD from scratch, it is usually worthwhile investing some time to look around for existing public DTDs which you might be able to adopt, extend, or adapt to suit your particular purposes. There are many DTDs available in the public domain, or made freely available for others to use (e.g. see Robin Cover's The SGML/XML Web Page (http://www.oasis-open.org/cover/)), but even if none of these match your needs, some may be worth investigating to see how others have tackled common problems. Although there are some tools available which are designed to facilitate the process of DTD-authoring, they are probably only worth buying if you intend to be doing a great deal of work with DTDs, and they can never compensate for poor document analysis. However, if you are working with literary or linguistic materials, you should take the time to familiarise yourself with the work of the Text Encoding Initative (see 5.2: The Text Encoding Initiative and TEI Guidelines), and think very carefully before rejecting use of their DTD.

Before we go any further, let us consider two other scenarios: one where you already have the material in electronic form but you need to convert it to SGML; the other, where you will need to create SGML from scratch. Once again, there are many useful tools available to help convert from one markup scheme to another, but if your target format is SGML this may have some bearing on the likelihood of success (or failure) of any conversion process. As we have seen, SGML lends itself most naturally to a structured, hierarchical view of a document's content (although it is perfectly possible to represent very loose organisational structures, and even non-hierarchical document webs, using SGML markup) and this means that it is much simpler to convert from a proprietary markup scheme to SGML if that scheme also has a strong sense of structure (i.e. adopts a descriptive markup approach) and has been used sensibly. However, if a document has been encoded with a presentational markup scheme which has, for example, used codes to indicate that certain words should be rendered in an italic font — regardless of the fact that sometimes this has been for emphasis, at other times to indicate book and journal titles, and elsewhere to indicate non-English words — then this will dramatically reduce the chances of automatically converting the data from this presentation-oriented markup scheme into one which complies with an SGML DTD.

It is probably worth noting at this point that these conversion problems primarily apply when converting from a non-descriptive, non-SGML markup language into SGML; the opposite process, namely converting from SGML into another target markup scheme, is much more straightforward (because it would simply mean that data variously marked-up with, say, <EMPHASIS>, <TITLE>, and <FOREIGN> tags, had their markup converted into the target scheme's markup tags for <ITALIC>). It is also worth noting that such a conversion might not be a particularly good idea, because you would effectively be throwing information away. In practice it would be much more sensible to retain the descriptive/SGML version of your material, and convert to a presentational markup scheme only when absolutely required for the successful rendering of your data on screen or on paper. Indeed, many dedicated SGML applications support the use of stylesheets to offer some control over the on-screen rendition of SGML-encoded material, whilst preserving the SGML markup behind the scenes.

If you are creating SGML documents from scratch, or editing existing SGML documents (perhaps the products of a conversion process, or the results of a re-keying exercise) there are several factors to consider. It is essential that you have access to a validating SGML parser, which is a software program that can read an SGML declaration and a document's prolog, understand the declarations in the DTD, and ensure that the SGML markup used throughout the document instance conforms appropriately. In many commercial SGML- and XML-aware software packages, a validating parser is included as standard and is often very closely integrated with the relevant tools (e.g. to ensure that any simple editing operations, such as cut and paste, do not result in the document failing to conform to the rules set out in the DTD because markup has been inserted or removed inappropriately). It also possible to find freeware and public domain software which have some understanding of the markup rules expressed in the DTD, while also allowing users to validate their documents with a separate parser in order to guarantee conformance. Your choice will probably be dictated by the kind of software you currently use (e.g. in the case of editors: windows-based office-type applications, or unix-style plain text editors?), the budget you have available, and the files with which you will be working. Whatever your decision, it is important to remember that a parser can only validate markup against the declarations in a DTD, and it cannot pick up semantic errors (e.g. incorrectly tagging a person's name as, say, a place name, or an epigraph as if it were a subtitle).

So for the purposes of creating valid SGML documents, we have seen that there are a number of tools which you may wish to consider. If you already have files in electronic form, you will need to investigate translation or auto-tagging software — and if you have a great many files of the same type, you will probably want software which supports batch processing, rather than anything which requires you to work on one file at a time. If you are creating SGML documents from scratch, or cleaning-up the output of a conversion process, you will need some sort of editor (ideally one that is SGML-aware), and if your editor does not incorporate a parser, you will need to obtain one that can be run as a stand-alone application (there are one or two exceptionally good parsers freely available in the public domain). For an idea of the range of SGML and XML tools available, readers should consult Steve Pepper's The Whirlwind Guide to SGML & XML Tools and Vendors (http://www.infotek.no/sgmltool/guide.htm).

Producing valid SGML files which conform to a DTD, is in some respects only the first stage in any project. If you want to search the files for particular words, phrases, or marked-up features, you may prefer to use an SGML-aware search engine, but some people are perfectly happy writing short scripts in a language like Perl. If you want to conduct sophisticated computer-assisted text analysis of your material, you will almost certainly need to look at adapting an existing tool, or writing your own code. Having obtained your SGML text, whether as complete documents or as fragments resulting from a search, you will need to find some way of displaying it. You might choose to simply convert the SGML markup in the data into another format (e.g. HTML for display in a conventional web browser), or you might use one of the specialist SGML viewing packages to publish the results — which is how many commercial SGML-based electronic texts are produced. We do not have sufficient space to consider all the various alternatives in this publication, but once again you can get an idea of the options available by looking at the The Whirlwind Guide to SGML & XML Tools and Vendors (http://www.infotek.no/sgmltool/guide.htm) or, more generally, The SGML/XML Web Page (http://www.oasis-open.org/cover/).

5.1.4: XML: The Future for SGML

As we saw in the previous section, an SGML-based markup language usually offers a number of advantages over other types of markup scheme, especially those which rely upon proprietary encoding. However, although SGML has met with considerable success in certain areas of publishing and many commercial, industrial, and governmental sectors, its uptake by the academic community has been relatively limited (with the notable exception of the Text Encoding Initiative, see 5.2: The Text Encoding Initiative and TEI Guidelines, below). We can speculate on why this might be so — for example, SGML has an undeserved reputation for being difficult and expensive to produce because it imposes prohibitive intellectual overheads, and because the necessary software is lacking (leastways at prices academics can afford). While it is true that peforming a thorough document analysis and developing a suitable DTD should not be undertaken lightly, it could be argued that to approach the production of any electronic text without first investing such intellectual resources is likely to lead to difficulties (either in the usefulness or the long-term viability of the resulting resource). The apparent lack of readily available, easy-to-use SGML software, is perhaps a more valid criticism — yet the resources have been available for those willing to look, and then invest the time necessary to learn a new package (although freely available software tends to put more of an onus on the user than some of the commercial products). However, what is undoubtedly true is the fact that writing a piece of SGML software (e.g. a validating SGML parser), which fully implements the SGML standard, is an extremely demanding task — and this has been reflected in the price and sophistication of some commercial applications.

Whilst SGML is probably more ubiquitous than many people realise, HTML — the markup language of the World Wide Web — is much better known. Nowadays, the notion of 'the Web' is effectively synonymous with the global Internet, and HTML plays a fundamental role in the delivery and presentation of information over the Web. The main advantage of HTML is that it is a fixed set of markup tags designed to support the creation of straightforward hypertext documents. It is easy to learn and easy for developers to implement in their software (e.g. HTML editors and browsers), and the combination of these factors has played a large part in the rapid growth and widespread acceptance of the Web. There is so much information about HTML already available, that there is little to be gained from going into much detail here — however, readers who wish to know more should visit the W3C's HyperText Markup Language Home Page (http://www.w3.org/MarkUp/).

Although HTML was not originally designed as an application of SGML, it soon became one once the designers realised the benefits to be gained from having a DTD (e.g. a validating parser could be used to ensure that markup had been used correctly, and so the resulting files would be easier for browsers to process). However, this meant that the HTML DTD had to be written retrospectively, and in such a way that any existing HTML documents would still conform to the DTD — which in turn meant that the value of the DTD was effectively diminished! This situation led to the release of a succession of different versions of HTML, each with their own slightly different DTD. Nowadays, the most widely accepted release of HTML is probably version 3.2, although the World Wide Web Consortium (W3C) released HTML 4.0 on 18th December 1997 in order to address a number of outstanding concerns about the HTML standard. Future versions of HTML are probably unlikely, although there is work going on within the HTML committees of W3C to take into account other developments within the W3C, and this has led to proposals such as the XHTML 1.0 Proposed Recommendation document released on 24th August 1999 (see http://www.w3.org/TR/1999/PR-xhtml1-19990824/).

It is perfectly possible to deliver SGML documents over the Web, but there are several ways that this can be achieved and each has different implications. In order to retain the full 'added-value' of the SGML markup, you might choose to deliver the raw SGML data over the Web and rely upon a behind-the-scenes negotation between your web-server and the client's browser to ensure that an appropriate SGML-viewing tool is launched on the client's machine. This enables the end-user to exploit fully the SGML markup included in your document, provided that s/he has been able to obtain and install the appropriate software. Another possibility would be to offer a Web-to-SGML interface on your server, so that end-users can access your documents using an ordinary Web browser whilst all the processing of the SGML markup takes place on the server, and the results are delivered as HTML. Alternatively, you might decide to simply convert the markup into HTML from whatever SGML DTD has been used to encode the document (either on-the-fly, or as part of a batch process) so that the end-user can use an ordinary Web browser and the server will not have to undertake any additional processing. The last of these options, while placing the least demands on the end-user, effectively involves throwing away all the extra intellectual information that is represented by the SGML encoding; for example, if in your original SGML document, proper nouns, place names, foreign words, and certain types of emphasis have each been encoded with different markup according to your SGML DTD, they may all be translated to <EM> tags in HTML — and thus any automatically identifiable distinction between these different types of content will probably have been lost. The first option retains the advantages of using SGML, whilst placing a significant onus on the end-user to configure his Web browser correctly to launch supporting applications. The second option represents a middle way: exploiting the SGML markup whilst delivering easy-to-use HTML, but with the disadvantage of having to do much more sophisticated processing at the Web server.

Until recently, therefore, those who create and deliver electronic text were confronted with a dilemma: to use their own SGML DTD with all the additional processing overheads that entails, or use an HTML DTD and suffer a diminution of intellectual rigour and descriptive power? Extending HTML was not an option for individuals and projects, because the developers of Web tools were only interested in supporting the flavours of HTML endorsed by the W3C. Meanwhile, delivering electronic text marked-up according to another SGML DTD meant that end-users were obliged to obtain suitable SGML-aware tools, and very few of them seemed willing to do this. One possible solution to this dilemma is the Extensible Markup Language (XML) 1.0 (see http://www.w3.org/TR/REC-xml), which became a W3C Recommendation (the nearest thing to a formal standard) on 10th February 1998.

The creators of XML adopted the following design goals:

XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.

They sought to gain the generic advantages offered by supporting arbitrary SGML DTDs, whilst retaining much of the operational simplicity of using HTML. To this end, they 'threw away' all the optional features of the SGML standard which make it difficult (and therefore expensive) to process. At the same time they retained the ability for users to write their own DTDs, so that they can develop markup schemes which are tailored to suit particular applications but which are still enforceable by a validating parser. Perhaps most importantly of all, the committee which designed XML had representatives from several major companies which develop software applications for use with the Web, particularly browsers, and this has helped to encourage a great deal of interest in XML's potential.

SGML has its roots in a time when creating, storing, and processing information on computer was expensive and time-consuming. Many of the optional features supported by the SGML standard were intended to make it cheaper to create and store SGML-conformant documents in an era when it was envisaged that all the markup would be laboriously inserted by hand, and megabytes of disk space were extremely expensive. Nowadays, faster and cheaper processors, and the falling costs of storage media (both magnetic and optical), mean that the designers and users of applications are less worried about the concerns of SGML's original designers. On the other hand, the ever growing volume of electronic information makes it all the more important that any markup which has been used has been applied in a thoroughly consistent and easy to process manner, thereby helping to ensure that today's applications perform satisfactorily.

XML addresses these familiar concerns, whilst taking advantage of modern computer systems and the lessons learned from using SGML. For example, now that the cost of storing data is of less concern to most users (except for those dealing with extremely large quantities of data), there is no need to offer support for markup short-cuts which, while saving storage space, tend to impose an additional load when information is processed. Instead, XML's designers were able to build-in the concept of 'well-formed' data, which requires that any marked-up data are explicitly bounded by start- and end-tags, and that all the tagged data in a document nest appropriately (so that it becomes possible, say, to generate a document tree which captures the hierarchical arrangement of all the data elements in the document). This has the added advantage that when two applications (such as a database and a web server) need to exchange data, they can use well-formed XML as their interchange format, because both the sending and receiving application can be certain that any data they receive will be appropriately marked-up and there can be no possible ambiguity about where particular data structures start and end.

XML takes this approach one stage further by adopting the SGML concept of DTDs, such that an XML document is said to be 'valid' if it has an associated DTD and the markup used in the document has been checked (by a validating parser) against the declarations expressed in that DTD. If an application knows that it will be handling valid XML, and has an understanding of and access to the relevant DTD, this can greatly improve its ability to process that data — for example, a search and retrieval application would be able to construct a list of all the marked-up data structures in the document, so that a user could refine the search criteria accordingly. Knowing that a vast collection of XML documents have all been validated against a particular DTD will greatly assist the processing of that collection, as valid XML data is also necessarily well-formed. By contrast, while it is possible for an XML application to process a well-formed document such that it can derive one possible DTD which could represent the data structures it contains, that DTD may not be sufficient to represent all the well-formed XML documents of the same type. There are clearly many advantages to be gained from creating and using valid XML data, but the option remains to use well-formed XML data in those situations where it would be appropriate.

Today's Web browsers expect to receive conformant HTML data, and any additional markup included in the data which is not recognised by the browser is usually ignored. The next generation of Web browsers will know how to handle XML data, and while all of them will know how to process HTML data by default, they will also be prepared to cope with any well-formed or valid XML data that they receive. This offers the opportunity for groups of users to come together, agree upon a DTD they wish to adopt, and then create and exchange valid XML data which conforms to that DTD. Thus, a group of academics concerned with the creation of electronic scholarly editions of major texts could all agree to prepare their data in accordance with a particular DTD which enabled them to markup the features of the texts which they felt to be appropriate for their work. They could then exchange the results of their labours safe in the knowledge that they could all be correctly processed by their favourite software (whether browsers, editors, text analysis tools, or whatever).

Readers who wish to explore the similarities and differences between SGML and XML are advised to consult the sources mentioned on Robin Cover's The SGML/XML Web Page (http://www.oasis-open.org/cover/). Projects which have invested heavily in the creation of SGML-conformant resources are well-placed to take advantage of XML developments, because any conversions that are required should be straightforward to implement. However, it is important to bear in mind that at the moment XML is just one of a suite of emerging standards, and it may be a little while yet before the situation becomes completely clear. For example the Extensible Stylesheet Language (XSL) Specification (http://www.w3.org/TR/WD-xsl/) for expressing stylesheets as XML documents is still under development, as are the proposals to develop XML Schema (http://www.w3.org/TR/xmlschema-1/ ), which may ultimately replace the role of DTDs when creating XML documents (and provide support not just for declaring data structures, but also for strong data typing such that it would be possible to ensure, say, that the contents of a <DATE> element conformed to a particular international standard date format).

5.2: The Text Encoding Initiative and TEI Guidelines

5.2.1: A brief history of the TEI

(Much of the following text is extracted from publicly available TEI documents, and is reproduced here with minor amendments and the permission of the TEI Editors.)

The TEI began with a planning conference convened by the Association for Computers and the Humanities (ACH), gathering together over thirty experts in the field of electronic texts, representing professional societies, research centers, and text and data archives. The planning conference was funded by the U.S. National Endowment for the Humanities (NEH &endash; an independent federal agency) and took place at Vassar College, Poughkeepsie, New York on 12–13 November 1987.

Those attending the conference agreed that there was a pressing need for a common text encoding scheme that researchers could use when creating electronic texts, to replace the existing system in which every text provider and every software developer had to invent and support their own scheme (since existing schemes were typically ad hoc constructs with support for the particular interests of their creators, but not built for general use). At a similar conference ten years earlier, one participant pointed out, everyone had agreed that a common encoding scheme was desirable, and predicted chaos if one was not developed. At the Poughkeepsie meeting, no one predicted chaos: everyone agreed that chaos has already arrived.

After two days of intense discussion, the participants in the meeting reached agreement on the desirability and feasibility of creating a common encoding scheme for use both in creating new documents and in exchanging existing documents among text and data archives; the closing statement — the Poughkeepsie Principles (see http://www-tei.uic.edu/orgs/tei/info/pcp1.html) — enunciated precepts to guide the creation of such a scheme.

After the planning conference, the task of developing an encoding scheme for use in creating electronic texts for research was undertaken by three sponsoring organisations: the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). Each sponsoring organisation named representatives to a Steering Committee, which was responsible for the overall direction of the project. Furthermore, a number of other interested professional societies were involved in the project as participating organisations, and each of these named a representative to the TEI Advisory Board.

With support from NEH and later from the Commission of the European Communities and the Andrew W. Mellon Foundation, the TEI began the task of developing a draft set of Guidelines for Electronic Text Encoding and Interchange. Working committees, comprising scholars from all over North America and Europe, drafted recommendations on various aspects of the problem, which were integrated into a first public draft (document TEI P1), which was published for public comment in June 1990.

After the publication of the first draft, work began immediately on its revision. Fifteen or so specialised work groups were assigned to refine the contents of TEI P1 and to extend it to areas not yet covered. So much work was produced that a bottleneck ensued getting it ready for publication, and the second draft of the Guidelines (TEI P2) was released chapter by chapter from April 1992 through November 1993. During 1993, all published chapters were revised yet again, some other necessary materials were added, and the development phase of the TEI came to its conclusion with the publication of the first 'official' version of the Guidelines — the first one not labelled a draft — in May 1994 (Sperberg-McQueen and Burnard 1994). Since that time, the TEI has concentrated on making the Guidelines (TEI P3) more accessible to users, teaching workshops and training users, and on preparing ancillary material such as tutorials and introductions.

5.2.2: The TEI Guidelines and TEI Lite

The goals outlined in the Poughkeepsie Principles (see http://www-tei.uic.edu/orgs/tei/info/pcp1.html) were elaborated and interpreted in a series of design documents, which recommended that the Guidelines should:

suffice to represent the textual features needed for research
be simple, clear, and concrete
be easy for researchers to use without special purpose software
allow the rigorous definition and efficient processing of texts
provide for user-defined extensions
conform to existing and emergent standards

As the product of many leading members of the research community, it is perhaps not surprising that research needs are the prime focus of the TEI's Guidelines. The TEI established a plethora of work groups — covering everything from 'Character Sets' and 'Manuscripts and Codicology', to 'Historical Studies'and 'Machine Lexica' — in order to ensure that the interests of the various sectors of the arts and humanities research community were adequately represented. As one of the co-editors of the Guidelines, Michael Sperberg-McQueen wrote, 'Research work requires above all the ability to define rigorously (i.e. precisely, unambiguously, and completely) both the textual objects being encoded and the operations to be performed upon them. Only a rigorous scheme can achieve the generality required for research, while at the same time making possible extensive automation of many text-management tasks.' (Sperberg-McQueen and Burnard 1995, 18). As we saw in the previous section (5.1 The Standard Generalized Markup Language), SGML offers all the necessary techniques to define and enforce a formal grammar, and so it was chosen as the basis for the TEI's encoding scheme.

The designers of the TEI also had to decide how to reconcile the need to represent the textual features required by researchers, with their other expressed intention of keeping the design simple, clear, and concrete. They concluded that rather than have many different SGML DTDs (i.e. one for each area of research), they would develop a single DTD with sufficient flexibility to meet a range of scholars' needs. They began by resolving that wherever possible, the number of markup elements should not proliferate unnecessarily (e.g. have a single <NOTE> tag with a TYPE attribute to say whether it was a footnote, endnote, shouldernote etc., rather than having separate <FOOTNOTE>, <ENDNOTE>, <SHOULDERNOTE> tags). Yet as this would still result in a large and complex DTD, they also decided to implement a modular design — grouping sets of markup tags according to particular descriptive functions — so that scholars could choose to mix and match as many or as few of these markup tags as they required. Lastly, in order to meet the needs of those scholars whose markup requirements could not be met by this comprehensive DTD, they designed it in such a way that the DTD could be adapted or extended in a standard fashion, thereby allowing these scholars to operate within the TEI framework and retain the right to claim compliance with the TEI's Guidelines.

There is no doubt that the TEI's DTD and Guidelines can appear rather daunting at first, especially if one is unfamiliar with descriptive markup, text encoding issues, or SGML/XML applications. However, for anyone seriously concerned about creating an electronic textual resource which will remain viable and usable in the 'long-term' (which can be less than a decade in the rapidly changing world of information technology), the TEI's approach certainly merits very serious investigation, and you should think very carefully before deciding to reject the TEI's methods in favour of another apparent solution.

The importance of the modularity and extensibility of the TEI's DTD cannot be over-stated. In order to make their design philosophy more accessible to new users of text encoding and SGML/XML, the creators of the TEI's DTD have developed what they describe as the 'Chicago pizza model' of DTD construction. Every Chicago (indeed, U.S.) pizza must have certain ingredients in common — namely, cheese and tomato sauce; pizza bases can be selected from a pre-determined limited range of types (e.g. thin-crust, deep-dish, or stuffed), whilst pizza toppings may vary considerably (from a range of well-known ingredients, through to local specialities or idiosyncratic preferences!). In the same way every implementation of the TEI DTD must have certain standard components (e.g. header information and the core tag set), one of the eight base tag sets (see below), to which can then be added any combination of the additional tag sets or user-defined application-specific extensions. TEI headers are discussed in more detail in Chapter 6, whilst the core tag set consists of common elements which are not specific to particular types of text or research application (e.g. the <P> tag used to identify paragraphs). Of the eight base tag sets, six are designed for use with texts of one particular type (i.e. prose, verse, drama, transcriptions of spoken material, printed dictionaries, and terminological data), whilst the other two (general, and mixed) allow for anthologies or unrestricted mixing of the other base types. The additional tag sets (the pizza toppings) provide the necessary markup tags for describing such things as hypertext linking, the transcription of primary sources (especially manuscripts), critical apparatus, names and dates, language corpora, and so on. Readers who wish to know more should consult the full version of the Guidelines, which are also available online at http://www.hcu.ox.ac.uk/TEI/P4beta/.

Even the brief description given above is probably enough to indicate that while the TEI scheme offers immense descriptive possibilities, its application is not something to be undertaken lightly. With this in mind, the designers of the TEI DTD developed a couple of pre-built versions of the DTD, of which the best known and most widely used is called 'TEI Lite'. Each aspect of the TEI Lite DTD is documented in TEI Lite: An Introduction to Text Encoding for Interchange (Burnard and Sperberg-McQueen 1995), which is also available online at http://www.hcu.ox.ac.uk/TEI/Lite/. The abstract of this document states that TEI Lite 'can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic transcriptions and to facilitate their interchange among scholars using different types of computer systems'. Indeed, many people find that the TEI Lite DTD is more than adequate for their purposes, but even for those who do need to use the other tag sets available in the full TEI DTD, TEI Lite provides a valuable introduction to the TEI's encoding scheme. Several people involved in the development and maintenance of the TEI DTD have continued to investigate ways to facilitate its use, such as the 'Pizza Chef' (available at http://www.hcu.ox.ac.uk/TEI/newpizza.html) — which offers a web-based method of combining the various tag sets to make your own TEI DTD — and an XML version of TEI Lite (see The TEI Consortium Homepage (http://www.tei-c.org/)). It can only be hoped that as more people appreciate the merits of adopting the TEI scheme, the number of freely available SGML/XML TEI-aware tools and applications will continue to grow.

5.3: Where to find out more about SGML/XML and the TEI

Although SGML was released as an ISO standard in 1986, its usage has grown steadily rather than explosively, and uptake has tended to occur within the documentation departments of major corporations, government departments, and global industries. This is in dramatic contrast to XML, which was released as a W3C Recommendation in 1998 but was able to build on the tremendous level of international awareness about the web and HTML (and, to some extent, on the success of SGML in numerous corporate sectors). As a very simple indicator, on the 20th August 1999 the online catalogue of amazon.co.uk (http://amazon.co.uk) listed only 28 books with 'SGML' in the title, as compared to 68 which mention 'XML' (and 5 of these are common to both!).

One of the best places to find out more about both the SGML and XML standards, their application, relevant websites, discussion lists and newsgroups, journal articles, conferences and the like, is Robin Cover's excellent The SGML/XML Web Page (http://www.oasis-open.org/cover/). It would be pointless to reproduce a selection of Cover's many references here (as they would rapidly go out of date), but readers are strongly urged to visit this website and use it to identify the most relevant information sources. However, it is also important to remember that XML (like SGML), is only one amongst a family of related standards, and that these XML-related standards are developing and changing very rapidly — so you should remember to visit these sites regularly, or risk making the wrong decisions on the basis of out-dated information.

Keeping up-to-date with the Text Encoding Initiative is a much more straightforward matter. The website of the TEI Consortium (http://www.tei-c.org/) provides the best starting point to accessing other TEI-related online resources, whilst the TEI-L@LISTSERV.UIC.EDU discussion list is an active forum for anyone interested in using the TEI's Guidelines and provides an extremely valuable source of advice and support.