Creating & Documenting Electronic Texts

 

Chapter 5: SGML/XML & TEI

The previous chapter showed what markup is, and how it plays a crucial role in almost every aspect of information processing. Now we shall learn about some crucial applications of descriptive markup which are ideally suited to the types of texts studied by those working in the arts and humanities disciplines.

5.1: The Standard Generalized Markup Language (SGML)

The late 1970s and early 1980s saw a consensus emerging that descriptive markup languages had numerous advantages over other types of text encoding. A number of products and macro languages appeared which were built around their own descriptive markup languages — and whilst these represented a step forward, they were also constrained by the fact that users were required to learn a new markup language each time, and could only describe those textual features which the markup scheme allowed (sometimes extensions were possible, but implementing them was rarely a straightforward process).

The International Standards Organization (ISO) also recognized the value of descriptive markup schemes, and in 1986 an ISO committee released a new standard called ISO 8879, the Standard Generalized Markup Langauge (SGML). This complex document represented several years effort by an international committee of experts working together under the Chairmanship of Dr Charles Goldfarb (one of the creators of IBM's descriptive markup language, GML). Since SGML was a product of the International Standards process, the committee also had the benefit of input from experts from the numerous national standards bodies associated with the ISO, such as the UK's British Standards Institute (BSI).

5.1.1: SGML as metalanguage

A great deal of largely unjustified mystique surrounds SGML. You do not have to look very hard to find instances of SGML being described as "difficult to learn", "complex to implement", or "expensive to use", when in fact it is none of these things. People all too frequently confuse the acronym, SGML, with SGML applications — many of which are indeed highly sophisticated and complex operations, designed to meet the rigorous demands of blue chip companies working in major international industries (automotive, pharmaceutical, or aerospace engineering). It should not be particularly surprising that a documentation system designed to control and support every aspect of the tens of thousands of pages of documentation needed to built and maintain a battleship, fix the latest passenger aircraft, or supplement a legal application for international recognition for a new advanced drug treatment, should appear overwhelmingly complex to an outsider. In fact despite its name, SGML is not even a markup language. Instead, it would be more appropriate to call SGML a "metalanguage".

In a conventional markup language, such as HTML, users are offered a pre-defined set of markup tags from which they must make appropriate selections; if they suddenly introduce new tags which are not part of the HTML specification, then it is clear that the resulting document will not be considered valid HTML, and it may be rejected or incorrectly processed by HTML software (e.g. an HTML-compatible browser). SGML, on the other hand, does not offer a pre-defined set of markup tags. Rather, it offers a grammar and specific vocabularly which can be used to define other markup languages (hence "metalanguage").

SGML is not constrained to any one particular type of application, and it is neither more nor less suited to producing technical documentation and specifications in the semiconductor industry, than it is for marking up linguistic features of ancient inscribed tablets of stone. In fact, SGML can be used to create a markup language to do pretty well anything, and that is both its greatest strength and weakness. SGML cannot be used ‘out-of-the-box’, so to speak, and because of this it has earned an undeserved reputation in some quarters as being troublesome and slow to implement. On the other hand, there are many SGML applications (and later we shall learn about one in particular), which can be used straightaway, as they offer a fully documented markup language which can be recognized by any one of a suite of tools and implemented with a minimum of fuss. SGML provides a mechanism for like-minded people with a shared concern to get together and define a common markup language which satisfies their needs and desires, rather than being limited by the vision of the designers of a closed, possibly propietary markup scheme which only does half the job.

SGML offers another advantage in that it not only allows (groups of) users to define their own markup languages, it also provides a mechanism for ensuring that the rules of any particular markup language can be rigorously enforced by SGML-aware software. For example, within HTML, although there are six different levels of heading defined (e.g. the tags <H1> to <H6>) there is no requirement that they should be applied in a strictly hierarchical fashion; in other words, it is perfectly possible for a series of headings in an HTML document to be marked up as <H1>, then <H3>, followed by <H5>, followed in turn by <H2>, <H4>, and <H6> — all to achieve a particular visual appearance in a particular HTML browser. By contrast, should such a feature be deemed important, an SGML-based markup language could be written in such a way that suitable software can ensure that levels of heading nest in a strictly hierarchical fashion (and the strength of this approach can perhaps become even more evident when encoding other kinds of hierarchical structure, e.g. a <BOOK> must contain one or more <CHAPTER>s, each of which must in turn contain one or more <PARAGRAPH>s, and so on). We shall learn more about this in the following section.

There is one final, crucial, difference between SGML-based markup languages and other descriptive markup languages: the process by which International Standards are created, maintained, and updated. ISO Standards are subject to periodic formal review, and each time this work is undertaken it is done so in full consultation with the various national standards bodies. The Committee which produced SGML has guaranteed that if and when any changes are introduced to the SGML standard, this will be done in such a way as to ensure backwards compatibility. This is not a decision which has been undertaken lightly, and the full implications can be inferred from the fact that commercial enterprises rarely make such an explicit commitment (and even when they do, users ought to reflect upon the likelihood that such a commitment will actually be fulfilled given the considerable pressures of a highly competitive marketplace). The essential difference has been characterized thus: the creators of SGML believe that a user's data should belong to that user, and not be tied-up inextricably in a proprietary markup system over which that user has no control; whereas, the creators of a proprietary markup scheme can reasonably be expected to have little motivation to ensure that data encoded using their scheme can be easily migrated to, or processed by, a competitor's software products.

5.1.2: The SGML Document

The SGML standard gives a very rigid definition as to what constitutes an ‘SGML document’. Whilst there is no need for us to consider this definition in detail at this stage, it is worthwhile reviewing the major concepts as they offer a valuable insight into some crucial aspects of an electronic text. Perhaps first and foremost amongst these is the notion that an SGML document is a single logical entity, even though in practice that document made be composed of any number of physical data files, spread over a storage medium (e.g. a single computer's hard-disk) or even over different types of storage media connected together via a network. As today's electronic publications become more and more complex, mixing (multilingual) text with images, audio, and image data, it reinforces the need to ensure that they are created in line with accepted standards. For example, an article from an electronic journal mounted on a website may be delivered to the end-user in the form of a single HTML document, but that article (and indeed the whole journal), may rely upon dozens or hundreds of data files, a database to manage the entire collection of files, several bespoke scripts to handle the interfacing between the web and the database, and so on. Therefore, whenever we talk about an electronic document, it is vitally important to remember that this single logical entity may, in fact, consist of many separate data files.

SGML operates on the basis of there being three major parts which combine to form a single SGML document. Firstly, there is the SGML declaration, which specifies any system and software constraints. Secondly, there is the prolog, which defines the document structure. Lastly, there is the document instance, which contains what one would ordinarily think of as the document. Whilst this may perhaps appear unnecessarily complicated, in fact it provides an extremely valuable insight into the key components which are essential to the creation of an electronic document.

The SGML declaration tells any software that is going to process an SGML document all that it should need to know. For example, the SGML declaration specifies which character sets have been used in the document (normally ASCII or ISO 646, but more recently this could be Unicode, or ISO 10646). It also establishes any constraints on system variables (e.g. the length of markup tag names, or the depth to which tags can be nested), and states whether or not any of SGML's optional features have been used. The SGML standard offers a default set-up, so that, for example, the characters < and > are used to delimit markup tag names — and with the widespread acceptance of HTML, this has become the accepted way to indicate markup tags — but if for any reason this presented a problem for a particular application (e.g. encoding alot of data in which < and > were heavily used to indicate something else), it would be possible to redefine the delimiters as @ or #, or whatever characters were deemed to be more appropriate.

The SGML declaration is important for a number of reasons. Although it may seem an unduly complicated approach, it is often these fundamental system or application dependencies which make it so difficult to move data around between different software and hardware environments. If the developers of wordprocessing packages had started off by agreeing on a single set of internal markup codes they would all use to indicate a change in font, the centreing of a line of text, the occurance of a pagebreak etc., then users' lives would have been made a great deal easier; however, this did not happen, and hence we are left in a situation where data created in one application cannot easily be read by another. We should also remember that as our reliance upon information technology grows, time passes, applications and companies appear or go bust, there may be data which we wish to exchange or reuse which were created when the world of computing was a very different place. It is a very telling lesson that although we are still able to access data inscribed on stone tablets or committed to papyrii or parchment hundreds (if not thousands) of years ago, we already have masses of computer-based data which are effectively lost to us because of technological progress, the demise of particular markup schemes, and so on. Furthermore, by supplying a default environment, the average end-user of an SGML-based encoding system is unlikely to have to familiarize him- or herself with the intricacies of the SGML declaration. Indeed it should be enough to simply be aware of the existence of the SGML declaration, and how it might affect one's ability to create, access, or exploit a particular source of data.

The next major part of an SGML document is the prolog, which must conform to the specification set out in the formal SGML standard, and the syntax given in the SGML declaration. Although it is hard to discuss the prolog without getting bogged down in the details of SGML, suffice to say that it contains (at least one) document type declaration, which in turn contains (or references) a Document Type Definition (or DTD). The DTD is one of the single most important features of SGML, and what sets it apart from — not to say above — other descriptive markup schemes. Although we shall learn a little more about the process in the following section, the DTD contains a series of declarations which define the particular markup language which will be used in the document instance, and also specifies how the different parts of that language can interrelate (e.g. which markup tags are required and optional, the contexts in which they can be used, and so on). Often, when people talk about “using SGML”, they are actually talking about using a particular DTD, which is why some of the negative comments that have been made about SGML (e.g. “It's too difficult.”, or “It doesn't allow me to encode those features which I consider to be important”) are erroneous, because such complaints should properly be directed at the DTD (and thus aimed at the DTD designer) rather than at SGML in general. Other than some of the system constraints imposed by the SGML declaration, there are no strictures imposed by the SGML standard regarding how simple or complex the markup language defined in the DTD should be.

Whilst the syntax used to write a DTD is fairly straightforward, and most people find that they can start to read and write DTDs with surprising ease, to create a good DTD requires experience and familiarity with the needs and concerns of both data creators and end-users. A good DTD nearly always reflects a designer's understanding of all these aspects, an appreciation of the constraints imposed by the SGML standard, and a thorough process of document analysis (see Chapter 2) and DTD-testing. In many ways this situation is indicative of the fact that the creators of the SGML standard did not envisage that individual users would be very likely to produce their own DTDs for highly specific purposes. Rather, they thought (or perhaps hoped), that groups would form within industry sectors or large-scale enterprises to produce DTDs that were tailored to the needs of their particular application. Indeed, the areas in which the uptake of SGML has been most enthusiastic have been operating under exactly those sorts of conditions — for example, the international Air Transport Authority seeking to standardize aircraft maintenance documentation, or the pharmaceutical industry's attempts to streamline the documentary evidence needed to support applications to the US Food and Drug Administration. As we shall see, the DTD of prime importance to those working within the Arts and Humanities disciplines has already been written and documented by the members of the Text Encoding Initiative, and in that case the designers had the foresight to build-in mechanisms to allow users to adapt or extend the DTD to suit their specific purposes. However, as a general rule, if users wish to write their own DTDs, or tweak an SGML declaration, they are entirely free to do so (within the framework set out by the SGML standard) — but the vast majority of SGML users prefer to rely upon an SGML declaration and DTD created by others, for all the benefits of interoperability and reusability promised by this approach.

This brings us to the third main part of an SGML document: namely, the document instance itself. This is the part of the document which contains a combination of raw data and markup, and its contents are constrained by the both the SGML declaration, and the contents of the prolog (especially the declarations in the DTD). Clearly from the perspective of data creators and end-users, this is the most interesting part of an SGML document — and it is common practice for people to use the term “SGML document” when they are actually referring to a document instance. Such confusion should be largely unproblematic, provided these users always remember that when they are interchanging data (i.e. a document instance) with colleagues, they should also pass on the relevant DTD and SGML declaration. In the next section we shall investigate the practical steps involved in the creation of an SGML document, and the very valuable role that can be played by SGML-aware software.

5.1.3: Creating Valid SGML Documents

How you create SGML documents will be greatly influenced by the aims of your project, the materials you are working with, and the resources available to you. For the purposes of this discussion, let us start by assuming that you have a collection of existing non-electronic materials which you wish to turn into some sort of electronic edition.

If you have worked your way through the chapter on document analysis (Chapter 2), then you will know what features of the source material are important to you, and what you will want to be able to encode with your markup. Similarly, if you have considered the options discussed in the chapter on digitization (Chapter 3), you will have some idea of the type of electronic files with which you will be starting to work. Essentially, if you have chosen to OCR the material yourself, you will be using ‘clear’ or ‘plain ASCII’ text files, which will need to undergo some sort of editing or translation as part of the markup process. Alternatively, if the material has been re-keyed, then you will either have electronic text files which already contain some basic markup, or you will also have plain ASCII text files.

Having identified the features you wish to encode, you will need to find a DTD which meets your requirements. Rather than trying to write your own DTD from scratch, it is usually worthwhile investing some time to look around for existing public DTDs which you might be able to adopt, extend, or adapt to suit your particular purposes. There are many DTDs available in the public domain, or made freely available for others to use (e.g. see The SGML/XML Web Page (http://www.oasis-open.org/cover/), but even if none of these match your needs, some may be worth investigating to see how others have tackled common problems. Although there are some tools available which are designed to facilitate the process of DTD-authoring, they are probably only worth buying if you intend to be doing a great deal of work with DTDs, and they can never compensate for poor document analysis. However, if you are working with literary or linguistic materials, you should take the time to familiarize yourself with the work of the Text Encoding Initative (see 5.2: The Text Encoding Initiative and TEI Guidelines), and think very carefully before rejecting use of their DTD.

Before we go any further, let us consider two other scenarios: one where you already have the material in electronic form but you need to convert it to SGML; the other, where you will need to create SGML from scratch. Once again, there are many useful tools available to help convert from one markup scheme to another, but if your target format is SGML this may have some bearing on the likelihood of success (or failure) of any conversion process. As we have seen, SGML lends itself most naturally to a structured, hierarchical view of a document's content (although it is perfectly possible to represent very loose organizational structures, and even non-hierarchical document webs, using SGML markup) and this means that it is much simpler to convert from a proprietary markup scheme to SGML if that scheme also has a strong sense of structure (i.e. adopts a descriptive markup approach) and has been used sensibly. However, if a document has been encoded with a presentational markup scheme which has, for example, used codes to indicate that certain words should be rendered in an italic font — regardless of that fact that sometimes this has been for emphasis, at other times to indicate book and journal titles, and elsewhere to indicate non-English words — then this will dramatically reduce the chances of automatically converting the data from this presentation-oriented markup scheme into one which complies with an SGML DTD.

It is probably worth noting at this point that these conversion problems primarily apply when converting from a non-descriptive, non-SGML markup language into SGML; the opposite process, namely converting from SGML into another target markup scheme, is much more straightforward (because it would simply mean that data variously marked-up with, say, <EMPHASIS>, <TITLE>, and <FOREIGN> tags, had their markup converted into the target scheme's markup tags for <ITALIC>). It is also worth noting that such a conversion might not be a particularly good idea, because you would effectively be throwing information away. In practice it would be much more sensible to retain the descriptive/SGML version of your material, and convert to a presentational markup scheme only when absolutely required for the successful rending of your data on screen or on paper. Indeed, many dedicated SGML applications support the use of stylesheets to offer some control over the on-screen rendition of SGML-encoded material, whilst preserving the SGML markup behind the scenes.

If you are creating SGML documents from scratch, or editing existing SGML documents (perhaps the products of a conversion process, or the results of a rekeying exercise) there are several factors to consider. It is essential that you have access to a validating SGML parser, which is a software program that can read an SGML declaration and a document's prolog, understand the declarations in the DTD, and ensure that the SGML markup used throughout the document instance conforms appropriately. In many commercial SGML- and XML-aware software packages, a validating parser is included as standard and is often very closely integrated with the relevant tools (e.g. to ensure that any simple editing operations, such as cut and paste, do not result in the document failing to confirm to the rules set out in the DTD because markup has been inserted or removed inappropriately). It also possible to find freeware and public domain software which have some understanding of the markup rules expressed in the DTD, whilst also allowing users to validate their documents with a separate parser in order to guarantee conformance. Your choice will probably be dictated by the kind of software you currently use (e.g. in the case of editors: windows based office-type applications, or unix style plain text editors?), the budget you have available, and the files with which you will be working. Whatever your decision, it is important to remember that a parser can only validate markup against the declarations in a DTD, and it cannot pick up semantic errors (e.g. incorrectly tagging a person's name as, say, a place name, or an epigraph as if it were a subtitle).

So for the purposes of creating valid SGML documents, we have seen that there are a number of tools which you may wish to consider. If you already have files in electronic form, you will need to investigate translation or auto-tagging software — and if you have a great many files of the same type, you will probably want software which supports batch processing, rather than anything which requires you to work on one file at a time. If you are creating SGML documents from scratch, or cleaning-up the output of a conversion process, you will need some sort of editor (ideally one that is SGML-aware), and if you editor does not incorporate a parser, you will need to obtain one that can be run as a stand-alone application (there are one or two exceptionally good parsers freely available in the public domain). For an idea of the range of SGML and XML tools available, readers should consult The Whirlwind Guide to SGML & XML Tools and Vendors (http://www.infotek.no/sgmltool/guide.htm).

Producing valid SGML files which conform to a DTD, is in some respects only the first stage in any project. If you want to search the files for particular words, phrases, or marked-up features, you may prefer to use an SGML-aware search engine, but some people are perfectly happy writing short scripts in a language like Perl. If you want to conduct sophisticated computer-assisted text analysis of your material, you will almost certainly need to look at adapting an exisitng tool, or writing your own code. Having obtained your SGML text, whether as complete documents or as fragments resulting from a search, you will need to find some way of displaying it. You might choose to simply convert the SGML markup in the data into another format (e.g. HTML for display in a conventional web browser), or you might use one of the specialist SGML viewing packages to publish the results — which is how many commercial SGML-based electronic texts are produced. We do not have sufficient space to consider all the various alternatives in this publication, but once again you can get an idea of the options available by looking at the The Whirlwind Guide to SGML & XML Tools and Vendors (http://www.infotek.no/sgmltool/guide.htm) or, more generally, The SGML/XML Web Page (http://www.oasis-open.org/cover/).

5.1.4: XML: The Future for SGML

As we saw in the previous section, an SGML-based markup language usually offers a number of advantages over other types of markup scheme, especially those which rely upon proprietary encoding. However, although SGML has met with considerable success in certain areas of publishing and many commercial, industrial, and governmental sectors, its uptake by the academic community has been relatively limited (with the notable exception of the Text Encoding Initiative, see 5.2: The Text Encoding Initiative and TEI Guidelines, below). We can speculate as to why this might be so — for example, SGML has an undeserved reputation for being difficult and expensive to produce because it imposes prohibitive intellectual overheads, and because the necessary software is lacking (leastways at prices academics can afford). Whilst it is true that peforming a thorough document analysis and developing a suitable DTD should not be undertaken lightly, it could be argued that to approach the production of any electronic text without first investing such intellectual resources, is likely to lead to difficulties (either in the usefulness or the long-term viability of the resulting resource). The apparent lack of readily available, easy-to-use SGML software, is perhaps a more valid criticism — yet the resources have been availble for those willing to look, and then invest the time necessary to learn a new package (although freely avaible software tends to put more of an onus on the user than some of the commercial products). However, what is undoubtedly true is the fact that writing a piece of SGML software (e.g. a validating SGML parser), which fully implements the SGML standard, is an extremely demanding task — and this has been reflected in the price and sophistication of some commercial applications.

Whilst SGML is probably more ubiquitous than many people realize, HTML — the markup language of the World Wide Web — is much better known. Nowadays, the notion of “the Web” is effectively synonymous with the global Internet, and HTML plays a fundamental role in the delivery and presentation of information over the Web.

HTML = limited fixed tagset, easy to implement and learn (hence rapid and widespread takeup).

HTML not originally an SGML application -- but did become so, from v. 2.0?

HTML non-extensible, insufficiently descriptive -- hence XML

As an encoding scheme for marking up documents, an SGML-based markup language generally offers tremendous benefits over other, possibly proprietary

5.2: The Text Encoding Initiative and TEI Guidelines

5.2.1: A brief history of the TEI

5.2.2: The TEI Guidelines and TEI-lite

5.3: Where to find out more about SGML/XML and the TEI

Bibliography

The SGML/XML Web Page (http://www.oasis-open.org/cover/), Cover, Robin.

TheSGML Handbook, Goldfarb, Charles F., Oxford University Press 1990

The Whirlwind Guide to SGML & XML Tools and Vendors (http://www.infotek.no/sgmltool/guide.htm), Pepper, Steve.

The TEI Consortium Homepage

Glossary

DTD
HTML
Markup
SGML
TEI (Text Encoding Initiative)
XML
© 
The right of xxxx to be identified as the Authorsof this Work has been asserted by them in accordance with the Copyright,Designs and Patents Act 1988. 
All material supplied via the Arts and HumanitiesData Service is protected by copyright, and duplication or sale of allor part of any of it is not permitted, except that material may be duplicatedby you for your personal research use or educational purposes in electronicor print form. Permission for any other use must be obtained from the
Arts and HumanitiesData Service
Electronic or print copies may not be offered, whetherfor sale or otherwise, 
to any third party. 
Arts and Humanities Data Service 
 
A red line
Back Next Bibliography Glossary Contents