Creating and Documenting Electronic Texts

 

Chapter 4: Markup: The key to reusability
 

4.1: What is markup?

Markup is most commonly defined as a form of text added to a document to transmit information about both the physical and electronic source. Do not be surprised if the term sounds familiar; it has been in use for centuries. It was first used within the printing trade as a reference to the instructions inscribed onto copy so that the compositor would know how to prepare the typographical design of the document. As Philip Gaskell points out, 'Many examples of printers' copy have survived from the hand-press period, some of them annotated with instructions concerning layout, italicization, capitalization, etc.' (Gaskell 1995, 41). This concept has evolved slightly through the years but has remained entwined with the printing industry. G.T. Tanselle writes in a 1981 article on scholarly editing, 'one might...choose a particular text to mark up to reflect these editorial decisions, but that text would only be serving as a convenient basis for producing printer's copy...' (Tanselle 1981, 64). There still seems to be some demarcation between the usage of the term for bibliography and for computing, but the boundary is really quite blurred. The leap from markup as a method of labelling instructions on printer's copy to markup as a language used to describe information in an electronic document is not so vast.

Therefore when we think of markup there are really three differing types (two of which will be discussed below). The first is the markup that relates strictly to formatting instructions found on the physical text, as mentioned above. It is used for the creation of an emended version of the document and, with the exception of the work of textual scholars, is rarely referred to again. Then there is the proprietary markup found in electronic document encoding, which is tied to a specific piece of software or developer. This markup is concerned primarily with document formatting, describing what words should be in italics or centred, where the margins should be set, or where to place a bulleted list. There are a few things to note about this type of markup. The first is that being proprietary means that it is intimately tied to the software that created it. This does not pose a problem as long as the document will only remain within that software program; and as long as the creator recognises that in the future there is no guarantee that the software will exist. This is important, as proprietary software formats allow users to say where and how they want the document formatted, but then the software inserts its own markup language to accomplish this. When users create documents in Word or a PDF file, they are unconsciously adding encoding with every keystroke. As anyone who has created a document in one software format and attempted to transfer it to another is aware, the encoding does not transfer — and if for some reason a bit of it does, it rarely means the same thing.

The third type of markup is non-proprietary, a generalised markup language. There are two critical distinctions between this markup and the previous two. Firstly, as it is a general language and not tied to a specific software/hardware, it offers cross-platform capabilities. This ensures that documents utilising this style of encoding will be readable many years down the line. Secondly, while a generalised markup language, as with the others, allows users to insert formatting markup in the document, it also allows for encoding based upon the content of the work. This is a level of control not found in the previous styles of markup. Here the user is able not only to describe the appearance of the document but the meanings found within it. This is a critical aspect of electronic text creation, and therefore receives more in-depth treatment below.

4.2: Visual/presentational markup vs. structural/descriptive markup

The discussion of visual/presentational markup vs. structural/descriptive markup carries on from the concepts of proprietary and non-proprietary markup. As the name implies, presentational markup is concerned with the visual structure of a text. Depending upon what processing software is being used, the markup explains to the computer how the document should appear. So if the work should be seen in 12 point, Tahoma font, the software dictates a markup so that this happens. Presentational markup is concerned with structure only insofar as it relates to the visual aspect of the document. It does not care whether a heading is for a book, a chapter or a paragraph — the only consideration is how that heading should look on the page. Most proprietary language formats tend to focus solely on presentational issues. To move into descriptive markup would require that the software provide the document creator with the ability to formulate their own tags with which to encode the structure and presentation of the work.

In other words, descriptive markup relates less to the visual strategy of the work and more to the reasons behind the structure. It allows the creator to encode the document with a markup that more clearly shows how the presentation, configuration, and content relate to the document as a whole. Once again, the beneficial effects of thorough document analysis can be seen. Having a holistic sense of the document, having the detailed listing of critical elements in the document, will exemplify how descriptive markup advances a project. In this case, a non-proprietary language will be the most beneficial, as it will allow the document creators to arrive at their own tagsets, providing a much needed level of control over the encoding development.

4.2.1: PostScript and Portable Document Format (PDF)

In 1985, Adobe Systems created a programming language for printers called PostScript. In so doing, they produced a system that allowed computers to 'talk' to their printers. This language describes for the printer the appearance of the page, incorporating elements like text, graphics, colour, and images, so that documents maintain their integrity through the transmission from computer to printer. PostScript printers have become industry standard with corporations, marketers, publishing companies, graphic designers, and more. Printers, slide recorders, imagesetters — all these output devices utilise PostScript technology. Combine this with PostScript's multiple operating system capability and it becomes clear why PostScript has become the standard for printing technology. (http://www.adobe.com/print/features/psvspdf/main.html). PostScript language can be found in most printers — Epson, IBM, and Hewlett-Packard just to name a few — almost guaranteeing that a high standard of printing can be found in both the home and office. Adobe provides a list of compatible products at http://www.adobe.com/print/postscript/oemlist.html.

Portable Document Format (PDF) was created by Adobe in 1993 to complement their PostScript language. PDF allows the user to view a document with a presentational integrity that almost resembles a scanned image of the source. This delivery of visually rich content is the most attractive use of PDF. The format is entirely concerned with keeping the document intact, and, to ensure this, allows any combination of text, graphics and images. It also has full, rich colour presentation and is therefore often used with corporate and marketing graphic arts materials. Another enticing feature, depending on the quality of the printer, is that when a PDF file is printed out, the hard copy output is an exact replication of the screen image. PDF is also desirable for its delivery strengths. Not only does the document maintain its visual integrity, but it also can be compressed. This compression eases on-line and CD-ROM transmission and assists its archiving opportunities.

PDF files can be read through an Acrobat Reader application that is freely available for download via the web. This application is also capable of serving as a browser plug-in for online document viewing. Creating PDF files is a bit more complicated than the viewing procedure. To write a PDF document it is necessary to purchase Adobe software. PDFWriter allows the user to create the PDF document, and the more expensive Adobe Capture program will convert TIFF files into PDF formatted text versions. If the user would like the document to become more interactive, offering the ability to annotate the document for example, then this functionality can be added with the additional purchase of Acrobat Exchange, which serves an editorial function. Exchange allows the user to annotate and edit the document, search across documents and also has plug-ins that provide highlighting ability.

Taking into consideration the earlier discussion of visual vs. structural markup, it is clear how programs like PostScript and PDF fall into the category of a proprietary processing language concerned with presentational rather than descriptive markup. This does not imply that these languages should be avoided. On the contrary, if the only concern is how the document appears both on the screen and through the printer, then software of this nature is appropriate. However, if the document needs to cross platforms or the project objectives require control over the encoding or document preservation, then these proprietary programs are not dependable.

4.2.2: HTML 4.0

HyperText Markup Language (or HTML as it is commonly known) is a non-proprietary format markup system used for publishing hypertext on the World Wide Web. To date, it has appeared in four main versions (1.0, 2.0, 3.2, 4.0), with the World Wide Web Consortium (W3C) recommending 4.0 as the markup language of choice. HTML is a derivative of SGML — the Standard Generalised Markup Language. SGML will be discussed in greater detail in Chapter 5, but suffice it to say that it is an international standard metalanguage that defines a set of rules for device-independent, system-independent methods of encoding electronic texts. SGML allows you to create your own markup language but provides the rules necessary to ensure its processing and preservation. HTML is a successful implementation of the SGML concepts, and, as a result, is accessible to most browsers and platforms. Along with this, it is a relatively simple markup language to learn, as it has a limited tagset. HTML is by far the most popular web-publishing language, allowing users to create online text documents that include multimedia elements (such as images, sounds, and video clips), and then put these documents in an environment that allows for instant publication and retrieval.

There are many advantages to a markup language like HTML. As mentioned above, the primary benefit is that a document encoded with HTML can be viewed in almost any browser — an extremely attractive option for a creator who wants documents which can be viewed by an audience with varied systems. However, it is important to note that while the encoding can cross platforms, there are consistently differences in page appearance between browsers. While W3C recommends the usage of HTML 4.0, many of its features are simply not available to users with early versions of browsers. Unlike PDF which is extremely concerned with keeping the document and its format intact, HTML has no true sense of page structure and files can neither be saved nor printed with any sense of precision.

Besides the benefit of a markup language that crosses platforms with ease, HTML attracts its many users for the simple manner with which it can be mastered. For users who do not want to take the time to learn the tagset, the good news is that conversion-to-HTML tools are becoming more accessible and easier to use. For those who cannot even spare the time to learn how to use HTML-creation software, of which there are a limited quantity, they can sit down with any text creation program (Notepad for example) and author an HTML document. Then by using the 'Open File'; tool in a browser, the document can immediately be viewed. What this means for novice HTML authors is that they can sit down with a text creator and a browser and teach themselves a markup language in one session. And as David Seaman, Director of the Electronic Text Center at the University of Virginia, points out:

[this] has a real pedagogical value as a form of SGML that makes clear to newcomers the concept of standardized markup. To the novice, the mass of information that constitutes the Text Encoding Initiative Guidelines — the premier tagging scheme for most humanities documents — is not easily grasped. In contrast, the concise guidelines to HTML that are available on-line (and usually as a "help" option from the menu of a Web client) are a good introduction to some of the basic SGML concepts. (Seaman 1994).

This is of real value to the user. The notion of marking up a text is quite often an overwhelming concept. Most people do not realise that markup enters into their life every time they make a keystroke in a word processing program. So for the uninitiated, HTML provides a manageable stepping-stone into the world of more complex encoding. Once this limited tagset is mastered, many users find the jump into an extended markup language less intimidating — and more liberating.

However, one of the drawbacks to this easy authoring language is that many of the online documents are created without a DTD. A DTD is the abbreviation for a document type definition, which outlines the formal specifications for an SGML encoded document. Basically, a DTD is the method for spelling out the SGML rules that the document is following. It sets the standards for what markup can be used and how this markup interacts with others. So, for example, if you create an HTML document with a specific software program, say HoTMetaL PRO, the resulting text will begin with a document type declaration stating which DTD is being used. A sample declaration from a HoTMetaL creation looks like this:
<!DOCTYPE HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 4.0::19970714::extensions to HTML 4.0//EN" "hmpro4.dtd">
As can be seen in the above statement, the declaration explains that the document will follow the HoTMetaL PRO 4.0 DTD. In so doing, the markup language used must adhere to the rules set out in this specific DTD. If it does not then the document cannot be successfully validated and will not work.

As it stands now, web browsers require neither a DTD nor a document type declaration. Browsers are notoriously lax in their HTML requirements, and unless something serious is missing from the encoded document it will be successfully viewed through a Web client. The impact of this is that while HTML provides a convenient and universal markup language for a user, many of the documents floating out in cyberspace are permeated with invalid code. The focus then moves away from authoring documents that conform to a set of encoding guidelines and towards the creation of works that can be viewed in a browser (Seaman 1994). This problem will become more severe with the increased use of Extensible Markup Language, or XML as it is more commonly known. This markup language, which is being lauded as the new lingua franca, combines the visual benefits of HTML with the contextual benefits of SGML/TEI. However, while XML will have the universality of HTML, the web clients will require a more stringent adherence to markup rules. While documents that comply with the rules of an HTML DTD will find the transition relatively simple, the documents that were constructed strictly with viewing in mind will require a good deal of clean up prior to conversion.

This is not to say that HTML is not a useful tool for creating online documents. As in the case of PostScript and PDF, the choice to use HTML should be document dependent. It is the perfect choice for static documents that will have a short shelf-life. If you are creating course pages or supplementary materials regarding specific readings that will not be necessary or available after the end of term, then HTML is an appropriate choice. If, however, you are concerned about presentational and structural integrity, the markup of document content and/or the long-term preservation of the text, then a user-definable markup language is a much better choice.

4.2.3: User-definable descriptive markup

A user-definable descriptive markup is exactly what its name implies. The content of the markup tags is established solely by the user, not by the software. As a result of SGML and its concept of a DTD, a document can have any kind of markup a creator desires. This frees the document from being married to proprietary hardware or software and from its reliance upon an appearance-based markup language. If you decide to encode the document with a non-proprietary language, which we highly recommend, then this is a good time to evaluate the project goals. While a user-definable markup language gives you control over the content of the markup, and thereby more control over the document, the markup can only be fully understood by you. Although not tied to a proprietary system, it is also not tied to any accepted standard. A markup language defined and implemented by you is simply that — a personal non-proprietary markup system.

However, if the electronic texts require a language that is non-proprietary, more extensive and content-oriented than HTML, and comprehensible and acceptable to a humanities audience, then there is a solution — the Text Encoding Initiative (TEI). TEI is an international implementation of SGML, providing a non-proprietary markup language that has become the de facto standard in Humanities Computing. TEI, which is explained more fully in Chapter 5, provides 'a full set of tags, a methodology, and a set of Document Type Descriptions (DTDs) that allow the detailed (or not so detailed) description of the spatial, intellectual, structural, and typographic form of a work' (Seaman 1994).

4.3: Implications for long-term preservation and reuse

Markup is a critical, and inescapable, part of text creation and processing. Regardless of the method chosen to encode the document, some form of markup will be included in the text. Whether this markup is proprietary or non-proprietary, appearance- or content-based is up to you. Be sure to evaluate the project goals when making the encoding decisions. If the project is short-lived or necessarily software dependent, then the choices are relatively straightforward. However, if you are at all concerned about long-term preservation, cross-platform capabilities, and/or descriptive markup, then a user-definable (preferably TEI) markup language is the best choice. As Peter Shillingsburg corroborates:

...the editor with a universal encoding system developing an electronic edition with a multiplatform application has created a tool available to anyone with a computer and has ensured the longevity of the editorial work through generations to come of software and hardware. It seems worth the effort (Shillingsburg 1996, 163).
© The right of Alan Morrison, Michael Popham and Karen Wikander to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. 

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or part of any of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service Electronic or print copies may not be offered, whether for sale or otherwise, to any third party. 
Arts and Humanities Data Service 
 
A red line
Bibliography Next Back Glossary Contents