Making e-publishing easy - Electronic Document Format for a New IT Publication

Tom Worthington FACS HLM

Director of Publishing for the Australian Computer Society and Visiting Fellow, Department of Computer Science, Australian National University, Canberra

For: Electronic Documents and Archives Symposium, Australian National University, 27 February 2004
This document is Version 0.3 21 February 2004 – Incomplete draft discussed in Computing 3410 in 2003, at the Australian National University

Introduction

This is to discuss the options for an electronic document format for on-line information technology publication for the Australian Computer Society. As well as defining a format for a real publication, this is intended to be of use for students of document representation, digital library and electronic document management in the Australian National University second semester unit “Information Technology in Electronic Commerce” (COMP3410).

As part of COMP3410 in 2001 a case study was presented on the electronic publishing requirements of the Australian Computer Society1. At that time organisations such as the ACM were making a considerable investment in electronic publishing technology. This was supplemented the following year with a discussion2 of the concept of Preflight systems and an investigation of the generation of XML from ACM and IEEE document templates using the OpenOffice.org product. For 2003 a draft of this document was presented, discussing what is needed to build a prototype system for the ACS. In 2004 an ANU third year student has nominated this as a project to undertake. The prototype is expeted to be completed by mid 2004.

The first step in building such a system is to decide what information it should hold. The the content of actual papers and the metadata about them will be discussed.

Given the rapid development in XML it was considered better for the ACS to wait until the technology was more widely availbale

IEEE Xplore, the online delivery system for all the IEEE's journals, magazines, conference proceedings, and standards, is now bigger and better than ever, thanks to its latest release, launched in December. ...
Another enhancement is full-text HTML formats for issues of IEEE Spectrum and Proceeding of the IEEE going back to January 2002. PDF versions are still available, but articles presented in HTML are easier to navigate, Williams says.
From: Upgrade Makes IEEE Xplore Easier to Explore, BY ERICA VONDERHEID, IEEE, 23 February 2004 08:00 AM (GMT -05:00)

The Future of Open Source Software, Bill Appelbe, JRPIT, Volume 35, No. 4, 2003, URL: http://www.jrpit.acs.org.au/JRPITVolumes/JRPIT35/JRPIT35.4.227.pdf

Relatively efficient PDF of only 39 kbytes for 10 pages (one small photo):


Zooming in to be able to read the text results in lines dropping off right hand sode of the screen:


Limited XML Format

Since at least 19943, ACM has been working on systems to convert papers to a structured electronic format (originally SGML and later XML). However, the structure used has not been made public and no generally accepted format for publication of IT papers exists. It is therefore necessary to define a format. Rather than define a new XML format, a subset of XHTML is proposed.

While XML would seem the obvious encoding to use, this would then require additional processing to create a document which can be viewed on pre-XML web browsers. This could be done by storing the XML document and converting it to HTML for display. However, HTML was originally designed for publishing scientific research papers4, it would therefore seem reasonable to use this format for IT papers. Using HTML would remove the need to transform documents for display, allowing one version to be used for creation, storage on-screen display and printing.

XHTML is a version of HTML modified slightly to conform to XML syntax. Older web browsers which do not support XML directly can display XHTML documents reasonably well. With styling added through CSS, this can provide a high quality display on advanced web browsers, while still being readable on older devices. Additional formatting commands can be used to provide a printed display similar to a PDF document. Web browsers with limited formatting capabilities (because they are older, for hand held devices or for the disabled) will ignore the advanced formatting but still render a readable web page.

Some advanced features, such as MathML for mathematical equations, will not render on pre-XML web browsers. The conventional approach to this has been to render equations as an image (usually in GIF format) for older browsers. However, this requires generating multiple versions of the document. Instead it will be assumed that IT professionals working on advanced IT concepts will have more modern browsers with MathML and similar features. Those without these features will still be able to follow the discussion of the equations, from reading the accompanying text in the paper, even if they can't see the equations.

CMS Fantisy

In the conventional view of e-publishing, the raw material for the publication is prepared using various tools and then imported into a content management system (CMS). The CMS can then produce many custom versions of the publciation for print on demand, the web, or a PDA. Output formats can be dynamically adjusted to suita particualr device, such as a mobile phone.

Existing elelctronic content will be converted and cleaned, to remove extranious formatting before input to the CMS. The CMS will store the content in a pure form, so it can be reporposed easily for different formats of publciations. Formatting,m such as font types and sizes is stripped off. Useful infroation, such as chapter headings is converted to the CMS's internal format. New content is created using tools which can export in an XML format taylored to the subject matter.

This idea world of the CMS is a fanatasy. In reality there are no standards which CMS conform to. Each time a new CMS is introducted cointent has to be labourously converted from the previous system. There are no widely adopted XML formats, so content has to go thorugh a labourours and error prone v=conversion process when received from the author or anohter CMS. The output formats keep changing so more and more output versions of the document have to be produced. Print on deman books are prohibilively expensive to produce. PDF documents are a poor way to read on screen. Dynamic creation of formats takes considerable systems resources, so the CMS delivers a poor user experience. After a few years the only relaible version of the publication which is left is the printed one.

However, the left and right margins of the document wase screen space on a small scree.

However, the text does display reasoanbly well on a small sceeen viewer, such as Opera:


Having been stored as a GIF image, the text of tables does not display satisfactorly:


Metadata

Almost more important that the content of scientific papers is the metadata used to find them. As part of the Open Archives Initiative, the Open Citation Project (OpCit) GNU EPrints ...

And that is as far as I have got and just about the state of the art. ;-)

Tom Worthington

21 August 2003

Welcome to the ANU E Press, 10 December 2003, ANU, URL: http://epress.anu.edu.au/index.htm

E Press Titles, ANU, 10 December 2003, URL: http://epress.anu.edu.au/titles.htm



Out of the Ashes: Deconstruction and Reconstruction of East Timor, ANU 10 December 2003, URL: http://epress.anu.edu.au/oota_citation.htm

URL: http://epress.anu.edu.au/oota/frames.php?page=3&chapters=h

URL: http://epress.anu.edu.au/oota/c3.pdf

URL: http://epress.anu.edu.au/oota/c3.pdb

URL: http://epress.anu.edu.au/oota/ch3.htm

1 Case Study: Electronic Publishing Options for the Australian Computer Society, Tom Worthington , 15 August 2001: http://www.tomw.net.au/2001/acsepub.html

2 Electronic Publishing Options for Academic Material, Tom Worthington , 20 August 2002: http://www.tomw.net.au/2002/epo.html

3 Converting ACM Authors' Articles to SGML, Bradley C. Watson, OCLC, 1994, URL: http://www.oclc.org/research/publications/arr/1994/part1/convacm.htm

4 Information Management: A Proposal, Tim Berners-Lee, CERN, March 1989, URL: http://www.w3.org/History/1989/proposal-msw.html