Tom Worthington FACS
Visiting Fellow, Department of Computer Science, Australian National University, Canberra and Director of Publications For the Australian Computer Society
For: Computing 3410 Students, The Australian National University
This document is Version 1.0 20 August 2002: http://www.tomw.net.au/2002/epo.html
This material was prepared for the unit Information Technology in Electronic Commerce (COMP3410) at the Australian National University, semester 2, 2001, as part of "A Common Understanding: Metadata, Data Management, the Digital Library and Electronic Document Management".
In this course last year the case study "Electronic Publishing Options for the Australian Computer Society" asked "How free and open should access to scholarly research be?". This case study asks what can be done with available technology?
On-line publications in the IT field, such as "The Computing Research Repository (CoRR)" accept documents on-line in electronic formats. Use of TeX/LaTeX is encouraged, with HTML being also accepted. While PDF and PostScript are also accepted they are not encouraged. Word processing formats, such as MS-Word are not accepted.
CoRR requires a title and abstract form to be submitted along with the document. This can be completed interactively on-line or as a text file for e-mail or FTP submission. The archive responds with a copy of the abstract and a URL for a PostScript version of the document (protected with a username/password). The submitter is expected to check the version sent back and make corrections before the document is made public automatically (there is no further human intervention in publication).
The process of checking electronic documents is called "preflight" in the publishing industry:
Preflight - a term used to test the validity and completeness of a prepared DTP document, ready for supply to a bureau service provider.
From: "Glossary of Terms", Goprint , 2001, URL: http://www.goprint.qld.gov.au/dj_gloss.htm#g_p
Commercial products provide checking services:
When you subscribe to Preflight Online and integrate it into your Web site, you can customize it to prevent customers from delivering jobs with the errors you see most often, such as missing fonts or images, low-resolution images, incorrect color mode or page size, and much more.From: Extensis Preflight Online, Extensis, 2002 http://www.extensis.com/preflightonline/
Academic publishers appear, in general, not to make use of preflight processes. As well as reducing the manual effort required by the publisher, this might also reduce the work needed by authors. As an example, submission processes require information which is already included in the text of the paper (such as Title, Author and affiliation) to be also entered in a separate on-line form. Examples are the ANU Digital Theses Deposit Procedures and the E-Print Repository Deposit Procedures. If preflight processes were used, this information could be extracted from the text of the document and presented to the author for checking.
Most authors will be unfamiliar with the discipline of using a template and it is not clear if they can be easily educated as to their use. However, if popular journals use the same (or similar) style sheets and this speeds up submission, it may be possible. Also use of a style sheet should allow creation of adaptable documents, rather than just print-line PDF documents.
Using OpenOffice.org to Translate Documents
An example of a document converted using OpenOffice.org, is "ICT Development in Australia - A Strategic Policy Review" prepared for the Australian Computer Society by Professor Houghton. The web adaption of the report was created from the the MS-Word version. This was done by first importing the MS-Word document into OpenOffice.org and saving in HTML. The HTML was run through the "Tidy" utility to replace formatting commands throughout the document with styles. The table of contents was then manually relinked to the document sections and ALT text placed on images.
Using OpenOffice.org to produce HTML has limitations. A better approach may be to use OpenOffice's internal XML format as an intermediate format. This retains more information about the original MS-Word document, than is present in a HTML translation.
OpenOffice files are stored as a directory of ZIP compressed files. The text of the word processing document is stored in a file labeled "content.xml" in the directory. Images and other binary files are stored in sub-directories.
Styles from the original style sheet are reflected in text styles in the translated XML documents:
<text:p text:style-name="Title">Preparation of Papers for IEEE T<text:span text:style-name="T9">RANSACTIONS</text:span> and J<text:span text:style-name="T9">OURNALS</text:span><text:span text:style-name="T10"> </text:span>(April 2002)</text:p>
<text:h text:style-name="Heading 1" text:level="1"><text:bookmark-end text:name="PointTmp"/>I<text:span text:style-name="T1">NTRODUCTION</text:span>
<text:p text:style-name="Title">This Is the Title of the Paper</text:p>
<text:p text:style-name="Primary Head">1. INTRODUCTION </text:p>
In theory it should be possible to open the files which OpenOffice creates, using XML capable desk top publishing software. In practice, this does not work reliably. The DTD defining the structure of OpenOffice XML files is intended for documentary purposes only and does not appear to have been used to generate, or verify the code. Adobe FrameMaker Version 7.0 reported syntax errors in the DTD, (which have been reported to the OO project):
Revision 188.8.131.52 May 31 2002 of drawing.mod <http://www.openoffice.org/source/browse/xml/xmloff/dtd/drawing.mod> shows two occurrences of: <!ATTLIST draw:text-box %draw-transform; >
From: Issue 6697, 2002-08-02, Project Issue Tracking: openoffice.org
Revision 1.31 May 6 2002 of chart.mod <http://www.openoffice.org/source/browse/xml/xmloff/dtd/chart.mod> has an occurrence of: fo:direction (ltr|ttb)
From: Issue 6698, 2002-08-02, Project Issue Tracking: openoffice.org
Also OpenOffice creates a considerable amount of verbose markup. It should be possible to used a simplified translation which ignores unneeded markup and uses the options identified in the stylesheet. A form of Preflight could then be used:
- The Author downloads a template and uses this to create their document in a wrod processor,
- The Author uploads a copy of the word processing document to the on-line preflight system,
- The Preflight system converts the word processing file to an internal XML format, extracts metadata, creates a PDF raft of the document and reports errors (such as "title not found"),
- The Author examines the error messages, adjusts their document and tries again,
- When the preflight system and the author are happy, the document is submitted for review.
With this system, if the author doesn't use the template correctly, content will be missing from the formatted document, or will not be correctly formatted. In some cases it is possible to automatically identify the error, for example if the title of the document is not corrected marked an error message can be generated, indicating the document has no title. Exaggerated formatting might be needed for the preflight draft, as an example it may not be possible to easily see the difference between normal text and a citation.
Copyright © Tom Worthington. 2001