A Common Understanding: Metadata, Data Management, the Digital Library and Electronic Document Management

Tom Worthington FACS

Visiting Fellow, Department of Computer Science, Australian National University, Canberra

For: Computing 3410 Students, The Australian National University
This document is Version 1.1 5 August 2002: http://www.tomw.net.au/2001/mdm.html

Notes for 2005 also available

Introduction

This material was prepared for the unit Information Technology in Electronic Commerce (COMP3410) at the Australian National University, semester 2, 2001. It is intended to introduce three topics: Metadata, Electronic Document Management and the Digital Library. There have been minor revisions to the material from 2001's lectures and major changes from 2000.

This is intended to complement other components of the unit on applications of information technology in electronic commerce. Meta-data, Data Management, the Digital Library and Electronic Document Management is intended to fit with lectures on document representation, knowledge discovery, trading and security. Case studies are used where appropriate. The material may also be of use to those interested in the issues, but not undertaking formal study.

This document is intended to provide both a set of "slides" for a group presentation and notes. The notes can be read or printed for individual use. For a slide-show group presentation, set your web browser to use a large font size and the accompanying style sheet, then select the frames version of the document. The style sheet is designed to omit the notes sections of the document, which are marked with the class definition "optional" and leave a large margin before titles marked "newslide". These slides may not fit precisely on screen, but provide more flexibility than a conventional slide show.

Topics

  1. The Politics of Data Standards
  2. Metadata
    1. Dublin Core and Derivatives
    2. EDI Expressed in XML
    3. Metadata Examples
  3. XML Schema
    1. Web Services Demonstration
    2. Case Study: Metadata Management Facility And Search Tool for New Zealand
  4. Data management
    1. Records Management and Record Keeping
    2. Digital Libraries
    3. Electronic Book and E-Document Formats
    4. PDF and XML
    5. Case Study: Electronic Publishing Options for the Australian Computer Society

The Politics of Data Standards

The common theme of these lectures is the creation, transmission, storage, discovery and display of information in electronic format. The title "A Common Understanding" comes from the need for those creating electronic information to agree a common format for the information to be understood. The challenge is to create formats which are sufficiently expressive to be able to communicate what is needed, but simple enough to be implemented efficiently. Those involved in creating the standard and in using it must have a common understanding of what is needed and what is enough. In implementing metadata and data management standards IT professionals need to keep the politics of standards development in mind. Most standards need to be profiled, to create a workable subset, before they can be used for practical purposes. Some standards need to be enhanced and others not used at all.

The World Wide Web Consortium (W3C) standard for Scalable Vector Graphics (SVG), provides a way to define images in web pages. As well as the expected features of shapes, filling, symbols, colors and patterns there is the 'metadata' element:

<!ENTITY % metadataExt "" >

<!ELEMENT metadata (#PCDATA %metadataExt;)* >

<!ATTLIST metadata %stdAttrs; >

From 21.2 The 'metadata' element, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/

This apparently technically simple definition is made politically complex by a preceding paragraph:

Individual industries or individual content creators are free to define their own metadata schema but are encouraged to follow existing metadata standards and use standard metadata schema wherever possible to promote interchange and interoperability. If a particular standard metadata schema does not meet your needs, then it is usually better to define an additional metadata schema in an existing framework such as RDF and to use custom metadata schema in combination with standard metadata schema, rather than totally ignore the standard schema.

From 21.1 Introduction, Scalable Vector Graphics (SVG) 1.0 Specification W3C Proposed Recommendation 19 July, 2001, URL: http://www.w3.org/TR/SVG/

The important points here are: "...free to define their own metadata schema but are encouraged to follow existing metadata standards ... better to define an additional metadata schema in an existing framework ...". In some ways the ease of defining metadata using new web based tools has made the standardization process more difficult. It is very tempting if an exiting definition is not quite right to define a new standard and hope that some tool will allow conversion between the standards. However, having many standards is a similar problem to no standards at all.

According to Cunningham (1998) Australian Government Locator Service (AGLS) metadata standard (discussed later) was originally called "AUSGILS" and intended to be based on the U.S. Government Information Locator Service (GILS), but this was abandoned in favor of the Dublin Core metadata standard in 1997. This author's recollection differs with the proposed standard called "AGILS" for political reasons, to suggest compatability with the US Government standard, with not necessary any intention to achieve compatability (OGO 1996).

Standards politics are very important to metadata and electronic document development in the real world. Few of decisions are made based on the technical merits of proposals. There are few cases where metadata standards are developed from first principles. Selections are made from existing metadata standards, based on the level of support for those standards, and the perceived importance of those organisations and individuals supporting them. Standards are then adapted, extended, made into subsets or combined.

Standards, Metadata and Documents

Thousands of millions of dollars in business for e-commerce and electronic publishing depend on decisions to be made in the next year over what standards to use. Previously separate standards for electronic commerce, documents and television are converging.Metadata standards for electronic data interchange can now be converted to use the same format (XML) as being used for electronic documents. These same formats are proposed to be used for interactive TV. How rapidly and how effectively will this convergence happen?

One example of where standards for document formats and commercial interests collide is the Portable Document Format (PDF). Developed by Adobe as an extension to the Postscript format for desk-top publishing, PDF has provide a popular electronic document format. However, PDF has a number of limitations as an on-screen format and for disabled users. Adobe have attempted to address these limitations with "Tagged Adobe PDF", which adds some XML interoperability to the PDF format.

Adobe Acrobat 5.0 software introduces tagged Adobe PDF, an enhancement to the PDF specification that allows PDF files to contain logical document structure. Logical structure refers to the organization of a document, such as the title page, chapters, sections, and subsections. Tagged Adobe PDF documents can be reflowed to fit small-screen devices and offer better support for repurposing content. They also are more accessible to the visually impaired. From Adobe PDF, Adobe Systems Incorporated. 2001

The technical capabilities of Tagged Adobe PDF documents are not yet proven and may be limited by depending on the pdfmark format. Also additional work is needed by document creators to use the new features. There is also an inherent contradiction between one of PDF's original selling point of providing an accurate representation of a printed document and the aims of the enhancement of allowing the representation to be transformed. This mayisplay result in a loss of market share by Adobe in e-document tools and the replacement of PDF with an XML based format. One possible format is OpenOffice.org's XML Packages format. This packages up XML documents and supplementary binary format data, such as images, in ZIP file format.

References

  1. Cunningham, A. (1998) Enabling Seamless Online Access to Government, National Archives of Australia, 1998

Further Information

  1. Next section: Metadata for E-commerce
  2. Tutorial questions
  3. Computing 3410

  4. Last year's notes
  5. Author's home page

Copyright © Tom Worthington 2001.