How free and open should access to scholarly research be?

The issue of how free and open access to scholarly research should be, and to make it that way, was explored on ABC Radio, August 12th :

On the eighth day God created the Internet so that eventually everyone would know everything. But mankind didn't want to share, and created new technologies to control the miracle of the Internet, and knowledge became a commodity.

Scientists are the first to rebel, and 26 000 have signed a petition. After the first of September they'll refuse to cooperate unless scientific knowledge is set free.

From: "Knowledge Indignation: Road Rage on the Information Superhighway", on Background Briefing, ABC Radio National's Weekly Investigative Documentary, (August 12th 2001 Produced by Stan Correy)

Petition from Public Library of Science

The petition referred to is from the "Public Library of Science". An Advocacy Group made up of 11 people from US based and one from UK academic institutions is proposing the establishment of international online public libraries of science with the complete text of all published scientific articles:

We believe that the permanent, archival record of scientific research and ideas should neither be owned nor controlled by publishers, but should belong to the public, and should be made freely available.

We support the establishment of international online public libraries of science that contain the complete text of all published scientific articles in searchable and interlinked formats.

From the "Public Library of Science" by Patrick O. Brown and Michael Eisen (undated, as at 15 August 2001).

The group claims 26144 researchers from 170 countries have signed an open letter urging publishers to allowing research reports from their journals to be publicly available. The web site for the group is maintained by Patrick O. Brown, Stanford University School of Medicine and the Howard Hughes Medical Institute and Michael Eisen of the Lawrence Berkeley National Lab and University of California at Berkeley.

The group is focusing on the life sciences, as that is the area of interest of the originators. An interesting issue is the position of information technology researchers on the issue, given their role in creating the technology used for electronic publishing.

Limit commercial journals to six months exclusive use

The group argues that commercial journals could have six months exclusive use of research reports before they are publicly available:

During this time, they would charge subscription fees for print editions and for electronic access to the articles on the journals' websites, just as they do now. It is unlikely that many subscriptions would be cancelled simply because material would be available free of charge six month later, and journals make relatively little money selling access to their archived.. Few scientists who currently subscribe to journals would want to wait six months to read about the latest results in their field. Indeed, many journals have already recognized that they have little to lose by providing free access to archival material, and have voluntarily opened their archives up to the public.

It is argued by the group that making research available through a publisher's web site is not sufficient. They argue that the material needs to be freely available in a single comprehensive collection. It is not explained what a "single comprehensive collection" is, but one of the benefits claimed for this it can be efficiently searched in a single search of the archival literature:

... so that researchers can begin to take on the challenge of integrating and interconnecting this fantastically rich but extremely fragmented and unsystematic information, and linking it to other kinds of data, such as genome sequence data, other genomic data, structural data, etc.; so that scientists and teachers can create local online resources for graduate course or high level undergrad courses, or even pre-college courses; so that physicians, including physicians without ready access to a major medical library, can access the original evidence on which to base their "evidence-based" practice; so that scientists can apply their creativity and energy toward making this huge information resource more valuable and accessible ...

Do the articles need to be copied?

This approach assumes that the research articles need to be copied to be accessible and to be built into complex information structures. It also assumes that the articles will be in a format which can be readily manipulated.

It pratice the metadata about the articles may be more valuable than the original text of the articles. This metadata can be harvested from the individual documents, extracted from a database or accessed by interrogating a database.

The content of the articles will be of little value if they are not in a standard, machine readable structure. Being able to read a report or copy the text is of little value. What is required is a way to quote the text in context with links to the place in the article. In some cases the structure of the article may be used without the detail of text.

If a standard format is used for the metadata and for the structure of documents, then it will be possible to make use of the content without requiring copies of the original document to be made. However, this creates further ethical, legal and commercial issues. Collections of metadata and constructs from collections of documents can be more valuable than the documents they are derived from. These collections can also be considered derived works which can be considered to have some separate status to the source documents.

XML based formats for expressing booth the metadata and the structure of the documents provide the opportunity to use more powerful tools for manipulation of the content than previously avaliable.

It is feasible to consider maintaining the entire life cycle of a document using the same family of formatting standards and tools. This could be starting from a call for papers, drafts, reviews with annotations, publication, citation and use in multimedia for conferences and training. The W3C XML standards for metadata, document structure, multi;media and annotations could be used for the document life cycle.

One interesting possibility is the use of the Annotea project for collaboration with shared annotations. This allows comments, notes, explanations and external remarks to be attached to a part of a web document without changes to the source document. Annotations are stored in a separate server and retrieved by an enhanced browser.

Current Approaches to the Electronic Library for the IT Discipline

The ACM Digital Library The Association for Computing Machinery (ACM) is a professional society that publishes research journals and magazines in computer science. It also organizes a wide variety of conferences, many of which publish proceedings. ACM is typical of the publishers that have moved rapidly into electronic publication of conventional journals. In 1993, the ACM decided that its future publication process would be a computer system that creates a database of journal articles, conference proceedings, magazines and newsletters, all marked up in SGML. Subsequently, ACM also decided to convert large numbers of its older journals and build a digital library covering its publications from 1985. The digital library will eventually extend back to ACM's foundation in 1948.

From: Preservation of Scientific Serials: Three Current Examples by WILLIAM Y. ARMS, The Journal of Electronic Publishing December, 1999 Volume 5, Issue 2 ISSN 1080-2711

ACM collection was made available on-line in 1997 and the web interface allows the contents pages of athe journals to be browsed and metadata searches. New content is created in SGML and web, PDF and print versions generated for that. The online service is by paid subscription to members, non-members and institutions or sales per article. The service has proved popular and ACM is considering discontinuing some print titles in the future.

ACM journals accept articles in a number of electronic formats using supplied templates. The PDF versions of documents generated are close in format to the print editions, but the HTML versions use a different format more suited to on-line viewing. As an example graphics are shown as small thumbnail versions, with links to high resolution versions. It is not clear from the literature how much manual effort is required to convert to the SGML format used.

... the current track 1 production process:

1. The paper is received from EIC, and is logged into the system.

2. The paper is converted from whatever original format into SGML (requires intervention). For mathematics, ACM requires that minimum customization be inserted into LaTeX. If you have a very clean LaTeX (with no spacing customization, etc), then conversion is pretty automatic. But macros tend not to translate cleanly. This conversion is done manually by a person inserting tags to match ACM styles. (ACM plans to move to XML in the future as very little work would be required to move the current DTD from SGML to XML).

3. The SGML is copy edited (by the managing editors). When the copy editing process begins, the editor is to send an email notification to the lead author to let them know to expect a galley in one week and that they will have 48 hours to respond to the galley. One editor (George) copy edits the hard copy directly from the author and then updates the SGML accordingly. Another editor (Roma) uses the online SGML output.

4. The reference section is created separately from the SGML file because it has to be citation-linked and the references are not tagged and kept in the same location as the article file (they go into a citation library). Citation linking is time consuming because you have to interface with the database to see if the citation already exists or is new and needs to be data entered.

5. The XyVision process composes the SGML data stream into page layout. Currently ACM employs an outside company to help with copy editing and citation linking. Proofing cannot be done by the outside company has to be brought back in house to generate the proof (this involves a rough layout and general matching to author's original). Proof is sent to the author before any tweaking takes place. After feedback from the author, layout is tweaked to be as close as possible to the original hard copy provided by the author and the feedback (breaks, spacing).

6. Problems in layout: tables with multiple columns which have different widths (the auto-table generator makes all columns of equal widths, so these must be tweaked by hand during composition).

7. Illustrations and figures are processed separately. If received figures are in TIF or EPS, they can be electronically processed and inserted during composition. Many times, the EPS file is non-standard.

8. When composition is finished, create the postscript file and the pdf file. The postscript is used for printing, the PDF for the DL. Black and White journals are no problem because B/W is used during processing. Four color hi-res graphics must be substituted before creation of the PDF.

There is much concern about authors not getting marked-up galleys, authors not getting enough feedback about timing (especially only getting 48 hours to review galleys with no prior notification), and concerns about how the SGML conversion caused inadvertent errors to creep in to manuscripts...

From: Minutes of the Publications Board Meeting, ACM, May 5, 2000

The IEEE offer a Digital Library similar in concept to ACM with similar guidelines for authors. Both ACM and IEEE accommodate less formal publications, for chapters and special interest groups, to be provided in electronic format as web pages. However, these are separate from the digital libraries.

E-Publishing at the Australian Computer Society

The Australian Computer Society (ACS) publishes a similar range of publications to ACM and IEEEE, but on a smaller scale:

Some editions of some publications are made available free on-line in PDF or web format. However, there is no overall digital library. The ACS is now considering publishing strategies, including e-publishing.

International Efforts for The IT Digital Library

In November 1996 the British Computer Society hosted a meeting of national IT professional bodies, to which the ACS sent a representative. While coordination of electronic publishing, including reciprocal access was discussed, no international system was established. The US based ACM and IEEE digital libraries have instead acted as defacto international libraries.

Activities such as the Open Archives Initiative are attempting to construct a virtual library of material using distributed document archives and shared metadata:

Digital Library Federation Encourages Use of Open Archives Initiative The Digital Library Federation (DLF) is supporting the development of a small number of Internet gateways through which users will access distributed digital library holdings as if they were part of a single uniform collection. The gateways will be built using the OAI Metadata Harvesting Protocol. DLF gateways will contribute to a practical evaluation of the OAI's harvesting technique and its application within libraries to encourage digital collection managers to expose metadata and build services.

From: Open Archives Initiative, 2001

Organisations now considering electronic publications strategies can consider an integrated approach using newer XML tools to create and maintain content. The ACS has a tradition of providing the content of its journal free for non-profit use. This could be extended into an electronic edition in a format suitable for direct citation and annotation with metadata in a format suitable for harvesting by specialised virtual library tools as well as traditional web search engines. The content could be available for use in multimedia conference and training formats.

