How CLOCKSS Works
As libraries and publishers migrate from print to online-only publications, they want to know that their shared investments are protected and preserved for generations to come. As a "dark archive" currently housing 46 million journal articles, over 25,000 serial and 260,000 book titles, and a wide array of supplementary materials and metadata information, CLOCKSS preserves this growing corpus of online scholarly content. This unique service assures publishers and libraries that the content they steward will withstand potential technological, economic, environmental, and political disruptions and failures and will always be available to those who want to access it, after a trigger event has occurred.
Built with proven LOCKSS open-source technology, CLOCKSS preserves scholarly publications in original formats. The polling-and-repair mechanism ensures the long-term validity of the data.
Mirror repository sites at 12 major academic institutions around the world guarantee long-term preservation and access. Our approach is resilient to threats from potential technological, economic, environmental and political failures.
The following is a step-by-step overview of how CLOCKSS works.
The publisher provides the CLOCKSS system access to either presentation or source files of the content. Presentation files are the HTML pages that are normally displayed to the readers of the content. Source files are minimally formatted content used internally by the publisher.
To allow CLOCKSS crawlers to access the publisher's presentation files, the publisher needs to add to its website a CLOCKSS-provided permission statement that will tell the crawlers what content is available for collection.
To allow CLOCKSS access to the publisher's source files, the publisher needs to place them on a designated FTP site.
Special CLOCKSS boxes located at Rice, Indiana, and Stanford Universities ingest the content the publisher made available.
The content in each CLOCKSS box must go through a verification process to confirm that their versions of the content are identical to each other. This establishes the authoritative version of the content.
The majority of the CLOCKSS boxes are preservation machines, performing the main storage and audit functions. After the quality of the content on the ingest machines is validated, it is collected from them by the preservation CLOCKSS boxes.
The content is then preserved through a system of audit and repair. The CLOCKSS boxes continually communicate over the Internet to audit the content they are preserving. If the content in one CLOCKSS box is damaged or incomplete, that CLOCKSS box will receive repairs of the content based on other CLOCKSS boxes' holdings and/or by referring to the publisher's original presentation files. This cooperation between the CLOCKSS boxes avoids the need to back them up individually. It also provides unambiguous reassurance that the system is performing its function and that the correct content is always available.
When a trigger event occurs and the CLOCKSS Board decides to release the content from the CLOCKSS Archive, two things happen:
- Content is automatically migrated to the newest format.
- Content is copied from the CLOCKSS boxes to a publicly available web server at a CLOCKSS host organization (currently the EDINA Data Center, University of Edinburgh, and Stanford University).
The released content is now freely available from Stanford University and EDINA at the University of Edinburgh. It is also directly available via Open URLs through Crossref, or either of:
- local library link-resolvers
- from this list