In the digital age, preserving the scholarly record is a strategic imperative. As such, CLOCKSS stands at the forefront of this mission, offering a resilient and community-governed archive that protects content even when publishers or platforms cease to exist. Behind its preservation goals lies a sophisticated system architecture built for scale, flexibility, and long-term access.
CLOCKSS operates on a two-tier network structure designed to separate preservation from content acquisition and access. A 12-node vault network handles preservation, while a 5-node harvest network manages ingestion. This architecture supports multiple ingestion pathways, including web harvesting, file transfer via FTP or SFTP, and even delivery on physical media. Given the wide range of publisher platforms and file formats involved, CLOCKSS relies on custom-developed plugins to manage ingestion and normalization workflows across diverse systems.
Much of the content preserved in CLOCKSS is not web native. This means it cannot be automatically harvested or normalized and often requires manual preprocessing and staging. Instead of handling all normalization and cleanup up front, CLOCKSS follows a deferred effort model. This approach defers intensive processing until content is triggered for release, typically when the original content becomes unavailable. While this strategy helps scale the preservation effort by spreading out the workload, it creates operational debt over time. Trigger events often require significant remediation, including resolving metadata gaps, format inconsistencies, and missing or malformed files. Preparing content for access in these scenarios can become a labour-intensive and time-sensitive process.
CLOCKSS relies on the LOCKSS (Lots of Copies Keep Stuff Safe) software but has more complex business requirements than other preservation services using the same opensource software. Until recently, CLOCKSS was operating on the original LOCKSS 1.0 architecture. That is now changing with the transition to LOCKSS 2.0, a major upgrade designed to modernize the system and prepare it for the evolving demands of digital preservation.
LOCKSS 2.0 introduces a modular architecture based on microservices with clearly defined APIs. This separation of core functions, such as repository management, auditing, and content replay, makes the system more scalable, easier to maintain, and better suited for future integration with external systems. One of the most significant improvements is the support for direct deposit. This allows content to be ingested without relying on web crawling or manual staging, streamlining the entire ingestion process.
The platform also introduces a multi-crawler and replay framework that supports modern replay engines like Pywb and OpenWayback. These tools are better suited for rendering dynamic and JavaScript-driven web content, which is increasingly common in today’s digital publishing landscape. Another critical advancement is the move from a legacy flat-file system to database-backed operations, enabling better performance, scalability, and reliability, especially when managing large volumes of content.
The benefits of LOCKSS 2.0 are exciting. Ingest will be faster and more efficient, triggered content will be delivered with improved user experience, and automation is enhanced to ensure better interoperability with library and publisher systems.
The close collaboration between the LOCKSS team and CLOCKSS has enabled a seamless exchange of expertise, enhancing our ability to provide long-term digital preservation while managing the diverse and often complex needs of publishers, libraries, and institutions.
This modernization of LOCKSS opensource digital preservation software is a strategic shift. It provides the flexibility needed to keep up with evolving formats, platforms, and user expectations. By investing in this scalable architecture and adaptive workflows, CLOCKSS continues to fulfil its mission to preserve the scholarly record with integrity, and to support the community of services and users built around our shared LOCKSS software.
