Web Archiving: The process of collecting and storing websites and web pages for long-term access and preservation

Web archiving is the process of collecting and storing websites and web pages for long-term access and preservation. The goal is to capture and retain the content of websites, including text, images, multimedia, and other elements, as they appear at a specific point in time. Web archiving is essential for preserving digital information, especially considering the dynamic and ever-changing nature of the internet.

Here are key aspects and methods associated with web archiving in relation to the CLOCKSS archive:

Crawling and Capturing: Web archiving involves crawling websites and capturing their content. This process uses automated tools, known as web crawlers or spiders, to navigate through the structure of websites and download the relevant files. These files include HTML pages, images, stylesheets, scripts, and other resources.
Frequency of Capture: Web archiving can be performed periodically to capture changes over time. Some archives capture websites on a regular schedule, while others may focus on capturing specific events, such as elections, product launches, or breaking news.
Seed URLs and Selective Crawling: Archivists typically identify seed URLs, which are the initial web addresses used to start the crawling process. Selective crawling may also be employed to focus on specific parts of a website or to exclude certain content.
National and Institutional Web Archives: Various national libraries, archives, and institutions worldwide operate web archiving initiatives. Examples include the Internet Archive, the Library of Congress Web Archive, and national archives in different countries. These archives play a crucial role in preserving the cultural, historical, and scientific record of the web.
Legal and Ethical Considerations: The CLOCKSS archive respects intellectual property rights, and crawls websites only with permission. We have signed agreements with publishers, and also ask them to add specific permission language to the websites themselves.
Dynamic Content Challenges: Archiving dynamic content, such as JavaScript-based interactivity and social media interactions, poses challenges. As websites increasingly rely on dynamic elements, web archivists continually work to improve methods for capturing and preserving these aspects.

Web archiving is a crucial practice in the digital age, ensuring that valuable online content is preserved for future generations and facilitating research on the evolution of the web and its impact on society.

About CLOCKSS

CLOCKSS is a community-led collaboration of academic publishers and research libraries around the world, working together to provide a sustainable online archive. Together we ensure the long-term survival of our shared intellectual heritage.

At CLOCKSS, through the services we provide to libraries and publishers and built over the award winning LOCKSS software, we instil confidence in authors, scholars, policy makers, libraries, and publishers that scholarship is safely and securely preserved for future generations.

As global leaders in digital preservation, we ensure that all books, journals, and digital collections entrusted to CLOCKSS are protected and preserved indefinitely.

Find out more about our services and how you can become part of the CLOCKSS community.

CLOCKSS Services