Perma Tools
Experiments and Tailored Tooling for
High Fidelity Web Archives
During almost ten years of work with legal scholars and US courts to preserve their citations, the Perma.cc team has learned a lot about web archives that are authoritative, specific, and information-rich. The concerns of our user base give us a distinct perspective on how web archives might function at every step of the pipeline: from capture, to storage, to playback. This perspective has prompted independent development of the base tools and libraries that are used for web archiving. Additionally, in tandem with the larger web archiving community we are interested in creating tools that support emerging standards like the .wacz file format.
This page outlines production tools and experimental projects that we’ve built based on our explorations for use by our team and the field at large. These range from quick experiments that we spun up in a matter of weeks to longer term work to be used by our overarching Perma.cc project.
Each is a response to a question that has implications beyond Perma.cc and are built for a broader context, but also addresses the core concerns of our user base: Fidelity, Specificity and Provenance.
In a broader context, we are interested in the decentralization of means for web archiving which has traditionally (by necessity) been a centralized and packaged process. Facilitated by new browser technology, these tools broaden resources for folks who are interested in building and deploying their own solutions for collecting the web.
All of our code is open source and can be accessed on Github, and we can be reached at [email protected] if you have any questions!
Capture Tools
The Scoop Capture Engine
Because we wondered:
What would a browser-based capture engine look like if its main goal was to create evidence for an article, court case, or fact check?
Scoop is a high fidelity, browser-based, web archiving capture engine from the Harvard Library Innovation Lab.
Fine-tune this custom web capture software to create robust single-page captures of the internet with accurate and complete provenance information.
With extensive options for asset formats and inclusions, Scoop will create .warc
, warc.gz
or .wacz
files to be stored by users and replayed using the web archive replay software of their choosing.
Scoop also comes with built-in support for the WACZ Signing and Verification specification, allowing users to cryptographically sign their captures.
Storage, Hosting and Playback
The wacz-exhibitor hosting boilerplate
Because we wondered:
How could browser-based playback of web archives shift how people access collections of preserved web content?
Experimental proxy and wrapper boilerplate for safely and efficiently embedding Web Archives (.warc
, .warc.gz
, .wacz
) into web pages.
This implementation:
- Wraps Webrecorder's <replay-web-page> client-side playback technology.
- Serves, proxies and caches web archive files using NGINX.
- Allows for two-way communication between the embedding website and the embedded archive using post messages.
Support for Emerging Standards
wacz-signing library
Because we wondered:
Given that there is now a recommended way to “sign” a web archive to attest to its authenticity, what would that process look like and how efficient could it be?
This is a library for signing and timestamping file hashes. This package builds on work by Ilya Kreymer and Webrecorder in authsign. It is intended for use in WACZ signing (and to a lesser extent, verification), as set forth in the Webrecorder Recommendation WACZ Signing and Verification, which our director Jack Cushman contributed to.
It is an attempt to reduce authsign's footprint, and decouple signing from any specific web API, authentication, and the process of obtaining key material. It also omits the optional cross-signing mechanism specified in the recommendation and provided by authsign.
js-wacz library and CLI
Because we wondered:
How can we facilitate the creation of .wacz files and expand the WACZ ecosystem?
js-wacz is a JavaScript module and CLI tool for working with web archive data using the WACZ format specification, similar to Webrecorder's py-wacz.
It can be used to combine a set of .warc
/ .warc.gz
files into a single .wacz
file programmatically (Node.js) or in the command line.
js-wacz makes use of workers to process as many WARC files in parallel as the host machine can handle.
wacz-preparator library and CLI
Because we wondered:
Would streamlining the portability of existing web archive collections enhance the community's ability to experiment with emerging playback technology?
CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.
This pipeline was originally developed in the context of The Library Innovation Lab's partnership with the Radcliffe Institute's Schlesinger Library on experimental access to web archives.
Experiments
WARC-GPT
Because we wondered:
Can the techniques used to ground and augment the responses provided by Large Language Models be used to help explore web archive collections?
This is the question we’ve asked ourselves while exploring how artificial intelligence changes our relationship to knowledge, which led us to develop and release WARC-GPT: an experimental open-source retrieval-augmented generation tool for exploring collections of WARC files using AI.
See also: WARC-GPT Case Study
Save Your Threads - thread-keeper
Because we wondered:
Is it possible to do anything about all of that Twitter content that was disappearing and being faked when a certain someone took over the company?
Thread Keeper is a tool to create high fidelity captures of Twitter threads as sealed PDFs. Here's an example PDF we made from this tweet.
There are lots of screenshots of Twitter threads going around. Some are real, some are fake. You can't tell who made them, or when they were made.
PDFs let us apply document signatures and timestamps so anyone can check, in the future, that a PDF you download with this site really came from the Harvard Library Innovation Lab and hasn't been edited.
PDFs also let us bundle in additional media as attachments. Each signed PDF currently includes all images in the page (so you can see full size images that are cropped in the PDF view), the primary video on the page if any, as well as a list of all the t.co links on the thread and their actual destinations.
See also: social.perma.cc
News
WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI lil.law.harvard.edu - February 12, 2023
An update on the WACZ format
webrecorder.net - May 5, 2023
Witnessing the web is hard: Why and how we built the Scoop web archiving capture engine lil.law.harvard.edu - April 13, 2023
New Release: High Fidelity Capture Engine for Witnessing the Web 🍨 blogs.harvard.edu/perma - March 28, 2023
IIPC Technical Speaker Series: Archiving Twitter
lil.law.harvard.edu - January 18, 2023
Use Social.Perma.cc To Preserve Twitter Threads!
blogs.harvard.edu/perma - December 22, 2022
Web Archiving: Opportunities and Challenges of Client-Side Playback
lil.law.harvard.edu - September 15, 2022