Perma Tools

Experiments and Tailored Tooling for
High Fidelity Web Archives

Perma Tools is built on more than a decade of experience developing Perma.cc. Through collaborating with academic libraries, courts, and legal scholars to preserve citations, the Perma team has gained deep expertise about web archives that are authoritative, specific, and information-rich. The concerns of our user base give us a distinct perspective on how web archives might function at every step of the pipeline: from capture, to storage, to playback. This perspective has prompted independent development of the base tools and libraries that are used for web archiving. Additionally, in tandem with the larger web archiving community we are interested in creating tools that support emerging standards like the .wacz file format.

This page outlines production tools and experimental projects that we’ve built based on our explorations for use by our team and the field at large. These range from quick experiments that we spun up in a matter of weeks to longer term work to be used by our overarching Perma.cc project.

Each is a response to a question that has implications beyond Perma.cc and are built for a broader context, but also addresses the core concerns of our user base: Fidelity, Specificity and Provenance.

In a broader context, we are interested in the decentralization of means for web archiving which has traditionally (by necessity) been a centralized and packaged process. Facilitated by new browser technology, these tools broaden resources for folks who are interested in building and deploying their own solutions for collecting the web.

All of our code is open source and can be accessed on Github, and we can be reached at [email protected] if you have any questions!


Capture Tools

The Scoop Capture Engine

Because we wondered:

What would a browser-based capture engine look like if its main goal was to create evidence for an article, court case, or fact check?

Scoop is a high fidelity, browser-based, web archiving capture engine from the Harvard Library Innovation Lab.

Fine-tune this custom web capture software to create robust single-page captures of the internet with accurate and complete provenance information.

With extensive options for asset formats and inclusions, Scoop will create .warc, warc.gz or .wacz files to be stored by users and replayed using the web archive replay software of their choosing.

Scoop also comes with built-in support for the WACZ Signing and Verification specification, allowing users to cryptographically sign their captures.

Scoop on GitHub

Processing Tools

The WARCbench command line tool

Because we wondered:

How might we design a resilient, efficient, and highly configurable tool for working with WARC files in all their variety, letting researchers explore without prior knowledge of the format?

WARCbench is a tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

WARCbench was designed to make as few assumptions as possible about your familiarity with web archives, the kind of WARC you are working with, or what you want to do with it.

The goal is not to hide the complexity of web archives. It is to make that complexity easier to inspect, manipulate, and learn from so you can experiment and iterate.

WARCbench on GitHub

See also: Blog Post

Storage, Hosting and Playback

The wacz-exhibitor hosting boilerplate

Because we wondered:

How could browser-based playback of web archives shift how people access collections of preserved web content?

Experimental proxy and wrapper boilerplate for safely and efficiently embedding Web Archives (.warc, .warc.gz, .wacz) into web pages.

This implementation:

  • Wraps Webrecorder's <replay-web-page> client-side playback technology.
  • Serves, proxies and caches web archive files using NGINX.
  • Allows for two-way communication between the embedding website and the embedded archive using post messages.

wacz-exhibitor on GitHub

See also: Live Demo, Blog post

Support for Emerging Standards

wacz-signing library

Because we wondered:

Given that there is now a recommended way to “sign” a web archive to attest to its authenticity, what would that process look like and how efficient could it be?

This is a library for signing and timestamping file hashes. This package builds on work by Ilya Kreymer and Webrecorder in authsign. It is intended for use in WACZ signing (and to a lesser extent, verification), as set forth in the Webrecorder Recommendation WACZ Signing and Verification, which our director Jack Cushman contributed to.

It is an attempt to reduce authsign's footprint, and decouple signing from any specific web API, authentication, and the process of obtaining key material. It also omits the optional cross-signing mechanism specified in the recommendation and provided by authsign.

wacz-signing on GitHub

js-wacz library and CLI

Because we wondered:

How can we facilitate the creation of .wacz files and expand the WACZ ecosystem?

js-wacz is a JavaScript module and CLI tool for working with web archive data using the WACZ format specification, similar to Webrecorder's py-wacz.

It can be used to combine a set of .warc / .warc.gz files into a single .wacz file programmatically (Node.js) or in the command line.

js-wacz makes use of workers to process as many WARC files in parallel as the host machine can handle.

js-wacz on GitHub

wacz-preparator library and CLI

Because we wondered:

Would streamlining the portability of existing web archive collections enhance the community's ability to experiment with emerging playback technology?

CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.

This pipeline was originally developed in the context of The Library Innovation Lab's partnership with the Radcliffe Institute's Schlesinger Library on experimental access to web archives.

wacz-preparator on GitHub

Experiments

WARC-GPT

Because we wondered:

Can the techniques used to ground and augment the responses provided by Large Language Models be used to help explore web archive collections?

This is the question we’ve asked ourselves while exploring how artificial intelligence changes our relationship to knowledge, which led us to develop and release WARC-GPT: an experimental open-source retrieval-augmented generation tool for exploring collections of WARC files using AI.

WARC-GPT on GitHub

See also: WARC-GPT Case Study

News

WARCbench: A Swiss Army Knife for WARC Processing lil.law.harvard.edu - June 6, 2026

WARC-GPT “on tour”: Talk transcript and slide decks lil.law.harvard.edu - June 28, 2024

WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI lil.law.harvard.edu - February 12, 2023

An update on the WACZ format
webrecorder.net - May 5, 2023

Witnessing the web is hard: Why and how we built the Scoop web archiving capture engine lil.law.harvard.edu - April 13, 2023

New Release: High Fidelity Capture Engine for Witnessing the Web 🍨 blogs.harvard.edu/perma - March 28, 2023

IIPC Technical Speaker Series: Archiving Twitter
lil.law.harvard.edu - January 18, 2023

Use Social.Perma.cc To Preserve Twitter Threads!
blogs.harvard.edu/perma - December 22, 2022

Web Archiving: Opportunities and Challenges of Client-Side Playback
lil.law.harvard.edu - September 15, 2022