Web Archive Collection Zipped

Updated 08042024-201726


Table of Contents

Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full nameWeb Archive Collection Zipped
Description

Web Archive Collection Zipped (WACZ), a file format for creating and hosting web archives, was announced by the Webrecorder team in January 2021, with an update to the format in May 2023. Webrecorder.net states, the format is "designed to make creating and hosting web archives quicker and easier...WACZ serves as a zipped package format for WARCs...WACZ files take the raw WARC files and zip them up, along with a CDX or compressed CDX index, and a full text index."

WACZ's technical specification, Web Archive Collection Zipped (WACZ) | Webrecorder Recommendation June 2021, is found on Webrecorder's GitHub, along with other useful reference documents, such as Use Cases for Decentralized Web Archives, WACZ Signing and Verification, and Crawl Index JSON (CDJ+XJ).

The WACZ Specification defines the WACZ format as a file "that is used to package up WARC data and metadata into a ZIP file for distribution and replay on the web." It goes on to state that "WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file. A WACZ file includes all the data that is needed for the rendering archived content as well as contextual information required for users to interpret it." Rendering software obtains the data using HTTP Range requests or data can be interacted with by special server software. The WACZ Specification defines the WACZ directory structure, as well as the ZIP format specification WACZ uses for sharing web archives.

WACZ Specification Goals

WACZ Spec has two broad goals for web archives:

  • Social: way to connect and communicate with web archives collections, but also have context information to interpret and interact with those web archives.
    • Contextual information describes what the archive collection contains, including when and how it was created.
  • Technical: dynamically load small amounts of archived data from remote host without requiring downloading the whole file.

WACZ Object & Directory Layout:

A WACZ Object consists of a 'datapackage.json' file (technical and descriptive metadata), an extensible directory and naming convention, and a method for bundling directory layout in ZIP file.

  • The directory structure MUST conform to the FRICTIONLESS-DATA-PACKAGE Specification.
    • As stated in the WACZ Signing and Verification, "The WACZ format builds on top of the [FRICTIONLESS-DATA-PACKAGE]. The data package includes a manifest datapackage.json file which contains the hashes of all the files in the data package...A signed WACZ also contains a datapackage-digest.json, which contains a hash and signature of the datapackage.json."

See the WACZ Specification for an Example of a WACZ Directory Structure.

  • Archive:
    • MUST contain at least one WARC formatted file.
    • SHOULD use WARC extension.
  • Indexes:
    • MUST include at least one index for WARC data stored in Archive.
    • MUST contain CDXJ data, MAY be qzip (index.cdx.gz) compressed (PYWB-CDXJ).
  • pages.jsonl:
    • MUST present/include list of 'Page' object as JSON-Lines.
  • Each line MUST contain:
    • URL - for the page.
    • ts - datetime string (RFC3339).
  • JSON-Lines files MAY contain:
    • title - resource description.
    • id - resource arbitrary identifier.
    • text - snapshot extracted text.
    • size - integer/number of bytes for the page/resources.
    • MAY contain other properties not conflicting with required properties.
  • datapackage.json:
    • MUST present root of WACZ, serves as manifest for web archive.
  • MUST contain properties:
    • profile - data-package string.
    • resources - lists include file names, paths, sizes, fixity for all files.
    • wacz_version - version used.
  • SHOULD include properties:
    • Title - string/sentence description for collection
    • Description - archive's contents longer description, MUST be Markdown formatted (plain text).
    • Crated - datetime when file created, RFC3339 format.
    • Modified - datetime when file modified, RFC3339 format.
    • Software - description of software used to create files.
    • MainPageUrl - optional URL main/starting page, used for replay.
    • MainPageData - optional ISO-formatted date of main/starting page, used for replay.
  • Datapackage-digest.json:
    • SHOULD be included in root of WACZ to verify datapackage.json.
    • Used to determine if datapackage.json has cryptographic signature present.
    • Contains a hash and signature of datapackage.json.

WACZ Zip Format & Processing Model:

As stated in the WACZ Specification:

As described in the WACZ Specification, ZIP format provides random access, archived web pages can be retrieved from large archives without having to transfer the entire WACZ, users "read portions of the ZIP file on-demand using HTTP RANGE requests (RFC7233).

Uses of WACZ

According to Kirsta Stapelfeldt in the article, Strategies for Preserving Digital Scholarship/Humanities Projects, May 2022, WACZ "offers additional tools to extend the richness and functions of web-archives and other configurations of web-based recorded data...The WACZ format allows web archive collections to be loaded incrementally as the user replays in the browser, substantially improving the experience for end users."

Production phaseMiddle and final state as an archive and distribution format, as described in the WACZ Specification, WACZ files are package WARC data and metadata ZIP files used for distribution and replay on the web.
Relationship to other formats
    Defined viaZIP, ZIP File Format (PKWARE). WACZ Standard, "A WACZ object consists of a method for bundling the directory layout in a ZIP file...The entire directory structure MUST be stored in a standard ZIP file."
    ContainsWARC, WARC (Web ARChive) file format. WACZ Standard, "A WACZ object consists of an extensible directory and naming convention for web archive data." WACZ Standard defines Web Archive as "A collection of files that preserve representations of web resources in the WARC format."
    ContainsJSON, JSON (JavaScript Object Notation). WACZ Standard, "A WACZ object consists of a 'datapackage.json' file for recording technical and descriptive metadata."
    ContainsCDX Index Format. WACZ Standard, Web Archive definition states, "A web archive may also include derivative files such as CDX indexes for accessing records within the archive." According to Archive.org's Format Reference, CDX File Format, "A CDX file consists of individual lines of text, each of which summarizes a single web document." Not described separately on this website at this time. (https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/)
    ContainsCDXJ. WACZ Specification, the WACZ "Indexes directory MUST include one or more indexes for the WARC data stored in archive...Index files MUST contain CDXJ data." As described in the Python Package Index (PyPI), py-wacz page, "The pywb system uses a more flexible version of the CDX, called CDXJ, which stores most of the fields in a JSON dictionary." Not described separately on this website at this time.

Local use Explanation of format description terms

LC experience or existing holdingsNone.
LC preference

The Library of Congress Recommended Formats Statement (RFS) lists WACZ as an acceptable format for web archives.

Sustainability factors Explanation of format description terms

Disclosure

Open-source, fully documented. The Webrecorder project announced the Web Archive Collection Zipped (WACZ) format on January 18, 2021. The technical specifications used by Webrecorder project are available on GitHub, including the Web Archive Collection Zipped (WACZ) Webrecorder Recommendation packaging standard for web archives.

As stated by Webrecorder, "The Webrecorder project aims to maintain existing open source tools and develop new ones."

    Documentation

Web Archive Collection Zipped (WACZ) | Webrecorder Recommendation 03 June 2021 https://specs.webrecorder.net/wacz/1.1.1/

Status of the Document - "This is a stable version of the WACZ standard and is in active use by the Webrecorder project."

Adoption

The GitHub Webrecorder py-wacz repository contains a module and command line utility for working with WACZ formatted files, stating on py-wacz GitHub, "The WACZ command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages. As stated by Python Package Index (PyPI), py-wacz page, "The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification.

According to Webrecorder, ReplayWeb.page and ArchiveWeb.page extension support the WACZ format.

ReplayWeb.page states, "ReplayWeb.page supports a new format for bundling raw web archive data (usually WARC files), indices, page lists and other metadata into a single ZIP file...Files bundled into this format can use the .wacz (web archive collection zipped) file extension. ReplayWeb.page will recognize this extension (as well as regular .zip) and will also load it from Google Drive when the Google Drive Integration is installed."

According to Ed Summers in the blog, Web Archives On, Of, and Off, the Web, November 2021, "WACZ, and WACZ enabled tools, will be a game changer for sharing web archives because it makes web archive data into a media-type for the web, where a WACZ file can be moved from place to place as a simple file, without requiring complex server side cloud services to view and interact with it-just your browser.

Scoop, a browser-based web archiving library, supports the use of WACZ formatted files. Matteo Cargnelutti in the Library Innovation Lab blog, Witnessing the Web is Hard: Why and How We Built the Scoop Web Archiving Capture Engine, April 2023, "Scoop comes with built- in support for the Web Archive Collection Zipped (WACZ) file format, an emerging standard initiated by Webrecorder, and for the WACZ Signing and Verification specification that the Library Innovation Lab helped design." Matteo describes the authsign-compatible server that applies a signature to the WACZ file (X509 certificate), allowing the file to sealed so that it cannot be altered without breaking that seal.

  • X.509 Certificate:
    • Public key certificate format.
    • As stated on SSL.com, "X.509 is a standard format for public key certificates, digital documents that securely associate cryptographic key pairs with identities such as websites, individuals, or organizations."

Browsertrix Cloud, browser-based crawling system, will support WACZ formatted files, stating in their Features page, "Browsertrix Cloud will support a way to upload externally created WACZ files which can be used to augment content from scheduled crawls...The output of the crawls will be standard WARC or the new portable WACZ format. The WACZ format will contain all the data and metadata for the crawls, including raw WARC data, page indexes, full- text search, and other metadata that may be part of the WACZ format."

Comments welcome.

    Licensing and patents

None. As stated on Webrecorder.com, "All Webrecorder tools are licensed under open-source licenses. Please see individual repositories on GitHub for more info about the licenses and contributing." GitHub does not list any license information for the WACZ format.

Comments welcome.

Transparency

As stated by Ed Summons in the blog, Save WACZ Now, April 2023, "One nice thing about WACZ files is that they are really just ZIP files, which users can unzip and inspect." Users can find the archive folder, containing the WARC (fdd000236), datapackage.json, indexes folder, and the pages folder.

Webrecorder blog, Next Generation Web Archiving: Loading Complex Web Archives On- Demand in the Browser, August 2020, explains, "a ZIP file has a built-in index of it content...It is possible to read a portion of the CDX index."

  • See ZIP for more information on ZIP's files transparency.
  • See JSON for more information on JSON files transparency.
As described in the WACZ Specification, the WACZ "Indexes directory MUST include one or more indexes for the WARC data stored in archive...Index files MUST contain CDXJ data."
  • Pywb Indexing: CDXJ Format describes CDXJ as a more flexible version of CDX and that it stores most fields in a JSON (fdd000381) directory.

WACZ Specification states, "The pages/pages.jsonl MUST be present and include a list of 'Page' objects as (JSON-Lines)."

  • JSONLines.org defines JSON Line text format as "a convenient format for storing structured data that may be processed one record at a time."
  • JSON Lines format allows UTF-8 encoding.

Comments welcome.

Self-documentation

Supports metadata, as stated on Webrecorder.com, "WACZ files they come packaged with everything users need to create and host a web archive collection: A random-access index of all raw data, a list of entry-point pages into the archive, and a user-defined, editable metadata about the web archive collection."

The WACZ Specification describes the datapackage.json file within the WACZ Object, the datapackage.json "recording technical and descriptive metadata specified in Frictionless-Data-Package.

Comments welcome.

Accessibility Features

No specific features in the file format. Instead, accessibility support for web content is supported through adherence to the W3C's Web Content Accessibility Guidelines (WCAG) which defines structures and good practice to make web content perceivable (such as text alternatives and captions), operable (such as keyboard navigation), understandable (predictable behavior) and robust (maximize compatibility with current and future user tools).

External dependencies

None, beyond the availability of software to extract and decompress the files contained in a ZIP file. See ZIP for more information on ZIP's external dependencies.

Comments welcome.

Technical protection considerations

Webrecorder Specification, WACZ Signing and Verification, describes the "working draft proposal to create signed WACZ packages which allow package's author to be cryptographically proven." WACZ developers want to make WACZ packaged files more secure, "to increase trust in web archives, it becomes necessary to guarantee certain properties about who the web archive was created and when."

Comments welcome.

Quality and functionality factors Explanation of format description terms

Web Archive
Normal rendering

Supported through Webrecorder's ReplayWeb.page and ArchiveWeb.page, as well as Internet Archive's Wayback Machine.

As stated by Webrecorder, "ReplayWeb.page and the newly announced ArchiveWeb.page extension both support the WACZ format 1.0."

Internet Archive's Wayback Machine can save web archives and email them to users as WACZ files.

Comments welcome.

| | Documentation of harvesting context |

Allows for substantial information about the time of harvesting, when announcing WACZ Format 1.0, Webrecorder described the WACZ packaged files, "they come packaged with everything users need to create and host a web archive collection: A random-access index of all raw data, a list of entry-point pages into the archive, and a user-defined, editable metadata about the web archive collection."

The WACZ Specification details the metadata contained in WACZ files, the "WACZ object consists of a datapackage.json file for recording technical and descriptive metadata specified in FRICTIONLESS-DATA-PACKAGE."

As described on FrictionlessData.io, Data Package, "Metadata that describes the structure and contents of the package...General metadata such as the package's title, license, publisher." View a full list of the required and optional Data Package descriptor properties at FrictionlessData.io.

Comments welcome.

| | Efficiency at scale |

WACZ files use the ZIP format's index for locating contents of the web archive and its metadata. WACZ Specification describes the WACZ object index, the index directory contains indexes for the WARC data in the WACZ archive directory, "These index files allow clients to efficiently look up an a URL to see if it is contained in the WACZ...Using the ZIP file format allows users to quickly "read portions of the ZIP file on-demand using HTTP RANGE requests [RFC7233]."

See WARC for more information on WARC's efficiency at scale.

Comments welcome.

| | Support for stewardship. |

WACZ Specification states, "The WACZ format provides a storage approach optimized for efficient random-access to packaged up WARC data that allows the browser to render a page by fetching only what is needed for that particular page. This is done by leveraging the ZIP format's built-in index to locate the contents of the web archive and its constituent metadata. WACZ is not designed to replace other web archiving formats. Rather it establishes a file packaging convention for all the data needed by a browser for efficient rendering of a web archive collection, and its contextualization."

Comments welcome.

| | Aggregate | | Compression |

As stated in the WACZ Specification, "Already compressed files MUST NOT be compressed again to allow for random access."

WACZ Specification lists ZIP, ZIP64, and gzip as compression methods for WACZ.

Comments welcome.

| | Support for Error Dectection |

As stated on Webrecorder.net, "starting from 1.0, WACZ also conforms to the Frictionless Data Package standard. The Data Package manifest adds integrity checks (via SHA-256 or MD5) for each file contained in the WACZ."

The GitHub Webrecorder py-wacz repository states the command line '-hash-type' "allows the user to specify the hash type used (sha256 or md5)

Existing WACZ files can be validated running: 'wacz validate myfile.wacz.'

Comments welcome.

| | Beyond normal functionality |

None.

Comments welcome.

|

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension wacz
WACZ Standard, "A ZIP file that follows this Web Archive Collection format spec MUST use the extension .wacz."
Internet Media Type application/x-wacz
WACZ Standard, "WACZ HTTP responses for WACZ files SHOULD be published with the application/wacz media type."
Other NF00796
See https://www.archives.gov/files/lod/dpframework/id/NF00439.ttl
Pronom PUID fmt/1840
Details for: WACZ in Pronom. See (https://www.nationalarchives.gov.uk/PRONOM/fmt/1840)
Wikidata Title ID Q104903124
Web Archive Collection Zipped, Wikidata. (https://www.wikidata.org/wiki/Q104903124)

Notes Explanation of format description terms

General 
History

As described on Rhizome's Wikipedia page, Rhizome, created in 1996, a non-profit platform organization for media art, launched the WebRecorder tool to the public in 2016 as "a free web archiving tool that allows users to create their own archives of the dynamic web...Webrecorder is targeted towards archiving social media, video content, and other dynamic content, rather than static webpages...It uses a 'symmetrical web archiving' approach, meaning the same software is used to record and play back the website...While other web archiving tools run a web crawler to capture sites, Webrecorder takes a different method, actually recording a user browsing the site to capture its interactive features."

WebRecorder.net announced the Web Archive Collection Zipped (WACZ) 1.0 format on January 18, 2021, as "a new file format designed to make creating and hosting web archives quicker and easier." The WACZ format update was introduced on May 3, 2023.

Format specifications Explanation of format description terms

Useful references

URLs

Last Updated: 04/29/2024