ISO OK's new Web archiving format

 A new file format designed to archive Web sites for posterity has been given the stamp of approval from the International Standards Organisation (ISO).

Known as WARC (Web ARChive), it offers a convention for concatenating multiple data objects into one long file. The format can be used to build applications for harvesting, managing, accessing and exchanging content.

“For a long time, keeping track of the staggering number of Web sites and pages posed a difficult challenge for digital curators and archivists, and resulted in countless lost data,” says Clément Oury, member of the working group that developed the standard.

“With WARC, ISO 28500 takes Internet archiving to the next level by enabling the effective management, structure and storage of billions of resources collected from the Web and elsewhere. Its standardization offers a guarantee of durability, and will help Web archiving become part of the mainstream activities of heritage institutions and other branches, by for example, fostering the development of new tools and ensuring interoperability between collections,” explains Mr. Oury.

The WARC format is an extension of the ARC file format, which has been used by the Internet Archive since 1996, and by numerous heritage institutions to store “Web crawls” – which represent extracts of entire Web pages and their links.

The motivation to extend the ARC arose from the discussions and experiences of these organizations within the International Internet Preservation Consortium (IIPC) – whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC members were finding it increasingly difficult to store and manage the growing volume of information coming from the Internet.

The WARC format differs from the ARC in that it offers new possibilities, notably the recording of HTTP request headers and of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, whether retrieved by HTTP or another protocol.

“Several applications are already WARC compliant,” adds Mr. Oury, “such as the Heritrix crawler for harvesting, the WARC tools for data management and exchange, the Wayback Machine, NutchWAX and other search tools for access.”
 

Business Solution: