Main page > File formats

Archive file formats and archivers

Introduction

This page explains what archives are, and gives background information on some of the more popular archive file formats.

Archive file format details

The pages linked in this list each describe the properties of one archive file format. More formats are likely to be added in the future.

Archives

Archives are files that store other files in them, plus information on those files. Most of the time an archive consists of a single file, but there also are multi-volume archives. Multi-volume archives are typically used to span an archive that doesn't fit on a single floppy disk, CD-R or other storage medium over several of those media. Archives are useful when transfering or creating a backup copy of a set of files. Not only is it easier to handle a single file compared to several files, archives store meta-information like the date of last modification which is restored when extracting entries from an archive and are typically reduced in size by using data compression.

Archivers

Programs that read or write archive files are called archivers. Several years ago, archivers could typically understand only one file format, the one they were designed for. Most of the time, the archiver even had the same name as the file format, and that name also was the file extension: tar, (pk)zip, rar, arj. These days, there are quite a few archivers that understand several file formats. Many of them still can only create and update one type, but they can at least read several others. Note that there are both command line and GUI (graphical user interface) archivers. Both have their advantages: the command line programs can be used in scripts to automate backups and file extraction processes. GUI programs are easier to handle for the average user, they make it convenient to quickly find out the content of an archive and perform typical tasks like extracting files.

Number of entries

Some archive file formats can only one store one entry (e.g. gzip), others allow to include several entries.

Archiving directories and empty files

Note that archive entries do not only include files, but also directories. It may seem not to be necessary. After all—if a file with a path in its name (example: images/persons/bob.jpg) is extracted, the required directories should be created automatically. Why store directories (in the example, images/ and its subdirectory images/persons/) as zero-sized entries of their own? Directory entries are valuable because of the meta data associated with them (e.g. access rights or date of last modification) . When creating them during the extraction process (e.g., when extracting the file bob.jpg to images/persons and either images or images/persons does not exist) that meta data can be restored to truthfully reflect the state of the file system from which bob.jpg was archived. For a similar reason, archiving empty files (file size equals zero) is also a good idea. Maybe an empty file with a certain name means something to an application, so the empty file should be archived. Entries without data don't take much space anyway.

Data compression

Most archive file formats use data compression to reduce the size of the resulting archives. The compression ratio—the amount of reduction yielded—depends on the compression algorithm used, and the type of data that is being compressed. Different from popular lossy compression types—like JPEG or MP3—compression algorithms used with archives must be lossless. This leads to a worse compression ratio, but the original data can be recreated exactly from the compressed data stream, which is not possible (and not necessary) with the aforementioned compression types. Typically, the better the algorithm, the more memory and CPU cycles it will require. Already compressed files will compress very badly, or not at all.

Solid archives

Solid archives are supported by some formats that allow for the inclusion of several files. In a solid archive, all files are compressed as if they were one large file. This improves compression ratio most of the time because compression algorithms tend to require a certain minimum size for input files in order to work well. Concatenating many small files results in one large file which can then be compressed. This works better than compressing each small file independently from the others. As an example, a tar archive of many smaller files compressed with gzip is typically compressed better than a zip file with the same entries, although both use the Deflate algorithm.

Solid archiving comes at the cost of not being able to extract a single file without decompressing everything that is located before it in the archive. However, in many cases this is not a problem, because typically the complete archive is extracted anyway. Some archivers also rearrange the input files in the solid archive in a way to have all files of a certain data type to be adjacent, which sometimes improves compression ratio even more.

Which archive file format / which archiver should you use?

Obviously, if there are some must-have requirements for the file format (like the ability to store multiple files) or the program (like the availability for a certain platform), they must be met first. In addition, go for popularity, unless you control the creation and extraction process. Don't use some exotic new file format for which nobody has an archiver installed just to save two percent of space. Use ZIP, it works well for most cases. Tar—in combination with either gzip or bzip2—is a good choice for Unix systems, StuffIt (.sit) for the Macintosh (all can handle the respective meta data of those systems well, because they were created for those systems and their file system properties). Info-ZIP has created tools for the ZIP format for almost any platform. If you cannot burn too much CPU power, use the compression settings of a program, switch to moderate compression instead of maximum. If your archiver must be very fast—or compress very well—check out the Archive comparison test.

Even with popular formats problems can arise if they have a major version change. As an example, WinRAR 3.0 uses by default some compression type that is not supported by older command line rar programs. PKWare and WinZip Inc. have added features to the ZIP format which few archivers support.