January 28, 2016

To ZIP or not to ZIP, that is the (web archiving) question

Do you use uncompressed (W)ARC files?

It is hard to imagine why you would want to store the material uncompressed. After all, web archives are big. Compression saves space and space is money.

While this seems straightforward it is worth examining some of the assumptions made here and considering what trade-offs we may be making.

Lets start by consider that a lot of the files on the Internet are already compressed. Images, audio and video files as well as almost every file format for "large data" is compressed. Sometimes simply by wrapping everything in a ZIP container (e.g. EPUB). There is very little additional benefit gained from compressing these files again (it may even increase the size very slightly).

For some, highly specific crawls, it is possible that compression will accomplish very little.

But it is also true that compression costs very little. We'll get back to that point in a bit.

For most general crawls, the amount of HTML, CSS, JavaScript and various other highly compressible material will make up a substantial portion of the overall data. Those files may be smaller, but there are a lot more of them. Especially, automatically generated HTML pages and other crawler traps that are impossible to avoid entirely.

In our domain crawls, HTML documents alone typically make up around a quarter of the total data downloaded. Given that we then deduplicate images, videos and other, largely static, file formats, HTML files' share in the overall data needing to be stored is even greater. Typically approaching half!

Given that these text files compress heavily (by 70-80% usually), tremendous storage savings can be realized using compression. In practice, our domain crawls compressed size is usually about 60% of the uncompressed size (after deduplication).

More frequently run crawls (with higher levels of deduplication) will benefit even more. Our weekly crawls' compressed size is usually closer to 35-40% of the uncompressed volume (after deduplication discards about three quarters of the crawled data).

So you can save anywhere from ten to sixty percent of the storage needed, depending on the types of crawling you do. But at what cost?

On the crawler side the limiting factor is usually disk or network access. Memory is also sometimes a bottleneck. CPU cycles are rarely an issue. Thus the additional overhead of compressing a file, as it is written to disk, is trivial.

On the access side, you also largely find that CPU isn't a limiting factor. The bottleneck is disk access. And here compression can actually help! It probably doesn't make much difference when only serving up one small slice of a WARC but when processing entire WARCs it will take less time to lift them off of the slow HDD if the file is smaller. The additional overhead of decompression is insignificant in this scenario except in highly specific circumstances where CPU is very limited (but why would you process entire WARCs in such an environment).

So, you save space (and money!) and performance is barely affected. It seems like there is no good reason to not compress your (W)ARCs.

But, there may just be one, HTTP Range Requests.

To handle an HTTP Range Request a replay tool using compressed (W)ARCs will have to access a WARC record and then decompress the entire payload (or at least from start and as far as needed). If uncompressed, the replay tool could simply locate the start of the record and then skip the required number of bytes.

This only affects large files and is probably most evident when replaying video files. User's may wish to skip ahead etc. and that is implemented via range requests. Imagine the benefit when skipping to the last few minutes of a movie that is 10 GB on disk!

Thus, it seems to me that a hybrid solution may be the best course of action. Compress everything except files whose content type indicates an already compressed format. Configure it to compress when in doubt. It may also be best to compress files under a certain size threshold due to headers being compressed. That would need to be evaluated.

Unfortunately, you can't alternate compressed and uncompressed records in a GZ file such as (W)ARC. But it is fairly simple to configure the crawler to use separate output files for these content types. Most crawls generate more than one (W)ARC anyway.

Not only would this resolve the HTTP Range Request issue, it would also avoid a lot of pointless compression/uncompression work being done.

No comments:

Post a Comment