My automatic rule-based time tracker (arbtt), which is written in Haskell, collects every minute a data sample consisting mainly of the list of currently open windows (window title and program name). Naturally, this log grows rather large. Since October of last year, I collected 70,000 samples. I already went from a text-based file format to a binary format using Data.Binary, which gave a big performance boost.
But by now, I was afraid that this is not enough. My log file is now 30MB large. Looking at the memory graph of gnome-panel, it is taking up more than half of my memory. When running arbtt-stats, the Haskell run time system reports 569 MB total memory in use and the command finishes after 28.5 seconds.
Naturally, the log file is highly redundant: Compressing it with bzip2 shrinks it to 1.6MB. But as I would like to preserve the ability to just append samples at the end, without having to read the file, I chose not just to add bzip2 or gzip compression. Rather, I am now exploiting a very obvious redundancy: Two adjacent samples usually list exactly the same windows, and a focus change only changes a flag. So now, when storing a string that is part of a sample, it will check if this string was already present in the previous sample and, in this case, just store the number of that string (one byte). Only if the string was not present it will write a zero byte and then the string. When reading the sample, the process is reversed.
This greatly reduces the file size: It is down to 6.2MB. It also improves the memory consumption, due to Haskell’s abilities with regard to sharing: When a reference to a string in a previous sample is read, then only one instance of this string is in memory, even if it occurs several times in the log. This brings the memory consumption down to 264 MB and the runtime to 17 seconds.
I released the changes as version 0.4.5.1 to Hackage, Debian and as a Windows installer. The log file is not automatically converted, but new samples will be written in the compressed format. If you want to convert your whole file, you have to stop arbtt-capture, run arbtt-recover, and then move the hopefully noticeable smaller ~/.arbtt/capture.log.recovered to ~/.arbtt/capture.log.
The required code changes were not too big. I somewhat isolated the relevant code in the Data.Binary.StringRef module. Unfortunately, I have to use OverlappingInstances to be able to provide the special instance for String – is there a cleaner way (besides the trick used for the Show class)?
That’s also an interesting idea. But the lists are relatively small (< 10 elements), and with my compression scheme, it’s three bytes each (one for a boolean, two string referencing bytes). I don’t think a delta algorithm would pay off here.