Three weeks ago, I announced the automatic rule-based time tracker here, and it seems that there are actually users out there :-). Since then, it has recorded more than 240 hours of my computer’s uptime in about 15000 samples. Until now, this data was stored in a simple text file, one line per entry, and relying on Haskell’s Show/Read instances to do the serialization. Although not quite unexpected, this turned out to be a severe bottleneck: Already, it takes more than 25 seconds to parse the log file on my computer.
Following an advice given to me on the #haskell IRC channel, I switched to a binary representation of the data, using the very nice Data.Binary module. The capture program will automatically detect if the log file is in the old format, move it away and convert the entries to binary data. And voilà, the statistics program evalutes my data in about two seconds! This should be fast enough for quite a while, I hope.
In my binary instances, which you can find in Data.hs, I prepended a version tag to each entry. This hopefully allows me to add more fields to the log file later, while still being able to parse the old data directly and without conversion. To still be able to manually inspect the recorded data, the program arbtt-dump was added to the package. The new version is uploaded to hackage as 0.3.0.
One thing still worries me: With the old format, I could easily throw away unparseable lines in the log file (e.g. from a partial write) and still read the rest of it. With the new code, there is no such robustness: If the file breaks somewhere, it will be unreadable in its whole, or at least to the broken point. I’m not sure what to do about that problem, but given the very low number of bytes written each time, I hope that it will just not happen.
A text file is a binary file. You can use the same method. The only difference between a text file and a binary in this case is you probably don't have any non-used binary bytes in the latter case. So all you need to do, if you are only worried about partial writes and not corruption, is have a start byte value (e.g. 0xFD), an end byte value (e.g. 0xFE), and an escape value (e.g. 0xFF). At the start of each record output 0xFD 0xFD, at the end output 0xFE, and escape any uses of 0xFD, 0xFE, or 0xFF with 0xFF within the actual data (i.e. replace 0xFD with 0xFF 0xFD). Now if you ever see an unescaped 0xFD you know the record you are reading in is incomplete and to start reading a new record. This scheme adds 3+3n/256 bytes of overhead on average for n bytes of data assuming the bytes of the data are uniformly distributed (which is not very likely, and so a prudent choice of escape bytes can probably improve the above.)
This scheme will work with any data whatsoever. If you have more information about the format, e.g. the records are a fixed size, they have a structured format where errors can be detected, there are certain bit patterns that can't occur, you could potentially simplify the above scheme.
The data produced by Data.Binary is per se not escaped, and I’d need to post-process it escape it properly; which is added complexity that I’d like to avoid. But maybe I’ll just have to do that eventually...
If your records are small and bounded in size, you could adopt the self-synchronizing properties of UTF-8 into your encoding. (UTF-8 parsers can only lose their place up to six bytes in either direction.)
A TLV scheme may eliminate your need to escape values, though you may still wish to have sigil T values to aid in well-formed-ness checks.
I already have some kind of Type-Value scheme, since I prepend each value with a version tag, but that is only one byte wide (0x01), not much help there :-). I guess I could make that a bit wider, and have a good guess where to try re-scanning the file if something is broken. I wonder what a good compromise between space waste and efficiency would be.
OTOH, given that I find it unlikely that breakage will occur, I could really try to find the next intact record at the 0x01 markers, and then check the data for sensibility. Since the first value in the record is a timestamp, I could check it whether it is monotonically increasing, and not in the future, and if it is, throw it away. This would be sufficient.
So if you happen to have a broken file, contact me, I might implement an arbtt-restore program. (And I should probably move all these programs into one "arbtt command" style program :-))
Right, that’s another option. Although I like, in this case, the the fact that I can just stupidly append an record to the end of the file. Since the capturing program will be running permanently and be active once per minute, I want to keep both the memory footprint and the CPU cycles down.