The Case of the Dying Hard Drive That Flipped Bits

February 29, 2016

The symptoms were hard to notice at first: downloaded files would sometimes be corrupted, especially large files; attempts to fix those downloads (using par2) would more often than not fail. Then it became bizarre; calculating the checksum of those files would sometimes, but not always, result in different values.

The last modified date wasn’t changing so the file I was testing with was not changing either. So why was its MD5 checksum sometimes different? Making it harder to debug was the fact that calculating a specific file MD5 over and over always returned the same result. But if I waited a couple of minutes before trying again, then the result would be different!

Bad memory maybe..? memtest said my four RAM sticks were working fine. So maybe the hard drive or its connection was an issue..? I connected the hard drive using another cable, to another SATA port (using a different controller), but the problem persisted.

Then I had an idea: maybe the problem was happening all the time, but the disk cache of the OS was preventing me from noticing it, when it was used instead of reading the data from the disk… So I found how to manually flush the disk cache on Linux (sync ; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches') and re-run my MD5 calculation test. Sure enough, the file’s checksum was now almost all the time wrong!

Now that I was able to easily reproduce the problem, I had to find what it was… So I created yet-another-PHP-script ^[1] that would: 1) read the file into an array of bytes, 2) flush the disk cache, 3) re-read the file into another array, 4) compare both arrays byte-by-byte. The results were quite astonishing: each time a byte was different between the two arrays, one of the byte was always exactly 0x10 (decimal 16) lower than the other:

First read MD5: 774d85d8ee57e6bce78533afe309e972
Second read MD5: 40a21030006b86b63becaf9ff0a19cae
Wrong byte at pos 1506201: 0x2c vs 0x3c
Wrong byte at pos 2644889: 0xee vs 0xfe
Wrong byte at pos 33971097: 0xe8 vs 0xf8
Wrong byte at pos 63228825: 0x60 vs 0x70

First read MD5: ef3eff5737e31699aff5f5c77ae28503
Second read MD5: 40a21030006b86b63becaf9ff0a19cae
Wrong byte at pos 21293977: 0x28 vs 0x38
Wrong byte at pos 41180057: 0x8b vs 0x9b
Wrong byte at pos 43686809: 0x03 vs 0x13
Wrong byte at pos 91962265: 0x0a vs 0x1a

First read MD5: 373a1bc985d25c076b70417b703479fc
Second read MD5: 40a21030006b86b63becaf9ff0a19cae
Wrong byte at pos 19692441: 0x21 vs 0x31

First read MD5: 40a21030006b86b63becaf9ff0a19cae
Second read MD5: 93db3cd743c5712895d97f03fcefa832
Wrong byte at pos 17804185: 0x53 vs 0x43

Even better, the read errors never occurred at the same position, and they were always 0x10 lower than the correct value.

So now I had a plan: copy the data from that evil drive onto another working drive, fix it ^[2] (knowing it was possible, from what I just discovered), and throw that drive as fas as possible from my home server, hopefully soon enough that its bad karma would not have infected my other hard drives!

That plan could be summarized like this: 1. Copy the data from the evil drive onto another drive; let’s call it the savior drive.

sudo rsync -av /mnt/evil_drive/* /mnt/savior_drive/

2. Re-execute the rsync, but this time, just to calculate the checksum of the source and target files, and log each file for which the checksums differ.

sudo rsync -acv --dry-run /mnt/evil_drive/* /mnt/savior_drive/ >> ${HOME}/files_to_check.txt

3. Fix the files on the savior drive by comparing the bytes on there with the bytes on the evil drive, and keeping the largest of the two, when they mismatch.

sudo php ${HOME}/fix_files.php ${HOME}/files_to_check.txt

A couple of hours later, I had a copy of all the data from that bad, (BAD!) drive somewhere else, and I was (pretty) sure that all that data was OK.

Why did this happen? What caused it? When did it start? I will probably never know. But what I learned is that hard drives can (and will) die in very unusual ways, and for the few souls that will be lucky enough to notice a pattern in the errors that will occur, it is possible to not loose any of the data that was stored on those drives.

Not that saving that data was really important, mind you. You have backups, and are keeping multiple copies on different hard drives, of all your important data, right? I am.

Refs: ^[1] The script I used to test: https://gist.github.com/gboudreau/219d90e30acad5131947 ^[2] The script I used to fix the copied data: https://gist.github.com/gboudreau/734a80848486a9dd0e2e