data

The dumbfuckery of storage drive data integrity

With all the complicated standards including data integrity, it is outrageous how useless it all still is.

I got a 4 TB WD Black harddisk that is probably 99.99999999% just fine, but it is telling me it will die in less than 24 hours and I better throw it in the trash.

If your harddisk suffers from a bad sector, the file occupying it is ruined. But you usually won’t even know that until you try to access it manually, directly, intentionally, and that fails.

Generally you might get a warning of some kind. So you know there’s a defect.
Now, to find out where it is, you need a special tool like GSmartControl that can show you the log and then the 24 or so latest entries shown and spammed with the same error are hopefully indicating only onw bad sector.

Then you would have to use a Windows command line tool nfi.exe to inquire what file if any is using that sector. And once you get that info you know which one is damaged and thus lost.

But now the real insanity starts: Unless repair attempts on the sector succeed, no dozens of failed and time-wasting read attempts can convince the firmware to reallocate the sector, to replace it with a spare or simply block access. I haven’t seen it happening so I cannot even confirm it, but AFAIU this only happens if you WRITE to that sector the next time. BUT… things like secure erase of the file won’t do, because the sector also needs to be unused before you attempt to write to it.
AND apparently even secure erase still leaves a filesystem remnant, basically a nameless undeletable garbage file, and it seems that this is still considered part of the filesystem for God knows how long. Because I verified that the bad sector is not used by any file, and yet chkdsk gets stuck for all eternity on the files check at the start, as if that sector was still part of the filesystem contents. (And if you try chkdsk /b, it might take far too long to get to the damaged spot and even there it is not guaranteed to succeed.)

So to clarify: Failing to read a single damn spot on the disk for half an hour that procudes tons of SMART log spam about the exact location does not convince the drive to replace that bit.

Also, DiskGenius can do targeted scans of disk surface to identify damaged regions, but its repair attempt will fail quite like chkdsk, endless freeze. Sadly it doesn’t offer to replace it without that futile attempts. (It did manage to repair a weak sector, but since SMART does not report a new reallocation event, I have to assume it merely repaired it through writes, so it could start making trouble again soon.)

HDDScan offers such a surface test, too, but its output seems to get stuck on reporting bad sectors long after the one that actually is bad.

But the most frustrating thing about all this pretend-SMARTness that is F.U.C.K.I.N.G.D.U.M.B. is that while I am being informed there is “Current Pending Sector Count = 1”, it does not inform me which sector is the pending one that I haven’t identified yet, even though it would have to know the position from knowing it is there.

More or less realistic options you have:

  • Go to Linux and use command line hackery with smartctl and hdparm to manually surgically replace the bad sector.
  • Alter the disk’s partitioning to exclude the area with the defective sector. (DiskGenius can tell you the megabyte-based region on the disk based on the sector.)
  • Slow-format or secure-erase the whole disk. (But AFAIR I have had issues with such in the past because the system pretended the problems had been repaired but then they reoccured later after I recopied all the data onto it. Not on those writes, apparently, but afterwards.)

So, since I have an installation of that relatively shitty Linux, I went there and after some basic hassle this command did the trick:
“sudo hdparm –write-sector 6960709 –yes-i-know-what-i-am-doing /dev/sdc”
It is described as an alias for “–repair-sector” and I thought ‘damn, not that again’. And it went so quick with no delay whatsoever and reading the sector only yielded zeroes that I was wondering whether it was the correct sector, but after checking back in Windows, the obstacle was gone!
So apparently it can be easy to accomplish such simple and intelligent things!
(But as I keep pointing out, Linux has its own infuriating shortcomings.)

I am still not over my skepticism, though. The reallocated sector count is still the same, so I have to wonder whether under Windows the repair attempt just failed for some reason or whether under Linux that method simply doesn’t trigger the counter.
Also, Offline Uncorrectable is still 1, and I read it is supposed to go back to 0 when everything is fine, not be a history-type statistic, although that’s unreliable info.


Now, what SHOULD happen in this data safety system, as an integral part, is that the OS warns and informs which file is affected by a defective sector and not in its original state anymore, and the drive should also reallocate sectors that are just weak, because in my view there is no such thing as a repaired sector, as my own experience with this drive has shown. If it is weak, it cannot be trusted anymore, and there should be plenty of spares. It is no drama to lose even a couple megabytes, although that will rarely be the case, and if so, the harddisk is probably finished. – But to cause such an obstacle out of a single sector, that is nuts. A late surprise of data corruption and a huge drama. Very unprofessional and sloppy, kinda defying the purpose of accurately tracking what is happening with the data.

The existing system looks like not designed primarily to protect your data but to convince you to buy a new product.