Discussion:
[zfs-discuss] when to discard a disk
devsk
2015-02-22 15:46:34 UTC
Permalink
So, I got an email (smartd) that one of my Seagate 1TB drives in my RAIDZ2
pool had developed few pending sectors, which then moved to
Reported_Uncorrect eventually. You can see the smartctl output below.

The question is: is it time to throw away this disk and replace or should I
leave it in there until it gets worse? When do I discard it?

The scrub seems to have found nothing wrong, which tells me that the disk
itself corrected the issue (but then, the Reallocated_Sector_ct is 0). Sort
of confused about that. Looks to me like those counters are not updated
correctly.

Thanks for your help folks.
-devsk

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 107 099 006 Pre-fail
Always - 147604273
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 167
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail
Always - 37139674
9 Power_On_Hours 0x0032 066 066 000 Old_age
Always - 30565
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 167
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age
Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 090 090 000 Old_age
Always - 10
188 Command_Timeout 0x0032 100 100 000 Old_age
Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0022 065 053 045 Old_age
Always - 35 (Min/Max 26/40)
194 Temperature_Celsius 0x0022 035 047 000 Old_age
Always - 35 (0 14 0 0 0)
195 Hardware_ECC_Recovered 0x001a 036 014 000 Old_age
Always - 147604273
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 30933 (207 229 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 2857545767
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 1510092958

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Turbo Fredriksson
2015-02-22 15:52:04 UTC
Permalink
Post by devsk
When do I discard it?
Does depend on how much money you got left over, doesn't it? :)

As in, if a new disk won't even register, then I'd replace it just because of the
principal.

But on the other hand, if you don't value your data (OR is low on funds), I'd
run it until it started making screeching noises..


Me, I'm in the second category unfortunately :(.
--
Geologists recently discovered that "earthquakes" are
nothing more than Bruce Schneier and Chuck Norris
communicating via a roundhouse kick-based cryptosystem.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
devsk
2015-02-22 16:40:40 UTC
Permalink
Post by devsk
When do I discard it?
Does depend on how much money you got left over, doesn't it? :)
Oh yeah...:) Always looking to save money. That's why the question.

While the scrub was going on, I was doing a watch on "smartctl -a /dev/sde
| sed -n '/^ID#/,/^$/p'" and some attributes are heading up:

Raw_Read_Error_Rate,Seek_Error_Rate,Hardware_ECC_Recovered,Head_Flying_Hours

And then, I did the same thing on another Seagate drive (which is doing
fine right now in terms of Reported_Uncorrect count) and it shows the same
thing. Those attributes are going up all the time on Seagate drives.

-devsk

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-22 16:53:18 UTC
Permalink
Hello Devsk,
Post by devsk
Post by devsk
When do I discard it?
Does depend on how much money you got left over, doesn't it? :)
Oh yeah...:) Always looking to save money. That's why the question.
Does your disk support TLER/SCTERC? If not, I'd discard it immediately (or
use it for offline storage or some other, less critical use). Erroring
disks in ZFS arrays (actually, in *any* arrays), without TLER/SCTERC set,
are an open invitation to disaster.
Post by devsk
While the scrub was going on, I was doing a watch on "smartctl -a /dev/sde
Raw_Read_Error_Rate,Seek_Error_Rate,Hardware_ECC_Recovered,Head_Flying_Hours
And then, I did the same thing on another Seagate drive (which is doing
fine right now in terms of Reported_Uncorrect count) and it shows the same
thing. Those attributes are going up all the time on Seagate drives.
This is normal:
- Head_Flying_Hours is only how many hours the disk is operating with its
heads in an unparked position, ie, it increments by exactly 1 for each hour
of normal operation
- Raw_Read_Error_Rate,Seek_Error_Rate,Hardware_ECC_Recovered: means the
built-in disk FEC/ECC is doing its job correcting physical errors and
passing good data up to the host layer. No modern disk operates with zero
errors at the physical layer (and actually haven't for many years), and
they all employ heavy built-in FEC/ECC to correct these errors in order to
make them invisible to the host.

Cheers,
--
Durval.
Post by devsk
-devsk
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
devsk
2015-02-22 16:57:53 UTC
Permalink
Post by Durval Menezes
Hello Devsk,
Post by devsk
Post by devsk
When do I discard it?
Does depend on how much money you got left over, doesn't it? :)
Oh yeah...:) Always looking to save money. That's why the question.
Does your disk support TLER/SCTERC? If not, I'd discard it immediately (or
use it for offline storage or some other, less critical use). Erroring
disks in ZFS arrays (actually, in *any* arrays), without TLER/SCTERC set,
are an open invitation to disaster.
Yeah, it does and thats why the disk is still in the system...:) It would
have killed my pool performance yesterday if that was not the case.
Post by Durval Menezes
Post by devsk
While the scrub was going on, I was doing a watch on "smartctl -a
Raw_Read_Error_Rate,Seek_Error_Rate,Hardware_ECC_Recovered,Head_Flying_Hours
And then, I did the same thing on another Seagate drive (which is doing
fine right now in terms of Reported_Uncorrect count) and it shows the same
thing. Those attributes are going up all the time on Seagate drives.
- Head_Flying_Hours is only how many hours the disk is operating with its
heads in an unparked position, ie, it increments by exactly 1 for each hour
of normal operation
- Raw_Read_Error_Rate,Seek_Error_Rate,Hardware_ECC_Recovered: means the
built-in disk FEC/ECC is doing its job correcting physical errors and
passing good data up to the host layer. No modern disk operates with zero
errors at the physical layer (and actually haven't for many years), and
they all employ heavy built-in FEC/ECC to correct these errors in order to
make them invisible to the host.
That's what I thought as well.

So, I am leaving the drive in there until it really starts making
noises...:)

-devsk

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 14:12:52 UTC
Permalink
Post by devsk
While the scrub was going on, I was doing a watch on "smartctl -a
Raw_Read_Error_Rate,Seek_Error_Rate,Hardware_ECC_Recovered,Head_Flying_Hours
Normal.

The ones to watch are Current Pending Sectors, Reallocated Sectors and
Offline Uncorrectable Sectors - and don't look at the raw figures, look
at the normalised value (it counts down from 100 to the threshold)

A bunch of pending sectors or offline uncorrectable means you should
take the drive offline and try to force it to map them out (usually the
easiest way is a ATA secure erase). It would be nice if ZFS did a 0x00
write(*) when it got bad results before attempting to rewrite the data
but that hasn't happened yet.

(*) Writing 0x00 to a sector will cause the drive to test properly and
map it out if broken. This applies to all SCSI/SAS/SATA/ATA drives with
automatic sector remapping.

If load cycle count is climbing (expecially if it's climbing rapidly)
then you need to tweak your idle timers.

High fly writes are associated with excess case vibration.

Temperature is worth keeping an eye on to ensure the drive doesn't
exceed 45C as temperatures above that are associated with decreasing
drive life - interestingly from Google's drive survey a few years back
drive temps _below_ ~30C are also associated with decreasing drive life
so there's merit in letting them stay warm.

"smartctl -x" will give you more info, but be careful interpreting it.

Regular short and long surface read tests are a good idea. Man smartd.

Disallowing drive spindown will increase power consumption by a few
dollars/year but in my experience start/stop cycles kill drives even
quicker than head unload cycles. IMO This is mostly down to uneven
thermal stresses on mechanical components.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-23 14:27:57 UTC
Permalink
Hello Uncle,
Post by devsk
While the scrub was going on, I was doing a watch on "smartctl -a
Post by devsk
Raw_Read_Error_Rate,Seek_Error_Rate,Hardware_ECC_
Recovered,Head_Flying_Hours
Normal.
The ones to watch are Current Pending Sectors, Reallocated Sectors and
Offline Uncorrectable Sectors - and don't look at the raw figures, look at
the normalised value (it counts down from 100 to the threshold)
I look at both -- I've noticed that some brands/models are way too
optimistic regarding the normalization.
Post by devsk
A bunch of pending sectors or offline uncorrectable means you should take
the drive offline and try to force it to map them out (usually the easiest
way is a ATA secure erase). It would be nice if ZFS did a 0x00 write(*)
when it got bad results before attempting to rewrite the data but that
hasn't happened yet.
(*) Writing 0x00 to a sector will cause the drive to test properly and map
it out if broken. This applies to all SCSI/SAS/SATA/ATA drives with
automatic sector remapping.
I didn't know that. In fact, I thought that *any* write to a known-crapped
sector (ie, one identified as pending by SMART) would result in its
remapping. What's your source for this info?
Post by devsk
If load cycle count is climbing (expecially if it's climbing rapidly) then
you need to tweak your idle timers.
+1.
Post by devsk
High fly writes are associated with excess case vibration.
This is also my experience in case it's constant. If not, perhaps someone
hard-bumped the rack inadvertently... I've seen it once or twice.
Post by devsk
Temperature is worth keeping an eye on to ensure the drive doesn't exceed
45C as temperatures above that are associated with decreasing drive life -
interestingly from Google's drive survey a few years back drive temps
_below_ ~30C are also associated with decreasing drive life so there's
merit in letting them stay warm.
"smartctl -x" will give you more info, but be careful interpreting it.
Regular short and long surface read tests are a good idea. Man smartd.
Actually, "man smartd.conf"
Post by devsk
Disallowing drive spindown will increase power consumption by a few
dollars/year but in my experience start/stop cycles kill drives even
quicker than head unload cycles. IMO This is mostly down to uneven thermal
stresses on mechanical components.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 14:41:30 UTC
Permalink
Post by Uncle Stoat
(*) Writing 0x00 to a sector will cause the drive to test properly
and map it out if broken. This applies to all SCSI/SAS/SATA/ATA
drives with automatic sector remapping.
I didn't know that. In fact, I thought that *any* write to a
known-crapped sector (ie, one identified as pending by SMART) would
result in its remapping. What's your source for this info?
Discussions many years ago with Andre Hedrick, with subsequent
confirmation from various sources.

linux' "hdparm --repair-sector" command uses 0x00 writes for that
reason. It's covered in the man page.

I believe the reasoning is that 512 * 0x00 will seldom happen under
normal circumstances so it's safe to treat it as a special case.
Post by Uncle Stoat
High fly writes are associated with excess case vibration.
This is also my experience in case it's constant. If not, perhaps
someone hard-bumped the rack inadvertently... I've seen it once or twice.
Yes, I should have said increasing numbers for that.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 14:56:11 UTC
Permalink
Post by Uncle Stoat
(*) Writing 0x00 to a sector will cause the drive to test properly
Post by Uncle Stoat
and map it out if broken. This applies to all SCSI/SAS/SATA/ATA
drives with automatic sector remapping.
I didn't know that. In fact, I thought that *any* write to a
known-crapped sector (ie, one identified as pending by SMART) would
result in its remapping. What's your source for this info?
Discussions many years ago with Andre Hedrick, with subsequent
confirmation from various sources.
linux' "hdparm --repair-sector" command uses 0x00 writes for that reason.
It's covered in the man page.
I believe the reasoning is that 512 * 0x00 will seldom happen under normal
circumstances so it's safe to treat it as a special case.
If the sector is marked as pending, _any_ write to it will cause a
reallocation unless the disk
firmware is doing something bad (such as reusing a previously duff sector
if the new data
stuck to it well enough.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 14:49:22 UTC
Permalink
Post by Durval Menezes
Post by devsk
A bunch of pending sectors or offline uncorrectable means you should take
the drive offline and try to force it to map them out (usually the easiest
way is a ATA secure erase). It would be nice if ZFS did a 0x00 write(*)
when it got bad results before attempting to rewrite the data but that
hasn't happened yet.
(*) Writing 0x00 to a sector will cause the drive to test properly and
map it out if broken. This applies to all SCSI/SAS/SATA/ATA drives with
automatic sector remapping.
I didn't know that. In fact, I thought that *any* write to a known-crapped
sector (ie, one identified as pending by SMART) would result in its
remapping. What's your source for this info?
I think you are both talking about the same thing. It's just that using
/dev/zero as the data source is faster than using /dev/urandom. :)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Kash Pande
2015-02-22 16:52:31 UTC
Permalink
Make sure your backups are good on a regular (daily?) basis - then
replace it once performance tanks. I always make sure to have a high
available setup so there's a few nodes with identical data - one can
crash without hurting the others. I use ZVOL and not files - so no
cluster setup is needed.


Kash
Post by devsk
So, I got an email (smartd) that one of my Seagate 1TB drives in my
RAIDZ2 pool had developed few pending sectors, which then moved to
Reported_Uncorrect eventually. You can see the smartctl output below.
The question is: is it time to throw away this disk and replace or
should I leave it in there until it gets worse? When do I discard it?
The scrub seems to have found nothing wrong, which tells me that the
disk itself corrected the issue (but then, the Reallocated_Sector_ct
is 0). Sort of confused about that. Looks to me like those counters
are not updated correctly.
Thanks for your help folks.
-devsk
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 107 099 006 Pre-fail
Always - 147604273
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 167
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail
Always - 37139674
9 Power_On_Hours 0x0032 066 066 000 Old_age
Always - 30565
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 167
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age
Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 090 090 000 Old_age
Always - 10
188 Command_Timeout 0x0032 100 100 000 Old_age
Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0022 065 053 045 Old_age
Always - 35 (Min/Max 26/40)
194 Temperature_Celsius 0x0022 035 047 000 Old_age
Always - 35 (0 14 0 0 0)
195 Hardware_ECC_Recovered 0x001a 036 014 000 Old_age
Always - 147604273
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 30933 (207 229 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 2857545767
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 1510092958
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
devsk
2015-02-22 16:59:52 UTC
Permalink
Post by Kash Pande
Make sure your backups are good on a regular (daily?) basis - then
replace it once performance tanks. I always make sure to have a high
available setup so there's a few nodes with identical data - one can crash
without hurting the others. I use ZVOL and not files - so no cluster setup
is needed.
Kash
Oh yeah. This data is doubly (actually, triply in parts like photos) backed
up.

-devsk

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-22 16:57:47 UTC
Permalink
Post by devsk
So, I got an email (smartd) that one of my Seagate
After the Subject: line, I got as far as this before my answer would become
"yes" without needing to read further. ;-)

1TB drives in my RAIDZ2 pool had developed few pending sectors, which then
Post by devsk
moved to Reported_Uncorrect eventually. You can see the smartctl output
below.
The question is: is it time to throw away this disk and replace or should
I leave it in there until it gets worse? When do I discard it?
The scrub seems to have found nothing wrong, which tells me that the disk
itself corrected the issue (but then, the Reallocated_Sector_ct is 0). Sort
of confused about that. Looks to me like those counters are not updated
correctly.
Thanks for your help folks.
-devsk
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 107 099 006 Pre-fail
Always - 147604273
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 167
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail
Always - 37139674
9 Power_On_Hours 0x0032 066 066 000 Old_age
Always - 30565
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 167
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age
Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 090 090 000 Old_age
Always - 10
188 Command_Timeout 0x0032 100 100 000 Old_age
Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0022 065 053 045 Old_age
Always - 35 (Min/Max 26/40)
194 Temperature_Celsius 0x0022 035 047 000 Old_age
Always - 35 (0 14 0 0 0)
195 Hardware_ECC_Recovered 0x001a 036 014 000 Old_age
Always - 147604273
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 30933 (207 229 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 2857545767
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 1510092958
Do you run regular (daily short, weekly long) smart self-tests?
Do you have Write-Read-Verify enabled on the drive?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
devsk
2015-02-22 17:52:54 UTC
Permalink
Post by Gordan Bobic
Post by devsk
So, I got an email (smartd) that one of my Seagate
After the Subject: line, I got as far as this before my answer would
become "yes" without needing to read further. ;-)
Surprisingly, my experience with Seagate has been mostly positive. If I
look at last 10 years of data from my own usage, Western Digital is the
worst brand. I had to replace their RE series drives (the supposedly
semi-enterprise grade drives) more than once. So, I have paid in shipping
only on WDC drives in my experience. Seagate has never failed pre-maturely.

The Barracuda 7200.12 that we are talking about is 4 years old and its not
dead yet...and it supports SCTERC (all Seagate drives in my system seem to
do), unlike all WDC drives (mostly Blacks, after having lost faith in RE) I
have in my system right now. That leaves HGST drives, which I have a great
faith in (never failed on me and support SCTERC) but then, they are under
WDC now. So, I don't know.

-devsk

-devsk

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-22 18:02:48 UTC
Permalink
Hello Devsk and a Gordan,
Post by devsk
Post by Gordan Bobic
Post by devsk
So, I got an email (smartd) that one of my Seagate
After the Subject: line, I got as far as this before my answer would
become "yes" without needing to read further. ;-)
Post by devsk
Surprisingly, my experience with Seagate has been mostly positive. If I
look at last 10 years of data from my own usage, Western Digital is the
worst brand. I had to replace their RE series drives (the supposedly
semi-enterprise grade drives) more than once. So, I have paid in shipping
only on WDC drives in my experience. Seagate has never failed pre-maturely.
Post by devsk
The Barracuda 7200.12 that we are talking about is 4 years old and its
not dead yet...and it supports SCTERC (all Seagate drives in my system seem
to do), unlike all WDC drives (mostly Blacks, after having lost faith in
RE) I have in my system right now.

My experience mostly agrees with yours, with the exception that I had a lot
of Seagate drives failing before warranty was over. But with WD my
experience was even worse: albeit being one order of magnitude less
numerous than the Seagates in my installed base, *all* of the WDs failed.
Post by devsk
That leaves HGST drives, which I have a great faith in (never failed on
me and support SCTERC) but then, they are under WDC now. So, I don't know

I've started to use HGSTs only 3 years ago, and i have very few of them,
but they seem really solid so far, way better than the Seagates (let's not
even mention the WDs). Right now they are my go-to brand for new disks, and
will remain so until the new ownership manages to fsck things up completely
(lets hope they take a really lung tone for that).

Cheers,
--
Durval.
Post by devsk
-devsk
-devsk
To unsubscribe from this group and stop receiving emails from it, send an
email to zfs-discuss+***@zfsonlinux.org.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 14:32:01 UTC
Permalink
Post by Durval Menezes
I've started to use HGSTs only 3 years ago, and i have very few of them,
but they seem really solid so far, way better than the Seagates (let's
not even mention the WDs). Right now they are my go-to brand for new
disks, and will remain so until the new ownership manages to fsck things
up completely (lets hope they take a really lung tone for that).
The only reason it hasn't happened already is that the chinese
anti-monopoly authority won't let them (they're fully aware that a
duopoly is just as bad as a monopoly)

Thankfully they show no sign of changing that stance. WDC isn't even
allowed to helicoptor in excs from that side of the fence.

There's been no meaningful R&D performed on HDDs since 2008 (that's when
the last research lab was closed). What's hitting the market now is the
final fruit of that research. HAMR will be the last developement and
then things are likely to stagnate.

Hopefully 3D SSD tech will give us 10Tb drives before then (WD have just
announced 10TB spinners, but they're shingled)





To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 14:54:38 UTC
Permalink
Post by Durval Menezes
I've started to use HGSTs only 3 years ago, and i have very few of them,
Post by Durval Menezes
but they seem really solid so far, way better than the Seagates (let's
not even mention the WDs). Right now they are my go-to brand for new
disks, and will remain so until the new ownership manages to fsck things
up completely (lets hope they take a really lung tone for that).
The only reason it hasn't happened already is that the chinese
anti-monopoly authority won't let them (they're fully aware that a duopoly
is just as bad as a monopoly)
Technically it's a triopoly. Samsung's 3.5" disk unit went to Toshiba,
which is was, IIRC, one of the conditions to Seagate eating the rest of
Samsung's disk manufacturing operations. Toshiba was only manufacturing
2.5" disks until then.
Post by Durval Menezes
Thankfully they show no sign of changing that stance. WDC isn't even
allowed to helicoptor in excs from that side of the fence.
There's been no meaningful R&D performed on HDDs since 2008 (that's when
the last research lab was closed). What's hitting the market now is the
final fruit of that research. HAMR will be the last developement and then
things are likely to stagnate.
Yup. And 3.5" disks are already so capacious that they are dangerous.
It's the main reason why I am switching to using 2.5" disks where possible.
It keeps the disks down to a sensible individual size while allowing
similar physical density.
Post by Durval Menezes
Hopefully 3D SSD tech will give us 10Tb drives before then (WD have just
announced 10TB spinners, but they're shingled)
The main thing at the moment is the cost. It is coming down annoyingly
slowly, but it is coming down with every new SSD model released.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-22 19:05:45 UTC
Permalink
Post by devsk
The Barracuda 7200.12 that we are talking about is 4 years old and its not
dead yet...
I have quite a selection of Barracuda 7200.11 and 7200.12 drives, and the
cummulative failure rate on them within the warranty period (5 years at the
time!) in my various servers has been about 120% (as in, on average all
have been replaced at least once, with some more than once).
Post by devsk
and it supports SCTERC (all Seagate drives in my system seem to do),
The 7200.12 Barracudas were the last ones to have SCTERC. After that
Seagate dropped the feature.
Their other redeeming feature is that they have WRV, unlike any other
drives that I own. It's a very handy feature. If you can live with halving
the write performance (both IOPS and MB/s), it will result in the errors
being picked up much sooner than they otherwise would. I suspect this has
also been removed since, as it leads to higher return rates...
Post by devsk
unlike all WDC drives (mostly Blacks, after having lost faith in RE) I
have in my system right now.
I only have WD greens with the exception of one WD blue, and while they
have been a lot more reliable than Seagates (which doesn't exactly say
much), they are painfully slow.
Post by devsk
That leaves HGST drives, which I have a great faith in (never failed on me
and support SCTERC) but then, they are under WDC now. So, I don't know.
Luckily, the Chinese government has not yet approved the WD-HGST merger,
supposedly on the grounds of risk of monopoly abuse. This is an awesome
thing - buy HGSTs before their production facilities are allowed to be
assimilated. :-)

As an aside, I noticed that Hitachi/HGST name was suspiciously absent from
the recent reports of malware-in-disk-firmware. Whether their drives have
genuinely not been malwared (by the malware in question at least) or
whether they just weren't mentioned, is debatable. But if the Chinese have
been stalling the approval process, it makes you wonder if maybe it's on
grounds of something other than risk of monopoly creation. Food for
thought, but only if you are wearing your tin foil hat. ;-)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoatwarbler
2015-02-22 20:18:11 UTC
Permalink
Post by Gordan Bobic
As an aside, I noticed that Hitachi/HGST name was suspiciously absent
from the recent reports of malware-in-disk-firmware
It just wasn't mentioned. The images and strings of the malware
contained segments for every maker, including many no longer with us.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-22 21:03:01 UTC
Permalink
Hello Gordan, Uncle,
Post by Uncle Stoatwarbler
Post by Gordan Bobic
As an aside, I noticed that Hitachi/HGST name was suspiciously absent
from the recent reports of malware-in-disk-firmware
As far as I understand, the malicious HDD firmware would infect the machine
at every boot by returning infected code at each "choke point" during the
OS boot sequence (MBR, PBR, kernel, disk drivers, etc). I think this would
work mostly with Windows, as it would be really really hard to interfere in
a Linux boot sequence the same way (custom compiled, tar.bz2 kernels with
embedded disk drivers, for instance). To say nothing of (boot) ZFS and a
myriad of other FSs, each one with a very specific disk structure that
would be hard to handle for every possible case in a (probably very
limited) firmware code size.

It just wasn't mentioned. The images and strings of the malware
Post by Uncle Stoatwarbler
contained segments for every maker, including many no longer with us.
Strange. Isn't HGST a foreign (japanese) company? I can see NSA leveraging
its way into US and US-controlled manufacturers, but foreign companies
would be a completely different case.

Cheers,
--
Durval.
Post by Uncle Stoatwarbler
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoatwarbler
2015-02-22 21:08:23 UTC
Permalink
Post by Durval Menezes
Strange. Isn't HGST a foreign (japanese) company? I can see NSA
leveraging its way into US and US-controlled manufacturers, but foreign
companies would be a completely different case.
HGST Ultrastars/deskstars/Trevalstars are all derived from the IBM hard
drive unit which was sold off shortly after the deathstar fiasco in the
late 1990s.



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-23 14:30:28 UTC
Permalink
Howdy Uncle,
Post by Uncle Stoatwarbler
Post by Durval Menezes
Strange. Isn't HGST a foreign (japanese) company? I can see NSA
leveraging its way into US and US-controlled manufacturers, but foreign
companies would be a completely different case.
HGST Ultrastars/deskstars/Trevalstars are all derived from the IBM hard
drive unit which was sold off shortly after the deathstar fiasco in the
late 1990s.
Yeah, I remember those. But that has been quite awhile... I would be very
surprised if any of the IBM firmware survived to this day, not only because
Hitachi redid the entire line, but also because the embedded MCUs are
completely different from that time...

Cheers,
--
Durval.
Post by Uncle Stoatwarbler
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 14:45:08 UTC
Permalink
Post by Uncle Stoatwarbler
HGST Ultrastars/deskstars/Trevalstars are all derived from the IBM hard
drive unit which was sold off shortly after the deathstar fiasco in the
late 1990s.
Yeah, I remember those. But that has been quite awhile... I would be
very surprised if any of the IBM firmware survived to this day, not only
because Hitachi redid the entire line, but also because the embedded
MCUs are completely different from that time...
Embedded code has a tendency to survive architecture changes because
there's so much stuff done in it.

The fun part would be to disassemble the code and see what exactly's
being done (it's almost always ARM-based) but that would require
circumventing the anti-reverse-engineering bits in the images.




To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 14:24:12 UTC
Permalink
Post by devsk
Surprisingly, my experience with Seagate has been mostly positive.
All makers have had good and bad patches over the years, but since 2007
the reliability levels of everyone except HGST have plummeted faster
than the warranty periods whilst pricing has stayed much the same (or
increased slightly) since the Thai floods put most of SGT/WDC production
facilities under 3-5 metres of muddy water.

HGST may cost more but the 5 year enterpise warranty gives some
peace-of-mind that they're confident in their product(*)

Compare and contrast with WDC/SGT reducing enterprise waranties from 5
to 3 years or less and consumer drives from 3 years to 12 months

The latter is probably legal in the EU as the obligation for 2 year
warranty is on the retailer not the manufacturer - but it's suprising
that volume retailers are selling drives with a warranty shorter than
the one they're legally required to put on the system - and even more
surprising that they're not simply bundling SSDs by default given the
standard 3-5 (or 10 in some cases) year warranties.


(*) My standard response when vendors quote stupidly high annual support
costs is "You're not very confident in your product, are you? Why should
I buy it if you think it's going to break down that much?"


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 14:47:52 UTC
Permalink
Compare and contrast with WDC/SGT reducing enterprise waranties from 5 to
3 years or less and consumer drives from 3 years to 12 months
The latter is probably legal
I think you mean _illegal_.
in the EU as the obligation for 2 year warranty is on the retailer not the
manufacturer - but it's suprising that volume retailers are selling drives
with a warranty shorter than the one they're legally required to put on the
system
What will happen is that the retailers will get burned and stop selling
their products (and frankly, good riddance if it means fewer Seagate and WD
drives on the streets).
(*) My standard response when vendors quote stupidly high annual support
costs is "You're not very confident in your product, are you? Why should I
buy it if you think it's going to break down that much?"
+1

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Uncle Stoat
2015-02-23 17:03:59 UTC
Permalink
Post by Uncle Stoat
Compare and contrast with WDC/SGT reducing enterprise waranties from
5 to 3 years or less and consumer drives from 3 years to 12 months
The latter is probably legal
I think you mean _illegal_.
No, I mean legal - specifically because the EU warranty requirement is
between the retailer and the customer, not the customer and the
manufacturer (as is the sale of goods act in the UK and most similar
laws across the EU)

Legal != ethical but I can't see their position failing in court on this
as they're not selling direct to consumers.

http://www.thisismoney.co.uk/money/bills/article-1677034/Two-year-warranty-EU-law.html

http://europa.eu/youreurope/citizens/shopping/shopping-abroad/guarantees/index_en.htm
Post by Uncle Stoat
What will happen is that the retailers will get burned and stop selling
their products (and frankly, good riddance if it means fewer Seagate and
WD drives on the streets).
One can hope.

I've heard of stores trying to wriggle out of eu obligaions by claiming
that hard drives or computers are office equipment, not consumer
devices, but such a claim would get bounced hard in court.

I've also heard of a practice where vendors dissolve the company and
form a new one every 12 months. Even with exactly the same
owners/staff/buildings/equipment, your warranty obligation is with the
now-defunct old company.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-23 17:14:17 UTC
Permalink
Post by Uncle Stoat
Post by Uncle Stoat
Compare and contrast with WDC/SGT reducing enterprise waranties from
5 to 3 years or less and consumer drives from 3 years to 12 months
The latter is probably legal
I think you mean _illegal_.
No, I mean legal - specifically because the EU warranty requirement is
between the retailer and the customer, not the customer and the
manufacturer (as is the sale of goods act in the UK and most similar laws
across the EU)
Legal != ethical but I can't see their position failing in court on this
as they're not selling direct to consumers.
http://www.thisismoney.co.uk/money/bills/article-1677034/
Two-year-warranty-EU-law.html
http://europa.eu/youreurope/citizens/shopping/shopping-
abroad/guarantees/index_en.htm
I see what you mean now, you were making the distinction between the retail
and manufacturer warranty obligations.

According to the UK sale of goods act, the warranty can legally be enforced
against the manufacturer as well as the retailer.
This also says that "warranty" is down to how long you would reasonably
expect the product to work reliably, so, for example,
because it is reasonable to expect that a laptop will last without abuse
for more than 12 months, even if it failed just after it's
12 month warranty expired, you should be able to get the retailer or
manufacturer to fix it on the basis of fitness for purpose.
But that is where things only tend to happen if you can be bothered to
start getting lawyers involved, and few things are
worth enough to bother.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
a***@gmail.com
2015-02-26 13:00:21 UTC
Permalink
Post by Uncle Stoat
Compare and contrast with WDC/SGT reducing enterprise waranties from 5
to 3 years or less and consumer drives from 3 years to 12 months
I don't know about WD, but Seagate have a 2 year warranty on their desktop
drives, 3 year on the cheap (SOHO) NAS drives and the new SMR Archives, and
5 year on all their enterprise drives. (I've just checked the spec sheets
on their website to make sure)

Eg:
http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-gb/docs/desktop-hdd-ds1770-5-1409gb.pdf

http://www.seagate.com/www-content/product-content/nas-fam/nas-hdd/en-us/docs/nas-hdd-ds1789-3-1409gb.pdf
http://www.seagate.com/www-content/product-content/hdd-fam/seagate-archive-hdd/en-us/docs/archive-hdd-dS1834-3-1411gb.pdf

http://www.seagate.com/www-content/product-content/enterprise-hdd-fam/enterprisenas-hdd/_shared/docs/enterprise-nas-hdd-ds1841-2-1501gb.pdf
http://www.seagate.com/www-content/product-content/enterprise-hdd-fam/enterprise-capacity-3-5-hdd/constellation-es-4/en-gb/docs/enterprise-capacity-3-5-hdd-ds1791-8-1410gb.pdf

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Hetz Ben Hamo
2015-02-26 13:10:38 UTC
Permalink
Small correction: WD RED warranty is 3 years, RED Pro - 5 years.
Post by a***@gmail.com
Post by Uncle Stoat
Compare and contrast with WDC/SGT reducing enterprise waranties from 5
to 3 years or less and consumer drives from 3 years to 12 months
I don't know about WD, but Seagate have a 2 year warranty on their desktop
drives, 3 year on the cheap (SOHO) NAS drives and the new SMR Archives, and
5 year on all their enterprise drives. (I've just checked the spec sheets on
their website to make sure)
http://www.seagate.com/www-content/product-content/desktop-hdd-fam/en-gb/docs/desktop-hdd-ds1770-5-1409gb.pdf
http://www.seagate.com/www-content/product-content/nas-fam/nas-hdd/en-us/docs/nas-hdd-ds1789-3-1409gb.pdf
http://www.seagate.com/www-content/product-content/hdd-fam/seagate-archive-hdd/en-us/docs/archive-hdd-dS1834-3-1411gb.pdf
http://www.seagate.com/www-content/product-content/enterprise-hdd-fam/enterprisenas-hdd/_shared/docs/enterprise-nas-hdd-ds1841-2-1501gb.pdf
http://www.seagate.com/www-content/product-content/enterprise-hdd-fam/enterprise-capacity-3-5-hdd/constellation-es-4/en-gb/docs/enterprise-capacity-3-5-hdd-ds1791-8-1410gb.pdf
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Michael Kjörling
2015-02-22 17:37:47 UTC
Permalink
Post by devsk
So, I got an email (smartd) that one of my Seagate 1TB drives in my RAIDZ2
pool had developed few pending sectors, which then moved to
Reported_Uncorrect eventually. You can see the smartctl output below.
I see little reason to throw out a disk _just_ because it develops a
few bad sectors. That's what spare sectors and vdev-level redundancy
is for! Sure, if you have unlimited funds then go ahead, but if you're
like most of us throwing out a disk for developing a few bad sectors
is more like a waste of money. Especially in a setup with double
redundancy.

That said, if you are really worried, you can always buy a replacement
disk now and leave it on the shelf until the values of attributes 5
(Reallocated_Sector_Ct) and possibly 187 (Reported_Uncorrect) on this
one start climbing toward their threshold values. That will allow you
to react more quickly when the drive does start to actually go bad,
without throwing away a disk that is still good.

As Durval said, the attributes that you saw climbing during the scrub
going up with usage is perfectly normal. Don't worry about them.
--
Michael Kjörling • https://michael.kjorling.se • ***@kjorling.se
OpenPGP B501AC6429EF4514 https://michael.kjorling.se/public-keys/pgp
“People who think they know everything really annoy
those of us who know we don’t.” (Bjarne Stroustrup)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
John Drescher
2015-02-26 13:49:28 UTC
Permalink
Post by devsk
So, I got an email (smartd) that one of my Seagate 1TB drives in my RAIDZ2
pool had developed few pending sectors, which then moved to
Reported_Uncorrect eventually. You can see the smartctl output below.
The question is: is it time to throw away this disk and replace or should I
leave it in there until it gets worse? When do I discard it?
The scrub seems to have found nothing wrong, which tells me that the disk
itself corrected the issue (but then, the Reallocated_Sector_ct is 0). Sort
of confused about that. Looks to me like those counters are not updated
correctly.
Thanks for your help folks.
-devsk
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 107 099 006 Pre-fail Always
- 147604273
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 167
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always
- 37139674
9 Power_On_Hours 0x0032 066 066 000 Old_age Always
- 30565
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 167
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always
- 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 090 090 000 Old_age Always
- 10
188 Command_Timeout 0x0032 100 100 000 Old_age Always
- 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 065 053 045 Old_age Always
- 35 (Min/Max 26/40)
194 Temperature_Celsius 0x0022 035 047 000 Old_age Always
- 35 (0 14 0 0 0)
195 Hardware_ECC_Recovered 0x001a 036 014 000 Old_age Always
- 147604273
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline
- 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline
- 30933 (207 229 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline
- 2857545767
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline
- 1510092958
I discard disks when they can not make it two 4 pass badblocks read /
write tests in a row without a single bad block.

badblocks -p2 -wsv /dev/mydevicetotest

John

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Loading...