Discussion:
'Replaced' disk in pool
(too old to reply)
Arno None
2014-06-20 00:53:00 UTC
Permalink
Hi to all,

When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.

When I placed the same disk back it started to resilver but it stopped quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h remaining.

Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an unrecoverable error' but when I read the link it mentions 'ZFS was able to recover from the error and subsequently repair the damaged data'.

One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.

Can someone please explain what happened and why the resilvering stopped?
What are the steps to correct this?

Greets,
Arno



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Tim
2014-06-20 01:29:48 UTC
Permalink
If the disk is fine, maybe there's a problem with the controller or cable.
Or perhaps the RAM?
Post by Arno None
Hi to all,
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC
(Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
When I placed the same disk back it started to resilver but it stopped
quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h
remaining.
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an
unrecoverable error' but when I read the link it mentions 'ZFS was able to
recover from the error and subsequently repair the damaged data'.
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
Can someone please explain what happened and why the resilvering stopped?
What are the steps to correct this?
Greets,
Arno
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Cédric Lemarchand
2014-06-20 07:21:19 UTC
Permalink
Hello Arno,
Post by Arno None
Hi to all,
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another
PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
When I placed the same disk back it started to resilver but it stopped
quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h
remaining.
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an
unrecoverable error' but when I read the link it mentions 'ZFS was
able to recover from the error and subsequently repair the damaged data'.
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
That means ZFS encountered unrecoverable error *on your drive* (ie the
bad one), but can still keep data *in the pool* safe because you had
redundancy (but not any more)
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
The most probable reason I see is that the defecting drive will pass the
way soon, dead sectors and no more spares ... , but it could be either
bad cable, bad ram (did you have ECC ?), bad controller, buggy firmware
... well, everything except ZFS ;-)
Post by Arno None
What are the steps to correct this?
Depending on what importance you give to your data in the pool, from
higher to lower :

- ensure you have a backup
- buy a new drive
- take the time to do a deep diagnosis of your failed drive and try to
make it resurrect from dust ! (full erase, full SMART test)

Cheers
Post by Arno None
Greets,
Arno
To unsubscribe from this group and stop receiving emails from it, send
--
Cédric Lemarchand
System & Network Engineer
iXBlue
52, avenue de l'Europe
78160 Marly le Roi
France
Tel. +33 1 30 08 88 88
Mob. +33 6 37 23 40 93
Fax +33 1 30 08 88 00
www.ixblue.com <http://www.ixblue.com>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
John McEntee
2014-06-20 07:24:33 UTC
Permalink
Post by Arno None
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
Take it out an perform the Long Generic test. I have had disk that pass the quick tests but fail the longer ones. The test is there for a reason.
Post by Arno None
When I placed the same disk back it started to resilver but it stopped quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h remaining.
Sound like a problem with the disk
Post by Arno None
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an unrecoverable error' but when I read the link it mentions 'ZFS was able to recover from the error and >subsequently repair the damaged data'.
Sound like the disk is broken, so zfs had to recover data from the rest of the array.
Post by Arno None
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
If the disk that is trying to recover to is broken this will happen. The disk has unrecoverable errors, but the array is still fine as the data is
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
resilvering would have stopped with errors on the hard disk being resilvered too.
Post by Arno None
What are the steps to correct this?
Do a long check on the disk, if it still checks ok, you probably have a sata cable or port issue.

John


_______________________________________________________________________

The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG
VAT no. GB 464 6551 29
_______________________________________________________________________

This e-mail has been scanned for all viruses MessageLabs.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Arno None
2014-06-20 12:27:46 UTC
Permalink
Thanks for the answers. Long Generic Test is running.
But I will replace the disk whatever the test result is. Unfortunately I have to wait over the weekend to receive a replacement.

Cédric: It is my backup pool. The disk enclosure with the affected pool is switched off so nothing can happen with the other disks until I can replace the disk.

Arno
From: JMcEntee-***@public.gmane.org
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: RE: [zfs-discuss] 'Replaced' disk in pool
Date: Fri, 20 Jun 2014 07:24:33 +0000
Post by Arno None
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
Take it out an perform the Long Generic test. I have had disk that pass the quick tests but fail the longer ones. The test is there for a reason.
Post by Arno None
When I placed the same disk back it started to resilver but it stopped quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h remaining.
Sound like a problem with the disk
Post by Arno None
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an unrecoverable error' but when I read the link it mentions 'ZFS was able to recover from the error and
subsequently repair the damaged data'.
Sound like the disk is broken, so zfs had to recover data from the rest of the array.
Post by Arno None
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
If the disk that is trying to recover to is broken this will happen. The disk has unrecoverable errors, but the array is still fine as the data is
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
resilvering would have stopped with errors on the hard disk being resilvered too.
Post by Arno None
What are the steps to correct this?
Do a long check on the disk, if it still checks ok, you probably have a sata cable or port issue.

John



_______________________________________________________________________



The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG

VAT no. GB 464 6551 29

_______________________________________________________________________



This e-mail has been scanned for all viruses MessageLabs.







To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Cédric Lemarchand
2014-06-20 13:18:48 UTC
Permalink
Post by Arno None
Thanks for the answers. Long Generic Test is running.
But I will replace the disk whatever the test result is. Unfortunately
I have to wait over the weekend to receive a replacement.
Ok then I think you could take some time to deeply check the defunct
hard drive. In the better case and if the problem is related to some bad
sectors, a full erase could remap them properly to spares one (if the
drive has sufficient of them) and mark them as *dead*. After that, if
the drive successfully pass a full SMART test, it could maybe still run
for some time, maybe years.

OTOH, depending of the age of your drives, your needs and your budget,
it could be the good timing to replace them and expand your pool.

Cheers
Post by Arno None
Cédric: It is my backup pool. The disk enclosure with the affected
pool is switched off so nothing can happen with the other disks until
I can replace the disk.
Arno
------------------------------------------------------------------------
Subject: RE: [zfs-discuss] 'Replaced' disk in pool
Date: Fri, 20 Jun 2014 07:24:33 +0000
Post by Arno None
When copying data I noticed that a pool was degraded. One disk became
unavailable.
Post by Arno None
Nothing was changed so I did a shutdown and tested the disk on another
PC (Windows) with SeaTools.
Post by Arno None
Tests were passed. I did not do the 'Long Generic Test' because it
takes long.
Take it out an perform the Long Generic test. I have had disk that
pass the quick tests but fail the longer ones. The test is there for a
reason.
Post by Arno None
When I placed the same disk back it started to resilver but it stopped
quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h
remaining.
Sound like a problem with the disk
Post by Arno None
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error
(CKSUM 1).
Post by Arno None
The status now mentions 'One or more devices has experienced an
unrecoverable error' but when I read the link it mentions 'ZFS was
able to recover from the error and >subsequently repair the damaged data'.
Sound like the disk is broken, so zfs had to recover data from the rest of the array.
Post by Arno None
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
If the disk that is trying to recover to is broken this will happen.
The disk has unrecoverable errors, but the array is still fine as the
data is
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
resilvering would have stopped with errors on the hard disk being resilvered too.
Post by Arno None
What are the steps to correct this?
Do a long check on the disk, if it still checks ok, you probably have
a sata cable or port issue.
John
_______________________________________________________________________
The contents of this e-mail and any attachment(s) are strictly
confidential and are solely for the person(s) at the e-mail
address(es) above. If you are not an addressee, you may not disclose,
distribute, copy or use this e-mail, and we request that you send an
Dynamics Ltd. accepts no legal liability for the contents of this
e-mail including any errors, interception or interference, as internet
communications are not secure. Any views or opinions presented are
solely those of the author and do not necessarily represent those of
Stirling Dynamics Ltd. Registered In England No. 2092114 Registered
Office: 26 Regent Street, Clifton, Bristol. BS8 4HG
VAT no. GB 464 6551 29
_______________________________________________________________________
This e-mail has been scanned for all viruses MessageLabs.
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send
--
Cédric Lemarchand
System & Network Engineer
iXBlue
52, avenue de l'Europe
78160 Marly le Roi
France
Tel. +33 1 30 08 88 88
Mob. +33 6 37 23 40 93
Fax +33 1 30 08 88 00
www.ixblue.com <http://www.ixblue.com>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
John McEntee
2014-06-20 13:59:41 UTC
Permalink
Arno,

Reading through your e-mail again, and the one from Paul, It could will be the drive resilvered in 3 minutes as most of the data would have been correct and it only need to change 9 GB. My last resilver was initially predicted to take 300 hours, but had 12 hours left after 16 hours. (at which point I had to , lower the priority of the resilver as the performance of the array was terriable)

I still think a Long test is needed though, due to cksum errors

John

From: Arno None [mailto:arnoads-***@public.gmane.org]
Sent: 20 June 2014 13:28
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: RE: [zfs-discuss] 'Replaced' disk in pool

Thanks for the answers. Long Generic Test is running.
But I will replace the disk whatever the test result is. Unfortunately I have to wait over the weekend to receive a replacement.

Cédric: It is my backup pool. The disk enclosure with the affected pool is switched off so nothing can happen with the other disks until I can replace the disk.

Arno
________________________________
From: JMcEntee-***@public.gmane.org<mailto:JMcEntee-***@public.gmane.org>
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org<mailto:zfs-discuss-VKpPRiiRko4/***@public.gmane.org>
Subject: RE: [zfs-discuss] 'Replaced' disk in pool
Date: Fri, 20 Jun 2014 07:24:33 +0000
Post by Arno None
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
Take it out an perform the Long Generic test. I have had disk that pass the quick tests but fail the longer ones. The test is there for a reason.
Post by Arno None
When I placed the same disk back it started to resilver but it stopped quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h remaining.
Sound like a problem with the disk
Post by Arno None
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an unrecoverable error' but when I read the link it mentions 'ZFS was able to recover from the error and >subsequently repair the damaged data'.
Sound like the disk is broken, so zfs had to recover data from the rest of the array.
Post by Arno None
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
If the disk that is trying to recover to is broken this will happen. The disk has unrecoverable errors, but the array is still fine as the data is
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
resilvering would have stopped with errors on the hard disk being resilvered too.
Post by Arno None
What are the steps to correct this?
Do a long check on the disk, if it still checks ok, you probably have a sata cable or port issue.



John



_______________________________________________________________________

The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org<mailto:admin-***@public.gmane.org> and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG
VAT no. GB 464 6551 29
_______________________________________________________________________

This e-mail has been scanned for all viruses MessageLabs.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko4/***@public.gmane.org<mailto:zfs-discuss+unsubscribe-VKpPRiiRko4/***@public.gmane.org>.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko4/***@public.gmane.org<mailto:zfs-discuss+unsubscribe-VKpPRiiRko4/***@public.gmane.org>.

_______________________________________________________________________

The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG
VAT no. GB 464 6551 29
_______________________________________________________________________

This e-mail has been scanned for all viruses MessageLabs.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Arno None
2014-06-20 16:31:06 UTC
Permalink
Cédric,

SeaTools Long Generic Test is still
running (~40% now).
The disks in the pool are 3 month old.
I haven't looked into SMART testing yet so I will and test some more when the running test is finished.

John,

As I replied to Paul it's a lot of data so I have to be patient.

From: JMcEntee-***@public.gmane.org
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: RE: [zfs-discuss] 'Replaced' disk in pool
Date: Fri, 20 Jun 2014 13:59:41 +0000









Arno,

Reading through your e-mail again, and the one from Paul, It could will be the drive resilvered in 3 minutes as most of the data would have been correct and
it only need to change 9 GB. My last resilver was initially predicted to take 300 hours, but had 12 hours left after 16 hours. (at which point I had to , lower the priority of the resilver as the performance of the array was terriable)


I still think a Long test is needed though, due to cksum errors

John



From: Arno None [mailto:arnoads-***@public.gmane.org]


Sent: 20 June 2014 13:28

To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org

Subject: RE: [zfs-discuss] 'Replaced' disk in pool





Thanks for the answers. Long Generic Test is running.

But I will replace the disk whatever the test result is. Unfortunately I have to wait over the weekend to receive a replacement.



Cédric: It is my backup pool. The disk enclosure with the affected pool is switched off so nothing can happen with the other disks until I can replace the disk.



Arno




From:
JMcEntee-***@public.gmane.org

To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org

Subject: RE: [zfs-discuss] 'Replaced' disk in pool

Date: Fri, 20 Jun 2014 07:24:33 +0000
Post by Arno None
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
Take it out an perform the Long Generic test. I have had disk that pass the quick tests but fail the longer ones. The test is there for a reason.
Post by Arno None
When I placed the same disk back it started to resilver but it stopped quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h remaining.
Sound like a problem with the disk
Post by Arno None
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an unrecoverable error' but when I read the link it mentions 'ZFS was able to recover from the error and
subsequently repair the damaged data'.
Sound like the disk is broken, so zfs had to recover data from the rest of the array.
Post by Arno None
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
If the disk that is trying to recover to is broken this will happen. The disk has unrecoverable errors, but the array is still fine as the data is
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
resilvering would have stopped with errors on the hard disk being resilvered too.
Post by Arno None
What are the steps to correct this?
Do a long check on the disk, if it still checks ok, you probably have a sata cable or port issue.

John



_______________________________________________________________________



The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send
an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet
communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG

VAT no. GB 464 6551 29

_______________________________________________________________________



This e-mail has been scanned for all viruses MessageLabs.



To unsubscribe from this group and stop receiving emails from it, send an email to
zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org



To unsubscribe from this group and stop receiving emails from it, send an email to
zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org


_______________________________________________________________________



The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG

VAT no. GB 464 6551 29

_______________________________________________________________________



This e-mail has been scanned for all viruses MessageLabs.







To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Arno None
2014-06-21 10:22:11 UTC
Permalink
I'm a little confused now.
The Windows SeaTools test seemed to be stuck at 40% so I cancelled that and started 'smartctl --test=long' in Linux.

The result is:
Extended offline Completed without error
Short offline Completed without error
smartctl --health: PASSED

When I searched for how to read 'Vendor Specific SMART Attributes with Thresholds' I found the raw values are coded hex values.
I don't know what to make of it. Which values should I look for?

ZFS: Errors, SMART: PASSED
Should I format and place it in the pool again? Or replace it with a new one?

Greets,
Arno

From: arnoads-***@public.gmane.org
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: RE: [zfs-discuss] 'Replaced' disk in pool
Date: Fri, 20 Jun 2014 16:31:06 +0000




Cédric,

SeaTools Long Generic Test is still
running (~40% now).
The disks in the pool are 3 month old.
I haven't looked into SMART testing yet so I will and test some more when the running test is finished.

John,

As I replied to Paul it's a lot of data so I have to be patient.

From: JMcEntee-***@public.gmane.org
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: RE: [zfs-discuss] 'Replaced' disk in pool
Date: Fri, 20 Jun 2014 13:59:41 +0000









Arno,

Reading through your e-mail again, and the one from Paul, It could will be the drive resilvered in 3 minutes as most of the data would have been correct and
it only need to change 9 GB. My last resilver was initially predicted to take 300 hours, but had 12 hours left after 16 hours. (at which point I had to , lower the priority of the resilver as the performance of the array was terriable)


I still think a Long test is needed though, due to cksum errors

John



From: Arno None [mailto:arnoads-***@public.gmane.org]


Sent: 20 June 2014 13:28

To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org

Subject: RE: [zfs-discuss] 'Replaced' disk in pool





Thanks for the answers. Long Generic Test is running.

But I will replace the disk whatever the test result is. Unfortunately I have to wait over the weekend to receive a replacement.



Cédric: It is my backup pool. The disk enclosure with the affected pool is switched off so nothing can happen with the other disks until I can replace the disk.



Arno




From:
JMcEntee-***@public.gmane.org

To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org

Subject: RE: [zfs-discuss] 'Replaced' disk in pool

Date: Fri, 20 Jun 2014 07:24:33 +0000
Post by Arno None
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
Take it out an perform the Long Generic test. I have had disk that pass the quick tests but fail the longer ones. The test is there for a reason.
Post by Arno None
When I placed the same disk back it started to resilver but it stopped quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h remaining.
Sound like a problem with the disk
Post by Arno None
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an unrecoverable error' but when I read the link it mentions 'ZFS was able to recover from the error and
subsequently repair the damaged data'.
Sound like the disk is broken, so zfs had to recover data from the rest of the array.
Post by Arno None
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
If the disk that is trying to recover to is broken this will happen. The disk has unrecoverable errors, but the array is still fine as the data is
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
resilvering would have stopped with errors on the hard disk being resilvered too.
Post by Arno None
What are the steps to correct this?
Do a long check on the disk, if it still checks ok, you probably have a sata cable or port issue.

John



_______________________________________________________________________



The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send
an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet
communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG

VAT no. GB 464 6551 29

_______________________________________________________________________



This e-mail has been scanned for all viruses MessageLabs.



To unsubscribe from this group and stop receiving emails from it, send an email to
zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org



To unsubscribe from this group and stop receiving emails from it, send an email to
zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org


_______________________________________________________________________



The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG

VAT no. GB 464 6551 29

_______________________________________________________________________



This e-mail has been scanned for all viruses MessageLabs.







To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org






To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
John Drescher
2014-06-21 13:13:24 UTC
Permalink
Post by Arno None
I'm a little confused now.
The Windows SeaTools test seemed to be stuck at 40% so I cancelled that and
started 'smartctl --test=long' in Linux.
Extended offline Completed without error
Short offline Completed without error
smartctl --health: PASSED
When I searched for how to read 'Vendor Specific SMART Attributes with
Thresholds' I found the raw values are coded hex values.
I don't know what to make of it. Which values should I look for?
ZFS: Errors, SMART: PASSED
Should I format and place it in the pool again? Or replace it with a new one?
Can you post the output of smartctl --all /dev/disk

where disk is the device for the disk that may be bad.

John

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Arno None
2014-06-21 16:48:08 UTC
Permalink
This is the output of smartctl --all
Date: Sat, 21 Jun 2014 09:13:24 -0400
Subject: Re: [zfs-discuss] 'Replaced' disk in pool
Post by Arno None
I'm a little confused now.
The Windows SeaTools test seemed to be stuck at 40% so I cancelled that and
started 'smartctl --test=long' in Linux.
Extended offline Completed without error
Short offline Completed without error
smartctl --health: PASSED
When I searched for how to read 'Vendor Specific SMART Attributes with
Thresholds' I found the raw values are coded hex values.
I don't know what to make of it. Which values should I look for?
ZFS: Errors, SMART: PASSED
Should I format and place it in the pool again? Or replace it with a new
one?
Can you post the output of smartctl --all /dev/disk
where disk is the device for the disk that may be bad.
John
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Hajo Möller
2014-06-21 19:06:58 UTC
Permalink
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 371
That line points to a bad cable connection or something similar, apart
from that the disk's smart data is looking fine.
--
Regards,
Hajo Möller

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Tim
2014-06-21 21:05:09 UTC
Permalink
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 371
That line points to a bad cable connection or something similar, apart
from that the disk's smart data is looking fine.
I've seen flaky SATA ports on motherboards before, loose solder or just a
poor connector. So the disk and cable might be OK, and so could be the
controller chip itself, just connection problems.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Arno None
2014-06-22 09:57:56 UTC
Permalink
So the disk seems to be OK.
The disk is in a new disk enclosure. The enclosure is connected to a RAID controller with disks in JBOD mode.
I can't do much with the cable connections. All nicely clicked into place.

How do I get it back in the zpool?
In my previous mails I wrote just placing it back stopped the resilvering with errors.
Should I wipe the disk first (dd if=/dev/zero) or just delete partition(s) or something else?

Thanks for the advise.
Arno

Date: Sat, 21 Jun 2014 14:05:09 -0700
From: tim.kagle-***@public.gmane.org
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: Re: [zfs-discuss] 'Replaced' disk in pool
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 371
That line points to a bad cable connection or something similar, apart

from that the disk's smart data is looking fine.




I've seen flaky SATA ports on motherboards before, loose solder or just a poor connector. So the disk and cable might be OK, and so could be the controller chip itself, just connection problems.





To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Audun Gangsto
2014-06-23 13:41:27 UTC
Permalink
I had a few disks fail this way, SMART parameters check out fine, life
seems to be good, but the kernel log reveals unrecoverable errors.

Once it was because of a bad SATA cable (I got it because the disk in bay X
would fail, no matter what disk i put there), that's also a thing to look
out for.

Once, the disk would give off a loud whine when i got close to the end of
the drive, but all tests and SMART parameters checked out fine, but reading
from the end of the disk would sometimes fail.

Don't trust your disks, but I guess that's why you're using ZFS ;)
Post by Arno None
Thanks for the answers. Long Generic Test is running.
But I will replace the disk whatever the test result is. Unfortunately I
have to wait over the weekend to receive a replacement.
Cédric: It is my backup pool. The disk enclosure with the affected pool is
switched off so nothing can happen with the other disks until I can replace
the disk.
Arno
------------------------------
Subject: RE: [zfs-discuss] 'Replaced' disk in pool
Date: Fri, 20 Jun 2014 07:24:33 +0000
Post by Arno None
When copying data I noticed that a pool was degraded. One disk became
unavailable.
Post by Arno None
Nothing was changed so I did a shutdown and tested the disk on another
PC (Windows) with SeaTools.
Post by Arno None
Tests were passed. I did not do the 'Long Generic Test' because it takes
long.
Take it out an perform the Long Generic test. I have had disk that pass
the quick tests but fail the longer ones. The test is there for a reason.
Post by Arno None
When I placed the same disk back it started to resilver but it stopped
quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h
remaining.
Sound like a problem with the disk
Post by Arno None
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error
(CKSUM 1).
Post by Arno None
The status now mentions 'One or more devices has experienced an
unrecoverable error' but when I read the link it mentions 'ZFS was able to
recover from the error and >subsequently repair the damaged data'.
Sound like the disk is broken, so zfs had to recover data from the rest of the array.
Post by Arno None
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
If the disk that is trying to recover to is broken this will happen. The
disk has unrecoverable errors, but the array is still fine as the data is
Post by Arno None
Can someone please explain what happened and why the resilvering stopped?
resilvering would have stopped with errors on the hard disk being resilvered too.
Post by Arno None
What are the steps to correct this?
Do a long check on the disk, if it still checks ok, you probably have a
sata cable or port issue.
John
_______________________________________________________________________
The contents of this e-mail and any attachment(s) are strictly
confidential and are solely for the person(s) at the e-mail address(es)
above. If you are not an addressee, you may not disclose, distribute, copy
or use this e-mail, and we request that you send an e-mail to
Ltd. accepts no legal liability for the contents of this e-mail including
any errors, interception or interference, as internet communications are
not secure. Any views or opinions presented are solely those of the author
and do not necessarily represent those of Stirling Dynamics Ltd. Registered
In England No. 2092114 Registered Office: 26 Regent Street, Clifton,
Bristol. BS8 4HG
VAT no. GB 464 6551 29
_______________________________________________________________________
This e-mail has been scanned for all viruses MessageLabs.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
p***@public.gmane.org
2014-06-20 12:47:32 UTC
Permalink
ARno, I can't speak to the possibility of there being an issue with your
disk, but about the resilver: the resilver typically will take a short
amount of time, even if it estimates something like 15 hours to start.
Remember, ZFS is very smart, and will only bother looking at places on the
disk that actually have data, not the whole disk. So if the disk is a 1TB
disk with 50GB of data on it, ZFS will only look at 50GB of space on the
disk, not the whole 1TB like dumber systems would.
Post by Arno None
Hi to all,
When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC
(Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.
When I placed the same disk back it started to resilver but it stopped
quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h
remaining.
Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an
unrecoverable error' but when I read the link it mentions 'ZFS was able to
recover from the error and subsequently repair the damaged data'.
One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.
Can someone please explain what happened and why the resilvering stopped?
What are the steps to correct this?
Greets,
Arno
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Arno None
2014-06-20 16:13:31 UTC
Permalink
Paul,

It's a 3TB drive of a 4x 3TB raidz1 pool 90% used. So I have to be patient....

Date: Fri, 20 Jun 2014 05:47:32 -0700
From: paul-***@public.gmane.org
To: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Subject: [zfs-discuss] Re: 'Replaced' disk in pool

ARno, I can't speak to the possibility of there being an issue with your disk, but about the resilver: the resilver typically will take a short amount of time, even if it estimates something like 15 hours to start. Remember, ZFS is very smart, and will only bother looking at places on the disk that actually have data, not the whole disk. So if the disk is a 1TB disk with 50GB of data on it, ZFS will only look at 50GB of space on the disk, not the whole 1TB like dumber systems would.



On Thursday, June 19, 2014 7:53:05 PM UTC-5, arnoads wrote:





Hi to all,

When copying data I noticed that a pool was degraded. One disk became unavailable.
Nothing was changed so I did a shutdown and tested the disk on another PC (Windows) with SeaTools.
Tests were passed. I did not do the 'Long Generic Test' because it takes long.

When I placed the same disk back it started to resilver but it stopped quickly (resilvered 9.07G in 0h3m with 0 errors). It started with ~15h remaining.

Now the pool (raidz1) is ONLINE but the 'replaced' disk has an error (CKSUM 1).
The status now mentions 'One or more devices has experienced an unrecoverable error' but when I read the link it mentions 'ZFS was able to recover from the error and subsequently repair the damaged data'.

One 'unrecoverable' and one 'was able to recover'. That are opposites.
I don't understand 'unrecoverable' but still the pool is online.
Shouldn't it complete the resilvering? One disk can fail in raidz1.

Can someone please explain what happened and why the resilvering stopped?
What are the steps to correct this?

Greets,
Arno







To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Continue reading on narkive:
Loading...