ZFS and SSDs and the impending HDD demise (Was: Re: [zfs-discuss] when to discard a disk)

Discussion:

Durval Menezes

2015-02-23 17:51:31 UTC

Hi Gordan, Uncle,

(changing the subject to something

Hopefully 3D SSD tech will give us 10Tb drives before then (WD have just
announced 10TB spinners, but they're shingled)

Wasn't there a presentation by someone from Seagate on the "shingled"
subject during the last OpenZFS seminar? How is ZFS preparing for this?

The main thing at the moment is the cost. It is coming down annoyingly
slowly, but it is coming down with every new SSD model released.

My impression is that overall SSD reliability is decreasing with time, with
increasing cost-cutting and/or performance cheating by manufacturers . But
it could be just me.

I for one don't expect to convert over to SSDs anytime soon...

I've seen one or three posts here from folks that were bringing up entire
SSD pools. If any of you are reading this, please comment on your
continuing experiences...

Here, I'm using SSDs for SLOG/L2ARC only, and one of them has just started
buying the farm:

Device Model: M4-CT256M4SSD3
[...]
1 Raw_Read_Error_Rate 0x002f 100 100 050 Pre-fail Always
- 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 57344 (0 1)
[...]
9 Power_On_Hours 0x0032 100 100 001 Old_age Always
- 9745
[...]
196 Reallocated_Event_Count 0x0032 100 100 001 Old_age Always
- 30
[...]
202 Perc_Rated_Life_Used 0x0018 099 099 001 Old_age Offline
- 1

The incredible bit is that theoretically only one percent of the device's
life has been used up, and it has already reallocated 2^16*1+57344= 122880
sectors during 9745 hours of operation. This would mean that it (a) has
more than 12 million total sectors (6GB) available for remapping, and/or
(b) has an expected life of 975K hours (more than 111 years!). The first
number does not seem so absurd, but the second one brings grave doubts to
the reliability of thar "Perc_..._Used" counter...

Cheers,
--
Durval.

To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Luke Olson

2015-02-23 18:22:06 UTC

Permalink

If the wiki is up to date then ZFS on Linux is not ready for pools with SSD
aside from the cache and log. FreeBSD however is ready.

http://open-zfs.org/wiki/Features#TRIM_Support

I personally haven't tried creating a pool with SSD but I don't foresee a
need for it anytime soon. At least from a computer graphics pipeline and
video production facility perspective. It simply doesn't make sense
financially because capacity is the show stopper.

I'm curious to see how the Seagate Archive branded drives perform without
host awareness of SMR. It could be a while before we see host awareness and
host managed SMR. Until then the drive managed SMR is what most people will
be talking about and dealing with. The presentation from Seagate was very
informative but I don't know how it meshes with the future of OpenZFS.

http://open-zfs.org/wiki/OpenZFS_Developer_Summit_2014#Presentations

I'm not going to hold my breath but if SMR is the future of high capacity
hard drives I hope OpenZFS embraces it and supports it.

Luke

Post by Durval Menezes
Hi Gordan, Uncle,
(changing the subject to something

Hopefully 3D SSD tech will give us 10Tb drives before then (WD have just
announced 10TB spinners, but they're shingled)

Wasn't there a presentation by someone from Seagate on the "shingled"
subject during the last OpenZFS seminar? How is ZFS preparing for this?

The main thing at the moment is the cost. It is coming down annoyingly
slowly, but it is coming down with every new SSD model released.

My impression is that overall SSD reliability is decreasing with time,
with increasing cost-cutting and/or performance cheating by manufacturers .
But it could be just me.
I for one don't expect to convert over to SSDs anytime soon...
I've seen one or three posts here from folks that were bringing up entire
SSD pools. If any of you are reading this, please comment on your
continuing experiences...
Here, I'm using SSDs for SLOG/L2ARC only, and one of them has just started
Device Model: M4-CT256M4SSD3
[...]
1 Raw_Read_Error_Rate 0x002f 100 100 050 Pre-fail Always
- 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 57344 (0 1)
[...]
9 Power_On_Hours 0x0032 100 100 001 Old_age Always
- 9745
[...]
196 Reallocated_Event_Count 0x0032 100 100 001 Old_age Always
- 30
[...]
202 Perc_Rated_Life_Used 0x0018 099 099 001 Old_age Offline
- 1
The incredible bit is that theoretically only one percent of the device's
life has been used up, and it has already reallocated 2^16*1+57344= 122880
sectors during 9745 hours of operation. This would mean that it (a) has
more than 12 million total sectors (6GB) available for remapping, and/or
(b) has an expected life of 975K hours (more than 111 years!). The first
number does not seem so absurd, but the second one brings grave doubts to
the reliability of thar "Perc_..._Used" counter...
Cheers,
--
Durval.

To unsubscribe from this group and stop receiving emails from it, send

To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Hajo Möller

2015-02-23 21:58:57 UTC

Permalink

Not related to SSDs, but to the current HDD development.

The user grarpamp sent this mail to ***@open-zfs.org just
yesterday, collecting information about zoned commands and SMR:

-------- Forwarded Message --------
Subject: [OpenZFS Developer] Zoned Commands ZBC/ZAC, Shingled SMR/SFS, ZFS
Date: Sat, 21 Feb 2015 18:27:05 -0500
From: grarpamp <***@gmail.com>
To: ***@open-zfs.org

FYI, these links may be of general interest,
and for possible integration, if not redundant...

https://github.com/hgst/libzbc
https://github.com/Seagate/SMR_FS-EXT4

Panel: Shingled Disk Drives???File System Vs. Autonomous Block Device
http://storageconference.us/2014/Presentations/Panel4.Bandic.pdf
http://storageconference.us/2014/Presentations/Panel4.Amer.pdf
http://storageconference.us/2014/Presentations/Panel4.Novak.pdf

ZFS on SMR Drives: Enabling Shingled Magnetic Recording (SMR) for
Enterprises
http://storageconference.us/2014/Presentations/Novak.pdf

http://storageconference.us/2014/index.html
http://storageconference.us/history.html

ZFS Host-Aware_SMR
http://open-zfs.org/w/images/2/2a/Host-Aware_SMR-Tim_Feldman.pdf

libzbc - The Linux Foundation
http://events.linuxfoundation.org/sites/events/files/slides/SMR-LinuxConUSA-2014.pdf

http://www.snia.org/sites/default/files/Dunn-Feldman_SNIA_Tutorial_Shingled_Magnetic_Recording-r7_Final.pdf

http://www.opencompute.org/wiki/Storage/Dev

Initial ZAC support
http://www.spinics.net/lists/linux-scsi/msg81545.html
ZAC/ZBC Update
http://www.spinics.net/lists/linux-scsi/msg80161.html

--
Regards,
Hajo Möller

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Uncle Stoat

2015-02-23 18:57:50 UTC

Permalink

Post by Durval Menezes
My impression is that overall SSD reliability is decreasing with time,
with increasing cost-cutting and/or performance cheating by
manufacturers . But it could be just me.

Some are, some aren't. I've been "watching this space" for over a decade:

http://www.storagesearch.com/chartingtheriseofssds.html

There's been a bunch of consolidation in the last 18 months and a bunch
more needs to happen.

The big breakthough is 3D NAND, which allow makers to move from 14nm
back to 40nm - which makes flash more reliable and much faster. That's
what the Samsung 850 range uses and I expect all makers will be using it
by the end of this year.

Samsung and others claim they can put up to 128 layers on a chip - right
now there are only 32. When you factor in multi-chip encapsulation
(they've all been doing this for a while) it's possible to put 1TB in a
single NAND device.

Lest you think this unlikely, bear in mind that 128GB microsd cards will
have been on the market for 1 year tomorrow and they use the same nand
chips that SSDs do, with a tiny ARM controller on the side to handle
defects. I expect that 256Gb+ will hit the market during the Mobile
World Congress in Barcelona next week

It's also worth noting that Samsung stated they _didn't_ bring 2Tb
850evo or pro to the market for one simple reason: They didn't think
enough people would buy them (the 1Tb evo only uses 1/3 of the inside of
its 2.5" SSD case.)

Post by Durval Menezes
I for one don't expect to convert over to SSDs anytime soon...

I've been putting SSDs in linux desktops (about 150 of them) for just
under 5 years.

They've paid for themselves simply in terms of not having to changeout
busted boot drives (we use network /home) - we'd resorted to using raid1
to cut down on the downtime so they weren't much more expensive overall.

As soon as 1Tb evo-style drives come under US$300 it's likely we'll
start using them instead of using 2Tb spinners + 64Tb root

Other groups have been resisting putting SSDs in windows boxes but I
think we're at the knee point now (no spinners in laptops for more than
a year)

Post by Durval Menezes
I've seen one or three posts here from folks that were bringing up
entire SSD pools. If any of you are reading this, please comment on your
continuing experiences...

That'll probably happen here soon. I've just noticed that 850pro are ~4
times the cost (per Gb) of 2Tb SATA enterprise spinners, which puts them
on the "affordable" side of the spectrum for some specialised uses.

(Things like the backup server are already all-ssd)

Post by Durval Menezes
The incredible bit is that theoretically only one percent of the
device's life has been used up, and it has already reallocated
2^16*1+57344= 122880 sectors during 9745 hours of operation. This would
mean that it (a) has more than 12 million total sectors (6GB) available
for remapping, and/or (b) has an expected life of 975K hours (more than
111 years!). The first number does not seem so absurd, but the second
one brings grave doubts to the reliability of thar "Perc_..._Used"
counter...

2 words "Bathtub curve" - Failure rates in flash devices aren't linear
with time.

Forget MTBF and MTTF - these have been taken over by murketing to tell
you what you'd expect if you replace your drives ar the end of warranty.

The hours may seem high to you, but drive life has been predicated on
stupidly large numbers of writes/day and so far testing by various
outfits has validated the manufacturer claims.

http://ssdendurancetest.com/

https://techreport.com/review/27436/the-ssd-endurance-experiment-two-freaking-petabytes

Here's what some of my consumer drives have on them (Yes, I like samsung
for home use. Unlike others *ahem*crucial*ahem*sandisk* they haevn't
died on me)

samsung 830 (256Gb raidz1, home desktop)
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age
Always - 15031
12 Power_Cycle_Count 0x0032 099 099 000 Old_age
Always - 47
177 Wear_Leveling_Count 0x0013 052 052 000 Pre-fail Always
- 1733
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always
- 0
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always
- 26316321407

samsung 840 (500Gb raidz1 + scratch space, home desktop)
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age
Always - 9558
12 Power_Cycle_Count 0x0032 099 099 000 Old_age
Always - 80
177 Wear_Leveling_Count 0x0013 091 091 000 Pre-fail Always
- 106
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always
- 0
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always
- 40985552914

samsung 840pro (128Gb zfs l2arc+slog on 32TB FS home system)
5 Reallocated_Sector_Ct 0x0033 083 083 010 Pre-fail
Always - 738
9 Power_On_Hours 0x0032 096 096 000 Old_age
Always - 19648
12 Power_Cycle_Count 0x0032 099 099 000 Old_age
Always - 192
177 Wear_Leveling_Count 0x0013 001 001 000 Pre-fail Always
- 8190
1
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 084 084 010 Pre-fail Always
- 738
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always
- 196717213815

Workwise:

Intel x25-e (64Gb backup spool, (D-to-D-to-T) one of 5 raid0)
5 Reallocated_Sector_Ct 0x0002 100 100 000 Old_age
Always - 0
9 Power_On_Hours 0x0002 100 100 000 Old_age
Always - 43736
12 Power_Cycle_Count 0x0002 100 100 000 Old_age
Always - 143
232 Available_Reservd_Space 0x0003 099 099 010 Pre-fail Always
- 0
233 Media_Wearout_Indicator 0x0002 093 093 000 Old_age Always
- 0
225 Host_Writes_32MiB 0x0000 192 192 000 Old_age
Offline - 33203904

Yes, that's really over a Petabyte written for "7% worn out"

By contrast for this last drive, spinners would shake themselves to
death in about 12-18 months on the same task.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Omen Wild

2015-02-23 20:10:03 UTC

Permalink

tl;dr: I wonder if SLOG and L2ARC have totally different write
patterns based on a SLOG device dying, but an identical L2ARC still
powering on strong.

Post by Uncle Stoat
samsung 840pro (128Gb zfs l2arc+slog on 32TB FS home system)
5 Reallocated_Sector_Ct 0x0033 083 083 010 Pre-fail Always - 738
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 19648
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 192
177 Wear_Leveling_Count 0x0013 001 001 000 Pre-fail Always - 8190
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 084 084 010 Pre-fail Always - 738
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 196717213815

Since the plural of anecdote is not data, I will throw my anecdote out
there. We had a Samsung 840 Pro (128GB) acting as a SLOG device in an
OpenIndiana box that does NFS serving of all of our VM images in the
department. Since all NFS writes go through the SLOG, it got a fair
amount of traffic. As of December the stats were:

5 Reallocated_Sector_Ct 0x0033 056 056 010 Pre-fail Always - 1981
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 11804
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 14
177 Wear_Leveling_Count 0x0013 009 009 000 Pre-fail Always - 3319
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 057 057 010 Pre-fail Always - 1981
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 98699281447

Overall fairly similar to yours, but around half the writes. The reason I
had to pull the stats from December is that a couple weeks ago, it died,
hard, and will not respond to requests at boot. The OS could see it, but
not communicate with it at all.

ZFS, of course, kicked it out, and happily continued.

Then again, we had another drive bought at the same time (serial # is
almost identical) doing L2ARC duty, and it is fine and the stats are much
better, even though it has done 2.6 times the writes:

5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 28
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 14946
177 Wear_Leveling_Count 0x0013 001 001 000 Pre-fail Always - 8108
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 28
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 255858901459

--
Klingon function calls do not have 'parameters' - they have 'arguments',
and they ALWAYS WIN THEM!

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Uncle Stoatwarbler

2015-02-23 21:40:25 UTC

Permalink

Post by Omen Wild
tl;dr: I wonder if SLOG and L2ARC have totally different write
patterns based on a SLOG device dying, but an identical L2ARC still
powering on strong.

More than likely.

Post by Omen Wild
Overall fairly similar to yours, but around half the writes. The reason I
had to pull the stats from December is that a couple weeks ago, it died,
hard, and will not respond to requests at boot. The OS could see it, but
not communicate with it at all.

Odd. I had a Sandisk pull that stunt, but after sitting on the shelf for
a week it decided to work again.

Post by Omen Wild
ZFS, of course, kicked it out, and happily continued.

Which is eactly the reason I'm not overloy worried about mirroring cache
drives and don't think other people should either.

Post by Omen Wild
Then again, we had another drive bought at the same time (serial # is
almost identical) doing L2ARC duty, and it is fine and the stats are much
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 28
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 14946
177 Wear_Leveling_Count 0x0013 001 001 000 Pre-fail Always - 8108
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 28
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 255858901459

Comparing the blocks erased might show some useful detail. IIRC the 830
and 840 series use 1Mb erase blocks and 128kB write ones. A lot of small
writes might well result in more erasure cycles.

As you point out, it's all anecdotal. Thousands of drives are needed to
gather any meaningful statistics.

One data point which never seems to be gathered is transport protection.

Quantum produced a couple of whitepapers back in the 1990s which
indicated that careless handling during shipping and jarring during
installation were major contributors to shortened drive lives. They even
showed the scale of G shocks that drives received when electric
screwdrivers were allowed to torque out of the screwheads and slap back
down during pc assembly. This didn't usually cause head problems but
bearings tended not to like it.

I'd be interested to know how many people here keep an eye on how their
devices come packed. I've rejected shipments in the past for being
improperly packed (and notified the maker of the serial numbers involved).

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Kuba

2015-02-23 22:08:53 UTC

Permalink

Post by Omen Wild
tl;dr: I wonder if SLOG and L2ARC have totally different write
patterns based on a SLOG device dying, but an identical L2ARC still
powering on strong.

In case you haven't stumbled upon this pdf yet, you might find it
interesting, as it contains results of a SLOG device access pattern
analysis (starting from page 19):

http://www.ddrdrive.com/zil_accelerator.pdf

In short: when writing to a single dataset, writes to the SLOG device
are mostly sequential, but when you start writing to multiple datasets,
writes to the SLOG device are becoming increasingly random.

Kuba

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.