Discussion:
ZFS consuming twice ARC's size and crashing system
(too old to reply)
s***@public.gmane.org
2013-08-15 06:55:43 UTC
Permalink
You might want to have a look at this issue:
https://github.com/zfsonlinux/zfs/issues/1132

No fix as yet, and this is kind of a big deal for production environments,
so as it stands, ZOL is not ready for full blown deployments.
Thanks for the confirmation. Reduced arc_max to ~30% RAM and set
vm.min_free_kbytes to 1GB and see how things go.
Thanks
- Trey
It is not unusual for arc to exceed arc_max by up to a factor of 2 due to
memory fragmentation. The only thing you can really do is reduce it's size
until you are no longer getting memory exhaustion. Increasing
vm.min_free_kbytes (I typically set it to about 1% of RAM) may also help.
I'm working to migrate our FhGFS storage and metadata filesystems using
XFS storage to ZOL. Initial performance tests with ZOL across the
clustered filesystem and local storage bricks was very promising. I'm now
running parallel-ish rsyncs from our existing filesystem to our new systems
(running ZOL) and have begun to see alarming memory consumption on the
storage and metadata systems. The metadata system has got hung (root's
ext4 with hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers and
single metadata server getting below less than 5% free memory.
After reading some of the Github issues and various posts I shutdown the
transfers and unmounted /tank and did 'modprobe -r zfs' and the memory
usage shifted from ~1.8% free to ~98.5% free.
Below are the slab and arcstat numbers from before [1] and after [2] the
unmount/modprobe -r operation.
The systems doing storage (and shown in output below) have 64GB RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD (though
they share the same SSDs, just partitioned separately).
I have not changed any zfs parameters. The current values are below [4].
Is there anything I can do to prevent such massive memory consumption?
Any input or advice is greatly appreciated.
Thanks!
- Trey
[1] - BEFORE umount /tank && modprobe -r zfs
# free -m
total used free shared buffers cached
Mem: 64394 63814 580 0 138 447
-/+ buffers/cache: 63227 1166
Swap: 17876 0 17876
# cat /proc/spl/kstat/zfs/arcstats
4 1 0x01 80 3840 14068604476 855326803059960
name type data
hits 4 14282253
misses 4 1620288
demand_data_hits 4 2218799
demand_data_misses 4 4070
demand_metadata_hits 4 12045186
demand_metadata_misses 4 1603226
prefetch_data_hits 4 11433
prefetch_data_misses 4 858
prefetch_metadata_hits 4 6835
prefetch_metadata_misses 4 12134
mru_hits 4 3393387
mru_ghost_hits 4 1026989
mfu_hits 4 10870655
mfu_ghost_hits 4 571654
deleted 4 146421009
recycle_miss 4 2087432
mutex_miss 4 218036
evict_skip 4 11812
evict_l2_cached 4 1965484251136
evict_l2_eligible 4 16744072573440
evict_l2_ineligible 4 46204928
hash_elements 4 3276791
hash_elements_max 4 3276795
hash_collisions 4 113018742
hash_chains 4 970350
hash_chain_max 4 12
p 4 32332809728
c 4 35433480192
c_min 4 4429185024
c_max 4 35433480192
size 4 35433387976
hdr_size 4 284321168
data_size 4 30945415168
other_size 4 3538699496
anon_size 4 16384
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 30879575552
mru_evict_data 4 28208510464
mru_evict_metadata 4 1436280832
mru_ghost_size 4 4553802752
mru_ghost_evict_data 4 933554176
mru_ghost_evict_metadata 4 3620248576
mfu_size 4 65823232
mfu_evict_data 4 0
mfu_evict_metadata 4 65823232
mfu_ghost_size 4 2934171648
mfu_ghost_evict_data 4 2929137664
mfu_ghost_evict_metadata 4 5033984
l2_hits 4 1370716
l2_misses 4 249550
l2_feeds 4 946623
l2_rw_clash 4 4
l2_read_bytes 4 5957497344
l2_write_bytes 4 1993744710656
l2_writes_sent 4 367327
l2_writes_done 4 367327
l2_writes_error 4 0
l2_writes_hdr_miss 4 203
l2_evict_lock_retry 4 157
l2_evict_reading 4 0
l2_free_on_write 4 79142
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 263394997760
l2_hdr_size 4 704066976
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 7224877512
arc_meta_limit 4 8858370048
arc_meta_max 4 7224893896
# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
nfs_direct_cache 0 0 200 19 1 : tunables 120 60 8
: slabdata 0 0 0
nfs_commit_data 0 0 704 11 2 : tunables 54 27 8
: slabdata 0 0 0
nfs_write_data 36 36 896 4 1 : tunables 54 27 8
: slabdata 9 9 0
nfs_read_data 0 0 832 9 2 : tunables 54 27 8
: slabdata 0 0 0
nfs_inode_cache 166 168 1048 3 1 : tunables 24 12
8 : slabdata 56 56 0
nfs_page 0 0 128 30 1 : tunables 120 60 8
: slabdata 0 0 0
fscache_cookie_jar 3 106 72 53 1 : tunables 120 60
8 : slabdata 2 2 0
rpc_buffers 8 8 2048 2 1 : tunables 24 12 8
: slabdata 4 4 0
rpc_tasks 8 15 256 15 1 : tunables 120 60 8
: slabdata 1 1 0
rpc_inode_cache 18 24 832 4 1 : tunables 54 27 8
: slabdata 6 6 0
nf_conntrack_expect 0 0 240 16 1 : tunables 120 60
8 : slabdata 0 0 0
nf_conntrack_ffffffff81b16580 0 0 312 12 1 : tunables
54 27 8 : slabdata 0 0 0
fib6_nodes 24 118 64 59 1 : tunables 120 60 8
: slabdata 2 2 0
ip6_dst_cache 16 30 384 10 1 : tunables 54 27 8
: slabdata 3 3 0
ndisc_cache 1 15 256 15 1 : tunables 120 60 8
: slabdata 1 1 0
ip6_mrt_cache 0 0 128 30 1 : tunables 120 60 8
: slabdata 0 0 0
RAWv6 19 20 1024 4 1 : tunables 54 27 8
: slabdata 5 5 0
UDPLITEv6 0 0 1024 4 1 : tunables 54 27 8
: slabdata 0 0 0
UDPv6 8 16 1024 4 1 : tunables 54 27 8
: slabdata 4 4 0
tw_sock_TCPv6 0 0 320 12 1 : tunables 54 27 8
: slabdata 0 0 0
request_sock_TCPv6 0 0 192 20 1 : tunables 120 60
8 : slabdata 0 0 0
TCPv6 3 4 1856 2 1 : tunables 24 12 8
: slabdata 2 2 0
ib_mad 1024 1096 448 8 1 : tunables 54 27 8
: slabdata 137 137 0
ioat2 4096 4110 128 30 1 : tunables 120 60 8
: slabdata 137 137 0
ext4_inode_cache 32088 32152 1024 4 1 : tunables 54 27 8
: slabdata 8038 8038 0
ext4_xattr 0 0 88 44 1 : tunables 120 60 8
: slabdata 0 0 0
ext4_free_block_extents 0 0 56 67 1 : tunables 120
60 8 : slabdata 0 0 0
ext4_alloc_context 1 28 136 28 1 : tunables 120 60
8 : slabdata 1 1 0
ext4_prealloc_space 18 74 104 37 1 : tunables 120 60
8 : slabdata 2 2 0
ext4_system_zone 0 0 40 92 1 : tunables 120 60 8
: slabdata 0 0 0
jbd2_journal_handle 1 288 24 144 1 : tunables 120 60
8 : slabdata 1 2 0
jbd2_journal_head 21 102 112 34 1 : tunables 120 60 8
: slabdata 3 3 5
jbd2_revoke_table 2 202 16 202 1 : tunables 120 60 8
: slabdata 1 1 0
jbd2_revoke_record 0 0 32 112 1 : tunables 120 60
8 : slabdata 0 0 0
bio-1 57 100 192 20 1 : tunables 120 60 8
: slabdata 5 5 0
sd_ext_cdb 2 112 32 112 1 : tunables 120 60 8
: slabdata 1 1 0
sas_task 0 0 320 12 1 : tunables 54 27 8
: slabdata 0 0 0
scsi_sense_cache 27 60 128 30 1 : tunables 120 60 8
: slabdata 2 2 0
scsi_cmd_cache 12 45 256 15 1 : tunables 120 60 8
: slabdata 3 3 0
dm_raid1_read_record 0 0 1064 7 2 : tunables 24 12
8 : slabdata 0 0 0
kcopyd_job 0 0 3240 2 2 : tunables 24 12 8
: slabdata 0 0 0
io 0 0 64 59 1 : tunables 120 60 8
: slabdata 0 0 0
dm_uevent 0 0 2608 3 2 : tunables 24 12 8
: slabdata 0 0 0
dm_rq_clone_bio_info 0 0 16 202 1 : tunables 120 60
8 : slabdata 0 0 0
dm_rq_target_io 0 0 392 10 1 : tunables 54 27 8
: slabdata 0 0 0
dm_target_io 0 0 24 144 1 : tunables 120 60 8
: slabdata 0 0 0
dm_io 0 0 40 92 1 : tunables 120 60 8
: slabdata 0 0 0
flow_cache 0 0 96 40 1 : tunables 120 60 8
: slabdata 0 0 0
uhci_urb_priv 0 0 56 67 1 : tunables 120 60 8
: slabdata 0 0 0
cfq_io_context 0 0 136 28 1 : tunables 120 60 8
: slabdata 0 0 0
cfq_queue 0 0 240 16 1 : tunables 120 60 8
: slabdata 0 0 0
bsg_cmd 0 0 312 12 1 : tunables 54 27 8
: slabdata 0 0 0
mqueue_inode_cache 1 4 896 4 1 : tunables 54 27
8 : slabdata 1 1 0
isofs_inode_cache 0 0 640 6 1 : tunables 54 27 8
: slabdata 0 0 0
hugetlbfs_inode_cache 1 6 608 6 1 : tunables 54 27
8 : slabdata 1 1 0
dquot 0 0 256 15 1 : tunables 120 60 8
: slabdata 0 0 0
kioctx 0 0 384 10 1 : tunables 54 27 8
: slabdata 0 0 0
kiocb 0 0 256 15 1 : tunables 120 60 8
: slabdata 0 0 0
inotify_event_private_data 0 0 32 112 1 : tunables 120
60 8 : slabdata 0 0 0
inotify_inode_mark_entry 174 374 112 34 1 : tunables 120
60 8 : slabdata 11 11 0
dnotify_mark_entry 4 34 112 34 1 : tunables 120 60
8 : slabdata 1 1 0
dnotify_struct 4 112 32 112 1 : tunables 120 60 8
: slabdata 1 1 0
fasync_cache 0 0 24 144 1 : tunables 120 60 8
: slabdata 0 0 0
khugepaged_mm_slot 11 184 40 92 1 : tunables 120 60
8 : slabdata 2 2 0
ksm_mm_slot 0 0 48 77 1 : tunables 120 60 8
: slabdata 0 0 0
ksm_stable_node 0 0 40 92 1 : tunables 120 60 8
: slabdata 0 0 0
ksm_rmap_item 0 0 64 59 1 : tunables 120 60 8
: slabdata 0 0 0
utrace_engine 0 0 56 67 1 : tunables 120 60 8
: slabdata 0 0 0
utrace 0 0 64 59 1 : tunables 120 60 8
: slabdata 0 0 0
pid_namespace 0 0 2120 3 2 : tunables 24 12 8
: slabdata 0 0 0
nsproxy 0 0 48 77 1 : tunables 120 60 8
: slabdata 0 0 0
posix_timers_cache 0 0 176 22 1 : tunables 120 60
8 : slabdata 0 0 0
uid_cache 7 120 128 30 1 : tunables 120 60 8
: slabdata 4 4 0
UNIX 140 175 768 5 1 : tunables 54 27
8 : slabdata 35 35 0
ip_mrt_cache 0 0 128 30 1 : tunables 120 60 8
: slabdata 0 0 0
UDP-Lite 0 0 832 9 2 : tunables 54 27 8
: slabdata 0 0 0
tcp_bind_bucket 83 177 64 59 1 : tunables 120 60 8
: slabdata 3 3 0
inet_peer_cache 2 59 64 59 1 : tunables 120 60 8
: slabdata 1 1 0
secpath_cache 0 0 64 59 1 : tunables 120 60 8
: slabdata 0 0 0
xfrm_dst_cache 0 0 384 10 1 : tunables 54 27 8
: slabdata 0 0 0
ip_fib_alias 1 112 32 112 1 : tunables 120 60 8
: slabdata 1 1 0
ip_fib_hash 14 106 72 53 1 : tunables 120 60 8
: slabdata 2 2 0
ip_dst_cache 38 50 384 10 1 : tunables 54 27 8
: slabdata 5 5 0
arp_cache 6 15 256 15 1 : tunables 120 60 8
: slabdata 1 1 0
PING 0 0 832 9 2 : tunables 54 27 8
: slabdata 0 0 0
RAW 17 18 832 9 2 : tunables 54 27 8
: slabdata 2 2 0
UDP 18 36 832 9 2 : tunables 54 27 8
: slabdata 4 4 0
tw_sock_TCP 33 45 256 15 1 : tunables 120 60 8
: slabdata 3 3 0
request_sock_TCP 22 30 128 30 1 : tunables 120 60 8
: slabdata 1 1 0
TCP 27 36 1664 4 2 : tunables 24 12 8
: slabdata 9 9 0
eventpoll_pwq 74 212 72 53 1 : tunables 120 60 8
: slabdata 4 4 0
eventpoll_epi 74 180 128 30 1 : tunables 120 60 8
: slabdata 6 6 0
sgpool-128 2 2 4096 1 1 : tunables 24 12 8
: slabdata 2 2 0
sgpool-64 2 4 2048 2 1 : tunables 24 12 8
: slabdata 2 2 0
sgpool-32 2 8 1024 4 1 : tunables 54 27 8
: slabdata 2 2 0
sgpool-16 2 16 512 8 1 : tunables 54 27 8
: slabdata 2 2 0
sgpool-8 7 45 256 15 1 : tunables 120 60 8
: slabdata 3 3 0
scsi_data_buffer 0 0 24 144 1 : tunables 120 60 8
: slabdata 0 0 0
blkdev_integrity 0 0 112 34 1 : tunables 120 60 8
: slabdata 0 0 0
blkdev_queue 81 82 2864 2 2 : tunables 24 12 8
: slabdata 41 41 0
blkdev_requests 242 275 352 11 1 : tunables 54 27 8
: slabdata 25 25 0
blkdev_ioc 2 96 80 48 1 : tunables 120 60 8
: slabdata 2 2 0
fsnotify_event_holder 0 0 24 144 1 : tunables 120 60
8 : slabdata 0 0 0
fsnotify_event 0 0 104 37 1 : tunables 120 60 8
: slabdata 0 0 0
bio-0 11 60 192 20 1 : tunables 120 60 8
: slabdata 3 3 1
biovec-256 6 6 4096 1 1 : tunables 24 12 8
: slabdata 6 6 0
biovec-128 0 0 2048 2 1 : tunables 24 12 8
: slabdata 0 0 0
biovec-64 0 0 1024 4 1 : tunables 54 27 8
: slabdata 0 0 0
biovec-16 0 0 256 15 1 : tunables 120 60 8
: slabdata 0 0 0
bip-256 2 2 4224 1 2 : tunables 8 4 0
: slabdata 2 2 0
bip-128 0 0 2176 3 2 : tunables 24 12 8
: slabdata 0 0 0
bip-64 0 0 1152 7 2 : tunables 24 12 8
: slabdata 0 0 0
bip-16 0 0 384 10 1 : tunables 54 27 8
: slabdata 0 0 0
bip-4 0 0 192 20 1 : tunables 120 60 8
: slabdata 0 0 0
bip-1 0 0 128 30 1 : tunables 120 60 8
: slabdata 0 0 0
sock_inode_cache 278 370 704 5 1 : tunables 54 27 8
: slabdata 74 74 0
skbuff_fclone_cache 42 42 512 7 1 : tunables 54 27
8 : slabdata 6 6 0
skbuff_head_cache 610 930 256 15 1 : tunables 120 60 8
: slabdata 62 62 0
file_lock_cache 6 66 176 22 1 : tunables 120 60 8
: slabdata 3 3 0
net_namespace 0 0 2240 3 2 : tunables 24 12 8
: slabdata 0 0 0
shmem_inode_cache 2309 2430 784 5 1 : tunables 54 27 8
: slabdata 486 486 0
Acpi-Operand 66947 67628 72 53 1 : tunables 120 60
8 : slabdata 1276 1276 0
Acpi-ParseExt 0 0 72 53 1 : tunables 120 60 8
: slabdata 0 0 0
Acpi-Parse 0 0 48 77 1 : tunables 120 60 8
: slabdata 0 0 0
Acpi-State 0 0 80 48 1 : tunables 120 60 8
: slabdata 0 0 0
Acpi-Namespace 5988 6072 40 92 1 : tunables 120 60 8
: slabdata 66 66 0
task_delay_info 664 1258 112 34 1 : tunables 120 60 8
: slabdata 37 37 0
taskstats 0 0 328 12 1 : tunables 54 27 8
: slabdata 0 0 0
proc_inode_cache 3486 3534 640 6 1 : tunables 54 27 8
: slabdata 589 589 0
sigqueue 2 48 160 24 1 : tunables 120 60 8
: slabdata 2 2 0
bdev_cache 109 148 896 4 1 : tunables 54 27 8
: slabdata 37 37 0
sysfs_dir_cache 28695 29241 144 27 1 : tunables 120 60 8
: slabdata 1083 1083 0
mnt_cache 28 45 256 15 1 : tunables 120 60 8
: slabdata 3 3 0
filp 1592 3560 192 20 1 : tunables 120 60 8
: slabdata 178 178 0
inode_cache 12529 12552 592 6 1 : tunables 54 27 8
: slabdata 2092 2092 0
dentry 4107669 4108160 192 20 1 : tunables 120 60
8 : slabdata 205408 205408 34
names_cache 21 21 4096 1 1 : tunables 24 12 8
: slabdata 21 21 0
avc_node 0 0 64 59 1 : tunables 120 60 8
: slabdata 0 0 0
selinux_inode_security 11048 11607 72 53 1 : tunables 120 60
8 : slabdata 219 219 0
radix_tree_node 3843 4536 560 7 1 : tunables 54 27 8
: slabdata 648 648 0
key_jar 8 40 192 20 1 : tunables 120 60 8
: slabdata 2 2 0
buffer_head 95556 125541 104 37 1 : tunables 120 60 8
: slabdata 3393 3393 4
vm_area_struct 6079 6707 200 19 1 : tunables 120 60 8
: slabdata 353 353 0
mm_struct 87 130 1408 5 2 : tunables 24 12 8
: slabdata 26 26 0
fs_cache 91 708 64 59 1 : tunables 120 60 8
: slabdata 12 12 0
files_cache 92 286 704 11 2 : tunables 54 27 8
: slabdata 26 26 0
signal_cache 559 770 1088 7 2 : tunables 24 12 8
: slabdata 110 110 0
sighand_cache 559 648 2112 3 2 : tunables 24 12 8
: slabdata 216 216 0
task_xstate 177 297 832 9 2 : tunables 54 27
8 : slabdata 33 33 0
task_struct 659 687 2656 3 2 : tunables 24 12 8
: slabdata 229 229 0
cred_jar 802 1620 192 20 1 : tunables 120 60 8
: slabdata 81 81 0
anon_vma_chain 5164 10626 48 77 1 : tunables 120 60 8
: slabdata 138 138 0
anon_vma 2792 6716 40 92 1 : tunables 120 60 8
: slabdata 73 73 0
pid 663 1170 128 30 1 : tunables 120 60 8
: slabdata 39 39 0
shared_policy_node 0 0 48 77 1 : tunables 120 60
8 : slabdata 0 0 0
numa_policy 15 56 136 28 1 : tunables 120 60 8
: slabdata 2 2 0
idr_layer_cache 411 539 544 7 1 : tunables 54 27 8
: slabdata 68 77 4
size-4194304(DMA) 0 0 4194304 1 1024 : tunables 1 1
0 : slabdata 0 0 0
size-4194304 0 0 4194304 1 1024 : tunables 1 1
0 : slabdata 0 0 0
size-2097152(DMA) 0 0 2097152 1 512 : tunables 1 1
0 : slabdata 0 0 0
size-2097152 0 0 2097152 1 512 : tunables 1 1
0 : slabdata 0 0 0
size-1048576(DMA) 0 0 1048576 1 256 : tunables 1 1
0 : slabdata 0 0 0
size-1048576 0 0 1048576 1 256 : tunables 1 1
0 : slabdata 0 0 0
size-524288(DMA) 0 0 524288 1 128 : tunables 1 1 0
: slabdata 0 0 0
size-524288 4 4 524288 1 128 : tunables 1 1 0
: slabdata 4 4 0
size-262144(DMA) 0 0 262144 1 64 : tunables 1 1 0
: slabdata 0 0 0
size-262144 1 1 262144 1 64 : tunables 1 1 0
: slabdata 1 1 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0
: slabdata 0 0 0
size-131072 4 7 131072 1 32 : tunables 8 4 0
: slabdata 4 7 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0
: slabdata 0 0 0
size-65536 126 135 65536 1 16 : tunables 8 4 0
: slabdata 126 135 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0
: slabdata 0 0 0
size-32768 12 26 32768 1 8 : tunables 8 4 0
: slabdata 12 26 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0
: slabdata 0 0 0
size-16384 36 63 16384 1 4 : tunables 8 4 0
: slabdata 36 63 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0
: slabdata 0 0 0
size-8192 70049 70049 8192 1 2 : tunables 8 4 0
: slabdata 70049 70049 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8
: slabdata 0 0 0
size-4096 2117 2185 4096 1 1 : tunables 24 12
8 : slabdata 2117 2185 0
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8
: slabdata 0 0 0
size-2048 2303 2408 2048 2 1 : tunables 24 12
8 : slabdata 1171 1204 0
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8
: slabdata 0 0 0
size-1024 12920 13380 1024 4 1 : tunables 54 27 8
: slabdata 3326 3345 11
size-512(DMA) 0 0 512 8 1 : tunables 54 27 8
: slabdata 0 0 0
size-512 5043 5312 512 8 1 : tunables 54 27 8
: slabdata 647 664 0
size-256(DMA) 0 0 256 15 1 : tunables 120 60 8
: slabdata 0 0 0
size-256 1964 2685 256 15 1 : tunables 120 60 8
: slabdata 156 179 0
size-192(DMA) 0 0 192 20 1 : tunables 120 60 8
: slabdata 0 0 0
size-192 20728 22320 192 20 1 : tunables 120 60 8
: slabdata 1110 1116 112
size-128(DMA) 0 0 128 30 1 : tunables 120 60 8
: slabdata 0 0 0
size-64(DMA) 0 0 64 59 1 : tunables 120 60 8
: slabdata 0 0 0
size-64 4899753 4908623 64 59 1 : tunables 120 60
8 : slabdata 83197 83197 0
size-32(DMA) 0 0 32 112 1 : tunables 120 60 8
: slabdata 0 0 0
size-128 2259651 2260530 128 30 1 : tunables 120 60
8 : slabdata 75351 75351 0
size-32 10278112 11060896 32 112 1 : tunables 120 60
8 : slabdata 98758 98758 0
kmem_cache 195 195 32896 1 16 : tunables 8 4 0
: slabdata 195 195 0
# cat /proc/spl/kmem/slab
--------------------- cache
------------------------------------------------------- ----- slab ------
---- object ----- --- emergency ---
name flags size alloc slabsize
objsize total alloc max total alloc max dlock alloc max
spl_vn_cache 0x00020 12288 3696 4096
88 3 2 2 63 42 42 0 0 0
spl_vn_file_cache 0x00020 0 0 4096
96 0 0 0 0 0 0 0 0 0
spl_zlib_workspace_cache 0x00140 0 0 8388608
268072 0 0 0 0 0 0 0 0 0
zio_cache 0x00020 263880704 1468320 32768
1064 8053 64 8052 233537 1380 233508 0 0 0
zio_link_cache 0x00020 21442560 123360 4096
48 5235 79 5234 235575 2570 235530 0 0 0
zio_vdev_cache 0x00040 92274688 72098400 4194304
131088 22 19 21 660 550 630 0 0 0
zio_buf_512 0x00020 2354642944 1140514304
32768 512 71858 71857 71857 2227598 2227567 2227567 0 0
0
zio_data_buf_512 0x00020 68943872 31937024 32768
512 2104 2076 2103 65224 62377 65193 0 0 0
zio_buf_1024 0x00040 19300352 978944 32768
1024 589 55 588 12369 956 12348 0 0 0
zio_data_buf_1024 0x00040 20086784 1822720 32768
1024 613 164 612 12873 1780 12852 0 0 0
zio_buf_1536 0x00040 24248320 1546752 65536
1536 370 36 369 11470 1007 11439 0 0 0
zio_data_buf_1536 0x00040 14286848 2566656 65536
1536 218 95 217 6758 1671 6727 0 0 0
zio_buf_2048 0x00040 26017792 1794048 65536
2048 397 53 396 9925 876 9900 0 0 0
zio_data_buf_2048 0x00040 42401792 3452928 65536
2048 647 151 646 16175 1686 16150 0 0 0
zio_buf_2560 0x00040 25886720 2339840 65536
2560 395 57 394 8295 914 8274 0 0 0
zio_data_buf_2560 0x00040 15269888 3297280 65536
2560 233 116 232 4893 1288 4872 0 0 0
zio_buf_3072 0x00040 20905984 3231744 65536
3072 319 65 318 5742 1052 5724 0 0 0
zio_data_buf_3072 0x00040 12320768 3852288 65536
3072 188 99 187 3384 1254 3366 0 0 0
zio_buf_3584 0x00040 16121856 3691520 131072
3584 123 44 122 3813 1030 3782 0 0 0
zio_data_buf_3584 0x00040 11272192 4168192 131072
3584 86 61 85 2666 1163 2635 0 0 0
zio_buf_4096 0x00040 26476544 10993664 262144
4096 101 100 100 3131 2684 3100 0 0 0
zio_data_buf_4096 0x00040 18874368 2985984 262144
4096 72 39 71 2232 729 2201 0 0 0
zio_buf_5120 0x00040 30801920 2549760 131072
5120 235 29 234 4935 498 4914 0 0 0
zio_data_buf_5120 0x00040 25690112 7203840 131072
5120 196 73 195 4116 1407 4095 0 0 0
zio_buf_6144 0x00040 60030976 3158016 131072
6144 458 50 457 8244 514 8226 0 0 0
zio_data_buf_6144 0x00040 26869760 4866048 131072
6144 205 89 204 3690 792 3672 0 0 0
zio_buf_7168 0x00040 74973184 7820288 262144
7168 286 266 285 8866 1091 8835 0 0 0
zio_data_buf_7168 0x00040 69730304 5211136 262144
7168 266 94 265 8246 727 8215 0 0 0
zio_buf_8192 0x00040 105119744 37617664 262144
8192 401 400 400 8421 4592 8400 0 0 0
zio_data_buf_8192 0x00040 91488256 12943360 262144
8192 349 126 348 7329 1580 7308 0 0 0
zio_buf_10240 0x00040 141033472 115148800 262144
10240 538 537 537 11298 11245 11277 0 0 0
zio_data_buf_10240 0x00040 136839168 32737280 262144
10240 522 181 521 10962 3197 10941 0 0 0
zio_buf_12288 0x00040 15728640 10653696 524288
12288 30 29 29 930 867 899 0 0 0
zio_data_buf_12288 0x00040 136839168 33767424 524288
12288 261 107 260 8091 2748 8060 0 0 0
zio_buf_14336 0x00040 10485760 7526400 524288
14336 20 19 19 620 525 589 0 0 0
zio_data_buf_14336 0x00040 224919552 178741248 524288
14336 429 421 428 13299 12468 13268 0 0 0
zio_buf_16384 0x00040 3381133312 2594439168
524288 16384 6449 6425 6448 161225 158352 161200 0 0 0
zio_data_buf_16384 0x00040 105381888 28966912 524288
16384 201 90 200 5025 1768 5000 0 0 0
zio_buf_20480 0x00040 0 0 524288
20480 0 0 0 0 0 0 0 0 0
zio_data_buf_20480 0x00040 187170816 145899520 524288
20480 357 355 356 7497 7124 7476 0 0 0
zio_buf_24576 0x00040 0 0 524288
24576 0 0 0 0 0 0 0 0 0
zio_data_buf_24576 0x00040 167772160 119881728 524288
24576 320 282 319 5760 4878 5742 0 0 0
zio_buf_28672 0x00040 0 0 1048576
28672 0 0 0 0 0 0 0 0 0
zio_data_buf_28672 0x00040 166723584 43180032 1048576
28672 159 63 158 4929 1506 4898 0 0 0
zio_buf_32768 0x00040 0 0 1048576
32768 0 0 0 0 0 0 0 0 0
zio_data_buf_32768 0x00040 213909504 174718976 1048576
32768 204 203 203 5712 5332 5684 0 0 0
zio_buf_36864 0x00040 0 0 1048576
36864 0 0 0 0 0 0 0 0 0
zio_data_buf_36864 0x00040 155189248 51720192 1048576
36864 148 70 147 3700 1403 3675 0 0 0
zio_buf_40960 0x00040 0 0 1048576
40960 0 0 0 0 0 0 0 0 0
zio_data_buf_40960 0x00040 218103808 98017280 1048576
40960 208 135 207 4784 2393 4761 0 0 0
zio_buf_45056 0x00040 0 0 1048576
45056 0 0 0 0 0 0 0 0 0
zio_data_buf_45056 0x00040 207618048 124399616 1048576
45056 198 137 197 4158 2761 4137 0 0 0
zio_buf_49152 0x00040 0 0 1048576
49152 0 0 0 0 0 0 0 0 0
zio_data_buf_49152 0x00040 256901120 92602368 1048576
49152 245 133 244 4655 1884 4636 0 0 0
zio_buf_53248 0x00040 0 0 1048576
53248 0 0 0 0 0 0 0 0 0
zio_data_buf_53248 0x00040 254803968 49627136 1048576
53248 243 124 242 4374 932 4356 0 0 0
zio_buf_57344 0x00040 0 0 1048576
57344 0 0 0 0 0 0 0 0 0
zio_data_buf_57344 0x00040 176160768 34349056 1048576
57344 168 112 167 2856 599 2839 0 0 0
zio_buf_61440 0x00040 0 0 2097152
61440 0 0 0 0 0 0 0 0 0
zio_data_buf_61440 0x00040 272629760 58920960 2097152
61440 130 78 129 4030 959 3999 0 0 0
zio_buf_65536 0x00040 0 0 2097152
65536 0 0 0 0 0 0 0 0 0
zio_data_buf_65536 0x00040 155189248 70123520 2097152
65536 74 61 73 2220 1070 2190 0 0 0
zio_buf_69632 0x00040 0 0 2097152
69632 0 0 0 0 0 0 0 0 0
zio_data_buf_69632 0x00040 150994944 53755904 2097152
69632 72 63 71 2016 772 1988 0 0 0
zio_buf_73728 0x00040 0 0 2097152
73728 0 0 0 0 0 0 0 0 0
zio_data_buf_73728 0x00040 146800640 47333376 2097152
73728 70 63 69 1820 642 1794 0 0 0
zio_buf_77824 0x00040 0 0 2097152
77824 0 0 0 0 0 0 0 0 0
zio_data_buf_77824 0x00040 142606336 43192320 2097152
77824 68 47 67 1700 555 1675 0 0 0
zio_buf_81920 0x00040 0 0 2097152
81920 0 0 0 0 0 0 0 0 0
zio_data_buf_81920 0x00040 130023424 57016320 2097152
81920 62 48 61 1488 696 1464 0 0 0
zio_buf_86016 0x00040 0 0 2097152
86016 0 0 0 0 0 0 0 0 0
zio_data_buf_86016 0x00040 132120576 59351040 2097152
86016 63 43 62 1449 690 1426 0 0 0
zio_buf_90112 0x00040 0 0 2097152
90112 0 0 0 0 0 0 0 0 0
zio_data_buf_90112 0x00040 121634816 55328768 2097152
90112 58 51 57 1276 614 1254 0 0 0
zio_buf_94208 0x00040 0 0 2097152
94208 0 0 0 0 0 0 0 0 0
zio_data_buf_94208 0x00040 121634816 73576448 2097152
94208 58 50 57 1218 781 1197 0 0 0
zio_buf_98304 0x00040 0 0 2097152
98304 0 0 0 0 0 0 0 0 0
zio_data_buf_98304 0x00040 117440512 60948480 2097152
98304 56 51 55 1120 620 1100 0 0 0
zio_buf_102400 0x00040 0 0 2097152
102400 0 0 0 0 0 0 0 0 0
zio_data_buf_102400 0x00040 117440512 67686400 2097152
102400 56 49 55 1064 661 1045 0 0 0
zio_buf_106496 0x00040 0 0 2097152
106496 0 0 0 0 0 0 0 0 0
zio_data_buf_106496 0x00040 117440512 61128704 2097152
106496 56 47 55 1008 574 990 0 0 0
zio_buf_110592 0x00040 0 0 2097152
110592 0 0 0 0 0 0 0 0 0
zio_data_buf_110592 0x00040 109051904 59056128 2097152
110592 52 46 51 936 534 918 0 0 0
zio_buf_114688 0x00040 31457280 23625728 2097152
114688 15 13 14 255 206 238 0 0 0
zio_data_buf_114688 0x00040 109051904 51724288 2097152
114688 52 51 51 884 451 867 0 0 0
zio_buf_118784 0x00040 0 0 2097152
118784 0 0 0 0 0 0 0 0 0
zio_data_buf_118784 0x00040 109051904 64974848 2097152
118784 52 42 51 884 547 867 0 0 0
zio_buf_122880 0x00040 0 0 2097152
122880 0 0 0 0 0 0 0 0 0
zio_data_buf_122880 0x00040 96468992 68812800 2097152
122880 46 45 45 736 560 720 0 0 0
zio_buf_126976 0x00040 0 0 4194304
126976 0 0 0 0 0 0 0 0 0
zio_data_buf_126976 0x00040 121634816 65519616 4194304
126976 29 22 28 899 516 868 0 0 0
zio_buf_131072 0x00040 7067402240 14155776
4194304 131072 1685 19 1684 52235 108 52204 0 0 0
zio_data_buf_131072 0x00040 36226203648 27071610880
4194304 131072 8637 8019 8637 267747 206540 267740 0 0
0
lz4_cache 0x00040 0 0 524288
16384 0 0 0 0 0 0 0 0 0
sa_cache 0x00020 276295680 178078560 4096
80 67455 67454 67454 2226015 2225982 2225982 0 0 0
spill_cache 0x00040 0 0 4194304
131072 0 0 0 0 0 0 0 0 0
dnode_t 0x00020 2284814336 2070598144
16384 928 139454 139453 139453 2231264 2231248 2231248 0 0
0
dmu_buf_impl_t 0x00020 885243904 756427000 8192
280 108062 108061 108061 2701550 2701525 2701525 0 0 0
arc_buf_hdr_t 0x00020 1032970240 891736768
8192 272 126095 126094 126094 3278470 3278444 3278444 0 0
0
arc_buf_t 0x00020 80920576 55533856 4096
112 19756 19271 19755 513656 495838 513630 0 0 0
zil_lwb_cache 0x00020 17776640 291200 4096
200 4340 100 4339 69440 1456 69424 0 0 0
zfs_znode_cache 0x00020 2431451136 2297262960
32768 1032 74202 74201 74201 2226060 2226030 2226030 0 0
0
[2] - AFTER umount /tank && modprobe -r zfs
# free -m
total used free shared buffers cached
Mem: 64394 1561 62832 0 138 448
-/+ buffers/cache: 974 63419
Swap: 17876 0 17876
# cat /proc/spl/kstat/zfs/arcstats
4 1 0x01 80 3840 855980527057136 855994985024330
name type data
hits 4 2354
misses 4 790
demand_data_hits 4 0
demand_data_misses 4 0
demand_metadata_hits 4 2350
demand_metadata_misses 4 35
prefetch_data_hits 4 0
prefetch_data_misses 4 0
prefetch_metadata_hits 4 4
prefetch_metadata_misses 4 755
mru_hits 4 1821
mru_ghost_hits 4 0
mfu_hits 4 531
mfu_ghost_hits 4 0
deleted 4 9
recycle_miss 4 0
mutex_miss 4 0
evict_skip 4 0
evict_l2_cached 4 0
evict_l2_eligible 4 0
evict_l2_ineligible 4 2048
hash_elements 4 785
hash_elements_max 4 785
hash_collisions 4 0
hash_chains 4 0
hash_chain_max 4 0
p 4 17716740096
c 4 35433480192
c_min 4 4429185024
c_max 4 35433480192
size 4 12590176
hdr_size 4 378144
data_size 4 11922432
other_size 4 289600
anon_size 4 16384
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 11867648
mru_evict_data 4 0
mru_evict_metadata 4 11554304
mru_ghost_size 4 4096
mru_ghost_evict_data 4 0
mru_ghost_evict_metadata 4 4096
mfu_size 4 38400
mfu_evict_data &
...
Sander Smeenk
2013-08-15 07:41:28 UTC
Permalink
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).

We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.

We've tried lowering arc_size but that seemed fruitless.

What does *seem* to help, but might drastically impact performance if you
read a lot from your pool, is asking Linux to drop it's cached memory:

# sync
# echo 3 > /proc/sys/vm/drop_caches

When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD (though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).


Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.


-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
Trey Dockendorf
2013-08-16 18:37:40 UTC
Permalink
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to 512MB
has kept my systems from crashing.
Post by Sander Smeenk
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
Gordan Bobic
2013-08-16 20:13:00 UTC
Permalink
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU
core on the machine so that their total size equals the amount I want to
use for ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling
ARC-ing prefetch data can help achieve better hit ratios in most cases.
If your disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
Trey Dockendorf
2013-08-19 15:50:17 UTC
Permalink
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?

Right now having a RAM backed L2ARC would greatly benefit my FhGFS metadata
server as it is all SSDs and needs to be very low latency for small
read/writes which are all in 0byte file xattrs.

- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers
and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
Gordan Bobic
2013-08-19 15:57:18 UTC
Permalink
Here is what I use on a 12-core machine (EL6 based, config files will
differ on different distros):

# cat /etc/modprobe.d/zram.conf
options zram num_devices=12

# cat /etc/sysconfig/modules/zram.modules
#!/bin/bash
#
# Fix mem to 1GB per zram
mem=1073741824

modprobe zram
sleep 1

pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null

If you find that your ZRAMs get set up before pools get imported, drop the
two zpool lines from zram.modules (commented out here) - if the zrams are
already there, they will get into the poor automatically when the pool gets
imported if they were previously added to the pool.
Post by Trey Dockendorf
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?
Right now having a RAM backed L2ARC would greatly benefit my FhGFS
metadata server as it is all SSDs and needs to be very low latency for
small read/writes which are all in 0byte file xattrs.
- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers
and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
Trey Dockendorf
2013-08-19 16:07:28 UTC
Permalink
Thanks! That will be extremely useful. Is zram a module provided by ZoL
or something in stock kernel?

Thanks,
- Trey
On Aug 19, 2013 10:57 AM, "Gordan Bobic" <gordan.bobic-***@public.gmane.org> wrote:

Here is what I use on a 12-core machine (EL6 based, config files will
differ on different distros):

# cat /etc/modprobe.d/zram.conf
options zram num_devices=12

# cat /etc/sysconfig/modules/zram.modules
#!/bin/bash
#
# Fix mem to 1GB per zram
mem=1073741824

modprobe zram
sleep 1

pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null

If you find that your ZRAMs get set up before pools get imported, drop the
two zpool lines from zram.modules (commented out here) - if the zrams are
already there, they will get into the poor automatically when the pool gets
imported if they were previously added to the pool.
Post by Trey Dockendorf
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?
Right now having a RAM backed L2ARC would greatly benefit my FhGFS
metadata server as it is all SSDs and needs to be very low latency for
small read/writes which are all in 0byte file xattrs.
- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers
and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-08-19 16:16:20 UTC
Permalink
It's in the Linux upstream kernel and has nothing at all to do with ZFS or
ZoL per se. Whether you distro kernel is built with it is another question.
It is certainly enabled in the EL6 based distros.

Gordan
Post by Trey Dockendorf
Thanks! That will be extremely useful. Is zram a module provided by ZoL
or something in stock kernel?
Thanks,
- Trey
Here is what I use on a 12-core machine (EL6 based, config files will
# cat /etc/modprobe.d/zram.conf
options zram num_devices=12
# cat /etc/sysconfig/modules/zram.modules
#!/bin/bash
#
# Fix mem to 1GB per zram
mem=1073741824
modprobe zram
sleep 1
pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null
If you find that your ZRAMs get set up before pools get imported, drop the
two zpool lines from zram.modules (commented out here) - if the zrams are
already there, they will get into the poor automatically when the pool gets
imported if they were previously added to the pool.
Post by Trey Dockendorf
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?
Right now having a RAM backed L2ARC would greatly benefit my FhGFS
metadata server as it is all SSDs and needs to be very low latency for
small read/writes which are all in 0byte file xattrs.
- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers
and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
To unsubscribe from this group and stop receiving emails from it, send
Trey Dockendorf
2013-08-19 16:23:03 UTC
Permalink
I'm using EL6 so that's good news. Will give it a try.

Will the benefits of L2ARC using zram be negated if the cache also contains
SSD backed devices? Our storage units have two 240GB SSDs intended for
cache and zil.

Thanks
- Trey
Post by Gordan Bobic
It's in the Linux upstream kernel and has nothing at all to do with ZFS or
ZoL per se. Whether you distro kernel is built with it is another question.
It is certainly enabled in the EL6 based distros.
Gordan
Post by Trey Dockendorf
Thanks! That will be extremely useful. Is zram a module provided by ZoL
or something in stock kernel?
Thanks,
- Trey
Here is what I use on a 12-core machine (EL6 based, config files will
# cat /etc/modprobe.d/zram.conf
options zram num_devices=12
# cat /etc/sysconfig/modules/zram.modules
#!/bin/bash
#
# Fix mem to 1GB per zram
mem=1073741824
modprobe zram
sleep 1
pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null
If you find that your ZRAMs get set up before pools get imported, drop
the two zpool lines from zram.modules (commented out here) - if the zrams
are already there, they will get into the poor automatically when the pool
gets imported if they were previously added to the pool.
Post by Trey Dockendorf
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?
Right now having a RAM backed L2ARC would greatly benefit my FhGFS
metadata server as it is all SSDs and needs to be very low latency for
small read/writes which are all in 0byte file xattrs.
- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage
servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are
SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
To unsubscribe from this group and stop receiving emails from it, send
Marcus Sorensen
2013-08-19 16:35:03 UTC
Permalink
Would it be feasible to have some sort of 'preallocation' zfs module
parameter, so that the effects of fragmentation are limited? I'm
thinking something along the lines of it doing a huge malloc up-front,
maybe some large percentage of arc_max, and never letting go of that.
It may be wasteful, but it seems like most people running zfs are
already assuming it will use that memory and building their machines
as such. Perhaps it would be easier to just fix the problem outright,
but I'm wondering if that might be a valid band-aid.
Post by Trey Dockendorf
I'm using EL6 so that's good news. Will give it a try.
Will the benefits of L2ARC using zram be negated if the cache also contains
SSD backed devices? Our storage units have two 240GB SSDs intended for
cache and zil.
Thanks
- Trey
Post by Gordan Bobic
It's in the Linux upstream kernel and has nothing at all to do with ZFS or
ZoL per se. Whether you distro kernel is built with it is another question.
It is certainly enabled in the EL6 based distros.
Gordan
Post by Trey Dockendorf
Thanks! That will be extremely useful. Is zram a module provided by ZoL
or something in stock kernel?
Thanks,
- Trey
Here is what I use on a 12-core machine (EL6 based, config files will
# cat /etc/modprobe.d/zram.conf
options zram num_devices=12
# cat /etc/sysconfig/modules/zram.modules
#!/bin/bash
#
# Fix mem to 1GB per zram
mem=1073741824
modprobe zram
sleep 1
pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null
If you find that your ZRAMs get set up before pools get imported, drop
the two zpool lines from zram.modules (commented out here) - if the zrams
are already there, they will get into the poor automatically when the pool
gets imported if they were previously added to the pool.
Post by Trey Dockendorf
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?
Right now having a RAM backed L2ARC would greatly benefit my FhGFS
metadata server as it is all SSDs and needs to be very low latency for small
read/writes which are all in 0byte file xattrs.
- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3
days.
While transferring ~58TB I began to notice the 2 storage
servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are
SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached (yet).
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
To unsubscribe from this group and stop receiving emails from it, send an
Niels de Carpentier
2013-08-19 19:38:36 UTC
Permalink
Post by Marcus Sorensen
Would it be feasible to have some sort of 'preallocation' zfs module
parameter, so that the effects of fragmentation are limited? I'm
thinking something along the lines of it doing a huge malloc up-front,
maybe some large percentage of arc_max, and never letting go of that.
It may be wasteful, but it seems like most people running zfs are
already assuming it will use that memory and building their machines
as such. Perhaps it would be easier to just fix the problem outright,
but I'm wondering if that might be a valid band-aid.
No, unfortunately that won't help. The problem is that you need to store
objects of different sizes and different lifetimes, and are not able to
move them. In time this will always lead to fragmentation. (Say you fill
the cache with 512B blocks, randomly remove half of them, and want to fill
the freed space with 128kB blocks.) There is also a lot of overhead, as
there are alignment requirements which lead to a lot of wasted space.

One thing that might work is to count the allocated memory for the
arc_size instead of the size used by the objects. This won't fix the core
issue, but will prevent the arc using more memory then arc_max. I suspect
it's not easy to do though without major code changes.

Niels



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoat
2013-08-20 12:25:53 UTC
Permalink
Post by Trey Dockendorf
I'm using EL6 so that's good news. Will give it a try.
Will the benefits of L2ARC using zram be negated if the cache also contains
SSD backed devices? Our storage units have two 240GB SSDs intended for
cache and zil.
Zram expands into swap. To make use of them with l2zrc-on-zram you need
to reconfigure the l2arc partitions as swap.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-08-20 12:34:13 UTC
Permalink
Post by Trey Dockendorf
I'm using EL6 so that's good news. Will give it a try.
Will the benefits of L2ARC using zram be negated if the cache also contains
SSD backed devices? Our storage units have two 240GB SSDs intended for
cache and zil.
Zram expands into swap. To make use of them with l2zrc-on-zram you need to
reconfigure the l2arc partitions as swap.
That's actually a pretty neat approach - use lots of zram for L2ARC and
swap onto SSD if the compression ratio happens to get uncharacteristically
bad. Certainly neater than flashcache-ing zram with SSD. Not sure which
would perform faster, though.

Gordan

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Sander Smeenk
2013-08-20 07:23:57 UTC
Permalink
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up
to about 1GB per TB of usable disk space). Then I set up one ZRAM
per CPU core on the machine so that their total size equals the
amount I want to use for ZFS caching, and I set up those ZRAM
devices as L2ARC. Enabling ARC-ing prefetch data can help achieve
better hit ratios in most cases. If your disk:RAM ratio is very
high, set L2ARC to metadata-only.
I am now running with 32 ZRAM-caches in my pool and arc_max set to 25%
of memory. The remaining memory, minus 8GB for 'OS stuff', is used by
the ZRAMs.

First tests are promising, though i also had good runs with the stock
config so this could still go anywhere... ;-))

I adapted a script to setup ZRAMs from the 'zram-config' package in
Ubuntu (which just puts ZRAMs as swap) which scales nicely when i
change arcsize and/or number of CPUs:

| #!/bin/sh
| # $Id: zramsetup 284 2013-08-19 18:31:02Z sanders $
|
| nrdevices=$(grep -c ^processor /proc/cpuinfo | sed 's/^0$/1/')
| if modinfo zram | grep -q ' zram_num_devices:' 2>/dev/null; then
| modprobeargs="zram_num_devices=${nrdevices}"
| elif modinfo zram | grep -q ' num_devices:' 2>/dev/null; then
| modprobeargs="num_devices=${nrdevices}"
| else
| echo "Unknown / unsupported 'zram' module?"
| exit 1
| fi
|
| modprobe zram $modprobeargs
|
| # Calculate memory to use for zram [ in bytes! ]
| totalmem=$(free -b | awk '/^Mem: */ {print $2}')
| arcsize=$(cat /sys/module/zfs/parameters/zfs_arc_max)
| ossize=$((8 * 1024 * 1024 * 1024))
| availmem=$((totalmem-$((arcsize+ossize))))
| memperdev=$((availmem / 2 / nrdevices))
|
| echo "Total memory : $totalmem"
| echo "Max ARC size : $arcsize"
| echo "OS memory : $ossize"
| echo ""
| echo "Unused memory: $availmem"
| echo "Per device : $memperdev (x$nrdevices)"
| echo ""
|
| # initialize the devices
| for i in $(seq ${nrdevices}); do
| echo $memperdev > /sys/block/zram$((i - 1))/disksize
| done
Post by Gordan Bobic
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
I also set:
| # grep -v "^#" /etc/sysctl.d/zfs.conf | grep -v "^$"
| vm.min_free_kbytes=4026532
| vm.dirty_background_ratio=5
| vm.swappiness=10
| vm.vfs_cache_pressure=10000

But by now i have tweaked so much i don't really know what helps and
what doesn't. All these tips/tweaks were gathered from various zfs
mailinglists / websites / howto's etc..

Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.


Thanks for all the good tips and help so far!!
-Sndr.
--
| The world is so full of these wonderful things,
| i'm sure we should all be as happy as kings.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Fajar A. Nugraha
2013-08-20 07:31:01 UTC
Permalink
Post by Sander Smeenk
| # grep -v "^#" /etc/sysctl.d/zfs.conf | grep -v "^$"
| vm.min_free_kbytes=4026532
| vm.dirty_background_ratio=5
| vm.swappiness=10
| vm.vfs_cache_pressure=10000
Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.
You shouldn't need to drop cache every 30 minutes if you have those
tunables above on sysctl.d. I suggest you just remove the cron for the job,
as dropping caches aggresively can have negative effect on performance.
--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Sander Smeenk
2013-08-20 07:35:55 UTC
Permalink
Post by Fajar A. Nugraha
Post by Sander Smeenk
| # grep -v "^#" /etc/sysctl.d/zfs.conf | grep -v "^$"
| vm.min_free_kbytes=4026532
| vm.dirty_background_ratio=5
| vm.swappiness=10
| vm.vfs_cache_pressure=10000
Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.
You shouldn't need to drop cache every 30 minutes if you have those
tunables above on sysctl.d. I suggest you just remove the cron for the job,
as dropping caches aggresively can have negative effect on performance.
Adding this cronjob had the best (positive) influence on stability
compared to all other tweaks and hacks suggested here. I understand it's
not the best thing to do, but until the box stays running for a week i
will keep this cronjob. ;)

-Sndr.
--
| 1 1 was a racehorse, 2 2 was 1 2, 1 1 1 1 race 1 day, 2 2 1 1 2
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Prakash Surya
2013-08-20 16:47:58 UTC
Permalink
Post by Sander Smeenk
Post by Fajar A. Nugraha
Post by Sander Smeenk
| # grep -v "^#" /etc/sysctl.d/zfs.conf | grep -v "^$"
| vm.min_free_kbytes=4026532
| vm.dirty_background_ratio=5
| vm.swappiness=10
| vm.vfs_cache_pressure=10000
Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.
You shouldn't need to drop cache every 30 minutes if you have those
tunables above on sysctl.d. I suggest you just remove the cron for the job,
as dropping caches aggresively can have negative effect on performance.
Adding this cronjob had the best (positive) influence on stability
compared to all other tweaks and hacks suggested here. I understand it's
not the best thing to do, but until the box stays running for a week i
will keep this cronjob. ;)
In that case, why not just set the ARC to some small value? Say 1/10th
of RAM or less. By dropping the cache, you're invalidating any work the
ARC (plus many other caches) has done every 30 mins.
--
Cheers, Prakash
Post by Sander Smeenk
-Sndr.
--
| 1 1 was a racehorse, 2 2 was 1 2, 1 1 1 1 race 1 day, 2 2 1 1 2
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Sander Smeenk
2013-08-21 06:45:06 UTC
Permalink
Post by Prakash Surya
Post by Fajar A. Nugraha
Post by Sander Smeenk
Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.
You shouldn't need to drop cache every 30 minutes
I understand it's not the best thing to do, but until the box stays
running for a week i will keep this cronjob. ;)
In that case, why not just set the ARC to some small value?
Say 1/10th of RAM or less.
An 'insanely small' ARC with a large zpool is just as bad or even worse
for performance as dropping the caches every 30 mins. Plus, during the
30 minutes i get the added bonus of having data cached. ;)

It's just that there's a bug(?) in ZFS memory management which causes my
system to fail under heavy IO-loads and this trick seems to keep it
(more) stable.
Post by Prakash Surya
By dropping the cache, you're invalidating any work the ARC (plus many
other caches) has done every 30 mins.
I'm very aware of this and the non-optimal situation this creates,
however, i also want my storage to be available.


On the up-side, the ZRAM/L2ARC trick by Gordan Bobic seems to work
wonders and i have disabled the drop-the-caches cronjob yesterday.
I'll have to see how this holds up...


Thanks,
-Sndr
--
| Today is the first day of the rest of your life
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Sander Klein
2013-08-21 07:42:27 UTC
Permalink
Hi,
Post by Sander Smeenk
Post by Prakash Surya
Post by Fajar A. Nugraha
Post by Sander Smeenk
Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.
You shouldn't need to drop cache every 30 minutes
I understand it's not the best thing to do, but until the box stays
running for a week i will keep this cronjob. ;)
In that case, why not just set the ARC to some small value?
Say 1/10th of RAM or less.
An 'insanely small' ARC with a large zpool is just as bad or even worse
for performance as dropping the caches every 30 mins. Plus, during the
30 minutes i get the added bonus of having data cached. ;)
It's just that there's a bug(?) in ZFS memory management which causes my
system to fail under heavy IO-loads and this trick seems to keep it
(more) stable.
Post by Prakash Surya
By dropping the cache, you're invalidating any work the ARC (plus many
other caches) has done every 30 mins.
I'm very aware of this and the non-optimal situation this creates,
however, i also want my storage to be available.
On the up-side, the ZRAM/L2ARC trick by Gordan Bobic seems to work
wonders and i have disabled the drop-the-caches cronjob yesterday.
I'll have to see how this holds up...
I wonder why one would want to use ZRAM.

While I do understand ZRAM compresses the L2ARC and RAM is faster than
SSD, I would think that adding relatively cheap SSD's and having a
bigger ARC would be more economic.

With the eye on the next ZFS release which also compresses the *ARC
ZRAM benefit would be even lower.

I might be completely wrong about this, so is someone willing to
explain? :-)

Greets,

Sander

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-08-21 08:40:57 UTC
Permalink
This post might be inappropriate. Click to display it.
Prakash Surya
2013-08-21 18:11:49 UTC
Permalink
Post by Sander Klein
Hi,
Post by Sander Smeenk
Post by Prakash Surya
Post by Fajar A. Nugraha
Post by Sander Smeenk
Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.
You shouldn't need to drop cache every 30 minutes
I understand it's not the best thing to do, but until the box stays
running for a week i will keep this cronjob. ;)
In that case, why not just set the ARC to some small value?
Say 1/10th of RAM or less.
An 'insanely small' ARC with a large zpool is just as bad or even worse
for performance as dropping the caches every 30 mins. Plus, during the
30 minutes i get the added bonus of having data cached. ;)
It's just that there's a bug(?) in ZFS memory management which causes my
system to fail under heavy IO-loads and this trick seems to keep it
(more) stable.
Post by Prakash Surya
By dropping the cache, you're invalidating any work the ARC
(plus many
other caches) has done every 30 mins.
I'm very aware of this and the non-optimal situation this creates,
however, i also want my storage to be available.
On the up-side, the ZRAM/L2ARC trick by Gordan Bobic seems to work
wonders and i have disabled the drop-the-caches cronjob yesterday.
I'll have to see how this holds up...
I wonder why one would want to use ZRAM.
While I do understand ZRAM compresses the L2ARC and RAM is faster
than SSD, I would think that adding relatively cheap SSD's and
having a bigger ARC would be more economic.
With the eye on the next ZFS release which also compresses the *ARC
ZRAM benefit would be even lower.
I might be completely wrong about this, so is someone willing to
explain? :-)
In this instance, it's not really about compressing the l2arc data, it's
more about reducing the ARC size (to reduce memory fragmentation) and
offloading the cache to the l2arc altogether (to maintain reasonable
performance).

The real issue being "fixed" is reducing the fragmentation caused by ARC
buffers backed by the SPL SLAB. It seems as though, pushing that data
into a RAM based l2arc allows for "good enough" cache performance, while
limiting the SPL SLAB fragmentation issues to a tolerable level.

This list has a knack at finding interesting workarounds. ;)
--
Cheers, Prakash
Post by Sander Klein
Greets,
Sander
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Prakash Surya
2013-08-21 18:01:32 UTC
Permalink
Post by Sander Smeenk
Post by Prakash Surya
Post by Fajar A. Nugraha
Post by Sander Smeenk
Oh, and i 'echo 3 > /proc/sys/vm/drop_caches' every 30 minutes.
You shouldn't need to drop cache every 30 minutes
I understand it's not the best thing to do, but until the box stays
running for a week i will keep this cronjob. ;)
In that case, why not just set the ARC to some small value?
Say 1/10th of RAM or less.
An 'insanely small' ARC with a large zpool is just as bad or even worse
for performance as dropping the caches every 30 mins. Plus, during the
30 minutes i get the added bonus of having data cached. ;)
It's just that there's a bug(?) in ZFS memory management which causes my
system to fail under heavy IO-loads and this trick seems to keep it
(more) stable.
Right, I'm aware of the memory issue, I was just thinking that setting
the ARC to a smaller value could prevent the fragmentation issue from
making your system unusable. As a bonus, you _might_ get better cache
performance depending on the workload.
Post by Sander Smeenk
Post by Prakash Surya
By dropping the cache, you're invalidating any work the ARC (plus many
other caches) has done every 30 mins.
I'm very aware of this and the non-optimal situation this creates,
however, i also want my storage to be available.
On the up-side, the ZRAM/L2ARC trick by Gordan Bobic seems to work
wonders and i have disabled the drop-the-caches cronjob yesterday.
I'll have to see how this holds up...
Thanks,
-Sndr
--
| Today is the first day of the rest of your life
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Sander Smeenk
2013-08-19 07:47:28 UTC
Permalink
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
Thanks for this!

It feels insane limiting ARC to ~20GB on a 192GB system, but it seems to
help in stabilizing the system a bit... Thing is, there's so many
tunables i can't really tell what combination works and what doesn't.

-Sndr.
--
| Arachnoleptic fit (n.): The frantic dance performed just after you've
| accidentally walked through a spider web.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
Gordan Bobic
2013-08-19 08:35:59 UTC
Permalink
I generally find that keeping ARC small and setting up between 4:1 and 8:1
ZRAM L2ARC:ARC works best. L2ARC cannot exceed it's allocation, and ZRAM
will still be faster than SSD. So unless you need massively more L2ARC than
you have RAM, ZRAM is probably a good solution.
Post by Sander Smeenk
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
Thanks for this!
It feels insane limiting ARC to ~20GB on a 192GB system, but it seems to
help in stabilizing the system a bit... Thing is, there's so many
tunables i can't really tell what combination works and what doesn't.
-Sndr.
--
| Arachnoleptic fit (n.): The frantic dance performed just after you've
| accidentally walked through a spider web.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Sander Smeenk
2013-08-19 08:45:09 UTC
Permalink
Post by Gordan Bobic
I generally find that keeping ARC small and setting up between 4:1 and
8:1 ZRAM L2ARC:ARC works best. L2ARC cannot exceed it's allocation,
and ZRAM will still be faster than SSD. So unless you need massively
more L2ARC than you have RAM, ZRAM is probably a good solution.
Thanks for this. I'm new to this 'zram' kernelmodule.
Will sure give this a try!
--
| Visitors always give pleasure: if not on arrival, then on the departure.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2
Niels de Carpentier
2013-08-19 10:16:24 UTC
Permalink
Post by Gordan Bobic
I generally find that keeping ARC small and setting up between 4:1 and 8:1
ZRAM L2ARC:ARC works best. L2ARC cannot exceed it's allocation, and ZRAM
will still be faster than SSD. So unless you need massively more L2ARC than
you have RAM, ZRAM is probably a good solution.
That's an interesting solution. It doesn't fix the fragmentation issue of
course, just limits it. But a good solution until the core issue is fixed
I think.

It would be interesting to try lowering the amount of objects per slab in
the SPL slabcache. This will increase overhead but should reduce
fragmentation.

I'll see if I can find some time to do some tests on this. It won't solve
the issue, but it's a simple thing to change and might improve things a
bit.

Niels


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Bryn Hughes
2013-08-19 16:42:21 UTC
Permalink
Post by Gordan Bobic
I generally find that keeping ARC small and setting up between 4:1 and
8:1 ZRAM L2ARC:ARC works best. L2ARC cannot exceed it's allocation,
and ZRAM will still be faster than SSD. So unless you need massively
more L2ARC than you have RAM, ZRAM is probably a good solution.
Hey Gordon, do you have a script or something to set this up on boot? I
assume you need to create the ZRAM devices and then add them to the pool
again each time, or ... ?

Bryn
Bryn Hughes
2013-08-19 17:17:45 UTC
Permalink
Post by Bryn Hughes
Post by Gordan Bobic
I generally find that keeping ARC small and setting up between 4:1 and
8:1 ZRAM L2ARC:ARC works best. L2ARC cannot exceed it's allocation,
and ZRAM will still be faster than SSD. So unless you need massively
more L2ARC than you have RAM, ZRAM is probably a good solution.
Hey Gordon, do you have a script or something to set this up on boot? I
assume you need to create the ZRAM devices and then add them to the pool
again each time, or ... ?
Bryn
Whoops, never mind, saw your earlier email with the script...

On my Ubuntu system I've tweaked yours a bit to changed how it is
called. I'm causing the load of the ZFS module to automatically load
the ZRAM module first, that way the devices will always be there when
ZFS loads:

$ cat /etc/modprobe.d/zram.conf
options zram num_devices=2
install zram /sbin/modprobe --ignore-install zram; /etc/default/zram


$ cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=1140850688
install zfs /sbin/modprobe zram; /sbin/modprobe --ignore-install zfs


$ cat /etc/default/zram
#!/bin/bash
#
# Fix mem to 2GB per zram
mem=2147483648

#modprobe zram
#sleep 1

pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null
Prakash Surya
2013-08-19 20:36:11 UTC
Permalink
Keep in mind, the L2ARC was designed to speed up random read work loads
that do not fit entirely in the ARC. So, if your workload doesn't meet
this criteria (i.e. the l2arc feed thread doesn't populate the l2arc
devices), adding zram l2arc devices won't help much.

Not saying this won't work for "you", just realize adding l2arc devices
is not the same as having a larger arc.
--
Cheers, Prakash
Post by Bryn Hughes
Post by Bryn Hughes
Post by Gordan Bobic
I generally find that keeping ARC small and setting up between 4:1 and
8:1 ZRAM L2ARC:ARC works best. L2ARC cannot exceed it's allocation,
and ZRAM will still be faster than SSD. So unless you need massively
more L2ARC than you have RAM, ZRAM is probably a good solution.
Hey Gordon, do you have a script or something to set this up on boot? I
assume you need to create the ZRAM devices and then add them to the pool
again each time, or ... ?
Bryn
Whoops, never mind, saw your earlier email with the script...
On my Ubuntu system I've tweaked yours a bit to changed how it is
called. I'm causing the load of the ZFS module to automatically load
the ZRAM module first, that way the devices will always be there when
$ cat /etc/modprobe.d/zram.conf
options zram num_devices=2
install zram /sbin/modprobe --ignore-install zram; /etc/default/zram
$ cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=1140850688
install zfs /sbin/modprobe zram; /sbin/modprobe --ignore-install zfs
$ cat /etc/default/zram
#!/bin/bash
#
# Fix mem to 2GB per zram
mem=2147483648
#modprobe zram
#sleep 1
pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null
Sander Klein
2013-08-19 08:54:40 UTC
Permalink
Post by Sander Smeenk
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
Thanks for this!
It feels insane limiting ARC to ~20GB on a 192GB system, but it seems to
help in stabilizing the system a bit... Thing is, there's so many
tunables i can't really tell what combination works and what doesn't.
I'm still a bit confused by the fact that I see lots of people who have
problems with the ARC. I also have the fragmentation issue but I do not
have to limit the ARC to keep my system running. Could this be to the
fact that we always add plenty L2ARC ssd's?

Greets,

Sander

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-08-19 10:24:14 UTC
Permalink
Post by Sander Klein
Post by Sander Smeenk
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
Thanks for this!
It feels insane limiting ARC to ~20GB on a 192GB system, but it seems to
help in stabilizing the system a bit... Thing is, there's so many
tunables i can't really tell what combination works and what doesn't.
I'm still a bit confused by the fact that I see lots of people who have
problems with the ARC. I also have the fragmentation issue but I do not
have to limit the ARC to keep my system running. Could this be to the
fact that we always add plenty L2ARC ssd's?
There are lots of factors that influence this. How the filesystem is used,
the total memory size (The ARC is relative smaller for small systems),
distribution default kernel settings etc.

I don't think the size of L2ARC has much influence.

Niels
Post by Sander Klein
Greets,
Sander
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-08-19 21:48:02 UTC
Permalink
Post by Niels de Carpentier
Post by Sander Klein
Post by Sander Smeenk
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
Thanks for this!
It feels insane limiting ARC to ~20GB on a 192GB system, but it seems to
help in stabilizing the system a bit... Thing is, there's so many
tunables i can't really tell what combination works and what doesn't.
I'm still a bit confused by the fact that I see lots of people who have
problems with the ARC. I also have the fragmentation issue but I do not
have to limit the ARC to keep my system running. Could this be to the
fact that we always add plenty L2ARC ssd's?
There are lots of factors that influence this. How the filesystem is used,
the total memory size (The ARC is relative smaller for small systems),
distribution default kernel settings etc.
I don't think the size of L2ARC has much influence.
I have to correct myself here. Having a large L2ARC does actually make a
difference.

The reason is that the L2ARC "management" struct has relaxed alignment
requirements, and so won't have a lot of overhead. Also the same struct
will be used for all buffer sizes, which reduces fragmentation.

Small buffer sizes are very inefficient in the slab. A 512B buffer object
will use at least 1024B of actual memory, while only 512B is registered as
used by the ARC. And this is without any fragmentation, which will make
things worse.

The reason is that the slab objects are stored with a small struct
containg pointers to the object, the slab struct and the linked list of
free objects. When using small objects which have the alignment the same
as the object size (which is the case for 512B and 4kB), this will result
in a 100% overhead. i.e. 512B (object) + 40B??(guessed struct size) = 552B
and will be aligned to 1024B.

So using a small ARC with a zram backed L2ARC will indeed use memory more
efficiently then using the same amount of memory for just the ARC.

Niels




To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoat
2013-08-20 12:22:41 UTC
Permalink
Post by Sander Klein
Post by Sander Smeenk
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
Thanks for this!
It feels insane limiting ARC to ~20GB on a 192GB system, but it seems to
help in stabilizing the system a bit... Thing is, there's so many
tunables i can't really tell what combination works and what doesn't.
I'm still a bit confused by the fact that I see lots of people who
have problems with the ARC. I also have the fragmentation issue but I
do not have to limit the ARC to keep my system running. Could this be
to the fact that we always add plenty L2ARC ssd's?
Quite possibly

ARC has to not only hold data about what's in the filesystem but also
keep track of what's on L2ARC, which may drive its size up as you put in
a larger L2arc (There is zero point having l2arc larger than your hot
data requirement)

You should look at the actual ratios of data/metadata held in ARC and
tune the metadata max upwards to suit. That's not possible by simply
looking at snapshots (you need to look at trends.)

Don't set arc=metadata or the l2arcs will hold virtually nothing at all.

My own experience is that after getting metadata max tuned correctly,
arc growth stopped being a problem.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Fajar A. Nugraha
2013-08-18 05:21:33 UTC
Permalink
Post by Sander Smeenk
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
You might want to set /proc/sys/vm/swappiness to 10 (see
http://en.wikipedia.org/wiki/Swappiness, or google other resources). The
default is 60, and on my old ext4 system (a mostly-static web server) with
lots of files it causes massive slow down since linux insists that most
memory was going to be used for cache. I had to use drop_caches manually
every several hours with the default swappiness. Once it's lowered, linux
seems smart enough to drop the cache automatically.

I haven't tried this on zfs (didn't have as many files on it), but since
you say drop_caches solves your problem it might work for your case as well.
--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-08-18 13:05:34 UTC
Permalink
Post by Sander Smeenk
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
You might want to set /proc/sys/vm/swappiness to 10 (see
http://en.wikipedia.org/wiki/Swappiness, or google other resources). The
default is 60, and on my old ext4 system (a mostly-static web server)
with lots of files it causes massive slow down since linux insists that
most memory was going to be used for cache. I had to use drop_caches
manually every several hours with the default swappiness. Once it's
lowered, linux seems smart enough to drop the cache automatically.
Surely the correct solution there is to up vm.min_free_kbytes, is it not?

I always run with vm.swappniess set to 100, and usually run high
priority swap on zram.

Gordan

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Fajar A. Nugraha
2013-08-19 00:58:28 UTC
Permalink
Post by Gordan Bobic
Post by Sander Smeenk
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage servers
and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool. From
time to time memory consumption spikes way past any limits set and the
server grinds to a slow halt. It's had to put a finger on what actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance if you
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory bound, i
got about 180GB of memory back.
You might want to set /proc/sys/vm/swappiness to 10 (see
http://en.wikipedia.org/wiki/**Swappiness<http://en.wikipedia.org/wiki/Swappiness>,
or google other resources). The
default is 60, and on my old ext4 system (a mostly-static web server)
with lots of files it causes massive slow down since linux insists that
most memory was going to be used for cache. I had to use drop_caches
manually every several hours with the default swappiness. Once it's
lowered, linux seems smart enough to drop the cache automatically.
Surely the correct solution there is to up vm.min_free_kbytes, is it not?
Not on my case.

This was a somewhat extreme system though, with lots of small files. Google
also find this article, which might be relevant although I haven't tried
it:
http://major.io/2008/12/03/reducing-inode-and-dentry-caches-to-keep-oom-killer-at-bay/
--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoat
2013-08-20 12:14:39 UTC
Permalink
Not on my case. This was a somewhat extreme system though, with lots
of small files. Google also find this article, which might be relevant
http://major.io/2008/12/03/reducing-inode-and-dentry-caches-to-keep-oom-killer-at-bay/
On our 192Gb systems I've gone the other way and pushed up the dentry
and inode caches as far as they'll go (about 5% of ram without
recompiling the kernel)

Kernel append at boot: dhash_entries=536870912 ihash_entries=536870912


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Brett Dikeman
2013-08-20 00:45:25 UTC
Permalink
Post by Sander Smeenk
Aparently the above is a know 'bug'. People on this list have stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x the
allowed size'.
Kernel versions 2.6.35 and newer have memory compaction, and it's
triggered when a request for a number of pages can't be satisfied. It
can also be triggered by writing anything to
/proc/sys/vm/compact_memory. See:
http://kernelnewbies.org/Linux_2_6_35#head-9cb0a1275559d40296da42efb7977896ac9edab7
https://lwn.net/Articles/368869/

Are pages used by the ARC non-moveable?

Brett

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-08-19 17:15:05 UTC
Permalink
Unfortunately, there are no tiers in L2ARC, so mixing zram with SSDs will make for oddly lopsided performance. You'd need something like a L3ARC, which doesn't exist.

Or you could try to do something new and different and create a flashcache device with SSD as slow tier and zram as fast tier and use that as L2ARC. I am not aware of anyone else having tried this before.
Post by Trey Dockendorf
I'm using EL6 so that's good news. Will give it a try.
Will the benefits of L2ARC using zram be negated if the cache also contains
SSD backed devices? Our storage units have two 240GB SSDs intended for
cache and zil.
Thanks
- Trey
Post by Gordan Bobic
It's in the Linux upstream kernel and has nothing at all to do with ZFS or
ZoL per se. Whether you distro kernel is built with it is another question.
It is certainly enabled in the EL6 based distros.
Gordan
Post by Trey Dockendorf
Thanks! That will be extremely useful. Is zram a module provided by ZoL
or something in stock kernel?
Thanks,
- Trey
Here is what I use on a 12-core machine (EL6 based, config files will
# cat /etc/modprobe.d/zram.conf
options zram num_devices=12
# cat /etc/sysconfig/modules/zram.modules
#!/bin/bash
#
# Fix mem to 1GB per zram
mem=1073741824
modprobe zram
sleep 1
pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null
If you find that your ZRAMs get set up before pools get imported, drop
the two zpool lines from zram.modules (commented out here) - if the zrams
are already there, they will get into the poor automatically when the pool
gets imported if they were previously added to the pool.
Post by Trey Dockendorf
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?
Right now having a RAM backed L2ARC would greatly benefit my FhGFS
metadata server as it is all SSDs and needs to be very low latency for
small read/writes which are all in 0byte file xattrs.
- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3 days.
While transferring ~58TB I began to notice the 2 storage
servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is
a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool.
From
time to time memory consumption spikes way past any limits set and
the
server grinds to a slow halt. It's had to put a finger on what
actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance
if you
read a lot from your pool, is asking Linux to drop it's cached
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory
bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are
SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached
(yet).
Aparently the above is a know 'bug'. People on this list have
stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x
the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC
6CD2
To unsubscribe from this group and stop receiving emails from it, send
an email t
Tamas Papp
2013-08-24 06:48:04 UTC
Permalink
Post by Gordan Bobic
Unfortunately, there are no tiers in L2ARC, so mixing zram with SSDs will make for oddly lopsided performance. You'd need something like a L3ARC, which doesn't exist.
Or you could try to do something new and different and create a flashcache device with SSD as slow tier and zram as fast tier and use that as L2ARC. I am not aware of anyone else having tried this before.
For tiering I suggest btier module (http://tier.sourceforge.net/)

tamas

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-08-19 17:17:18 UTC
Permalink
I suggested something like this recently with putting ARC into a hugetlb memory block. I doubt it would be better than just limiting it's size though.
Post by Marcus Sorensen
Would it be feasible to have some sort of 'preallocation' zfs module
parameter, so that the effects of fragmentation are limited? I'm
thinking something along the lines of it doing a huge malloc up-front,
maybe some large percentage of arc_max, and never letting go of that.
It may be wasteful, but it seems like most people running zfs are
already assuming it will use that memory and building their machines
as such. Perhaps it would be easier to just fix the problem outright,
but I'm wondering if that might be a valid band-aid.
Post by Trey Dockendorf
I'm using EL6 so that's good news. Will give it a try.
Will the benefits of L2ARC using zram be negated if the cache also contains
SSD backed devices? Our storage units have two 240GB SSDs intended for
cache and zil.
Thanks
- Trey
Post by Gordan Bobic
It's in the Linux upstream kernel and has nothing at all to do with ZFS or
ZoL per se. Whether you distro kernel is built with it is another question.
It is certainly enabled in the EL6 based distros.
Gordan
Post by Trey Dockendorf
Thanks! That will be extremely useful. Is zram a module provided by ZoL
or something in stock kernel?
Thanks,
- Trey
Here is what I use on a 12-core machine (EL6 based, config files will
# cat /etc/modprobe.d/zram.conf
options zram num_devices=12
# cat /etc/sysconfig/modules/zram.modules
#!/bin/bash
#
# Fix mem to 1GB per zram
mem=1073741824
modprobe zram
sleep 1
pushd /dev > /dev/null
for zram in zram*; do
echo $mem > /sys/block/$zram/disksize
# zpool remove ssd /dev/$zram
# zpool add ssd cache /dev/$zram
done
popd > /dev/null
If you find that your ZRAMs get set up before pools get imported, drop
the two zpool lines from zram.modules (commented out here) - if the zrams
are already there, they will get into the poor automatically when the pool
gets imported if they were previously added to the pool.
Post by Trey Dockendorf
Those sound like very interesting ideas. I'm not familiar with some of
those concepts. Would you mind sharing how to setup ZRAM ?
Right now having a RAM backed L2ARC would greatly benefit my FhGFS
metadata server as it is all SSDs and needs to be very low latency for small
read/writes which are all in 0byte file xattrs.
- Trey
Post by Gordan Bobic
What I find works quite well is setting the arc_max quite small (up to
about 1GB per TB of usable disk space). Then I set up one ZRAM per CPU core
on the machine so that their total size equals the amount I want to use for
ZFS caching, and I set up those ZRAM devices as L2ARC. Enabling ARC-ing
prefetch data can help achieve better hit ratios in most cases. If your
disk:RAM ratio is very high, set L2ARC to metadata-only.
Post by Trey Dockendorf
So far setting the arc_max to 25-30% of RAM and vm.min.free_kbytes to
512MB has kept my systems from crashing.
The metadata system has got hung (root's ext4 with
hung_task_timeout_secs errors) already twice in the past 3
days.
While transferring ~58TB I began to notice the 2 storage
servers and
single metadata server getting below less than 5% free memory.
I'm seeing the exact same behaviour on our ZoL-storage server.
This server has 192GB(!) memory and only a mere 26T pool (which is
a
mirror vdev across two iscsi LUNs at this moment).
We run ~25 concurrent rsyncs and mostly write data to the pool.
From
time to time memory consumption spikes way past any limits set and
the
server grinds to a slow halt. It's had to put a finger on what
actually
triggers this.
We've tried lowering arc_size but that seemed fruitless.
What does *seem* to help, but might drastically impact performance
if you
read a lot from your pool, is asking Linux to drop it's cached
# sync
# echo 3 > /proc/sys/vm/drop_caches
When i did the above on the server while it was getting memory
bound, i
got about 180GB of memory back.
The systems doing storage (and shown in output below) have 64GB
RAM, no
dedup, no compression. Zpool configuration [3] is two 10-disk
RAIDZ2's
with mirrored zil and striped cached. Both zil and cache are
SSD
(though
they share the same SSDs, just partitioned separately).
Our pool does compression (lz4) and has no zil/cache attached
(yet).
Aparently the above is a know 'bug'. People on this list have
stated
that 'ARC [memory] fragmentation can cause the ARC to grow to 2/3x
the
allowed size'.
-Sndr.
--
| Two blondes walk into a building.
| You'd think at least one of them would have seen it.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC
6CD2
To unsubscribe from this group and stop receiving emails from it, send an
Gordan Bobic
2013-08-19 17:18:16 UTC
Permalink
See a post earlier in this thread for all the configs/scripts required.
Post by Bryn Hughes
Post by Gordan Bobic
I generally find that keeping ARC small and setting up between 4:1 and
8:1 ZRAM L2ARC:ARC works best. L2ARC cannot exceed it's allocation,
and ZRAM will still be faster than SSD. So unless you need massively
more L2ARC than you have RAM, ZRAM is probably a good solution.
Hey Gordon, do you have a script or something to set this up on boot? I
assume you need to create the ZRAM devices and then
Uncle Stoat
2013-08-20 11:39:24 UTC
Permalink
It is not unusual for arc to exceed arc_max by up to a factor of 2 due to
memory fragmentation. The only thing you can really do is reduce it's size
until you are no longer getting memory exhaustion. Increasing
vm.min_free_kbytes (I typically set it to about 1% of RAM) may also help.
Increasing arc_metadata_max to 3/4 of the arc_max size might help too.

The problem with rsyncs is that they are heavy on metadata - and
doubling up on them just makes things worse.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Continue reading on narkive:
Loading...