This is a follow-up to: High speed network writes with large capacity storage. The setup has changed notably.
I have a pool with a single raid-z2
with 6 drives, all Exos X18 CMR drives. Using fio
and manual tests I know that the array can sustain around 800 MB/s sequential writes on average this is fine and in-line with the expected performance of this array. The machine is an Ryzen5 Pro 2400 GE (4C/8T, 3.8 GHz boost) with 32G ECC RAM, NVMe boot/system drive and 2x10Gbps ethernet ports (Intel x550-T2). I'm running an up-to-date Arch system with zfs 2.1.2-1.
My use case is a video archive of mostly large (~30G) write once, read once, compressed video. I've disabled atime
, set recordsize=1M
, set compressios=off
and dedup=off
as the data is actually incompressible and testing showed worse performance with compression=lz4
than off
despite what the internet said and there is no duplicate data by design. This pool is shared over the network via Samba. I've tuned my network and Samba to the point where transferring from NVMe NTFS on a Windows machine to NVMe ext4 reaches 1GB/s, i.e reasonably close to saturating the 10 Gbps link with 9K Jumbo Frames.
Here's where I run into problems. I want to be able to transfer one whole 30G video archive at 1GB/s to the raid-z2
array that can only support 800 MB/s sequential write. My plan is to use the RAM based dirty pages to absorb the spillover and let it flush to disk after the transfer is "completed" on the client side. I figured that all I would need is (1024-800)*30~=7G
of dirty pages in RAM that can get flushed out to disk over ~10 seconds after the transfer completes. I understand the data integrity implications of this and the risk is acceptable as I can always transfer the file again later for up to a month in case a power loss causes the file to be lost or incomplete.
However I cannot get ZFS to behave in the way I expect... I've edited my /etc/modprobe.d/zfs.conf
file like so:
options zfs zfs_dirty_data_max_max=25769803776
options zfs zfs_dirty_data_max_max_percent=50
options zfs zfs_dirty_data_max=25769803776
options zfs zfs_dirty_data_max_percent=50
options zfs zfs_delay_min_dirty_percent=80
I have ran the appropriate mkinitcpio -P
command to refresh my initramfs and confirmed that the settings were applied after a reboot:
# arc_summary | grep dirty_data
zfs_dirty_data_max 25769803776
zfs_dirty_data_max_max 25769803776
zfs_dirty_data_max_max_percent 50
zfs_dirty_data_max_percent 50
zfs_dirty_data_sync_percent 20
I.e. I set the max dirty pages to 24G which is waay more than the 7G that I need, and hold of to start delaying writes until 80% of this is used. As far as I understand, the pool should be able to absorb 19G into RAM before it starts to push back on writes from the client (Samba) with latency.
However what I observe writing from the Windows client is that after around 16 seconds at ~1 GB/s write speed the write performance falls off a cliff (iostat
still shows the disks working hard to flush the data) which I can only assume is the pushback mechanism for the write throttling of ZFS. However this makes no sense as at the very least even if nothing was flushed out during the 16 seconds it should have set in 3 seconds later. In addition it falls off once again at the end, see picture: [][https://i.stack.imgur.com/Yd9WH.png]
I've tried adjusting the zfs_dirty_data_sync_percent
to start writing earlier because the dirty page buffer is so much larger than the default and I've alse tried adjusting the active io scaling with zfs_vdev_async_write_active_{min,max}_dirty_percent
to kick in earlier as well to get the writes up to speed faster with the large dirty buffer. Both of these just moved the position of the cliff slightly but no where near what I expected.
Questions:
- Have I missunderstood how the write throttling delay works?
- Is what I'm trying to do possible?
- If so, what am I doing wrong?
Yes, I know, I'm literally chasing a couple of seconds and will never recoup the effort spent in achieving this. That's ok, it's personal between me and ZFS at this point, and a matter of prinicple ;)