*New 11.4 series Release:
2020-07-03: XigmaNAS 11.4.0.4.7633 - released!

*New 12.1 series Release:
2020-04-17: XigmaNAS 12.1.0.4.7542 - released


We really need "Your" help on XigmaNAS https://translations.launchpad.net/xigmanas translations. Please help today!

Producing and hosting XigmaNAS costs money. Please consider donating for our project so that we can continue to offer you the best.
We need your support! eg: PAYPAL

ZFS resilver keeps restarting, but no indication why

If you are new on this forum and you don't know where to post please use this sub-forum. Somebody will answer your question and/or will move your topic into the right sub-forum.
Forum rules
Set-Up GuideFAQsForum Rules
Post Reply
debacle_3k
NewUser
NewUser
Posts: 6
Joined: 29 Jun 2020 21:11
Status: Offline

ZFS resilver keeps restarting, but no indication why

#1

Post by debacle_3k »

Version: 12.1.0.4 - Ingva (revision 7542)
Platform: x64-embedded on Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
Motherboard: ASRock H81 Pro BTC
RAM: 8GB (4x Kingston KVR800D2N6)

I have a Norco RPC-4224 + Chenbro CK23601 connected to my PC with a LSI00300.
2x RAIDZ2 vdevs, 13 drives + 11 drives.

I have a single drive replacement resilver (on the 13-drive vdev) that keeps restarting but I see nothing in system logs or SMART or ZFS history to indicate why. Looking for ideas on other ways to diagnose.

Even if I remove the new drive and run in DEGRADED state, the same resilver beahviour occurs. I've tried to detach / offline the new device but ZFS refuses with "insufficient replicas exist" even though it's a RAIDZ2 vdev. It can go between a few minutes and several hours between restarts.

Some further background:

Did a number on my RPC-4224 setup recently, the fans seized up and had to move one row of drives to a new enclosure (ORICO DS500U3), seems the backplane has... had a number done on it. Luckily only one drive seemed to NEED replacing as a result, so I'm trying to get that resilver done and then regroup and assess the remaining situation.

I guess at least one other drive or a connection path to it is flaky, and since SMART isn't changing I guess it's likely the latter. And I know I probably need to migrate away from this enclosure entirely, but before I throw money at the situation I'd like to see SOME clue as to why ZFS is deciding it needs to restart.

cookiemonster
Advanced User
Advanced User
Posts: 281
Joined: 23 Mar 2014 02:58
Location: UK
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#2

Post by cookiemonster »

Can you check if after you rearranged the disks you did a clear config & import so that the OS knows about the changes?
Anything in dmesg? What shows on "zpool status" ?
Main: Xigmanas 11.2.0.4 x64-full-RootOnZFS as ESXi VM with 24GB memory.
Main Host: Supermicro X8DT3 Memory: 72GB ECC; 2 Xeon E5645 CPUs; Storage: (HBA) - LSI SAS 9211-4i with 3 SATA x 1 TB in raidZ1, 1 x 3 TB SAS drive as single stripe, 3 x 4 TB SAS drives in raidZ1.
Spare1: HP DL360 G7; 6 GB ECC RAM; 1 Xeon CPU; 5 x 500 GB disks on H210i
Backup1: HP DL380 G7; 24 GB ECC RAM; 2 Xeon E5645 CPUs; 8 x 500 GB disks on IBM M1015 flashed to LSI9211-IT

debacle_3k
NewUser
NewUser
Posts: 6
Joined: 29 Jun 2020 21:11
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#3

Post by debacle_3k »

Thanks cookiemonster. No I hadn't done a clear config & import, didn't realize that could make a difference - I've done that now and will monitor.

Nothing in dmesg since boot, and don't see any complaints during boot.

Current zpool status -v:

Code: Select all

Mon Jun 29 14:26:18 PDT 2020
  pool: parity6
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jun 29 14:18:15 2020
	355G scanned at 753M/s, 21.5G issued at 45.6M/s, 40.8T total
	45.0M resilvered, 0.05% done, 10 days 20:18:31 to go
config:

	NAME                        STATE     READ WRITE CKSUM
	parity6                     DEGRADED     0     0     0
	  raidz2-0                  DEGRADED     0     0     0
	    da0                     ONLINE       0     0     0
	    da14                    ONLINE       0     0     0
	    da2                     ONLINE       0     0     0
	    da16                    ONLINE       0     0     0
	    da1                     ONLINE       0     0     0
	    da5                     ONLINE       0     0     0
	    da7                     ONLINE       0     0     0
	    da4                     ONLINE       0     0     0
	    replacing-8             UNAVAIL      0     0     0
	      3793494453691558979   UNAVAIL      0     0     0  was /dev/da8/old
	      15101296353119072328  FAULTED      0     0     0  was /dev/da18
	    da17                    ONLINE       0     0     0
	    da15                    ONLINE       0     0     0
	    da6                     ONLINE       0     0     0
	    da3                     ONLINE       0     0     0
	  raidz2-1                  ONLINE       0     0     0
	    da19                    ONLINE       0     0     0
	    da21                    ONLINE       0     0     0
	    da18                    ONLINE       0     0     0
	    da8                     ONLINE       0     0     0
	    da9                     ONLINE       0     0     0
	    da10                    ONLINE       0     0     0
	    da11                    ONLINE       0     0     0
	    da22                    ONLINE       0     0     0
	    da12                    ONLINE       0     0     0
	    da20                    ONLINE       0     0     0
	    da13                    ONLINE       0     0     0

errors: No known data errors

debacle_3k
NewUser
NewUser
Posts: 6
Joined: 29 Jun 2020 21:11
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#4

Post by debacle_3k »

Update: It has already restarted at least once since clear config & import.

cookiemonster
Advanced User
Advanced User
Posts: 281
Joined: 23 Mar 2014 02:58
Location: UK
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#5

Post by cookiemonster »

I was expecting dmesg to show the device faulting. If you're using embedded you might want to start logging to a location that persists reboots. There are instructions how but if I remember correctly it requires a reboot.
The FAULTED device will likely be the cause of the restart of resilvering. Is it a WD Red by any chance? Could it be falling foul of the SMR debacle?
Main: Xigmanas 11.2.0.4 x64-full-RootOnZFS as ESXi VM with 24GB memory.
Main Host: Supermicro X8DT3 Memory: 72GB ECC; 2 Xeon E5645 CPUs; Storage: (HBA) - LSI SAS 9211-4i with 3 SATA x 1 TB in raidZ1, 1 x 3 TB SAS drive as single stripe, 3 x 4 TB SAS drives in raidZ1.
Spare1: HP DL360 G7; 6 GB ECC RAM; 1 Xeon CPU; 5 x 500 GB disks on H210i
Backup1: HP DL380 G7; 24 GB ECC RAM; 2 Xeon E5645 CPUs; 8 x 500 GB disks on IBM M1015 flashed to LSI9211-IT

debacle_3k
NewUser
NewUser
Posts: 6
Joined: 29 Jun 2020 21:11
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#6

Post by debacle_3k »

Yes I've been following this OS line for over 10 years now and in the past dmesg has always pointed out the culprit drive, but not this time. To be clear, the OS isn't restarting, just the resilver. The device that's FAULTED is the new replacement drive that I've deliberately removed. With or without it online, I see the same behaviour from ZFS.

The new drive is a Seagate ST4000DM004 which I've been buying for a while now but apparently yes they've switched this model to SMR. However getting the same behaviour with the drive physically removed makes me wonder if that's in play?

I'm considering removing other drives one at a time until I hopefully find one that allows the resilver to stabilize with it removed.

Shperrung
experienced User
experienced User
Posts: 149
Joined: 04 Apr 2018 16:29
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#7

Post by Shperrung »

People say that SMR drives in combination with CMRs within RAID may be recognised as "damage" due to significant write speed down because of full memory cash.
ASRock J3710-ITX, 16Gb RAM; RAID-Z 4Tx3HDD, 2T Stripe; UPS
Debian+OMV+ZFS

cookiemonster
Advanced User
Advanced User
Posts: 281
Joined: 23 Mar 2014 02:58
Location: UK
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#8

Post by cookiemonster »

Yes that I don't understand why it would try to resilver without the drive replacement inserted.
Does "zpool history -il parity6" provide any hints? Needs sudo.
Main: Xigmanas 11.2.0.4 x64-full-RootOnZFS as ESXi VM with 24GB memory.
Main Host: Supermicro X8DT3 Memory: 72GB ECC; 2 Xeon E5645 CPUs; Storage: (HBA) - LSI SAS 9211-4i with 3 SATA x 1 TB in raidZ1, 1 x 3 TB SAS drive as single stripe, 3 x 4 TB SAS drives in raidZ1.
Spare1: HP DL360 G7; 6 GB ECC RAM; 1 Xeon CPU; 5 x 500 GB disks on H210i
Backup1: HP DL380 G7; 24 GB ECC RAM; 2 Xeon E5645 CPUs; 8 x 500 GB disks on IBM M1015 flashed to LSI9211-IT

debacle_3k
NewUser
NewUser
Posts: 6
Joined: 29 Jun 2020 21:11
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#9

Post by debacle_3k »

Ok! So including the internal events with "zpool history -il parity6" actually shows the restarts happening, which makes me happy, so thanks for showing me that. Unfortunately I don't think there's a clue here, it seems to just be logging the restarts without an explanation:

Code: Select all

2020-07-04.21:03:49 [txg:30619538] open pool version 5000; software version 5000/5; uts  12.1-RELEASE-p3 1201000 amd64 [on ]
2020-07-04.21:03:53 [txg:30619541] import pool version 5000; software version 5000/5; uts  12.1-RELEASE-p3 1201000 amd64 [on ]
2020-07-04.21:04:44 zpool import -d /dev -f -a [user 0 (root) on ]
2020-07-04.21:12:55 [txg:30619573] scan aborted, restarting errors=0 [on freenas.local]
2020-07-04.21:12:55 [txg:30619573] scan setup func=2 mintxg=3 maxtxg=30619201 [on freenas.local]
2020-07-04.21:18:38 [txg:30619585] scan aborted, restarting errors=0 [on freenas.local]
2020-07-04.21:18:38 [txg:30619585] scan setup func=2 mintxg=3 maxtxg=30619201 [on freenas.local]
2020-07-04.21:27:34 [txg:30619613] scan aborted, restarting errors=0 [on freenas.local]
2020-07-04.21:27:34 [txg:30619613] scan setup func=2 mintxg=3 maxtxg=30619201 [on freenas.local]
I tried Googling around this regardless, and one person said it turned out their USB enclosure was the culprit - not sure how they came to that conclusion, but it does make me wonder if the USB enclosure I was forced to move some drives into is occasionally making ZFS think it's lost connection to those drives, or is stalling the resilver progress for too long, or something, somehow. Of course that's all speculation since I still haven't seen any hard evidence of why it's restarting. My hardware's a bit old and I think the enclosure may be operating at USB 2.0 instead of 3.0, so fixing that is one thing I could pursue.

Removing drives one by one in hopes of finding one bad apple hasn't helped so far but I still have a few more to go before eliminating that possibility.

If I decided to try an everything-but-the-drives SAS enclosure daisy-chained with my Norco RPC-4224 with 5+ drive capacity so I can rule out the USB enclosure, are there recommendations there? Apologies if that's not an appropriate question for this forum.

cookiemonster
Advanced User
Advanced User
Posts: 281
Joined: 23 Mar 2014 02:58
Location: UK
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#10

Post by cookiemonster »

In my opinion, yes is appropriate, although possibly best placed in Storage questions. Matters little.
There are engineering options like zdb for transactions but I don't suggest to go there. Even the manual pages attempts to discourage its use unless there is knowledge of zfs internals.
I didn't realise you were using USB. Personally I'd suspect that as being the problem but I'd be with you that there is no hard evidence in hand. Sorry it's not a suggestion on how to move forward but my impression is there is something up with the hardware.
Main: Xigmanas 11.2.0.4 x64-full-RootOnZFS as ESXi VM with 24GB memory.
Main Host: Supermicro X8DT3 Memory: 72GB ECC; 2 Xeon E5645 CPUs; Storage: (HBA) - LSI SAS 9211-4i with 3 SATA x 1 TB in raidZ1, 1 x 3 TB SAS drive as single stripe, 3 x 4 TB SAS drives in raidZ1.
Spare1: HP DL360 G7; 6 GB ECC RAM; 1 Xeon CPU; 5 x 500 GB disks on H210i
Backup1: HP DL380 G7; 24 GB ECC RAM; 2 Xeon E5645 CPUs; 8 x 500 GB disks on IBM M1015 flashed to LSI9211-IT

debacle_3k
NewUser
NewUser
Posts: 6
Joined: 29 Jun 2020 21:11
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#11

Post by debacle_3k »

Yes I've just recently moved 5 drives from my Norco RPC-4224 to a USB enclosure because some ports were failing, but that seems to have just caused a different issue. Thanks for the advice!

cookiemonster
Advanced User
Advanced User
Posts: 281
Joined: 23 Mar 2014 02:58
Location: UK
Status: Offline

Re: ZFS resilver keeps restarting, but no indication why

#12

Post by cookiemonster »

It's logical to use what's available and yes, I can't offer you hard evidence or official documentation but everywhere I've read, all people who administer systems and do storage for a living say that usb is not reliable enough.
Main: Xigmanas 11.2.0.4 x64-full-RootOnZFS as ESXi VM with 24GB memory.
Main Host: Supermicro X8DT3 Memory: 72GB ECC; 2 Xeon E5645 CPUs; Storage: (HBA) - LSI SAS 9211-4i with 3 SATA x 1 TB in raidZ1, 1 x 3 TB SAS drive as single stripe, 3 x 4 TB SAS drives in raidZ1.
Spare1: HP DL360 G7; 6 GB ECC RAM; 1 Xeon CPU; 5 x 500 GB disks on H210i
Backup1: HP DL380 G7; 24 GB ECC RAM; 2 Xeon E5645 CPUs; 8 x 500 GB disks on IBM M1015 flashed to LSI9211-IT

Post Reply

Return to “Newbie Questions”