Latest News:
*New 11.2 series Release:
2019-06-20: XigmaNAS 11.2.0.4.6766 - released!

*New 12.0 series Release:
2019-06-20: XigmaNAS 12.0.0.4.6766 - released!

We really need "Your" help on XigmaNAS https://translations.launchpad.net/xigmanas translations. Please help today!

Producing and hosting XigmaNAS costs money. Please consider donating for our project so that we can continue to offer you the best.
We need your support! eg: PAYPAL

Strange state after disk replacement - still status degraded

Forum rules
Set-Up GuideFAQsForum Rules
Post Reply
maddes8cht
NewUser
NewUser
Posts: 12
Joined: 07 Jun 2017 18:26
Status: Offline

Strange state after disk replacement - still status degraded

#1

Post by maddes8cht » 11 Jun 2019 11:06

I've got an old server with some 4 old 1TB SAS Disks.
RAID-Controler didn't allow for pass thru, so i made every Disk in its own 1-drive-raid.

Configured as raidz with 2 parity Disks, as i wasn't sure how much to trust these disks.
Turned out, I was right with my distrust, as the first disk failed about some weeks later.
Okay, i still have time to think about how to replace this disk, as i still have 1 parity disk, no hurry. Maybe use this incident to (slowly) switch to 2 TB disks?
Only few days later the second disk fails. Now go into alarm-mode, ordering 4 new 2 TB-Disks.
Replacing and rsilivering the two failed ones.

After resilver, the disks remain in "replacing"-mode, but state "online", beeing fully operational.

Shows 2 permanent errors, which points to snapshots about the time of disk failure.

third disk fails.
replace it.

after resilver, i remain with this:

Code: Select all

Tue Jun 11 08:57:36 UTC 2019
  pool: NudelPool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 156G in 8h29m with 2 errors on Sat Jun  8 08:14:45 2019
config:

	NAME                       STATE     READ WRITE CKSUM
	NudelPool                  DEGRADED     0     0     2
	  raidz2-0                 DEGRADED     0     0     4
	    replacing-0            DEGRADED     0     0     0
	      2272089153789827426  OFFLINE      0     0     0  was /dev/mfid0p1.nop
	      mfid1                ONLINE       0     0     0
	    mfid0p1.nop            ONLINE       0     0     0
	    replacing-2            UNAVAIL      0     0     0
	      1951472312918734891  UNAVAIL      0     0     0  was /dev/mfid1p1.nop
	      mfid3                ONLINE       0     0     0
	    replacing-3            DEGRADED     0     0     0
	      7330468822100346415  OFFLINE      0     0     0  was /dev/mfid3p1.nop
	      mfid2                ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x333>:<0x4a0b8>
        <0x333>:<0x49ec8>
While the Snapshots aged out, the errors remained without pointing anywhere useful.

Zfs-Array is fully functional, but this is a strange state.
Still want to replace the 4th disk, as it may fail soon anyways as the other ones.
How do I get rid of this mess?
How do i get rid of these "numbered" disks?

When changing disks, i have to go to raid control manager and declare the new ones. Seems names got a bit shuffled there, as mfid1 "was" mfid0p1.nop, mfid3 "was" mfid1p1.nop, and mfid2 "was" mfid3p1.nop

User avatar
raulfg3
Site Admin
Site Admin
Posts: 4928
Joined: 22 Jun 2012 22:13
Location: Madrid (ESPAÑA)
Contact:
Status: Offline

Re: Strange state after disk replacement - still status degraded

#2

Post by raulfg3 » 11 Jun 2019 11:53

12.0.0.4 (revision 6766)+OBI on SUPERMICRO X8SIL-F 8GB of ECC RAM, 12x3TB disk in 3 vdev in RaidZ1 = 32TB Raw size only 22TB usable

Wiki
Last changes

maddes8cht
NewUser
NewUser
Posts: 12
Joined: 07 Jun 2017 18:26
Status: Offline

Re: Strange state after disk replacement - still status degraded

#3

Post by maddes8cht » 11 Jun 2019 13:07

raulfg3 wrote:
11 Jun 2019 11:53
viewtopic.php?f=66&t=14473

perhaps can help
Sorry, but no.
Didn't gpart to create a zfs-partition, used Xigmanas webfrontend Disk-management instead,
didn't manualy replace but used Xigmanas webfrontend to send the replace command, still reads in the zpool command history as
zpool replace NudelPool /dev/mfid0p1.nop mfid2
zpool replace NudelPool /dev/mfid3p1.nop mfid3
zpool replace NudelPool /dev/mfid1p1.nop mfid3

Did that with another server (and much longer operation before failure) without getting such a mess.

So, this is the state i ended after doing the steps described in said forumthread.

User avatar
ms49434
Developer
Developer
Posts: 673
Joined: 03 Sep 2015 18:49
Location: Neuenkirchen-Vörden, Germany - GMT+1
Contact:
Status: Offline

Re: Strange state after disk replacement - still status degraded

#4

Post by ms49434 » 11 Jun 2019 13:35

maddes8cht wrote:
11 Jun 2019 13:07
raulfg3 wrote:
11 Jun 2019 11:53
viewtopic.php?f=66&t=14473

perhaps can help
Sorry, but no.
Didn't gpart to create a zfs-partition, used Xigmanas webfrontend Disk-management instead,
didn't manualy replace but used Xigmanas webfrontend to send the replace command, still reads in the zpool command history as
zpool replace NudelPool /dev/mfid0p1.nop mfid2
zpool replace NudelPool /dev/mfid3p1.nop mfid3
zpool replace NudelPool /dev/mfid1p1.nop mfid3

Did that with another server (and much longer operation before failure) without getting such a mess.

So, this is the state i ended after doing the steps described in said forumthread.
You have PERMANENT errors, that's why you're getting that 'mess'. (It's not, this is how ZFS preserves as much information as possible).
Run a scrub on the pool, the permanent errors seem to belong to a deleted dataset (hex reference). It could be that those errors are cleared during the scrub. It is probably a good idea to scrub your pool 2 times.
Try to issue a clear command to the pool and to the vdev if the scrub didn't clear the errors.
If the errors are not cleared you should think about recreating the pool and restore the data from your backup.
1) XigmaNAS 12.0.0.4 amd64-embedded on a Dell T20 running in a VM on ESXi 6.7U2, 22GB out of 32GB ECC RAM, LSI 9300-8i IT mode in passthrough mode. Pool 1: 2x HGST 10TB, mirrored, SLOG: Samsung 850 Pro, L2ARC: Samsung 850 Pro, Pool 2: 1x Samsung 860 EVO 1TB , services: Samba AD, CIFS/SMB, ftp, ctld, rsync, syncthing, zfs snapshots.
2) XigmaNAS 12.0.0.4 amd64-embedded on a Dell T20 running in a VM on ESXi 6.7U2, 8GB out of 32GB ECC RAM, IBM M1215 crossflashed, IT mode, passthrough mode, 2x HGST 10TB , services: rsync.

maddes8cht
NewUser
NewUser
Posts: 12
Joined: 07 Jun 2017 18:26
Status: Offline

Re: Strange state after disk replacement - still status degraded

#5

Post by maddes8cht » 13 Jun 2019 15:13

Okay, tried that all.
After clearing the device errors on every drive it starts reslvering, where in the beginning the last line states:

Code: Select all

errors: No known data errors
Full output:

Code: Select all

Thu Jun 13 12:37:03 UTC 2019
  pool: NudelPool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jun 13 12:30:21 2019
	120G scanned out of 620G at 48.6M/s, 2h55m to go
        1.06G resilvered, 19.32% done
config:

	NAME                       STATE     READ WRITE CKSUM
	NudelPool                  DEGRADED     0     0     8
	  raidz2-0                 DEGRADED     0     0    16
	    replacing-0            DEGRADED     0     0     0
	      2272089153789827426  OFFLINE      0     0     0  was /dev/mfid0p1.nop
	      mfid1                ONLINE       0     0     0
	    mfid0p1.nop            ONLINE       0     0     0
	    replacing-2            DEGRADED     0     0     0
	      1951472312918734891  UNAVAIL      0     0     0  was /dev/mfid1p1.nop
	      mfid3                ONLINE       0     0     0
	    replacing-3            DEGRADED     0     0     0
	      7330468822100346415  OFFLINE      0     0     0  was /dev/mfid3p1.nop
	      mfid2                ONLINE       0     0     0

errors: No known data errors
(see, i did it again, so now its running...)
But after a while it finds 2 "permanent Errors", these beeing in the oldest available snapshot (60 days old).
(last times this always happened around 20%, which i have reached right now - maybe this time i reached the "final" snapshot that needed to be deleted? We will see... the "messy" disk-config is still there, let's see if it vanishes when scrub finishes ...)

Deleting these Snapshots only leads to them reappearing at the next oldest snapshot.

Both errors are in snapshots of two virtual-Box machines, and while It won't matter to delete and recreate the Lubuntu-machine, I'm unhappy with the idea of recreating the openz-Machine.

The VM itself is fully functional and working fine.
If the errors are not cleared you should think about recreating the pool and restore the data from your backup.
Instead of doing that, i would prefer replicating (zrep) the entire thing into another zpool, then recreate the old zpool and re-replicate back.
Doing so i could dump downtime to almost zero. Right now i have enough disks available to do so.


Haven't done replcation with zrep, so:
may this replication also "recreate" these "permantent errors", or can i be sure to get rid of these?

Post Reply

Return to “ZFS (only!)”