Page 1 of 1

zpool faulted

Posted: 17 Sep 2015 22:38
by nicolap8
Hi all,
I had(!) a zpool with 5 disks in raid-z2.
One disk started to have problemss so decided to replace.
I read badly the manual and:
- zpool offline zpl disk
- shutdown
- changed the drive with another identical
- restarted the pc
- zpool attach zpl ada1
I have not had any response, the console just stopped. No messages, no log.
Every command involving zfs was non completed, no response.
After some hours I rebooted the machine.

The pool was in FAULTED state. Tried to export.
Tried to import:
freenas: administrator# zpool import -f
pool: zpl
id: 15065020095725120455
state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://illumos.org/msg/ZFS-8000-EY
config:

zpl FAULTED corrupted data
raidz2-0 DEGRADED
replacing-0 UNAVAIL insufficient replicas
1200431444536697293 OFFLINE
5458900440138144991 UNAVAIL cannot open
da2 ONLINE
da3 ONLINE
da4 ONLINE
da5 ONLINE
The two device with the strange name was connected to an SILcontroller (siisch1).

Any suggestion?
Thanks

Re: zpool faulted

Posted: 18 Sep 2015 07:01
by b0ssman
did you offline the wrong disk?

see the faq on how to replace the a failed disk

http://wiki.nas4free.org/doku.php?id=faq:0149

Re: zpool faulted

Posted: 18 Sep 2015 11:02
by nicolap8
No. I offlined the right disk, this with the strange number that does not compare in the gui!

<RANT>
I'm not sure that I have done the right procedure (probably not) but i'm sure that EVERY software MUST request a CONFIRMATION before doing something that can destroy user data.
The zpool command never requested anything... this, in 2015, is not only incompetence!
</RANT>

Re: zpool faulted

Posted: 18 Sep 2015 11:12
by raulfg3
nicolap8 wrote:No. I offlined the right disk, this with the strange number that does not compare in the gui!

<RANT>
I'm not sure that I have done the right procedure (probably not) but i'm sure that EVERY software MUST request a CONFIRMATION before doing something that can destroy user data.
The zpool command never requested anything... this, in 2015, is not only incompetence!
</RANT>
same for format c: on windows

lesson is : Do not use commands unless you know what you are doing.

Re: zpool faulted

Posted: 18 Sep 2015 13:27
by nicolap8
raulfg3 wrote:same for format c: on windows
:lol:
raulfg3 wrote:lesson is : Do not use commands unless you know what you are doing.
I ever read the manual... but this time I read badly, lesson is: take your time...

Update
Using zbd -l I checked the LABEL (metadata) of the disks:
- one disk has old metadata (that was offlines and phisically disconnected)
- two disks (da2 and da3) have same metadata txg: 908576
- two disks (da4 and da5) have same metadata txg: 908555

The data in the disks is not modified because before starting the whole process I stopped all services that can write (or read) it.

The question now is: how can I change the metadata of da2 and da3 so that they will be the same of da4 and da5?
Thanks

Re: zpool faulted

Posted: 18 Sep 2015 13:32
by raulfg3
nicolap8 wrote:The question now is: how can I change the metadata of da2 and da3 so that they will be the same of da4 and da5?
Thanks
High level questin, sorry I do not know answer.

Perhaps Parkcomm know it, I remember a recent post about metadata on ZFS

Re: zpool faulted

Posted: 18 Sep 2015 15:55
by nicolap8
After some(!) reading I tried this:
zpool import -N -o readonly=on -f -F -R /pool -T 908096 zpl

cannot import 'zpl': one or more devices is currently unavailable
Of course the first time I used also the "-n" switch and got no message :-(

The selected transaction is present on all 5 disks.

Update:
freenas: administrator# zpool import -N -o readonly=on -f -F -X -R /pool -V -T 907981 zpl
freenas: administrator# zpool status
pool: zpl
state: FAULTED
status: One or more devices could not be used because the label is missing
or invalid. There are insufficient replicas for the pool to continue
functioning.
action: Destroy and re-create the pool from
a backup source.
see: http://illumos.org/msg/ZFS-8000-5E
scan: none requested
config:

NAME STATE READ WRITE CKSUM
zpl FAULTED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
1200431444536697293 OFFLINE 0 0 0 was /dev/diskid/DISK-JK2171B9HYWX4L
5458900440138144991 UNAVAIL 0 0 0 was /dev/ada1
859813622038509516 UNAVAIL 0 0 0 was /dev/da2
18217603076805749401 UNAVAIL 0 0 0 was /dev/da3
17699380717749284690 UNAVAIL 0 0 0 was /dev/da4
1703911041795763760 UNAVAIL 0 0 0 was /dev/da5
freenas: administrator#
Now the pool is imported in read/only mode. The magic was made by the -V switch. But I dont know what it do!
And I dont understand why the disks are all UNAVAIL: they are present and working.

Re: zpool faulted

Posted: 18 Sep 2015 18:37
by nicolap8
Found the -V switch!
https://github.com/zfsonlinux/zfs/blob/ ... ool_main.c
-V Import even in the presence of faulted vdevs. This is an intentionally undocumented option for testing purposes, and treats the pool configuration as complete, leaving any bad vdevs in the FAULTED state. In other words, it does verbatim import.

Re: zpool faulted

Posted: 18 Sep 2015 19:34
by nicolap8
Another update, a small step! I imported the pool.
zpool status
pool: zpl
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
a backup source.
see: http://illumos.org/msg/ZFS-8000-72
scan: none requested
config:

NAME STATE READ WRITE CKSUM
zpl FAULTED 0 0 1
raidz2-0 DEGRADED 0 0 6
replacing-0 UNAVAIL 0 0 0
1200431444536697293 OFFLINE 0 0 0 was /dev/diskid/DISK-JK2171B9HYWX4L
5458900440138144991 UNAVAIL 0 0 0 was /dev/ada1
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
da4 ONLINE 0 0 0
da5 ONLINE 0 0 0
Now I have two possibilities:
a) remove the "replacing-0" devices and add a new one than resilver
b) correct bad metadata by hand :lol:

Re: zpool faulted

Posted: 25 Sep 2015 18:29
by Parkcomm
Hey Guys - sorry I missed this one.

nicolap8 - do not try to edit the metdata by hand please (it's possible i believe but I don't think its the best idea just yet).

Because you've had tow disks fail simultaneously either:

You've got a fault in some common componntry (sata card or cable)

OR

You've had a change in drive lables (e.g. /dev/ada0 has become /dev/ada1). Seeing "status: One or more devices could not be used because the label is missing" is a clue that this might be the case. Good news, this is totally fixable!

So lets have a look at those labels - can you post the output of

Code: Select all

zdb -C poolname
and

Code: Select all

zdb -C
The reason I am asking for both is in case you have a zpool.cache file for comparison. If your pool is curently in an exported state use the "zdb -C -e poolname" rather than importing.

In the meant time have you looked for faults in dmesg? Have you made sure the drives are seated correctly and the cables connected correctly? Do you have another chassis - or can you use a friends, to rule out common component failure?

Sorry if I'm teaching you to suck eggs, but sometimes in the heat of battle, you can skip the basics.

Re: zpool faulted

Posted: 26 Sep 2015 22:16
by nicolap8
Parkcomm wrote:Hey Guys - sorry I missed this one.
No problem!
Parkcomm wrote:nicolap8 - do not try to edit the metdata by hand please (it's possible i believe but I don't think its the best idea just yet).
I will do! :twisted: Then will report here :)
Parkcomm wrote:Because you've had tow disks fail simultaneously either:

You've got a fault in some common componntry (sata card or cable)
dmesg was empty. I checked cables, cards, smart reports, no faults at all.
Parkcomm wrote:You've had a change in drive lables (e.g. /dev/ada0 has become /dev/ada1).
I removed a disk and replaced with another in the same place/cable/controller.

I think that was some kind of race condition in zfs core (my machine is I386...)
Parkcomm wrote:Seeing "status: One or more devices could not be used because the label is missing" is a clue that this might be the case. Good news, this is totally fixable!

So lets have a look at those labels - can you post the output of

Code: Select all

zdb -C poolname
and

Code: Select all

zdb -C

Code: Select all

$ zdb -C -e zpl
zdb: can't open 'zpl': Input/output error
Parkcomm wrote:In the meant time have you looked for faults in dmesg? Have you made sure the drives are seated correctly and the cables connected correctly? Do you have another chassis - or can you use a friends, to rule out common component failure?

Sorry if I'm teaching you to suck eggs, but sometimes in the heat of battle, you can skip the basics.
:lol: you are asking the right things!
Thanks