Page 1 of 1

[ IMPORTANT ] Last Friday morning, ZFS freeze

Posted: 24 Sep 2013 11:33
by justin
Last Friday our zfs volumes freeze.
This happend on a SAN/NAS for backup by NFS.
Thrusday 10:00pm, it works yet (zfs snapshot copy of another SAN)
Friday, at 1:00am, a backup tried to start on NFS and it never started.
Friday, at 4:00am, system report mail never send. there was smartctl freeze also.

I see this à 11:00am. Try to access data on pool but it was impossible. System worked yet.
ssh access ok.
zfs list ok
zpool status -v ok

NO error reported in log
IPMI console does'nt show error.
No see kernel panic.

WHAT'S HAPPEND ??? misterious!!

Afraid it arrive again on backup SAN/NAS. MORE on iSCSI SAN !! :(

Is that NFS that Crash ?
Is there a way to Incrase LOGS ?

after some tries, web interface stop to answer.

try to reboot cmd... freezed too. waiting 5 mins... nothing. no error message nowhere.

I had to hard reset... (Yeh!!! so happy).

After that, all was good. I've same see backup that was pending starting and completed.

1,5 month that the SAN/NAS is in production. I have to understand what happend.

Thanks you a lot,

Best regards,

Re: [ IMPORTANT ] Last Friday morning, ZFS freeze

Posted: 24 Sep 2013 21:49
by Lee Sharp
Use dmesg and see if you see disconnect and timeout errors on a SATA card. I had a card slowly going out, and after a while the zpool would lock up. Replaced the card, and all was good again.

Re: [ IMPORTANT ] Last Friday morning, ZFS freeze

Posted: 24 Sep 2013 22:19
by btechnet
You may need to plug a monitor into the server and view the console output when this happens. If there are hard drive timeout errors then CAM will provide a status change and spit out information about what changed. example:

NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
CAM status: Command timeout
Error 5, Retries exhausted
NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
CAM status: Command timeout
Error 5, Retries exhausted

This will cause ZFS to stop but it will most likely show that the pool is degraded.

On the other hand, if your ram is going bad, that can cause a kernel to deadlock. But usually you get a timeout trap when that happens.
Use memtest to check for bad ram.

If not, then it may be your boot drive. (if it is not embedded)

Re: [ IMPORTANT ] Last Friday morning, ZFS freeze

Posted: 24 Sep 2013 23:14
by Lee Sharp
btechnet wrote: NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
CAM status: Command timeout
Error 5, Retries exhausted
NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
CAM status: Command timeout
Error 5, Retries exhausted
Thanks for that. I saved my dmesg somewhere, but could not find it at the time. :)

And those timeouts start as degraded zpools and corrupt files, but eventually they can hang the entire pool and more.

Re: [ IMPORTANT ] Last Friday morning, ZFS freeze

Posted: 25 Sep 2013 13:11
by justin
Thanks for answers btechnet and lee sharp.

Dmesg : no error
Console : i logged in IPMI, and try to see some error. It there a key command to see more console message ? alt+F4->F12 does nothing

i don't see errors now.
But dmesg and console ALT+F1, friday does'nt show nothing.

If it happen again, i'll do a memtest.

If i redirect syslog to another server, Console messages will be redirect also? a way for?

Best regards,

Re: [ IMPORTANT ] Last Friday morning, ZFS freeze

Posted: 26 Sep 2013 00:06
by Lee Sharp
When it happens, run dmesg from the console. Compare it to a dmesg run after boot. If the broken dmesg is longer, those later lines are the key.