Page 1 of 1

Unstable system crashes performing zfs send/receive

Posted: 27 Jul 2014 07:01
by gregb
I've got a NAS4Free system that is running really well EXCEPT that it crashes. I have swapped out hardware and tried to narrow down the issue without much (any) success.

I have found that I can reliabily crash Nas4Free after ~15minutes by making a copy of one pool to another:

Code: Select all

 # zfs snapshot -r tank@01
 # zfs send -R tank@01 | zfs receive -Fdvu
I had top running through ssh and the last update prioir to the crash was:

Code: Select all

last pid:  5289;  load averages:  2.06,  1.60,  1.93                           up 0+00:34:03  16:15:48
33 processes:  1 running, 32 sleeping
CPU:  0.0% user,  0.0% nice, 25.1% system,  0.0% interrupt, 74.9% idle
Mem: 442M Active, 32M Inact, 22G Wired, 12M Buf, 8151M Free
ARC: 20G Total, 586M MFU, 18G MRU, 830M Anon, 529M Header, 8631K Other
Swap: 64G Total, 64G Free

  PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 5105 root            1  52    0 37624K  3280K nokva   3   1:14 21.88% zfs
 5104 root            1  29    0 37624K  3332K pipewr  1   0:29  4.59% zfs
 4894 root            1  20    0 59672K 15828K select  1   0:07  0.29% bsnmpd
 4287 root          309  20    0  1438M   423M uwait   3   0:10  0.00% istgt
 5112 root            1  20    0 16596K  2488K CPU2    2   0:01  0.00% top
 4354 root            1  20    0 12084K  1632K select  2   0:01  0.00% powerd
 4317 root            1  20    0 54868K  6240K select  3   0:00  0.00% snmp-ups
 4950 greg            1  20    0 56272K  5096K select  3   0:00  0.00% sshd
 3877 root            1  20    0 12112K  1840K select  3   0:00  0.00% syslogd
 4319 root            1  20    0 20392K  3692K select  2   0:00  0.00% upsd
 3634 root            1  20    0  6280K   740K select  0   0:00  0.00% devd
 4352 root            1  20    0 20404K  3828K nanslp  0   0:00  0.00% upsmon
 4953 root            1  20    0 14508K  3564K pause   2   0:00  0.00% csh
The hardware is:
- e5-2609
- 32G ECC RAM (currently trying another 32G of non-ECC RAM in a single socket desk motherboard - didn't help)
- LSI 2008 controllers (P19 IT firmware)
- x540 nics (currently trying i350's - didn't help)
- two disks in a plain mirror for swap (doesn't seem to help)

I am using:
- ipv4 and ipv6
- 9k MTU
- lagg
- vlan's
- iSCSI and NFS (no CIFS)

As an experiment I tried the same zfs send/receive using FreeNAS. It looked to be going really well (it lasted well past the 15 minutes) but crashed with a kernel problem when I added a VLAN interface (I thought that it was going so well I might add some networking - bad idea).

I found this specific way to crash it because the pool has a vdev with 8 disks (bad idea) and I want to migrate it onto a new pool with a raidz2 vdev of 6 devices. Is that somehow causing an issue?

How can I capture the cause of the crash please? Without understanding what is crashing I am not making any progress on this. When it crashes I don't see anything.

Re: Unstable system crashes performing zfs send/receive

Posted: 27 Jul 2014 08:16
by b0ssman
try the p16 firmware. the freebsd driver has some problems with newer firmware.

Re: Unstable system crashes performing zfs send/receive

Posted: 27 Jul 2014 12:28
by b0ssman
also be aware that there are fake lsi 2008 controller around which cause system isntability.

they chips did not pass qa. but instead of beeing destroyed they were "aquired" and used to build these knock off controllers.

Re: Unstable system crashes performing zfs send/receive

Posted: 28 Jul 2014 14:19
by armandh
also how long does the RAM test go?

and the usual suspects of power supply and incoming power

Re: Unstable system crashes performing zfs send/receive

Posted: 31 Jul 2014 07:58
by gregb
Thanks for the comments. I will look into reflashing the cards down to P16. I used to have P16 firmware on the card and put the latest P19 on to try and overcome these issues.

How do you know if it was a reject chip? The controllers had a sticker with serial numbers. Can I check if they are valid serial numbers?

I only ran the RAM checker for about 12 hours (on the last run). I have run the system with other power supplies (power was the first thing I checked).

How can I get kernel crash information?

Re: Unstable system crashes performing zfs send/receive

Posted: 31 Jul 2014 08:17
by b0ssman
see
http://www.freebsd.org/doc/en/books/dev ... tions.html

change the default value of the debug.debugger_on_panic sysctl to 0