Page 1 of 1

System crash after ZFS pool import or scrubbing attempt

Posted: 02 Apr 2015 09:46
by clicker
Hi there!

I make a new installation of NAS4free on a computer, set settings I needed in and copy data to newly created ZFS pool. After that I began trying to change some settings and suddently system just fell down and began reboot and reboot infinitely.
So, I began looking for reason and after several days found that system crash after attempt scrubbing pool, containing more than 100 Gb of data. If I try to scrub pool before data copying or after copying a little amount of data (4-10 Gb) nothing wrong is happened. Moreover, if reinstall system or even try to boot LiveCD and make an attempt to import that pool, system crashs again anyway (pool remain offline after reboot).

Unfortunately, I dont't know a way to read or save crashdump log, so if anyone would tell me how to do this, I'll append logs.

My configuration is:

NAS4free Version: 9.3.0.2 - Nayla (revision 1213)
Platform: x64-full on Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
HDDs:
zfs disk - ada0 | WDC WD40EFRX-68WT0N0 | 3815448MB | ahcich0 | Intel Cougar Point AHCI SATA controller | Always on | ZFS storage pool device
system disk - ada1 | WDC WD1200JS-00MHB0 | 114474MB | ahcich1 | Intel Cougar Point AHCI SATA controller | Always on | UFS

System crash after ZFS pool import or scrubbing attempt

Posted: 02 Apr 2015 16:13
by b0ssman
Run a memtest. Does your system support ecc? That is a defacto requirement.


Sent from my iPhone using Tapatalk

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 03 Apr 2015 01:36
by Lee Sharp
First, disconnect the hard drives. You have probably already lost the data there, but why kick them again?

Now, check your foundation. On any server I start with an overnight memtest burn in by booting memtest and then removing the CD. This thoroughly tests the memory, but also does an incidental test of CPU and Power Supply. Because if it crashes and / or reboots, it will NOT boot into memtest.

What he is saying about EEC is overly alarming, but somewhat valid. If you have regular memory, and it fails, it WILL corrupt all of your data when you scrub. There is no qualifire here. It will read the data, calculate the checksums and get errors due to the faulty memory, and then "correct" the data on the disk to match those errors. You can run without EEC memory, and I do often. But you better have good and frequent backups. And it is easy to see when you have a memory error because all of your data goes away.

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 03 Apr 2015 05:08
by Princo
Lee Sharp wrote:What he is saying about EEC is overly alarming,
Sorry, but you have no idea what you're talking about. Your suggestions are irresponsible. ECC is mandatory for using ZFS, there is no discussion about this, and doing Backups didn't help here.

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 03 Apr 2015 09:02
by clicker
Thank you for fast answers,

Sorry, I didn't remember about platform description when wrote post.
I have Intel R1304BTSSFANR server platform with 4Gb 1333MHz Kingston ECC CL9 (KVR13LE9S8/4) RAM.

Anyway, I started memtest few hours ago, just in case. For now no errors displayed, but I will let it go until worktime ends.
As I said I've tried to find a reason several days changing conditions and reading google. For example, I've tried "zpool import -F -f -o readonly=on -R /mnt/temp pool0" as describe in viewtopic.php?t=8173 and it works. But I have no need in recovering existing data, I looking for second enterprise storage OS, so I need in full ZFS features, including scrubbing (BTW ZFS works fine until starting scrubbing procedure, that crashs system).

System crash after ZFS pool import or scrubbing attempt

Posted: 03 Apr 2015 09:23
by b0ssman
To save logs see http://wiki.nas4free.org/doku.php?id=faq:0134
4gb is very little for that kind of setup. 1gb per 1 tb is recommended.


Sent from my iPhone using Tapatalk

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 03 Apr 2015 17:03
by Lee Sharp
Princo wrote:
Lee Sharp wrote:What he is saying about EEC is overly alarming,
Sorry, but you have no idea what you're talking about. Your suggestions are irresponsible. ECC is mandatory for using ZFS, there is no discussion about this, and doing Backups didn't help here.
Many people keep saying this, and feel very strongly about it. However, many other people, that have significant skill, disagree. (Including one of the co-founders of ZFS. http://arstechnica.com/civis/viewtopic. ... #p26303271 ) So while there may be "no discussion about this" with some people, there is undoubtedly some dissension. And saying "backups do not help here" truly baffles me. How can having a good copy of your data not help when you lose data?
My point is that not using ECC will not cause you NAS to burst into flame. It does add significant risk, but that risk can be balanced. And that is what good IT professionals do; find cost effective ways to mitigate risk as much as can be afforded. And this is from someone with just over a dozen nas4free boxes in production without ECC ram, and no failures yet.
And let me be clear that I am not saying there is no risk. I am saying that the assessment of that risk may be overstated by some people.

Lastly, as a FOSS project lead and multiple forum moderator myself, I wanted to give you a little advice. In that kind of position, you no longer have the luxury of just being a person. Small, offhand statements carry a lot of weight when you have that kind of authority. So saying things like "Sorry, but you have no idea what you're talking about. Your suggestions are irresponsible." should be considered strongly and perhaps reworded.

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 03 Apr 2015 18:16
by b0ssman
you can run a a zfs without ecc. i have done it myself.
but you should really know what you are doing and know what risk you are taking.

we should not recommend using zfs without ecc to newcomers who do not know about the risks or how to mitigate them.
we want their data to be safe now and safe in years to come.
most newcomers will set up their nas once and then forget about it and only come back to it when things have gone wrong.
and by then it will be to late.

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 03 Apr 2015 23:16
by Princo
Lee Sharp wrote:And saying "backups do not help here" truly baffles me. How can having a good copy of your data not help when you lose data?
Because corrupted data looks like normally changed data. Yes, it has same date and size, but content differs, and so it's a candidate for next backup.
If things are going bad (and they do), your backup contains the same corrupted data as your main system. And so backups do not help here.
Lee Sharp wrote:My point is that not using ECC will not cause you NAS to burst into flame.
Nobody said that. We are discussing about a high risk resulting in complete data loss.
Lee Sharp wrote:It does add significant risk, but that risk can be balanced.
Balanced in this way?: "I never fasten seat belts, because my car is equipped with air bags."
Lee Sharp wrote:And that is what good IT professionals do;
No, this is what bad businessmen do. :evil:
A good IT professional is not arguing against the benefits of ecc ram. Never ever.
Lee Sharp wrote:find cost effective ways to mitigate risk as much as can be afforded.
"Yes, why not taking this old, somehow dismantled bigtower from Walmart as a cheap fileserver at office?".
In most cases, this is not a good advice.
Lee Sharp wrote:And this is from someone with just over a dozen nas4free boxes in production without ECC ram, and no failures yet.
I also used nas4free for years without ecc ram, with no data loss (i think). But after i heard about that specific problem, i instantly changed my hardware because of irreplaceable data on it. And: yes, i have backups.
Lee Sharp wrote:And let me be clear that I am not saying there is no risk. I am saying that the assessment of that risk may be overstated by some people.
The risk is a complete loss of data, and (maybe) a useless backup.
It takes days, to reinstate the system.
With ecc ram you can avoid it.
Lee Sharp wrote:Lastly, as a FOSS project lead and multiple forum moderator myself, I wanted to give you a little advice. In that kind of position, you no longer have the luxury of just being a person. Small, offhand statements carry a lot of weight when you have that kind of authority. So saying things like "Sorry, but you have no idea what you're talking about. Your suggestions are irresponsible." should be considered strongly and perhaps reworded.
I apologize if my words would have upset you.

Regards
Princo

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 04 Apr 2015 09:25
by b0ssman
princo you can turn off the checksum in zfs.
this will prevent bad memory from destroying existing data.
this will take away an important feature of zfs.
however i would rather run zfs without checksum than the freebsd software raid.

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 04 Apr 2015 09:38
by b0ssman
do this and see what the kernel panic is.
http://blog.hostileadmin.com/2012/09/25 ... ng-kernel/

During a kernel panic, it will simply reboot unless debug.debugger_on_panic is enabled. To enable this execute:

sysctl debug.debugger_on_panic=1

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 06 Apr 2015 16:40
by Lee Sharp
b0ssman wrote:you can run a a zfs without ecc. i have done it myself.
but you should really know what you are doing and know what risk you are taking.

we should not recommend using zfs without ecc to newcomers who do not know about the risks or how to mitigate them.
we want their data to be safe now and safe in years to come.
most newcomers will set up their nas once and then forget about it and only come back to it when things have gone wrong.
and by then it will be to late.
Totally agree with this! I do recommend ECC. However, it has a substantial cost. (More than just the price of memory as ECC motherboards are also expensive)

However, that is minor. It is the folks saying "You Must Use ECC. There is no debate." that I take issues with, because that is simply not true. It can work very well without ECC, and there is a heck of a lot of debate!

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 06 Apr 2015 17:02
by Lee Sharp
Princo wrote:
Lee Sharp wrote:And saying "backups do not help here" truly baffles me. How can having a good copy of your data not help when you lose data?
Because corrupted data looks like normally changed data. Yes, it has same date and size, but content differs, and so it's a candidate for next backup.
If things are going bad (and they do), your backup contains the same corrupted data as your main system. And so backups do not help here.
My fault... I should have specified here, but one copy is NOT a backup. Lots of copies in lots of places with lots of versioning. And then you go back to the point before corruption. Autosnapshot is one of the reasons we use ZFS after all. (And I back up to a system with ZFS and autosnapshot. Which is one of the reasons I am frustrated with the nas4free rsync implementation.)
Princo wrote:
Lee Sharp wrote:My point is that not using ECC will not cause you NAS to burst into flame.
Nobody said that. We are discussing about a high risk resulting in complete data loss.
Wait a seoncd... Last line you talked about silent and unnoticed corruption of a few files, and now the whole thing is falling over? It is this kind of argument that I dislike. More fear and less fact. Yes, you can have a complete filesystem loss. But we have seen more of those from failing hard drives then from failing memory...
Princo wrote:
Lee Sharp wrote:It does add significant risk, but that risk can be balanced.
Balanced in this way?: "I never fasten seat belts, because my car is equipped with air bags."
Funny that you mention that. I also ride motorcycles, and I have a reasoned opinion on helmets, based on data. A while back, there was a huge survey of motorcycle accident data called the Hurt Report. http://en.wikipedia.org/wiki/Hurt_Report It showed that most head impacts occur to the jaw and face... So an open face or shorty helmet is somewhat pointless. It also showed that while helmets decrease closed head injury, they also increase spinal trauma by about the same amount. This second line makes helmet use a trade off, and open faced or shorty helmet use a net negative. So, using reasoned data, I use a full face helmet on highway rides, or in poor weather, and no helmet in short rides or out in the country.
Princo wrote:
Lee Sharp wrote:And that is what good IT professionals do;
No, this is what bad businessmen do. :evil:
A good IT professional is not arguing against the benefits of ecc ram. Never ever.
I guess you work in that mythical IT show with an unlimited budget? We are all business men. And if you can not present a business case for a given expense, there is a good reason it is being denied.
Princo wrote:
Lee Sharp wrote:find cost effective ways to mitigate risk as much as can be afforded.
"Yes, why not taking this old, somehow dismantled bigtower from Walmart as a cheap fileserver at office?".
In most cases, this is not a good advice.
No, but I am often repurposing old servers of varying specs and quality. When it is a choice between nas4free on an older HP with no ECC, and continuing to use the old ReadyNAS, which do you choose? Because sometimes those are your only budget options...
Princo wrote:
Lee Sharp wrote:And this is from someone with just over a dozen nas4free boxes in production without ECC ram, and no failures yet.
I also used nas4free for years without ecc ram, with no data loss (i think). But after i heard about that specific problem, i instantly changed my hardware because of irreplaceable data on it. And: yes, i have backups.
You heard something, did not see it, and reacted. Where if I have something with "irreplaceable data on it," which I have had, it is backed up and versioned in several sites. I even had a client that was storing some data (encrypted) with megaupload when they were shut down. Because it was also stored elsewhere, (backblaze, encrypted) he was fine.
Princo wrote:
Lee Sharp wrote:And let me be clear that I am not saying there is no risk. I am saying that the assessment of that risk may be overstated by some people.
The risk is a complete loss of data, and (maybe) a useless backup.
It takes days, to reinstate the system.
With ecc ram you can avoid it.
If your DR plan takes "Days" you have a much bigger problem then ECC ram! DR of a single system should be hours or even minutes, not days.
Princo wrote:
Lee Sharp wrote:Lastly, as a FOSS project lead and multiple forum moderator myself, I wanted to give you a little advice. In that kind of position, you no longer have the luxury of just being a person. Small, offhand statements carry a lot of weight when you have that kind of authority. So saying things like "Sorry, but you have no idea what you're talking about. Your suggestions are irresponsible." should be considered strongly and perhaps reworded.
I apologize if my words would have upset you.
I was not upset. But I was not sure if you understood how strong your words were, and how they could be perceived by others. I have seen this kill projects before.

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 07 Apr 2015 11:49
by clicker
Sorry for long absence, I decided to do additional tests after weekend, but I read hot discussion "pro and contra ECC" with interest.
Here new information:
Adding this
Variable: clog_logdir
Value: /mnt/data/logs
Description: Log files location.
does not affect at all. Maybe I did something wrong. I boot from LiveCD, import disks, mount UFS partition to /mnt/second and set variable value to /mnt/second/logs. However, there is nothing appeared there neither before, nor after reboot.
Can someone point me in my mistake?
b0ssman wrote:do this and see what the kernel panic is.
http://blog.hostileadmin.com/2012/09/25 ... ng-kernel/

During a kernel panic, it will simply reboot unless debug.debugger_on_panic is enabled. To enable this execute:

sysctl debug.debugger_on_panic=1
It is also have no effect. System reboot immediately as it did before in any case.

Here what I've done:
I reinstall system, set the same settings and copy 200 or 300 Gb of data to dataset "Techno" w/o acls copying. After that I started scrubbing and it works fine!
Then I destroyed dataset, recreate it and copy 200 Gb of data WITH ACLs and started scrubbing again. It still ends fine. Then I decide to add 150 Gb more with acls and start scrub. Scrubbing fails at 64-70% and system fell down.

I booted from LiveCD and tried to import pool from command line. The only way to catch error was recording on cellphone :? And here it is:

Code: Select all

Fatal trap 12:page fault while in kernel mode
cpuid = 2; apic id = 02
Fault virtual address		= 0xa0
Fault code				= supervisor read data, page not present
Instruction pointer		= 0x20:0xffffffff81cc35d7
Stack pointer			= 0x28:0xffffff80a5e764f0
Frame pointer			= 0x28:0xffffff80a5e76560
code segment			= base 0x0, limit 0xfffff, type 0x1b
					= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags			= interrupt enabled, resume, IOPL = 0
current process			= 4 (txg_thread_enter)
trap number			= 12
panic: page fault
cpuid = 2
GDB: stack backtrace:
#0 0xffffffff80a32c96 at kdb_backtrace+0x66
#1 0xffffffff809f83ce at panic+0x1ce
#2 0xffffffff80e1a5d0 at trap_fatal+0x290
#3 0xffffffff80e1a931 at trap_pfault+0x211
#4 0xffffffff80e1af33 at trap+0x363
#5 0xffffffff80e04123 at calltrap+0x8
#6 0xffffffff81ca9d01 at vdev_mirror_io_start+0x221
#7 0xffffffff81cc2764 at zio_vdev_io_start+0x254
#8 0xffffffff81cc21ae at zio_execute+0xbe
#9 0xffffffff81c81a1c at dsl_scan_scrub_cb+0x3fc
#10 0xffffffff81c82b1d at dsl_scan_visitbp+0x3fd
#11 and so on
Today I will try format all disks, reinstall system, copy 350 Gb of data w/o acls and start scrubbing to precisely locate an error.

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 08 Apr 2015 07:59
by b0ssman
hmm so it seems like a hard crash.

it could be the power supply.

when you scrub that will tax the entire system and creates an increased power need.

can you try with another psu?

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 08 Apr 2015 11:43
by clicker
b0ssman wrote:hmm so it seems like a hard crash.

it could be the power supply.

when you scrub that will tax the entire system and creates an increased power need.

can you try with another psu?
Definitely it is NOT power supply.

What I'm done from yesterday:
Made new installation, formatted disks, set same settings as before and copied about 400 Gb of data without acls, then started scrubbing. Scrubbing finished succesfully.
Last time I noticed that system crashed after copying with acls folder of 150 Gb size. So I decided start from beginning, format zfs disk, recreate pool and copy this folder with acl. When I start scrubbing system immediately fell down. After boot fell down again with the same reason I wrote before (page fault).

Summary: nas4free crashes when scrubbing start check certain folder (not sure, what exactly in it) which copied with acls, but checks the same folder succesfully when it copied without acls.

What should I do to fix this situation, because I need to preserve acls on copied data (common capacity 1,5 Tb and I don't know whether it has more this traps in)?

Re: System crash after ZFS pool import or scrubbing attempt

Posted: 14 Apr 2015 07:15
by clicker
A week has gone, but I'm still can't understand reason of error. Maybe someone have suggestion to test?