Please advise re Data Loss question and proposed solution

XigmaNAS Basic Tune-up
Posts: 50
Joined: 23 Jan 2014 15:31
Location: nelson twp, OH, USA
Please advise re Data Loss question and proposed solution


Post by karlandtanya »

Sometimes our nas4free servers get really slow; sometimes they just stop responding altogether.
Next week I will spend the boss' $$$ I told him I'd ask the community for advice before just going with my plan
Thanks very much to anybody who can suggest something!

Little more detail:

I've been blessed with the "IT" job at my company although I'm really just an old PLC monkey.

Low end server got bogged down, people in the office got sick of waiting and I told them just reboot it.
After they did what I asked there was about a day or day & 1/2 of missing data.
Incomplete transaction?
Solution: More memory or more everything?
Boss will spend reasonable money; I have a plan, just looking for some advice to steer me in the right direction
(Go out and buy a top-o-the-line Windows 10 server super-duper-pooper-scooper from Dell is not in the reasonable list)

Tons o' detail:

My work uses a couple of NAS4Free servers as primary and backup file servers.
Version is - Sayyadina (revision 4303)
Hardware is a Dell T20 x64-embedded on Intel(R) Pentium(R) CPU G3220 @ 3.00GHz, 4G ECC RAM
Disks are 4EA Seagate ST4000DM000-1F2168 in RAIDZ2; the pool shows 16% Frag, 62% Full.
We're using Active Directory authentication and CIFS/SMB this seems to be working beautifully.
Also using Fritz' scripts to take regular snasphots which correctly appear in "Previous Versions" for Windows users.
And using zrep to replicate every hour from the Primary to the backup server.
Regular scrubs and SMART monitoring; always passes with no issues.
Extended GUI is installed and rrdgraphs (now Status...Monitoring) tracking things.

From time to time the server gets somehow bogged down:
SMB performance is *really* slow
Web GUI login screen loads, but that's it.
SSH *usually* works, but restarting samba and lighttpd doesn't help.
A "controlled" shutdown/reboot by #shutdown -r now doesn't seem to do anything after about 10 minutes
A "controlled" shutdown by #shutdown -p now eventually gets the ssh session kicked off,
but the machine still answers pings after several hours; never actually finishes shutting down.
In these cases, somebody from the office usually get sick of waiting on it and leans on the power button.

This time some data was proven to be lost:
Work that users put on the server from Monday Afternoon (hour or so before the server got "slow") until hard reboot Tuesday 9AM is missing.
Last snapshot saved is Monday Midnight (0:00 Tues.); they resume shortly after the hard reboot.
There is an obvious "hole" n the status...monitoring graphs that lines up nicely with the missing snapshots.
*Note there is another "hole" in the snapshots over the preceding weekend, but I don't have any more info about that.

I'm calling the data "lost" because the Office Manger copied timesheets, etc to the server and is very diligent about her work. So the files at least appeared to be there--but after the reboot it looks like they were not.
Note the office manager keeps another copy for just in case and restored it--this kind of thing is why I do not doubt the files were reported to be successfully copied according to what she saw as a Windows SMB client to the server.

So....What's going on?
And...What is the solution?

My GUESS...if anybody can help me with this, PLEASE AND THANK YOU!!
User's data was written but some other operation(s) were going on at the same time?
(a scheduled backup, replication during the backup, ???)
My *vague* understanding of these things is zfs uses copy-on-write in a transactional manner along with something called the zfs intent log. And if the transaction fails before it is DONE, then it's all discarded and the previous data is untouched.
So...my continued GUESS:
Before the transaction could be completed, things started to get slow.
People got sick of waiting on it and power cycled; the transaction is still incomplete.
ZFS woke up after the reboot and says Last Known Good state doesn't inclued the past (ten of!!) hours. Sorry, dude. Other data is still solid.

Is this valid? Is it possible that the office manager could look at the filesystem (by SMB) and see stuff that was *going to be* committed, but not yet complete?
Is there some other explanation what could be going on?

My Proposed Solution
I suspect if we'd waited long enough the transaction would have *eventually* completed, but...I've been expecting this for a while...
Of course--throw money at it!
--SSD for SLOG? (probably not)
This seems like a nifty cool thing that people like to add but most people don't really need.
I'm guessing this is probably not going to help get things written to the disk faster.
--Just add more memory to what we have? (maybe?)
I hear some folks say 1GB per TB and others say that's BS. I have no idea.
Crucial says 16G (we have 4x4T HDDs in there now) ECC is ~180, so total ~400.
If it helps, good. I'm a big fan of tons-o-memory.
--The Tim Taylor solution (more power Hrr Hrr Hrr)
Old ones are $300.00 servers from Dell. Probably $150 of that is the nice Dell case and logo.
Supermicro X11SSL-F (lotsa SATA & supports ECC, i3, and Xeon later if we need)
32G ECC (for each server)
Add 2 more HDDs to each server for total of 6x4T as RaidZ2 (62% is getting full IMO)
That BOM is about $2500.00 with all the other schmutz that you need.
Note--the T20s are DDR3, so if I guess wrong with the memory, it will be wasted.

Advanced User
Advanced User
Posts: 401
Joined: 27 Jun 2012 20:18
Location: Northeast, USA
Re: Please advise re Data Loss question and proposed solution


Post by kenZ71 »

A bit late in my reply...

You mention backups are run often, how are those machines performing? Similar hardware specs? 4GB memory does sound a bit low. How many users access the server on average concurrently?

Is the server generally reasonably quick with read / ? Write speeds?

SMART stats?

11.2-RELEASE-p3 | ZFS Mirror - 2 x 8TB WD Red | 28GB ECC Ram
HP ML10v2 x64-embedded on Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz

Extra memory so I can host a couple VMs
1) Unifi Controller on Ubuntu
2) Librenms on Ubuntu

Posts: 50
Joined: 23 Jan 2014 15:31
Location: nelson twp, OH, USA
Re: Please advise re Data Loss question and proposed solution


Post by karlandtanya »

So--the short answer for anybody who has this
Problem: Server stops responding to web or samba requests.
Can't reboot by console, either--it never finishes shutting down; just hangs.
Unplug it or lean on the power button, sometimes the most recent writes to disk are lost.
Cause?: samba database file in /var grows too big or get orphaned or something.
Solution: add a cron job to restart samba every day or so when nobody's using it.

Thanks for the reply; and now I must apologize for the late followup!
(I got sent on the road shortly after I posted!)

Anyhow, before I left I saw something funny on my home server--
It's just the same as the one at work, except only 2 users and it's a month or so ahead in revisions.
Anyhow, I saw the /var directory filling up--not full yet, otherwise I wouldn't be able to see it!
I let it go, and sure enough, Tanya calls to tell me the server locked up.

So... a little more watching and digging--it was the samba database files that were growing.
Somebody said (don't remember where I read it--a kind and helpful person on the web somewhere!!) that a file can be deleted where only entry that shows the file exists is gone. The actual storage space isn't freed up until the process using it goes away. Some kind of stuck lock or something not getting released by the process??
Looking at the files in /var--they do NOT add up to full or anywhere near it...But the os reports /var is completely full.

After restarting all the extra services that put logs in /var, I saw the usage shrink after samba restart.

Added a cron job to restart samba every day, and continued watching it (then got sent to sunny Detroit!)
Anyhow folks back home have NOT complained since, and when I can remote in, I see the /var directory is all happy now never more than 1-2% max

