*New 11.3 series Release:
2019-10-19: XigmaNAS 11.3.0.4.7014 - released

*New 12.0 series Release:
2019-10-05: XigmaNAS 12.0.0.4.6928 - released!

*New 11.2 series Release:
2019-09-23: XigmaNAS 11.2.0.4.6881 - released!

We really need "Your" help on XigmaNAS https://translations.launchpad.net/xigmanas translations. Please help today!

Producing and hosting XigmaNAS costs money. Please consider donating for our project so that we can continue to offer you the best.
We need your support! eg: PAYPAL

Temperature monitoring stopped working - stuck at 15C

General information about XigmaNAS
Forum rules
Set-Up GuideFAQsForum Rules
Post Reply
nas4me2
NewUser
NewUser
Posts: 9
Joined: 21 Mar 2017 07:47
Status: Offline

Temperature monitoring stopped working - stuck at 15C

#1

Post by nas4me2 » 17 Apr 2019 12:13

Hi and thanks for a great system.

Just noticed the termperature monitoring graph stopped getting the right values about 3 weeks ago. The system has been up for almost 2 months so it has occurred while the system was running. And it has been working for almost a year now.

Indeed I see the old rrd update command in /usr/local/share/rrdgraphs/rrd-update.sh was:

Code: Select all

T1=`sysctl -q -n dev.cpu.0.temperature | awk '{gsub("C",""); print}'`;      # core 1 temperature
Unfortunately that command seems to be stuck at 15C and it's been well over that the last few days - besides the graph shows when it was working with daily fluctuations:

Code: Select all

nas4free: /etc# sysctl -q -n dev.cpu.0.temperature | awk '{gsub("C",""); print}'
15.0
I just changed it to use ipmitool instead as it is giving more reliable numbers:

Code: Select all

nas4free: /etc# ipmitool sensor reading "CPU Temp" | awk '{print $NF}'
26
1. Last month temp graph
2. Latest 15 minutes since I updated the file.

Image

Image

nas4me2
NewUser
NewUser
Posts: 9
Joined: 21 Mar 2017 07:47
Status: Offline

Re: Temperature monitoring stopped working - stuck at 15C

#2

Post by nas4me2 » 17 Apr 2019 12:18

Urgh - images at pasteboard . co /

IaxnI58.jpg

Iaxk6AX.jpg

...respectively.

cookiemonster
Advanced User
Advanced User
Posts: 165
Joined: 23 Mar 2014 02:58
Location: UK
Status: Offline

Re: Temperature monitoring stopped working - stuck at 15C

#3

Post by cookiemonster » 17 Apr 2019 21:55

Hi. I'm pretty sure I've read somewhere that the ipmi tool values if available to your system, are always more reliable than the kernel values as they're collected from the iron.
That said, seems you're reporting the rrdtools values being stuck. Fine on my system at the moment, allowing for the graphs only having values for two cores on them, T1 and T2, and my extended-gui correctly reporting for all 24 cores in my current system.
Maybe a crazy idea but could your permanent location for the data be exhausted of space?
Main: Xigmanas 11.2.0.4 x64-full-RootOnZFS on Supermicro X8DT3. zroot on mirrorred pair of CRUCIAL_CT64M225. Memory: 24GB ECC; 2 Xeon E5645 CPUs; Storage: (HBA) - LSI SAS 9211-4i with 3 SATA x 1 Tb in raidZ1, 1 x 3 Tb SAS drive as single stripe.
Spare1: HP DL580 G5; 128 GB ECC RAM; 4 CPU; 8 x 500 GB disks on H210i
Spare2: HP DL360 G7; 6 GB ECC RAM; 1 Xeon CPU; 5 x 500 GB disks on H210i
Spare3: HP DL380 G7; 24 GB ECC RAM; 2 Xeon E5645 CPUs; 8 x 500 GB disks on IBM M1015 flashed to LSI9211-IT

nas4me2
NewUser
NewUser
Posts: 9
Joined: 21 Mar 2017 07:47
Status: Offline

Re: Temperature monitoring stopped working - stuck at 15C

#4

Post by nas4me2 » 18 Apr 2019 03:48

It's both the RRD values and the System Status GUI - the status page correctly shows all 8 cores, but the temperature as reported by sysctl is wildy inaccurate at the moment.

There is over 40GB free so space should not be an issue. Besides, ipmitool is giving the right result - it's only sysctl values that seem to have stopped giving the right values.

As I said, it was working perfectly for the last 9-10 months. It worked perfectly after the most recent upgrade. It was working after the last reboot for about a month.

But sometime about 3 weeks ago it suddenly stopped working and started to report 15C as the only result. (To be honest, it does give slightly different values for the other cores, as shown below) - but the graph clearly indicates that core 0 was 'reported' as 15C for the whole time - when clearly (based on previous graph values) this is not a correct number.

Code: Select all

sysctl -a | grep temper
dev.cpu.7.temperature: 15.0C
dev.cpu.6.temperature: 15.0C
dev.cpu.5.temperature: 14.0C
dev.cpu.4.temperature: 14.0C
dev.cpu.3.temperature: 16.0C
dev.cpu.2.temperature: 16.0C
dev.cpu.1.temperature: 15.0C
dev.cpu.0.temperature: 15.0C

Code: Select all

# ipmitool sensor 
CPU Temp         | 21.000     | degrees C  | ok    | 0.000     | 0.000     | 0.000     | 93.000    | 98.000    | 98.000    
System Temp      | 27.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 80.000    | 85.000    | 90.000    
Peripheral Temp  | 30.000     | degrees C  | ok    | -9.000    | -7.000    | -5.000    | 80.000    | 85.000    | 90.000    
DIMMA1 Temp      | 29.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000    
DIMMA2 Temp      | 26.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000    
DIMMB1 Temp      | 24.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000    
DIMMB2 Temp      | 26.000     | degrees C  | ok    | 1.000     | 2.000     | 4.000     | 80.000    | 85.000    | 90.000    
FAN1             | 3200.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000 
FAN2             | 1300.000   | RPM        | ok    | 300.000   | 500.000   | 700.000   | 25300.000 | 25400.000 | 25500.000
etc...
The only possibility I can think of is that I built a new VM in VirtualBox around 1st March and installed Ubuntu to do some testing (Joomla upgrade php5.x->php7.2), so I wonder if some microcode update could possibly have interfered with the readings from sysctl? Note that this VM was built 3 weeks prior to the issue occurring - ie:it was built around 1st March - and this issue didn't appear until the last week of March.

No other graphs appear to be affected - only CPU Temp - but it affects both the monitoring graph and the status page.

Otherwise I'm at a complete loss as to why this has happened.

Will reboot it later today to see if it recovers or is permanently broken now. (Rebooting it is a pain as the startup regenerates the jail configs - and I have one jail that requires it's own bridge otherwise routes wrongly through the host nic and that's not why I bought a 4x1GBe motherboard. Restarting the host takes barely minutes - getting the jails running and routing correctly again can take up to 2 hours of pure frustration [sigh] - vnet is (despite being there for years) still experimental, I guess).

Post Reply

Return to “GENERAL INFORMATION”