Monitoring temperatures with telegraf

29 Aug 2022 15:04 telegraf

One of my NUCs keeps hanging, and I suspect temperature. I wanted to use collectd to gather the temperature sensor values, but it was removed from Ubuntu 22.04. Instead, I’m going to see if I can get telegraf to do it instead.

Background

I have a NUC in the living room, running Emby Server. It keeps dying. The Emby client can’t connect; I can’t SSH to it; I can’t ping it; attempting to wake it up with the keyboard does nothing.

I suspect that it might be a temperature problem, and I have monit set up to keep an eye on that, but monit only alerts at given thresholds; it doesn’t report ongoing values.

So I wanted to install collectd and have it record the temperature every few minutes or so. Rather than report to graphite/grafana (because my k3s cluster is turned off right now), I planned to have it write to local RRD files, and I’d collate them later.

Unfortunately, I upgraded the NUC to Ubuntu 22.04 yesterday, and collectd is no longer in the default repository. So I’m going to play with telegraf instead.

It’s part of the InfluxDB family, but I’m going to see if I can get away without installing InfluxDB just yet (because, again, my k3s cluster is turned off).

Installation

sudo apt update && sudo apt install telegraf

Edit the configuration file at /etc/telegraf/telegraf.config. To enable the sensors plugin, uncomment the [[inputs.sensors]] section. The defaults are probably fine. Similarly, uncomment the [[outputs.file]] section.

Test the configuration

$ telegraf --config /etc/telegraf/telegraf.config --input-filter sensors --test
2022-08-29T15:17:22Z I! Starting Telegraf 1.21.4+ds1-0ubuntu2
2022-08-29T15:17:22Z I! Loaded inputs: sensors
2022-08-29T15:17:22Z I! Loaded aggregators:
2022-08-29T15:17:22Z I! Loaded processors:
2022-08-29T15:17:22Z W! Outputs are not used in testing mode!
2022-08-29T15:17:22Z I! Tags enabled: host=roger-nuc
> sensors,chip=coretemp-isa-0000,feature=package_id_0,host=roger-nuc temp_crit=100,temp_crit_alarm=0,temp_input=49,temp_max=100 1661786243000000000
> sensors,chip=coretemp-isa-0000,feature=core_0,host=roger-nuc temp_crit=100,temp_crit_alarm=0,temp_input=45,temp_max=100 1661786243000000000
> sensors,chip=coretemp-isa-0000,feature=core_1,host=roger-nuc temp_crit=100,temp_crit_alarm=0,temp_input=47,temp_max=100 1661786243000000000
> sensors,chip=pch_skylake-virtual-0,feature=temp1,host=roger-nuc temp_input=38.5 1661786243000000000
> sensors,chip=acpitz-acpi-0,feature=temp1,host=roger-nuc temp_input=-263.2 1661786243000000000
> sensors,chip=nvme-pci-3c00,feature=composite,host=roger-nuc temp_alarm=0,temp_crit=84.85,temp_input=44.85,temp_max=79.85,temp_min=-5.15 1661786243000000000

Restart the daemon

sudo service telegraf restart

Is it working?

$ tail -f /tmp/metrics.out
processes,host=roger-nuc total_threads=502i,idle=68i,zombies=2i,running=0i,unknown=0i,dead=0i,paging=0i,blocked=0i,stopped=0i,sleeping=160i,total=230i 1661787590000000000
sensors,chip=coretemp-isa-0000,feature=package_id_0,host=roger-nuc temp_max=100,temp_crit=100,temp_crit_alarm=0,temp_input=39 1661787590000000000

Now we wait

Now we wait. The next time the NUC freezes, I should be able to restart it and then take a look at the /tmp/metrics.out output file to see if there’s anything obviously wrong.

Follow-up

Like a complete idiot, I forgot that /tmp would be wiped on restart, so there was nothing useful in the output file.

On the other hand, it was spamming /var/log/syslog, so there was a record of the metrics in there. It’s not a temperature thing.

To turn off the syslog spam (which is due to stdout -> systemd -> syslog), and to fix the /tmp thing, edit /etc/telegraf/telegraf.con (as root):

[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["/var/log/telegraf/metrics.out"]

That is: remove stdout; write to somewhere more persistent. You might need to create the directory first (I didn’t). If you do, it’s:

sudo mkdir /var/log/telegraf/
sudo chown _telegraf:_telegraf /var/log/telegraf/
sudo chmod 755 /var/log/telegraf/

Don’t forget to restart the daemon:

sudo service telegraf restart

References