About stats

I’m responsible for a few machines, in several locations, and only connected through the internet.

At some point in time I wanted to collect some stats from those machines, like the load and disk occupation. I could have used SNMP for that, however there are two drawbacks to that approach:

  1. if, for some reason, the internet connection (either mine or on the other side) goes down, then no stats are collected. My ISP performs quite some maintenance (or repairs depending on the way you look at it), and sometimes at night my internet connection is gone.
  2. my ISP doesn’t allow SNMP traffic (or so I suspect, since none of my SNMP GETs are getting through).

So I came up with another plan: have the remote machine collect the information itself, and send it to me via email, every 5 minutes. On my machine, some procmail recipe will automatically put that data into a MySQL table.

Collecting is quite simple: create a cronjob that executes a shell script every 5 minutes. The shell script could look like this:

#!/bin/sh

DATE=`date "+%Y-%m-%d %H:%M"`
SERVER=validmst04
MAILQ=`mailq | grep "Total requests" | awk '{print $3}'`
LOAD=`/sbin/sysctl vm.loadavg | tr '{}=' ' '`
LOAD1=`echo ${LOAD} | awk '{print $2}'`
LOAD5=`echo ${LOAD} | awk '{print $3}'`
LOAD15=`echo ${LOAD} | awk '{print $4}'`
DISKVAR=`df -h | grep '/var$' | awk '{print $5}' | tr '%' ' '`
DISKROOT=`df -h | grep '/$' | awk '{print $5}' | tr '%' ' '`
DISKUSR=`df -h | grep '/usr$' | awk '{print $5}' | tr '%' ' '`
DISKTMP=`df -h | grep '/tmp$' | awk '{print $5}' | tr '%' ' '`
DISKHOME=`df -h | grep '/home$' | awk '{print $5}' | tr '%' ' '`

/usr/sbin/sendmail my-email-address <<EOF
Subject: Validmst04 Stats

Key=SECRETKEY
data ${DATE} ${SERVER} mailq ${MAILQ}
data ${DATE} ${SERVER} load1 ${LOAD1}
data ${DATE} ${SERVER} load5 ${LOAD5}
data ${DATE} ${SERVER} load15 ${LOAD15}
data ${DATE} ${SERVER} diskvar ${DISKVAR}
data ${DATE} ${SERVER} diskroot ${DISKROOT}
data ${DATE} ${SERVER} diskusr ${DISKUSR}
data ${DATE} ${SERVER} disktmp ${DISKTMP}
data ${DATE} ${SERVER} diskhome ${DISKHOME}

EOF

OK, what happens: I have the script perform some queries, and put them into atomical variables. For instance, if I want the current load, I get them with sysctl vm.loadavg and remove some noise with tr. This works on both FreeBSD and OpenBSD (OSes I use). FreeBSD displays like this: ‘vm.loadavg: { 0.00 0.00 0.00 }’ and OpenBSD displays like this: ‘vm.loadavg=0.12 0.11 0.08′, so in order to get rid of the ‘{‘,’}’ and ‘=’ I replace them with a blank. The individual values for 1, 5 and 15 minute average are then split by awk into LOAD1, LOAD5 and LOAD15.

For the partitions I do the same (however may vary per server, since the disk layout may be different). Get me the fifth field from an df output, strip the ‘%’ and you have the percentage of disk (=partition) space used.

With all those data, I then create an email (between <<EOF and EOF), give the email a subject (for diagnostic purposes), a secret key, and all the data, preceded by ‘data’.

The secret key has two distinct purposes (obviously this is not the key I use):

  1. to keep people from poluting my database, that’s why I don’t tell the real key here
  2. to create a specific recipe in procmail, to perform its doings only on emails I intended it to do it on

That’s all needed on the remote side. You could easily expand the collected data with more if you wanted to. But I took a different approach for that. I will tell that in the next article.

If I would stop here, I would be collecting a lot of email, 12 per hour per machine, and in my case that would be 1728 emails per 24 hours (6 machines, but you already figured that out, I guess).

My server runs MySQL, a great and scalable SQL server. In one of my databases I created a table called stats, and this table looks like this:

describe stats;
+--------+-------------+------+-----+---------+----------------+
| Field  | Type        | Null | Key | Default | Extra          |
+--------+-------------+------+-----+---------+----------------+
| id     | int(11)     | NO   | PRI | NULL    | auto_increment |
| date   | datetime    | NO   | MUL | NULL    |                |
| server | varchar(25) | NO   | MUL | NULL    |                |
| type   | varchar(15) | NO   | MUL | NULL    |                |
| value  | float       | YES  |     | NULL    |                |
+--------+-------------+------+-----+---------+----------------+

Nothing much to tell, you have your date, server name (to make some distinction per server), the type and the type’s value.

This type field is the really smart thing: I can add as much different data as I want, because (like I told before) for instance the disk layout could be different, or the function of the server (not all servers are mail servers). The type field would hold values like ‘load1′, ‘load5′, ‘mailq’, ‘diskvar’, etc. The value field holds the actual value for that type. You might notice that the fields date, server and type are part of a ‘MUL’ key: suppose your email gets delayed at some point, it is possible that the email is offered twice, and then you could get some weird results when querying your database. If you assure that the combination of those fields is unique, then you will not have this problem.

Now I have some machines sending email, and a table in MySQL, but how do I connect the two? That’s where procmail starts playing.

I you use procmail, you can easily perform all kind of actions on your incoming email, before putting it into your mailbox (and in this case, I don’t want these messages in my mailbox). So I wrote the following recipe:

:0Bb*
^Key=SECRETKEY
|/home/peter/bin/insert | mysql stats

Not very spectacular: I search the body (‘B’) of the incoming email for the string ‘^Key=SECRETKEY’ (you might recognize the regex ‘^’, which means the string has to be in the beginning of the line, and if it’s found run contents of the body of the email message through the program /home/peter/bin/insert and pipe that through the mysql command (and the database stats).

The insert script looks like this:

#!/usr/bin/awk -f

BEGIN {}

{
  if ($1 == "data")
    printf ("insert into stats values (null, \"%s %s\", \"%s\", \"%s\", %s);\n", $2, $3, $4, $5, $6);
}

You might remember that in the email, sent from the collecting server, I start each line with ‘data’. This awk script searches for that specific word in the first position (awk uses blanks or tabs as separator by default, unless told otherwise – the collecting servers separate the values by blank). If the first word is ‘data’, a printf command echoes a string that can be parsed by the mysql command. If you change the first word on the collecting server, you could easily create different insert command. My insert script looks like this:

#!/usr/bin/awk -f

BEGIN {}

{
  if ($1 == "data")  
    printf ("insert into stats values (null, \"%s %s\", \"%s\", \"%s\", %s);\n", $2, $3, $4, $5, $6);
  if ($1 == "userstats") 
    printf ("insert into userstats values (null, \"%s\", \"%s\", %f);\n", $2, $3, $4);
  if ($1 == "scopus")
    printf ("insert into scopus values (null, \"%s\", \"%s\", %f);\n", $2, $3, $4);
  if ($1 == "webmail")
    printf ("insert into webmail values (null, \"%s\", \"%s\", %f);\n", $2, $3, $4);
  if ($1 == "fileshare") 
    printf ("insert into fileshare values (null, \"%s\", \"%s\", %f);\n", $2, $3, $4);
  if ($1 == "teststats") 
    printf ("insert into t_userstats values (null, \"%s\", \"%s\", %f);\n", $2, $3, $4);
  if ($1 == "blacklist")  
    printf ("insert into blacklist values (null, \"%s\", \"%s\", %d);\n", $2, $3, $4);
  if ($1 == "blc")   
    printf ("insert into blc values (null, \"%s\", \"%s\", \"%s\", \"%s\");\n", $2, $3, $4, $5);
}

Before you start spamming me with all kind of data: you would have to know my secret key to activate this script and you don’t ;-) .

Now where done: if the secret key matches, and all scripts are in place (and the insert script is executable – mode 755), then your database will be filled. What do I have at this moment (when I write this)?

select server, count(*) from stats group by server;
+------------+----------+
| server     | count(*) |
+------------+----------+
| isis       |  1371653 |
| loghost    |   539175 |
| muth       |  1244103 |
| ra         |  1115613 |
| ramses     |   823648 |
| thoth      |    27684 |
| validmst03 |  1244621 |
| validmst04 |  1410799 |
+------------+----------+

7.7 million records, but then again I’m collecting quite some time. Two of the mentioned servers are not active anymore.

So how to query these?

I use gnuplot to create graphs for all of my machines, once every half an hour. For instance, if I wanted to show the diskspace used on my server ra for the last seven days, I would use this sql statement (in a shell script):

#!/bin/sh#

DATE=`date -v -7d +"%Y-%m-%d 00:00:00"`

/usr/local/bin/mysql -Be \
'select distinct date_format(date,"%Y%m%d-%H:%i") as date, \ 
sum(case when type="diskroot" then value else 0 end) as root, \       
sum(case when type="diskusr" then value else 0 end) as usr, \       
sum(case when type="diskvar" then value else 0 end) as var, \       
sum(case when type="disktmp" then value else 0 end) as tmp \       
from stats \       
where server="ra" \       
and date >= "'"$DATE"'" \       
group by date        
order by date' > /tmp/ra.disk

this creates a text file in /tmp, looking like this (truncated):

date    root    usr    var    tmp
20090607-00:00    12    54    34    0
20090607-00:05    12    54    34    0
20090607-00:10    12    54    34    0
20090607-00:15    12    55    34    0
20090607-00:20    12    54    34    0
20090607-00:25    12    54    34    0
20090607-00:30    12    54    34    0
20090607-00:35    12    54    34    0
20090607-00:40    12    54    34    0

my /tmp partition stays empty during those 40 minutes (not exactly, but remember the values art percentages, and a small text file like the one above will not add a spectacular amount to that), and nothing happens in the other partitions.

The graph is created by this gnuplot script:

set term png transparent nocrop enhanced font arial 8 size 600,480
set output "/usr/local/www/admin/htdocs/mailq/disk/ra.png"
unset log
unset label
set grid xtics ytics
set lmargin 10
set bmargin 5
set rmargin 2
set tmargin 2
set border 3
set xtics rotate nomirror auto
set ytics 10 nomirror
set yrange [0:100]
set mytics 2
set title "ra diskspace `date +"%Y-%m-%d"`"
set ylabel "% used"
set xdata time
set timefmt "%Y%m%d-%H:%M"
set format x "%d-%m"
set style fill solid 1.0
plot "/tmp/ra.disk" using 1:2 title col with steps , \
 "" using 1:3 title col with steps , \
 "" using 1:4 title col with steps , \
 "" using 1:5 title col with steps 

The result looks like this:

ra

More on some problems I encountered will be described in the next article…

This entry was posted on Monday, June 15th, 2009 at 7:52 pm and is filed under freebsd. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

Spam Protection by WP-SpamFree

Subscribe without commenting