Wednesday, February 26, 2014

Zabbix : Display network interface description in graph names


One very fustrating thing about Zabbix is the inability to display network interface's description in graph names. All your network equipment interfaces graphs look like :



Nice graph, hummm wait ! What the hell is "ge-0/0/8" ?
If you only have one or two switches it's fine but otherwise it becomes very confusing.

Since Zabbix 2.2, they added the possibitlity to interpret items in graph names which means you can add interface's descriptions to your graph names !

To add your interface's descriptions go to "Configuration -> Templates -> Template SNMP Interfaces -> Discovery -> Graph prototypes", choose the "Traffic graph" and use this string as graph's name :
 Traffic on interface {#SNMPVALUE} : {{HOSTNAME}:ifAlias[{#SNMPVALUE}].last(0)}  

The result is pure happiness :


You now have your interface's description in your graph names !

Hope that helps !

UPDATE : The same frustrating bug (yes, yes bug) is also present in triggers, by default you can't have the interface's description in your triggers... This post explains how to solve this issue and get happy !
http://sysnet-adventures.blogspot.fr/2014/04/zabbix-display-network-interface.html

Tuesday, February 25, 2014

Manage your backup retentions policies with retdo

You have folders with all your backups but storage capacity is starting to become low ? Need to clean up all these old files but still want to keep some of them just in case ? Then retdo is the perfect tool !

Retdo is a little script I wrote that allows administrators to clean up files on a custom retention basics.
Retdo can be used to implement production's backups retention plans.

retdo can resolve the following queries :

- I want to keep only one file per week if files are older than 3 months up to 6 months.
- I want to keep only one file per month if files are older than 6 months up to 1 year.
- I want files older than 1 year to be moved to another machine.
- I want a cup of tea (feature in progress)

Code and instructions are available for free at https://github.com/gcharot/retdo

Example : 

 Let's say I have my January daily backups in /data/backup/db/dbname :

#  ll /data/backup/db/dbname/   
 -rw-r--r-- 1 root root 0 Jan 1 12:00 jan01.tgz  
 -rw-r--r-- 1 root root 0 Jan 2 12:00 jan02.tgz  
 -rw-r--r-- 1 root root 0 Jan 3 12:00 jan03.tgz  
 -rw-r--r-- 1 root root 0 Jan 4 12:00 jan04.tgz  
 -rw-r--r-- 1 root root 0 Jan 5 12:00 jan05.tgz  
 -rw-r--r-- 1 root root 0 Jan 6 12:00 jan06.tgz  
 -rw-r--r-- 1 root root 0 Jan 7 12:00 jan07.tgz  
 -rw-r--r-- 1 root root 0 Jan 8 12:00 jan08.tgz  
 -rw-r--r-- 1 root root 0 Jan 9 12:00 jan09.tgz  
 -rw-r--r-- 1 root root 0 Jan 10 12:00 jan10.tgz  
 -rw-r--r-- 1 root root 0 Jan 11 12:00 jan11.tgz  
 -rw-r--r-- 1 root root 0 Jan 12 12:00 jan12.tgz  
 -rw-r--r-- 1 root root 0 Jan 13 12:00 jan13.tgz  
 -rw-r--r-- 1 root root 0 Jan 14 12:00 jan14.tgz  
 -rw-r--r-- 1 root root 0 Jan 15 12:00 jan15.tgz  
 -rw-r--r-- 1 root root 0 Jan 16 12:00 jan16.tgz  
 -rw-r--r-- 1 root root 0 Jan 17 12:00 jan17.tgz  
 -rw-r--r-- 1 root root 0 Jan 18 12:00 jan18.tgz  
 -rw-r--r-- 1 root root 0 Jan 19 12:00 jan19.tgz  
 -rw-r--r-- 1 root root 0 Jan 20 12:00 jan20.tgz  
 -rw-r--r-- 1 root root 0 Jan 21 12:00 jan21.tgz  
 -rw-r--r-- 1 root root 0 Jan 22 12:00 jan22.tgz  
 -rw-r--r-- 1 root root 0 Jan 23 12:00 jan23.tgz  
 -rw-r--r-- 1 root root 0 Jan 24 12:00 jan24.tgz  
 -rw-r--r-- 1 root root 0 Jan 25 12:00 jan25.tgz  
 -rw-r--r-- 1 root root 0 Jan 26 12:00 jan26.tgz  
 -rw-r--r-- 1 root root 0 Jan 27 12:00 jan27.tgz  
 -rw-r--r-- 1 root root 0 Jan 28 12:00 jan28.tgz  
 -rw-r--r-- 1 root root 0 Jan 29 12:00 jan29.tgz  
 -rw-r--r-- 1 root root 0 Jan 30 12:00 jan30.tgz  
 -rw-r--r-- 1 root root 0 Jan 31 12:00 jan31.tgz  

Now I need to free some space up so I'd like to keep only one file per week :

 # retdo -p /data/backup/db/dbname -r "*.tgz" -b 1 -e 92 -d 7  
 26 file(s) processed - 0 file(s) in error  
 # ll /data/backup/db/dbname  
 total 0  
 -rw-r--r-- 1 root root 0 Jan 5 12:00 jan05.tgz  
 -rw-r--r-- 1 root root 0 Jan 12 12:00 jan12.tgz  
 -rw-r--r-- 1 root root 0 Jan 19 12:00 jan19.tgz  
 -rw-r--r-- 1 root root 0 Jan 26 12:00 jan26.tgz  
 -rw-r--r-- 1 root root 0 Jan 31 12:00 jan31.tgz  

As you can see only one file per week (7 days) has been kept, 26 files were deleted.

This commands means  : "find all files matching regexp *.tgz in /data/backup/db/dbname which are older than 1 days up to 92 days (3 months) and keep only one file every week (7 days)"

Hope that helps !

Monday, February 24, 2014

Postfix : Show next deferred delivery attempt

Deferred mails happen, that's the hard life of the Internet. If you want to know when these mails will be requeued, you need to look at the Postfix's spool directory files.

First of all, get the mail's queue ID with the mailq, postqueue commands or the maillog file.
Once you have your queue ID type the following command :
 # find /var/spool/postfix/deferred/ -name QUEUE_ID -exec stat {} \;  

This will find your mail in the deferred spool and show its file's properties.
Postfix stamps mail's files with an access time in the future. This time is the time Postfix will requeue the mail for delivery.

Below a concrete example with the queue ID  3EC905800CB :
 # date  
 Mon Feb 24 18:32:27 CET 2014  
 # find /var/spool/postfix/deferred/ -name 7A5F2580101 -exec stat {} \;  
  File: `/var/spool/postfix/deferred/7/7A5F2580101'  
  Size: 28779      Blocks: 64     IO Block: 4096  regular file  
 Device: 811h/2065d   Inode: 5767425   Links: 1  
 Access: (0700/-rwx------) Uid: (  89/ postfix)  Gid: (  89/ postfix)  
 Access: 2014-02-24 19:22:30.000000000 +0100  
 Modify: 2014-02-24 19:22:30.000000000 +0100  
 Change: 2014-02-24 18:15:50.156029044 +0100  
 
As you can see the access/modify time is a time in the future, that means postfix will requeue the mail at 19:22:30.

Note : Requeuing doesn't mean that the mail will be sent at this exact time, it will depend on your active queue load.

Note 2: You can force a requeuing with the "postsuper -r" command

Friday, February 7, 2014

Smartctl : Linux disk I/O scheduler is reseted back to default's CFQ

Got a weird issue recently, I'm monitoring my SSD's life time with smartctl + Zabbix and realized that my scheduler settings are reseted each time smartctl was executed !

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sda  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
 noop anticipatory deadline [cfq]  


There is no real solution, but you can work around by specifying  the generic SCSI name i.e "sgX "instead of sdX.

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sg0  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
  [noop] anticipatory deadline cfq 

And voila ! Problem not really solved but that does the job !

You can use sg_map (part of the sg3_utils package) to check the sdX -> sgX mappings :

 # sg_map -a  
 /dev/sg0 /dev/sda  
 /dev/sg1 /dev/sdb  
 /dev/sg2 /dev/scd0  

Wednesday, February 5, 2014

Omreport fails : object not found

If you get the following message while using omreport :
 $ omreport chassis memory  
 Memory Information  
 Error : Memory object not found  
 $ omreport chassis hwperformance  
 Error! No Hardware Peformance probes found on this system.  

The first thing to do is to restart the srvadmin services :
 # srvadmin-services.sh restart  
 # service ipmi restart  

Check that the services are properly started.

If that doesn't solve the problem, you might have a semaphore issue. In my case Zabbix agent/scripts became nuts and didn't close its semaphores.

To list the current semaphore's arrays use the following command :
 # ipcs -s  

To show the current system limits
 # ipcs -sl  

You can use the following command to count the current number of semaphore's arrays
 # ipcs -us  

If you reached the system limit, it will certainly explain the omreport issue. From now on, you have two possibilities :

  • You've reached the limit because there is an issue on your system (semaphores not closed or whatever reason). You need to cleanup your semaphores with the following command :
 # ipcrm -s semaphore_id  
 To clean all semaphores from a particular user :  
 # ipcs -s | awk '/username/ {system("ipcrm -s" $2)}'   

Important : You need to stop attached process before removing the semaphores.
  • All your semaphores are legit, you need to increase the system limits :
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-Setting_Semaphore_Parameters.html

Hope that helps !