Showing posts with label CentOS. Show all posts
Showing posts with label CentOS. Show all posts

Friday, February 7, 2014

Smartctl : Linux disk I/O scheduler is reseted back to default's CFQ

Got a weird issue recently, I'm monitoring my SSD's life time with smartctl + Zabbix and realized that my scheduler settings are reseted each time smartctl was executed !

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sda  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
 noop anticipatory deadline [cfq]  


There is no real solution, but you can work around by specifying  the generic SCSI name i.e "sgX "instead of sdX.

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sg0  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
  [noop] anticipatory deadline cfq 

And voila ! Problem not really solved but that does the job !

You can use sg_map (part of the sg3_utils package) to check the sdX -> sgX mappings :

 # sg_map -a  
 /dev/sg0 /dev/sda  
 /dev/sg1 /dev/sdb  
 /dev/sg2 /dev/scd0  

Wednesday, February 5, 2014

Omreport fails : object not found

If you get the following message while using omreport :
 $ omreport chassis memory  
 Memory Information  
 Error : Memory object not found  
 $ omreport chassis hwperformance  
 Error! No Hardware Peformance probes found on this system.  

The first thing to do is to restart the srvadmin services :
 # srvadmin-services.sh restart  
 # service ipmi restart  

Check that the services are properly started.

If that doesn't solve the problem, you might have a semaphore issue. In my case Zabbix agent/scripts became nuts and didn't close its semaphores.

To list the current semaphore's arrays use the following command :
 # ipcs -s  

To show the current system limits
 # ipcs -sl  

You can use the following command to count the current number of semaphore's arrays
 # ipcs -us  

If you reached the system limit, it will certainly explain the omreport issue. From now on, you have two possibilities :

  • You've reached the limit because there is an issue on your system (semaphores not closed or whatever reason). You need to cleanup your semaphores with the following command :
 # ipcrm -s semaphore_id  
 To clean all semaphores from a particular user :  
 # ipcs -s | awk '/username/ {system("ipcrm -s" $2)}'   

Important : You need to stop attached process before removing the semaphores.
  • All your semaphores are legit, you need to increase the system limits :
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-Setting_Semaphore_Parameters.html

Hope that helps !

Monday, September 2, 2013

Squirrel Mail crash with "PHP Fatal error: Call to undefined function sqimap_run_literal_command()"

I have encountered a bug on Squirrel Mail (version 1.4.8-21), the client's credentials are accepted but the page is then stucked on "webmail/src/redirect.php"

On the Apache side, I noticed the following PHP error :
  PHP Fatal error: Call to undefined function sqimap_run_literal_command() in /usr/share/squirrelmail/functions/imap_general.php on line 518  

After some debugging, I found out that Squirrel Mail does NOT properly handle specials characters; In this case I had an accent in the user password.
I reseted the user's password and everything went back to normal.

Hope that helps !


Friday, August 30, 2013

Bash : Wait for a command with timeout

Here is a very useful little command that wait for a process to finish and kill it if doesn't exit after a pre defined timeout.

The command is called timeout and is part of the coreutils package witch is embedded by default in CentOS 6.

Usage is very simple, you just need to specify the timeout, the process to start and optionally the signal to send in case of failure.

Timeout returns the exit value of the process and if the timeout has been reached, the value 124 is returned.

Here are some examples on how to use timeout :

Successful process : 
 $ timeout 10s ls /tmp  
 ....  
 ....  
 $ echo $?  
 0  

Unsuccessful process :
 $ timeout 10s ls /blabla  
 ls: cannot access /blabla: No such file or directory  
 $ echo $?  
 2  

Timed out process :
 $ timeout 5s sleep 10  
 $ echo $?  
 124  

More details in the man page.

Hope that helps !

Tuesday, July 16, 2013

Cobbler reposync failed

I run daily cobbler reposync crons and it appears sometimes the process fails with the following error :
 Exception occured: <class 'cobbler.cexceptions.CX'>  
 Exception value: 'cobbler reposync failed'  
 Exception Info:  
  File "/usr/lib/python2.6/site-packages/cobbler/utils.py", line 126, in die  
   raise CX(msg)  
   
 Exception occured: <class 'cobbler.cexceptions.CX'>  
 Exception value: 'cobbler reposync failed'  
 Exception Info:  
  File "/usr/lib/python2.6/site-packages/cobbler/action_reposync.py", line 125, in run  
   self.sync(repo)  
   File "/usr/lib/python2.6/site-packages/cobbler/action_reposync.py", line 169, in sync  
   return self.yum_sync(repo)  
   File "/usr/lib/python2.6/site-packages/cobbler/action_reposync.py", line 402, in yum_sync  
   utils.die(self.logger,"cobbler reposync failed")  
   File "/usr/lib/python2.6/site-packages/cobbler/utils.py", line 134, in die  
   raise CX(msg)  
...
!!! TASK FAILED !!!

I don't really have an explanation for this but it seems that the reposync process doesn't really fail as if I run the reposync command manually the process goes fine.

The createrepo command line is shown when you execute "cobbler reposync", for example :
 hello, reposync  
 run, reposync, run!  
 creating: /var/www/cobbler/repo_mirror/Dell-CentOS5/.origin/Dell-CentOS5.repo  
 running: /usr/bin/reposync -l -m -d --config=/var/www/cobbler/repo_mirror/Dell-CentOS5/.origin/Dell-CentOS5.repo --repoid=Dell-CentOS5 --download_path=/var/www/cobbler/repo_mirror  

In this case the command line is :
reposync -l -m -d --config=/var/www/cobbler/repo_mirror/Dell-CentOS5/.origin/Dell-CentOS5.repo --repoid=Dell-CentOS5 --download_path=/var/www/cobbler/repo_mirror  

The process should go fine, however the return value $? is '1' which can explain why the cobbler commands fails.

Monday, July 15, 2013

Emulate bad or WAN network performances from a particular IP on a Gigabit LAN network

If you're developing Web or mobile applications, you'll certainly be confronted to poor network conditions.

The problem now is "how can I test my application under bad network conditions". Well you could rent a forein internet connection or use tools that reports performance from various remote countries however this is not a good debugging environment.

The solution is to use TC and NetEM on your front development server (typically Web or reverse proxy server), then use filters so only one client station (the debugging station) is impacted.
Don't forget to use filter otherwise all your clients will be impacted.

Below an example on how to emulate a network with :
  • 1Mbps bandwidth
  • 400ms delay
  • 5% packet loss
  • 1% Corrupted packet
  • 1% Duplicate packet
The debugging client IP is 192.168.0.42  (i.e the IP impacted by the bad network performance);
The following commands need to be executed on the front developement server, please set the appropriate NIC for you environment (eth0 used below) :
 # Clean up rules  
   
 tc qdisc del dev eth0 root  
   
 # root htb init 1:  
   
 tc qdisc add dev eth0 handle 1: root htb  
   
 # Create class 1:42 with 1Mbps bandwidth  
   
 tc class add dev eth0 parent 1:1 classid 1:42 htb rate 1Mbps  
   
 # Set network degradations on class 1:42  
   
 tc qdisc add dev eth0 parent 1:42 handle 30: netem loss 5% delay 400ms duplicate 1% corrupt 1%  
   
 # Filter class 1:42 to 192.168.0.42 only (match destination IP)  
   
 tc filter add dev eth0 protocol ip prio 1 u32 match ip dst 192.168.0.42 flowid 1:42  
   
 # Filter class 1:42 to 192.168.0.42 only (match source IP)  
   
 tc filter add dev eth0 protocol ip prio 1 u32 match ip src 192.168.0.42 flowid 1:42  

To check that the rules are properly set use the following commands :
 tc qdisc show dev eth0  
 tc class show dev eth0  
 tc filter show dev eth0  

Once you're done with the testing, cleanup the rules with the command :
 tc qdisc del dev eth0 root   


There is many other options you can use (correlation, distribution, packet reordering, etc), please check the documentation available at :

http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

If this setup fits your requirements, I advice you to create a shell script so you can start/stop the rules with custom values. Be aware that you can also make filters based on source/destination ports, etc.

If you have more complex requirements, you can try WANem, which is a live Linux Distribution with a graphical interface on top of NetEM. Please be aware that this requires route modifications on your client and server (or any other routing tricks).

http://wanem.sourceforge.net/
http://sourceforge.net/projects/wanem/files/Documents/WANemv11-Setup-Guide.pdf

I didn't had the opportunity to try it, please let me know if you have any feedback.

Wednesday, May 22, 2013

omreport : failed to load external entity "/opt/dell/srvadmin/var/lib/openmanage/xslroot//oma/cli/about.xsl"

If you're having the following error when executing omreport :
 I/O warning : failed to load external entity "/opt/dell/srvadmin/var/lib/openmanage/xslroot//oma/cli/about.xsl"  
 error  
 xsltParseStylesheetFile : cannot parse /opt/dell/srvadmin/var/lib/openmanage/xslroot//oma/cli/about.xsl  
 Error! XML Transformation failed  

Then install srvadmin-omcommon package :
 # yum install srvadmin-omcommon  

Tuesday, May 21, 2013

DRAC Firmware update failed : Error: 30001 Method httpCgiErrorPage()

Have tried to update an old DRAC4 Firmware from firmware 1.5 to 1.75 via Linux binary and came to an unplaisant surprise :
 Dell Remote Access Controller 4/P  
 The version of this Update Package is newer than the currently installed version.  
 Software application name: Dell Remote Access Controller 4/P Firmware  
 Package version: 1.75  
 Installed version: 1.50  
 Continue? Y/N:Y  
 Executing update...  
 WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE UPDATE IS IN PROGRESS.  
 THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!  
 ......................................................................................
 /tmp/duptmp.xml:6: parser error : Extra content at the end of the document  
 <SVMExecution lang = "en">  
 ^  
 /tmp/.dellSP-XmlResult12908-32487.M19124:6: parser error : Extra content at the end of the document  
 <SVMExecution lang = "en">  
 ^  
 unable to parse /tmp/.dellSP-XmlResult12908-32487.M19124  
 /tmp/.dellSP-XmlResult12908-32487.M19124:6: parser error : Extra content at the end of the document  
 <SVMExecution lang = "en">  
 ^  
 unable to parse /tmp/.dellSP-XmlResult12908-32487.M19124  

Doesn't look good and of course if I try to access the DRAC via HTTPs, I've got a nice CGI error :
 Error: 30001 Method httpCgiErrorPage()  

Looked on the web and somebody (who contacted Dell Support) advises to shutdown the server, unplug the DRAC card for a while and plug it in back... Well explain to your CTO that you need to shutdown a production server, unrack it, unplug a card just because a DRAC update failed o_O
Reference: http://lists.us.dell.com/pipermail/linux-poweredge/2008-January/034556.html

The solution that worked for me was to install the racadm Dell tool on my bastion and reset the firmware remotely.

  • First install racadm :
 # yum install srvadmin-racadm4.x86_64  
Note : This is for DRAC4, didn't had the issue with newer DRAC.
Note 2 : You need to have the Dell OSMA repository installed on your server:
http://www.openfusion.net/linux/dell_omsa

  •  Then run the following command :
 # racadm -rDRAC_IP -i racreset  
Note : Change DRAC_IP with your DRAC IP.
Note 2 : This operation will NOT erase your DRAC configuration.
  •  Wait a while, pray, and if you're lucky as me you should be back on line (with the original firmware version of course).
Final word, I stopped being lazy and updated the firmware via the Web GUI which is a long and annoying process. Of course I used Internet Explorer as I felt like Murphy's law was around this day ^^

Hope that helps !

Yum stuck/hangs at "Running Transaction Test"

If yum is stuck at the "Running Transaction Test" step, double check that you don't have a stalled network mount (NFS,SMB,etc) somewhere.

Umount it and retry your yum/rpm command.

More info on how to umount a stalled NFS share :
http://sysnet-adventures.blogspot.fr/2013/05/umount-stalledfrozen-nfs-mount-point.html

Thursday, March 28, 2013

Omreport doesn't update disk rebuild progress

Had to replace a hard drive on a Dell Server and omreport rebuild progress got stuck at 1%.

The solution is to restart the srvadmin service :

 # srvadmin-services.sh restart  

This is quite dirty but it's the only solution I found. This also happened when I changed a PERC H700 battery.

Another way to check the rebuild process is to export log with omconfig :

 # omconfig storage controller action=exportlog controller=0  

This creates a /var/log/lsi_MMDD.log file, with the rebuild progress :

 03/09/13 22:07:51: EVT#13296-03/09/13 22:07:51: 99=Rebuild complete on VD 01/1  
 03/09/13 22:07:51: EVT#13297-03/09/13 22:07:51: 100=Rebuild complete on PD 05(e0x20/s5)  
 03/09/13 22:07:51: EVT#13298-03/09/13 22:07:51: 114=State change on PD 05(e0x20/s5) from REBUILD(14) to ONLINE(18)  
 03/09/13 22:07:51: EVT#13299-03/09/13 22:07:51: 81=State change on VD 01/1 from DEGRADED(2) to OPTIMAL(3)  
 03/09/13 22:07:51: EVT#13300-03/09/13 22:07:51: 249=VD 01/1 is now OPTIMAL  

Same thing for the battery learn cycle.

Hope that helps !

Tuesday, March 26, 2013

Dell Openmanage/Omreport failed after updating to CentOS 6.4

After updating a testing machine from CentOS 6.3 to 6.4, the Dell OpenManage tools stopped working AT ALL.
It seems that with the lastest CentOS kernel (2.6.32-358.2.1.el6.x86_64), they moved away some IPMI drivers from kernel modules to "built-in"

The result is :

 # omreport chassis  
 Health   
 # srvadmin-services.sh start  
 Starting Systems Management Device Drivers:  
 Starting dell_rbu:                     [ OK ]  
 Starting ipmi driver:                   [FAILED]  
 Starting Systems Management Device Drivers:  
 Starting dell_rbu: Already started             [ OK ]  
 Starting ipmi driver:                   [FAILED]  
 Starting DSM SA Shared Services:              [ OK ]  
 /var/log/messages reports :   
 instsvcdrv: /etc/rc.d/init.d//dsm_sa_ipmi start command failed with status 1  

Solution : 

 # yum install OpenIPMI  

Note : There is no need to start or chkconfig the service.

You can check that the IPMI components are seen with the following command :

 # service ipmi status  
 ipmi_msghandler module in kernel.  
 ipmi_si module in kernel.  
 ipmi_devintf module loaded.  
 /dev/ipmi0 exists.  

Then start Openmanager services :
 # srvadmin-services.sh start