/sys/net Adventures: Bugs

Showing posts with label Bugs. Show all posts

Friday, February 7, 2014

Smartctl : Linux disk I/O scheduler is reseted back to default's CFQ

Got a weird issue recently, I'm monitoring my SSD's life time with smartctl + Zabbix and realized that my scheduler settings are reseted each time smartctl was executed !

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sda  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
 noop anticipatory deadline [cfq]

There is no real solution, but you can work around by specifying the generic SCSI name i.e "sgX "instead of sdX.

 # echo noop > /sys/block/sda/queue/scheduler  
   
 # cat /sys/block/sda/queue/scheduler  
 [noop] anticipatory deadline cfq  
   
 # smartctl -A --device=sat+megaraid,0 /dev/sg0  
 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.23.2.el6.x86_64] (local build)  
 Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net  
 === START OF READ SMART DATA SECTION ===  
 ...  
 ...  
   
 # cat /sys/block/sda/queue/scheduler  
  [noop] anticipatory deadline cfq

And voila ! Problem not really solved but that does the job !

You can use sg_map (part of the sg3_utils package) to check the sdX -> sgX mappings :

 # sg_map -a  
 /dev/sg0 /dev/sda  
 /dev/sg1 /dev/sdb  
 /dev/sg2 /dev/scd0

Wednesday, February 5, 2014

Omreport fails : object not found

If you get the following message while using omreport :

 $ omreport chassis memory  
 Memory Information  
 Error : Memory object not found  
 $ omreport chassis hwperformance  
 Error! No Hardware Peformance probes found on this system.

The first thing to do is to restart the srvadmin services :

 # srvadmin-services.sh restart  
 # service ipmi restart

Check that the services are properly started.

If that doesn't solve the problem, you might have a semaphore issue. In my case Zabbix agent/scripts became nuts and didn't close its semaphores.

To list the current semaphore's arrays use the following command :

 # ipcs -s

To show the current system limits

 # ipcs -sl

You can use the following command to count the current number of semaphore's arrays

 # ipcs -us

If you reached the system limit, it will certainly explain the omreport issue. From now on, you have two possibilities :

You've reached the limit because there is an issue on your system (semaphores not closed or whatever reason). You need to cleanup your semaphores with the following command :

 # ipcrm -s semaphore_id  
 To clean all semaphores from a particular user :  
 # ipcs -s | awk '/username/ {system("ipcrm -s" $2)}'

Important : You need to stop attached process before removing the semaphores.

All your semaphores are legit, you need to increase the system limits :

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-Setting_Semaphore_Parameters.html

Hope that helps !

Friday, October 11, 2013

Dell Firmware update fails with "mktemp: too many templates"

If you have the following error while updating a Dell server firmware (BIOS, RAID, etc) via Linux binary (*.BIN) :

 mktemp: too many templates

Then check the binary's filename for specials characters, in my case Chrome added a "(1)" at the end of the filename. Remove it, restart the update process and you're good to go !

Linux server sends SYNACK packet only after receiving 8 SYN

Got a really weird issue recently, in some rare case (mostly Mac and Mobile phone clients), the connection to a linux server was really really slow (about 12s).

The issue was not only impacting Apache but all TCP services like SSH, hence it was not a particular service issue/misconfiguration.

The Chrome console on a MacBook Pro showed that the initial connection took about 10s, on the other hand a Win7 client in the same LAN had no problem at all.

After some digging on the client and server side, I found out that the client needs to send 8 SYN packets before the server replies with a SYNACK which explain why the connexion is so slow. Once the SYNACK is send back to the client, the communication speed is back to normal.

One hour headache later, it turn out that I enabled some Sysctl TCP tunning values that somehow introduced the issue.

I disabled the net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse features and everything went back to normal.

I think the problem comes from the net.ipv4.tcp_tw_reuse option, but as the issue impacted a production service (and is really hard to reproduce) I didn't try to re-enable tcp_tw_recycle.

Some posts advice to disable window scaling, I strongly disencourage this as it would result in poor network performances.

Hope that helps !

Below the tcpdump output that shows the 8 client's SYN packets before the SYNACK is sent back. Test was performed on SSH service as you can see, the TCP handshake took 10 secondes.

 # SYN 1  
 15:57:26.303076 IP (tos 0x0, ttl 53, id 9488, offset 0, flags [DF], proto TCP (6), length 64)  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xdf5f (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835124724 ecr 0,sackOK,eol], length 0  
 # SYN 2  
 15:57:27.306416 IP (tos 0x0, ttl 53, id 37141, offset 0, flags [DF], proto TCP (6), length 64)  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xdb71 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835125730 ecr 0,sackOK,eol], length 0  
 15:57:28.315804 IP (tos 0x0, ttl 53, id 2415, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 3  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xd785 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835126734 ecr 0,sackOK,eol], length 0  
 15:57:29.330233 IP (tos 0x0, ttl 53, id 62758, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 4  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xd398 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835127739 ecr 0,sackOK,eol], length 0  
 15:57:30.335779 IP (tos 0x0, ttl 53, id 29003, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 5  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xcfa9 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835128746 ecr 0,sackOK,eol], length 0  
 15:57:31.345254 IP (tos 0x0, ttl 53, id 5246, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 6  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xcbba (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835129753 ecr 0,sackOK,eol], length 0  
 15:57:33.382242 IP (tos 0x0, ttl 53, id 5958, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 7  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xc3dc (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835131767 ecr 0,sackOK,eol], length 0  
 15:57:37.881881 IP (tos 0x0, ttl 53, id 21274, offset 0, flags [DF], proto TCP (6), length 48)  
 # SYN 8  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0x5c3d (correct), seq 2356956535, win 65535, options [mss 1460,sackOK,eol], length 0  
 15:57:37.881907 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 48)  
 # SYNACK (at last !!!)  
   server_ip.ssh > client_ip.49316: Flags [S.], cksum 0x7a12 (correct), seq 3228952474, ack 2356956536, win 14600, options [mss 1460,nop,nop,sackOK], length 0  
 15:57:37.885362 IP (tos 0x0, ttl 53, id 62772, offset 0, flags [DF], proto TCP (6), length 40)  
 # ACK  
   client_ip.49316 > server_ip.ssh: Flags [.], cksum 0xdfde (correct), seq 1, ack 1, win 65535, length 0

Monday, September 2, 2013

Squirrel Mail crash with "PHP Fatal error: Call to undefined function sqimap_run_literal_command()"

I have encountered a bug on Squirrel Mail (version 1.4.8-21), the client's credentials are accepted but the page is then stucked on "webmail/src/redirect.php"

On the Apache side, I noticed the following PHP error :

  PHP Fatal error: Call to undefined function sqimap_run_literal_command() in /usr/share/squirrelmail/functions/imap_general.php on line 518

After some debugging, I found out that Squirrel Mail does NOT properly handle specials characters; In this case I had an accent in the user password.
I reseted the user's password and everything went back to normal.

Hope that helps !

Tuesday, May 21, 2013

DRAC Firmware update failed : Error: 30001 Method httpCgiErrorPage()

Have tried to update an old DRAC4 Firmware from firmware 1.5 to 1.75 via Linux binary and came to an unplaisant surprise :

 Dell Remote Access Controller 4/P  
 The version of this Update Package is newer than the currently installed version.  
 Software application name: Dell Remote Access Controller 4/P Firmware  
 Package version: 1.75  
 Installed version: 1.50  
 Continue? Y/N:Y  
 Executing update...  
 WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE UPDATE IS IN PROGRESS.  
 THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!  
 ......................................................................................
 /tmp/duptmp.xml:6: parser error : Extra content at the end of the document  
 <SVMExecution lang = "en">  
 ^  
 /tmp/.dellSP-XmlResult12908-32487.M19124:6: parser error : Extra content at the end of the document  
 <SVMExecution lang = "en">  
 ^  
 unable to parse /tmp/.dellSP-XmlResult12908-32487.M19124  
 /tmp/.dellSP-XmlResult12908-32487.M19124:6: parser error : Extra content at the end of the document  
 <SVMExecution lang = "en">  
 ^  
 unable to parse /tmp/.dellSP-XmlResult12908-32487.M19124

Doesn't look good and of course if I try to access the DRAC via HTTPs, I've got a nice CGI error :

 Error: 30001 Method httpCgiErrorPage()

Looked on the web and somebody (who contacted Dell Support) advises to shutdown the server, unplug the DRAC card for a while and plug it in back... Well explain to your CTO that you need to shutdown a production server, unrack it, unplug a card just because a DRAC update failed o_O
Reference: http://lists.us.dell.com/pipermail/linux-poweredge/2008-January/034556.html

The solution that worked for me was to install the racadm Dell tool on my bastion and reset the firmware remotely.

First install racadm :

 # yum install srvadmin-racadm4.x86_64

Note : This is for DRAC4, didn't had the issue with newer DRAC.
Note 2 : You need to have the Dell OSMA repository installed on your server:
http://www.openfusion.net/linux/dell_omsa

Then run the following command :

 # racadm -rDRAC_IP -i racreset

Note : Change DRAC_IP with your DRAC IP.
Note 2 : This operation will NOT erase your DRAC configuration.

Wait a while, pray, and if you're lucky as me you should be back on line (with the original firmware version of course).

Final word, I stopped being lazy and updated the firmware via the Web GUI which is a long and annoying process. Of course I used Internet Explorer as I felt like Murphy's law was around this day ^^

Hope that helps !

Monday, April 29, 2013

Shrew VPN Client + Juniper SRX : "session terminated by gateway" (Autodisconnect)

If like me, you're trying to connect to a Juniper dynamic VPN with Shrew VPN Client, be aware that this not yet possible.

The connection works but the tunnel is constantly disconnected after 60 seconds.

I asked the core developer "Matthiew Grooms" about this and after few debug, it seems like a fix is needed in Shrew's code:

"It's pretty clear whats going on but it won't be possible to fix without 
a rewrite of the modecfg code on the Shrew Soft VPN client, which is probably 
needed anyway."

Full technical details are available at :
https://lists.shrew.net/pipermail/vpn-help/2012-December/014091.html

If anybody found out an alternative solution please share !

Thursday, March 28, 2013

Omreport doesn't update disk rebuild progress

Had to replace a hard drive on a Dell Server and omreport rebuild progress got stuck at 1%.

The solution is to restart the srvadmin service :

 # srvadmin-services.sh restart

This is quite dirty but it's the only solution I found. This also happened when I changed a PERC H700 battery.

Another way to check the rebuild process is to export log with omconfig :

 # omconfig storage controller action=exportlog controller=0

This creates a /var/log/lsi_MMDD.log file, with the rebuild progress :

 03/09/13 22:07:51: EVT#13296-03/09/13 22:07:51: 99=Rebuild complete on VD 01/1  
 03/09/13 22:07:51: EVT#13297-03/09/13 22:07:51: 100=Rebuild complete on PD 05(e0x20/s5)  
 03/09/13 22:07:51: EVT#13298-03/09/13 22:07:51: 114=State change on PD 05(e0x20/s5) from REBUILD(14) to ONLINE(18)  
 03/09/13 22:07:51: EVT#13299-03/09/13 22:07:51: 81=State change on VD 01/1 from DEGRADED(2) to OPTIMAL(3)  
 03/09/13 22:07:51: EVT#13300-03/09/13 22:07:51: 249=VD 01/1 is now OPTIMAL

Same thing for the battery learn cycle.

Hope that helps !

Tuesday, March 26, 2013

Dell Openmanage/Omreport failed after updating to CentOS 6.4

After updating a testing machine from CentOS 6.3 to 6.4, the Dell OpenManage tools stopped working AT ALL.
It seems that with the lastest CentOS kernel (2.6.32-358.2.1.el6.x86_64), they moved away some IPMI drivers from kernel modules to "built-in"

The result is :

 # omreport chassis  
 Health   
 # srvadmin-services.sh start  
 Starting Systems Management Device Drivers:  
 Starting dell_rbu:                     [ OK ]  
 Starting ipmi driver:                   [FAILED]  
 Starting Systems Management Device Drivers:  
 Starting dell_rbu: Already started             [ OK ]  
 Starting ipmi driver:                   [FAILED]  
 Starting DSM SA Shared Services:              [ OK ]  
 /var/log/messages reports :   
 instsvcdrv: /etc/rc.d/init.d//dsm_sa_ipmi start command failed with status 1

Solution :

 # yum install OpenIPMI

Note : There is no need to start or chkconfig the service.

You can check that the IPMI components are seen with the following command :

 # service ipmi status  
 ipmi_msghandler module in kernel.  
 ipmi_si module in kernel.  
 ipmi_devintf module loaded.  
 /dev/ipmi0 exists.

Then start Openmanager services :

 # srvadmin-services.sh start