/sys/net Adventures: Network

Showing posts with label Network. Show all posts

Monday, April 14, 2014

Zabbix : Create a production network interface trigger

Following my two previous posts on how to add interface's description in Zabbix graphs [1] and triggers [2], I will finish this serie of Zabbix posts with the creation of a production interface trigger.

By default Zabbix includes the "Operational status was changed ..." trigger which is (from my opinion) a big joke :

The trigger disappears (status "OK") after the next ifOperStatus check (60 seconds by default)
The trigger is raised when an equipment is plugged in. This is a "good to know information" but I can't rise a high severity trigger each time something is plugged !
I can't tell if the interface was up and went down OR if the interface was down and went up.
If I want to have a "Something was plugged in on GEX/X/X" trigger, I would make a special trigger for that purpose.
The trigger doesn't include the interface's description (which is extremely irritating and makes me want to kill little kittens). Check my previous post [2] if you care about kitten's survival.

This new trigger will have the following properties :

Raise ONLY if the interface was up (something was plugged in) and went down (equipment stopped, interface shut or somebody removed the cable).
Will disappear if the interface come back up.
A "high" severity and will include interface's description.

Go to "Configuration -> Templates -> Template SNMP Interfaces -> Discovery -> Trigger prototypes" and click on "Create trigger prototype".

Use the following line as trigger's name :

 Production Interface status on {HOST.HOST}: {#SNMPVALUE}, {ITEM.VALUE2} : {ITEM.VALUE3}

Use this as trigger's expression :

 {Template SNMP Interfaces:ifOperStatus[{#SNMPVALUE}].avg(3600)}<2&{Template SNMP Interfaces:ifOperStatus[{#SNMPVALUE}].last(0)}=2&{Template SNMP Interfaces:ifAlias[{#SNMPVALUE}].str(this_does_not_exist)}=0

This expression means, raise if interface was up "avg(3600)}<2" AND went down "last(0)}=2". The 3600 value specify how long the trigger will stay up; After 3600s "avg(3600)" will equals 2 and the trigger will disappear.
The .str(this_does_not_exist)}=0 expression is used to show the interface's description and is explained in my previous post [2].

Use this as trigger's description :

 Interface status went up to down !!!  
 Interface : {#SNMPVALUE}, {ITEM.VALUE1} = {ITEM.VALUE3}

Set the severity to "high" (or whatever is your concern), you can override severity for each of your interface/equipement.

Wait until the discovery rule is refreshed (default is 3600s) or temporarily set it to 60s. We can now try to disable an interface to check the results, let's do this on bccsw02 ge/0/0/3 :

The trigger is raised as expected with the hostname, interface name and description, if you configured Zabbix actions, the alert message will look like

"Production Interface status ev-bccsw02: ge-0/0/3, down (2) : EV-ORADB01 - BACK_PROD"

Let's renable the interface :

Trigger goes green as the interface went up, you should receive a message saying :

"Production Interface status ev-bccsw02: ge-0/0/3, up (1) : EV-ORADB01 - BACK_PROD"

Be aware that you can also use SNMP traps for that purpose.

Hope that helps !

[1] : http://sysnet-adventures.blogspot.fr/2014/02/zabbix-display-network-interface.html
[2] : http://sysnet-adventures.blogspot.fr/2014/04/zabbix-display-network-interface.html

Zabbix : Display network interface description in triggers

In a previous post [1], I explained how to solve a very fustrating thing about Zabbix : "How add network interface's description in your graph names."

In this post, I'll explain how to fix another very fustrating thing about Zabbix : "How to add network interface description in your trigger names"

Zabbix has a default interface trigger which is raised when an interface status changes.
Good thing it would have been if we didn't have the same issue we had with the graphs; you don't have the interface description neither in the trigger's name nor in the comment. This is very annonying, especially if you receive alerts during the night.

Below an example of the default Zabbix trigger alert :

Seems like Ge1 operational status changed, good to know, but again what the hell is "ge1" ???
Message to Zabbix team : Do you really think I learnt all my switches port allocations by heart ???

The good news here is you can solve this stupidity with à "crafty" trick !

Trigger names/descriptions don't interpret items so using the "Zabbix Graph" trick [1] won't work...
To get your interface's description, you'll need to insert a "interface alias" item (ifAlias) in your trigger expression and reference it in the trigger name with the Zabbix standard macro "{ITEM.VALUEX}"

Go to "Configuration -> Templates -> Template SNMP Interfaces -> Discovery -> Trigger prototypes"

You should have a trigger named "Operational status was changed on {HOST.NAME} interface {#SNMPVALUE}" which matches the screenshot above.
To get the interface description, we first add a trigger expression that checks if the interface alias (i.e description) equals (str() function) a string that will NEVER match for example "this_does_not_exist" :

 {Template SNMP Interfaces:ifAlias[{#SNMPVALUE}].str(this_does_not_exist)}=0

This line means, the network interface description is NOT "this_does_not_exist" which is always true. Finally we add an AND operator (&) between the original expression and the string comparison which gives us the final trigger expression :

 {Template SNMP Interfaces:ifOperStatus[{#SNMPVALUE}].diff(0)}=1&{Template SNMP Interfaces:ifAlias[{#SNMPVALUE}].str(this_does_not_exist)}=0

EDIT: From an user's comment, it appears that newer versions of Zabbix require to replace the "&" sign by an "AND" string.

This line means there were a interface operational status change AND the interface's alias is NOT "this_does_not_exist".
This alias comparaison is just a trick so we can reference the interface's alias (i.e description) with the "{ITEM.VALUEX}" standard macro.

Now change the trigger name with the following string :

  Operational status was changed on {HOST.NAME} interface {#SNMPVALUE} : {ITEM.VALUE2}

As you can see, I added the macro {ITEM.VALUE2} that returns the name of the second item in the trigger's expression which is, you guessed it, the interface alias !

Wait until the discovery rule is refreshed (default is 3600s) or temporarily set it to 60s and enjoy the happiness of the result :

You can also use the {ITEM.VALUE2} macro in the trigger's description, very handy if you want to include additional information for the on-call guy.

In the next post [2], I'll show how to create a real interface trigger; from my point of view this default trigger is completely useless :

The trigger disappears after the next ifOperStatus check (60 seconds by default)
The trigger is raised when an equipment is plugged in. This is a "good to know information" but I can't rise a high severity trigger each time something is plugged !
I can't tell if the interface was up and went down OR if the interface was down and went up.
If I want to have a "Something was plugged in on GEX/X/X" trigger, I would make a special trigger for that purpose.

[1] http://sysnet-adventures.blogspot.fr/2014/02/zabbix-display-network-interface.html
[2] http://sysnet-adventures.blogspot.fr/2014/04/zabbix-create-production-network.html

Wednesday, February 26, 2014

Zabbix : Display network interface description in graph names

One very fustrating thing about Zabbix is the inability to display network interface's description in graph names. All your network equipment interfaces graphs look like :

Nice graph, hummm wait ! What the hell is "ge-0/0/8" ?
If you only have one or two switches it's fine but otherwise it becomes very confusing.

Since Zabbix 2.2, they added the possibitlity to interpret items in graph names which means you can add interface's descriptions to your graph names !

To add your interface's descriptions go to "Configuration -> Templates -> Template SNMP Interfaces -> Discovery -> Graph prototypes", choose the "Traffic graph" and use this string as graph's name :

 Traffic on interface {#SNMPVALUE} : {{HOSTNAME}:ifAlias[{#SNMPVALUE}].last(0)}

The result is pure happiness :

You now have your interface's description in your graph names !

Hope that helps !

UPDATE : The same frustrating bug (yes, yes bug) is also present in triggers, by default you can't have the interface's description in your triggers... This post explains how to solve this issue and get happy !
http://sysnet-adventures.blogspot.fr/2014/04/zabbix-display-network-interface.html

Friday, October 11, 2013

Linux server sends SYNACK packet only after receiving 8 SYN

Got a really weird issue recently, in some rare case (mostly Mac and Mobile phone clients), the connection to a linux server was really really slow (about 12s).

The issue was not only impacting Apache but all TCP services like SSH, hence it was not a particular service issue/misconfiguration.

The Chrome console on a MacBook Pro showed that the initial connection took about 10s, on the other hand a Win7 client in the same LAN had no problem at all.

After some digging on the client and server side, I found out that the client needs to send 8 SYN packets before the server replies with a SYNACK which explain why the connexion is so slow. Once the SYNACK is send back to the client, the communication speed is back to normal.

One hour headache later, it turn out that I enabled some Sysctl TCP tunning values that somehow introduced the issue.

I disabled the net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse features and everything went back to normal.

I think the problem comes from the net.ipv4.tcp_tw_reuse option, but as the issue impacted a production service (and is really hard to reproduce) I didn't try to re-enable tcp_tw_recycle.

Some posts advice to disable window scaling, I strongly disencourage this as it would result in poor network performances.

Hope that helps !

Below the tcpdump output that shows the 8 client's SYN packets before the SYNACK is sent back. Test was performed on SSH service as you can see, the TCP handshake took 10 secondes.

 # SYN 1  
 15:57:26.303076 IP (tos 0x0, ttl 53, id 9488, offset 0, flags [DF], proto TCP (6), length 64)  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xdf5f (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835124724 ecr 0,sackOK,eol], length 0  
 # SYN 2  
 15:57:27.306416 IP (tos 0x0, ttl 53, id 37141, offset 0, flags [DF], proto TCP (6), length 64)  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xdb71 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835125730 ecr 0,sackOK,eol], length 0  
 15:57:28.315804 IP (tos 0x0, ttl 53, id 2415, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 3  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xd785 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835126734 ecr 0,sackOK,eol], length 0  
 15:57:29.330233 IP (tos 0x0, ttl 53, id 62758, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 4  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xd398 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835127739 ecr 0,sackOK,eol], length 0  
 15:57:30.335779 IP (tos 0x0, ttl 53, id 29003, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 5  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xcfa9 (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835128746 ecr 0,sackOK,eol], length 0  
 15:57:31.345254 IP (tos 0x0, ttl 53, id 5246, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 6  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xcbba (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835129753 ecr 0,sackOK,eol], length 0  
 15:57:33.382242 IP (tos 0x0, ttl 53, id 5958, offset 0, flags [DF], proto TCP (6), length 64)  
 # SYN 7  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0xc3dc (correct), seq 2356956535, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 835131767 ecr 0,sackOK,eol], length 0  
 15:57:37.881881 IP (tos 0x0, ttl 53, id 21274, offset 0, flags [DF], proto TCP (6), length 48)  
 # SYN 8  
   client_ip.49316 > server_ip.ssh: Flags [S], cksum 0x5c3d (correct), seq 2356956535, win 65535, options [mss 1460,sackOK,eol], length 0  
 15:57:37.881907 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 48)  
 # SYNACK (at last !!!)  
   server_ip.ssh > client_ip.49316: Flags [S.], cksum 0x7a12 (correct), seq 3228952474, ack 2356956536, win 14600, options [mss 1460,nop,nop,sackOK], length 0  
 15:57:37.885362 IP (tos 0x0, ttl 53, id 62772, offset 0, flags [DF], proto TCP (6), length 40)  
 # ACK  
   client_ip.49316 > server_ip.ssh: Flags [.], cksum 0xdfde (correct), seq 1, ack 1, win 65535, length 0

Monday, July 15, 2013

Emulate bad or WAN network performances from a particular IP on a Gigabit LAN network

If you're developing Web or mobile applications, you'll certainly be confronted to poor network conditions.

The problem now is "how can I test my application under bad network conditions". Well you could rent a forein internet connection or use tools that reports performance from various remote countries however this is not a good debugging environment.

The solution is to use TC and NetEM on your front development server (typically Web or reverse proxy server), then use filters so only one client station (the debugging station) is impacted.
Don't forget to use filter otherwise all your clients will be impacted.

Below an example on how to emulate a network with :

1Mbps bandwidth
400ms delay
5% packet loss
1% Corrupted packet
1% Duplicate packet

The debugging client IP is 192.168.0.42 (i.e the IP impacted by the bad network performance);
The following commands need to be executed on the front developement server, please set the appropriate NIC for you environment (eth0 used below) :

 # Clean up rules  
   
 tc qdisc del dev eth0 root  
   
 # root htb init 1:  
   
 tc qdisc add dev eth0 handle 1: root htb  
   
 # Create class 1:42 with 1Mbps bandwidth  
   
 tc class add dev eth0 parent 1:1 classid 1:42 htb rate 1Mbps  
   
 # Set network degradations on class 1:42  
   
 tc qdisc add dev eth0 parent 1:42 handle 30: netem loss 5% delay 400ms duplicate 1% corrupt 1%  
   
 # Filter class 1:42 to 192.168.0.42 only (match destination IP)  
   
 tc filter add dev eth0 protocol ip prio 1 u32 match ip dst 192.168.0.42 flowid 1:42  
   
 # Filter class 1:42 to 192.168.0.42 only (match source IP)  
   
 tc filter add dev eth0 protocol ip prio 1 u32 match ip src 192.168.0.42 flowid 1:42

To check that the rules are properly set use the following commands :

 tc qdisc show dev eth0  
 tc class show dev eth0  
 tc filter show dev eth0

Once you're done with the testing, cleanup the rules with the command :

 tc qdisc del dev eth0 root

There is many other options you can use (correlation, distribution, packet reordering, etc), please check the documentation available at :

http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

If this setup fits your requirements, I advice you to create a shell script so you can start/stop the rules with custom values. Be aware that you can also make filters based on source/destination ports, etc.

If you have more complex requirements, you can try WANem, which is a live Linux Distribution with a graphical interface on top of NetEM. Please be aware that this requires route modifications on your client and server (or any other routing tricks).

http://wanem.sourceforge.net/
http://sourceforge.net/projects/wanem/files/Documents/WANemv11-Setup-Guide.pdf

I didn't had the opportunity to try it, please let me know if you have any feedback.

Monday, April 29, 2013

Shrew VPN Client + Juniper SRX : "session terminated by gateway" (Autodisconnect)

If like me, you're trying to connect to a Juniper dynamic VPN with Shrew VPN Client, be aware that this not yet possible.

The connection works but the tunnel is constantly disconnected after 60 seconds.

I asked the core developer "Matthiew Grooms" about this and after few debug, it seems like a fix is needed in Shrew's code:

"It's pretty clear whats going on but it won't be possible to fix without 
a rewrite of the modecfg code on the Shrew Soft VPN client, which is probably 
needed anyway."

Full technical details are available at :
https://lists.shrew.net/pipermail/vpn-help/2012-December/014091.html

If anybody found out an alternative solution please share !