Saturday, July 9, 2016

False alerts - Dropped packets

Most of the monitoring is based on same fact it does some kind of probe, ICMP, SNMP, SSH, WinRM, WMI, or some kind of API. But most importantly it because it needs to pull all data from all devices it cannot afford more then few tries. So it's understandable that from time to time something is missed, not reply. It would be fine if it doesn't occur often and ideally on the random bases.

Usually the problem lies in TCP/IP implementation. Each network interface (no matter if virtual, physical, on switch, server, printer) have incoming and outgoing queue (buffer) which have some size. Packets goes first to this buffer and then they are processed one by one by TCP/IP layers.

When there is lot of traffic in very short period of time, typically when you start big file transfer, this buffer fills and newly incoming packets are dropped. For ongoing TCP connection it's no problem, because dropped packets as they are not acknowledged are resend. But new connections and state less protocols are not able recover and it "timeouts".

In monitoring we usually consider such device as down because it's not responding.
So if you have irregularly flapping devices or more specifically packet loss and your problem might be dropped packets and you shall consider extending those buffers.
This is known issue of VMWare by the way
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495
If you want to prove this the problem you just need to get the dropped packet counters so here are tips how you can get it.

For Windows


From Zenoss


winrm -r http://hotsname:5985 -u service-acount@example.net -a kerberos --dcip 10.10.10.10 -f "select * from Win32_PerfRawData_Tcpip_NetworkInterface" -d

From Windows Powershell


Get-WmiObject -computername hostname -Query "select Name,PacketsOutboundDiscarded,PacketsReceivedDiscarded from Win32_PerfRawData_Tcpip_NetworkInterface"

Output example:
__GENUS                  : 2
__CLASS                  : Win32_PerfRawData_Tcpip_NetworkInterface
__SUPERCLASS             :
__DYNASTY                :
__RELPATH                : Win32_PerfRawData_Tcpip_NetworkInterface.Name="vmxnet3 Ethernet Adapter _2"
__PROPERTY_COUNT         : 3
__DERIVATION             : {}
__SERVER                 :
__NAMESPACE              :
__PATH                   :
Name                     : vmxnet3 Ethernet Adapter _2
PacketsOutboundDiscarded : 0
PacketsReceivedDiscarded : 50
PacketsReceivedDiscarded ( https://msdn.microsoft.com/en-us/library/aa394340(v=vs.85).aspx )
Data type: uint32
Access type: Read-only
Qualifiers: DisplayName ("Packets Received Discarded") ,
CounterType (65536) ,
DefaultScale (0) ,
PerfDetail(200)
Number of inbound packets that were chosen to be discarded even though no errors had been detected to prevent delivery to a higher-layer protocol. One possible reason for discarding such a packet could be to free up buffer space.
  

For SNMP Devices

Snmpwalk/get for below OIDs
ifInDiscards .1.3.6.1.2.1.2.2.1.13 
ifOutDiscards .1.3.6.1.2.1.2.2.1.19
The number of in/out bound packets which were chosen to be discarded even though no errors had been detected to prevent their being deliverable to a higher-layer protocol. One possible reason for discarding such a packet could be to free up buffer space

root@raspberrypi:/home/pi# snmpwalk -v 2c -c public 10.0.0.40 .1.3.6.1.2.1.2.2.1.2
iso.3.6.1.2.1.2.2.1.2.1 = STRING: "eth0"
iso.3.6.1.2.1.2.2.1.2.2 = STRING: "LOOPBACK"
root@raspberrypi:/home/pi# snmpwalk -v 2c -c public 10.0.0.40 .1.3.6.1.2.1.2.2.1.13
iso.3.6.1.2.1.2.2.1.13.1 = Counter32: 0
iso.3.6.1.2.1.2.2.1.13.2 = Counter32: 0
root@raspberrypi:/home/pi# snmpwalk -v 2c -c public 10.0.0.40 .1.3.6.1.2.1.2.2.1.19
iso.3.6.1.2.1.2.2.1.19.1 = Counter32: 0
iso.3.6.1.2.1.2.2.1.19.2 = Counter32: 0

For Linux base devices

On linux based devices you can use ifconfig command nad check for dropped counters
root@raspberrypi:/var/www# ifconfig
eth0      Link encap:Ethernet  HWaddr b8:27:eb:1d:89:fe
          inet addr:10.0.0.37  Bcast:10.0.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19462418 errors:0 dropped:43030 overruns:0 frame:0
          TX packets:12891339 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000

          RX bytes:4260766110 (3.9 GiB)  TX bytes:2835848497 (2.6 GiB)

No comments: