So a very short post here but for people out there and for myself as reference it might be handy.
Friday I was at a customer site having the problem that he lost 5 pings to virtual machines every now and then. Unrelated to other events, debugging it was a hell. What also was bizarre was the fact that we didn't have any loss on the network ( we checked statics with ethtool and on the switches). The packets were just delivered to late to the VM, which then ofcourse replied to late and caused a ping timeout (we saw this behavior with a network sniffer).
We start to analyse the variables in the game. What changed since the network problems occured. Well the SAN changed. Look further in the logs (/var/log/messages) revealed a multipathing problem (SATP,PSP and NMP warnings). What happened was that the client was migrating to a newer version of SVC and in effect replacing the nodes. Multipathing worked fine but the old nodes were still showing up as dead paths. Probably this path were fixed paths (because SVC likes fixed) and the ESX hypervisor was trying to jump back to this old paths. Rescanning the hba's or rebooting the ESX server solved the problem (for now?)