I have a setup with 3 VMs (1 application server on CentOS6 and 2 database servers on CentOS7). The last 1-2 weeks we have had issues with timeouts when connecting to the database servers (and between the two servers that are in a cluster).
The database provider (Couchbase) can see from logs that the connections are forced closed:
WARN com.couchbase.endpoint - [com.couchbase.endpoint][UnexpectedEndpointDisconnectedEvent] The remote side disconnected the endpoint unexpectedly
The logs also show that packages are dropped, like:
[warn] Interface âens32â (removedip) failures: RX:2863 / TX:0 - Details:
- RX packets:308,593,167 errors:0
dropped:2,863 overruns:0 frame:0
The VMs are hosted on the same host which is a VMware ESXi (version 6.5). So they should be able to have good connections to each other.
And what has changed over the last couple of weeks? Security updates on the VM OSes and the database server version (from 6.6.0 to 7.0.0). The database upgrade shouldn't change anything in the network but obviously is the reason why I first contacted the database provider...
Any ideas to find the culprit much appreciated :-)
Edit:
Following Camerons suggestion I just ran a short network trace and loaded it into Wireshark on my local machine. Then I opened the "Expert information" and got this:
I need to say that there is an Nginx proxy server in front of the application server. It handles SSL and "lifts it off" before hitting the app. server. Just looking at the info I would expect the two "red" blocks to be related to requests coming from the outside - and not from the app. server to the database servers.
But I'm not really sure what to look for in the results? - and I guess I need to let it run a little longer - but perhaps without the information from the outside?
Edit 2
While sitting and looking at it the issue actually arose... - so I quickly started the tcpdump again. So the results may not contain the root cause - but should be more relevant than the first:
The blocks I have expanded seem to be related to communication with one of the database servers.... :-)
But what do these results mean and how do I get closer to finding the cause?