2009/11/24

VMware ESX (+HA) and correct DNS/NTP settings: a must, must

Today I was at a customer site who is using ESX 4.0 hosts (and 3.5 hosts) both being connected to vCenters. A strange issue occured on both clusters. One of the esx's host got disconnected and didn't want to reconnect. Trying to connect directly with the vi client didn't work + the webservice seemed to be down. On the ESX 4.0 host I got the famous 503 service unavailable error (...)

My first thing I thought, lets reboot mgmt-vmware (it even blocked on the 3.5 ESX). This didn't help :(. The customer then suggested to reboot the server. Although not hopeful it would help, he did it anyway. After the reboot, still the same issue.

After some reading I found some ssl errors in the logs. However this didn't make sense at all. Some more reading lead me to some guy saying that by correcting his DNS settings everything got fixed. So I started the typical nslookup
> esxhostname
> esxhostname.f.q.d.n
> ip
The result was... well timeouts all over the way. The network team had given us new DNS ips but before we were able to change them they already shut down the servers. The result was an unmanageable server.

So i logged on to the console changed the dns settings in /etc/resolv.conf manually. Trying to restart mgmt-vmware server now worked (service mgmt-vmware restart). However the vi client still could not connect or the connection was not restored. Restarting the webAccess service did the final trick (service webAccess restart). We did a final reboot to check if the error would reoccur but luckily it didn't.

Next up was configuring the ntp server. This should have been simple, ...

First of all start by putting the right time manually. Something I learned the hard ways (checking the logs). If you have to much skew (offset of more then 1000 seconds or something) the ntp service won't synchronize. It is thinking the offset is too big and abnormal. Result : it discards the offset, no ntp sync :(

Second of all start by enabling the ntp client before configuring it. This will open up the proper ports in the firewall.

Then configure the ntp service to start automatically and configure your ntp servers.

Lastly you might see that ESX is not synchronizing. Strange behaviour. I solved it by doing
$ service ntpd stop
$ ntpd -d -q
$ service ntpd start
(a trick I learned here)

After all the clocks were done I checked everything was in sync. I opened up putty sessions to all the esx hosts and issued
$ watch "date"
This will execute date every 2 second and show its output. Makes it easy to compare ;)

Then finally I tried to enable HA. The client tried this before but obviously with bad time parameters and dns settings, the process failed. The process started pushing the HA clients and everything ran smoothly. So before you start trying HA and consider the above

One of my big concerns though is that ESX is really really really really sensible to uptime and reachability of your DNS server. I know you can supply 2 DNS servers but still this is scary stuff. I was wondering (and if I have the time to try I will post some result) if I was able to make a dnsless ESX setup.

First tryout would be by testing if /etc/hosts is sufficient for replacing dns. What you will need to do is make adding a lines for each ESX/VC on every host
ip.addr.put.first hostname hostname.fqdn

fe on esx01.mydomain.com with 2 other esx hosts esx02 and esx03 I would get something like
127.0.0.1 localhost
10.0.0.1 esx01 esx01.mydomain.com
10.0.0.2 esx02 esx02.mydomain.com
10.0.0.3 esx03 esx03.mydomain.com
10.0.1.1 vcenter vcenter.mydomain.com

Then on the esx hosts check the nsswitch.conf file and check that files are before dns.
>hosts: files dns

So if that would work correctly, we would have an offline-dns solution (This would assume we tested out HA by plugging out an ESX and everything). However it would be a lot of work if you have a lot of hosts.

You could then make a perl script (assuming you are going to run perl on vcenter) that will pull this config from a central location every 10 minutes. Then compares it with the current file (md5 checksum?) and replace it if necessary. By now you must be thinking I'm insane. I'm going to replace a 40 year old system by a script. What if the central location is going done? No biggy, you still have your local configuration. And lets be honest, if your dns servers are going down, you won't be busy installing new ESX hosts

*big fat warning , the last part is pure theoretical, and if I not tested it. I wrote this at home in my lazyboy :)

No comments:

Post a Comment