In a upgrade process form ESX 4.0 to ESXi 4.1 we reinstalled a Blade servers(BL 460c G6) by script. In our update process steps there is a step Un-present all LUN that are presented to the related host. This is done to be sure that the installation will not install on a LUN that is presented. And here is where the human factor kicked in we forgot 1 LUN the was presented to the host. This was a LUN that was presented form the Test and Development environment and this was temporally but not temporally enough 🙁 So when running the installation script he saw that LUN (CLUN002) as the first disk and in the script it says install on first disk See this article. In the beginning we did noticed that the installation failed but with a adjustment of the script it worked. And all Virtual machine just keep running. The next morning all look well until some of the virtual machine went for a planed reboot and did not come up any more. At this point we know we had a storage issue because the rebooted virtual Machines gave a Orphaned error. Some machine where not rebooted and could not be accessed though the vCenter Console option but could be accessed through RDP and where alive and well.
When we looked at the storage we did not see the folders with the virtual machine files any more nor the folders of the virtual machines that where still accessible though RDP and when looking form the view of the LUN we saw multi partitions that suggesting that the (in our eyes) failed installation was installed on that LUN.
When we went to the console view we saw that the VMFS permission is denied on the specified LUN (see picture below)
At that moment we had a VMFS partition that was reinstalled with a ESXi 4.1 installation with no permissions for the host or other host an virtual machines running from the LUN. At this moment i had a face if had jurist say water burning! the install option says :
autopart –firstdisk –overwritevmfs
So if VMFS is really overwritten why are my virtual machines still working? Because vCenter was not able to do a storage vMotion (No permission) We had a idea lets V2V this machine ( instead of P2V) and it works!!! we V2V several machine that where on the specified LUN remove the LUN on the Storage and presented a new one to our environment Lucy us!
I opened a discussion on the VMware forum because i am curious if there where other out there who have seen this en how they soled it.
When writing this post i hade one comment which suggested:
If there is little disk activity the VMs are quite happy to run in RAM. Most OS’s will cache writes until the disk becomes available again. That may not necessarily explain your situation but I have experienced similar.
This is the only explanation that we had thought of to! if you have any other suggestions pleas leave a comment at the VMware Form.
this article is related to Scripted install VMware ESXi on HP Blade hardware