Nov14th2007

ESX: Recover from expanded disk with existing snapshot or corrupted snapshots

I had a nasty shock this week with ESX3.

I was going about expanding virtual disks and reallocating resources for one client. Now, I have done this MANY times, so I thought that “the 2 day old backup is sufficient” and did not wait 3-4 hours for a new backup, right before what will be a 10 min task.

I went to expand the virtual disks from the COS and noticed that there were some “Virtual-Disk-000001-delta.vmdk” and “Virtual-Disk-000001.vmdk” files present.

“Oh, a snapshot is here for some reason..?”, I pondered. I then went into the VI3 management console, drilled down to said VPS and went to the snapshot manager, expecting to find a snapshot and then simply commit it to the main disk so I could get back to expanding.

What I found was “No Snapshots for this Virtual Server”.

Hmmm….. “maybe they are old snapshot files that should have been deleted, but weren’t”, I further mused. And =IF= it is a snapshot, surely vmkfstools will not let me run a dangerous or incompatible command. So off I went to expand this virtual disk by another 100GB.

– Expand disk:

“vmware-cmd –X 220GB Virtual-Disk.vmdk”

Expansion done. All looks good. Fire up VPS…..

“Sorry VPS can’t be started because one of the base files that a given snapshot is based on has been modified and thus can’t be mounted”.

” *^&#*^&@(*&!(@&(!*&(&! “

Ok, no harm no foul. The actual disk is not changed. Doing an expand with vmkfstools just adds a marker for more size… surely I can just remove the extra addition, ‘rollback the expansion’ so to speak and all will be spiffy?

Nup. Even though I knew in the back of my head that shrinking a VMDK was NOT POSSIBLE in ESX3 as it was in ESX2.5, I still went searching in the faint hope that I had overlooked some trick during past information gathering exercises when I was not under so much pressure and panic as I was this time.

No dice. What I knew was confirmed. I can’t shrink it. I can’t even load up ghost and mirror, because the main problem is that this Virtual-Disk-000001-delta.vmdk file should be appended to the end. And seeing as it was 25GB in size, for what is a 100GB virtual disk and the data stamp was some 3 months prior – there is A LOT of data and changes that are at risk now.

” *^&#*^&@(*&!(@&(!*&(&! “

OK, on to google again. After some searching and a lot of effort in trying to refine my query, which was needed, because as opposed to what I actually found out (this being the #4 and #5 global support issues for VMWARE), information was scant. I did manage to find a couple of blogs that had some very brief and lacking in all technical detail, reviews of the recent VMWorld summit.

So with that hook, I then started to search on detailed info from that summit and managed to get a PPT file from one of the developers. And inside were all the details that I needed. Or thought that I needed. Because with any system as complicated as VMWARE, definitions of words and correct semantics can make if very difficult to get a clear grasp of one problem, versus a slight variation of it. And even a slight change can come with very different procedures to use and using the wrong ones could make a problem worse. First rule – do no more harm.

I then went to the page that was titled “Expanding the size of a VMDK with an existing Snapshot”. I did not know if this meant, “how to expand a VMDK with an existing snapshot and keep it intact”, or “How to recover from a monumental screw up that only an idiot would do, when expecting vmkfstools to do all due diligence for him and has fucked up the VMDK that happened to have a current and active snapshot that wasn’t committed to the main VMDK file first”

I assumed it meant the latter, being “tech support” and “high rating”… if it was documentary of a feature or process it would have been, well, better documented.

The procedure is this:

– After I was an idiot and issued this command to cause all the problems:

“vmkfstools –X 220G Virtual-Disk.vmdk”

– Check the “Virtual-Disk.vmdk” file with vi and look for the following lines:

RW 482344960 VMFS “Virtual-Disk-000001-delta.vmdk”

– Now check the “Virtual-Disk-000001.vmdk” file and look for the following lines:

RW 209715200 VMFS “Virtual-Disk-000001-delta.vmdk”

What we now know is the current RW value on the newly expanded “Virtual-Disk.vmdk” and it is 482344960. We want to ‘trick’ the system into thinking that the expand never happened. So we then go and replace that value with the one we got from the delta vmdk. So we replace 492344960 with 209715200.

– Now we need to commit all snap shots:

“vmware-cmd /vmfs/volumes/VMFSVOLUME/VPS/VPS.vmx removesnapshots”

Unfortunately I was not done yet. The system reported back that the virtual machine “VPS.vmx” did not have any snapshots present! “Ah ha” I thought. While this is not good, it is also the reason why vmkfstools went on and screwed everything in the start. There is a snapshot there – that is a fact – but the system does not believe so.

This is where global common VMWARE problem #5 comes in, “Corrupted .VMSD file”. In a nutshell this means that the file that tracks all this snapshot info (amongst other tid bits) is somehow compromised. So a new one is needed. This is also fairly simple once you know how:

– First rename the current VMSD file:

mv VPS.vmsd VPS.vmsd.old

– Now create a new snapshot to force the system to generate a new all emcompasing VMSD file:

“vmware-cmd VPS.vmx createsnapshot addedforrecovey “You are an IDIOT”

– Now commit all snapshots like we wanted to do before anyway. You have to commit them all:

“vmware-cmd VPS.vmx removesnapshots”

Now that all the snapshots are committed (the original one and the temp one we made to help recreate the VMSD file) we can continue the process of fixing up our expanding a disk issue. And this is as simple as running the initial vmkfstools expand command that we ran before, that caused all the problems. This is needed so that the correct RW values are set in Virtual-Disk.vmdk” because in the end, the virtual disk IS expanded already.

– So issue the command:

“vmware-cmd –X 220GB Virtual-Disk.vmdk”

In the end, I am NOT STUPID enough to try and expand a virtual disk with a snapshot. However if you DO SEE delta files in your file system, do not trust the VI3 clients snapshot manager if it says “No Snapshots present”. As a matter of caution, I would follow the process above to recreate a new VMSD file to be sure and commit the temporary and any other snapshots that may exist. Then you can go on and expand your disks.

Also, make sure that you have backups. While I did and they weren’t totally fresh and the client was not too upset when briefed of the situation, it could have been much worse.

ALWAYS BACKUP!

DON’T LET A JUNIOR TECH TOUCH THINGS!

TAKE THE TIME TO RELAX AND ASSES THE SITUATION BEFORE YOU POSSIBLY MAKE IT WORSE!

19 Responses to “ESX: Recover from expanded disk with existing snapshot or corrupted snapshots”


  1. Nov16th2007
    1 Matt Nov 16th, 2007 at 10:27 AM

    I love your blog, funny as hell and technically cool. We are using ESX at the school. Thanks for the command line education!!!

  2. Nov16th2007
    2 richard Nov 16th, 2007 at 10:52 AM

    Thanks. I do try to please. This stuff can be dry enough by itself… it does need a dose of colour to aid digestion.

  3. Nov18th2007
    3 Doug Nov 18th, 2007 at 2:25 AM

    Thanks this was a lifesaver. I didn’t expand my disk, but I did have phantom snapshots.

  4. Nov18th2007
    4 richard Nov 18th, 2007 at 3:30 AM

    “Phantom Snapshots”… if that is not =the= word for this situation already, then it had better become the lingua franca…

    It too did scare the crap out of me when it messed up.

  5. Nov18th2007
    5 richard Nov 18th, 2007 at 2:47 PM

    UPDATE: I found these two patches. As luck would have it, the VPS’ in question here, did have 2 virtual disks or more.

    http://www.vmware.com/support/vi3/doc/esx-8258730-patch.html

    http://www.vmware.com/support/vi3/doc/esx-1000077-patch.html

  6. Jan22nd2008
    6 Glen Jan 22nd, 2008 at 3:18 AM

    Thank you for this article. It really saved my bacon.

    The support folks at VMWare told me that I was out of luck. I directed them to your article (which they found interesting). They told me that I could try it, but it wouldn’t be a “supported” solution.

    My only other option was restore from tape. So, I gave it a try. It worked like a charm.

  7. Jan22nd2008
    7 richard Jan 22nd, 2008 at 3:26 AM

    My pleasure. As I said, I could not get anything out them either! Then they say their conference that these were “Global support issues 4 and 5”!.

  8. Mar15th2008
    8 Steve Mar 15th, 2008 at 2:05 AM

    Very informational. The information about the VMSD file might help me out of a different situation I’m in right now where a VCB snapshot was left to grow and run the volume out of space. Space was added so the VM could run again, but the snapshot won’t commit after 24 hours so far. Looks like corruption.

  9. Mar15th2008
    9 richard Mar 15th, 2008 at 10:02 PM

    Steve, give it a little longer, as snapshots can be painfully slow to commit. Try and reduce all other IO on that particular storage array.

    I had a 25 GB snapshot (so 25 gb worth of delta changes) and it took a good 2 hours to commit on a RAID 50 array with 12 SCSI 320’s at 15K…

  10. Mar22nd2008
    10 Serkan Mar 22nd, 2008 at 10:36 PM

    Hi Richard

    Thank you for this article, but i still have some problem, i followed your step s but when i try to create a new snapshot it gives the error: “VMControl error -11: No such virtual machine” could you please help me to solve this problem.

  11. Mar23rd2008
    11 richard Mar 23rd, 2008 at 10:38 AM

    Yes, when doing any operation on a VMX file (Virtual machine) – make sure that you reference it using the full absolute path, eg;

    vmware-cmd -s register /vmfs/volumes/storage1/VM/VM.vmx

  12. Mar29th2008
    12 ghandi Mar 29th, 2008 at 4:56 AM

    Thanks for this info. It saved me weeks worth of work if I would have had to reproduce my work.

  13. May13th2008
    13 Sam King May 13th, 2008 at 11:56 PM

    This is more of a reply to Serkan’s “VMControl error -11:No such virtual machine”. If you have several ESX servers sharing the same storage, make sure that you’re logged into the host where the problematic VM is currently running. Try issuing “vmware-cmd -l” to list the VMs on that host and make sure that yours is there (it will also give you the “official” name of your VM)

  14. Jun20th2008
    14 Alessandro Jun 20th, 2008 at 11:24 PM

    Un grandissimo Thank!!!
    Mi hai salvato.

    You are the best…..

  15. Nov15th2008
    15 Rafal Nov 15th, 2008 at 1:30 AM

    It worked 🙂 The OS is not bootable, but I was able to mount it to a different OS and copy the data. Good job.

  16. Dec1st2008
    16 Wade Dec 1st, 2008 at 3:18 PM

    Bravo! I did not have the same problem that you had but I did have multiple snapshots that did not show up in the Snapshot manager. Renaming the .vmsd file, recreating it by making a snapshot and then deleting that snapshot deleted all previous snapshots. Your technique was substancially easy than any other one I found. Thank you for sharing!

  17. Dec1st2008
    17 richard Dec 1st, 2008 at 11:03 PM

    No problems. Glad it could help. You’d have thought that software that costs upwards of 35K USD for a 4 server cluster would:

    A) Not have this problem

    B) Have the issue clearly publicised and not kept for internal partner conferences!

    Working on another one now for screwed storage vMotion tasks.

  18. Apr10th2010
    18 SilentLamb Apr 10th, 2010 at 3:25 AM

    WOW! I cannot believe that VMWare 1) deliberately included a process that would fill the disk, inducing corruption 2) charges out the a$$ for their supposedly superior product and 3) has no simple means of recovering from this Bull$(*&

    What F-TARD decided that this stupid “DELTA to drive capacity” idea was good one? How hard would it be to monitor the disk for storage space and force you to commit the changes once in a while or at least send a dang email??????

  19. Apr10th2010
    19 richard Apr 10th, 2010 at 12:19 PM

    Happened to me again with vSphere4 last weekend. Phantom snapshot + fill disk. Add in the thin provisioning of vSphere4 and it is a disaster. No data lost -b ut about 24 hours of uptime was. Good thing this was a private internal testing server for – Zimbra – no less. Oh the poetry of it!

Comments are currently closed.