Propably my major complaint about oracle vm is it’s lack of a hot backup solution for virtual machines. There is simply no documented or supported way to back up running virtual machines – crazy. Of course you can install good old backup agents inside of your VMs but that is one of the things I want to avoid when deploying virtualization. Or, you could suspend or halt a machine, copy the virtual disk off to somewhere else and be happy. But I feel like we should not have to stop our services for backup. This is 2012 afterall!
Sure, the latest release introduced a new feature to export an OCFS2 filesystem via NFS so it can be backed up from another machine. But that is only half of the solution since a simple filesystem copy of a virtual disk image is very likely to be corrup if a running machine writes to it while we are trying to read.
One workaround suggested across forums and message boards is to clone your running machine and then copy the virtual disk of the stopped cloned machine. This sounds like a dirty workaround but if it flies, I’ll be happy. Unfortunately, I could not find a way to clone VMs through commandline scripts so this was not really practical for automated, day-to-day backups.
Until I came across instructions today that describes a new (and provided as-is, unsupported…) CLI interface for ovm manager. It requires version 3.1.1 build 365 but updating from build 305 was quite easy.
And sure enough, I can now use ssh to log in to the CLI:
[root@ovm ~]# ssh -l admin -p 10000 localhost admin@localhost's password: OVM> showversion 126.96.36.1995
Diving right into the task at hand, I cloned a test VM:
OVM> clone Vm name=BTCminer_01 destType=Vm destName=BTCM01_backup serverPool=ptx_pool Command: clone Vm name=BTCminer_01 destType=Vm destName=BTCM01_backup serverPool=ptx_pool Status: Success Time: 2012-07-19 12:14:08.539 OVM> show VM name=BTCM01_backup Command: show VM name=BTCM01_backup Status: Success Time: 2012-07-19 12:21:38.055 Data: Name = BTCM01_backup Id = 0004fb0000060000a0d318dfea1a8ecb Status = Stopped Memory (MB) = 1024 Max. Memory (MB) = 2048 Max. Processors = 8 Processors = 8 Priority = 10 Processor Cap = 80 High Availability = false Operating System = Oracle Linux 6 Mouse Type = Default Domain Type = Xen PVM Keymap = en-us description = bitcoin miner test, burning away CPU Server = 08:00:20:ff:ff:ff:ff:ff:ff:ff:00:1b:24:78:cc:62 [ovm01] Repository = 0004fb0000030000d4d126daf6f36560 [ovm_repo1tb] Vnic 1 = 0004fb0000070000fe61d3745c1e09c4 [00:21:f6:42:42:01] VmDiskMapping 1 = 0004fb0000130000e5ff03a3b8fe3a6b OVM> show VmDiskMapping id=0004fb0000130000e5ff03a3b8fe3a6b Command: show VmDiskMapping id=0004fb0000130000e5ff03a3b8fe3a6b Status: Success Time: 2012-07-19 14:53:32.845 Data: Name = 0004fb0000130000e5ff03a3b8fe3a6b Id = 0004fb0000130000e5ff03a3b8fe3a6b Slot = 0 Emulated Block Device = false Virtual Disk Id = 0004fb000012000054b0f999972b7d64.img [BTCminer_01 (2)] Vm Id = 0004fb0000060000a0d318dfea1a8ecb [BTCM01_backup]
I now have a stopped (consistent) clone of my running machine and I know the machine id and the virtual disk image file. sweet! I already mounted the repository on my ovm server, so now I can copy the vm.cfg and virtual disk to another filesystem (plain local disk in my test case)
cp -pr /mnt/repository/VirtualMachines/0004fb0000060000a0d318dfea1a8ecb/ /var/www/html/ovmbackup/ root@ovm ~]# cp -pr /mnt/repository/VirtualDisks/0004fb00001200007df94d4e5a72be09.img /var/www/html/ovmbackup/
Of course, I was eager to see how restoring works…
OVM> importVirtualDisk repository name=ovm_repo1tb server=ovm01 url='http://ovmmgr/ovmbackup/backup.img' Command: importVirtualDisk repository name=ovm_repo1tb server=ovm01 url='http://ovmmgr/ovmbackup/backup.img' Status: Success Time: 2012-07-19 15:51:12.712 OVM> create VM name=recoverytest repository=ovm_repo1tb domainType=XEN_PVM memory=1024 on Server name=ovm01 Command: create VM name=recoverytest repository=ovm_repo1tb domainType=XEN_PVM on Server name=ovm01 Status: Success Time: 2012-07-19 16:04:06.091 OVM> create vmDiskMapping name=recoverMap1 slot=1 storageDevice=backup.img on vm name=recoverytest Command: create vmDiskMapping name=recoverMap1 slot=1 storageDevice=backup.img on vm name=recoverytest Status: Success Time: 2012-07-19 16:05:53.694
I cheated a little bit and did the assignment of a virtual network in the GUI, stopped the original vm and started the recovered machine. Eureka! It came up just like I expected it to. All the little pieces are in place now to build an automated backup process for our virtual machines.
Put the backup-steps in a simple script so that it can run automatically. I would love to be able to use public key authentication with that ssh server. If that does not work, I’ll have to play with modifying the provided “expect” scripts to do what I want.
I also don’t want to just trust the backup to work, especially with the snapshot taken while the VM is running. In theory, the filesystem inside the VM should survive this crash-consistent state but I want to really make sure it does. Plus, there are a ton of other things that can go wrong. So in addition to automating the backup process I’d like to automate the recovery aswell. The idea is to import the backup back into OVM, change the virtual network to a sandbox and boot the VM. We can then perform a series of basic tests against it to check if all needed services inside the VM come back up the way they should. When this works, we have tested and guaranteed that our backup really works and we also know how long it takes to restore our backup.