Virsto and Deduplication (first in a series)
Posted Thursday, May 20, 2010 in Technology 4 comments
Since our Virsto One announcement we have received a stream of questions regarding our data deduplication solution. This is understandable given the dedupe hype that media teams are pushing. Even the least sophisticated of hardware distributors feel compelled to put a "we support dedupe" message on the front pages of their data sheets; some even add yellow ribbons.
Virsto one does dramatically reduce the amount of storage consumed by virtual disks (VHDs in Microsoft parlance). However, Virsto One does not perform data de-duplication as the industry has coined the term. While dedupe is a useful technique in many environments including virtualization, the specific nature of virtualization workflows calls for what we at Virsto call no dupe. More on that later.
Deduplication 101
This post is not the best forum to provide a comprehensive guide on the diverse universe of data deduplication technology. For this discussion, I will call data dedupe any technology that analyses data content for the purpose of finding redundancies for subsequent reduction of required space in storage or bandwidth for transmission. Readers old enough to remember their high school years before MySpace may also remember a similar data reduction technique called compression. There are substantial differences between compression and deduplication but both techniques rely on analyzing data content.
Virsto One does not use data deduplication in the above sense. We do not look inside data blocks as they pass from virtual machines to I/O channels. (We don't even touch the data, but that's a topic for another post.)
Introduction to Virsto's thinking on VM duplication
Virsto One is built on the realization that in a virtual environment the overwhelming majority of virtual disks are not independent objects. Rather, they are derivatives of each other.
This is an important insight, so let's think about it for a few minutes.
Virtual machine images are fundamentally different from disk images in the physical server world. The magic of virtualization is that now a server (or desktop) is just a big file that is formatted to look like a disk (VHD, VMDK, or other format). This paradigm shift is enormously powerful, and enables all kinds of cool things like live migration of virtual machines across physical servers. It also completely changes the workflow of provisioning servers or desktops.
A new virtual machine is normally – in fact, almost always – created (or provisioned) as a copy of another reference VM. Such reference images are referred to as templates or golden images. As an example, a golden image may contain a reference installation of an operating system with certain applications.
Want a new VM that is similar to one that you created earlier? Just copy the image, simple. Remember, since that VM is just a big file that looks like a disk, making that new VM image means making a copy of the original golden image. Lots of people call that cloning, although you should beware that not every vendor means exactly the same thing when they say "clone".
What is a clone anyway?

In everyday non-technical conversations, when we hear the word "clone", we think of a separate, independent copy of something. The copy is its own entity with a life of its own, but at the moment of creation, the clone is indistinguishable from its antecedent.
Cloning implies repeatability. If you can make one clone, why not two? Ten? A thousand?
An important concept is that each instance is an individual and can change over time. So a year later, those thousand clones may look very similar to each other, but they are no longer identical. Some of the clones may differ in only very small ways, while other may have more noticeable variations.
This commonsense connotation of a clone is exactly the right way to think of VM image clones. Keep this imagery in mind.
The germ of an idea
For fun, you might want to think about how this notion might suggest a vastly different and better storage architecture for VM environments. If you do, you might start to understand why we started Virsto Software a couple years ago.
In the next post, we'll discuss how this notion of cloning relates to the way virtual machines are provisioned and stored, and how Virsto One optimizes the process. In part three of this triptych, we'll compare and contrast dedupe against no dupe.





Comments
Faras Namus 7:31am PST on May 21st, 2010
Alex,
When you create a clone, you know who you are cloning. In essence it is like fork() for a process. You share pages with COW and only make a copy of pages which are modified. You could do the same thing even when someone asked you to create a complete clone (in vmware parlance). linked clones even today only store delta differences. I am not sure what else you can do if you don’t look at content.
However, if you had N gold images to start with i.e. which were initially created by hand (not cloned), without looking at content you cannot find duplicate blocks. It is debatable what such redundancies could be but I have been lots of similar blocks between XP and Server 2003 VM’s which were created by hand. How would Virsto deal with such duplication?
Similarly you could argue that you could write the exact same bytes in a clone as in the parent, but if you don’t look at content, you will store them as two different data blocks.
Would be glad to hear your thoughts.
- Faras
Alex Miroshnichenko 5:52pm PST on June 2nd, 2010
Faras,
Thank you for your comment, I apologize for not responding to it sooner. I have just posted the second part in the series on Virsto clones and I hope it will clarify some of the issues you raised. In particular it explains that we do not use CoW in our product.
I do agree with you that any scheme solely based on tracing data block origins cannot recognize duplicate data in independently created data blocks. I would argue however, that in a large scale virtualized environment this is not the most pressing issue compared to the VM storage sprawl caused by the massive provisioning from a limited set of golden images. Virsto One solves this problem today. Going forward, we do have a technology for dealing with data duplication in the independently created blocks.
I hope that you will continue to follow my posts and my reasoning. I am looking forward to your comments.
Thank you,
Alex.
Faras Namus 6:32am PST on June 7th, 2010
Alex,
I read your comment as well as the 2nd post. I agree that a majority of redundancy is due to provisioning from N golden VM’s.
Does Virsto today have a tool to import my virtual image library intelligently (i.e. form its clone representation by examining the current vhd linkages?) or do I need to import gold images and clone them using your UI once again?
I also see the other benefit of the clones being usable even if the parent has been modified. This is not the case with delta-disks and linked-clones today.
Do you have people testing Virsto in VDI kind of environments? How frequently do they have to build gold masters and reprovision desktops. How do they deal with SP updates etc? Say you started with XP SP2 and cloned 1000 desktops, do they let XP SP3 updates come in on every desktop (making your dedup/nodup ineffective) or they make a new gold image, zap old clones and make new ones all over again?
On a slightly divergent note, I had a question about the `IO Blender’ solution part of your software. How do `IO’s from Applications in an OS’ intrinsically differ from `IO’s from different VM’s to a hypervisor’? I ask because most modern operating systems have multiple IO schedulers available. Why aren’t such IO schedulers suitable for a hypervisor too? Did you have to do anything fundamentally different? I would also argue that on a hypervisor environment, you should disable any IO scheduling in the guests.
Would be glad to hear your opinions.
Faras
Alex Miroshnichenko 9:07am PST on June 7th, 2010
Faras,
Answer to the first question: the current version of Virsto One does not have a tool to import multiple similar images and perform a data dedup while doing this. We do have a quick import feature, but it will treat each external image a a separate object.
The feature you are describing is on our roadmap however, I interpret your question as an additional validation of our roadmap.
Can you elaborate on your vision of value for our clones in case of parent being modified above and beyond we have already discussed? Thank you in advance.
VDI: as you correctly noticed Virsto One can offer huge benefits in VID environments. We are in the process of putting together a large scale test bed with a customer to quantify these benefits.
The I/O Blender question: it probably deserves a separate blog post. To make a quick summary: you are correct, multiple applications inside an OS (real or virtual)do generate multiple independent I/O streams. However they all go through a common I/O layers (file system, page cache, so on). These layer are designed to reoder and optimize the application I/O streams. Hypervisor (without Virsto of course) just blends multiple optimized streams from different VMs in a random fashion.
Again this is a very interesting topic, I’ll definitely make it into a separate post or maybe even a series of posts.
Thank you very much for the productive discussions, looking forward to continuing them.
Alex.
Leave a Comment