Tuesday, June 5, 2007

Deep Copy vs. Shallow Copy

The concept of deep vs. shallow copy is one of the decisions that a designers/programmers make on a daily basis yet in my tutoring experience I've seen a lot of people who don't know it by this name or aren't familiar with all the issues & techniques that relate to this issue. This topic is also one of the issues that needs to be thought about and decided when a designer want to use the prototype design pattern.

The question of deep vs. shallow copy usually pops up when a designer has to create a copy (or clone) method for a class that he/she has designed. When creating the clone method (especially when it relates to the prototype pattern) the designer has to decide between these scenarios:

  1. The clone method is only going to create an object exactly the same as the current object but no data copying happens (the simple case of the prototype pattern).
  2. The clone method is going to create an object exactly the same as the current object and is also going to copy the current object's data into the new object.

The second decision will create the deep vs. shallow dilemma. To show the difference let's examine a simple example:



As the example above illustrates the car's member data are body and engine. Body has its own member data, so does engine but the Engine class is also specialized with two different sub-classes. Now suppose we have created an instance of the Car class, filled it with the right information and now we want to copy it. For example:



Car c1 = new Car();
c1.MaxSpeed = 240;
c1.body = new Body();
c1.body.NumOfDoors = 2;
c1.engine = new FastEngine();
c1.engine.CapacityInLiters = 2.0;
Car c2 = MakeCopyOf(c);

The MakeCopyOf() method is responsible for copying the car. Or better yet if our Car class has implemented the ICloneable method then the above line of code will look like this:

Car c2 = (Car)c.Clone();


 

Now the question of how to implement the Clone (or the MakeCopyOf) method. This is where deep and shallow copies come into play. If you implement the Clone method like this:



class Car : ICloneable {
public Body body;
public Engine engine;
public int MaxSpeed;

public object Clone() {
Car c = new Car();
c.MaxSpeed = this.MaxSpeed;
c.body = this.body;
c.engine = this.engine;
return c;
}
}

You have implemented the shallow copy method. It's called shallow copy because your copying only makes sure that the new object has a copy of the current object's memory space. It doesn't really care what the current object's memory space is composed of and if it contains references to other objects it doesn't try to duplicate those other objects either. To explain this better, take a look at the following diagram. It shows the original Car object (c1), its memory space and the fact that the memory space is filled with data (MaxSpeed) & references (body & engine).



Now when you clone the above object and get c2 using the method described above we are only copying whatever is in the c1 memory space into the c2 memory space. This will mean that the data are being copied fresh into the new space, so are the references, but when we manipulate the data held in an object being referenced by the new c2 we are actually changing the same data that c1 is using too. So if I write this code after creating the c2 object:

c2.MaxSpeed = 300;
c2.engine.CapacityInLiters = 2.5;

The change to MaxSpeed is only a change to the c2 object but the change to the c2 engine's capacity also affects c1. The following diagram better explains this:



As can be seen in the above diagram in the case of a shallow copy the objects referenced by the first object are shared between the two objects.

Now this can be a very good thing in many situations: Imagine you have a Receipt object and that Receipt object is referencing a Customer object (for which the receipt was created for), then you make a copy of the Receipt basically to change a few items, prices, etc. and re-save it as a new receipt. In this case you don't want a separate Customer object but you want the Customer object to be shared between the two receipts. That's what the requirements and the problem domain are dictating to us so we use Shallow copy.

As most of you .net developers know there is also a an easier way to implement a shallow copy and that's to use the MemberwiseClone() method. This protected method which is implemented at the object level can be used to create an exact copy of the current object (but a shallow copy) so when all we need is a shallow copy the Clone method is usually implemented like this:


class Car : ICloneable {
public Body body;
public Engine engine;
public int MaxSpeed;

public object Clone() {
return this.MemberwiseClone();
}
}

Other situations do occur in design where a shallow copy will not do, we want the full graph of objects to be copied over and a new graph created so any change to any object in the graph will not affect the previous copy. Suppose in the previous example we wanted to make a copy of the receipt object with all other objects attached so we are sure that we can recover any changes made to the original objects by other pieces of code. In this case we need to perform a deep copy. The deep copy will ensure that the full graph of objects will be copied and a new object graph created.

Going back to the previous car example once we do a deep copy the second car would be a new object with a new engine etc. So changes to the direct fields of the car object or any of the objects it points to (or the objects they point to etc. etc.) are going to be separate from the original copy. So the following piece of code will only change c2's engine capacity:


c2 = c1.DeepCopy();
c2.MaxSpeed = 300;
c2.engine.CapacityInLiters = 2.5;

Now assuming that all the objects in the graph we are going to clone (using deep copy) have implemented the ICloneable interface we can implement our clone method like this:


/* Deep copy clone */
class Car : ICloneable {
public Body body;
public Engine engine;
public int MaxSpeed;

public object Clone() {
Car c = (Car)this.MemberwiseClone(); //create a copy so everything comes across
c.engine = (Engine)engine.Clone(); //for properties that are objects call their clones to get a deep copy
c.body = (Body)body.Clone(); //for properties that are objects call their clones to get a deep copy
}
}

Obviously the above code will only produce a deep copy if the clone methods of engine & body also perform a deep copy. So usually when we have related objects the decision to perform a shallow or deep copy is one that has to be made for that graph of object together. As a designer you might even face situations where you would prefer to implement both deep and shallow copy in one class and use the appropriate clone as needed.

When deciding to use deep copy there is usually a problem to consider: If the graph of objects we want to perform a deep copy on have a cycle implementing the deep copy isn't going to be easy. A cycle occurs when multiple objects reference each other so that these references create a closed circuit. This could be as simple as two objects pointing to each other or a complex path between objects which eventually turns around and points at the original object. Usually when we have two way referencing linked lists or trees we are faced with this issue or any other fairly complex data structure where direct traversal between the elements in the structure in both ways is needed. For example let's say we have a Company class which references multiple Department objects which in turn hold multiple Employee objects. But for some reason the designer has decided that each Employee should also know (directly) which company it works for. So you have a design like this:



Now if we implement a deep copy where each object is going to clone itself by calling the clone of the objects it holds reference to, we will get into a never ending loop and obviously this solution will not work.

As far as deep copy complexity is concerned the above issue also happens when you don't have a cycle but have multiple objects referencing the same object, so again in the above example suppose we had department holding a collection of its employees and the company also holding a collection of all its employees. The employee objects where referenced by the department and by the company so if we tried to clone a graph of objects starting at company we would get two sets of employees (the ones that the departments reference and the ones that the company references). Obviously this wasn't what we wanted either, we wanted one company object referenced by both the company and the department it belongs to.

In these scenarios the designer must either change the way he/she is looking at the problem and consequently the solution. In other words come up with a solution where a deep copy is not needed or we can recalculate one of the links based on some other data therefore not needing Clone across that link and breaking the infinite loop. Alternatively he/she can accept the little more complex scenario of implementing deep copies as described below:

Another more elaborate (and more costly) solution would be to pass a context object along to every clone method. The context object will hold all the original objects that have already been cloned and what they have been cloned to. This way wherever in the cycle of objects we start a clone the whole network will be cloned once and only once. A perfect context for these scenarios is the Dictionary generic collection (or the Hashtable class). So for example the clone of the above mentioned Employee class would look like this:



class Employee : ICloneable
{
/* other properties that we don’t care about */
public Company company;
public object Clone()
{
return Clone(new Dictionary()); //create an empty context and start cloning
}
public Employee Clone(Dictionary context)
{
Employee e = this.MemberwiseClone();
context.Add(this, e);
If (context.ContainsKey(company))
e.company = (Company)context[company];
else
e.company = (Company)this.company.Clone(context);
return e;
}
}

Now before I get into the other classes' code (which would be very similar to the above) let's see what happened. The original Clone basically means that a client code has called and started the clone recursive call. This original clone will create an empty context and call the clone that is context aware. The context aware clone will lookup the company object (the object that might be the cause of the infinite loop) in the context, if it doesn't exist it means that it hasn't been cloned yet, so clone it and use the newly created company in the output, otherwise just use the previously cloned object that is stored in the context. As can be seen the context will get filled by each object after it creates a shallow copy of itself (before getting into anymore clones). This is the only place where we can be sure that this object hasn't been used anywhere else (yet) and it's immediately after the new object has been created (albeit incomplete: it's still only a shallow copy) so let's add the old object (which other objects might be referencing) to the context alongside it's new version, since the reference to the new version isn't going to change it's not going to cause a problem for others and we will complete the new version's other members (such as company in this example) afterwards.

Let's take a look at the other classes:


class Company : ICloneable
{
/* other member variables */
public List departments = new List();
public object Clone()
{
return Clone(new Dictionary());
}
public Company Clone(Dictionary context)
{
Company c = this.MemberwiseClone();
context.Add(this, c);
c.departments = new List();
foreach(Department d in departments)
{
if (context.ContainsKey(d))
c.departments.Add((Department)context[d]);
else
c.departments.Add(d.Clone(context));
}
return c;
}
}

class Department : ICloneable
{
/* other member variables */
public List employees = new List();
public object Clone()
{
return Clone(new Dictionary());
}
public Department Clone(Dictionary context)
{
Department d = this.MemberwiseClone();
context.Add(this, d);
d.employees = new List();
foreach(Employee e in employees)
{
if (context.ContainsKey(e))
d.employees.Add((Employee)context[e]);
else
d.employees.Add(e.Clone(context));
}
return d;
}
}

The above code guarantees the fact that if you call Clone() on any object in the above graph the whole graph will get duplicated without creating any object more than once.

Let's also examine another alternative to the above deep copy problem:

Readers familiar with .net's serialization technology know that the above deep copy can also be achieved among object graphs which have serializable objects. Just serialize one of the objects in the graph and then deserializing it. In other words you can create a deep copy of objects without the need for the ICloneable interface and the time consuming implementation mentioned above by simply using the technology available (the only drawback being the fact that the serialize/deserialize technique is a lot slower than the above code.

So let's say we had the following three classes:


[Serializable]
class Company
{
/* other fields */
public List departments = new List();
}
[Serializable]
class Department
{
/* other fields */
public List employees = new List();
}
[Serializable]
Class Employee
{
/* other fields */
public Company company;
}

So assuming we have the above serializable classes we can create a deep copy of any graph of the above design using serialization:


Company c = //assume we have a fully loaded company with departments and employees;
MemoryStream ms = new MemoryStream();
new BinaryFormatter().Serialize(ms, c);
byte[] data = ms.ToArray();
ms = new MemoryStream(data);
Company c2 = (Company)new BinaryFormatter().Deserialize(ms);
 

Three problems exist with the above technique:

  1. It's slow.
  2. All objects have to be serializable.
  3. Since we are using .net's serialization routines all attributes (or at least attributes that has been marked as serializable) will be copied while in certain custom deep copy implementations we might want to control what gets copied and what is reset in the new object.

1 comment:

Unknown said...

This formula was very interesting for learn to hire php programmer for programming so thanks allot for shearing this blog.