Monday, October 24, 2011

Visitor Pattern in C++ - Intuitive understanding.

Today we will discuss Visitor pattern in an intuitive way as on the face of it, it would seem pretty intricate. First off, we need to understand the scenarios, in which Visitor would be of great helpful. Let me start with basic examples and move on to complex details of Visitor pattern.

When we design any container, our focus should be on data-structure which holds the data but not on algorithms which work on contained data. For instance, STL containers had been designed in such a way that algorithms can be developed independent of data containers. STL achieves this by the concept of iterators and by the uniformity of type of contained elements. Every container that supports iterators, can be used seamlessly in algorithms.

What if the contained elements are of disparate types? Now iterators wouldn't help this situation as types are varying. More importantly, if the contained elements need to be treated differently for different types, even deriving from a same base class wouldn't be much helpful. Consider an example of Shape objects like rectangle and Ellipse. Even though both can be described by Bounding rectangles, calculation of area would differ. So, a common algorithm for area is not possible, even though both are Shape objects. This is one of the scenarios in which Visitor can help.

In a nutshell, Visitor pattern can be used to add a new algorithm to work on a container without modifying the container itself. Now, let us translate the above line in OOP way.

Visitor is a way to add a new behavior to the existing class without modifying the class itself.

Perhaps, this is one more to way to achieve Open-Closed principle of OOP, Let us explore.

Let us implement some trivial examples to proceed further. Implementations are not of production quality; proper memory management is not there. This is just to explain the concepts.

Let us define an interface, IShape from which Rectangle and Ellipse are derived as given below.

struct Bounds
{
Bounds() : left(0), right(0), top(0), bottom(0)
{

}
Bounds(int ileft, int itop, int iright, int ibottom) : left(ileft), top(itop), right(iright), bottom(ibottom) { }
int left;
int right;
int top;
int bottom;
};

class IShape
{
public:
virtual Bounds GetBounds() = 0;
virtual void Draw() = 0;
};

class Rectangle : public IShape
{
public:
Rectangle(Bounds bounds) : currentBounds(bounds)
{

}
Bounds GetBounds()
{
return currentBounds;
}
void Draw()
{
// Draw Rectangle here.
}
private:
Bounds currentBounds;
};

class Ellipse : public IShape
{
public:
Ellipse(Bounds bounds) : currentBounds(bounds)
{

}
Bounds GetBounds()
{
return currentBounds;
}
void Draw()
{
// Draw Ellipse here.
}
private:
Bounds currentBounds;
};

Let us define a Graphics Designer, a composite class for Shapes. At Run-time several shape objects could be added to this toy Graphics Designer and Draw can be invoked to draw all the elements.

class GraphicsDesigner : IShape
{
public:
void AddRectangle(Bounds inBounds)
{
currentElements.push_back(new Rectangle(inBounds));
}

void AddEllipse(Bounds inBounds)
{
currentElements.push_back(new Ellipse(inBounds));
}

void Draw()
{
// Enumerate through the currentElements and Draw each.
}

private:
vector<IShape*> currentElements;
};

Given a class like GraphicsDesigner, How to add a new algorithm to work on contained elements! For instance, we need to find the Minimal rectangle which covers all the elements. In order to achieve this, we would need to add a method in container which enumerates through the elements and execute the algorithm. This violates OOP. And every time, we need to implement a new algorithm, we would end up adding new methods into the container class.

Algorithms are getting bound to the data container.

We will analyse further to understand the problem well.

Adding methods to the container is not a good idea as algorithms just depend upon the contained elements and not on the state of the container.

Needed algorithms can't be defined before hand, just think of possible algorithms on Integers containers, finitely huge, right.

Given that, we need to find out a way to dynamically add the behaviors on to the containers. Without further ado, we will define Visitor pattern and see how it resolves this.

For every algorithm, we should derive a class from Visitor interface; Visitor interface should contain a collection of overloaded Visit methods one for each different types of Visit-able classes. Visit-able classes as per our example, are Rectangle and Ellipse(and general Shape also). Every visit-able class should define a method "Accept".

Let me define a simple PrintVisitor to clarify the needs of all these interfaces.

Before that, given below are the example classes after the changes, made with Visit and Accept interfaces.

class Rectangle;
class Ellipse;

class IShapesVisitor
{
public:
virtual void Visit(Rectangle*) = 0;
virtual void Visit(Ellipse*) = 0;
};

class IShape
{
public:
virtual Bounds GetBounds() = 0;
virtual void Draw() = 0;
virtual void Accept(IShapesVisitor* visitor) = 0;
};

class Rectangle : public IShape
{
public:
Rectangle(Bounds bounds) : currentBounds(bounds)
{

}
Bounds GetBounds()
{
return currentBounds;
}
void Draw()
{
// Draw Rectangle here.
}
void Accept(IShapesVisitor* visitor)
{
visitor->Visit(this);
}
private:
Bounds currentBounds;
};

class Ellipse : public IShape
{
public:
Ellipse(Bounds bounds) : currentBounds(bounds)
{

}
Bounds GetBounds()
{
return currentBounds;
}
void Draw()
{
// Draw Ellipse here.
}
void Accept(IShapesVisitor* visitor)
{
visitor->Visit(this);
}
private:
Bounds currentBounds;
};

class GraphicsDesigner : IShape
{
public:
void AddRectangle(Bounds inBounds)
{
currentElements.push_back(new Rectangle(inBounds));
}

void AddEllipse(Bounds inBounds)
{
currentElements.push_back(new Ellipse(inBounds));
}

void Draw()
{
// Enumerate through the currentElements and Draw each.
}

Bounds GetBounds() { return Bounds(); }

void Accept(IShapesVisitor* visitor)
{
vector::iterator shapeItr = currentElements.begin();
for(; shapeItr != currentElements.end(); shapeItr++)
{
IShape *val = (*shapeItr);
val->Accept(visitor);
}
}

private:
vector<IShape*> currentElements;
};

Now, we will define a simple PrintVisitor to show the power of Visitor pattern.

class PrintVisitor : public IShapesVisitor
{
public:
void Visit(Rectangle* inShape)
{
cout << "This is a Rectangle" << endl;
}

void Visit(Ellipse* inShape)
{
cout << "This is an Ellipse" << endl;
}
};

Of course, "Print" is so trivial to be considered as an algorithm which works on the data. But it clearly avoided the need for having "Print" virtual function in Shape derived classes. Now let us think of a decent algorithm which could work on the data. Let us assume, we need to find the total area occupied by all shapes in GraphicsDesigner. Without Visitor pattern, we would have to add up a new member function in GraphicsDesigner class. But with Visitor pattern, things become very easy that we need to add a new class in Visitor hierarchy as given below.

class TotalAreaVisitor : public IShapesVisitor
{
public:
TotalAreaVisitor() : TotalArea(0.0)
{
}
void Visit(Rectangle* inShape)
{
Bounds bounds = inShape->GetBounds();
int width = bounds.right - bounds.left;
int height = bounds.bottom - bounds.top;
int currentArea = width * height;
TotalArea += currentArea;
}

void Visit(Ellipse* inShape)
{
Bounds bounds = inShape->GetBounds();
int width = bounds.right - bounds.left;
int height = bounds.bottom - bounds.top;
double currentArea = width * height * 3.14 / 4;
TotalArea += currentArea;
}
double GetTotalArea()
{
return TotalArea;
}
private:
double TotalArea;
};

I hope now the need for Visitor pattern is very clear. For every new algorithm, instead of adding a new method in container class, we can add a new visitor. This helps us to achieve one of the SOLID principles, Open-Closed principle. There are disadvantages also like a new Shape derived can't be easily added as it needs modification of all possible Visitors.

So, Visitor pattern trades off the ease of adding new behaviours into Visit-able hierarchy with not easily allowing to create a new derived class in Visit-able hierarchy.

Thanks for Reading.

Friday, October 21, 2011

Bloom filter implementation in C++ - NikBloomFilter

Bloom filter can be considered as a succinct representation of a set of elements. Set of elements might be stored in files or in Database tables, etc. And a set may support operations like insertion, deletion and retrieving elements.

For this discussion, Let us assume a dictionary of passwords which is a set of huge number of strings which is stored in several files according to their starting character like A-List, B-List,etc. We assume that we don't keep all this information in Main memory. So, in order to find a string's presence in the dictionary, Program has to fetch the right file according to the starting character and keep it in main memory for a Binary search. Generally, we don't require frequent insertions and deletions from this set, but queries for the presence of an element. Only way to make the queries fast is to cache complete set of elements in Main memory. But there are some scenarios in which caching is unaffordable.

Bloom filter, as I said early, is a succinct data-structure which registers the presence of elements in a set. While we insert an element in a set, Bloom filter must be updated. Bloom filter is basically a bit array. A small subset of bits in that array would correspond to an element in the set. So, when an element is inserted, corresponding bits should be set. Now query is simply a check for the corresponding bits state; if all the bits corresponding to that element are set, then query would return true. But this true may not be a real true. Since every element maps to a subset of bits in the bit array, overlapping may occur and which could cause a false positive. Before going further, we will analyse this structure mathematically.

Bloom filter is a probabilistic data structure. When an element is inserted, it has to set certain bits in the bits array. In order to find out what are all the bits to be set, we need to hash the content of an element. Now, hashed elements could be used to construct the indices of bits to be set. For instance, MD5 of a byte array would result in 16 bytes which could be used to construct 4 indices. We will get into implementation details later in this discussion. As we discussed early, this data structure may give false positive. Luckily, we could be able to control this by trading off space.

Let us get into the analysis. Let us assume the size of the bloom filter, "M" bits and "N" elements had been inserted into the set, hence in Bloom filter also.

Now, probability of a bit B[i] to be unset or zero is (1 - 1/M) ** kN. Here ** represents "power of" operator. k is the number of bits to be set for an element in a set. The above formula could be easily derived as follows: Since N elements are inserted, setting some bit in the bloom filter would have happened for kN times. One particular bit not set in the first time is M-1 / M, for two consecutive times, is (M-1 / M) * (M-1 / M) and so on. Underlying assumption in this, is setting bits are statistically independent. Asymptotically, this is true. So, we will assume so in this discussion.

P{ B[i] == 0] } is (1 - 1/M) ** kN, approximately e ** (-kN/M).

False positive is a condition in which bloom filter tells a string present in the set which is not actually present. Cause of this condition is the bits corresponding to the string set by other independent strings and due to cumulative effect of that, as described earlier in this discussion. Probability of false positive is all the bits corresponding to the string are set which is,

(1 - e ** (-kN / M) ) ** k => P{ False Positive }
Since the false positive probability depends on the factors k, N, M, we need to minimize the function ( taking derivative and equating to zero) with respect to 'k'. Analysis shows this function becomes minimum if 'k' becomes (M / N) * ln 2. ln 2 is natural logarithm of 2.

If we assign 'k', the value (M/N) * ln2 in the False positive probability function, we will get,

P{ False Probability } = (1 - e ** -(ln 2) ) ** k = (1/2) ** k or (0.6185) ** (M/N).

The above result is the lynch-pin of the Bloom filter implementation, as it gives the way to control False positives. For instance, if we wish to have F.P of 0.2 then M/N must be around 8. "N" is the number of elements in the set. Given that, "N", bit-set size of the Bloom filter, could be easily calculated.

An implementation of Bloom filter in C++ has been done and checked in to the Git-hub.

This implementation doesn't have support for Deletion and merging with some other Bloom filter as it is the initial release. Any comments on this is welcome.

Thanks for Reading.

Saturday, October 15, 2011

Prototype Pattern in C++, Dynamic instantiation.

Before starting the discussion of Prototype pattern, We will understand the problem of Dynamic instantiation.

Given a typename ( a class name ) in C++ as a string, how an object can be created of that type?

First off, there is no language level support for dynamic instantiation in C++ unlike Java and C#. Since in C++, there is no common base class for all classes, dynamic instantiation support is generally not possible. MFC, a C++ framework provides support for Dynamic instantiation as all classed are derived from CObject. Let us try in the same line, making use of common base and Prototype pattern to achieve Dynamic instantiation.

Prototype pattern imposes a class to have a method, Clone, whose sole purpose is to generate a new object of same type. With a clone-able object, we could be able to generate a new object of same type. In order to achieve this, all classes must be derived from a common base class. And in that common base class, we can have a pure virtual member function(interface), Clone. So, all classes derived from that, would be clone-able.

Let us call that common base class "Object" as in Java. My minimalistic implementation of the Object is as given here,

class Object
{
public:
virtual ~Object() {}
virtual Object* Clone() const = 0;
static Object *MakeObject(std::string type);
static void AddNewClassInfo(std::string type, Object* in);
static std::map objectsTable;
};

Clone is the key method of this abstract class. Implementation of the same could be like the one given below.

class Derived : public Object
{
Object *Clone() { new Derived(); }
};

So, all the classes derived from Object can be cloned using this method call. In order to "Clone" an object, we should have one initial object, prototypical instance. And we should be able to get that base prototypical instance from the type name. If we could be able to do this, then Making an object using type-name alone is a cake-walk. "objectsTable" map in Object class holds the aforesaid mapping; a map of type-name and a prototypical instance.

Adding a prototypical instance can be done through the method, "AddNewClassInfo", whose implementation can be like this.

// Populate map with type-name and its corresponding prototype.
void Object::AddNewClassInfo(std::string type, Object *in)
{
objectsTable[type] = in;
}

Now the implementation of "MakeObject" must be easily understandable.

Object* Object::MakeObject(std::string type)
{
std::map::iterator itr;
itr = objectsTable.find(type);
if( itr != objectsTable.end())
{
return itr->second->Clone(); // Clone the prototypical instance.
}
else
return NULL;
}

All the essential implementations of "Object" is done. Since it has a static member, it should be defined in CPP file as given here,

std::map Object::objectsTable;

OK. Now let us create a derived class from Object and test this implementation.

#include "Object.h"

class DynamicInstantiable : public Object
{
public:
DynamicInstantiable() {}
void SayHello()
{
std::cout << "Hello! " << std::endl;
}

Object* Clone() const
{
return new DynamicInstantiable();
}
};

The above class has the implementation for Clone method and is derived from Object. Now, this type must be added in Static Map in "Object" class using Object::AddNewClassInfo. We have several options here. But I would like to keep things simple. So, I have added a new header file which has a global function "InitializeDynamicObjects". And this method must be called in "main()" function in the very first line itself.

A sample implementation is given below.

void InitializeDynamicObjects()
{
Object::AddNewClassInfo("DynamicInstantiable", new DynamicInstantiable());
}

Here is the sample Main function, I have written to test the code.

#include "DynamicInstantiable.h"
#include "ObjectInitializer.h"
using namespace std;

int main()
{
InitializeDynamicObjects();

DynamicInstantiable* newInst = dynamic_cast(Object::MakeObject("DynamicInstantiable"));

if( newInst != NULL)
{
newInst->SayHello();
}

cout << "Hello World" << endl;
}


I hope this post explains Prototype clearly. Comments and Queries are welcome.

Thanks for Reading.

References:

Friday, October 14, 2011

Curiously recurring template Pattern, CRTP - Static Polymorphism in C++

Let us start this discussion with some minimalistic C++ classes, which doesn't convey any meaning but serves the purpose of discussion.

class Base
{
public:
void PrintMe()
{
std::cout << "Print: Base" << std::endl;
}
};

class Derived : public Base
{
public:
void PrintMe()
{
std::cout << "Print: Derived" << std::endl;
}
};

void TestPolymorphism()
{
Base *ptrBase = new Derived();
ptrBase->PrintMe(); // This would print "Print: Base"
}

In the above code, ptrBase is initialized with an object of Derived. Ideally, PrintMe should have printed "Print: Derived". But it would print "Print: Base". Any C++ programmer could identify the issue; early binding according to the type. It means when we say, ptrBase->PrintMe(), compiler would check whether the function called is virtual or not. If it is not virtual, it would bound this call to the address of function defined in the calling type; In this case calling type is Base.

Dynamic binding can be achieved using "Virtual" specifier, in method declaration. When we mark a function with Virtual, compiler wouldn't make early binding; it would resolve the same using V-Table. Every class which has a virtual method, would have a table of function pointers. And a hidden pointer for the table, will be inserted into the class and will be initialized during construction of objects of the class. So, during a real method invocation on objects, two things would happen. Getting the right function address from V-Table and calling the function pointed by the same. Even though this extra indirection doesn't cause much performance degradation, in several scenarios this could be easily avoided. Please find the snippet below, after adding virtual specifiers.

class Base
{
public:
virtual void PrintMe()
{
std::cout << "Print: Base" << std::endl;
}
};

class Derived : public Base
{
public:
virtual void PrintMe()
{
std::cout << "Print: Derived" << std::endl;
}
};

void TestPolymorphism()
{
Base *ptrBase = new Derived();
ptrBase->PrintMe(); // This would print "Print: Base"
}

As we discussed early, even though dynamic polymorphism doesn't cause much performance overhead, it would make a function not "in-line"able. Sometimes, in-lining a simple function would improve performance especially if it is being called several times in code. Since normal function call breaks code execution flow, CPU level, caching like optimizations, are not possible. With static polymorphism, we could achieve the in-lining capability. Now let us get into the topic of Static polymorphism.

Let us try to make Base class function a bit intelligent.

In Base class, if we could cast the "this" pointer's type to the right Derived class, we could be able to solve the issue of Dynamic binding; Means avoiding Virtual specifier.

The below code snippet won't compile, but it gives the idea of what to do.

class Base
{
public:
void PrintMe()
{
static_cast<Derived*>(this)->PrintMe(); // This would call "Print: Derived"
}
};

We had successfully removed Virtual Keyword; So, V-Table wouldn't be created and function can be in-lined. There are some problems; First off, it wouldn't compile, as compiler doesn't know the Derived class yet. Secondly, this is specialized for only one derived class. This specialization can be removed with templates. Yes. that is the whole idea.

template<Derived>
class Base
{
public:
void PrintMe()
{
static_cast<Derived*>(this)->PrintMe(); // This would call "Print: Derived"
}
};

Now, you could derive classes from the above class like this,

class Derived : public Base<Derived>{ };

Deriving from template classes, specialized with the same Derived class type, is called as CRTP, curiously recurring template pattern. Even though the above example is very minimalistic, there are several uses for this pattern. One of the great examples, as given by Wikipedia is Counter base class. In order to get the statistics of objects of a particular type, we can implement a Counter class as below.

template<T>
class Counter
{
public:
static int GetTotalObjectsCreated()
{
return nObjectsCreated;
}
protected:
Counter() { nObjectsCreated++;}
~Counter() { nObjectsCreated--; }
private:
static int nObjectsCreated;
};

template<T>
int Counter<T>::nObjectsCreated = 0;

Let us assume we need to get the number of objects created for a particular type, Test. Derive Test from Counter, specialized with Test itself.

Class Test : private Counter<Test>{ };
// Private derivation is to show that it is not 'is a' relationship.

Now at any point of time, in program execution, we could get the number of objects alive using the call like this.

int nObjects = Counter<Test>::GetTotalObjectsCreated();

CRTP is cool and positive side effect of Code replication based generics mechanism, unlike Java and .NET.



Thursday, October 6, 2011

Doors Toggling Problem.

Today we will discuss a puzzle which is very interesting as its end result is mathematically elegant and intuitive.

There are 100 doors initially kept in closed state. One has to go through 100 iterations and in each iteration, he has to toggle a certain set of doors given by the rule below:


In 'i'th numbered iteration, he has to toggle the doors i, 2i, 3i, 4i, .. n * i. where n * i <= 100 < (n + 1) * i. For instance, in 3rd iteration, doors to be toggled are 3, 6, 9 ... 99.

By toggling, it means if the door is closed it has to be opened and if it is opened, it has to be closed.

Given that, question is after 100 iterations, which are all the doors would be in open state. Please stop reading if you want to try this puzzle on your own.

First off, door would be in OPEN state, if it had been toggled for odd number of times otherwise it would be in CLOSED state. I hope this is clear. This is very essential to understand the following.

Let us try some simple cases now. The first door wouldn't be toggled further, after 1st iteration which means one of the doors in open state after 100 iterations, is first door. Second door would be toggled in 2 iterations, namely first and second. So, it would be in closed state. Third would be toggled in iterations 1, 3 and would be in closed state. Fourth would be toggled in iterations 1, 2, 4 and would be in opened state. Now you could see the pattern clearly.

A 'N'th numbered door would have been toggled in an iteration 'i' if 'i' is a divisor of 'N'. Number of times a door toggled is the total number of divisors of door's number. For door 1, only one divisor is there which is again 1. For door 4, divisors are 1, 2, 4.

So, this question boils down to the simplified version: what are all the numbers( < 100) contains odd number of divisors.

Number theoretically, for what numbers tau(n) is odd. I will discuss Number theoretic functions in a different post.

Generally, if a number 'N' is divisible by 'n' and it would be divisible by (N / n) also. That means number of divisors would come in pair. But if a number is a square number like 25 which has divisor 5 and pair ( 25 / 5) is also 5, then it has odd number of divisors.

All square numbered doors like 1, 4, 9 ... 81, 100 would be in OPEN state and others in CLOSED state.


So, the number of opened doors after 100 iterations is 10. If you simulate the puzzle in computer, the program would be of complexity O(n * H[n]) where H[n] is 'n'th harmonic number.

http://en.wikipedia.org/wiki/Harmonic_number

In this naive approach, 1st iteration has to toggle 100 doors, 2nd 50 doors, 3rd 33 doors .... 100th 1 door.

So, total number of toggles would be ( 100 + 50 + 33 + 25 ... 1) => 100( 1 + 1/2 + 1/3 ... + 1/100), which is 100 * H(100) . I am attaching the code in Java. Remove the comments if you want to check the complexity in terms of iterations.


public static void main(String[] args)
{
List<Boolean> doorState = new ArrayList<Boolean>();
for(int i = 0; i <= 100; i++)
doorState.add(false);
// int totalIterations = 0;
for( int i = 1; i <= 100; i++)
{
// int innerIterations = 0;
for( int j = i; j <= 100; j += i)
{
Boolean state = doorState.get(j);
doorState.set(j, !state);
// totalIterations++;
// innerIterations++;
}
// System.out.println(innerIterations);
}
// System.out.println(totalIterations);
System.out.println("Following doors would be open: ");
for(int i = 1; i <= 100; i++)
{
if(doorState.get(i))
{
System.out.println(i);
}
}
}


Thanks for Reading.