Monday, September 10, 2012

More complex sorting algorithm: Merge sort and Quick sort

Welcome back! In the previous blog, we saw two of our simplest sorting algorithms: selection sort and bubble sort. Though they were simple, they take large time to sort (O(n^2)). For large data set this means a huge amount of time and thus people go for other sorting algorithms like merge sort, quick sort etc. Today, we will see how merge sort and quick sort happens and then compare them.

Another point, time is not the only parameter to judge the performance of an algorithm - another parameter is how much space the algorithm takes. Although you have lots and lots Bytes of memory in your PC and laptop to adjust space for the algorithm, but think of your phone. If we run our algorithm in your phone then there would not have that much of space. In that situation space complexity comes into picture and algorithms have to be compared not only with respect to time but also with respect to space / memory requirement. Thus, space complexity becomes critical in case of embedded systems and cache performance in PCs.

Merge Sort: Before jumping into merge sort algorithm let us discuss how to merge two arrays. You have two sorted arrays and you have to make a single array maintaining order. It's pretty straight forward using extra space.

C function to merge two arrays:
1st array: array[l1:h1], 2nd array: array[l2:h2]

void merge (int* array, int l1, int h1, int l2, int h2, int n)
{
int i=l1,j=l2,k=l1;
int* temp = (int*) malloc(sizeof(int)*n);

for(;;)
{
if (i>h1 || j>h2)
break;
if (array[i] < array[j])
temp[k++] = array[i++];
else
temp[k++] = array[j++];
}
while (i<=h1)
temp[k++] = array[i++];
while (j<=h2)
temp[k++] = array[j++];
for(i=l1;i<k;i++)
array[i] = temp[i];
free(temp);
}

To sort a array in merge sort algorithm, we will break the array in smallest possible unit: a single element. And then we'll keep on merging the parts. Merge sort algorithm goes as follows:

void merge_sort(int* array, int l, int h, int n)
{
int mid;
if (l >= h)
return;
mid = (l+h)/2;
merge_sort(array, l, mid, n);
merge_sort(array, mid+1, h, n);
merge(array, l, mid, mid+1, h, n);
}

You can find a pictorial description of the algorithm here.

Let us see the time complexity graph for merge sort in best case, average case and worst case:


Time Complexity:

Worst case complexity: If the array size is 1 then the merging takes constant time. This forms the basis of the recursion. For merging two arrays each of length n/2, we need n comparisons. To sort n element array using merge sort if we need T(n) time then,
T(n) = 2T(n/2) + n, T(0) = T(1) = 0
Now, let us solve this recursion.
T(n) = 2T(n/2) + n
        = 2[2T(n/4) + n/2] + n
        = 4T(n/4) + 2n
        = 4[2T(n/8) + n/4] + 2n
        = 8T(n/8) + 3n
        ..................
        = nT(n/n) + n * log(n)
        = n log(n)

Best case complexity: If the first element of the right hand side array is greater than the highest element in the left hand side array (for ascending sort) then the merge takes n/2 comparisons only. In this situation, the time complexity comes to:
T(n) = 2T(n/2) + n/2
Solving the recursion we get:
T(n) = n/2 * log(n) ~ O(n log n)

So, both the best case and worst case complexity of merge sort is O(n log n).

Space Complexity:

Typical merging is not in place operation(as our example here). It needs additional memory of n. And the algorithm has a recursion tree of depth of 2n-1. Thus, this has worst case space complexity of O(n).


Quick Sort: 

Quick sorting is based on how to partition an array into sub-arrays such that each element in the left hand sub-array will be smaller than every element in the right hand sub-array. For partitioning the array we choose an element and then compare other elements to it and finally decides which are the elements that are smaller than this element (left sub-array) and which are greater than this (right hand sub-array). This critical element is called as 'pivot'. Time complexity of quick sort depends on the selection process of the pivot element as well.

Partition Example (Pivot as the first element of the array):

Partition algorithm example

Code:

int partition(int* array, int s, int n)
{
int pivot_val = array[s];
int pivot;
int i=s, j=n-1;
while(i<j && i<n-1 && j>=0)
{
while (array[j] >= pivot_val && i<j) j--;
while (array[i] <= pivot_val) i++;
if (i<j)
{
swap(array,i,j);
}
}
swap(array,j,s);
return j;
}


void quick_sort(int* array, int s, int n)
{
int pivot = partition(array, s, n);
if (s < pivot)
{
quick_sort(array, s, pivot);
}
if (pivot+1 < n)
{
quick_sort(array, pivot+1, n);
}
}


Time Complexity:

A quantification of time complexity is how many comparisons is to make for sorting. For quick sort algorithm, number of comparison is:
C(n) = n + C(n-k) + C(k-1), C(0)=C(1) = 0
Where the final position of the pivot is k.

Worst case: k=1 or k=n

C(n) =  n + C(n−1)
        =   n + n−1 + C(n−2)
        =   n + n−1 + n−2 + C(n−3)
        =  n + n−1 + n−2 + ... + 3 + 2 + C(1)
        =    (n + n−1 + n−2 + ... + 3 + 2 + 1) − 1
        =   n(n+1)/2  − 1
        ≈  n^2 / 2


Best case: Each sub-array would be of same size (n/2).

C(n) = n + 2C(n/2). This is same as merge sort recursion. So the solution is,

C(n) = nlog n

Avg case: Here all the values of k are equally likely. We would not be doing the calculation here. The result is:
C(n) = 1.39 n log n, i.e. just 1.39 times than the best case

Pivot selection methods:

Pivot selection can be done in many ways:
1. First / last element in the array
2. Median of the array
3. Largest number in the array
4. Middle element in the array
5. Randomly selected element

Selecting largest element in the array results in worst case time complexity. As the partition would always be 1 and n-1 elements. Selecting median as pivot is the best case as the array is divided always into two equal parts: n/2. For calculating the median of an array in linear time we can use median of median's algorithm. I'm planning discuss this algorithm in detail in later blog. All the other methods gives more or less same performance as they are not exploring any significant property of the data set.

Comparison of Merge sort and Quick sort:


Time taken:


Generally time wise quick sort is faster than merge sort for randomly ordered array.

Space required: For merge sort, the additional space requirement is n or minimum as n/2. However, the recursion depth is log n. For quick sort, partitioning is in-place. Partitioning takes O(1) time however for best case the recursion depth is log n. Thus, the space complexity is O(log n).

Quick sort is not stable. So, if you need a stable sorting for your problem then you better use merge sort.

Next time we'll discuss some popular interview questions in the domain of sorting algorithm... See you...

Friday, August 31, 2012

Practical explanation of sorting algorithm time complexity

Do you fear the mathematical formulas for calculating complexity of a sorting algorithm? This is very normal to have though! There is nothing wrong with it as people like me are not gifted with such a mathematical skills. However,  sorting algorithms are the stepping stones in the study of algorithm and complexity. Here, we will take practical examples and verify our theoretical understanding. Let's take two simple sorting algorithms and analyse with coding and not with those scary mathematical formulas. I hope many of the fundamental stuffs will be clear. We'll gradually go in advanced topics also.

First let me give a brief intro to those who are wondering what the hell is this "sorting"? Sorting is a very well known problem in computer science where you have to put a given set of elements in some order. Here we will deal with a set of integer numbers and we will sort them in ascending order. For example, if our given data set is {3,5,1,6,7,1} then our algorithm output should be: {1,1,3,5,6,7}.

Selection sort:
This is the simplest sorting algorithm that comes to mind whenever a sorting of array problem comes into picture.
Pseudo code:

Function: SelectionSort (Array to sort: array)
  1. n := length of array
  2. For i : 1 to n
  3. do
  4.     MinElementIndex = i
  5.     For j : i+1 to n
  6.     do
  7.         if (array[j] < array[MinElementIndex])
  8.             MinElementIndex = j
  9.     done
  10.     if ( i != MinElementIndex )
  11.         Swap ( array[i], array[MinElementIndex] )
  12. End Function
Brief Explanation of how selection sort works: Find the minimum element and place at the beginning of the array. In each iteration select the min of the remaining array and place it in its place.Now, let's write a simple C code to realize the algorithm.

void selection_sort(int* array, int n)
{
int i=0, j=0, MinElementIndex, temp;
for (i=0; i<n; i++)
{
MinElementIndex = i;
for (j=i+1; j<n; j++)
{
if (array[j] < array[MinElementIndex])
MinElementIndex = j;
}
if (array[i]!=array[MinElementIndex])
{
temp = array[i];
array[i] = array[MinElementIndex];
array[MinElementIndex] = temp;
}
}
}

Having this function in a C file we will analyze this extensively. We will now write a shell script to do that. The script will generate a data set and provide it to our selection sort algorithm. The C program, writes the result in a file after completion of the algorithm and also reports the algorithm running time.The script will then see if the result reported by our algorithm is right or not. If it is right it continues to test our algorithm with next data set. We will run the algorithm with different data sets varying the value of n (the number of elements in the data set) to see how the algorithm running time varies with n.

To generate the data set, I have used environment variable $RANDOM provided by linux shell. This gives a pseudo random integer between 0 and 32767 with each reference. You can refer to my script here.

It's time for seat and watch the graphs created for us by the script. Before that let me explain the nature of the data set we are providing as the running time will vary on that. There are some notions in sorting algorithm domain also on the basis of the nature of the data set.

Best case: Where the given data set is already sorted.
Worst case: Where the given data set is in the reverse order of the required order. In our case it is decreasing order.
Avg case: Where the data set elements are random i.e. there is no particular ascending or descending pattern in them.

For a best case scenario, selection algorithm does not realize that everything is in order. It blindly traverses all the elements and tries to sort. Although, there would not be any swap but it will do all the iteration... Thus, for this algorithm, there is no improvement for best case scenario. Check out the graph below which plots number of elements vs algorithm running time.




Bubble sort:

Let us take another simple sorting approach. Unlike selection sort, its main funda is to find the maximum number in the array and put it in its position. However, doing this it can recognize the best case scenario pretty quickly and thus gives much better performance in best case scenario than avg case or worst case. Pseudo code for bubble sort:

      Function: BubbleSort (Array to sort: array)

  1. n := length of array
  2. swapped = true
  3. while ( swapped == true )
  4. do
  5.    swapped = false
  6.    for i : 1 to n-1
  7.    do
  8.       if ( array[i] > array[i+1] )
  9.          swap( array[i], array[i+1] )
  10.          swapped = true
  11.       End if
  12.    done
  13. done

Now running this algorithm with our random data sets as earlier we get the following graph. Now we can clearly see that for best case scenario the algorithm almost takes constant time which corresponds to our theoretical understanding also.

The C program can be viewed here.

Time Complexity:

Are you wondering as I didn't talk about the actual O(n2)? Well it is very easy to observe from these two graphs though. Both the graphs start from n = 1000 onward. So, for this value of n, nwould be 10,00,000, whereas the highest time goes up to 1,20,000 and 2,50,000 for selection sort and bubble sort respectively. So, the running time is definitely well below the order of nmark. And that matches with the theoretical understanding also.


Wondering how these graphs are plot? I've use gnuplot tool in linux. You can find a good tutorial on this here.

Now, let us compare two sorting algorithms' running time.


From the graph above, we can see that in the worst case scenario, selection sort takes much less time than the bubble sort although they are of same order, O(n2). Let's see why is it so? This is because for sorting of same array, selection sort algorithm makes less number of swaps than the bubble sort algorithm. See yourself with manually simulating the algorithms for some simple array.

How do you feel now? A bit more confident in this domain? Ok, for sake of simplicity I've stored the numbers in an array. Ideally, these arrays should be replaced by linked lists... Next time we will see more complicated sorting algorithms like merge sort: understand and compare them. See you then...

Looking forward for your feedback...