Friday, August 31, 2012

Practical explanation of sorting algorithm time complexity

Do you fear the mathematical formulas for calculating complexity of a sorting algorithm? This is very normal to have though! There is nothing wrong with it as people like me are not gifted with such a mathematical skills. However,  sorting algorithms are the stepping stones in the study of algorithm and complexity. Here, we will take practical examples and verify our theoretical understanding. Let's take two simple sorting algorithms and analyse with coding and not with those scary mathematical formulas. I hope many of the fundamental stuffs will be clear. We'll gradually go in advanced topics also.

First let me give a brief intro to those who are wondering what the hell is this "sorting"? Sorting is a very well known problem in computer science where you have to put a given set of elements in some order. Here we will deal with a set of integer numbers and we will sort them in ascending order. For example, if our given data set is {3,5,1,6,7,1} then our algorithm output should be: {1,1,3,5,6,7}.

Selection sort:
This is the simplest sorting algorithm that comes to mind whenever a sorting of array problem comes into picture.
Pseudo code:

Function: SelectionSort (Array to sort: array)
  1. n := length of array
  2. For i : 1 to n
  3. do
  4.     MinElementIndex = i
  5.     For j : i+1 to n
  6.     do
  7.         if (array[j] < array[MinElementIndex])
  8.             MinElementIndex = j
  9.     done
  10.     if ( i != MinElementIndex )
  11.         Swap ( array[i], array[MinElementIndex] )
  12. End Function
Brief Explanation of how selection sort works: Find the minimum element and place at the beginning of the array. In each iteration select the min of the remaining array and place it in its place.Now, let's write a simple C code to realize the algorithm.

void selection_sort(int* array, int n)
{
int i=0, j=0, MinElementIndex, temp;
for (i=0; i<n; i++)
{
MinElementIndex = i;
for (j=i+1; j<n; j++)
{
if (array[j] < array[MinElementIndex])
MinElementIndex = j;
}
if (array[i]!=array[MinElementIndex])
{
temp = array[i];
array[i] = array[MinElementIndex];
array[MinElementIndex] = temp;
}
}
}

Having this function in a C file we will analyze this extensively. We will now write a shell script to do that. The script will generate a data set and provide it to our selection sort algorithm. The C program, writes the result in a file after completion of the algorithm and also reports the algorithm running time.The script will then see if the result reported by our algorithm is right or not. If it is right it continues to test our algorithm with next data set. We will run the algorithm with different data sets varying the value of n (the number of elements in the data set) to see how the algorithm running time varies with n.

To generate the data set, I have used environment variable $RANDOM provided by linux shell. This gives a pseudo random integer between 0 and 32767 with each reference. You can refer to my script here.

It's time for seat and watch the graphs created for us by the script. Before that let me explain the nature of the data set we are providing as the running time will vary on that. There are some notions in sorting algorithm domain also on the basis of the nature of the data set.

Best case: Where the given data set is already sorted.
Worst case: Where the given data set is in the reverse order of the required order. In our case it is decreasing order.
Avg case: Where the data set elements are random i.e. there is no particular ascending or descending pattern in them.

For a best case scenario, selection algorithm does not realize that everything is in order. It blindly traverses all the elements and tries to sort. Although, there would not be any swap but it will do all the iteration... Thus, for this algorithm, there is no improvement for best case scenario. Check out the graph below which plots number of elements vs algorithm running time.




Bubble sort:

Let us take another simple sorting approach. Unlike selection sort, its main funda is to find the maximum number in the array and put it in its position. However, doing this it can recognize the best case scenario pretty quickly and thus gives much better performance in best case scenario than avg case or worst case. Pseudo code for bubble sort:

      Function: BubbleSort (Array to sort: array)

  1. n := length of array
  2. swapped = true
  3. while ( swapped == true )
  4. do
  5.    swapped = false
  6.    for i : 1 to n-1
  7.    do
  8.       if ( array[i] > array[i+1] )
  9.          swap( array[i], array[i+1] )
  10.          swapped = true
  11.       End if
  12.    done
  13. done

Now running this algorithm with our random data sets as earlier we get the following graph. Now we can clearly see that for best case scenario the algorithm almost takes constant time which corresponds to our theoretical understanding also.

The C program can be viewed here.

Time Complexity:

Are you wondering as I didn't talk about the actual O(n2)? Well it is very easy to observe from these two graphs though. Both the graphs start from n = 1000 onward. So, for this value of n, nwould be 10,00,000, whereas the highest time goes up to 1,20,000 and 2,50,000 for selection sort and bubble sort respectively. So, the running time is definitely well below the order of nmark. And that matches with the theoretical understanding also.


Wondering how these graphs are plot? I've use gnuplot tool in linux. You can find a good tutorial on this here.

Now, let us compare two sorting algorithms' running time.


From the graph above, we can see that in the worst case scenario, selection sort takes much less time than the bubble sort although they are of same order, O(n2). Let's see why is it so? This is because for sorting of same array, selection sort algorithm makes less number of swaps than the bubble sort algorithm. See yourself with manually simulating the algorithms for some simple array.

How do you feel now? A bit more confident in this domain? Ok, for sake of simplicity I've stored the numbers in an array. Ideally, these arrays should be replaced by linked lists... Next time we will see more complicated sorting algorithms like merge sort: understand and compare them. See you then...

Looking forward for your feedback...