Wednesday, March 01, 2006

le prospector

A couple of weeks ago I had an idea about sorting come unexpectedly into my mind, and it wasn't until last night that I'd had a look online to see what, if anything, anyone had written about the idea.

In 1995 (when I was in my first year at university) we were asked to implement any sorting algorithm we liked to sort an input stream of integers (I think in the range 1 to 1000). At the time I knew how to do a bubble sort, probably selection sort, and maybe something else, but nothing as fancy as quick sort or merge sort. As far as I recall, we were learning about asymptotic complexity ("big oh" notation), and the program I wrote wasn't really a sort, but a trick that occurred to me that would give the correct result for the given type of input. The lecturer docked me a mark (so I only received 9/10) saying that my method was not extensible (not that extensibility was stated as a requirement for the assignment, so I was pretty annoyed!).

And now, more than 10 years later, I find that the method I used is known as "Count Sort". Well, very nearly - my method involved creating an array of counts for each possible input value, but was single pass for the input - the output was obtained by checking each count and emitting count copies of the index, so was super fast. :o)

I don't know yet when "Count Sort" was first known, but my lecturer certainly didn't mention it, and I'm a bit miffed that I was led to believe (for 10 years!) that it was useless. In any case, for completely unknown reasons, while trying to get ready to leave home on a Friday morning this hacky sorting method was in my head and I started thinking about ways to make it more general. I soon ended up in a state, not quite panicking, but rather agitated at the notion that I might have stumbled across something new (and potentially extremely valuable!)... I paced around, unable to concentrate on things like putting on my shoes, and started getting worried about what the process is for protecting such an idea.

I SMS'ed a friend, paced around some more, tried to stay calm and work out how to put my shoes on. As far as I knew, quick sort was about as good as it gets, and O(nlogn) was the best you could do. But my idea was more like O(n) ... surely either someone had thought of this before, or the details in the implementation, or the last part of the algorithm, would force the complexity back to nlogn (or worse).

SMS wasn't good enough. I called the friend, interrupting whatever it was they were doing, to see if they knew what to do about intellectual property. I hadn't even sat down to see what the details might be. It was a very strange feeling - a sort of overwhelming emotion attached to a simple idea, and I still have no idea why I was thinking about it at all.

Since then I was in the library at Enmore TAFE (where I am currently a student), looking for a Photoshop tutorial book. Instead of what I was looking for, I noticed a booked that I had read about somewhere called "The Artist's Way", and near to that a very small book titled "A Technique for Producing Ideas" by James Webb Young. I picked it from between the neighboring books and started reading it right there. It was originally published in 1965, and I find it's contents to be very self-affirming, and very possibly explaining exactly the process by which I'd had this sorting idea.

Having now read about Count Sort, and some similar sorts (especially one originally called "histogram sort"), including a reference to something Knuth wrote about needing linked lists, I'm not sure that I want to go through the pain of pursuing the idea any further, but I would like to propose the following notion:

Sorting based on any simple integer-based key (strings of letters are "integers", but floating point values probably aren't!) can be done in O(n + k) time, where n is the number of input values, and k is related to the complexity of the key.

eg for my original program, the array of 1000 counts had to be scanned to produce the result. But give it a thousand, a million, or a billion input values to "sort", and there same array need only be scanned once. So if there are a lot more input values than possible unique keys, the O(k) can be thought of as a constant and ignored. Most sorts aren't on simple keys, though, so it is important to keep that "k" in consideration.

Basically, the idea is not to compare values with each other to determine the order, but rather to simply put each input value into "the right place" straight away. Just as you can place a mark on a grid at the right location from knowing the co-ordinates, you can link the a record with a given key to the right place in an array (or tree... or perhaps some other structure) without care for any other values. Once the keys are put in the right places, just pick them up in order from start to finish.

I wouldn't call it genius, but I'll settle for spark of brilliance! I didn't quite exclaim "Eureka!", but I think that's something like how I felt about it.

[px]