Computational Complexity, Cloud Computing and Big Data

War Story

Most of the data we use here at Datacratic is stored on Amazon’s S3 with the compute jobs running on EC2 instances accessing this data as needed. A few months ago, our operations team noticed that our Amazon bills were unusually high. After analysis, it turned out that the problem was a parameter in a configuration file that was causing our code to perform a large number of list operations. In the process of looking into the cause of the problem we discovered a very interesting thing about the price of an API call.

More specifically, before deciding whether a file in S3 should be processed or not, we need to get information about its size and the time that it was last modified. This information can be obtained via 2 different S3 API calls.

HEAD (object): returns metadata about the object itself ($0.004 per 10, 000 requests)
LIST (Bucket, prefix) which returns data about all objects that match prefix ($0.005 per 1, 000 requests).

Note: Note that the price is given in per 1000 units in one case and in per 10000 in the other.

In our case, it turned that that we were using the second form with a prefix that matched the object whose metadata was required. Because the second form can return information about multiple objects in the same call, it is considered a LIST operation and its price is higher even though we were using it to retrieve information about a single object.

This means that a LIST request for a single item is 12.5 times more expensive than a GET request! By switching to the first form we achieved a significant reduction in costs.

That’s right:  with the advent of SaaS, a function call actually costs money and depending on its nature, the price can vary significantly!

Awareness of the space and runtime cost of a program is one thing but the fact that an API call now has a price was, at least for me, something new whose implications were worth exploring.

Running out of time (and space)

Back in the old days when computing power and memory were at a premium, a big part of a practising programmer’s mental toolbox was an awareness of an algorithm’s run-time complexity, of a data structure’s efficiency. If creativity is only possible within a system of rules and constraints, those hardware requirements ensured that as programmers, we were kept on our toes and forced to be mindful of costs.

With the availability of more plentiful memory and faster processors, Moore’s law seemed to make all these considerations moot -  The era of the free lunch had arrived where advances in hardware would more than offset inefficiencies in software. The limits had been pushed back and while we knew that Moore’s law would surely hit a wall at some point, this was comfortably in the future and in the long run we are all dead anyway. While a seminal article by Herb Sutter proclaimed that The Free Lunch is Over and made a plea for more efficiency and discipline in software, it would seem that the arrival of the cloud with its promise of infinite computing power has once again re-established the free lunch: why bother with the computational complexity of an algorithm when we can simply throw more hardware at it?

Except of course it’s not free and, whereas in two cases, the cost implications are obvious there is a third case in which they are not as straightforward as we will see next. Consider a program with a given run-time and space complexity that runs in the cloud and makes use of SaaS:

  1. If we wish for it to run faster we either need more machines or more powerful ones and typically these are costs that visible upfront
  2. Similarly, the more memory a program uses the beefier the instance type required, with obvious price implications
  3. However, if the program makes API calls to a service such as S3, these costs are more or less hidden since the cost implications are delayed (the bill comes at the end of the month or at least much later than the time at which they are used)

Complexity Theory In Practice

With the 3rd consideration in mind, I decided to compare the cost (not just the familiar notion of run-time cost of algorithm but the actual monetary cost keeping in mind that a function call now results in a bill at the end of the month.) Using the prices above for an S3 API call, we compare an algorithm that is O(NlogN) versus one that is O(N2). In the latter case for example, we could have two nested loops making a call to HEAD or LIST.


Input Size

Cost ( HEAD - O(N2) )

Cost (LIST - O(N2))

Cost( HEAD - O(NLogN))

Cost(LIST - O(NLogN))





















The results are eye-opening. For small values of N, we see that the difference is not so great but as the value of N climbs the difference in cost is staggering. With N=100 000, an O(N2) algorithm would actually result in an AWS bill of $4000 for a HEAD operation and $50000(!) for a LIST operation. On the other hand, an algorithm with O(NLogN) complexity would result in a cost of only $0.67 and $8.30 respectively.

Considering that some of our products operate in an environment with low margins, this can be instrumental in deciding whether a given product is viable or not. If we ever needed a good reason to pay attention to the complexity of an algorithm there was no need to look any further which led me to look back at the previous practical justifications I had seen for learning about computational complexity.

Memory Lane

My first introduction to the relevance of the Theory of Computation to the Real-World(™) was in a 3rd year Computer Science course through the book Computers and Intractability: A Guide to the Theory of NP-Completeness by Garey and Johnson. For many of us, the first encounter with these ideas was a bit of a shock to the system -  these (big) oh-so-abstract concepts standing in stark contrast to the much more down-to-earth programming courses we had been exposed to until then. The book contains one of my favourite cartoons which they motivate as follows (I have shortened and paraphrased the description from the book):

Suppose you are employed in the halls of industry and your boss calls you into his office and confides that the company wants to enter the highly competitive ‘bandersnatch’ market. Your job is to find an algorithm that can determine whether or not a given set of specifications can be met for a given component and, if so, to construct a design that meets them.

You try various approaches for weeks and the best you can come up with are variations on an algorithm that essentially searches through all possible designs. You don’t want to return to your boss and report “I can’t find an efficient algorithm. I guess I’m just too dumb.”

To avoid damaging your position in the company you would like to be able to confidently stride into your boss’ office and say "I can't find an efficient algorithm because no such algorithm is possible!" Unfortunately, proving inherent intractability can be just as hard as finding efficient algorithms.

 “I can’t find an efficient algorithm, because no such algorithm is possible!”

However, after reading this book, you have discovered something almost as good. The theory of NP-completeness provides many straightforward techniques for proving that a given problem is “just as hard” as a class of problems that no-one has been able to prove as intractable but neither has any efficient solution been found to any of them. If you can show that your ‘bandersnatch’ specification is NP-complete you could march into your boss’ office and say,” I can’t find an efficient algorithm, but neither can all these famous people.”

(Cloud) Consciousness Raising

While the above parable brings home the importance of the theory of computational complexity, I have always found it to be slightly depressing: After all, we are devoting a lot of time and effort to showing that something cannot be done. While this is no doubt a great tool to avoid exploring blind alleys it can feel a bit underwhelming.

With the advent of the cloud and SaaS, and as our S3 war story illustrates, being shot at focuses the mind wonderfully, and as SaaS becomes more ubiquitous it will become critical for working programmers to be aware that computational complexity is no longer the province of academia or specialized fields within industry. Instead, it should be part of the mental toolbox of every pragmatic programmer.

I believe that such considerations (while no doubt obvious to those with experience with the cloud) are novel and probably outside of the normal experience of many (most?) working software developers at this point. Even with the best of intentions however, bugs always creep up in unexpected ways (i.e. even with a computationally optimal algorithm, a bug can easily cause an infinite loop where calls to an API can have disastrous consequences) and in my next blog post I will address some of the ways in which we at Datacratic have addressed or plan to address these issues, from detection to prevention.