2008-09-19

Digg: Still not interesting

As some of you know, during my job search last year one of the places I interviewed that ended up making me an offer was Digg. Joe Stump, who is now the lead architect at Digg was someone I knew from the Seattle PHP community and one of the folks that interviewed me. Being interviewed by someone you know is nice because you have a shared set of experiences that allow you to ask good questions that can help you make a better decision about whether or not to accept an offer.

One of the questions I asked Joe was, "are you learning?" to which Joe essentially responded (and, I'm paraphrasing) that he wasn't learning much but things operated on the largest scale at which he had been involved which made things interesting for him. Some of the technology points I inferred at that point, knowing Joe's background and that he's a bright guy:

  • Typical LAMP stack
  • No web services
  • Database sharding
  • 'Legacy', organically grown code base

This was my opinion after talking with several engineers. Having worked in that environment for several years, it wasn't exactly what I was looking for and I ultimately ended up declining the offer.

In a series of recent blog posts, Digg engineers including Joe have begun describing the system and software architecture.

One of the things that strikes me is that from a technology perspective, not a lot has changed in a year. The typical high traffic LAMP system still consists of:

  • Caching (Memcache)
  • Distributed file system (MogileFS)
  • Monitoring (Nagios)
  • Asynchronous Processing (Gearman)

It's about as vanilla as it gets from an architecture perspective. But what's wrong with that?

Clearly Digg has been successful and as such their approach to technology has obviously worked. Anyone that has been tasked with scaling a web application is going to recognize the building blocks that Digg is using. However in not building a distributed system (as Digg has decided to do) you will run into some of the following issues: increased coupling of software components, longer ramp up for new developers, inability to update individual system components, difficulty in parallelizing development tasks and additional risk in new releases.

Let's use a Unix pipes analogy for a minute. Assume that each component in a software system is a unix tool; ls, grep, tail, etc. Imagine the command you are running is:

    ls /bin/ | grep cat | tail

Each of these applications handles a very specific piece of functionality. You can use each application in isolation. You can upgrade any of these applications without affecting another. Different developers can work on each application in isolation. There are some obvious advantages to the Unix approach. This is one way you can think of a distributed system but instead of pipes you're using an IP based transport (probably) and instead of command line options you're using a well defined API.

Now imagine an application called lsgreptail. It's a single application that handles all of the above functionality. You lose the ability to use each part of the application in isolation (no mixability). The code base is larger so it's more difficult for developers to get up to speed on it or become an expert with it. Making a change to functionality in directory listing (ls) requires reinstalling the entire application. Tracking down a performance bug becomes more difficult due to the lack of component isolation. There are some obvious drawbacks to this approach in software development. This is how you can think of how Digg (and many LAMP based sites) has built their system.

The point is this; there is more to scalability than the number of simultaneous users you can support. As your business grows and becomes successful, scaling your development team is just as important and becomes increasingly difficult on a monolithic lsgreptail style application. Digg is digging their own hole (no pun intended) by continuing to build their system in this fashion.

5 comments:

Matt Waggoner said...

Can you give a concrete example of what such a distributed web application might look like, on the mechanical level? I'm not certain that the independence and flexibility of Unix-style command line programs is analogous to a (by definition) much larger-scale application spanning multiple computers and involving external systems like DNS and long-distance communication over a relatively laggy medium (the Internet, vs. local pipes).

senfo said...

I'm failing to understand. This seems pretty modular, to me.

Like ls, grep, cat and tail, Linux, Apache, MySQL, PHP, Memcache, MogileFS, Nagios, and Gearman are each individual modules that do a specific task. Is it the API that you're arguing is strongly coupling them together? If that's the case, this can easily be resolved by abstracting them out in the implementation. The code should be written such that it does not *depend* on any specific module, allowing for plugging and playing with any new technology that might come along and do the job better. Additionally, written in this fashion, Digg could easily introduce a distributed system.

I haven't seen the digg code base, so I can't say for sure that it was written properly. But it sounds like you're jumping to a lot of conclusions.

Blake Matheny said...

Matt, quoting Werner Vogel, CTO at Amazon.com (source more), "If you hit the Amazon.com gateway page, the application calls more than 100 services to collect data and construct the page for you.". This distributed, service oriented approach is exactly the one being taken by Amazon and many other large websites.

Senfo, yes you are correct, each of the pieces you mentioned do perform a single task. There is no indication however that the Digg application has been built in a similar fashion. It's true that I'm making some of my own conclusions but in addition to the technical posts I cited in my post, there are a variety of other interesting data points. The lack of feature and service growth as well as the recently failed Google acquisition (which, some sources say was due to technical issues) both point to some serious architectural deficiencies.

Matt Waggoner said...

blake, thanks for the clarification. (First, a bit from the first link:

"The big architectural change that Amazon went through in the past five years was to move from a two-tier monolith..." Shouldn't that be a dilith? ;) )

Anyway... after reading Vogels's description of what their SOA platform looks like in the abstract, it's not fundamentally different from a monolithic service, and here's why. Here's quotes from him, interspersed with comments:

"For us service orientation means encapsulating the data with the business logic that operates on the data, with the only access through a published service interface.

This sounds like encapsulation of logic and data, which can be accomplished via classes. Imagine an MVC application where the "business logic that operates on the data" is a model (implemented in a class), and the only access is through that class's public methods. Only that class can talk to the database (or the set of database tables), so therefore the only way to interact with that data is through the interface provided by the class. This "class" can certainly be a web service, accessed via (e.g.) XML-RPC, instead of a local class instantiated directly, but it's conceptually the same: encapsulation of logic through defined interfaces.

"No direct database access is allowed from outside the service, and there's no data sharing among the services."

See above.

"Over time, this grew into hundreds of services and a number of application servers that aggregate the information from the services."

In a monolithic MVC model, the "application servers that aggregate" are the controllers, which normally talk to models to get the relevant data, and give it to a view for display.

Breaking this apart into separate physical servers, each with their own architectures, is certainly a good way to deal with the scaling issues, but it's not fundamentally different. Presumably Obidos was internally segregated, too, rather than being internally intertwined.

I'm not trying to say that Amazon's platform isn't interesting or well-constructed, but I think it's more of a feat of engineering than of innovation... which is actually just as important, if not more so, in my opinion.

9am Studio said...
This comment has been removed by a blog administrator.