2008-09-22

Oracle does NOT enter the AWS Cloud

Okay, seriously? Did the announcement that oracle was entering the AWS cloud really get sent out today? Don't get me wrong. I think Jeff Barr is great and I love AWS, but let's be clear about what this announcement really means.

When I saw this headline I thought, "Awesome, Oracle is lowering the barrier to entry for SMB customers." This isn't the case. It's true that Oracle is making it easier for people to boot up a presumably properly tuned Oracle instance in the cloud. It's also true that Oracle will support certain EC2 hardware configurations in the cloud (up to 8 virtual cores although you can license it for 16). However, when Jeff Bar makes a statement like the following I get confused.

The variability and flexibility of cloud-based licensing has perplexed users and vendors for some time now. Now that a large software vendor has made a clear statement of direction here, we should see more and more cloud-compatible licenses before too long.

Let's dig into the Oracle licensing terms for a minute. If you read the licensing terms it becomes clear that two things have happened:

  1. Oracle will support certain versions of their flagship product running on certain hardware configurations in the cloud.
  2. Oracle will license certain versions of their flagship product running on certain hardware configurations in the cloud.

I'm sorry, but does Oracle just not get it? This type of a license is the exact opposite of the utility model. While the hardware (and software, in the case of Red Hat, S3, SQS, etc) is "pay for what you use", Oracle has decided that you will pay whether you are using the software or not. On Red Hat's Cloud Computing page we get the following quote:

Cloud computing changes the economics of IT by enabling you to pay only for the capacity that you actually use.

This is what I always believed was the power of cloud computing. Pay for what you use. I hope Oracle does not become a model for other software vendors as Jeff states. I hope companies take a page from Red Hat's book in this case, especially if they are looking to enter and stay competitive in the SMB market.

2008-09-19

Digg: Still not interesting

As some of you know, during my job search last year one of the places I interviewed that ended up making me an offer was Digg. Joe Stump, who is now the lead architect at Digg was someone I knew from the Seattle PHP community and one of the folks that interviewed me. Being interviewed by someone you know is nice because you have a shared set of experiences that allow you to ask good questions that can help you make a better decision about whether or not to accept an offer.

One of the questions I asked Joe was, "are you learning?" to which Joe essentially responded (and, I'm paraphrasing) that he wasn't learning much but things operated on the largest scale at which he had been involved which made things interesting for him. Some of the technology points I inferred at that point, knowing Joe's background and that he's a bright guy:

  • Typical LAMP stack
  • No web services
  • Database sharding
  • 'Legacy', organically grown code base

This was my opinion after talking with several engineers. Having worked in that environment for several years, it wasn't exactly what I was looking for and I ultimately ended up declining the offer.

In a series of recent blog posts, Digg engineers including Joe have begun describing the system and software architecture.

One of the things that strikes me is that from a technology perspective, not a lot has changed in a year. The typical high traffic LAMP system still consists of:

  • Caching (Memcache)
  • Distributed file system (MogileFS)
  • Monitoring (Nagios)
  • Asynchronous Processing (Gearman)

It's about as vanilla as it gets from an architecture perspective. But what's wrong with that?

Clearly Digg has been successful and as such their approach to technology has obviously worked. Anyone that has been tasked with scaling a web application is going to recognize the building blocks that Digg is using. However in not building a distributed system (as Digg has decided to do) you will run into some of the following issues: increased coupling of software components, longer ramp up for new developers, inability to update individual system components, difficulty in parallelizing development tasks and additional risk in new releases.

Let's use a Unix pipes analogy for a minute. Assume that each component in a software system is a unix tool; ls, grep, tail, etc. Imagine the command you are running is:

    ls /bin/ | grep cat | tail

Each of these applications handles a very specific piece of functionality. You can use each application in isolation. You can upgrade any of these applications without affecting another. Different developers can work on each application in isolation. There are some obvious advantages to the Unix approach. This is one way you can think of a distributed system but instead of pipes you're using an IP based transport (probably) and instead of command line options you're using a well defined API.

Now imagine an application called lsgreptail. It's a single application that handles all of the above functionality. You lose the ability to use each part of the application in isolation (no mixability). The code base is larger so it's more difficult for developers to get up to speed on it or become an expert with it. Making a change to functionality in directory listing (ls) requires reinstalling the entire application. Tracking down a performance bug becomes more difficult due to the lack of component isolation. There are some obvious drawbacks to this approach in software development. This is how you can think of how Digg (and many LAMP based sites) has built their system.

The point is this; there is more to scalability than the number of simultaneous users you can support. As your business grows and becomes successful, scaling your development team is just as important and becomes increasingly difficult on a monolithic lsgreptail style application. Digg is digging their own hole (no pun intended) by continuing to build their system in this fashion.

2008-09-18

AWS CDN - Super Sweet

One of the very first things I did with AWS (Amazon Web Services) was to use S3 (Simple Storage Service) and EC2 (Elastic Compute Cloud) to build a CDN on top of it. A CDN is essentially a way of distributing static content to your users rapidly in a scalable fashion. I built mine by publishing data to S3, using UltraDNS to distribute users requests to an appropriate availability zone in EC2 (east coast, west coast) based on their geographic location, and serving the request out of S3 but from EC2. Many people choose not to go this route for reasons of simplicity and just serve content out of S3. Well, now Amazon is going to do all the hard work for you.

If you are on the early warning radar of Amazon Web Services, I'm sure that you received the following email this morning just like I did:

...we are excited to share some early details with you about a new offering we have under development here at AWS -- a content delivery service.

This new service will provide you a high performance method of distributing content to end users, giving your customers low latency and high data transfer rates when they access your objects. The initial release will help developers and businesses who need to deliver popular, publicly readable content over HTTP connections. Our goal is to create a content delivery service that:

  • Lets developers and businesses get started easily - there are no minimum fees and no commitments. You will only pay for what you actually use.
  • Is simple and easy to use - a single, simple API call is all that is needed to get started delivering your content.
  • Works seamlessly with Amazon S3 - this gives you durable storage for the original, definitive versions of your files while making the content delivery service easier to use.
  • Has a global presence - we use a global network of edge locations on three continents to deliver your content from the most appropriate location.
You'll start by storing the original version of your objects in Amazon S3, making sure they are publicly readable. Then, you'll make a simple API call to register your bucket with the new content delivery service. This API call will return a new domain name for you to include in your web pages or application. When clients request an object using this domain name, they will be automatically routed to the nearest edge location for high performance delivery of your content.

Why is this significant?

  • Lowers the barrier to entry for small businesses wanting to use a CDN
  • Reduces the need to do DNS based geo-distribution on your own
  • Allows you to take advantage of something you are already using (AWS S3)
  • Allows you to simply 'enable' the service for existing items stored in S3

Given the expense of CDN services from companies like Akamai, Limelight and Level3 as well as the term commitments (you typically negotiate a rate in a fashion similar to bandwidth), many smaller companies have often avoided using a CDN. This is despite the fact that using a CDN is one of the easiest ways to significantly improve perceived page load times for end users. By allowing users to pay for a CDN via the utility model that has become so popular with AWS, this opens the door for Joe Developer to simply start out using a CDN and not have to make those kinds of trade offs.

Very cool stuff. If anyone from Amazon happens to read this, please add me to your beta group :)

2008-09-16

Practical TDD

It took me several years to drink the TDD kool-aid but now that I have I'm addicted. It's not that I didn't want to automate my testing, it's just that it wasn't particularly practical for me to do so. Having worked at startups over the past several years, I have never been able to find that balance between producing new code and appropriate test coverage for that code.

The problem has typically been that the testable API changed frequently enough that I spent as much time or more updating tests as I did writing new code. This was the problem for unit tests and as such I never seem to get around to writing them or using continuous integration tools like CruiseControl. However, at my current job we have managed to create a TDD methodology that works particularly well for us. It essentially works like this:

  1. Agree upon web service API
  2. Write Unit Tests that cover the new service API
  3. Iterate on code until all service tests pass

The primary difference between this and any other testing methodology is that we focus on testing our web services as opposed to the underlying API's. This gives us a few very concrete benefits:

  • The front-end team can begin coding against the service API immediately.
  • Increased test coverage with less tests due to service dependencies.
  • Immediate feedback on work in progress.
  • Breaking API changes caught immediately, reducing impact on customers.
The introduction of continuous integration via CruiseControl along with 100% service coverage has allowed us to immediately see the benefits. The number of bugs introduced into our production environment has been reduced a measurable amount since creating the test framework.

Resources

2008-09-15

MySQL Multimaster Replication in an Asynchronous Environment

By design, MySQL replication occurs asynchronously. That is to say that the replication on a slave doesn't necessarily occur at the same time as on the master. In a multi-master replicated environment (assuming two masters), each master is a slave to the other master. There are a few gotcha's to consider when creating or editing data asynchronously in a multi-master environment. You can get bitten by these issues if using AJAX, threads, or even loading images that are database protected or backed. It's even possible to run across these in an entirely synchronous environment if the replication lag time is high enough.

Let's assume for this discussion we have the following table:

CREATE TABLE users (
    user_id INTEGER UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    user_name CHAR(64) DEFAULT '',
    user_password CHAR(64) DEFAULT '',
    user_last_login DATETIME DEFAULT '0000-00-00 00:00:00',
    UNIQUE KEY (user_name)
);
Let's also assume that we have two servers, master1 and master2.

Auto Increment Conflict

Assume that the following is submitted to master1:

INSERT INTO users VALUES (0, 'user1', 'a609316768619f154ef58db4d847b75e', '1979-09-23');
and let's label this as e1 (event1) and assume that master1 assigns user_id 1 to 'user1'.
The following is then submitted to master2:
INSERT INTO users VALUES (0, 'user2', 'f522d1d715970073a6413474ca0e0f63', '1984-01-02');
and let's label this as e2 (event2) and assume that master2 assigns user_id 1 to 'user2'.
Oops. Now when replication occurs on the other slave, we end up with 'user1' and 'user2' having different values for user_id on each server.
This topic has been covered in depth elsewhere so I won't go into details. For fixes and more information see Advanced MySQL Replication Techniques. Note that I recommend using a UUID/GUID instead of AUTO_INCREMENT to avoid this type of problem however the MySQL function UUID() doesn't work with statement based replication.

Uniqueness Conflict

Let's assume that you have some AJAX code which attempts to create a user, 'user1' which results in the following SQL statement:

INSERT INTO users VALUES (0, 'user1', 'a609316768619f154ef58db4d847b75e', '1979-09-23');
being submitted to master1 and let's assume that for whatever reason you attempt to create the user a second time (timeout occurred, side effect causes another create to happen, etc) against master2.
You would think that the UNIQUE constraint would prevent the creation from occurring the second time however whether or not the statement is executed on both servers and in what order depends on a variety of factors including server lag (the amount of time between a statement being executed on one server and replicated and executed on the second server). This problem means that you can end up with user1 being created on both master1 and master2 but having two different user_id's.
How can you avoid this pitfall? Avoid asynchronous, identical create statements. Make the call synchronous. Additionally you can configure your application to only write to one database for certain statements, essentially from an application level reverting to a typical master-slave replicated environment.
Yes, I have run into this in a production environment.

Update Conflict

Let's assume you submit the following update request to master1:

UPDATE users SET user_last_login = '2008-09-15 17:58:30' WHERE user_id=1;
which is executed on master1 at the time set in user_last_login plus 1 second (at 2008-09-15 17:58:31).
Now the following is submitted to master2:
UPDATE users SET user_last_login = '2008-09-15 17:58:32' WHERE user_id=1;
which is executed on master2 at the same time set in user_last_login plus 1 second (at 2008-09-15 17:58:33).
Lastly, the update on master2 is replicated to master1, and the update on master1 is replicated to master2.
Here is what was executed on master1:
UPDATE users SET user_last_login = '2008-09-15 17:58:30' WHERE user_id=1;
UPDATE users SET user_last_login = '2008-09-15 17:58:32' WHERE user_id=1;
and on master2:
UPDATE users SET user_last_login = '2008-09-15 17:58:32' WHERE user_id=1;
UPDATE users SET user_last_login = '2008-09-15 17:58:30' WHERE user_id=1;
Now we have different values on each server. Uh oh.

Delete Conflict

This can occur when a delete occurs on master1 and before that delete is replicated to master2, an update to the PK used for the delete in statement1 on master1 is executed on master2. Imagine the following is executed on master1:

DELETE FROM user WHERE user_name='user1';
and the following is executed on master2:
UPDATE user SET user_name='user3' WHERE user_name='user1';
Now we have inconsistent data on each server. Uh oh.

Summary

In short, there are a variety of challenges to overcome in a multi-master replication setup. These problems are exacerbated by asynchronous operations on your data set. A few bullets of advice:
  • Reduce complex transactions to be written to a single server.
  • Monitor server lag.
  • Be prepared for failure. The more you distribute your data set and scale your service, the more you will need to deal with failures.

References

2008-09-07

Google Code as Personal Wiki/VC Tool

I figured I would put this up as some folks might find the idea useful. For a long time I've wanted an externally available, free, reliable, hosted environment that had a personal wiki and version control. I would have loved to find a Trac+Subversion environment but I didn't trust any of the free ones out there. I've got tons of documentation that I create that seems to get lost in a slew of .txt files in my home directory. Likewise I've got lots of sample code that gets created as Foo.java, foo.js, foo.php etc and ends up disappearing. Google Code is a hosting service that has version control via subversion, issue tracking, wiki pages and a bunch of other features that I didn't really need. Today I noticed the link on Google Code saying, "Create a new project", so I thought, "What the heck?" Looking at the TOS and FAQ, there is nothing that prevents me from using this as a personal wiki and version control system. I already keep all of my source under a friendly LICENSE and have no problem doing the same for scripts and documentation as well. I put everything I would need to checkout my home directory (bash scripts, vim files, etc) into subversion. Perfect for when I hop on a new machine. Also good for synchronizing changes between environments. I also threw up a few small java projects that have been sitting out of source control for too long, I'll keep adding more as they come up. I also got started putting a bunch of the files in my local doc/ directory into the wiki. Fortunately I've been using the Trac/MoinMoin syntax for a long time so it shouldn't be too difficult to make the transition from local storage to remote. One of my favorite features so far is that I can check out my wiki onto my local workstation and make changes there with my editor of choice. Very cool. If you're interested, here are some links: So far I only see two downsides. First, I have to be very careful not to commit anything sensitive to the repository as it's public. This generally isn't a problem but I do double check my commits. Second, unless I switch from Ant to Maven in the near future, jar files are going to send me over the 100MB limit sooner than later. I wonder why I don't have storage similar to GMail or Picasa. Oh well.

2008-02-17

Beliefs and Programming

I don't post too often on here these days. I've moved to blogging with my current employer, Compendium Blogware. You can find new posts here. Occasionally though, there is a post that doesn't quite belong or isn't quite appropriate for the corporate world. This is one of those posts. Michael Kimsal put together a survey called Religious affiliation and software development languages, which you can also discuss on his blog here. I downloaded the data set and made the following further analysis:
  • Found the top 25 languages by the number of people who filled out the survey
  • Found the top 5 religious affiliations for each language
  • I normalized all Christian religions into the affiliation 'Christianity'
  • I grouped agnostic and atheist declarations into 'AA'
Given that only ~3815 people took the survey, not a whole lot could be drawn from the numbers. However, here is what I found, draw your own conclusions.
  • The top 10 languages in order were: Python, C, C++, Java, Javascript, Ruby, PHP, Lisp, Perl, Haskell
  • The top two affiliate declarations were: AA (Atheist,Agnostic), Christian. After that Buddhist was most common.
  • Without normalization, the top declarations where Atheism followed by Agnostic followed by some variety of Christianity.
I'm not sure what the target group was. I didn't even know about the survey until after it was closed. However, I would say the results are inline with what I have observed in my own geeky social circles. Analytical people tend to question doctrine.