On the Ethics of Contracting

For the past couple of months I have been doing contract work for a variety of local companies. When someone takes you on as a contractor, they have certain expectations about what you bring to the table. In particular, clients have an expectation that you bring particular expertise to the company and can help them solve a particular problem more quickly then they could do on their own with their given resources. This means you are, as a contractor, particularly well suited for startups, short-term projects, acquisitions and mergers. In many circumstances, you are brought in as a domain expert and simply asked to do the "best" thing for that company while still solving their problem. There are no or very few technology requirements. Herein lies a problem of ethics. If you determine that the best solution is a technology that you have very little or no experience with, do you have a responsibility to inform the company of that fact and should you charge them for the time it takes you to get up to speed? First, I'm not sure how you can recommend a technology if you have zero experience with it. Yet, I've seen it happen. If you have no experience, how do you know that the solution will meet their expectations? I don't care how much you've read about something, experience matters. Assuming that you take some time to work with the technology and then make a recommendation, you have a responsibility to let the company know what your experience level with the technology is. This mitigates risk, and allows the company to make an educated decision on how to move forward. Also ask yourself the question, "How much is this recommendation based on my own personal desire to become an expert on the technology?" If it plays a large part, do the right thing and at least reconsider the recommendation. If you genuinely selected the technology because it is the best fit, read on. If you make a recommendation to use a technology that you have very little experience with, should you charge the company for your learning time? Assuming you let them know that you aren't an expert, and they still want you to handle the development, I still don't believe you should charge for learning time. The only time I believe that is appropriate is when the technology is specified and you aren't an expert. It shouldn't happen, but it does. So, what do you think? Do contractors have an ethical (and perhaps legal) responsibility to disclose their level of expertise with a given technology and should they charge for the time they spend learning? Until now, I have felt alone in the thought camp of responsible disclosure and appropriate billing. What do other contractors out there do?


Distributed Computing Failures

I went to a talk several months ago given by Alan Robins, a Principal Engineer in the distributed systems engineering group at Amazon. The title of the talk was something like, "Performance and Availability" but the focus of the talk was much more on the how and why of what distributed computing technologies had failed at Amazon. It was really interesting. I took about three pages of notes, and they're more or less verbatim below. I wish he had released the slides, there was a lot of really good information that I was unable to get down on paper.
  • Technologies
    • XA Distributed transactions (two phase commit)
      • TP monitors such as Tuxedo
    • RPC
    • Stateful Remote Objects
      • RMI
  • Many dimensions to consider beyond performance and availability
    • Performance (TPS/Host Latency, etc)
    • Availability: How many nines (time up/total time)?
    • Scalability: How much effort to scale?
    • Distributability: How much effort for multiple data centers?
    • Evolvability: Effort to extend and mutate
    • TCO: Hardware, licensing, dev or integration, operations and maintenance
      • Reconsider performance and availability relative to TCO!
  • Distributed Transactions: Atomic transactions across multiple transactional resources
    • Example: Customer changes primary address and hits customer and address db
    • Dark Side
      • Expensive. Reduces scalability of db server
      • Latency of commit is 5x more over normal transactions
      • Reduce throughput of application
      • If any resources are down, nothing can happen. Reduces availability
    • Alternate to XA
      • Be optimistic, commit what work you can
      • Do no harm: order commits such that if failure occurs you can live with inconsistent state
      • Compensate: undo previous commit or queue up rest of work for later
      • Design for failure: Minimize cross db foreign key refs, even denormalize
      • Tolerate dangling references and inconsistencies
  • Remote Procedure Calls: make a function call like it's local, but it's not
    • Example: calculate shipping charges on a customer order
    • Dark Side
      • Binary formats create dependencies
      • Evolving API forces client side rebuilds. Expect to evolve.
      • Service owners must run multiple versions of their software.
      • RPC tightly couples availability requirements.
      • Many fine grained requests have high latency over global distances
    • Alternative to RPC
      • Document passing paradigm
        • Self describing wire format (XML)
        • Evolution without affecting old clients possible
        • Good for asynchronous message passing
        • RPC model still possible
      • SOAP Problems
        • Large messages
        • Expensive to parse and build DOMs
  • Stateful Remote Objects (CORBA, EJB)
    • Problem being solved: supports ? for clients, clients can make many fine grained calls, keeps data model on server, complex data model not transferred
    • Dark Side
      • Mapping client session to stateful server is complex.
      • Servers must keep state for each client (reduces scalability)
      • Server failure fails a lot of clients (reduced availability)
    • Alternative: Stateless servers with persistence store
      • Servers handle each request independently
      • Use data in request to establish context
      • Return results to caller
      • Advantage: High performance, high availability, scales great
      • Disadvantage: Pushes state onto data store
  • Asynchronous messaging, Once-only delivery
    • Problem being solved: service developers don't worry about dupes. They can just do what the request wants. Reduces application logic and complexity of handling dupes.
    • Example: Customer 1-clicks on an item
    • Dark Side
      • Almost impossible to guarantee. In order to ? everything must be transactional
        • Double clicks happen all the time
    • Alternative to Only-Once
      • Idempotence (quality of something that has the same effect if used multiple times at once): dupes handled correctly with respect to application.
      • Advantages: simple, enables more scalability and availability. Simplifies clients.
      • Disadvantages: Requires services to check their db. Sometimes service has to build look aside cache.
  • In order delivery: service doesn't worry about temporal discontinuities
    • Example: Order adds A, adds B, adds C, deletes B, submits order
    • Dark Side
      • Very difficult for infrastructure to manage total ordering.
      • Tight coupling.
      • Can't deliver message until current message delivered.
      • Eliminates availability and scalability
    • Alternative: best effort delivery
      • Developers deal with out of order messaging. Requires event to have a time stamp or sequence id.
      • Advantages: high throughput, optimistic delivery policy (deliver events when you can), very high availability
      • Disadvantage: Application developers must deal with out of order messages
  • Stored Procedures
    • Easy for developers to write RPC type applications
    • DBAs can ensure db resources are used efficiently
    • Complex logic performed without moving data across the wire
    • Dark Side
      • Database resources are the most expensive
      • Creates scaling limitations
      • Low performance
      • Application now split between application server and database
    • Alternative
      • Use a database for what it's good for: relational queries and updates
      • Keep business logic on server
  • Centralized Database
    • ACID model is easy to program against, ensures consistency
    • Reads after write guaranteed to reflect write
    • Provides single synchronization point for all applications
    • Provides richest set of capabilities
    • Example: Customer information database
    • Dark Side
      • Doesn't scale
      • Doesn't lend to global distribution
      • Most labor intensive
      • Least available
    • Alternative: Lightweight operation datastores/caches (e.g. bdb)
      • Datastore distributed geographically
      • Updates propagate via asynchronous messaging
      • Read operations are done locally
      • Updates done locally then write back to central or peer
      • Disadvantages
        • Inconsistencies: Read after write not absolutely guaranteed
        • Partitions can cause multiple versions to exist on different peers
        • Requires distributed group management (DHT)
  • The nature of distributed systems
    • Nodes fail
    • Networks partition
    • Data centers go down
  • There is a tradeoff between availability and consistency
    • Via distribution and redundancy you gain availability, scalability and performance but lose consistency
    • Strive for eventual consistency
  • Embrace failure: build in availability
  • Accept inconsistency.
    • Apology oriented development
  • If you are a developer, deal with these things: Potential inconsistencies considering race conditions
  • Infrastructure cant hide
  • Model applications as event driven process. Include all info needed in each message. Prop, repl, cache. This provides high performance and high availability
There are talks of this nature fairly regularly at UW and Seattle University, I encourage you to go when you can. This was one of the most informational talks I have ever been to, and it was free.

Trac reminds me of Oracle

I just finished getting a trac installation up on my site, you can find projects here. And, while I love Trac, the install process just reminded me of Oracle. It took several hours, had a bunch of dependencies, and to get it to work the way I wanted it required several more hours of customization. Granted, if you need an "enhanced wiki and issue tracking system" that has integration with subversion and works pretty well, Trac is tough to beat. But getting it running on a slightly older system was no easy task. If they want that tool to become more widely used the developers are going to have to fix the installation issues and create some type of an automated installer. Of course, I really have no right to complain considering I haven't written a single line of code for the project.


Martin Roesch on Snort 3.0 and Sourcefire

Yesterday Sourcefire put on a two hour presentation at the EMP here in Seattle. With admission you got some swag including a calendar and a snort toy, admission to the Sci-Fi museum for the afternoon and an "ice cream social". Below are notes from the presentation, these are not my opinion. Overall I found the presentations pretty interesting with them covering the following topics:
  • Sourcefire & Snort; past, present & future
  • Demo of their RNA/ETM tools
  • Snort 3.0
  • Sourcefire 4.7
In particular, I wanted to hear Marty's thoughts on Snort 3.0 and where he is heading. Martin said that the 3.0 release would focus on the following areas:
  • Reduce Manual Tuning & Automate Configuration
    • "Tuning today is a failure"
      • We need dynamic defense for dynamic networks
  • Solve layer 3/4 evasion due to the IDS not being IP stack aware
    • Model the way an endpoint sees, model the IP stack
  • Normalize rules and configuration languages
    • Pro
      • Rules work well
      • Trivial to use for simple stuff
    • Con
      • Ugly
      • Hard to do hard things
      • A bad rule can significantly impact performance
    • Snort is not a language project
      • LUA will be snort 3.0's next generation language processor
      • Snort 3.0 will include a command shell that will allow LUA commands to be executed
  • Take better advantage of hardware
    • We are getting more cores, not speed. Snort is single threaded, this is a problem.
      • Must multi-thread snort
    • Vendors are accelerating the wrong parts of Snort and have been for years
      • Need explicit locations for optimization.
Martin asserts that tuning, prioritization and evasion are the same problem. The root of this problem is a lack of knowledge of what is being defended. The solution is to impart knowledge about the operating environment directly into the engine. This allows for the engine to tune itself, automate anti-evasion and automate prioritization. Above is the snort 3.0 architecture as described/shown by Martin. I think of primary interest is the rearchitecture and threading. I will be surprised if Martin is able to release RNA as open source and integrate that into Snort. If that doesn't happen, it either means that the automation features won't make it into Snort or they won't work nearly as well as RNA.


Work in Progress: CopyBlog

I used to have a Wordpress blog, and although there weren't many posts I wanted to import them into Blogger. I found a bunch of tools for importing from Blogger to Wordpress, but none that did the opposite. I found one tool that did what I was wanting (or says so), blogsync-java, but looking at the code it isn't very modular. Given that my technology tastes seem to change every other month, I really wanted a tool that would allow me to copy posts and comments between any two blog systems. Hence, CopyBlog. CopyBlog is a command line tool that allows you to copy posts and comments between any two blog systems, at least in theory. I spent some time yesterday and this morning writing some code to take Wordpress posts and comments and import them to Blogger. Although this currently just replicates the functionality of blogsync-java, the API is much more modular so you should be able to drop in a single class and you can immediately copy to and from that blog type. I should have a 0.1 version out the door this week if I can find some free time, and that would include full support for Blogger, Wordpress and LiveJournal. You can find source code up at http://mobocracy.net/code/CopyBlog


The Challenges of EC2

I've recently been working on a project building out the development process, environment and tools for a startup client. This includes things like configuration management, release engineering, automated testing, version control, etc. In doing so, I've been wanting to create as part of this several images including:
  • A bootable and optionally installable build system (see this post)
  • A bootable development environment for each developer (preconfigured application server, version control data, preconfigured database, etc)
  • A bootable QA and Beta environment
I was hoping to use AWS for this, particularly EC2 for the development, QA and Beta environments. If you're not familiar with EC2, it is Amazon's Elastic Compute Cloud, and it allows you to essentially boot and run OS images in their cloud. You pay for hourly CPU usage and data transfer from the cloud and to the cloud. Let's compare a fully hosted, dedicated server solution from the planet with an EC2 image.
 The PlanetAmazon EC2
OSCentOSLinux w/2.6 Kernel
Data Transfer1500GBUnlimited
Disk Space250GB160GB
So, the above for bandwidth assumes that you use your full 1500MB a month (and the same in EC2) and that inbound traffic is only 25% of the total traffic, that number comes from my own experience and is probably overly generous. If you decreased the percentage of inbound traffic, the price increases. Also, it assumes 24/7 operation of a machine over a 30 day period. We also a amortized a $225 setup fee at the planet over a 12 month period. So, EC2 is more expensive for a front-end web server then a hosted environment however your downtime due to hardware related failures decreases to almost zero. You also have no setup fee and you can literally bring images up in minutes (so I'm told, more on that later). However, for an N-tier system, EC2 is a very inexpensive solution for your middleware application servers and backend servers since traffic between EC2 systems costs you nothing. Your cost for operating an EC2 image on a 24/7 basis and only doing inter-image traffic? $72.00/month. So, purely on a cost basis, EC2 seems like a platform for at least middleware and backend systems. Also from an downtime perspective, not having hardware to deal with should increase availability. In doing some research, I found the following troubling points. First, the internet is apparently full. This limited beta is currently full, a helpful error message tells me, and I will be notified when there is availability. Okay, that sucks. Second, EC2 doesn't natively have persistent storage for images. That is, if an image fails/aborts/shutsdown, any data stored on the local disk is lost to you. Apparently you can mount an S3 partition in an EC2 image, but S3 isn't really meant for random IO like you might have in a database for instance. I hope that EC2 opens up to new users in the near future, because it does solve many common problems (rapid horizontal scaling for the web tier, ephemeral environments for dev or qa). It is not however a silver bullet. Bandwidth to and from the cloud is expensive, and the lack of persistent storage makes many jobs impractical for the platform. If I ever get a chance to test, I'll put more information up.


SASAG Meeting for June

Last night was the monthly SASAG (Seattle Area System Administrators Guild) and was the first I had been to. As I'm not a system administrator but a software engineer, I wasn't sure what I would gain from going but it turns out a lot. Last nights topic of discussion was, "Project Success: Science, Magic or Luck?" and was presented by Leeland Artra who is currently a PM at Qpass. Essentially, Leeland was suggesting applying software development techniques to system administration. For example, he suggested using test first development methodologies (I always knew them as TDD, Test Driven Design) to works towards a functional system. He also recommended using scrum to manage projects. My experience with systems engineers is that their programming experience is limited to systems programming, and things like functional and unit tests are foreign to them. If this isn't true, by all means let me know. So the suggestion for non-programmers to write programs to validate systems seemed counter intuitive to me. However, I liked the idea of using TDD to move forward on a project. I'm not sure why Leeland didn't suggest simply using Nagios or another monitoring system as your test platform. You can imagine going and writing all your monitors for Nagios which all start off red, or non-functional initially and as parts of the system come online and become functional monitors go to green. This seems like a much more straightforward, and intuitive way to use TDD for non-development projects. Regardless, the idea of applying agile methodologies to non-development projects is an interesting one and one I hadn't seriously considered before. I'm not sure how well it would apply to projects with serious capital expenditures such as hardware acquisitions, but the ideas should apply pretty well for any project. Leeland also showed off an interface he had developed for testing complex systems, which just seems unfair since I know it's not open source. On another note, I had my all time longest interview today. 7 hours and 8 people including the CTO, the hiring manager, two developers, one system programmer, one system engineer, the ops manager and the HR person. I really enjoy interviews that are challenging because you know you're going to be working with other good people. I mentally collapsed in my last technical interview though, so who knows what they thought. Updates ahead.

What do you want to be when you grow up?

I've been doing software development, system engineering and architecture for almost 10 years. I've worked at large companies, small companies and tiny companies. And after leaving Mixxer in February with the intention of going back to school, for the first time in many years I felt a bit lost in terms of "What do I want to do now?". So I spent about 8 weeks traveling, saw family, did some vacationing, and got back feeling refreshed with what I had hoped would be a new perspective on things. I didn't have that though. What I had instead was the desire to go back to work, but no idea about what I wanted to be doing. So I started interviewing with everyone, 19 companies to be exact, doing everything from embedded C & C++ development to Ruby on Rails at companies ranging from fortune 100 to pre-funding startups. During the interview process I have kept busy consulting for small startups; helping with software development, architecture and direction. In doing some consulting it helped me figure out not what I want to do, but the characteristics by which I will be able to identify what I want to do. Characteristics of the right job include:
  • A company that believes they are improving the quality of life for its users
  • Coworkers who are really smart and passionate about the company mission
  • A startup
  • People who get the philosophy behind the technology they are using
  • The hacker ethos is prevalent
  • Decisions are made based on merit, not ego
  • A technical community based on meritocracy, not seniority
After determining that the above list would help me classify the right company for me, my list of companies dropped from 19 to 4. I'm wrapping up interviews with those 4 now, although the way I crumble in my in person interviews it could drop to zero pretty quickly :) In any case, being able to identify what it is exactly about working that you love is crucial to being able to find the right job. Sometimes you don't have a choice, you have responsibilities that drive you towards finding the first available and well paying position. I choose to wait. If you think you fit the environment I described above, and based on my resume I look like a good fit, send me an email.

Botnets and the Convex Hull

Over the past few months I have worked on some computational geometry problems which required computing the Convex Hull for some set of points. I have been using it for some pattern recognition work and in doing so thought to myself, how could you map an IP address to a real vector space? And, if you could, is it possible to track an attacker or adversary? More importantly, can you estimate size or infer the location of a master in a botnet? Now, I realize that the location of a compromised host has no bearing on the location of the attacker. However, the latency between the compromised host and the attacker (or botnet master) does have a bearing on location. Likewise, there are a number of other useful metrics such as how recently the machine was compromised, the difference in times for two zombies to receive the same command, etc. Take one of these metrics, and assign it to each node you are aware of. Now use that metric as the distance to an arbitrary point P. Now compute the convex hull. Perform this same series of steps for each of the metrics you have chosen and overlay the convex hull for each metric. My assumption would be that your arbitrary point P could be identified in each one, and that may help identify a master. Also, it may help estimate the size of the botnet. The above writing is very hand wavy, I realize. However I'm curious if any work has been done until now to determine botnet topology via a similar mechanism. If anyone is aware, please let me know.


Reviving WADE

I couple of years ago I started a small project called WADE. WADE stands for "Wireless ADvertising Engine", the goal of which was to enable coffee shops and other sites providing free wi-fi with the ability to earn money from advetising to help offset the cost of wifi. The technology would essentially allow sites to insert advertising in place of existing ads on a page, with ads that they get a payment for. More specifically, using something like the Adblock filter list, apache, mod_rewrite/mod_proxy and some dns magic to instead of removing ads, replace them with perhaps more relevant ads from local businesses. When I went to a friend at the EFF, he informed me that there may be an issue with Copyright law. As in, the content and layout of a page are protected under copyright law. I'm not sure if I'm liable for providing the software, the coffee shop is liable for using the software, or if it is a non-issue. In any case, I have received a few emails about WADE over the past couple of months and have thought I would revive the software and send it to a few friends who can use it. The question is, does something already exist and I should just point people there?


REALM Part I: Tomcat & Servlets

This is the first in a five part series on the REALM stack. The previous introduction can be found here. In this posting I will introduce Tomcat & Servlets as well as review the basic tomcat installation, organization of a deployment, organization of your source and the actual development process. A link to source code will be provided at the end of the tutorial, and assuming you have tomcat installed you should be able to modify the included build.properties file and type "ant install". I will not cover ant or tomcat installation, as there are a number of good tutorials on the web. These applications are available for most Linux distributions. Some of this information has been taken from, "Developing Applications with Tomcat" which is available here. The Project CalculatorInc.com wants to provide a website where they allow people to do basic arithmetic operations on the web using arbitrarily large numbers. They are sure it will be the next big thing in this "web 2.0" space, however they haven't yet discovered web services or rails so they have implemented the entire thing using JSP & Servlets. Here is a link to the project source. In it is a reasonable starting place for any Servlet/JSP project, it contains the following: ant build file, JavaCC grammar, unit tests, a servlet and a jsp page. Pretty basic but we're going to use it as a starting point from which we will enhance it with Spring, add remoting to provide a web service and finally hook it up to rails. Since I recently made fun of a calculator web service, we'll create a calculator web service. It only has the basic operations (+,-,/,*,%,^) but it uses a JavaCC grammar so if you haven't used JavaCC before it's a good simple intro. The requirements are in the docs/README.txt file, but in short: Java >= 1.5, Tomcat 6 (5 probably works, 4 perhaps as well), JavaCC and Ant. That should be about it. Basic install instructions (edit build.properties, ant install test) are in that same document. What is Tomcat? First and foremost, Tomcat is a web container. A web container, according to sun is defined as follows.
"A container that implements the Web component contract of the J2EE architecture. This contract specifies a runtime environment for Web components that includes security, concurrency, life-cycle management, transaction, deployment, and other services. A Web container provides the same services as aJSP container as well as a federated view of the J2EE platform APIs. A Web container is provided by a Web or J2EE server."
That's a wordy definition. I would say, "Tomcat allows Java code to run in a web environment". It also does all of the things described above, but for the purposes of this discussion the shorter definition is fine. Secondly, Tomcat implements theJSP and Servlet API specifications. For Tomcat 6.0, which is the most recent release of Tomcat, that means the Servlet 2.5 and JavaServer Pages 2.1 specifications. These can be found here and here, respectively. In general you configure Tomcat via xml files that can be found in the conf directory. Tomcat also has its own web server. This is fine for development purposes however you almost always want to use something like mod_jk in production settings, assuming you are using Tomcat in the web tier as well as the middleware tier. There is excellent documentation for Tomcat available here. Competitors include Jetty, Geronimo, Resin, BEA WebLogic and JBoss. What is a Servlet? A servlet is a Java server application that answers and fulfills requests from clients. It's that simple. Tomcat interacts with and manages your servlets, that is one of its jobs. In terms of implementation, a servlet is a component that extends or implements classes or interfaces from the javax.servlet package or the javax.servlet.http packages. A servlet allows you to create dynamic content, and is most commonly interacted with via the HTTP protocol. Some typical uses of a servlet include:
  • Processing data submitted by an HTML form (and optionally storing it)
  • Providing dynamic content (e.g. information stored in a DB)
  • Managing state information (e.g. sessions)
Servlets have the following advantages over a typical CGI; doesn't run in its own process, stays in memory between requests, there is a single instance that handles all requests concurrently.Servlets are also typically packed in a WAR file, that is, a Web ARchive. This is the web analogy to a JAR file. Your Tomcat Installation Once you have tomcat installed (I'll assume it's in /usr/local/java/tomcat), it at a minimum should have the following directories.
  • bin - contains startup, shutdown and other scripts
  • conf - server configuration
  • lib - The jar files used by Tomcat
  • logs - Application and server log files
  • webapps - Location of servlets/web applications
  • work - Automatically generated by Tomcat, these files are often intermediary (such as compiled JSP)
Optionally, you can create the following directories.
  • classes - Classes you want available to a servlet. You may have to configure this.
  • doc - Documentation for tomcat, copy from webapps/docs
  • src - Servlet API source files.
When you deploy a webapp as source or as a WAR file, it will end up under the webapps directory. When you go to your servlet on the web, files and directories will be created under the work directory. The above should be enough hierarchy information for you to figure out where things are or should be. Your Deployment Servlets conforming to the 2.2 or later Servlet specification are required to accept a WAR file in a specified format. A WAR file has a specific directory and file hierarchy that must be conformed to and as such it often makes sense for your development environment to reflect this layout but more on that in the next section. The WAR file when unpacked is useful for development, and packed is useful for deployment. The top-level directory of your web app hierarcy is also the document root of your app. You should put your HTML/JSP/UI files there. When you deploy your application to a server, your application is assigned a context path. If your context path is /catalog, then a request URI to /catalog/index.html will fetch the index.html file from your document root. From the document root, the directory and file hierarchy will look something like this:
  • *.html, *.jsp, images, etc - Files that must be visable to the client. You can break this up in a hierarchy if your application is large.
  • WEB-INF/web.xml - The Web Application Deployment Descriptor. This XML file describes servlets, initialization parameters, container security, etc.
  • WEB-INF/classes/ - Java class files that are required for your application that are not in JAR files. Added to your classpath.
  • WEB-INF/lib/ - Contains the JAR files required for your application. Added to your classpath.
The WEB-INF/web.xml contains the Web Application Deployment Descriptor for your app. This is an XML document that defines everything about your app that the server needs to know (except the context path). The complete syntax and semantics for the descriptor are defined in Chapter 13 of the Servlet API specification, version 2.4. Also see doc/appdev/web.xml. The 2.5 API specification seems to not be available yet in PDF form. A web application must be installed in a server container, even during the development phase. A web application can be installed in several ways, however when you run "ant deploy" for this application ant will submit the built WAR file to the Tomcat installation and Tomcat will automatically unpack it for you. Your Source Code This section primarily focuses on the directory structure and build targets of your build.xml ant file. You want to separate your source code (tests and application) from your deployable application as much as possible. This makes both deployment and revision control easier. Below is the recommended hierarchy for the top level project source directory.
  • build.xml - Your ant build file
  • build.properties - Ant build properties
  • build/ - temporary home for javadocs and built war files
  • dist/ - temporary home for build classes
  • docs/ - Documents generated by javadoc or install notes, etc
  • lib/ - Jar files needed for builds and distributions
  • src/
    • src/tests/ - Your unit tests, load tests, etc
    • src/main/ - The primary java code for your servlets and application
  • web/ - User facing components (images, html, etc)
    • web/WEB-INF/ - The application
      • web/WEB-INF/web.xml - Your web application descriptor
      • web/WEB-INF/classes/ - The compiled classes from src/main/. Built for you.
      • web/WEB-INF/lib/ - JAR files from lib/. Built for you.
    • web/images/ - Images for web facing components
    • web/jsp/ - JavaServer Pages
Again, this is simply the recommended hierarchy, and the one that will be used for all the sample code. The top level build.xml file will include a properties file called build.properties. The properties file will contain build related properties such as the tomcat manager username and password and the base of your tomcat installation. The included build file has the following targets available:
  • all - Run clean target followed by compile target, to force a complete recompile.
  • clean - Delete any previous build and dist directory so that you can be ensured the application can be built from scratch.
  • compile - Transforms source files (from src/main/ directory) into object files, generally unpacked in build/WEB-INF/classes.
  • dist - Creates binary distribution of your application in a directory structure ready to be archived. Runs compile and javadoc.
  • install - Tells tomcat5 to dynamically install the web app and make it available for execution (deploy). Does not cause app to be remembered across restarts. If you just want Tomcat to recognize that you have updated classes (or web.xml) use the reload target instead.
  • javadoc - Creates Javadoc API documentation for the Java classes included in the application. Normally only done for dist.
  • list - List currently running web applications. Useful to check if app has been installed.
  • prepare - Create the build dest directory, copy static content to it. Normally executed indirectly.
  • reload - Signals tomcat to shutdown and reload. Useful when web application context isn't reloadable and you have updated classes or properties or added new jars. In order to reload web.xml you must stop and then start the web application.
  • remove - Remove the web app from service (undeploy).
  • start - Start this web application.
  • stop - Stop this web application.
  • test - Run all unit tests.
  • usage - Display a short form of the above, the default target.
In general, you will run "ant install" to build and install your application. The Development Process The Servlet/JSP development cycle mantra is "edit, test, deploy". Say it with me now, "edit, test, deploy". After you have created your base directory structure and installed your application at least once (i.e. Tomcat recognizes it) you will employ this mantra frequently. How Does it Work? Based on web/WEB-INF/web.xml (which sets our "routes" which are known as servlet mappings), by default the controller servlet will be loaded which is in the net.mobocracy.web.ControllerServlet class. When that servlet receives control, it will if the request is an HTTP GET, forward to the jsp file located at /jsp/index.jsp (a forward is essentially handing off control). If the request is an HTTP POST, it will do basic validation, instantiate the Arithmetic parser, and evaluate the arithmetic expression. On success, it sets ArithmeticSuccess for the JSP page and on failure it sets ArithmeticError. The doPost method also forwards to the /jsp/index.jsp page, but only after setting one of the previous attributes. Once the JSP page is handed control, it sees what attributes (if any) have been set by the Servlet and acts appropriately. The JSP page only has a few lines of code so it probably isn't worth discussing too much. The examples are meant to be evaluated in the source code, both in comments and interaction. The code comes in at only 338 lines, not including the build file and README so I will not spend a lot of time explaining the source here but if you have questions please leave it as a comment or send me an email. You should find the source fairly well documented. Conclusion We will be using this to build upon to take this from a JSP/Servlet architecture to a REALM project where we can utilize the wide variety of toolkits available to J2EE and the flexible front-end development environment of Ruby on Rails. After reading this you should have an understanding of one model of Servlet/JSP development using Tomcat. Below is a list of resources. Resources Tomcat http://tomcat.apache.org/tomcat-6.0-doc/index.html http://en.wikipedia.org/wiki/Apache_Tomcat http://www.coreservlets.com/Apache-Tomcat-Tutorial/ Java 1.5 http://java.sun.com/j2se/1.5.0/docs/api/index.html JavaCC https://javacc.dev.java.net/doc/docindex.html http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-cooltools.html http://www.idevelopment.info/data/Programming/java/JavaCC/The_JavaCC_FAQ.htm Ant http://ant.apache.org/manual/index.html Servlets http://www.servlets.com/ http://java.sun.com/products/servlet/ http://www.apl.jhu.edu/~hall/java/Servlet-Tutorial/ JSP http://java.sun.com/products/jsp/ http://www.jsptut.com/ Source Code http://mobocracy.net/code/calculator

Stop Giving out Your Passwords

Over the past 2-3 years, as this web 2.0 thing has come to be official jargon, another term that has become popular is "SOA". For those of you who are new to that term, SOA stands for Service Oriented Architecture and is more commonly referred to simply as a "web service". This effectively means that web sites expose functionality via a service using a protocol like REST, XMLRPC or the dreaded SOAP. The typical example is that of a calculator. You have a calculator web service that people hook their applications into that provides all your calculating needs without the developers doing the integration having to know anything about subjects such as addition and subtraction. Some web sites have taken SOA to mean, "Anything you expose via a web page that can be scraped I can use". This means that more and more frequently, users are being asked to provide their username and password for sites such as GMail, Yahoo Mail and MySpace. Once you have given this third-party site your credentials they login and scrape information like contacts and friends. While this isn't a new practice, it only seems to have become widely accepted over the past few years. If you are a user and are asked for your credentials, should you provide them? I would say as a general rule, no. However in the real world it really all depends on a variety of factors such as what kind of data you are exposing, how much you trust the third-party and the level of utility being provided by the service. My assumption is that most users implicitly trust many of these third-parties and simply assume that they would not be asked for this information unless it was needed. The additional use of GMail/MySpace/etc corporate logos makes the request seem even more legit. As a third-party site, what are your ethical and legal responsibilities to your users? I would argue that if a service such as GMail provides an authentication mechanism (they do) which doesn't require you to actually process the login or store any user data you have a responsibility to use it, even if it doesn't mesh with corporate branding. Additionally you should give users the option to store their credentials or not, and assume this automatically. Use of a logo without permission I believe to be another no no as it implies endorsement. As a popular web site, realize that third-parties will want to integrate with you. For the sake of your users, provide at a minimum a token based authentication system that third-parties can use. You could also just get on the wagon and embrace web services like everyone else (I'm looking at you MySpace). In short, stop giving out your passwords and providers STOP ASKING FOR THEM. The security community has enough problems without having companies instill the idea in end user that giving out their password is okay.


Mirror your Bookmarks

I've been building and migrating my bookmarks.html file for nearly 10 years. It started out as a Netscape bookmark file, then became a Mozilla bookmark file, then it was transferred to Phoenix and finally to Firefox where it has happily stayed for several years now. I recently wrote some perl to automatically check my bookmarks, and found that a large number (like, 10%) of the links were invalid (404) or off the net (no server response). It occurs to me that when you bookmark a file, you are often less interested in the URL as being able to find that content again. Many of the resources I bookmark these days are publications, how-to's, FAQ's or other informational pieces of content. It is much more rare that I bookmark a site that has so many resources that I want to just be able to get back to that site. It seems like a useful Firefox extension would be one where when you bookmark a site, you are asked if you want to mirror it as well (just fetch that page and images/etc, no real mirroring). Then when you try to go to a site from your bookmarks, if the site can't be reached or is a 404, the bookmark pulls up the local copy for you. Is anyone aware of such an extension? If one doesn't exist I'll write one, but I'd rather download it. The closest thing I have found so far is the Resurrect Pages extension, but that seems to be more useful for recent or highly trafficked sites as opposed to obscure ones. Although it does use Internet Archive, IA is hardly reliable. On a side note, does anyone know what happened to them? It almost seems that they've stopped archiving most sites.


Bloom Filters for Everyone

A Bloom filter according to wikipedia, "is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set." Named for the computer scientist Burton Bloom, the Bloom filter has the following informal attributes:
  • The filter can yield false positives (told element is in the set but it isn't) but not false negatives (told element is not in set but it is)
  • Typically implemented using a bit vector
  • The more elements added to the filter, the higher the chance for false positives (assuming no resize)
  • Is space compact with respect to the set it represents (can represent all possibilities with little space)
So, what might you want to use a Bloom filter for? There are plenty of examples in search problems which illustrate their usefulness so we'll look at two of them. First (taken from Wikipedia), consider a spell checker. You can imagine a language with a large dictionary where it is expensive to do a spell check (dictionary too large to hold in memory). In this case, you map words in the dictionary to a large bit vector using a Bloom filter. When you go to spell check the document and your filter indicates that the word is _correct_, you can check an original source (dictionary file) to ensure you haven't received a false positive. The above usage isn't ideal since realistically in most cases (except mine), the majority of words will be spelled _correctly_. What you really want for a use case with a bloom filter is one where the majority of content is not found. A better example comes from PlanetPeer, a P2P network. When a peer searches for content, it looks for peers that have Bloom filters with the correct bits set to indicate that the content is available. The peer then checks all found peers to see if the content actually exists on those nodes or not. This is the ideal case for Bloom filter usage, where it reduces the search space and reduces the number of expensive operations required (in this case, network connections). The mathematics for a bloom filter are fairly trivial so I leave the majority of the mathematics to the wikipedia reference. I also don't know how to embed LaTeX into Blogger yet. Three important choices for the Bloom filter are the size of m (total number of bits in the vector), k (there are k hash functions, one for each bit set in the vector) and the hash function (which should decrease false positives). Calculating these values will often times be functions of the physical limitations of the hardware (disk space, etc) and the desire to reduce false positives. One should note that the probability of a false positive is (1/2)^k or approximately 0.6185^(m/n) where k is the number of hash functions used (and therefor bits set), m is the size of the vector and n is the number of elements inserted into the vector. This makes k with respect to m and the potential size of n the crucial elements to reducing false positives. There are a variety of references to implementations in the wikipedia article.

The REALM Stack

Recently I started building with what I call the REALM stack. That's Ruby on Rails, EJB, Apache, Linux and MySQL. Although to be fair it's never EJB, it's usually Spring or something similar. This has changed my attitude completely from what it was while using the traditional LAMP stack. There is a more appropriate separation of responsibilities for each tier. A lot more design and forethought goes into the middleware tier because it can be expensive to make changes there (rebuild, redeploy, restart, etc). The front end developers no longer wait on backend functionality to be exposed which means they can work with stakeholders earlier in the process to provide a functional prototype. The toolkits are fantastic. And, it's fun again. New challenges, new things to learn, new problems to solve and new communities to interact with. In the middleware I have been using Hessian as the service protocol between Rails and Spring and it's fast, although I don't yet have any benchmarks. I'm using Maven2 or Ant plus a host of applications for continuous integration, bug tracking, deployment, etc. I've been using AppFuse too which does just what it says, it simplifies web development with Java. Granted, I'm not using any of the web stack but regardless Matt Raible has done a great job with the project. Hibernate is what bears do, but despite the horror that you can be faced with it is a fairly well engineered tool with a lot of nice features. I like the inversion of control/dependency injection model, but it would be nice if there was a bit more convention as there is with Rails. On the front end (RoR) it's easy to hook into web services provided by Spring remoting and you can still benefit from the slick scriptaculous integration, the rapid prototyping (scaffolding, etc), and the instant gratification of reloading the page to see a change (no need to redeploy a war file). You can even generate the models/controllers/views from the xml provided by your servlet definition, so the UI folks can get started as soon as you've enabled remoting. And in general, UI folks seem to like RHTML/erb a lot more than JSP/Velocity/etc. When I was getting started with the stack, a lot of the examples were complex and reading intensive for grasping relatively simple concepts. Over the next week I'm going to do a 5 part series on getting your REALM stack up and going, using as few lines of code as possible and as few lines of text while still proving all the appropriate references. The REALM series will look something like:
  • Tomcat & Servlets
  • Spring
  • Spring Remoting
  • RoR
  • RoR Web Services
The end goal being that you have a general understanding of all the technologies involved, the underlying design principals, the architecture component interaction and where to go for help and documentation. Although the services protocol is hessian, after going through this series you should be able to swap it out for REST, XMLRPC or SOAP (or any other services protocol). The first installment will be on Monday, so if you have any suggestions between now and then please let me know. The platform used will obviously be Linux, although you should be able to adapt the series to another platform.

JRuby Hits 1.0

Although not yet announced on the site, jruby now has their official 1.0 release up here. This is really exciting. Congratulations jruby.


Book: The Crack in Space

I like Philip K Dick, I always have. I've been spending a bit of the past few years reading through most of his collection and I finally got to "The Crack in Space" (aka Cantanta 140 in the UK), which I had always heard good things about. The plot summary is as follows:
In The Crack in Space, a repairman discovers that a hole in a faulty Jifi-scuttler leads to a parallel world. Jim Briskin, campaigning to be the first black president of the United States, thinks alter-Earth is the solution to the chronic overpopulation that has seventy million people cryogenically frozen; Tito Cravelli, a shadowy private detective, wants to know why Dr Lurton Sands is hiding his mistress on the planet; billionaire mutant George Walt wants to make the empty world all his own. But when the other earth turns out to be inhabited, everything changes.
One thing that really struck me about this work is Dick's projection of American culture onto the future. Not only culture such as politics and religion, but subculture as well such as racism and sexuality. His idea that such a substantial discovery as life on another world would be primarily of financial and political interest and not scientific or historic is an idea that holds more true today than ever. The thought that given the chance, instead of learning from another species we would attempt to proselytize them again seems more pertinent than ever with our current trends towards gentrification and globalization. This work reminded me more of "The Man in the High Tower" then works like "A Scanner Darkly" due to the fact that you become engaged in the story as opposed to the characters. It's a short read at under 200 pages, I'd recommend it to anyone into PKD or sci-fi.

Filtering Blogger by Label

How do you show only posts with a particular label on your blogger homepage? Read on. What if you want to write a blog with a primarily technical audience, but also want to be able to write complete cruft and not have to bother readers with it? I searched for how to show only posts with a particular label on the home page but the closest thing I found was this post asking how to remove content from the homepage with a particular label. Of course, that post had no responses. So I wrote my own solution, which turned out to be easier than I expected. This could easily work to support inclusive or exclusive filtering. I haven't been able to find much documentation on the markup for Blogger (if you know where it is, please let me know) but I was able to find some documentation in a few blogs such as the excellent Hoctro's Place. That post gave me enough information to put this hack together. The only thing you really need to know is that you can create a 'function' with includable and you can call it with include. From your blog dashboard, click on "Layout" then click on "Edit HTML". Once you have done that, click the "Expand Widget Templates" checkbox. Now you're ready to edit. As always, I recommend making a backup before you get started. First, search for the line that starts with <b:includable id='main' var='top'>. You are going to remove everything in that function between <b:loop values='data:posts' var='post'> and the closing tag with the following:
 <b:if cond='data:blog.url == data:blog.homepageUrl'>
   <b:if cond='data:post.labels'>
     <b:loop values='data:post.labels' var='label'>
       <b:if cond='data:label.name == "main"'>
         <b:include data='post' name='printPosts'/>
   <b:include data='post' name='printPosts'/>
The above change essentially says the following: if you are on the homepage, and a post has labels, and the post has the "main" label, then call printPosts. Or, if you are not on the homepage, call printPosts. Now search for the close tag of the main function (it looks like </b:includable>) and after it paste the following code:
<b:includable id='printPosts' var='post'>
<b:if cond='data:post.dateHeader'>
<h2 class='date-header'><data:post.dateHeader/></h2>

<b:include data='post' name='post'/>

<b:if cond='data:blog.pageType == "item"'>
<b:include data='post' name='comments'/>
The above code is your printPosts function/method. It is identical to what you had before, it was just turned into a function to reduce code duplication. That's it. Now only posts with a "main" label will show up on the front page. For Peter Sachs, who was wanting to filter out content, just change the (== "main") markup to something like (!= "nsfw"). Enjoy.

Seattle Tech Events in June

I'm a big fan of going to meetups, etc if the content is any good. So far since I've been in Seattle the only blogs/news I have found covering tech events are Robert Raketty and Ignite Seattle. I usually keep track of these kinds of events, so I'll start posting them here: This is just a list of events that I anticipate going to however they will usually make it to my calendar regardless of whether I can make it or not. Below is the public Google calendar link: If anyone wants something added or knows of another event, let me know.

The Difference Between a Developer and an Engineer

I had an interview today that left me asking myself, what constitutes basic CS mathematics? Certainly discrete mathematics such as combinatorics, graph theory, first order logic, linear algebra and perhaps even some abstract algebra (groups, fields, rings) and probably a couple of others. However, how deep of an understanding does one need to posses in any of these categories to call themselves a "computer scientist"? The master theorem? Do you need an understanding of advanced graph and geometric algorithms such as minimum spanning trees and convex hulls? How about encryption algorithms and number theory? Probably. As far as I can tell, computer science is a branch of mathematics. Computer programming is something else entirely, at least potentially. I've met many computer programmers during my interviews over the past few weeks, but very few scientists or hackers (and all 3 are different). Although this would seem obvious to most, being able to program a computer is an entirely different skill than understanding the why, how, how fast and O(my). It seems that recently there has been a move to distinguish between "Software Developers" and "Software Engineers", the latter having the distinction that most prefer. What is the difference between a developer and an engineer? Probably nothing, but if you ask some it's the difference between an architect and a construction worker. Engineers design a framework and programmers put up the walls. Both of them helped build the St. Francis Dam though so really who cares? If you build software, even if it's not for a life support system, you have a responsibility to your users. Engineer or Programmer, no one wants to be Mulholland.


Bootable Development Environments

I've recently been doing a fair amount of contract work and the one constant between all the jobs is the need for a consistent development environment with the following features:
  • Revision Control (preferably subversion)
  • Bug tracking
  • Project collaboration (ala wiki)
  • Documentation
  • Continuous integration
  • Report system (builds, tests, etc)
  • Status updates and notifications (commits, bugs)
Ideally, this environment would be bootable (vmware image and an iso) and easily distributed. You boot a new image, do your development and when you're done you give the image to the client. Obviously this can also be used for opensource projects or in a commercial environment. I've looked at buildix but it's missing many of these features and since it's a thoughtworks project it uses cruise control instead of continuum. I favor continuum because it is language agnostic and easy to extend. It's also easy on the eyes, which is nice if you are doing for pay work. Buildix also isn't that up to date in terms of software revisions. Is anyone interested in this type of a system? I imagine using subversion, trac and fisheye on a debian based system (perhaps knoppix) using apache2. Additionally some custom management interfaces should be developed for administering users, projects, etc perhaps using JMX? If I can get some interest, I'll try and get a working prototype together for the next project I start on.

Decentralized News (aka the decline of the interesting)

Back in the day (which was a Wednesday, btw), you only had a few good sources for tech information online. You had news groups (like comp.lang.*, etc), a few websites (like slashdot) and you had mailing lists. You knew many of the people you read, and felt very much like a part of a community. Even without the rose colored glasses, it was a different time. The landscape of the interesting was very different. Today you have thousands of sites including news aggregators like digg, you have countless mailing lists and you have more blogs then you can possibly read (such as this one). More and more though, I find that the best "news" and information comes from individual blogs. The problem is, and always has been, that authors (particularly bloggers) don't consistently write interesting articles. For instance, I like reading Martin Fowler's blog however I only really enjoy about 1/2 the articles. There are a number of blogs that fall into the same category. So, as what you find relevant and interesting with respect to your life becomes decentralized, where do you go for news? How do you parse through and filter out all the cruft that you find without taking up too much of your time? I love netvibes, but I still have to read all the headlines and summaries to determine if I want to read an article or not. Services like digg are useful, but only in finding out what my "peers" (I use the term loosely) find interesting. Today Blogstorm was launched which tracks "what's hot" in the blogosphere based on the number of links to a blog entry over some period of time. There I'm sure are other services like this, but I particularly like this one. What would be useful is if they tracked which entries you clicked on, and recommended stories based on those trends. More uniformly, I'm surprised that there isn't a web service (xmlrpc, soap, etc) that provides a generalized recommendations mechanism for sites and users. It wouldn't have to be very sophisticated to be useful. If anyone knows of such a service, let me know.


Notebooks Suck

I take lots of notes during the course of a week. Formerly I had done this with vim, a directory structure on my Linux machine and a trusty compiler. After I moved my mail to being hosted by Google, I decided I would try out this new "web 2.0" thing and start to move some other functionality online. I migrated my local "cal" to Google calendar, self-hosted wordpress to blogger, Google start page to netvibes, my todo list to "Remember the Milk", and my text editing to Google notebook. Whoa. Google notebook sucks. Things I had grown accustomed to, like using tabs or formatting to indicate sections were suddenly not working that well. Code suddenly lost all meaning in this new Google world, as I had no syntax highlighting and it was a pain to move between my compiler and Google notebook. It was time to search for a new replacement. I looked at Zoho notebook, and wow, it was pretty excellent. But it's slow as anything and I don't know how much I trust having my valuable data with a company that I would rate as "fair weather" for now. So after playing with Zoho for a bit, I moved back to the speed and simplicity of Google notebook. AHA! They have a Firefox plugin, this might be useful. Actually the Google notebooks plugin for Firefox is nothing more than a dumber version of their standard online interface. Well that's not very creative at all. I thought it might have all kinds of neat functionality like additional formatting options (lists would be nice), the ability to embed rich media (sort of? images work), and just maybe the ability to include other files (text, odt, etc would be great). It seems like there is a union between Google docs and Google notebook that has not yet taken place, but needs to. For now, I'll stick with vim.

Workrave Rocks

I've been using Workrave for a few days now and although it keeps telling me to take a break and stop hacking, my RSI has been slowly improving. According to the website: "Workrave is a program that assists in the recovery and prevention of Repetitive Strain Injury (RSI). The program frequently alerts you to take micro-pauses, rest breaks and restricts you to your daily limit." I started off with the author recommended configuration for recovering from RSI: 10 minute rest breaks every 20 minutes, 30 second micro-pauses every 5 minutes and a 6 hour daily limit. The daily limit only counts when you're typing and mousing and seems to have about a 6 second window where it stops recording the time due to inactivity. With this configuration I was being interrupted too often and was having a difficult time feeling productive. Now I'm taking a 7 minute rest break every 53 minutes, 30 second micro-pauses every 7 minutes and an 8 hour daily limit. The micro-pauses can get a bit annoying, but with this configuration I'm feeling much more productive and my RSI isn't painful. This configuration seems to give me about 12-14 hours of computing time in a day (I do a lot of reading I guess). Additionally, Workrave provides me with interesting stats about my daily computer usage such as number of keystrokes, hours of usage, mouse clicks, mouse movement (in meters), etc. There are some nice graphing utilities available from other Workrave users for plotting basic statistics. Today's usage so far: 48 minutes with the mouse, 213m of mouse movement, 1931 clicks and 65279 keystrokes. Forget flamewars, show off your Workrave stats.

Terrible Coffee

I live in Seattle, and Seattlites love their coffee. I never drank coffee until about a month ago when I got back from Europe and started preparing for interviews. I needed a little help waking up in the morning and realized that mountain dew and crack weren't helping me with that boost of energy I required. So as all Seattlites do at some point, I bought myself a little coffee maker and some freeze dried folgers (I don't think Seattlites do that part). I did this after spending too much money for too long at places like Zeitgeist, Seattle's Best Coffee & the like. Now, I had no idea how bad coffee could taste until I made my own for the first time. I don't know if it was the $20 coffee maker, the $3 coffee or the $1 filters but my god that first cup was like a curry filled diaper. Just awful. So in the weeks since my brilliant acquisition, my spending on coffee has mostly gone down although I no longer go for the 5 year old folgers at the local bodega. I've upped the quality of coffee to an "Italian Blend" (their words) made in Seattle, go figure. I can't go on drinking something so horrible. As I continue to attempt coffee perfection on the cheap, I can only hope that I find a tolerable concoction.