Market Basket Analysis in a Nutshell

Market Basket Analysis is the task of determining what products are commonly purchased together. The general method used for MBA uses associational rule mining as implemented by the Apriori Algorithm, a reasonable explanation of the algorithm can be found here. The problem with this approach is that in implementation it can actually be quite slow, especially if the results need to be found in real-time and apriori data generation isn't an option (and it usually isn't for websites). MBA can be used in a couple of different ways. First, a marketing/data analyst can look for long term trends and possible growth or cross-sell opportunities. If this is the application then real-time processing most likely won't be needed and the apriori algorithm should be fine. The other application (and the one I'm interested in) is a real-time application and can be used on websites for carts, product pages and the like. The general apriori algorithm is used for offline analysis. Below I present a fast method for real-time MBA that makes the following assumptions:
  1. The items that you want to find association rules for are known
  2. The number of transactions that the items occur in is small compared to the total number of transactions
This method isn't revolutionary by any means, but I couldn't find any good published equivalents. Say you have a table defined as follows (SQL is specific to MySQL 4.x but can be modified easily):

CREATE TABLE `trans_item` ( `transaction_id` int(10) unsigned NOT NULL default '0', `product_id` int(10) unsigned NOT NULL default '0', PRIMARY KEY (`transaction_id`,`product_id`) ) ENGINE=MyISAM

If you are using MySQL 4.1 and want to find products that are commonly purchased along with product_id 1 you can run the following query:

SELECT product_id, COUNT(*) AS PurchaseCount FROM trans_item WHERE transaction_id IN (SELECT DISTINCT transaction_id FROM trans_item WHERE product_id = 1) AND product_id != 1 GROUP BY product_id ORDER BY PurchaseCount DESC;

This query will give you a list of products purchased with product_id 1 in decreasing order. Of course if you have a large number of rows you will want to put a limit on the subselect so you don't end up with a very slow query. I have found that the last 1-3k orders works just fine and is fast. If you are using MySQL 4.0, instead of using a subselect you can just prefetch that data. This query works very well if a customer is viewing a product page, but what if they are viewing their basket (with many items) or you simply want to show the customer items they might be interested in based on previous orders? If you need to do MBA for multiple items, the following query is an example of how to find commonly purchased items along with product_id 1 and product_id 3:

SELECT product_id, COUNT(*) AS PurchaseCount FROM trans_item WHERE transaction_id IN (SELECT DISTINCT a.transaction_id FROM trans_item AS a, trans_item AS b WHERE a.transaction_id = b.transaction_id AND a.product_id = 1 AND b.product_id = 3) AND product_id != 1 AND product_id != 3 GROUP BY product_id ORDER BY PurchaseCount DESC;

You can see that this is easily generalizable to multiple items in a basket. Feedback welcome on how I might be able to optimize this.


Zire on SuSE 9.3

I have spent the past two days trying to get as much as possible out of my Zire 72 on my SuSE 9.3 machine at home. I ditched kontacts in favor of evolution, despite it being more full featured, for the following reasons:
  • kpilot would crash randomly, not allowing synchronization to happen
  • kontacts worked oddly using the wmi window manager
  • kmail filters don't support (although this feature is in CVS) imap to imap moves or copies
I got evolution working with the Zire by adding the evolution-pilot rpm, which installs some .so files and some config files. I'm not sure why these weren't installed with the packages I had selected, since the libtool generated .la files for those .so files were already installed. I already had an SD reader/writer that worked in Linux (The SanDisk Mobile 5 in 1 Reader/Writer model SDDR-103), so I tried copying MP3 files to an SD card and popped it in my Zire. No luck. You need to create a directory named 'audio' on the SD card and put your files in that folder. Still to do:
  • Figure out how to get PDF files onto the Zire
  • OpenOffice for Zire?
  • See how it imports into a fresh Evolution config (at work)
  • Install MMPlayer (and figure out installs in general)
That's all for now.


Small Technology Disappoints, Details at 11

I hate PDAs, and let me explain why. In 1999 I had a vision of a device that was smaller than a laptop (pocket size), a little less functional then a laptop, and a little less money then a laptop. I had a vision of something that could help me organize my work, allow me to be productive on the go, and might provide a little entertainment. Being young and naive, I assumed that a PDA would meet my needs and would work fine with Linux (Home) and Solaris (Work). I did some research, and ended up purchasing a Palm III. I gave it away less than a month after I bought it. The Palm III had very poor Unix support, virtually no integration with any type of mail/calendar/contact system, poor handwriting recognition (the Newton from apple blew it away and was almost 10 years older), was slow with no entertainment, color screen, or real productivity tools at all, weighed 8 pounds and was about the size of a brick. You could carry it in your pocket if you had a tight belt to keep your pants up. I was disgusted, and people were being taken by Palm and other vendors promising a better life through PDA integration. For the past 6 years I have been without any type of PDA, preferring a written calendar and the contact manager and email provided by Thunderbird (and before that, Mutt). It has worked pretty well for the most part until the past 6 months or so. I started having enough going on that my desktop calendar was filling up, and I was missing meetings and appointments because of schedule conflicts or simply forgetting and not having the calendar in front of me. I tried to start putting appointments in my cell phone, but it lacks any type of Unix synchronization so I was inputting everything by hand and the input was taking up too much time to be useful. I recently upgraded my home and work machines to SuSE 9.3 which included Kontact and Evolution, which surprised me by having native PDA integration for synchronizing contact/calendar/email information. I generally dislike Gnome/KDE applications, finding them to be bloated and slow but at work I spend a good deal of my time working with schedules and people, so I'll put up with a fat app for some functionality. Kontact had it all, and I decided to check it out. After using Kontact for a few days and finding it to be mostly bug free with all the features I needed, I started thinking about my PDA purchase. I wanted the following features in my PDA:
  • Ability to view PDF files
  • Ability to view, edit and create Word/Excel (or even better, OpenOffice) files and at least view PowerPoint
  • Voice Memo capability
  • Audio (MP3 and/or Ogg) and Video Playback
  • SD (or similar) expansion
  • Native Wireless Connectivity
  • Strong Linux Support
I found two devices in my price range that met all these standards, the Palm Zire 72 and the Sharp Zaurus SL-5500. I actually like the Zaurus more than the Zire (it has two expansion cards and more memory by default), but ended up buying the Zire because a.) The Zire was available locally b.) I'm not sure what Sharps commitment to the Zaurus is c.) The Zire had more built in functionality then the Zaurus (1.2MP Camera for instance). So today I'm getting the Zire setup and configured with Kontact. Am I going to be rewarded for my persistence, or are PDAs still not where I want them to be? I'm obviously hoping for the reward, not the revenge.


Utility Based Hosting

The company sysadmin and I spent the past 6 months or so evaluating technologies, hosting companies, colo facilities, network providers and various hardware vendors. We took this information and designed a robust, SAN based system architecture with no single points of failure using what marketing types like to refer to as "best of breed technologies". Two things occurred to me during this process. First, it occurred to me that there really is no good documentation and there are no books on the subject of designing a fault tolerant system for high traffic web sites. Second, it occurred to me that the amount we were about to drop on this system was ridiculous, and there should be companies out there that can offer what we're looking for on a month to month lease that can be rolled out quickly, doesn't break the bank, and allows for quick future expansion. The issue seems to have been mostly addressed by Savvis (http://www.savvis.net), a company that hosts services for companies such as Microsoft and American Airlines. Now, I don't work for Savvis and we haven't decided to go with Savvis, but the system they boast is pretty impressive. I have a call with a Savvis SE tomorrow, I'll report more on my findings then. As for the book? At least some documentation is in order.


When cron won’t cut it

Our production systems at work started off like most, having cron run simple jobs at regular intervals. As the company grew, and the technology evolved however, cron started failing us. Paul Vixie, don't go in there! Our requirements started involving job restarts upon failure, jobs being dependent upon another job successfully completing, and jobs that needed to be run only when certain system parameters were met (load, disk usage, etc). Of course, the cron scripts could be modified to meet these needs, but there exist several expensive commercial solutions that accomplish the same thing and don't require the system administrators to spend all their time coding cron scripts. I looked around, and to my surprise I found no decent opensource job scheduler for unix. Does anyone know of a free and opensource unix job scheduler? I'm going to write one, if anyone is interested in helping. Let me know. To the mines with him!