2005-06-03

Market Basket Analysis in a Nutshell

Market Basket Analysis is the task of determining what products are commonly purchased together. The general method used for MBA uses associational rule mining as implemented by the Apriori Algorithm, a reasonable explanation of the algorithm can be found here. The problem with this approach is that in implementation it can actually be quite slow, especially if the results need to be found in real-time and apriori data generation isn't an option (and it usually isn't for websites). MBA can be used in a couple of different ways. First, a marketing/data analyst can look for long term trends and possible growth or cross-sell opportunities. If this is the application then real-time processing most likely won't be needed and the apriori algorithm should be fine. The other application (and the one I'm interested in) is a real-time application and can be used on websites for carts, product pages and the like. The general apriori algorithm is used for offline analysis. Below I present a fast method for real-time MBA that makes the following assumptions:
  1. The items that you want to find association rules for are known
  2. The number of transactions that the items occur in is small compared to the total number of transactions
This method isn't revolutionary by any means, but I couldn't find any good published equivalents. Say you have a table defined as follows (SQL is specific to MySQL 4.x but can be modified easily):

CREATE TABLE `trans_item` ( `transaction_id` int(10) unsigned NOT NULL default '0', `product_id` int(10) unsigned NOT NULL default '0', PRIMARY KEY (`transaction_id`,`product_id`) ) ENGINE=MyISAM

If you are using MySQL 4.1 and want to find products that are commonly purchased along with product_id 1 you can run the following query:

SELECT product_id, COUNT(*) AS PurchaseCount FROM trans_item WHERE transaction_id IN (SELECT DISTINCT transaction_id FROM trans_item WHERE product_id = 1) AND product_id != 1 GROUP BY product_id ORDER BY PurchaseCount DESC;

This query will give you a list of products purchased with product_id 1 in decreasing order. Of course if you have a large number of rows you will want to put a limit on the subselect so you don't end up with a very slow query. I have found that the last 1-3k orders works just fine and is fast. If you are using MySQL 4.0, instead of using a subselect you can just prefetch that data. This query works very well if a customer is viewing a product page, but what if they are viewing their basket (with many items) or you simply want to show the customer items they might be interested in based on previous orders? If you need to do MBA for multiple items, the following query is an example of how to find commonly purchased items along with product_id 1 and product_id 3:

SELECT product_id, COUNT(*) AS PurchaseCount FROM trans_item WHERE transaction_id IN (SELECT DISTINCT a.transaction_id FROM trans_item AS a, trans_item AS b WHERE a.transaction_id = b.transaction_id AND a.product_id = 1 AND b.product_id = 3) AND product_id != 1 AND product_id != 3 GROUP BY product_id ORDER BY PurchaseCount DESC;

You can see that this is easily generalizable to multiple items in a basket. Feedback welcome on how I might be able to optimize this.