Say you are running a bookstore. (Didn’t you know that Rails is all about selling books?). Like everybody else you moved over to an IT infrastructure for managing sales during the last years, and now you have 5000 customers in the database and know about all the books these guys bought at your store – on average a swooping 15 books per person.
Now Christmas is coming: you plan a small campaign, where you would send a gift coupon to your customers. And as the package is designed nicely (so it wouldn’t disappear amongst all those buy-at-as-for-xmas-coupons) it is somewhat expensive, and you want to send it only to your most valuable customers – i.e. to those that bought more than 5 hardcover books.
Easy enough to do:
User.find(:all) do |user| number_of_hardcovers = user.books.inject(0) do |cnt, book| cnt += book.hardcover? ? 1 : 0 end print_address(user) if number_of_hardcovers >= 5 end
(Well, you still need the package designed, packed, and put to the post office. But this exercise is left to you, reader. If you want my address for some holiday gift just ask.)
Note: Yes, I know about count. This example is somewhat made up – and should just highlight the point.
Now, 1 year later…
Last years x-mas campaign was just a hit! You now have 100.000 customers in the database with an average number of books bought of 30. Congratulation! But of course, this years campaign will be even better than last years! You are well-prepared, are you not?
No, in fact you are not. The above code would still work, but really really slooow this time; it will clog your website, and drive your online customers away from you! And all that just because that little piece of code collects data on all users, along with all the books they have bought, in your computer’s memory. This sums up to RAM usage for 100k users and 3M books – which rises easily into several gigabyte.
The problem is that even though you are requesting the books on a per-user level, ActiveRecord’s built in “caching” saves the books for all the users (for the tech savvy: in the “@books” instance variable in each of the User models). Usually this is exceptionally smart (TM) behaviour, but this time it is really shit hitting the fan.
There are several ways to deal w/that issue. You could nil-ing those instance variables, which will then garbage collect the no-longer needed memory at some point…
User.find(:all) do |user| ... user.instance_variable_set "@books", nil end
..or you could just reload the user model:
User.find(:all) do |user| ... user.reload end
However, the first solution is somewhat dirty, because it relies on a non-documented implementation detail. The second one is inefficient because it runs an additional database query on an object that we just don’t need any longer…
Just don’t need any longer? Well then, let’s get away with these objects as soon as possible! And while Array#each doesn’t do this for us – using it the earliest point to get rid of these objects would be after each returned – we can roll out our own each, which is just a little bit destructive. It works (quite) like each, but removes each object after it is not needed anymore.
class Array def destructive_each! until empty? yield first shift end end end
Note: the “yield first; shift” order is intended and has one significant difference to “yield shift” – think Exceptions!
Ready for next year?
To have you prepared for next year’s X-Mas ad campaign we will revisit this topic in one year’s time. The issue then will be “How to handle 50 Million users” and quite likely involve map/reduce strategies. Just wait for “0x42 – Divide and Conquer”, in stores Dec 2009! Your job in the meantime: go outgrow amazon!