D3 bar chart with zoom & hover

This is an example of a reusable chart built using d3. The range (zoom) slider is built in d3 too, as a separate widget, so it can be customized.

Check the sample source code on bitbucket for the full description of how to use it; here is the essence:

    var rangeWidget = d3.elts.startEndSlider().minRange(30);
    var myChart = d3.elts.barChart().rangeWidget(rangeWidget);
    myChart.mouseOver(function(el, d) { showHover(el, d) });
    myChart.mouseOut(function(el, d) { hideHover(el, d) });
    d3.csv('data.csv', function(data) {
        data = _.map(data, function(d) {return [d.id,d.price]});

Experience with Meteor

Meteor Day is upon us!

Here are the slides I am going to present on my experience developing with Meteor. In a nutshell -

  • Love the principles – they made it easy to develop a great webapp quickly
  • I’ve done a few cool things, I think! (like encryption, undo, admin panel with datatables, custom art:accounts-ui)
  • But… SEO is killing me (only two keywords showing!?)
  • The Paypal IPN has been a pain – still unresolved
  • Can’t escape the usual browser issues – and Iron Router adds a few for IE9
  • Speed has been an intermittent problem
  • Still feeling my way towards a good programming style
  • Please check out the app!
(Just click on the slide above and then you can advance through the deck using the arrow keys. Alternatively, click here to see them in a new tab.)

You can see the slides from the talk online here.


Harness your Python style to write good Javascript

Python encourages you to write good code. Javascript does not. How can you harness your Python style to write good Javascript?


Let’s try to write this Python code in Javascript.

class Vehicle(object):
    def __init__(self, size):
        self.size = size
    def __str__(self):
        return "Vehicle of size {0}".format(self.size)

class Hovercraft(Vehicle):
    def __init__(self, *args, **kwargs):
        # initialize just like a vehicle
        super(Hovercraft, self).__init__(*args, **kwargs)
        # add extra custom init if needed
    def hover(self, height):
        print("Hovering at height {0}".format(height))

# eg. make a new size 8 hovercraft
h = Hovercraft(8)
# eg. hover at height 12
# eg. change size and print
h.size = 9

Not obvious? The problem is that Javascript doesn’t have an out-of-the-box analogue to classes. Somehow we need to adapt objects and functions to the task.

Use prototypes

Your first thought might be to define new object types by defining object constructors, which are called with the new keyword, and adding methods to the object’s prototype chain. These objects look a lot like classes, don’t they? So you might write something like this:

// Not the best approach - see below
function Vehicle(size) {
    this.size = size;
    // or use this and arguments pseudo-arguments
Vehicle.prototype.toString = function () {
    return "Vehicle of size " + this.size;
function Hovercraft(size) {
    this.size = size;
Hovercraft.prototype = new Vehicle();
Hovercraft.prototype.hover = function (height) {
    alert("hovering at height "+height);

// eg. make a new size 8 hovercraft
h = new Hovercraft(8);
// eg. hover at height 12
// eg. change size and print
h.size = 9

This has a number of problems:

  • Vehicle and Hovercraft have the same constructor, but you can’t reuse it.
  • It feels strange to define the class’s methods outside the constructor, with the prototype lines.
  • There’s no room for private variables or methods.

Use closures

For a better solution, we need to think laterally – and use Javascript’s closures. I was hesitant to make use of these at first, thinking it would only lead to perverse and impenetrable code. But in fact they are a force for good, as we’ll see.

Put simply, a closure is a function which has variables bound to it. You write the closure as a function inside another function. The trick is that the inner function can refer to any of the outer function’s local variables. (Thanks for this succinct summary StackExchange!)

Here’s a nice example of a closure from a talk on functional programming given by Douglas Crockford, JavaScript architect at PayPal and formerly Yahoo (at 57mins). (In fact the previous example is from this talk too.) In this example, a closure is used to produce an object called singleton, which has a private variable, a private function, and two methods. You call the two methods as singleton.firstMethod(a,b) and singleton.secondMethod(c):

var singleton = (function () {
    var privateVariable;
    function privateFunction(x) {

    return {
        firstMethod: function (a, b) {
        secondMethod: function (c) {
}() );
// note the function is called immediately,
// so the var singleton is its returned value
// the surrounding brackets are just to help the reader

So let’s adapt that to our problem:

function vehicle(size) {
    var that = {
        size: size,
        toString: function () { 
            return "Vehicle of size " + this.size; 
    return that;
function hovercraft(size) {
    var that = vehicle(size);  // inherit from vehicle
    that.hover = function(height) { 
        alert("hovering at height "+height);
        return that; // optional - allows chaining
    return that;

// eg. make a new size 8 hovercraft
h = hovercraft(8);
// eg. hover at height 12
// eg. both at once using chaining
g = hovercraft(8).hover(12);
// eg. change size and print
h.size = 9

In the above, size is accessible to the world, just as it is in Python. But Javascript also lets us make it private:

function vehicle(size) {
    // can define private variables here
    // instead, here we use the fact that parameters are private
    return {
        toString: function () { 
            return "Vehicle of size " + size; 

In general, to make your own constructor function (eg. vehicle, hovercraft), you follow this recipe from the talk:

  1. Make an object
  2. Define (private) variables and functions
  3. Augment the object with methods (which have access to the privates above)
  4. Return the object

This pattern has a name: the module pattern. You’ll find a great writeup of it, and some ways to use it across multiple js files, in this post by Ben Cherry. He also points out the best way to handle dependencies, and to update existing variables.

Use chaining

In the example above, I have added the extra feature of “chaining”. In Python you have to put each effect on a separate line, eg.:

h.size = 9

But in Javascript you can potentially chain it all together into one, like so:


I first discovered the joy of chaining while using Mike Bostock’s super-powerful D3 library. He describes how to do it here – in fact, his description of how to write a reusable chart arrives at the same closure-based solution as we have, with the addition of getters and setters as below.

We made hover chain, but size do it doesn’t yet, as it’s a variable, not a function. Mike solves this by adding a getter/setter function for each public variable, which we could add like this:

function vehicle(startSize) {
    var size = startSize;
    var that = {
        toString: function () { 
            return "Vehicle of size " + size;
    // getter/setter functions
    that.size = function(_) {
        if (!arguments.length) return size;
        size = _;
        return that; // Q: would 'return this' work?
    return that;
// eg. this works now

Then h.size() returns (gets) the hovercraft h‘s size, and h.size(9) sets the size to 9.

Another benefit: the code never refers to this. That’s handy, because I find whenever I refactor code into smaller functions I get tripped up by the meaning of this changing.

You may also recognize such getters and setters from jQuery, where eg. $("body").text() returns the page body’s text, and $("body").text("eels") sets the body’s text to “eels”.

Still, as nice as chaining is, needing to add 5 lines of boilerplate code for every variable, with 3 references to the variable name that must be changed each time, is the sort of thing we became programmers to avoid.

To solve this, I am starting to put all the variables with getters and setters into a single object, eg. xt:

function vehicle(startSize) {
    var ext = {size: startSize};
    if (Object.keys) {
        var extKeys = Object.keys(ext);
    function toString() {
        return "Vehicle of size " + ext.size;
    return {
        toString: toString,
        get: function(name) {
            if (!arguments.length) return ext;
            return ext[name];
        set: function(name, val) {
            if (typeof extKeys!=="undefined" && extKeys.indexOf) {
                if (extKeys.indexOf(name)>=0) {
                    ext[name] = val;
                } else {
                    throw Error("Variable "+name+" not found");
            } else {
                // on browsers without Object.keys or indexOf,
                // don't check the name is valid
                ext[name] = val;
            return this;
// eg. these work

Don’t make it a global

Finally, you probably don’t want to have a new global called vehicle. It’s better to add it to another module, eg. RT (which may or may not already exist), as RT.vehicle. It might also depend on other modules, eg. the underscore (_) library. To do this, wrap the whole function in another closure!

if (typeof _ === 'undefined') { 
    throw new Error('Vehicle requires underscore') 
(function(RT, _) {
    function vehicle(startSize) { 
        ... // copy from above

    // attach vehicle to RT
    if (typeof RT==="undefined") {
        RT = {};
    RT.vehicle = vehicle;
    return RT;
}(typeof RT === "undefined" ? {} : RT, _));
// eg. make a new size 9 vehicle
v = RT.vehicle(9);

Too crazy?

In conclusion

Javascript’s closures give you access to some interesting programming patterns. Foremost among them, it lets you implement Python-style classes, with the added bonus of private variables and functions. And this is not just an academic gimmick that risks complicating your code in the real world: it is championed by the people who develop javascript, and it is used by jQuery and D3 among others. It helps you to write good, reusable code.

So – please let me know if you’ve used this pattern before, and whether my comparison to Python’s classes stacks up.

A final thought – perhaps it is wrong to compare Javascript to Python after all. Perhaps it is better compared to LISP!

Nicely format tables without using table-layout:fixed

I have recently helped to develop Signup Zone, a very handy app where people can easily create their own signup sheets and rosters. People have started using it for interesting purposes, including registering to help injured wildlife! And this has led to people entering in unexpected data, such as very long email addresses; and creating sheets with a lot of columns.

At its heart, the signup sheet is simply an HTML table such as this:

Date Name Contact email address
10/11/2014 Racing Tadpole racingtadpole@example.com

This table has the code:

            <th scope="col">Date</th>
            <th scope="col">Name</th>
            <th scope="col">Contact email address</th>
            <td>Racing Tadpole</th>

Since I don’t know in advance how to size the different columns, I actually really like the browser’s built-in ability to choose its own column widths.

However, the browser will not let any content spill outside a cell, and it only breaks words at white space. So if one user enters a long email address (or URL) into a cell, that column becomes very wide and the table either spills outside the page margins or looks unsightly – or both, like this:

Date Name Email
10/11/2014 Racing Tadpole racingtadpole@example.com
11/12/2014 Another Tadpole anotherracingtadpole@quitelong.verylong.superlong.megalong.example.com

Here are two solutions:

  • Use the new css command hyphens: auto. It won’t work in Opera, Chrome (!) or older browsers (eg. IE 9), and it can hyphenate too many words, but it looks ok, eg:
    Date Name Email
    10/11/2014 Racing Tadpole racingtadpole@example.com
    11/12/2014 Another Tadpole anotherracingtadpole@quitelong.verylong.superlong.megalong.example.com
  • Run some javascript to insert a “zero-width space” or an invisible soft hyphen in places where the offending text can break, eg. at the . or @ symbols. This gives you control over where the words break, eg:
    Date Name Email
    10/11/2014 Racing Tadpole racingtadpole@​example​.com
    11/12/2014 Another Tadpole anotherracingtadpole@​quitelong​.verylong​.superlong​.megalong​.example​.com

    However, if someone copies the email address it will also copy the invisible characters, which could cause trouble.

Here’s some javascript code to implement the second solution in Meteor:

<template name="example">
function htmlEscape(str) {
    return String(str).replace(/&/g, '&amp;')
            .replace(/</g, '&lt;')
            .replace(/>/g, '&gt;')
            .replace(/"/g, '&quot;')
            .replace(/'/g, '&#39;');

function myMarkup(str) {
    return String(str).replace(/@/g, '&#8203;@')
            .replace(/\./g, '&#8203;.')
            .replace(/\//g, '&#8203;/');

Template.example.email = function(){
    // suppose s is the string we want to show
    return myMarkup(htmlEscape(s));

The htmlEscape function is inspired by this Stack Overflow post. An alternative is to use Spacebars.SafeString.

Hope that’s helpful!


9 Lessons from PyConAU 2014

A summary of what I learned at PyCon AU in Brisbane this year. (Videos of the talks are here.)

1. PyCon’s code of conduct

Basically, “Be nice to people. Please.”

I once had a boss who told me he saw his role as maintaining the culture of the group.  At first I thought that seemed a strange goal for someone so senior in the company, but I eventually decided it was enlightened: a place’s culture is key to making it desirable, and making the work sustainable. So I like that PyCon takes the trouble to try to set the tone like this, when it would be so easy for a bunch of programmers to stay focused on the technical.

2. Django was made open-source to give back to the community

Ever wondered why a company like Lawrence Journal-World would want to give away its valuable IP as open source? In a “fireside chat” between Simon Willison (Django co-creator) and Andrew Godwin (South author), it was revealed that the owners knew that much of their CMS framework had been built on open source software, and they wanted to give back to the community. It just goes to show, no matter how conservative the organisation you work for, if you believe some of your work should be made open source, make the case for it.

3. There are still lots more packages and tools to try out

That lesson’s copied from my post last year on PyCon AU. Strangely this list doesn’t seem to be any shorter than last year – but it is at least a different list.

Things to add to your web stack -

  • Varnish – “if your server’s not fast enough, just add another”.  Apparently a scary scripting language is involved, but it can take your server from handling 50 users to 50,000. Fastly is a commercial service that can set this up for you.
  • Solr and elasticsearch are ways to make searches faster; use them with django-haystack.
  • Statsd & graphite for performance monitoring.
  • Docker.io

Some other stuff -

  • mpld3 – convert matplotlib to d3. Wow! I even saw this in action in an ipython notebook.
  • you can use a directed graph (eg using networkx) to determine the order of processes in your code

Here are some wider tools for bioinformaticians (if that’s a word), largely from Clare Sloggett’s talk -

  • rosalind.info – an educational tool for teaching bioinformatics algorithms in python.
  • nectar research cloud – a national cloud for Australian researchers
  • biodalliance – a fast, interactive, genome visualization tool that’s easy to embed in web pages and applications (and ipython notebooks!)
  • ensembl API – an API for genomics – cool!

And some other sciency packages -

  • Natural Language Toolkit NLTK
  • Scikit Learn can count words in docs, and separate data into training and testing sets
  • febrl – to connect user records together when their data may be incorrectly entered

One standout talk for me was Ryan Kelly’s pypy.js, implementing a compliant and fast python in the browser entirely in javascript. The only downside is it’s 15 Mb to download, but he’s working on it!

And finally, check out this alternative to python: Julia, “a high-level, high-performance dynamic programming language for technical computing”, and Scirra’s Construct 2, a game-making program for kids (Windows only).

4. Everyone loves IPython Notebook

I hadn’t thought to embed javascript in notebooks, but you can. You can even use them collaboratively through Google docs using Jupyter‘s colaboratory. You can get a table-of-contents extension too.

5. Browser caching doesn’t have to be hard

Remember, your server is not just generating html – it is generating an http response, and that includes some headers like “last modified”, “etag”, and “cache control”. Use them. Django has decorators to make it easy. See Mark Nottingham’s tutorial. (This from a talk by Tom Eastman.)

6. Making your own packages is a bit hard

I had not heard of wheels before, but they replace eggs as a “distributable unit of python code” – really just a zip file with some meta-data, possibly including operating-system-dependent binaries. Tools that you’ll want to use include tox (to run tests in lots of different environments); sphinx (to auto-generate your documentation) and then ReadTheDocs to host your docs; check-manifest to make sure your manifest.in file has everything it needs; and bumpversion so you don’t have to change your version number in five different places every time you update the code.

If you want users to install your package with “pip install python-fire“, and then import it in Python with “import fire“, then you should name your enclosing folder python_fire, and inside that you should have another folder named fire. Also, you can install this package while you are testing it by cding to the python-fire directory and typing pip install -e . (note the final full-stop; the -e flag makes it editable).

Once you have added a LICENSE, README, docs, tests, MANIFEST.insetup.py and optionally a setup.cfg (to the python-fire directory in the above example) and you have pip installed setuptoolswheel and twine, you run both

python setup.py bdist_wheel [--universal]
python setup.py sdist

The bdist version produces a binary distribution that is operating-system-specific, if required the universal flag says it will run on all operating systems in both Python 2 and Python 3). The sdist version is a source distribution.

To upload the result to pypi, run

twine upload dist/*

(This from a talk by Russell Keith-Magee.)  Incidentally, piprot is a handy tool to check how out-of-date your packages are. Also see the Hitchhiker’s Guide to Packaging.

7. Security is never far from our thoughts

This lesson is also copied from last year’s post. If you offer a free service (like Heroku), some people will try to abuse it. Heroku has ways of detecting potentially fraudulent users very quickly, and hopes to open source them soon. And be careful of your APIs which accept data – XML and YAML in particular have scary features which can let people run bad things on your server.

8. Database considerations

Some tidbits from Andrew Godwin’s talk (of South fame)…

  • Virtual machines are slow at I/O, so don’t put your database on one – put your databases on SSDs. And try not to run other things next to the database.
  • Setting default values on a new column takes a long time on a big database. (Postgres can add a NULL field for free, but not MySQL.)
  • Schema-less (aka NoSQL) databases make a lot of sense for CMSes.
  • If only one field in a table is frequently updated, separate it out into its own table.
  • Try to separate read-heavy tables (and databases) from write-heavy ones.
  • The more separate you can keep your tables from the start, the easier it will be to refactor (eg. shard) later to improve your database speed.

9. Go to the lightning talks

I am constantly amazed at the quality of the 5-minute (strictly enforced) lightning talks. Russell Keith-Magee’s toga provides a way to program native iOS, Mac OS, Windows and linux apps in python (with Android coming). (Keith has also implemented the constraint-based layout engine Cassowary in python, with tests, along the way.) Produce displays of lightning on your screen using the von mises distribution and amazingly quick typing. Run python2 inside python3 with sux (a play on six).  And much much more…

Finally, the two keynotes were very interesting too. One was by Katie Cunningham on making your websites accessible to all, including people with sight or hearing problems, or dyslexia, or colour-blindness, or who have trouble with using the keyboard or the mouse, or may just need more time to make sense of your site. Oddly enough, doing so tends to improve your site for everyone anyway (as Katie said, has anyone ever asked for more flashing effects on the margins of your page?). Examples include captioning videos, being careful with red and green (use vischeck), using aria, reading the standards, and, ideally, having a text-based description of any graphs on the site, like you might describe to a friend over the phone. Thinking of an automated way to do that last one sounds like an interesting challenge…

The other keynote was by James Curran from the University of Sydney on the way in which programming – or better, “computational thinking” – will be taught in schools. Perhaps massaging our egos at a programming conference, he claimed that computational thinking is “the most challenging thing that people do”, as it requires managing a high level of complexity and abstraction. Nonetheless, requiring kindergarteners to learn programming seemed a bit extreme to me – until he explained at that age kids would not be in front of a computer, but rather learning “to be exact”. For example, describing how to make a slice of buttered bread is essentially an algorithm, and it’s easy to miss all the steps required (like opening the cupboard door to get the bread). If you’re interested, some learning resources include MIT’s scratch, alice (using 3D animations), grok learning and the National Computer Science School (NCSS).

All in all, another excellent conference – congratulations to the organisers, and I look forward to next year in Brisbane again.

Make an animated & reusable barchart

Do you need a dynamic bar chart like this in your web page, with positive and negative values, and categories along the x-axis?

Dog breeds are on my mind at the moment as we just bought a new Sheltie puppy – this chart might show a person’s scores for each breed. Click the button above to see the next person’s (random) set of scores.

This is an example of a reusable chart built using d3. Using it is fairly simple, eg.:

<script src="http://d3js.org/d3.v3.min.js"></script>
<script src="barChart.js"></script>
    var points = [['Beagle',-10],
    var myChart = d3.elts.barChart().width(300);

Please check out the source code for barChart.js on bitbucket.

You may also find this helpful even if you don’t need a barchart, but want to understand how to build a reusable chart. I was inspired when I read Mike Bostock’s exposition, Towards reusable charts, but I found it took some work to get my head around how to do it for real – so I hope this example may help others too.

The key tricks were:

  • How to adjust Mike Bostock’s sample code to produce bars instead of a line and area
  • Where the enter(), update and exit() fit in (answer: they are internal to the reusable function)
  • How to call it, including whether to use data vs datum (answer: see code snippet above)
  • How to refer to the x and y coordinates inside the function (answer: the function maps them to d[0] and d[1] regardless of how they started out)

You can find a much fancier version of this chart, with a range slider and hover text, in this post.

Good luck!

Monte Carlo Business Case Analysis in Python with pandas

I recently gave a talk at the Australian Python Convention arguing that for big decisions, it can be risky to rely on business case analysis prepared on spreadsheets, and that one alternative is to use Python with pandas.

The video is available here.

The slides for the talk are shown below – just click on the picture and then you can advance through the slides using the arrow keys. Alternatively, click here to see them in a new tab.

You can see the slides from the talk online here.

I also showed an ipython notebook, which you can find as a pdf here. The slides, the ipython notebook, and the beginnings of a library to make the analysis easier, are all at bitbucket.org/artstr/montepylib.

How is your tax money being spent?

Want to know how your tax money is being spent? The Australian Budget Explorer is an interactive way to explore spending by portfolio, agency, program or even in more detail; compare 2014 against previous years; and search for the terms that interest you.

Australian Budget Explorer

This was produced in collaboration with BudgetAus. This year for the first time, the team at data.gov.au provided the Budget expenditure data in a single spreadsheet, which Rosie Williams (InfoAus) manipulated to include further data on the Social Services portfolio. The collaboration is producing lots of good visualisations, collected at AusViz.

I won’t editorialise about the Budget here; instead here is my data and extensions wishlist:

Look-through to include State budgets

The biggest line item (component) in the Federal Budget is $54 billion for “Administered expenses: Special appropriation GST Revenue Entitlements – Federal Financial Relations Act 2009″, which I take it is revenue to the States. I would love to be able to “look through” this item into how the States spend it.

The BudgetAus team has provided some promising data leads here.

Unique identifiers to track spending over time

One of the most frequent requests I get is to track changes in spending over time.

Unfortunately this is hard, as there are no unique identifiers for a given portfolio, program, agency or component. That means if the name changes from one year to the next, it is hard to work out which old name corresponds to which new name. E.g. In 2014, the Department of Employment & Workplace Relations has been split into the Department of Employment and the Department of Education, while the Environment portfolio used to be “Sustainability, Environment, Water, Population and Communities”.

It would be great to give all spending an identifier, and have a record of how identifiers map from one year to the next.

What money is actually spent?

How does the budget relate to what is spent? There is some info here at BudgetAus, but the upshot is “This might be a good task for a future group of volunteers”…


There is revenue data available here – I haven’t looked at it carefully yet, but I hope to include it, if possible.

Cross-country comparison

It would be great to compare the percentages spent in key areas by governments across the world.
Maybe it’s already being done? To do this I’d need some standard hierarchy of categories (health, education, defence, and subdivisions of these, etc), and we’d need every country’s government (and every State government) to tag their spending by those categories. Sounds simple in concept but I bet it would be hard to make it happen.

In the meantime, my plan is to check quandl for data and see how far I can go with what’s there…


Finally, many thanks to the authors for the awesome d3 package!


If you have any comments or know how to solve any of the data issues raised above, please let me know.


Monte Carlo cashflow modelling in R with dplyr

Business cashflow forecast models are virtually always Excel spreadsheets. Spreadsheets have many advantages – they have a low barrier to entry and are easy for most people to understand.  However as they get more complicated, the disadvantages start to appear. Without good discipline the logical flow can become confusing. The same formula must appear in every cell in a table. When new columns or rows are needed, it is hard to be sure you have updated all the dependencies. Sometimes the first or last cell in a table needs a slightly different formula to the rest. Checking the spreadsheet still works after every change is laborious.  And producing report-ready plots in Excel can be tedious.

Furthermore, most cashflow forecasts are deterministic, ie. they incorporate assumptions about sales, costs, and prices, and calculate the profit (eg. EBITDA) for this single set of assumptions. Everybody knows the profit will not turn out to be exactly equal to the forecast.  But it is not clear from a deterministic model what the range of profits might be, and a deterministic model cannot tell you how likely a loss is.  So many businesses are turning to simulated (“Monte Carlo”) cashflow forecasts, in which the assumptions are ranges; the forecast is then a range of outcomes.

There are tricks to implementing Monte Carlo models in Excel (one approach is to use Data Tables to run the simulation, and some clever dynamic histogram functions to plot the range of profits); alternatively, several good commercial packages exist which make it much easier.  But with all the extra simulation information to accommodate, the problems of spreadsheets only get worse.

Use a programming language!

It doesn’t at first seem like a natural idea, especially if you are well-versed in Excel, but you can actually implement a cashflow model in any programming language, eg. python, R, matlab or Mathematica. Over the years programmers have developed a few principles which directly address many of the problems with spreadsheets:

  • Don’t Repeat Yourself” (DRY): you shouldn’t have the same formula appearing more than once anywhere. No need to copy and paste it from one cell into all the cells in a table! And if the last column of a table needs a slightly different formula, it should be more transparent.
  • Unit testing: If you make a change to the model, you can see immediately if it breaks something. Eg. set up some sample data for which you know the answer you want; then set up a script which checks the sample data gives you that answer. Every time you change the model, run the script.
  • Flexible arrays: Most languages make it easy to change the size of your arrays (ie. add rows and columns to your tables).
  • Configuration in text, not through a user interface: Eg. Define the precise layout of the graphics that you want upfront (eg. colours, fonts, line widths), not by adjusting each plot manually.
  • Object oriented programming (!): Collect all the information and methods relevant to each object in one place… etc…

Use R!

I have recently developed a Monte Carlo cashflow forecast model in R.  There are quite a few subtleties to making it work nicely which may help you if you’re trying the same thing. (I would like to compare this to a python version, most likely using Pandas, as Python has a nicer structure as a programming language than R does.  But each language has its advantages.)

Here is a simplified version of the model:

  • The business has several farms.
  • The farms have different tree crops (apples, pears, and maybe more).
  • In fact there are different varieties of each (Granny Smith, Pink Lady, …) but we’re not sure if we want to model that fact yet; the model will need to be flexible enough to add them in later.
  • Input numbers: the number of trees per farm;
  • Input ranges:
    • how much fruit each tree produces;
    • the price of fruit;
    • the variable cost (per kg of fruit);
    • the fixed cost.
  • The cashflow model itself is very simple:
    • the volume of fruit is the number of trees times the fruit/tree;
    • the cost is the volume of fruit times the variable cost plus the fixed cost;
    • the revenue is the volume of fruit times the price;
    • the profit is revenue minus cost.


The first decision we have to face is: what data structure should we use to hold the forecasts? Remember that we will need a profit estimate for each farm, crop and potentially variety.

My first inclination would be to create a multidimensional table, with three dimensions (farm × crop × variety).  In Python this could be as simple as a dictionary with keys constructed as tuples of indices for each dimension (e.g. profit[('Farm A', 'Apple', 'Granny Smith')]). However, in R, and possibly even in Python with Pandas, a better structure is the dataframe, which is modelled on how you would store the data in a database. So it is a two-dimensional structure, with columns for each dimension and for each value of interest. For example:

         farm   crop numTrees fruitPerTree
1      farm A  apple       10          9.6
2      farm B  apple       20         20.1
3      farm A   pear       30         25.8
4      farm B   pear        1          1.7

These dataframes will also hold the simulation parameters, eg. the mean and standard deviation of the fruitPerTree.

Input data

There is a convenient function in R to read csv files in as dataframes:

require(plyr, quietly=TRUE, warn.conflicts=FALSE);

readDataframe <- function(path) {
    return(read.table(path, row.names=NULL, 
                            header = TRUE, 
                            sep = ",", 
                            strip.white = TRUE, 
                            comment.char = "#", 

numTrees <- readDataframe('data/numTrees.csv');

This would read in data/numTrees.csv, eg.:

# numTrees.csv
farm, crop, variety, numTrees
Farm A, apple, Granny Smith, 100
Farm A, apple, Jonathon, 200
Farm A, pear, Nashi, 50
Farm B, apple, Granny Smith, 200
Farm B, apple, Jonathon, 400
# Farm B has no pears


I have found the dplyr package (“the next generation of plyr“) to be very useful for this.

library('dplyr', quietly=TRUE); # for left_join
dfSim <- function(nSims, distName, paramDf, paramNames, colName) {
    # Args:
    #   nSims:      integer, e.g. 10000
    #               TODO: pass 0 for the mean of the distribution
    #   distName:   distribution function name (as a string), e.g. 'rnorm'
    #   paramDf:    dataframe of parameters, e.g. of mean, sd over farms x crops
    #   paramNames: vector of the col names in paramDf that correspond to parameters, 
    #               e.g. c('mean', 'sd')
    #   colName:    name of the output column, e.g. 'fruitPerTree' 
    #               (optional, defaults to the name of the paramDf)
    # Returns:
    #   a dataframe that is nSims times bigger than the paramDf
    # E.g.:
    #   simPrice <- data.frame(crop=c('apple','pear'), mean=c(10,22), sd=c(2,4));
    #   paramNames <- c('mean','sd')
    #   dfSim(nSims=3, distName='rnorm', paramDf=simPrice, paramNames=paramNames, colName='price');
    if (missing(colName)) {
        colName <- deparse(substitute(paramDf));
    nRows <- nrow(paramDf);
    df <- left_join(paramDf, data.frame(simId=1:nSims));
    df[colName] <- do.call(get(distName), c(list(n=nrow(df)), df[paramNames]));
    # delete the parameters
    df<-dropFields(df, paramNames);

dropFields <- function(df, fieldNames) {
	# Args:
	#   df:         dataframe
	#   fieldNames: vector of field or column names
	# Returns:
	#   the dataframe with those fields/columns dropped
	return(df[,!(names(df) %in% fieldNames), drop=FALSE]); # the drop=FALSE stops a 1D dataframe becoming a vector

In practice you may want to extend this, eg.:

  • To ‘sniff’ which distribution (rnorm etc) to use, based on the parameters chosen
  • To return only the mean of the distribution if nSims==0, so you can run the model in deterministic mode

The first of these can be achieved as follows:

autoDetectDist <- function(df) {
    # Args:
    #   df:   dataframe of parameters, eg. with fields farm, crop, mean, sd
    # Result:
    #   a list "result" with:
    #   result$distName:    distribution function name (eg. 'rnorm')
    #   result$paramNames:  vector of the parameter names (eg. mean, sd)
    #   result$df:          possibly revised dataframe
    #                       (eg. a triangular distn may have min,max,mode mapped to a,b,c) 
    normParams <- c('mean','sd');
    triangleParams <- c('min','mode','max');
    singleValueParams <- c('value');
    if (all(normParams %in% names(df))) {
        return(list(distName='rnorm', paramNames=normParams, df=df));
    if (all(singleValueParams %in% names(df))) {
        # a deterministic single value - just simulate as a uniform distribution with min=max=value
        result <- list(distName='runif', paramNames=c('min','max'));
        result$df <- df %.% mutate(min=value, max=value);
        result$df <- dropFields(result$df, singleValueParams);
    # extend this to other distributions as desired, eg. triangle
    stop(paste('unexpected parameters in ', deparse(substitute(df))));

doSim <- function(nSims, paramDf, colName) {
    # see dfSim for details
    distInfo <- autoDetectDist(paramDf);
    if (missing(colName)) {
        colName <- deparse(substitute(paramDf));
    return(dfSim(nSims=nSims, colName=colName, distName=distInfo$distName,
            paramDf=distInfo$df, paramNames=distInfo$paramNames));

With this code, you can have a datafile such as data/fruitPerTree.csv:

# fruitPerTree.csv
farm, crop, variety, mean, sd
Farm A, apple, Granny Smith, 200, 40
Farm A, apple, Jonathon, 130, 30
Farm A, pear, Nashi, 150, 30
Farm B, apple, Granny Smith, 200, 35
Farm B, apple, Jonathon, 150, 40

(This is a simplified example with dummy data – in practice I would not recommend a Normal distribution for this quantity, since it should not go negative.)

You can simulate with:

nSims <- 3;
fruitPerTree <- readDataframe('data/fruitPerTree.csv');
simFruitPerTree <- doSim(nSims, fruitPerTree);

This produces a dataframe simFruitPerTree such as:

     farm  crop      variety simId fruitPerTree
1  Farm A apple Granny Smith     1    263.07713
2  Farm A apple Granny Smith     2    165.75104
3  Farm A apple Granny Smith     3    208.85228
4  Farm A apple     Jonathon     1     97.53114
5  Farm A apple     Jonathon     2    110.81156
6  Farm A apple     Jonathon     3    175.87662
7  Farm A  pear        Nashi     1    143.44306
8  Farm A  pear        Nashi     2    183.15205
9  Farm A  pear        Nashi     3    159.63510
10 Farm B apple Granny Smith     1    189.73699
11 Farm B apple Granny Smith     2    235.34085
12 Farm B apple Granny Smith     3    260.16784
13 Farm B apple     Jonathon     1    171.05951
14 Farm B apple     Jonathon     2    212.27406
15 Farm B apple     Jonathon     3    110.02394

Implementing the cashflow logic

As an example, let’s multiply the number of trees by the fruit per tree, using the dplyr package:

library('dplyr', quietly=TRUE); # for left_join, %.% and mutate
nSims <- 3;
numTrees <- readDataframe('data/numTrees.csv');
fruitPerTree <- readDataframe('data/fruitPerTree.csv');
simFruitPerTree <- doSim(nSims, fruitPerTree);

simFruit <- simFruitPerTree %.% left_join(numTrees) %.% mutate(fruit = fruitPerTree * numTrees);

When you execute this, you’ll get the message that it is Joining by: c("farm", "crop", "variety"). So in your data files, you do can specify any dimensions you want, and the code will just work. Nice! (There are some caveats to this coming up though…)

You can hopefully see how you could use this approach to add whatever columns you need (eg. profit as revenue minus cost, revenue as volume times price, …).

Another class of calculation that comes up a lot is accumulating or differencing a quantity over time. Suppose we had a “year” column in our simFruitPerTree dataframe above, and we wanted to calculate the total fruit produced over time:

# set up a sample dataframe with a year column
annFruitPerTree <- left_join(fruitPerTree, data.frame(year=2010:2012));
simAnnFruitPerTree <- doSim(3, annFruitPerTree);
simAnnFruit <- simAnnFruitPerTree %.% left_join(numTrees) %.% mutate(annFruit = annFruitPerTree * numTrees);
# find cumulative over year, for each simId
simAnnFruit <- simAnnFruit %.% group_by(simId, farm, crop, variety) %.% mutate(cumFruit=cumsum(annFruit)) %.% ungroup();

The only problem with this is that we have had to hardwire the dimensions farm, crop, variety into the code – to make things worse, not even as variables we could set once-off, but directly into the group_by command. This is unsatisfactory, but I do not know a way around it with dplyr. If you can help, please let me know. (Note – this stackoverflow post may be the answer.)

Note in the code above, simFruit <- simFruitPerTree %.% left_join(numTrees) I have used left_join. This returns all rows from the first table, and all columns from either table. This is fine in this case, since you would expect the same farm, variety and crops in each table. However, in some cases, you want an "outer join", using all possible farm, variety and crops, even if they do not appear in some tables. For example, there may be a head office with an annual fixed cost, as well as the annual fixed cost of the farms. So a fixed cost data file could include "head office" as a "farm". Unfortunately, dplyr does not have an "outer_join" yet, so you need to fall back on the basic merge command instead, eg.:

simCost <- merge(simVarCost, simFixedCost, all=TRUE);

Summarising the results

If you have a profit column in a dataframe sim with lots of simulations, you can sum across all farms, crops and varieties to produce a single profit figure for each iteration as follows:

simProfitSummary <- sim %.% group_by(simId) %.% summarise(profit = sum(profit)) %.% ungroup();
# or to format nicely:
sim %.% group_by(simId) %.% 
    summarise(totalCost = sum(cost), totalRevenue = sum(revenue), totalProfit = sum(profit)) %.%
    format(big.mark=",", scientific=F)

Plotting the results

Here are two examples of useful plots: a histogram and a plot showing the P10-mean-P90 points of the distribution over time. I have included settings so that they are report-ready.

quartz.options(dpi=100);  # prevents plot stretching on Mac terminal
# set ggplot2 default colours
colors = c('#1EA0BA', '#4E809A', '#707070');
lightColors = c('#89D4E3', '#A9D4E3', '#C0C0C0');
colorGray = '#EAE6E3';
scale_colour_discrete <- function(...) scale_colour_manual(values = colors);
scale_fill_discrete <- function(...) scale_fill_manual(values = lightColors);
theme_custom <- function (base_size = 12, base_family = "Helvetica Neue") {
    theme_gray(base_size = base_size, base_family = base_family) %+replace% 
            axis.text = element_text(size=rel(1.5)),
            axis.title.x = element_text(size=rel(2)),
            axis.title.y = element_text(size=rel(2), angle=90),
            panel.background = element_rect(fill=colorGray),
            plot.background = element_rect(fill="white")

# helper routine for showing P10-mean-P90 plots
summaryCI <- function(x) data.frame(
  ymin=quantile(x, .1),
  ymax=quantile(x, .9)
millionsFormatter <- function(x){ 

# histogram of the fruit produced, in millions
simFruitSummary <- simFruit %.% group_by(simId) %.% summarise(fruit = sum(fruit)) %.% ungroup();
p <- qplot(fruit, data=simFruitSummary ) + 
        geom_histogram(fill=colors[1], colour=colorGray) + 
        scale_x_continuous(labels=millionsFormatter) + 
        # use coord_cartesian instead of limits=c(a,b) in scale_x_continuous so the viewport is affected, not the data
        coord_cartesian(xlim = c(-10e6, 50e6)) + # note in original units 
        aes(y = ..count../sum(..count..));
p <- update_labels(p, list(x="Fruit produced (millions)", y="Frequency"));

# P10-mean-P90 plot of the fruit produced over time
p2 <- ggplot(simAnnFruit, aes(x=year, y=annFruit)) + 
        stat_summary(fun.data="summaryCI", geom="smooth", colour=colors[1], size=2, fill=lightColors[1]);
p2 <- update_labels(p2, list(x="Year", y="Fruit produced"));


As you can see, some of the R commands are a little obscure; however, they can be bundled away in function definitions, so that the key cashflow logic stands alone and is easy to check.

[Edit] Since writing this post, I have reproduced and extended the above functionality using python with pandas instead of R. The library, as well as slides and notebooks from a presentation entitled "Monte Carlo Business Case Analysis using pandas", are available at bitbucket.org/artstr/montepylib.

Some future extensions:

  • To help with testing the logic for the mean or base case, I have extended the code so that if you pass nSims -> 0, it returns the mean of the distribution with simId=0. This is not too hard.
  • To help with sensitivity analysis, I am extending the code so that if you pass nSims -> -1, for each input variable it returns low, mean and high cases. This is hard to implement.
  • Add discrete variables, such as the small chance of a disaster.
  • Add unit tests.

I hope you found this useful - please let me know what you think, or if you have suggestions for how to improve the approach, or especially if you have a completely different way of solving the same problem. Thanks!

Hosting Meteor with MongoDB on Webfaction

Meteor is constantly evolving, and so are the ways to host it. I couldn’t find any single source that explained how to deploy a Meteor site to Webfaction using the latest technology (ie. as of March 2014). So here is how I did it for my site, Signup Zone.

First, follow Webfaction’s instructions to install a MongoDB app and add a data directory. I have called the app mongodb1. Note down its port number, eg. 18000.

The instructions say to add a user with the ["userAdminAnyDatabase"] role. I have also added a second user with the roles ['clusterAdmin', 'userAdminAnyDatabase', 'readAnyDatabase'], based on this Stackoverflow answer.

The instructions then say to run this in an SSH window:

./mongodb-linux-x86_64-2.4.9/bin/mongod --auth --dbpath $HOME/webapps/mongodb1/data/ --port 18000

Use Webfaction’s one-click installer to install a node.js app, and note down its port number too, eg. 17000. Upgrade it to the latest stable version using the “n” package, following Webfaction’s instructions for node.

Following this post, I demeteorized my meteor app, pushed it up to a git repo on Webfaction, and pulled it down into $HOME/webapps/nodeapp/demeteoredapp/.

As mentioned in the Webfaction instructions, in the nodeapp directory, type export PATH=$PWD/bin/:$PATH. Then cd demeteoredapp and type npm install in that directory.

In another SSH shell, enter (the DB name is “admin”):

export MONGO_URL="mongodb://mongo_user:mongo_password@localhost:18000/admin?autoReconnect=true"  # 2
export MAIL_URL='smtp://user:password@smtp.webfaction.com'
export PORT="17000"
export ROOT_URL="http://example.com"

Then type:

./bin/node $HOME/webapps/nodeapp/demeteoredapp/main.js

You can adjust the default bin/start script to do this too.

Having installed the app this way, I have developed a process to update the code. I rename the previously demeteorized app directory, eg. to old; go into my Meteor app directory and type: demeteorize -o ../demeteoredapp, then mv ../old/.git ../demeteoredapp. At that point I can do git add --all and commit and push the new version up to my git repo on Webfaction. I then ssh onto the server and type:

cd webapps/nodeapp/demeteoredapp/appname
git pull; ../bin/stop ; ../bin/start

to pull in the new changes and launch them.

Note there is no need for the forever package when you do it this way. I’m not sure we need the demeteorizer package either, or if the Meteor bundler is sufficient.

I found some comments online that there can be memory problems if you use Meteor and MongoDB on Webfaction. I haven’t experienced this problem. (I’m keeping a close eye on it – I have even built an admin panel which includes a call to a Meteor method that executes the memory_usage.py script provided by Webfaction.)

Ongoing jobs

You will then want to set up cron jobs to start your MongoDB app if it fails for any reason. I have modelled this script off Webfaction’s start script for node.js:

mkdir -p $HOME/webapps/mongoapp/run
mkdir -p $HOME/webapps/mongoapp/log
pid=$(/sbin/pidof $HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongod)
if echo "$pid" | grep -q " "; then
if [ -n "$pid" ]; then
  user=$(ps -p $pid -o user | tail -n 1)
  if [ $user = "your-username" ]; then
    exit 0
nohup $HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongod --auth --fork --logpath $HOME/webapps/mongoapp/log/mongo.log --dbpath $HOME/webapps/mongoapp/data/ --port 18000 > /dev/null 2>&1 &
/sbin/pidof $HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongod > $HOME/webapps/mongoapp/run/mongo.pid

Put this into your scheduled tasks using crontab -e.

You will also want to back up your database. A simple command to back up the database to a directory called dbbackup is:

$HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongodump --port 18000 -o $HOME/dbbackup

You will need to add the admin username and password.

I like to have daily, weekly and monthly backups, so I make three directories $HOME/mongo_backup/daily, weekly, monthly and have a script for each:

The daily script is:

# $HOME/scripts/cron_daily_mongo_backup.sh
# add this to crontab, e.g. to run at 15:45 server time every day:
# 45 15 * * * ~/scripts/cron_daily_mongo_backup.sh

# back up the database
$HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongodump --port 18000 -o $HOME/mongo_backup/daily/dump-`date +\%Y\%m\%d` 2>> $HOME/mongo_backup/daily/cron.log

# delete backup directories older than 7 days
find $HOME/mongo_backup/daily -type d -ctime +7 | xargs -r rm -rf  # Be careful using -rf

The weekly script is:

# add this to crontab as, e.g. to run at 16:20 server time every Sunday:
# 20 16 * * 0 ~/scripts/cron_weekly_mongo_backup.sh
cp -R $HOME/mongo_backup/daily/dump-`date +\%Y\%m\%d` $HOME/mongo_backup/weekly
find $HOME/mongo_backup/weekly -type d -ctime +32 | xargs -r rm -rf

And the monthly script is:

# add this to crontab as, e.g. to run at 16:25 server time every 1st of the month:
# 25 16 1 * * ~/scripts/cron_monthly_mongo_backup.sh
cp -R $HOME/mongo_backup/daily/dump-`date +\%Y\%m\%d` $HOME/mongo_backup/monthly
find $HOME/mongo_backup/monthly -type d -ctime +370 | xargs -r rm -rf

If you’ve read this far, please check out the finished product at signup.zone!


Mathematical web apps & visualisation