Category Archives: Modelling

How to estimate uncertain data

Data Estimator is a tool that helps answer questions about uncertain quantities, eg. “What will our company’s sales be next year?”

It is designed to be used as part of an interview process, where expert judgements are drawn out and quantified.

It’s a reminder that in this world of big data, some things remain hard to measure, especially when it comes to the future.

You will be asked a handful of questions, using “probability wheels” like this to visualise the uncertainties:

probability wheel

When you’re done, you will be able to see and export the resulting probability distribution. For example, the result could be “there is a 60% chance this market will more than double in five years, a 20% chance it will more than treble, but a 10% chance it will shrink.”

probability density

It will also show some alternatives for you how you could place the uncertainty in a decision tree, eg.

decision tree node

Try it out here:

Inside Class Cruncher

You’ll find here the concepts used to create Class Cruncher, a handy webapp for school administrators.

The problem

How should you deploy a certain number of teachers across your school’s classes, given rules about class sizes? You may need to use composite classes.

More precisely, given:

  • The total number of teachers
  • Which grades are being taught (eg. kindergarten, 1st grade, 2nd grade etc)
  • How many children are in each grade
  • Min & max class sizes in each grade
  • Which composite classes are allowed (eg. grades 1 & 2 but not grades 1 & 3 together)
  • The smallest number of children from a grade allowed in a composite class (eg. so you don’t have a single lonely first grader in a class with grade 2).

Then list out the classes that each teacher should take (i.e. how many children from each grade are in each class).

A solution

Solve this as a mixed-integer linear programming problem, using lpsolve. This uses the simplex and branch-and-bound algorithms to maximise an objective function subject to some constraints. Here we have lots of constraints, and just want a feasible solution, rather than an “optimal” solution, so the objective function is not so important.

Step 1: Choose your variables

For every potential class j, we’ll have a binary variable y(j) (ie. it must be either 0 or 1), to tell us if it is being used or not.

Because classes can include students from different grades, we will invent the term “subclass” to refer to the number of children from one particular grade in a class. Then our key variables are x(i), the number of children in each potential subclass i.

Let’s relate the classes and subclasses via a matrix C(j,i), with 1 if subclass i is in class j, and 0 otherwise. Since subclasses are only in a single class, for any subclass i, C(j,i) is 1 for only one class j. So C is mostly 0s, with each row having at least one 1, and no columns having more than one 1.

Then the number of children in class j = the sum over i of C(j,i).x(i).

(In matrix notation, Cx = number of children in each class, where x and the right hand side are column vectors.)

C(j,i) also tells us if subclass i is being used: it is given by the sum over j of C(j,i).y(j).

Recall that y(j) tells us if class j is being used. Also, for a given subclass i, C(j,i) is only 1 for a single class j, so the sum is just a single term.

(In matrix notation, C’y = binaries telling if each subclass is used, where C’ is the transpose of C.)

For example: Suppose you have 3 grades (kindergarten, 1 and 2), and can allow for 1 pure kindy class, 2 composite K/1 classes, 2 pure 1st grade and 1 pure 2nd grade class. That’s a total of 6 classes and 8 subclasses. The matrix C is:

—-8 subclasses—–
| 1 . . . . . . .
| . 1 1 . . . . .
6 . . . 1 1 . . .
classes . . . . . 1 . .
| . . . . . . 1 .
| . . . . . . . 1

To simplify the model, we’ll assume potential composite classes cannot optionally only use some of their subclasses – it’s all or nothing.

So in fact, we also need a preliminary step (before solving) to decide on the number of potential classes and subclasses from the input data. We can make a guess at an upper limit, eg. by dividing the number of children in each grade by the maximum number allowed in a class, and rounding up.

Step 2: Choose your constraints

Our constraints need to achieve these things:

  1. Ensure all the children in each grade are in a class (or equivalently, a subclass)
  2. Classes sizes must be between the minimum and maximum class size – or zero (since not all potential classes need to be used!)
  3. Subclass sizes must be above the minimum subclass size, if the class is being used
  4. The number of classes must match (or be less than) the number of available teachers

We also need to relate the x and y variables. In a linear program, all the constraints need to be linear in the variables z. This means you can express them as matrices A, so that A.z is greater than, equal to, or less than, another vector c. Here z is just a column vector containing both x and y.

Take each of the above in turn.

Ensure all the children in each grade are in a class (or equivalently, a subclass)

We have to express this in terms of the number of children in the subclasses x, because the y’s don’t tell us how many children there are in each class.

In our earlier example, the first point involves 3 constraints, one for each grade. Multiply the below matrix with the column vector x (which has 8 rows).

—-8 subclasses—– sums to
| 1 1 . 1 . . . . = number of kindergarteners
3 grades . . 1 . 1 1 1 . = number of first graders
| . . . . . . . 1 = number of 2nd graders

Classes sizes must be between the minimum and maximum class size, or zero

If we had to use every potential class (ie. all the y variables were known to be 1), then this is just the constraint Cx >= min_class_size, Cx <= max_class_size (where I’m applying the inequality to each row of each side, so these are 6+6=12 constraints in our example.)

However, Cx >= min_class_size should only apply if y is 1. That’s easy to write: Cx – min_class_size.y >= 0. So the constraint matrix has C (N_classes × N_subclasses or 6×8) on the left and -min_class_size (N_classes × N_classes or 6×6, a diagonal matrix of the minimum class sizes) on the right.

I make the same modification for the max class sizes, although it’s not strictly necessary; this has the effect of forcing y >= 0.

Subclass sizes must be above the minimum subclass size, if the class is being used

If we had to use every potential class (ie. all the y variables were known to be 1), and therefore every subclass, then this is a very simple set of constraints: x >= min_subclass_size. However, we only want this constraint to apply if the corresponding binary value is 1. That corresponding y is actually the corresponding row of C’y (as we discussed earlier).

There’s a classic trick to only apply constraints based on the value of a binary variable, called the “big M” method.

Write x >= min_subclass_size as (x – min_subclass_size) >= 0, and then replace the right hand side with M(C’y – 1), where M is a big number (eg. 10000). Now when C’y is 1, the constraint applies; when it is 0, it does not. Cool!

The number of classes must match (or be less than) the number of available teachers

The last constraint is easy – the sum of the y’s must be less than or equal the number of teachers; the x’s don’t come into it.

Do we need to explicitly constrain y to be binary?

Certainly all our variables need to be integers, not floating point. But y already can’t be bigger than 1, because our big M constraint would fail. And it can’t be less than 0 because of the way we wrote our max class size constraint.

So y is already forced to be binary.

Matrix representation of the constraints

A:  (integer vars)      (binary vars)                     b:
 -- num_subclasses -- -- num_classes --
(--------------------+-----------------)                  (--------------)
(                    |                 )      |           (       |      )
( grade_size_matrix  |        0        )  num_grades  =   (  grade_sizes )
(                    |                 )      |           (       |      )
(--------------------+-----------------)                  (--------------)
(                    |-l1              )      |           (       |      )
( class_size_matrix  |     (diag.)     )  num_classes >=  (       0      )
(                    |             -lnc)      |           (       |      )
(--------------------+-----------------)                  (--------------)
(                    |-u1              )      |           (       |      )
( class_size_matrix  |     (diag.)     )  num_classes <=  (       0      )
(                    |             -unc)      |           (       |      )
(--------------------+-----------------)                  (--------------)
( 1                  |                 )      |           (       |      )
(     1              |                 )      |           (       |      )
(        ...         |     -M * C'     ) num_sub_classes>=(    ls - M    )
(              1     |                 )      |           (       |      )
(                  1 |                 )      |           (       |      )
(--------------------+-----------------)                  (--------------)
(         0          | 1 1   ...   1 1 ) num_classes =,<= ( max_classes  )
(--------------------+-----------------)                  (--------------)

where l is the min_class_size
      u is the max_class_size
      ls is the min_subclass_size
      C' is the transpose of the class_size_matrix
      M is a large number

Step 3: Choose the objective function

For a single answer, I just minimise the sum of all the x(i).

Ideally you could get second-best and other feasible solutions from the solver, but I had trouble doing this. As a work-around, I produce different solutions by changing the weights on each subclass. I raise the weight from 1 to 2 for the subclasses in a given grade or composite class, and repeat.

Step 4: Express the solution in an understandable way

We recast the resulting x and y values so that people can understand them, eg. in our example with 8 subclasses we might get an answer like

--------------- x ----------- ------- y -------
[20, 12, 10, 0, 0, 25, 22, 21, 1, 1, 0, 1, 1, 1]

This is more easily understood as being 5 classes with the following compositions:

           K  1st 2nd
class #1: 20,  0,  0
class #2: 12, 10,  0
class #3:  0, 25,  0
class #4:  0, 22,  0
class #5:  0,  0, 21

Step 5: Post-process to humanise the answers

The optimisation produces often solutions with very different numbers of students in each class, eg. two kindergarten classes, one with 10 students, and one with 20. Clearly, a better solution is two classes of 15.

As is often the case, the constraints we thought we needed don't quite fully give us the solutions we expected.

We could tackle this by adding more constraints or changing the objective function, but in this case it is simpler to do some "shuffling" after the optimisation, according to some prescribed rules.

Composite classes make this a bit harder. For another example, given the formatted output (note the composite classes are listed at the end):

[[10,0,0], [20,0,0], [0,25,0], [0,0,10], [25,5,0], [0,5,25]]

We could return:

[[21,0,0], [21,0,0], [0,22,0], [0,0,20], [13,8,0], [0,5,15]]

The rules I came up with took some discovering:

  1. For each grade, find the average class size of classes with subclasses from that grade. (In the above example, it's 20, which includes 5 grade 1s in the comp class)
  2. For each grade, move that grade's students around to make them as close to this average as possible, eg:
    • kindergarten: avg = 20;

      [[20,0,0], [20,0,0], [0,25,0], [0,0,10], [15,5,0], [0,5,25]]

    • grade 1: avg = (25+20+30)/3 = 25; in the example, this cannot be done, since the last class is the problem.
    • grade 2: avg = (10+30)/2 = 20,

      [[20,0,0], [20,0,0], [0,25,0], [0,0,20], [15,5,0], [0,5,15]]

  3. Repeat until class sizes only change by 1, eg:
    • kindergarten: avg = 20, no shuffling required
    • grade 1: avg = (25+20+20)/3 = 21.7;

      [[20,0,0], [20,0,0], [0,22,0], [0,0,20], [15,7,0], [0,6,15]]

    • grade 2: avg = (20+21)/2 = 20.5, no change required
    • kindergarten: avg = (20+20+22)/3 = 20.7,

      [[21,0,0], [20,0,0], [0,22,0], [0,0,20], [14,7,0], [0,6,15]]

    • etc

Step 6: Build it!

Build a web app to take the inputs, perform the optimisation and serve the results!

Curve fitting with javascript & d3

If javascript is up to amazing animations and visualizations with d3, maybe it’s up to non-linear curve fitting too, right?

Something like this, perhaps:

Here’s how I did it:

  • The hard work is done using Cobyla (“Constrained Optimization BY Linear Approximation”), which Anders Gustafsson ported to Java and Reinhard Oldenburg ported to Javascript, as per this stackoverflow post.
    Cobyla minimises a non-linear objective function subject to constraints.
  • The demo and its components (including cobyla) use the module pattern, which has the advantage of keeping the global namespace uncluttered.
  • To adapt Cobyla to this curve fitting problem, I wrote a short wrapper which is added onto the cobyla module as cobyla.nlFit(data, fitFn, start, min, max, constraints, solverParams). This function minimises the sum of squared differences (y1^2-y2^2) between the data points, (x,y1), and the fitted points, (x,y2).
  • The Weibull cumulative distribution function (CDF), inverse CDF and mean are defined in the “distribution” module. Thus distribution.weibull([2,1,5]) .inverseCdf(0.5) gives the median (50th percentile) of a Weibull distribution with shape parameter 2, scale parameter 1 and location parameter 5.
  • The chart is built with d3. I am building an open-source library of a few useful chart types, d3elements, which are added onto the d3 module as d3.elts. This one is called d3.elts.xyChart.
  • So the user interface doesn’t get jammed, I use a javascript web worker to calculate the curve fit. I was surprised how easy it was to set this up.
  • I apologise in advance that this sample code is quite complicated. If you see ways to make it simpler, please let me know.
  • Finally, this may be obvious, but I like the rigour that a tool like jshint imposes. Hence the odd comment at the top of fitdemo.js, /* global fitdemo: true, jQuery, d3, _, distribution, cobyla */

Check out the source code on bitbucket here. You can see it being used for uncertain data estimation here.

Please let me know what you think!

D3 bar chart with zoom & hover

This is an example of a reusable chart built using d3. The range (zoom) slider is built in d3 too, as a separate widget, so it can be customized.

Check the sample source code on bitbucket for the full description of how to use it; here is the essence:

    var rangeWidget = d3.elts.startEndSlider().minRange(30);
    var myChart = d3.elts.barChart().rangeWidget(rangeWidget);
    myChart.mouseOver(function(el, d) { showHover(el, d) });
    myChart.mouseOut(function(el, d) { hideHover(el, d) });
    d3.csv('data.csv', function(data) {
        data =, function(d) {return [,d.price]});"#chart").datum(data).call(myChart);

Monte Carlo Business Case Analysis in Python with pandas

I recently gave a talk at the Australian Python Convention arguing that for big decisions, it can be risky to rely on business case analysis prepared on spreadsheets, and that one alternative is to use Python with pandas.

The video is available here.

The slides for the talk are shown below – just click on the picture and then you can advance through the slides using the arrow keys. Alternatively, click here to see them in a new tab.

You can see the slides from the talk online here.

I also showed an ipython notebook, which you can find as a pdf here. The slides, the ipython notebook, and the beginnings of a library to make the analysis easier, are all at

Monte Carlo cashflow modelling in R with dplyr

Business cashflow forecast models are virtually always Excel spreadsheets. Spreadsheets have many advantages – they have a low barrier to entry and are easy for most people to understand.  However as they get more complicated, the disadvantages start to appear. Without good discipline the logical flow can become confusing. The same formula must appear in every cell in a table. When new columns or rows are needed, it is hard to be sure you have updated all the dependencies. Sometimes the first or last cell in a table needs a slightly different formula to the rest. Checking the spreadsheet still works after every change is laborious.  And producing report-ready plots in Excel can be tedious.

Furthermore, most cashflow forecasts are deterministic, ie. they incorporate assumptions about sales, costs, and prices, and calculate the profit (eg. EBITDA) for this single set of assumptions. Everybody knows the profit will not turn out to be exactly equal to the forecast.  But it is not clear from a deterministic model what the range of profits might be, and a deterministic model cannot tell you how likely a loss is.  So many businesses are turning to simulated (“Monte Carlo”) cashflow forecasts, in which the assumptions are ranges; the forecast is then a range of outcomes.

There are tricks to implementing Monte Carlo models in Excel (one approach is to use Data Tables to run the simulation, and some clever dynamic histogram functions to plot the range of profits); alternatively, several good commercial packages exist which make it much easier.  But with all the extra simulation information to accommodate, the problems of spreadsheets only get worse.

Use a programming language!

It doesn’t at first seem like a natural idea, especially if you are well-versed in Excel, but you can actually implement a cashflow model in any programming language, eg. python, R, matlab or Mathematica. Over the years programmers have developed a few principles which directly address many of the problems with spreadsheets:

  • Don’t Repeat Yourself” (DRY): you shouldn’t have the same formula appearing more than once anywhere. No need to copy and paste it from one cell into all the cells in a table! And if the last column of a table needs a slightly different formula, it should be more transparent.
  • Unit testing: If you make a change to the model, you can see immediately if it breaks something. Eg. set up some sample data for which you know the answer you want; then set up a script which checks the sample data gives you that answer. Every time you change the model, run the script.
  • Flexible arrays: Most languages make it easy to change the size of your arrays (ie. add rows and columns to your tables).
  • Configuration in text, not through a user interface: Eg. Define the precise layout of the graphics that you want upfront (eg. colours, fonts, line widths), not by adjusting each plot manually.
  • Object oriented programming (!): Collect all the information and methods relevant to each object in one place… etc…

Use R!

I have recently developed a Monte Carlo cashflow forecast model in R.  There are quite a few subtleties to making it work nicely which may help you if you’re trying the same thing. (I would like to compare this to a python version, most likely using Pandas, as Python has a nicer structure as a programming language than R does.  But each language has its advantages.)

Here is a simplified version of the model:

  • The business has several farms.
  • The farms have different tree crops (apples, pears, and maybe more).
  • In fact there are different varieties of each (Granny Smith, Pink Lady, …) but we’re not sure if we want to model that fact yet; the model will need to be flexible enough to add them in later.
  • Input numbers: the number of trees per farm;
  • Input ranges:
    • how much fruit each tree produces;
    • the price of fruit;
    • the variable cost (per kg of fruit);
    • the fixed cost.
  • The cashflow model itself is very simple:
    • the volume of fruit is the number of trees times the fruit/tree;
    • the cost is the volume of fruit times the variable cost plus the fixed cost;
    • the revenue is the volume of fruit times the price;
    • the profit is revenue minus cost.


The first decision we have to face is: what data structure should we use to hold the forecasts? Remember that we will need a profit estimate for each farm, crop and potentially variety.

My first inclination would be to create a multidimensional table, with three dimensions (farm × crop × variety).  In Python this could be as simple as a dictionary with keys constructed as tuples of indices for each dimension (e.g. profit[('Farm A', 'Apple', 'Granny Smith')]). However, in R, and possibly even in Python with Pandas, a better structure is the dataframe, which is modelled on how you would store the data in a database. So it is a two-dimensional structure, with columns for each dimension and for each value of interest. For example:

         farm   crop numTrees fruitPerTree
1      farm A  apple       10          9.6
2      farm B  apple       20         20.1
3      farm A   pear       30         25.8
4      farm B   pear        1          1.7

These dataframes will also hold the simulation parameters, eg. the mean and standard deviation of the fruitPerTree.

Input data

There is a convenient function in R to read csv files in as dataframes:

require(plyr, quietly=TRUE, warn.conflicts=FALSE);

readDataframe <- function(path) {
    return(read.table(path, row.names=NULL, 
                            header = TRUE, 
                            sep = ",", 
                            strip.white = TRUE, 
                            comment.char = "#", 

numTrees <- readDataframe('data/numTrees.csv');

This would read in data/numTrees.csv, eg.:

# numTrees.csv
farm, crop, variety, numTrees
Farm A, apple, Granny Smith, 100
Farm A, apple, Jonathon, 200
Farm A, pear, Nashi, 50
Farm B, apple, Granny Smith, 200
Farm B, apple, Jonathon, 400
# Farm B has no pears


I have found the dplyr package (“the next generation of plyr“) to be very useful for this.

library('dplyr', quietly=TRUE); # for left_join
dfSim <- function(nSims, distName, paramDf, paramNames, colName) {
    # Args:
    #   nSims:      integer, e.g. 10000
    #               TODO: pass 0 for the mean of the distribution
    #   distName:   distribution function name (as a string), e.g. 'rnorm'
    #   paramDf:    dataframe of parameters, e.g. of mean, sd over farms x crops
    #   paramNames: vector of the col names in paramDf that correspond to parameters, 
    #               e.g. c('mean', 'sd')
    #   colName:    name of the output column, e.g. 'fruitPerTree' 
    #               (optional, defaults to the name of the paramDf)
    # Returns:
    #   a dataframe that is nSims times bigger than the paramDf
    # E.g.:
    #   simPrice <- data.frame(crop=c('apple','pear'), mean=c(10,22), sd=c(2,4));
    #   paramNames <- c('mean','sd')
    #   dfSim(nSims=3, distName='rnorm', paramDf=simPrice, paramNames=paramNames, colName='price');
    if (missing(colName)) {
        colName <- deparse(substitute(paramDf));
    nRows <- nrow(paramDf);
    df <- left_join(paramDf, data.frame(simId=1:nSims));
    df[colName] <-, c(list(n=nrow(df)), df[paramNames]));
    # delete the parameters
    df<-dropFields(df, paramNames);

dropFields <- function(df, fieldNames) {
	# Args:
	#   df:         dataframe
	#   fieldNames: vector of field or column names
	# Returns:
	#   the dataframe with those fields/columns dropped
	return(df[,!(names(df) %in% fieldNames), drop=FALSE]); # the drop=FALSE stops a 1D dataframe becoming a vector

In practice you may want to extend this, eg.:

  • To ‘sniff’ which distribution (rnorm etc) to use, based on the parameters chosen
  • To return only the mean of the distribution if nSims==0, so you can run the model in deterministic mode

The first of these can be achieved as follows:

autoDetectDist <- function(df) {
    # Args:
    #   df:   dataframe of parameters, eg. with fields farm, crop, mean, sd
    # Result:
    #   a list "result" with:
    #   result$distName:    distribution function name (eg. 'rnorm')
    #   result$paramNames:  vector of the parameter names (eg. mean, sd)
    #   result$df:          possibly revised dataframe
    #                       (eg. a triangular distn may have min,max,mode mapped to a,b,c) 
    normParams <- c('mean','sd');
    triangleParams <- c('min','mode','max');
    singleValueParams <- c('value');
    if (all(normParams %in% names(df))) {
        return(list(distName='rnorm', paramNames=normParams, df=df));
    if (all(singleValueParams %in% names(df))) {
        # a deterministic single value - just simulate as a uniform distribution with min=max=value
        result <- list(distName='runif', paramNames=c('min','max'));
        result$df <- df %.% mutate(min=value, max=value);
        result$df <- dropFields(result$df, singleValueParams);
    # extend this to other distributions as desired, eg. triangle
    stop(paste('unexpected parameters in ', deparse(substitute(df))));

doSim <- function(nSims, paramDf, colName) {
    # see dfSim for details
    distInfo <- autoDetectDist(paramDf);
    if (missing(colName)) {
        colName <- deparse(substitute(paramDf));
    return(dfSim(nSims=nSims, colName=colName, distName=distInfo$distName,
            paramDf=distInfo$df, paramNames=distInfo$paramNames));

With this code, you can have a datafile such as data/fruitPerTree.csv:

# fruitPerTree.csv
farm, crop, variety, mean, sd
Farm A, apple, Granny Smith, 200, 40
Farm A, apple, Jonathon, 130, 30
Farm A, pear, Nashi, 150, 30
Farm B, apple, Granny Smith, 200, 35
Farm B, apple, Jonathon, 150, 40

(This is a simplified example with dummy data – in practice I would not recommend a Normal distribution for this quantity, since it should not go negative.)

You can simulate with:

nSims <- 3;
fruitPerTree <- readDataframe('data/fruitPerTree.csv');
simFruitPerTree <- doSim(nSims, fruitPerTree);

This produces a dataframe simFruitPerTree such as:

     farm  crop      variety simId fruitPerTree
1  Farm A apple Granny Smith     1    263.07713
2  Farm A apple Granny Smith     2    165.75104
3  Farm A apple Granny Smith     3    208.85228
4  Farm A apple     Jonathon     1     97.53114
5  Farm A apple     Jonathon     2    110.81156
6  Farm A apple     Jonathon     3    175.87662
7  Farm A  pear        Nashi     1    143.44306
8  Farm A  pear        Nashi     2    183.15205
9  Farm A  pear        Nashi     3    159.63510
10 Farm B apple Granny Smith     1    189.73699
11 Farm B apple Granny Smith     2    235.34085
12 Farm B apple Granny Smith     3    260.16784
13 Farm B apple     Jonathon     1    171.05951
14 Farm B apple     Jonathon     2    212.27406
15 Farm B apple     Jonathon     3    110.02394

Implementing the cashflow logic

As an example, let’s multiply the number of trees by the fruit per tree, using the dplyr package:

library('dplyr', quietly=TRUE); # for left_join, %.% and mutate
nSims <- 3;
numTrees <- readDataframe('data/numTrees.csv');
fruitPerTree <- readDataframe('data/fruitPerTree.csv');
simFruitPerTree <- doSim(nSims, fruitPerTree);

simFruit <- simFruitPerTree %.% left_join(numTrees) %.% mutate(fruit = fruitPerTree * numTrees);

When you execute this, you’ll get the message that it is Joining by: c("farm", "crop", "variety"). So in your data files, you do can specify any dimensions you want, and the code will just work. Nice! (There are some caveats to this coming up though…)

You can hopefully see how you could use this approach to add whatever columns you need (eg. profit as revenue minus cost, revenue as volume times price, …).

Another class of calculation that comes up a lot is accumulating or differencing a quantity over time. Suppose we had a “year” column in our simFruitPerTree dataframe above, and we wanted to calculate the total fruit produced over time:

# set up a sample dataframe with a year column
annFruitPerTree <- left_join(fruitPerTree, data.frame(year=2010:2012));
simAnnFruitPerTree <- doSim(3, annFruitPerTree);
simAnnFruit <- simAnnFruitPerTree %.% left_join(numTrees) %.% mutate(annFruit = annFruitPerTree * numTrees);
# find cumulative over year, for each simId
simAnnFruit <- simAnnFruit %.% group_by(simId, farm, crop, variety) %.% mutate(cumFruit=cumsum(annFruit)) %.% ungroup();

The only problem with this is that we have had to hardwire the dimensions farm, crop, variety into the code – to make things worse, not even as variables we could set once-off, but directly into the group_by command. This is unsatisfactory, but I do not know a way around it with dplyr. If you can help, please let me know. (Note – this stackoverflow post may be the answer.)

Note in the code above, simFruit <- simFruitPerTree %.% left_join(numTrees) I have used left_join. This returns all rows from the first table, and all columns from either table. This is fine in this case, since you would expect the same farm, variety and crops in each table. However, in some cases, you want an "outer join", using all possible farm, variety and crops, even if they do not appear in some tables. For example, there may be a head office with an annual fixed cost, as well as the annual fixed cost of the farms. So a fixed cost data file could include "head office" as a "farm". Unfortunately, dplyr does not have an "outer_join" yet, so you need to fall back on the basic merge command instead, eg.:

simCost <- merge(simVarCost, simFixedCost, all=TRUE);

Summarising the results

If you have a profit column in a dataframe sim with lots of simulations, you can sum across all farms, crops and varieties to produce a single profit figure for each iteration as follows:

simProfitSummary <- sim %.% group_by(simId) %.% summarise(profit = sum(profit)) %.% ungroup();
# or to format nicely:
sim %.% group_by(simId) %.% 
    summarise(totalCost = sum(cost), totalRevenue = sum(revenue), totalProfit = sum(profit)) %.%
    format(big.mark=",", scientific=F)

Plotting the results

Here are two examples of useful plots: a histogram and a plot showing the P10-mean-P90 points of the distribution over time. I have included settings so that they are report-ready.

quartz.options(dpi=100);  # prevents plot stretching on Mac terminal
# set ggplot2 default colours
colors = c('#1EA0BA', '#4E809A', '#707070');
lightColors = c('#89D4E3', '#A9D4E3', '#C0C0C0');
colorGray = '#EAE6E3';
scale_colour_discrete <- function(...) scale_colour_manual(values = colors);
scale_fill_discrete <- function(...) scale_fill_manual(values = lightColors);
theme_custom <- function (base_size = 12, base_family = "Helvetica Neue") {
    theme_gray(base_size = base_size, base_family = base_family) %+replace% 
            axis.text = element_text(size=rel(1.5)),
            axis.title.x = element_text(size=rel(2)),
            axis.title.y = element_text(size=rel(2), angle=90),
            panel.background = element_rect(fill=colorGray),
            plot.background = element_rect(fill="white")

# helper routine for showing P10-mean-P90 plots
summaryCI <- function(x) data.frame(
  ymin=quantile(x, .1),
  ymax=quantile(x, .9)
millionsFormatter <- function(x){ 

# histogram of the fruit produced, in millions
simFruitSummary <- simFruit %.% group_by(simId) %.% summarise(fruit = sum(fruit)) %.% ungroup();
p <- qplot(fruit, data=simFruitSummary ) + 
        geom_histogram(fill=colors[1], colour=colorGray) + 
        scale_x_continuous(labels=millionsFormatter) + 
        # use coord_cartesian instead of limits=c(a,b) in scale_x_continuous so the viewport is affected, not the data
        coord_cartesian(xlim = c(-10e6, 50e6)) + # note in original units 
        aes(y = ..count../sum(..count..));
p <- update_labels(p, list(x="Fruit produced (millions)", y="Frequency"));

# P10-mean-P90 plot of the fruit produced over time
p2 <- ggplot(simAnnFruit, aes(x=year, y=annFruit)) + 
        stat_summary("summaryCI", geom="smooth", colour=colors[1], size=2, fill=lightColors[1]);
p2 <- update_labels(p2, list(x="Year", y="Fruit produced"));


As you can see, some of the R commands are a little obscure; however, they can be bundled away in function definitions, so that the key cashflow logic stands alone and is easy to check.

[Edit] Since writing this post, I have reproduced and extended the above functionality using python with pandas instead of R. The library, as well as slides and notebooks from a presentation entitled "Monte Carlo Business Case Analysis using pandas", are available at

Some future extensions:

  • To help with testing the logic for the mean or base case, I have extended the code so that if you pass nSims -> 0, it returns the mean of the distribution with simId=0. This is not too hard.
  • To help with sensitivity analysis, I am extending the code so that if you pass nSims -> -1, for each input variable it returns low, mean and high cases. This is hard to implement.
  • Add discrete variables, such as the small chance of a disaster.
  • Add unit tests.

I hope you found this useful - please let me know what you think, or if you have suggestions for how to improve the approach, or especially if you have a completely different way of solving the same problem. Thanks!