# Monte Carlo Business Case Analysis in Python with pandas

I recently gave a talk at the Australian Python Convention arguing that for big decisions, it can be risky to rely on business case analysis prepared on spreadsheets, and that one alternative is to use Python with pandas.

The video is available here.

The slides for the talk are shown below – just click on the picture and then you can advance through the slides using the arrow keys. Alternatively, click here to see them in a new tab.

You can see the slides from the talk online here.

I also showed an ipython notebook, which you can find as a pdf here. The slides, the ipython notebook, and the beginnings of a library to make the analysis easier, are all at bitbucket.org/artstr/montepylib.

# Monte Carlo cashflow modelling in R with dplyr

Business cashflow forecast models are virtually always Excel spreadsheets. Spreadsheets have many advantages – they have a low barrier to entry and are easy for most people to understand.  However as they get more complicated, the disadvantages start to appear. Without good discipline the logical flow can become confusing. The same formula must appear in every cell in a table. When new columns or rows are needed, it is hard to be sure you have updated all the dependencies. Sometimes the first or last cell in a table needs a slightly different formula to the rest. Checking the spreadsheet still works after every change is laborious.  And producing report-ready plots in Excel can be tedious.

Furthermore, most cashflow forecasts are deterministic, ie. they incorporate assumptions about sales, costs, and prices, and calculate the profit (eg. EBITDA) for this single set of assumptions. Everybody knows the profit will not turn out to be exactly equal to the forecast.  But it is not clear from a deterministic model what the range of profits might be, and a deterministic model cannot tell you how likely a loss is.  So many businesses are turning to simulated (“Monte Carlo”) cashflow forecasts, in which the assumptions are ranges; the forecast is then a range of outcomes.

There are tricks to implementing Monte Carlo models in Excel (one approach is to use Data Tables to run the simulation, and some clever dynamic histogram functions to plot the range of profits); alternatively, several good commercial packages exist which make it much easier.  But with all the extra simulation information to accommodate, the problems of spreadsheets only get worse.

### Use a programming language!

It doesn’t at first seem like a natural idea, especially if you are well-versed in Excel, but you can actually implement a cashflow model in any programming language, eg. python, R, matlab or Mathematica. Over the years programmers have developed a few principles which directly address many of the problems with spreadsheets:

• Don’t Repeat Yourself” (DRY): you shouldn’t have the same formula appearing more than once anywhere. No need to copy and paste it from one cell into all the cells in a table! And if the last column of a table needs a slightly different formula, it should be more transparent.
• Unit testing: If you make a change to the model, you can see immediately if it breaks something. Eg. set up some sample data for which you know the answer you want; then set up a script which checks the sample data gives you that answer. Every time you change the model, run the script.
• Flexible arrays: Most languages make it easy to change the size of your arrays (ie. add rows and columns to your tables).
• Configuration in text, not through a user interface: Eg. Define the precise layout of the graphics that you want upfront (eg. colours, fonts, line widths), not by adjusting each plot manually.
• Object oriented programming (!): Collect all the information and methods relevant to each object in one place… etc…

### Use R!

I have recently developed a Monte Carlo cashflow forecast model in R.  There are quite a few subtleties to making it work nicely which may help you if you’re trying the same thing. (I would like to compare this to a python version, most likely using Pandas, as Python has a nicer structure as a programming language than R does.  But each language has its advantages.)

Here is a simplified version of the model:

• The business has several farms.
• The farms have different tree crops (apples, pears, and maybe more).
• In fact there are different varieties of each (Granny Smith, Pink Lady, …) but we’re not sure if we want to model that fact yet; the model will need to be flexible enough to add them in later.
• Input numbers: the number of trees per farm;
• Input ranges:
• how much fruit each tree produces;
• the price of fruit;
• the variable cost (per kg of fruit);
• the fixed cost.
• The cashflow model itself is very simple:
• the volume of fruit is the number of trees times the fruit/tree;
• the cost is the volume of fruit times the variable cost plus the fixed cost;
• the revenue is the volume of fruit times the price;
• the profit is revenue minus cost.

### Dataframes

The first decision we have to face is: what data structure should we use to hold the forecasts? Remember that we will need a profit estimate for each farm, crop and potentially variety.

My first inclination would be to create a multidimensional table, with three dimensions (farm × crop × variety).  In Python this could be as simple as a dictionary with keys constructed as tuples of indices for each dimension (e.g. `profit[('Farm A', 'Apple', 'Granny Smith')]`). However, in R, and possibly even in Python with Pandas, a better structure is the dataframe, which is modelled on how you would store the data in a database. So it is a two-dimensional structure, with columns for each dimension and for each value of interest. For example:

```         farm   crop numTrees fruitPerTree
1      farm A  apple       10          9.6
2      farm B  apple       20         20.1
3      farm A   pear       30         25.8
4      farm B   pear        1          1.7
```

These dataframes will also hold the simulation parameters, eg. the mean and standard deviation of the fruitPerTree.

### Input data

There is a convenient function in R to read csv files in as dataframes:

```require(plyr, quietly=TRUE, warn.conflicts=FALSE);

sep = ",",
strip.white = TRUE,
comment.char = "#",
stringsAsFactors=FALSE
)
);
}

```

This would read in `data/numTrees.csv`, eg.:

```# numTrees.csv
farm, crop, variety, numTrees
Farm A, apple, Granny Smith, 100
Farm A, apple, Jonathon, 200
Farm A, pear, Nashi, 50
Farm B, apple, Granny Smith, 200
Farm B, apple, Jonathon, 400
# Farm B has no pears
```

### Simulating

I have found the `dplyr` package (“the next generation of `plyr`“) to be very useful for this.

```library('dplyr', quietly=TRUE); # for left_join
dfSim <- function(nSims, distName, paramDf, paramNames, colName) {
#
# Args:
#   nSims:      integer, e.g. 10000
#               TODO: pass 0 for the mean of the distribution
#   distName:   distribution function name (as a string), e.g. 'rnorm'
#   paramDf:    dataframe of parameters, e.g. of mean, sd over farms x crops
#   paramNames: vector of the col names in paramDf that correspond to parameters,
#               e.g. c('mean', 'sd')
#   colName:    name of the output column, e.g. 'fruitPerTree'
#               (optional, defaults to the name of the paramDf)
#
# Returns:
#   a dataframe that is nSims times bigger than the paramDf
#
# E.g.:
#   simPrice <- data.frame(crop=c('apple','pear'), mean=c(10,22), sd=c(2,4));
#   paramNames <- c('mean','sd')
#   dfSim(nSims=3, distName='rnorm', paramDf=simPrice, paramNames=paramNames, colName='price');
#
if (missing(colName)) {
colName <- deparse(substitute(paramDf));
}
nRows <- nrow(paramDf);
df <- left_join(paramDf, data.frame(simId=1:nSims));
df[colName] <- do.call(get(distName), c(list(n=nrow(df)), df[paramNames]));
# delete the parameters
df<-dropFields(df, paramNames);
return(df);
}

dropFields <- function(df, fieldNames) {
#
# Args:
#   df:         dataframe
#   fieldNames: vector of field or column names
#
# Returns:
#   the dataframe with those fields/columns dropped
#
return(df[,!(names(df) %in% fieldNames), drop=FALSE]); # the drop=FALSE stops a 1D dataframe becoming a vector
}
```

In practice you may want to extend this, eg.:

• To ‘sniff’ which distribution (`rnorm` etc) to use, based on the parameters chosen
• To return only the mean of the distribution if `nSims==0`, so you can run the model in deterministic mode

The first of these can be achieved as follows:

```autoDetectDist <- function(df) {
#
# Args:
#   df:   dataframe of parameters, eg. with fields farm, crop, mean, sd
#
# Result:
#   a list "result" with:
#   result\$distName:    distribution function name (eg. 'rnorm')
#   result\$paramNames:  vector of the parameter names (eg. mean, sd)
#   result\$df:          possibly revised dataframe
#                       (eg. a triangular distn may have min,max,mode mapped to a,b,c)
#
normParams <- c('mean','sd');
triangleParams <- c('min','mode','max');
singleValueParams <- c('value');
if (all(normParams %in% names(df))) {
return(list(distName='rnorm', paramNames=normParams, df=df));
}
if (all(singleValueParams %in% names(df))) {
# a deterministic single value - just simulate as a uniform distribution with min=max=value
result <- list(distName='runif', paramNames=c('min','max'));
result\$df <- df %.% mutate(min=value, max=value);
result\$df <- dropFields(result\$df, singleValueParams);
return(result);
}
# extend this to other distributions as desired, eg. triangle
stop(paste('unexpected parameters in ', deparse(substitute(df))));
}

doSim <- function(nSims, paramDf, colName) {
# see dfSim for details
distInfo <- autoDetectDist(paramDf);
if (missing(colName)) {
colName <- deparse(substitute(paramDf));
}
return(dfSim(nSims=nSims, colName=colName, distName=distInfo\$distName,
paramDf=distInfo\$df, paramNames=distInfo\$paramNames));
}
```

With this code, you can have a datafile such as `data/fruitPerTree.csv`:

```# fruitPerTree.csv
farm, crop, variety, mean, sd
Farm A, apple, Granny Smith, 200, 40
Farm A, apple, Jonathon, 130, 30
Farm A, pear, Nashi, 150, 30
Farm B, apple, Granny Smith, 200, 35
Farm B, apple, Jonathon, 150, 40
```

(This is a simplified example with dummy data – in practice I would not recommend a Normal distribution for this quantity, since it should not go negative.)

You can simulate with:

```nSims <- 3;
simFruitPerTree <- doSim(nSims, fruitPerTree);
```

This produces a dataframe `simFruitPerTree` such as:

```     farm  crop      variety simId fruitPerTree
1  Farm A apple Granny Smith     1    263.07713
2  Farm A apple Granny Smith     2    165.75104
3  Farm A apple Granny Smith     3    208.85228
4  Farm A apple     Jonathon     1     97.53114
5  Farm A apple     Jonathon     2    110.81156
6  Farm A apple     Jonathon     3    175.87662
7  Farm A  pear        Nashi     1    143.44306
8  Farm A  pear        Nashi     2    183.15205
9  Farm A  pear        Nashi     3    159.63510
10 Farm B apple Granny Smith     1    189.73699
11 Farm B apple Granny Smith     2    235.34085
12 Farm B apple Granny Smith     3    260.16784
13 Farm B apple     Jonathon     1    171.05951
14 Farm B apple     Jonathon     2    212.27406
15 Farm B apple     Jonathon     3    110.02394
```

### Implementing the cashflow logic

As an example, let’s multiply the number of trees by the fruit per tree, using the `dplyr` package:

```library('dplyr', quietly=TRUE); # for left_join, %.% and mutate
nSims <- 3;
simFruitPerTree <- doSim(nSims, fruitPerTree);

simFruit <- simFruitPerTree %.% left_join(numTrees) %.% mutate(fruit = fruitPerTree * numTrees);
```

When you execute this, you’ll get the message that it is `Joining by: c("farm", "crop", "variety")`. So in your data files, you do can specify any dimensions you want, and the code will just work. Nice! (There are some caveats to this coming up though…)

You can hopefully see how you could use this approach to add whatever columns you need (eg. profit as revenue minus cost, revenue as volume times price, …).

Another class of calculation that comes up a lot is accumulating or differencing a quantity over time. Suppose we had a “year” column in our `simFruitPerTree` dataframe above, and we wanted to calculate the total fruit produced over time:

```# set up a sample dataframe with a year column
annFruitPerTree <- left_join(fruitPerTree, data.frame(year=2010:2012));
simAnnFruitPerTree <- doSim(3, annFruitPerTree);
simAnnFruit <- simAnnFruitPerTree %.% left_join(numTrees) %.% mutate(annFruit = annFruitPerTree * numTrees);
# find cumulative over year, for each simId
simAnnFruit <- simAnnFruit %.% group_by(simId, farm, crop, variety) %.% mutate(cumFruit=cumsum(annFruit)) %.% ungroup();
```

The only problem with this is that we have had to hardwire the dimensions `farm, crop, variety` into the code – to make things worse, not even as variables we could set once-off, but directly into the `group_by` command. This is unsatisfactory, but I do not know a way around it with `dplyr`. If you can help, please let me know. (Note – this stackoverflow post may be the answer.)

Note in the code above, `simFruit <- simFruitPerTree %.% left_join(numTrees)` I have used `left_join`. This returns all rows from the first table, and all columns from either table. This is fine in this case, since you would expect the same farm, variety and crops in each table. However, in some cases, you want an "outer join", using all possible farm, variety and crops, even if they do not appear in some tables. For example, there may be a head office with an annual fixed cost, as well as the annual fixed cost of the farms. So a fixed cost data file could include "head office" as a "farm". Unfortunately, `dplyr` does not have an "outer_join" yet, so you need to fall back on the basic `merge` command instead, eg.:

```simCost <- merge(simVarCost, simFixedCost, all=TRUE);
```

### Summarising the results

If you have a profit column in a dataframe `sim` with lots of simulations, you can sum across all farms, crops and varieties to produce a single profit figure for each iteration as follows:

```simProfitSummary <- sim %.% group_by(simId) %.% summarise(profit = sum(profit)) %.% ungroup();
# or to format nicely:
sim %.% group_by(simId) %.%
summarise(totalCost = sum(cost), totalRevenue = sum(revenue), totalProfit = sum(profit)) %.%
format(big.mark=",", scientific=F)
```

### Plotting the results

Here are two examples of useful plots: a histogram and a plot showing the P10-mean-P90 points of the distribution over time. I have included settings so that they are report-ready.

```quartz.options(dpi=100);  # prevents plot stretching on Mac terminal
library('ggplot2');
#
# set ggplot2 default colours
#
colors = c('#1EA0BA', '#4E809A', '#707070');
lightColors = c('#89D4E3', '#A9D4E3', '#C0C0C0');
colorGray = '#EAE6E3';
scale_colour_discrete <- function(...) scale_colour_manual(values = colors);
scale_fill_discrete <- function(...) scale_fill_manual(values = lightColors);
theme_custom <- function (base_size = 12, base_family = "Helvetica Neue") {
theme_gray(base_size = base_size, base_family = base_family) %+replace%
theme(
axis.text = element_text(size=rel(1.5)),
axis.title.x = element_text(size=rel(2)),
axis.title.y = element_text(size=rel(2), angle=90),
panel.background = element_rect(fill=colorGray),
plot.background = element_rect(fill="white")
)
}
theme_set(theme_custom())

#
# helper routine for showing P10-mean-P90 plots
#
summaryCI <- function(x) data.frame(
y=mean(x),
ymin=quantile(x, .1),
ymax=quantile(x, .9)
)
millionsFormatter <- function(x){
x/1.e6
}

#
# histogram of the fruit produced, in millions
#
simFruitSummary <- simFruit %.% group_by(simId) %.% summarise(fruit = sum(fruit)) %.% ungroup();
p <- qplot(fruit, data=simFruitSummary ) +
geom_histogram(fill=colors[1], colour=colorGray) +
scale_x_continuous(labels=millionsFormatter) +
# use coord_cartesian instead of limits=c(a,b) in scale_x_continuous so the viewport is affected, not the data
coord_cartesian(xlim = c(-10e6, 50e6)) + # note in original units
aes(y = ..count../sum(..count..));
p <- update_labels(p, list(x="Fruit produced (millions)", y="Frequency"));

#
# P10-mean-P90 plot of the fruit produced over time
#
p2 <- ggplot(simAnnFruit, aes(x=year, y=annFruit)) +
stat_summary(fun.data="summaryCI", geom="smooth", colour=colors[1], size=2, fill=lightColors[1]);
p2 <- update_labels(p2, list(x="Year", y="Fruit produced"));
```

### Conclusions

As you can see, some of the R commands are a little obscure; however, they can be bundled away in function definitions, so that the key cashflow logic stands alone and is easy to check.

 Since writing this post, I have reproduced and extended the above functionality using python with pandas instead of R. The library, as well as slides and notebooks from a presentation entitled "Monte Carlo Business Case Analysis using pandas", are available at bitbucket.org/artstr/montepylib.

Some future extensions:

• To help with testing the logic for the mean or base case, I have extended the code so that if you pass `nSims -> 0`, it returns the mean of the distribution with `simId=0`. This is not too hard.
• To help with sensitivity analysis, I am extending the code so that if you pass `nSims -> -1`, for each input variable it returns low, mean and high cases. This is hard to implement.
• Add discrete variables, such as the small chance of a disaster.