9 Lessons from PyConAU 2014

A summary of what I learned at PyCon AU in Brisbane this year. (Videos of the talks are here.)

1. PyCon’s code of conduct

Basically, “Be nice to people. Please.”

I once had a boss who told me he saw his role as maintaining the culture of the group.  At first I thought that seemed a strange goal for someone so senior in the company, but I eventually decided it was enlightened: a place’s culture is key to making it desirable, and making the work sustainable. So I like that PyCon takes the trouble to try to set the tone like this, when it would be so easy for a bunch of programmers to stay focused on the technical.

2. Django was made open-source to give back to the community

Ever wondered why a company like Lawrence Journal-World would want to give away its valuable IP as open source? In a “fireside chat” between Simon Willison (Django co-creator) and Andrew Godwin (South author), it was revealed that the owners knew that much of their CMS framework had been built on open source software, and they wanted to give back to the community. It just goes to show, no matter how conservative the organisation you work for, if you believe some of your work should be made open source, make the case for it.

3. There are still lots more packages and tools to try out

That lesson’s copied from my post last year on PyCon AU. Strangely this list doesn’t seem to be any shorter than last year – but it is at least a different list.

Things to add to your web stack -

  • Varnish – “if your server’s not fast enough, just add another”.  Apparently a scary scripting language is involved, but it can take your server from handling 50 users to 50,000. Fastly is a commercial service that can set this up for you.
  • Solr and elasticsearch are ways to make searches faster; use them with django-haystack.
  • Statsd & graphite for performance monitoring.
  • Docker.io

Some other stuff -

  • mpld3 – convert matplotlib to d3. Wow! I even saw this in action in an ipython notebook.
  • you can use a directed graph (eg using networkx) to determine the order of processes in your code

Here are some wider tools for bioinformaticians (if that’s a word), largely from Clare Sloggett’s talk -

  • rosalind.info – an educational tool for teaching bioinformatics algorithms in python.
  • nectar research cloud – a national cloud for Australian researchers
  • biodalliance – a fast, interactive, genome visualization tool that’s easy to embed in web pages and applications (and ipython notebooks!)
  • ensembl API – an API for genomics – cool!

And some other sciency packages -

  • Natural Language Toolkit NLTK
  • Scikit Learn can count words in docs, and separate data into training and testing sets
  • febrl – to connect user records together when their data may be incorrectly entered

One standout talk for me was Ryan Kelly’s pypy.js, implementing a compliant and fast python in the browser entirely in javascript. The only downside is it’s 15 Mb to download, but he’s working on it!

And finally, check out this alternative to python: Julia, “a high-level, high-performance dynamic programming language for technical computing”, and Scirra’s Construct 2, a game-making program for kids (Windows only).

4. Everyone loves IPython Notebook

I hadn’t thought to embed javascript in notebooks, but you can. You can even use them collaboratively through Google docs using Jupyter‘s colaboratory. You can get a table-of-contents extension too.

5. Browser caching doesn’t have to be hard

Remember, your server is not just generating html – it is generating an http response, and that includes some headers like “last modified”, “etag”, and “cache control”. Use them. Django has decorators to make it easy. See Mark Nottingham’s tutorial. (This from a talk by Tom Eastman.)

6. Making your own packages is a bit hard

I had not heard of wheels before, but they replace eggs as a “distributable unit of python code” – really just a zip file with some meta-data, possibly including operating-system-dependent binaries. Tools that you’ll want to use include tox (to run tests in lots of different environments); sphinx (to auto-generate your documentation) and then ReadTheDocs to host your docs; check-manifest to make sure your manifest.in file has everything it needs; and bumpversion so you don’t have to change your version number in five different places every time you update the code.

If you want users to install your package with “pip install python-fire“, and then import it in Python with “import fire“, then you should name your enclosing folder python_fire, and inside that you should have another folder named fire. Also, you can install this package while you are testing it by cding to the python-fire directory and typing pip install -e . (note the final full-stop; the -e flag makes it editable).

Once you have added a LICENSE, README, docs, tests, MANIFEST.insetup.py and optionally a setup.cfg (to the python-fire directory in the above example) and you have pip installed setuptoolswheel and twine, you run both

python setup.py bdist_wheel [--universal]
python setup.py sdist

The bdist version produces a binary distribution that is operating-system-specific, if required the universal flag says it will run on all operating systems in both Python 2 and Python 3). The sdist version is a source distribution.

To upload the result to pypi, run

twine upload dist/*

(This from a talk by Russell Keith-Magee.)  Incidentally, piprot is a handy tool to check how out-of-date your packages are. Also see the Hitchhiker’s Guide to Packaging.

7. Security is never far from our thoughts

This lesson is also copied from last year’s post. If you offer a free service (like Heroku), some people will try to abuse it. Heroku has ways of detecting potentially fraudulent users very quickly, and hopes to open source them soon. And be careful of your APIs which accept data – XML and YAML in particular have scary features which can let people run bad things on your server.

8. Database considerations

Some tidbits from Andrew Godwin’s talk (of South fame)…

  • Virtual machines are slow at I/O, so don’t put your database on one – put your databases on SSDs. And try not to run other things next to the database.
  • Setting default values on a new column takes a long time on a big database. (Postgres can add a NULL field for free, but not MySQL.)
  • Schema-less (aka NoSQL) databases make a lot of sense for CMSes.
  • If only one field in a table is frequently updated, separate it out into its own table.
  • Try to separate read-heavy tables (and databases) from write-heavy ones.
  • The more separate you can keep your tables from the start, the easier it will be to refactor (eg. shard) later to improve your database speed.

9. Go to the lightning talks

I am constantly amazed at the quality of the 5-minute (strictly enforced) lightning talks. Russell Keith-Magee’s toga provides a way to program native iOS, Mac OS, Windows and linux apps in python (with Android coming). (Keith has also implemented the constraint-based layout engine Cassowary in python, with tests, along the way.) Produce displays of lightning on your screen using the von mises distribution and amazingly quick typing. Run python2 inside python3 with sux (a play on six).  And much much more…

Finally, the two keynotes were very interesting too. One was by Katie Cunningham on making your websites accessible to all, including people with sight or hearing problems, or dyslexia, or colour-blindness, or who have trouble with using the keyboard or the mouse, or may just need more time to make sense of your site. Oddly enough, doing so tends to improve your site for everyone anyway (as Katie said, has anyone ever asked for more flashing effects on the margins of your page?). Examples include captioning videos, being careful with red and green (use vischeck), using aria, reading the standards, and, ideally, having a text-based description of any graphs on the site, like you might describe to a friend over the phone. Thinking of an automated way to do that last one sounds like an interesting challenge…

The other keynote was by James Curran from the University of Sydney on the way in which programming – or better, “computational thinking” – will be taught in schools. Perhaps massaging our egos at a programming conference, he claimed that computational thinking is “the most challenging thing that people do”, as it requires managing a high level of complexity and abstraction. Nonetheless, requiring kindergarteners to learn programming seemed a bit extreme to me – until he explained at that age kids would not be in front of a computer, but rather learning “to be exact”. For example, describing how to make a slice of buttered bread is essentially an algorithm, and it’s easy to miss all the steps required (like opening the cupboard door to get the bread). If you’re interested, some learning resources include MIT’s scratch, alice (using 3D animations), grok learning and the National Computer Science School (NCSS).

All in all, another excellent conference – congratulations to the organisers, and I look forward to next year in Brisbane again.

  

Make an animated & reusable barchart

Do you need a dynamic bar chart like this in your web page, with positive and negative values, and categories along the x-axis?

Dog breeds are on my mind at the moment as we just bought a new Sheltie puppy – this chart might show a person’s scores for each breed. Click the button above to see the next person’s (random) set of scores.

This is an example of a reusable chart built using d3. Using it is fairly simple, eg.:

<script src="http://d3js.org/d3.v3.min.js"></script>
<script src="categoryBarChart.js"></script>
<script>
    var points = [['Beagle',-10],
                  ['Terrier',30],
                  ['Sheltie',55],
                  ['Greyhound',-24]];
    var myChart = categoryBarChart().width(300);
    d3.select("body").datum(points).call(myChart);
</script>

Please check out the source code for categoryBarChart.js on bitbucket.

You may also find this helpful even if you don’t need a barchart, but want to understand how to build a reusable chart. I was inspired when I read Mike Bostock’s exposition, Towards reusable charts, but I found it took some work to get my head around how to do it for real – so I hope this example may help others too.

The key tricks were:

  • How to adjust Mike Bostock’s sample code to produce bars instead of a line and area
  • Where the enter(), update and exit() fit in (answer: they are internal to the reusable function)
  • How to call it, including whether to use data vs datum (answer: see code snippet above)
  • How to refer to the x and y coordinates inside the function (answer: the function maps them to d[0] and d[1] regardless of how they started out)

You may wish to extend it by improving the y-axis options too.

Good luck!

  

Monte Carlo Business Case Analysis in Python with pandas

I recently gave a talk at the Australian Python Convention arguing that for big decisions, it can be risky to rely on business case analysis prepared on spreadsheets, and that one alternative is to use Python with pandas.

The video is available here.

The slides for the talk are shown below – just click on the picture and then you can advance through the slides using the arrow keys. Alternatively, click here to see them in a new tab.


You can see the slides from the talk online here.

I also showed an ipython notebook, which you can find as a pdf here. The slides, the ipython notebook, and the beginnings of a library to make the analysis easier, are all at bitbucket.org/artstr/montepylib.

  

How is your tax money being spent?

Want to know how your tax money is being spent? The Australian Budget Explorer is an interactive way to explore spending by portfolio, agency, program or even in more detail; compare 2014 against previous years; and search for the terms that interest you.

Australian Budget Explorer

This was produced in collaboration with BudgetAus. This year for the first time, the team at data.gov.au provided the Budget expenditure data in a single spreadsheet, which Rosie Williams (InfoAus) manipulated to include further data on the Social Services portfolio. The collaboration is producing lots of good visualisations, collected at AusViz.

I won’t editorialise about the Budget here; instead here is my data and extensions wishlist:

Look-through to include State budgets

The biggest line item (component) in the Federal Budget is $54 billion for “Administered expenses: Special appropriation GST Revenue Entitlements – Federal Financial Relations Act 2009″, which I take it is revenue to the States. I would love to be able to “look through” this item into how the States spend it.

The BudgetAus team has provided some promising data leads here.

Unique identifiers to track spending over time

One of the most frequent requests I get is to track changes in spending over time.

Unfortunately this is hard, as there are no unique identifiers for a given portfolio, program, agency or component. That means if the name changes from one year to the next, it is hard to work out which old name corresponds to which new name. E.g. In 2014, the Department of Employment & Workplace Relations has been split into the Department of Employment and the Department of Education, while the Environment portfolio used to be “Sustainability, Environment, Water, Population and Communities”.

It would be great to give all spending an identifier, and have a record of how identifiers map from one year to the next.

What money is actually spent?

How does the budget relate to what is spent? There is some info here at BudgetAus, but the upshot is “This might be a good task for a future group of volunteers”…

Revenue

There is revenue data available here – I haven’t looked at it carefully yet, but I hope to include it, if possible.

Cross-country comparison

It would be great to compare the percentages spent in key areas by governments across the world.
Maybe it’s already being done? To do this I’d need some standard hierarchy of categories (health, education, defence, and subdivisions of these, etc), and we’d need every country’s government (and every State government) to tag their spending by those categories. Sounds simple in concept but I bet it would be hard to make it happen.

In the meantime, my plan is to check quandl for data and see how far I can go with what’s there…

D3

Finally, many thanks to the authors for the awesome d3 package!

Conclusion

If you have any comments or know how to solve any of the data issues raised above, please let me know.

  

Monte Carlo cashflow modelling in R with dplyr

Business cashflow forecast models are virtually always Excel spreadsheets. Spreadsheets have many advantages – they have a low barrier to entry and are easy for most people to understand.  However as they get more complicated, the disadvantages start to appear. Without good discipline the logical flow can become confusing. The same formula must appear in every cell in a table. When new columns or rows are needed, it is hard to be sure you have updated all the dependencies. Sometimes the first or last cell in a table needs a slightly different formula to the rest. Checking the spreadsheet still works after every change is laborious.  And producing report-ready plots in Excel can be tedious.

Furthermore, most cashflow forecasts are deterministic, ie. they incorporate assumptions about sales, costs, and prices, and calculate the profit (eg. EBITDA) for this single set of assumptions. Everybody knows the profit will not turn out to be exactly equal to the forecast.  But it is not clear from a deterministic model what the range of profits might be, and a deterministic model cannot tell you how likely a loss is.  So many businesses are turning to simulated (“Monte Carlo”) cashflow forecasts, in which the assumptions are ranges; the forecast is then a range of outcomes.

There are tricks to implementing Monte Carlo models in Excel (one approach is to use Data Tables to run the simulation, and some clever dynamic histogram functions to plot the range of profits); alternatively, several good commercial packages exist which make it much easier.  But with all the extra simulation information to accommodate, the problems of spreadsheets only get worse.

Use a programming language!

It doesn’t at first seem like a natural idea, especially if you are well-versed in Excel, but you can actually implement a cashflow model in any programming language, eg. python, R, matlab or Mathematica. Over the years programmers have developed a few principles which directly address many of the problems with spreadsheets:

  • Don’t Repeat Yourself” (DRY): you shouldn’t have the same formula appearing more than once anywhere. No need to copy and paste it from one cell into all the cells in a table! And if the last column of a table needs a slightly different formula, it should be more transparent.
  • Unit testing: If you make a change to the model, you can see immediately if it breaks something. Eg. set up some sample data for which you know the answer you want; then set up a script which checks the sample data gives you that answer. Every time you change the model, run the script.
  • Flexible arrays: Most languages make it easy to change the size of your arrays (ie. add rows and columns to your tables).
  • Configuration in text, not through a user interface: Eg. Define the precise layout of the graphics that you want upfront (eg. colours, fonts, line widths), not by adjusting each plot manually.
  • Object oriented programming (!): Collect all the information and methods relevant to each object in one place… etc…

Use R!

I have recently developed a Monte Carlo cashflow forecast model in R.  There are quite a few subtleties to making it work nicely which may help you if you’re trying the same thing. (I would like to compare this to a python version, most likely using Pandas, as Python has a nicer structure as a programming language than R does.  But each language has its advantages.)

Here is a simplified version of the model:

  • The business has several farms.
  • The farms have different tree crops (apples, pears, and maybe more).
  • In fact there are different varieties of each (Granny Smith, Pink Lady, …) but we’re not sure if we want to model that fact yet; the model will need to be flexible enough to add them in later.
  • Input numbers: the number of trees per farm;
  • Input ranges:
    • how much fruit each tree produces;
    • the price of fruit;
    • the variable cost (per kg of fruit);
    • the fixed cost.
  • The cashflow model itself is very simple:
    • the volume of fruit is the number of trees times the fruit/tree;
    • the cost is the volume of fruit times the variable cost plus the fixed cost;
    • the revenue is the volume of fruit times the price;
    • the profit is revenue minus cost.

Dataframes

The first decision we have to face is: what data structure should we use to hold the forecasts? Remember that we will need a profit estimate for each farm, crop and potentially variety.

My first inclination would be to create a multidimensional table, with three dimensions (farm × crop × variety).  In Python this could be as simple as a dictionary with keys constructed as tuples of indices for each dimension (e.g. profit[('Farm A', 'Apple', 'Granny Smith')]). However, in R, and possibly even in Python with Pandas, a better structure is the dataframe, which is modelled on how you would store the data in a database. So it is a two-dimensional structure, with columns for each dimension and for each value of interest. For example:

         farm   crop numTrees fruitPerTree
1      farm A  apple       10          9.6
2      farm B  apple       20         20.1
3      farm A   pear       30         25.8
4      farm B   pear        1          1.7

These dataframes will also hold the simulation parameters, eg. the mean and standard deviation of the fruitPerTree.

Input data

There is a convenient function in R to read csv files in as dataframes:

require(plyr, quietly=TRUE, warn.conflicts=FALSE);

readDataframe <- function(path) {
    return(read.table(path, row.names=NULL, 
                            header = TRUE, 
                            sep = ",", 
                            strip.white = TRUE, 
                            comment.char = "#", 
                            stringsAsFactors=FALSE
                     )
           );
}

numTrees <- readDataframe('data/numTrees.csv');

This would read in data/numTrees.csv, eg.:

# numTrees.csv
farm, crop, variety, numTrees
Farm A, apple, Granny Smith, 100
Farm A, apple, Jonathon, 200
Farm A, pear, Nashi, 50
Farm B, apple, Granny Smith, 200
Farm B, apple, Jonathon, 400
# Farm B has no pears

Simulating

I have found the dplyr package (“the next generation of plyr“) to be very useful for this.

library('dplyr', quietly=TRUE); # for left_join
dfSim <- function(nSims, distName, paramDf, paramNames, colName) {
    #
    # Args:
    #   nSims:      integer, e.g. 10000
    #               TODO: pass 0 for the mean of the distribution
    #   distName:   distribution function name (as a string), e.g. 'rnorm'
    #   paramDf:    dataframe of parameters, e.g. of mean, sd over farms x crops
    #   paramNames: vector of the col names in paramDf that correspond to parameters, 
    #               e.g. c('mean', 'sd')
    #   colName:    name of the output column, e.g. 'fruitPerTree' 
    #               (optional, defaults to the name of the paramDf)
    #
    # Returns:
    #   a dataframe that is nSims times bigger than the paramDf
    #
    # E.g.:
    #   simPrice <- data.frame(crop=c('apple','pear'), mean=c(10,22), sd=c(2,4));
    #   paramNames <- c('mean','sd')
    #   dfSim(nSims=3, distName='rnorm', paramDf=simPrice, paramNames=paramNames, colName='price');
    #
    if (missing(colName)) {
        colName <- deparse(substitute(paramDf));
    }
    nRows <- nrow(paramDf);
    df <- left_join(paramDf, data.frame(simId=1:nSims));
    df[colName] <- do.call(get(distName), c(list(n=nrow(df)), df[paramNames]));
    # delete the parameters
    df<-dropFields(df, paramNames);
    return(df);
}

dropFields <- function(df, fieldNames) {
	#
	# Args:
	#   df:         dataframe
	#   fieldNames: vector of field or column names
	#
	# Returns:
	#   the dataframe with those fields/columns dropped
	#
	return(df[,!(names(df) %in% fieldNames), drop=FALSE]); # the drop=FALSE stops a 1D dataframe becoming a vector
}

In practice you may want to extend this, eg.:

  • To ‘sniff’ which distribution (rnorm etc) to use, based on the parameters chosen
  • To return only the mean of the distribution if nSims==0, so you can run the model in deterministic mode

The first of these can be achieved as follows:

autoDetectDist <- function(df) {
    #
    # Args:
    #   df:   dataframe of parameters, eg. with fields farm, crop, mean, sd
    #
    # Result:
    #   a list "result" with:
    #   result$distName:    distribution function name (eg. 'rnorm')
    #   result$paramNames:  vector of the parameter names (eg. mean, sd)
    #   result$df:          possibly revised dataframe
    #                       (eg. a triangular distn may have min,max,mode mapped to a,b,c) 
    #
    normParams <- c('mean','sd');
    triangleParams <- c('min','mode','max');
    singleValueParams <- c('value');
    if (all(normParams %in% names(df))) {
        return(list(distName='rnorm', paramNames=normParams, df=df));
    }
    if (all(singleValueParams %in% names(df))) {
        # a deterministic single value - just simulate as a uniform distribution with min=max=value
        result <- list(distName='runif', paramNames=c('min','max'));
        result$df <- df %.% mutate(min=value, max=value);
        result$df <- dropFields(result$df, singleValueParams);
        return(result);
    }
    # extend this to other distributions as desired, eg. triangle
    stop(paste('unexpected parameters in ', deparse(substitute(df))));
}

doSim <- function(nSims, paramDf, colName) {
    # see dfSim for details
    distInfo <- autoDetectDist(paramDf);
    if (missing(colName)) {
        colName <- deparse(substitute(paramDf));
    }
    return(dfSim(nSims=nSims, colName=colName, distName=distInfo$distName,
            paramDf=distInfo$df, paramNames=distInfo$paramNames));
}

With this code, you can have a datafile such as data/fruitPerTree.csv:

# fruitPerTree.csv
farm, crop, variety, mean, sd
Farm A, apple, Granny Smith, 200, 40
Farm A, apple, Jonathon, 130, 30
Farm A, pear, Nashi, 150, 30
Farm B, apple, Granny Smith, 200, 35
Farm B, apple, Jonathon, 150, 40

(This is a simplified example with dummy data – in practice I would not recommend a Normal distribution for this quantity, since it should not go negative.)

You can simulate with:

nSims <- 3;
fruitPerTree <- readDataframe('data/fruitPerTree.csv');
simFruitPerTree <- doSim(nSims, fruitPerTree);

This produces a dataframe simFruitPerTree such as:

     farm  crop      variety simId fruitPerTree
1  Farm A apple Granny Smith     1    263.07713
2  Farm A apple Granny Smith     2    165.75104
3  Farm A apple Granny Smith     3    208.85228
4  Farm A apple     Jonathon     1     97.53114
5  Farm A apple     Jonathon     2    110.81156
6  Farm A apple     Jonathon     3    175.87662
7  Farm A  pear        Nashi     1    143.44306
8  Farm A  pear        Nashi     2    183.15205
9  Farm A  pear        Nashi     3    159.63510
10 Farm B apple Granny Smith     1    189.73699
11 Farm B apple Granny Smith     2    235.34085
12 Farm B apple Granny Smith     3    260.16784
13 Farm B apple     Jonathon     1    171.05951
14 Farm B apple     Jonathon     2    212.27406
15 Farm B apple     Jonathon     3    110.02394

Implementing the cashflow logic

As an example, let’s multiply the number of trees by the fruit per tree, using the dplyr package:

library('dplyr', quietly=TRUE); # for left_join, %.% and mutate
nSims <- 3;
numTrees <- readDataframe('data/numTrees.csv');
fruitPerTree <- readDataframe('data/fruitPerTree.csv');
simFruitPerTree <- doSim(nSims, fruitPerTree);

simFruit <- simFruitPerTree %.% left_join(numTrees) %.% mutate(fruit = fruitPerTree * numTrees);

When you execute this, you’ll get the message that it is Joining by: c("farm", "crop", "variety"). So in your data files, you do can specify any dimensions you want, and the code will just work. Nice! (There are some caveats to this coming up though…)

You can hopefully see how you could use this approach to add whatever columns you need (eg. profit as revenue minus cost, revenue as volume times price, …).

Another class of calculation that comes up a lot is accumulating or differencing a quantity over time. Suppose we had a “year” column in our simFruitPerTree dataframe above, and we wanted to calculate the total fruit produced over time:

# set up a sample dataframe with a year column
annFruitPerTree <- left_join(fruitPerTree, data.frame(year=2010:2012));
simAnnFruitPerTree <- doSim(3, annFruitPerTree);
simAnnFruit <- simAnnFruitPerTree %.% left_join(numTrees) %.% mutate(annFruit = annFruitPerTree * numTrees);
# find cumulative over year, for each simId
simAnnFruit <- simAnnFruit %.% group_by(simId, farm, crop, variety) %.% mutate(cumFruit=cumsum(annFruit)) %.% ungroup();

The only problem with this is that we have had to hardwire the dimensions farm, crop, variety into the code – to make things worse, not even as variables we could set once-off, but directly into the group_by command. This is unsatisfactory, but I do not know a way around it with dplyr. If you can help, please let me know. (Note – this stackoverflow post may be the answer.)

Note in the code above, simFruit <- simFruitPerTree %.% left_join(numTrees) I have used left_join. This returns all rows from the first table, and all columns from either table. This is fine in this case, since you would expect the same farm, variety and crops in each table. However, in some cases, you want an "outer join", using all possible farm, variety and crops, even if they do not appear in some tables. For example, there may be a head office with an annual fixed cost, as well as the annual fixed cost of the farms. So a fixed cost data file could include "head office" as a "farm". Unfortunately, dplyr does not have an "outer_join" yet, so you need to fall back on the basic merge command instead, eg.:

simCost <- merge(simVarCost, simFixedCost, all=TRUE);

Summarising the results

If you have a profit column in a dataframe sim with lots of simulations, you can sum across all farms, crops and varieties to produce a single profit figure for each iteration as follows:

simProfitSummary <- sim %.% group_by(simId) %.% summarise(profit = sum(profit)) %.% ungroup();
# or to format nicely:
sim %.% group_by(simId) %.% 
    summarise(totalCost = sum(cost), totalRevenue = sum(revenue), totalProfit = sum(profit)) %.%
    format(big.mark=",", scientific=F)

Plotting the results

Here are two examples of useful plots: a histogram and a plot showing the P10-mean-P90 points of the distribution over time. I have included settings so that they are report-ready.

quartz.options(dpi=100);  # prevents plot stretching on Mac terminal
library('ggplot2');
#
# set ggplot2 default colours
#
colors = c('#1EA0BA', '#4E809A', '#707070');
lightColors = c('#89D4E3', '#A9D4E3', '#C0C0C0');
colorGray = '#EAE6E3';
scale_colour_discrete <- function(...) scale_colour_manual(values = colors);
scale_fill_discrete <- function(...) scale_fill_manual(values = lightColors);
theme_custom <- function (base_size = 12, base_family = "Helvetica Neue") {
    theme_gray(base_size = base_size, base_family = base_family) %+replace% 
        theme(
            axis.text = element_text(size=rel(1.5)),
            axis.title.x = element_text(size=rel(2)),
            axis.title.y = element_text(size=rel(2), angle=90),
            panel.background = element_rect(fill=colorGray),
            plot.background = element_rect(fill="white")
    )   
}
theme_set(theme_custom())

#
# helper routine for showing P10-mean-P90 plots
#
summaryCI <- function(x) data.frame(
  y=mean(x), 
  ymin=quantile(x, .1),
  ymax=quantile(x, .9)
)
millionsFormatter <- function(x){ 
    x/1.e6 
}

#
# histogram of the fruit produced, in millions
#
simFruitSummary <- simFruit %.% group_by(simId) %.% summarise(fruit = sum(fruit)) %.% ungroup();
p <- qplot(fruit, data=simFruitSummary ) + 
        geom_histogram(fill=colors[1], colour=colorGray) + 
        scale_x_continuous(labels=millionsFormatter) + 
        # use coord_cartesian instead of limits=c(a,b) in scale_x_continuous so the viewport is affected, not the data
        coord_cartesian(xlim = c(-10e6, 50e6)) + # note in original units 
        aes(y = ..count../sum(..count..));
p <- update_labels(p, list(x="Fruit produced (millions)", y="Frequency"));

#
# P10-mean-P90 plot of the fruit produced over time
#
p2 <- ggplot(simAnnFruit, aes(x=year, y=annFruit)) + 
        stat_summary(fun.data="summaryCI", geom="smooth", colour=colors[1], size=2, fill=lightColors[1]);
p2 <- update_labels(p2, list(x="Year", y="Fruit produced"));

Conclusions

As you can see, some of the R commands are a little obscure; however, they can be bundled away in function definitions, so that the key cashflow logic stands alone and is easy to check.

[Edit] Since writing this post, I have reproduced and extended the above functionality using python with pandas instead of R. The library, as well as slides and notebooks from a presentation entitled "Monte Carlo Business Case Analysis using pandas", are available at bitbucket.org/artstr/montepylib.

Some future extensions:

  • To help with testing the logic for the mean or base case, I have extended the code so that if you pass nSims -> 0, it returns the mean of the distribution with simId=0. This is not too hard.
  • To help with sensitivity analysis, I am extending the code so that if you pass nSims -> -1, for each input variable it returns low, mean and high cases. This is hard to implement.
  • Add discrete variables, such as the small chance of a disaster.
  • Add unit tests.

I hope you found this useful - please let me know what you think, or if you have suggestions for how to improve the approach, or especially if you have a completely different way of solving the same problem. Thanks!

  

Hosting Meteor with MongoDB on Webfaction

Meteor is constantly evolving, and so are the ways to host it. I couldn’t find any single source that explained how to deploy a Meteor site to Webfaction using the latest technology (ie. as of March 2014). So here is how I did it for my site, Easy Online Rosters.

First, follow Webfaction’s instructions to install a MongoDB app and add a data directory. I have called the app mongodb1. Note down its port number, eg. 18000.

The instructions say to add a user with the ["userAdminAnyDatabase"] role. I have also added a second user with the roles ['clusterAdmin', 'userAdminAnyDatabase', 'readAnyDatabase'], based on this Stackoverflow answer.

The instructions then say to run this in an SSH window:

./mongodb-linux-x86_64-2.4.9/bin/mongod --auth --dbpath $HOME/webapps/mongodb1/data/ --port 18000

Use Webfaction’s one-click installer to install a node.js app, and note down its port number too, eg. 17000. Upgrade it to the latest stable version using the “n” package, following Webfaction’s instructions for node.

Following this post, I demeteorized my meteor app, pushed it up to a git repo on Webfaction, and pulled it down into $HOME/webapps/nodeapp/demeteoredapp/.

As mentioned in the Webfaction instructions, in the nodeapp directory, type export PATH=$PWD/bin/:$PATH. Then cd demeteoredapp and type npm install in that directory.

In another SSH shell, enter (the DB name is “admin”):

export MONGO_URL="mongodb://mongo_user:mongo_password@localhost:18000/admin?autoReconnect=true"  # 2
export MAIL_URL='smtp://user:password@smtp.webfaction.com'
export PORT="17000"
export ROOT_URL="http://example.com"

Then type:

./bin/node $HOME/webapps/nodeapp/demeteoredapp/main.js

You can adjust the default bin/start script to do this too.

Having installed the app this way, I have developed a process to update the code. I rename the previously demeteorized app directory, eg. to old; go into my Meteor app directory and type: demeteorize -o ../demeteoredapp, then mv ../old/.git ../demeteoredapp. At that point I can do git add --all and commit and push the new version up to my git repo on Webfaction. I then ssh onto the server and type:

cd webapps/nodeapp/demeteoredapp/appname
git pull; ../bin/stop ; ../bin/start

to pull in the new changes and launch them.

Note there is no need for the forever package when you do it this way. I’m not sure we need the demeteorizer package either, or if the Meteor bundler is sufficient.

I found some comments online that there can be memory problems if you use Meteor and MongoDB on Webfaction. I haven’t experienced this problem. (I’m keeping a close eye on it – I have even built an admin panel which includes a call to a Meteor method that executes the memory_usage.py script provided by Webfaction.)

Ongoing jobs

You will then want to set up cron jobs to start your MongoDB app if it fails for any reason. I have modelled this script off Webfaction’s start script for node.js:

#!/bin/sh
mkdir -p $HOME/webapps/mongoapp/run
mkdir -p $HOME/webapps/mongoapp/log
pid=$(/sbin/pidof $HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongod)
if echo "$pid" | grep -q " "; then
  pid=""
fi
if [ -n "$pid" ]; then
  user=$(ps -p $pid -o user | tail -n 1)
  if [ $user = "your-username" ]; then
    exit 0
  fi
fi
nohup $HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongod --auth --fork --logpath $HOME/webapps/mongoapp/log/mongo.log --dbpath $HOME/webapps/mongoapp/data/ --port 18000 > /dev/null 2>&1 &
/sbin/pidof $HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongod > $HOME/webapps/mongoapp/run/mongo.pid

Put this into your scheduled tasks using crontab -e.

You will also want to back up your database. A simple command to back up the database to a directory called dbbackup is:

$HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongodump --port 18000 -o $HOME/dbbackup

You will need to add the admin username and password.

I like to have daily, weekly and monthly backups, so I make three directories $HOME/mongo_backup/daily, weekly, monthly and have a script for each:

The daily script is:

#!/bin/sh
# $HOME/scripts/cron_daily_mongo_backup.sh
# add this to crontab, e.g. to run at 15:45 server time every day:
# 45 15 * * * ~/scripts/cron_daily_mongo_backup.sh

# back up the database
$HOME/webapps/mongoapp/mongodb-linux-x86_64-2.4.9/bin/mongodump --port 18000 -o $HOME/mongo_backup/daily/dump-`date +\%Y\%m\%d` 2>> $HOME/mongo_backup/daily/cron.log

# delete backup directories older than 7 days
find $HOME/mongo_backup/daily -type d -ctime +7 | xargs -r rm -rf  # Be careful using -rf

The weekly script is:

#!/bin/sh
# add this to crontab as, e.g. to run at 16:20 server time every Sunday:
# 20 16 * * 0 ~/scripts/cron_weekly_mongo_backup.sh
cp -R $HOME/mongo_backup/daily/dump-`date +\%Y\%m\%d` $HOME/mongo_backup/weekly
find $HOME/mongo_backup/weekly -type d -ctime +32 | xargs -r rm -rf

And the monthly script is:

#!/bin/sh
# add this to crontab as, e.g. to run at 16:25 server time every 1st of the month:
# 25 16 1 * * ~/scripts/cron_monthly_mongo_backup.sh
#
cp -R $HOME/mongo_backup/daily/dump-`date +\%Y\%m\%d` $HOME/mongo_backup/monthly
find $HOME/mongo_backup/monthly -type d -ctime +370 | xargs -r rm -rf

If you’ve read this far, please check out the finished product at easyonlinerosters.com!

  

Serve datatables with ajax from Django

Datatables is an amazing resource which lets you quickly display lots of data in tables, with sorting, searching and pagination all built in.

The simplest way to use it is to populate the table when you load the page.  Then the sorting, searching and pagination all just happen by themselves.

If you have a lot of data, you can improve page load times by just serving the data you need to, using ajax. On first sight, this is made easy too.  However, be warned: if the server is sending only the data needed, then the server needs to take care of sorting, searching and pagination. You will also need to control the table column sizes more carefully.

There’s quite a lot required to get this right, so I thought I’d share what I’ve learned from doing this in Django.

Start with the following html. This example demonstrates using the render function to insert a link into the table.

</pre>
<div class="row">
<table class="table table-striped table-bordered" id="example" style="clear: both;">
<thead>
<tr>
<th>Name</th>
<th>Value</th>
</tr>
</thead>
</table>
</div>
<pre>

and javascript:

$(document).ready(function() {
    exampleTable = $('#example').dataTable( {
        "aaSorting": [[ 2, "asc" ]],
        "aoColumns": [
            { "mData":"name", "sWidth":"150px" },
            { "mData":"supplier", "sWidth":"150px",
              "mRender": function (supplier, type, full)  {
                             return '<a href="'+supplier.slug+'">' + supplier.name + '</a>';
                         },
            },
            { "sType": 'numeric', "sClass": "right", "mData":"price", "sWidth":"70px" },
        ],
        "bServerSide": true,
        "sAjaxSource": "{% url 'api' 'MyClass' %}",
        "bStateSave" : true, // optional
                fnStateSave :function(settings,data){
                        localStorage.setItem("exampleState", JSON.stringify(data));
                },
                fnStateLoad: function(settings) {
                        return JSON.parse(localStorage.getItem("exampleState"));
                },
        fnInitComplete: function() { // use this if you don't hardcode column widths
            this.fnAdjustColumnSizing();
        }
    });
    $('#example').click(function() { // only if you don't hardcode column widths
        exampleTable.fnAdjustColumnSizing();
    });

Next you need to write an API for the data. I’ve put my api in its own file, apis.py, and made it a generic class-based view, so I’ve added to urls.py:

from django.conf.urls import patterns, url
from myapp import views, apis

urlpatterns = patterns('',
   ...
   url(r'^api/v1/(?P<cls_name>[\w-]+)/$',apis.MyAPI.as_view(),name='api'),
)

Then in apis.py, I put the following. You could use Django REST framework or TastyPie for a fuller solution, but this is often sufficient. I’ve written it in a way that can work across many classes; just pass the class name in the URL (with the right capitalization). One missing feature here is an ability to sort on multiple columns.

import sys
import json

from django.http import HttpResponse
from django.views.generic import TemplateView
from django.core.serializers.json import DjangoJSONEncoder

import myapp.models

class JSONResponse(HttpResponse):
    """
    Return a JSON serialized HTTP response
    """
    def __init__(self, request, data, status=200):
        # pass DjangoJSONEncoder to handle Decimal fields
        json_data = json.dumps(data, cls=DjangoJSONEncoder)
        super(JSONResponse, self).__init__(
            content=json_data,
            content_type='application/json',
            status=status,
        )

class JSONViewMixin(object):
    """
    Return JSON data. Add to a class-based view.
    """
    def json_response(self, data, status=200):
        return JSONResponse(self.request, data, status=status)

# API

# define a map from json column name to model field name
# this would be better placed in the model
col_name_map = {'name': 'name',
                'supplier': 'supplier__name', # can do foreign key look ups
                'price': 'price',
               }
class MyAPI(JSONViewMixin, View):
    "Return the JSON representation of the objects"
    def get(self, request, *args, **kwargs):
        class_name = kwargs.get('cls_name')
        params = request.GET
        # make this api general enough to handle different classes
        klass = getattr(sys.modules['myapp.models'], class_name)

        # TODO: this only pays attention to the first sorting column
        sort_col_num = params.get('iSortCol_0', 0)
        # default to value column
        sort_col_name = params.get('mDataProp_{0}'.format(sort_col_num), 'value')
        search_text = params.get('sSearch', '').lower()
        sort_dir = params.get('sSortDir_0', 'asc')
        start_num = int(params.get('iDisplayStart', 0))
        num = int(params.get('iDisplayLength', 25))
        obj_list = klass.objects.all()
        sort_dir_prefix = (sort_dir=='desc' and '-' or '')
        if sort_col_name in col_name_map:
            sort_col = col_name_map[sort_col_name]
            obj_list = obj_list.order_by('{0}{1}'.format(sort_dir_prefix, sort_col))

        filtered_obj_list = obj_list
        if search_text:
            filtered_obj_list = obj_list.filter_on_search(search_text)

        d = {"iTotalRecords": obj_list.count(),                # num records before applying any filters
            "iTotalDisplayRecords": filtered_obj_list.count(), # num records after applying filters
            "sEcho":params.get('sEcho',1),                     # unaltered from query
            "aaData": [obj.as_dict() for obj in filtered_obj_list[start_num:(start_num+num)]] # the data
        }

        return self.json_response(d)

This API depends on the model for two extra things:

  • the object manager needs a filter_on_search method, and
  • the model needs an as_dict method.

The filter_on_search method is tricky to get right. You need to search with OR on the different fields of the model, and AND on different words in the search text. Here is an example which subclasses the QuerySet and object Manager classes to allow chaining of methods (along the lines of this StackOverflow answer).

from django.db import models
from django.db.models import Q
from django.db.models.query import QuerySet

class Supplier(models.Model):
    name = models.CharField(max_length=60)
    slug = models.SlugField(max_length=200)

class MyClass(models.Model):
    name = models.CharField(max_length=60)
    supplier = models.ForeignKey(Supplier)
    price = models.DecimalField(max_digits=8, decimal_places=2)
    objects = MyClassManager()

    def as_dict(self):
        """
        Create data for datatables ajax call.
        """
        return {'name': self.name,
                'supplier': {'name': self.supplier.name, 'slug': self.supplier.slug},
                'price': self.price,
                }

class MyClassMixin(object):
    """
    This will be subclassed by both the Object Manager and the QuerySet.
    By doing it this way, you can chain these functions, along with filter().
    (A simpler approach would define these in MyClassManager(models.Manager),
        but won't let you chain them, as the result of each is a QuerySet, not a Manager.)
    """
    def q_for_search_word(self, word):
        """
        Given a word from the search text, return the Q object which you can filter on,
        to show only objects containing this word.
        Extend this in subclasses to include class-specific fields, if needed.
        """
        return Q(name__icontains=word) | Q(supplier__name__icontains=word)

    def q_for_search(self, search):
        """
        Given the text from the search box, search on each word in this text.
        Return a Q object which you can filter on, to show only those objects with _all_ the words present.
        Do not expect to override/extend this in subclasses.
        """
        q = Q()
        if search:
            searches = search.split()
            for word in searches:
                q = q & self.q_for_search_word(word)
        return q

    def filter_on_search(self, search):
        """
        Return the objects containing the search terms.
        Do not expect to override/extend this in subclasses.
        """
        return self.filter(self.q_for_search(search))

class MyClassQuerySet(QuerySet, MyClassMixin):
    pass

class MyClassManager(models.Manager, MyClassMixin):
    def get_query_set(self):
        return MyClassQuerySet(self.model, using=self._db)

This is a stripped down version of my production code. I haven’t fully tested this stripped down version, so please let me know if you find any problems with it.

Hope it helps!

  

Solve this puzzle – @memo and dynamic programming

Here’s the situation.  You have a puzzle with 15 round coloured plates arranged in an equilateral triangle, and 14 coloured pebbles on top of them.  One plate does not have a pebble – it is called the hole.  Your goal is to rearrange the pebbles so that they are on the matching coloured plates, in the minimum number of moves possible.  For each move, you can only move one pebble in a straight line to the hole, possibly leaping over other pebbles on the way.

The question is – can you design an algorithm to calculate, for any starting board, the minimum number of moves to solve it?

In fact this describes the game Panguru, recently produced by Second Nature Games and available as a board game and online.  In Panguru, there are two pebbles and plates of each colour, and one additional legal move down the centreline of the triangle.  If my quick description is too wordy, the online game will give you a much better feel for it; there are rules available too.

Panguru Online

A dynamic programming solution, with memoization

Here’s the approach I came up with, which I implemented in python. This was informed by the excellent book Python Algorithms, Mastering Basic Algorithms in the Python Language, by Magnus Lie Hetland, which I highly recommend.

Think of all the possible moves used to solve the puzzle as a tree. Number the positions 0 to 14.  The root node of the tree is the hole position, and child nodes give the position of the pebble that is being moved. So with the hole starting at the top of the triangle (position 0), one path through the tree might be 0-3-5.  (If you think about it, the last move always tells you where the hole finishes up.)

It is easy to measure how close we are to the solved board: we just count how many pebbles have the same colour as their plates.  When we hit 14, we are done!

In fact we can turn it around.  Let’s assume we have a solution (14 matching pebbles) after m moves. How did we get here?  Clearly, we must have had 13 matching pebbles the move before. (Because you only move one pebble at a time, each move can only increase or decrease the number of matching pebbles by one, or leave the number unchanged.)

And the move before that, we must have had either 12 or 13 matching pebbles. And so on.

This sounds like induction and lends itself to a recursive solution, like this. First define the tree structure via the Node class. The core of this is the __init__ method which keeps track of the node’s parent. I’ve also added original_pos to help us later, and two extra methods to display nodes nicely on the command line (__repr__) and to make it easier to access the node’s ancestry (__getitem__).

class Node():
    def __init__(self, parent, move, num_matching):
        self.parent = parent
        self.move = move
        self.num_matching = num_matching

    def original_pos(self, pos):
        """
        Return the original position of the pebble now at position pos.
        """
        if not self.parent:
            return pos
        if pos==self.move:
            prev_pos = self.parent.move
        elif pos==self.parent.move:
            prev_pos = self.move
        else:
            prev_pos = pos
        return self.parent.original_pos(prev_pos)

    def __getitem__(self, key):
        """
        If m is a node which you think of as the last move in a sequence,
        m[-3] is the third last move as a Node.
        m[-3].move is the third last move as an integer (which position was moved)
        """
        if key>=0:
            raise IndexError, "You must index moves from the end, e.g. m[-1] for the last move."
        if key==-1:
            return self
        else:
            try:
                return self.parent[key+1]
            except TypeError:
                raise IndexError, "Out of range."

    def __repr__(self):
        return "%s%d" % ((self.parent and ("%s-" % str(self.parent)) or ""),
                         self.move)

Then the puzzle-solving approach could be implemented like this:

class Puzzle():
    def __init__(self, plates, pebbles, allowed_moves):
    """
    Set up a puzzle instance for solving.
    Args:
        plates is a string of 15 chars representing colours
            e.g. "WCBRBOGYGPCORPY"
        pebbles is the same, with "-" for the hole
            e.g. "-PCBRYOGYGBCORP"
        allowed_moves is a list s.t.
            allowed_moves[i] = list of positions that a pebble at pos i can move to
                = list of positions that can move to this position, if it is the hole
            e.g. [[1, 2, 3, 4, 5, 6, 9, 10, 12, 14],
                  [0, 2, 3, 4, 6, 8, 10, 13],
                  [0, 1, 4, 5, 7, 9, 11, 14],
                  [0, 1, 4, 5, 6, 7, 10, 12],
                  [0, 1, 2, 3, 5, 7, 8, 11, 12, 13],
                  [0, 2, 3, 4, 8, 9, 12, 14],
                  [0, 1, 3, 7, 8, 9, 10, 11],
                  [2, 3, 4, 6, 8, 9, 11, 12],
                  [1, 4, 5, 6, 7, 9, 12, 13],
                  [0, 2, 5, 6, 7, 8, 13, 14],
                  [0, 1, 3, 6, 11, 12, 13, 14],
                  [2, 4, 6, 7, 10, 12, 13, 14],
                  [0, 3, 4, 5, 7, 8, 10, 11, 13, 14],
                  [1, 4, 8, 9, 10, 11, 12, 14],
                  [0, 2, 5, 9, 10, 11, 12, 13]]
    """
        self.plates = plates
        self.pebbles = pebbles
        self.allowed_moves = allowed_moves
        self.num_pebbles = len(filter(lambda x: x not in ["-"], self.pebbles))
        self.num_matching = sum(plates[i] == pebbles[i] for i in range(len(pebbles)))
        hole_pos = self.pebbles.find('-')
        self.root = Node(None, hole_pos, self.num_matching)

    def matching_nodes(self, turn, num_matching):
        """
        Return all the series of moves (as BoardNodes with parents)
        that have 'num_matching' matching spots after 'turn' turns.
        """
        if turn==0:
            if num_matching==self.num_matching:
                return [self.root]
            else:
                return []
        result = []
        for change in (-1,0,1):
            for prev_node in self.matching_nodes(turn-1, num_matching+change):
                for move in self.allowed_moves[prev_node.move]:
                    pebble_colour = self.pebbles[prev_node.original_pos(move)]
                    # was the moved pebble on a matching plate already?
                    old_pos_match = (self.plates[move]==pebble_colour)
                    # does the prev board's hole plate match the moved pebble?
                    new_pos_match = (self.plates[prev_node.move]==pebble_colour)
                    # did the move change how many positions were matching,
                    # by exactly the number we're looking at?
                    if (old_pos_match-new_pos_match)==change:
                        result += [Node(prev_node, move, num_matching)]
        return result

The interesting recursion here is going on in the matching_nodes method.  It just implements the idea that the solutions that have, say, 10 matching positions at turn 10, must have had either 9,10 or 11 matching positions at turn 9. It then works back to turn 0, at which point we know how many matching positions there were.

On top of this, we need a further method which finds the right question to ask. It could start by saying – give me all the solutions after 0 moves. If there are none, find all the solutions after 1 move. And keep trying until you find some solutions, e.g.:

    def optimal_moves(self, stop_at=20):
        """
        Return a tuple (fewest possible number of moves, [optimal moves for this board]).
        """
        num = 0
        result = []
        while not result and num<=stop_at:
            result = self.matching_nodes(num, self.num_pebbles)
            num += 1
        return (num, result)

The code above will work, but it will be slow and inefficient, because matching_nodes will often recurse through territory it has already covered. And that’s where Hetland’s memo decorator comes to the rescue. (This is only a few lines of code which you can find by searching at Google books.) This will cache the results of the decorated method, so that it does not need to be recalculated.  To use it, simply apply @memo to the def matching_nodes line, like so:

@memo
def matching_nodes(self, turn, num_matching):

And that’s it!

You can see this code in action when you ask for a hint in the online game. Each time you press the hint button, the javascript sends off an ajax query which triggers the server to run the above code on the player’s board, and return which moves would come next if the player plays optimally.

In fact, to get around annoying ever-expanding memory problems on the server, I’m running this asynchronously using Celery and Redis (as covered in an earlier blog post), and restarting the celery worker after every request.  But that’s another story…

I hope that has been of interest, and please let me know if you have any comments or improvements.

  

Better than jQuery.ajax() – Django with Angular

I have built a website for my games company Second Nature Games using Django.  Django brings lots of benefits like a nice admin panel where my partner can upload new game boards and add new content to the site.

However, to write games on the web you can’t rely too much on a server-side framework like Django. You are going to have to write some javascript as well.  I was keen for the challenge, having used some jQuery before.

In my first attempt, the boards were rendered by Django on the server, and then the javascript on the client-side would manipulate the document object model (DOM) as the game was played. As I described in an earlier post, I used jQuery’s $.ajax() command to alert the server when the game was finished, and then the javascript would update scores and stats as required.

But this is not easily scalable:  as I added a leaderboard and more stats, more and more things needed to be updated, and I got into a tangle.

The solution is to use a javascript MVC framework like Backbone, Angular or Ember (or Meteor even more so).  I chose to use Angular. There is a great website, To Do MVC, which shows the same To Do application written in many different frameworks, so you can make an informed choice if you’re facing the same decision.

Using one of these frameworks changes how you view the server: the server just sends data and the client renders it. This is known as “data on the wire”.  The effect is to turn the server into an API. (Unfortunately for Django fans, I suspect Django is really overkill for this. But having developed the rest of the site with Django, I stuck with it.)  Meanwhile, whenever the data updates, the javascript framework will update the DOM for you.

Here is a sample template for the Panguru leaderboard, using Angular:

<div class="info">
  <div class="leaderboard">
    <h3 class="title">Leaderboard</h3>
    <ul>
      <li ng-repeat="entry in leaderboard.entries">
        <span class="name">{{ entry.name }}</span>
        <span class="score">{{ entry.score }}</span>
      </li>
    </ul>
  </div>
</div>

You can see it’s pretty straightforward and elegant. The code to load it from the API is:

$scope.updateLeaderboard = function() {
  var that = this;
  $http.get('api/leaderboard/')
    .success(function(data, status, headers, config) {
      that.leaderboard.entries = data;
    })
    .error(function(data, status, headers, config) {
      console.log('error updating leaderboard');
    });
  }

As you may have noticed, Angular and Django both use the double-parentheses notation. You don’t want to have them both try to interpret the same file.

One approach is to use Django’s {% verbatim %} tag. I think it’s nicer to separate the Django-rendered templates from the Angular-rendered templates completely, into two separate directories. Django templates stay in their app’s templates directory. I put Angular’s html files into the app’s static/partials directory. If the server needs to provide data to the javascript, I let Django render it in its template, and pick it up in the static html file. One example is to pass the csrf token. Another example arises because I’m using the staticfiles app, so Django renames all the static files. Unfortunately this includes the locations of all the partial html files and images you need. E.g. in my Django templates/game_page.html:

<html lang="en" ng-app="panguru">
<head>
  <title>Panguru Online</title>
  ...
  {% addtoblock "css" %}{# if you are using sekizai #}</pre>
  <style type="text/css">
    .load {background-image: url('{% static "img/loader.gif" %}');}
  </style>
  {% endaddtoblock %}
  <script type="text/javascript">
    csrf_token = '{{ csrf_token }}';
    container_url = '{% static "partials/container.html" %}';
    ...
  </script>

</head>
<body>
  <div panguru-container></div>
</body>
</html>

This part I’m not especially proud of, and would love to hear if you have a better solution.

You can then either use a Django API framework like TastyPie to serve the api/leaderboard/ URL, or you can write one yourself. I did this starting from a JSON response mixin like the one in the Django docs, or this one by Oz Katz, developed for use with Backbone. Then, in Django views.py, I put:

class LeaderboardApi(JSONViewMixin, View):
    "Return the JSON representation of the leaderboard"
    def get(self, request, *args, **kwargs):
        return self.json_response(Score.as_list(limit=20))

That just requires a method on your model which returns the object in an appropriate format, e.g.

class Score(models.Model):
    user = models.ForeignKey(User)
    score = models.PositiveIntegerField(default=0)
    class Meta:
        ordering = ['-score']
    @classmethod
    def as_list(cls):
        return [score.as_dict() for score in cls.objects.all()]
    def as_dict(self):
        return {'name': self.user.username, 'score': self.score }

There’s just one more thing, which is an inconsistency in how Django and Angular apply trailing slashes to URLs. Unfortunately if you use Angular’s $resource approach to interacting with the API, it will strip off any final slash, and then Django will redirect the URL to one with a slash, losing any POST parameters along the way. There’s a fair bit of discussion about this online. My approach has been to just turn off Django’s default behavior in settings.py:

APPEND_SLASH = False

Get ready to thoroughly test your app when you make this change, especially if you are also using Django-CMS, as I am. I uncovered a few places where I had been a bit sloppy in hardwiring URLs, and was inconsistent in my use of slashes in them.

The full result is the logic puzzle game Panguru, which you can play online or buy as a physical boardgame.  Please check it out and let me know what you think!

  

Get started with Python on a Mac

Here are the steps I recommend for getting started with python on a Mac (OS X), particularly if you are fairly new to the language.

  1. The Mac comes with python installed, and you can run this directly from the terminal application with the command “python”.  However there is a very nice python interpreter called the “ipython notebook” which I strongly recommend using too.  Installation instructions are available here - specifically, I downloaded Enpackage Canopy and typed two commands into the terminal.  You should now be able to get an ipython notebook running in your internet browser with the terminal command:
    cd somewhere/sensible/
    ipython notebook --pylab=inline

    The last part of this means that plots get drawn inside the notebook, which is handy. If you forget to add that, you can type into the notebook itself: %pylab inline.
    Note you use shift-Enter to run all the python commands in a cell.

  2. Install pip.  This lets you easily download python packages and will be very handy later.  This is easy – from the terminal type:
    sudo easy_install pip
  3. More advanced users may want to install virtualenv, so that different projects you work on can use different versions of packages. If you’re planning on putting any of your code in production, this is a must. But if you’re just getting started, ignore it for now.
  4. Advanced users will also want to install git so you can incrementally save your own versions of code. If you’re just starting out, leave this for later too.

OK, now let’s dive in the deep end by loading some financial data from Quandl, then manipulating it with Pandas and plotting it with matplotlib. You’ll need Pandas and the Quandl package, which you can get by typing into the terminal:

pip install pandas
pip install Quandl

Now in your ipython notebook type (I recommend doing each group of statements in its own cell, so that you can run them separately; remember it’s shift-enter to run the statements):

import numpy
import pandas
import Quandl

assets = ['OFDP.ALUMINIUM_21.3',
          'OFDP.COPPER_6.3',
          'OFDP.LEAD_31.3',
          'OFDP.GOLD_2.3',
          'OFDP.ZINC_26.3']

data = Quandl.get(assets)
data.head()

data['2012':].plot()

This link gives ways you can develop the plot further.

To give an example of how to manipulate the data, you could try:

# show recent daily returns, normalised by asset volatility (ie. z-scores)
data = data.sort_index() # ensure oldest comes first
returns = (data/data.shift(1))
log_returns = numpy.log(returns)
vols = pandas.ewmstd(log_returns,span=180)  # daily vol, exp weighted moving avg
z = (returns-1)/vols # calc z-scores
z[z>50] = numpy.nan  # remove very bad data
z[z<-50] = numpy.nan
z.tail() # take a look at the most recent z-scores

That’s a very quick introduction to installing and using python on the Mac. I hope it’s helpful!
If you want more info about writing in Python itself, this link looks like a comprehensive site.

Let me know if you have any comments or improvements.

  

Mathematical web apps & visualisation