Tag Archives: data

How to estimate uncertain data

Data Estimator is a tool that helps answer questions about uncertain quantities, eg. “What will our company’s sales be next year?”

It is designed to be used as part of an interview process, where expert judgements are drawn out and quantified.

It’s a reminder that in this world of big data, some things remain hard to measure, especially when it comes to the future.

You will be asked a handful of questions, using “probability wheels” like this to visualise the uncertainties:

probability wheel

When you’re done, you will be able to see and export the resulting probability distribution. For example, the result could be “there is a 60% chance this market will more than double in five years, a 20% chance it will more than treble, but a 10% chance it will shrink.”

probability density

It will also show some alternatives for you how you could place the uncertainty in a decision tree, eg.

decision tree node

Try it out here: http://racingtadpole.com/more/estimator/

  

Curve fitting with javascript & d3

If javascript is up to amazing animations and visualizations with d3, maybe it’s up to non-linear curve fitting too, right?

Something like this, perhaps:



Here’s how I did it:

  • The hard work is done using Cobyla (“Constrained Optimization BY Linear Approximation”), which Anders Gustafsson ported to Java and Reinhard Oldenburg ported to Javascript, as per this stackoverflow post.
    Cobyla minimises a non-linear objective function subject to constraints.
  • The demo and its components (including cobyla) use the module pattern, which has the advantage of keeping the global namespace uncluttered.
  • To adapt Cobyla to this curve fitting problem, I wrote a short wrapper which is added onto the cobyla module as cobyla.nlFit(data, fitFn, start, min, max, constraints, solverParams). This function minimises the sum of squared differences (y1^2-y2^2) between the data points, (x,y1), and the fitted points, (x,y2).
  • The Weibull cumulative distribution function (CDF), inverse CDF and mean are defined in the “distribution” module. Thus distribution.weibull([2,1,5]) .inverseCdf(0.5) gives the median (50th percentile) of a Weibull distribution with shape parameter 2, scale parameter 1 and location parameter 5.
  • The chart is built with d3. I am building an open-source library of a few useful chart types, d3elements, which are added onto the d3 module as d3.elts. This one is called d3.elts.xyChart.
  • So the user interface doesn’t get jammed, I use a javascript web worker to calculate the curve fit. I was surprised how easy it was to set this up.
  • I apologise in advance that this sample code is quite complicated. If you see ways to make it simpler, please let me know.
  • Finally, this may be obvious, but I like the rigour that a tool like jshint imposes. Hence the odd comment at the top of fitdemo.js, /* global fitdemo: true, jQuery, d3, _, distribution, cobyla */

Check out the source code on bitbucket here. You can see it being used for uncertain data estimation here.

Please let me know what you think!

  

Visualising Flows in a D3 Chord Diagram with Hover

This is an example of a reusable chart built using d3.

The idea is that you have a matrix of the flows between one category (here, optimist/neutral/pessimist) to another (introvert/extrovert). `d3.elts.flowChord()` then converts this matrix into a chord diagram, with the option of hover text.

Check the sample source code on bitbucket for the full description of how to use it; here is the essence:

  var colors = d3.scale.ordinal().range(["#AAA", "steelblue", "green", "orange", "brown"]);
  var hoverHtml = {'Introvert': '<h1>Introverts</h1>Like to be by themselves', 
      'Extrovert': '<h1>Extroverts</h1>Like the company of other people', 
      'Optimist': '<h1>Optimists</h1>Look on the bright side of life',
      'Neutral': '<h1>Neutrals</h1>Life could be good, it could be bad',
      'Pessimist': '<h1>Pessimists</h1>See the glass half empty'}
  var chordDiagram = d3.elts.flowChord().colors(colors).hoverHtml(hoverHtml).rimWidth(30);
  var data = [['Disposition','Optimist','Neutral','Pessimist'],
              ['Introvert', 0.8, 0.4, 0.67], 
              ['Extrovert', 0.2, 0.6, 0.33]]
  d3.select("#flow").datum(data).call(chordDiagram);
  

9 Lessons from PyConAU 2014

A summary of what I learned at PyCon AU in Brisbane this year. (Videos of the talks are here.)

1. PyCon’s code of conduct

Basically, “Be nice to people. Please.”

I once had a boss who told me he saw his role as maintaining the culture of the group.  At first I thought that seemed a strange goal for someone so senior in the company, but I eventually decided it was enlightened: a place’s culture is key to making it desirable, and making the work sustainable. So I like that PyCon takes the trouble to try to set the tone like this, when it would be so easy for a bunch of programmers to stay focused on the technical.

2. Django was made open-source to give back to the community

Ever wondered why a company like Lawrence Journal-World would want to give away its valuable IP as open source? In a “fireside chat” between Simon Willison (Django co-creator) and Andrew Godwin (South author), it was revealed that the owners knew that much of their CMS framework had been built on open source software, and they wanted to give back to the community. It just goes to show, no matter how conservative the organisation you work for, if you believe some of your work should be made open source, make the case for it.

3. There are still lots more packages and tools to try out

That lesson’s copied from my post last year on PyCon AU. Strangely this list doesn’t seem to be any shorter than last year – but it is at least a different list.

Things to add to your web stack -

  • Varnish – “if your server’s not fast enough, just add another”.  Apparently a scary scripting language is involved, but it can take your server from handling 50 users to 50,000. Fastly is a commercial service that can set this up for you.
  • Solr and elasticsearch are ways to make searches faster; use them with django-haystack.
  • Statsd & graphite for performance monitoring.
  • Docker.io

Some other stuff -

  • mpld3 – convert matplotlib to d3. Wow! I even saw this in action in an ipython notebook.
  • you can use a directed graph (eg using networkx) to determine the order of processes in your code

Here are some wider tools for bioinformaticians (if that’s a word), largely from Clare Sloggett’s talk -

  • rosalind.info – an educational tool for teaching bioinformatics algorithms in python.
  • nectar research cloud – a national cloud for Australian researchers
  • biodalliance – a fast, interactive, genome visualization tool that’s easy to embed in web pages and applications (and ipython notebooks!)
  • ensembl API – an API for genomics – cool!

And some other sciency packages -

  • Natural Language Toolkit NLTK
  • Scikit Learn can count words in docs, and separate data into training and testing sets
  • febrl – to connect user records together when their data may be incorrectly entered

One standout talk for me was Ryan Kelly’s pypy.js, implementing a compliant and fast python in the browser entirely in javascript. The only downside is it’s 15 Mb to download, but he’s working on it!

And finally, check out this alternative to python: Julia, “a high-level, high-performance dynamic programming language for technical computing”, and Scirra’s Construct 2, a game-making program for kids (Windows only).

4. Everyone loves IPython Notebook

I hadn’t thought to embed javascript in notebooks, but you can. You can even use them collaboratively through Google docs using Jupyter‘s colaboratory. You can get a table-of-contents extension too.

5. Browser caching doesn’t have to be hard

Remember, your server is not just generating html – it is generating an http response, and that includes some headers like “last modified”, “etag”, and “cache control”. Use them. Django has decorators to make it easy. See Mark Nottingham’s tutorial. (This from a talk by Tom Eastman.)

6. Making your own packages is a bit hard

I had not heard of wheels before, but they replace eggs as a “distributable unit of python code” – really just a zip file with some meta-data, possibly including operating-system-dependent binaries. Tools that you’ll want to use include tox (to run tests in lots of different environments); sphinx (to auto-generate your documentation) and then ReadTheDocs to host your docs; check-manifest to make sure your manifest.in file has everything it needs; and bumpversion so you don’t have to change your version number in five different places every time you update the code.

If you want users to install your package with “pip install python-fire“, and then import it in Python with “import fire“, then you should name your enclosing folder python_fire, and inside that you should have another folder named fire. Also, you can install this package while you are testing it by cding to the python-fire directory and typing pip install -e . (note the final full-stop; the -e flag makes it editable).

Once you have added a LICENSE, README, docs, tests, MANIFEST.insetup.py and optionally a setup.cfg (to the python-fire directory in the above example) and you have pip installed setuptoolswheel and twine, you run both

python setup.py bdist_wheel [--universal]
python setup.py sdist

The bdist version produces a binary distribution that is operating-system-specific, if required the universal flag says it will run on all operating systems in both Python 2 and Python 3). The sdist version is a source distribution.

To upload the result to pypi, run

twine upload dist/*

(This from a talk by Russell Keith-Magee.)  Incidentally, piprot is a handy tool to check how out-of-date your packages are. Also see the Hitchhiker’s Guide to Packaging.

7. Security is never far from our thoughts

This lesson is also copied from last year’s post. If you offer a free service (like Heroku), some people will try to abuse it. Heroku has ways of detecting potentially fraudulent users very quickly, and hopes to open source them soon. And be careful of your APIs which accept data – XML and YAML in particular have scary features which can let people run bad things on your server.

8. Database considerations

Some tidbits from Andrew Godwin’s talk (of South fame)…

  • Virtual machines are slow at I/O, so don’t put your database on one – put your databases on SSDs. And try not to run other things next to the database.
  • Setting default values on a new column takes a long time on a big database. (Postgres can add a NULL field for free, but not MySQL.)
  • Schema-less (aka NoSQL) databases make a lot of sense for CMSes.
  • If only one field in a table is frequently updated, separate it out into its own table.
  • Try to separate read-heavy tables (and databases) from write-heavy ones.
  • The more separate you can keep your tables from the start, the easier it will be to refactor (eg. shard) later to improve your database speed.

9. Go to the lightning talks

I am constantly amazed at the quality of the 5-minute (strictly enforced) lightning talks. Russell Keith-Magee’s toga provides a way to program native iOS, Mac OS, Windows and linux apps in python (with Android coming). (Keith has also implemented the constraint-based layout engine Cassowary in python, with tests, along the way.) Produce displays of lightning on your screen using the von mises distribution and amazingly quick typing. Run python2 inside python3 with sux (a play on six).  And much much more…

Finally, the two keynotes were very interesting too. One was by Katie Cunningham on making your websites accessible to all, including people with sight or hearing problems, or dyslexia, or colour-blindness, or who have trouble with using the keyboard or the mouse, or may just need more time to make sense of your site. Oddly enough, doing so tends to improve your site for everyone anyway (as Katie said, has anyone ever asked for more flashing effects on the margins of your page?). Examples include captioning videos, being careful with red and green (use vischeck), using aria, reading the standards, and, ideally, having a text-based description of any graphs on the site, like you might describe to a friend over the phone. Thinking of an automated way to do that last one sounds like an interesting challenge…

The other keynote was by James Curran from the University of Sydney on the way in which programming – or better, “computational thinking” – will be taught in schools. Perhaps massaging our egos at a programming conference, he claimed that computational thinking is “the most challenging thing that people do”, as it requires managing a high level of complexity and abstraction. Nonetheless, requiring kindergarteners to learn programming seemed a bit extreme to me – until he explained at that age kids would not be in front of a computer, but rather learning “to be exact”. For example, describing how to make a slice of buttered bread is essentially an algorithm, and it’s easy to miss all the steps required (like opening the cupboard door to get the bread). If you’re interested, some learning resources include MIT’s scratch, alice (using 3D animations), grok learning and the National Computer Science School (NCSS).

All in all, another excellent conference – congratulations to the organisers, and I look forward to next year in Brisbane again.

  

How is your tax money being spent?

Want to know how your tax money is being spent? The Australian Budget Explorer is an interactive way to explore spending by portfolio, agency, program or even in more detail; compare 2014 against previous years; and search for the terms that interest you.

Australian Budget Explorer

This was produced in collaboration with BudgetAus. This year for the first time, the team at data.gov.au provided the Budget expenditure data in a single spreadsheet, which Rosie Williams (InfoAus) manipulated to include further data on the Social Services portfolio. The collaboration is producing lots of good visualisations, collected at AusViz.

I won’t editorialise about the Budget here; instead here is my data and extensions wishlist:

Look-through to include State budgets

The biggest line item (component) in the Federal Budget is $54 billion for “Administered expenses: Special appropriation GST Revenue Entitlements – Federal Financial Relations Act 2009″, which I take it is revenue to the States. I would love to be able to “look through” this item into how the States spend it.

The BudgetAus team has provided some promising data leads here.

Unique identifiers to track spending over time

One of the most frequent requests I get is to track changes in spending over time.

Unfortunately this is hard, as there are no unique identifiers for a given portfolio, program, agency or component. That means if the name changes from one year to the next, it is hard to work out which old name corresponds to which new name. E.g. In 2014, the Department of Employment & Workplace Relations has been split into the Department of Employment and the Department of Education, while the Environment portfolio used to be “Sustainability, Environment, Water, Population and Communities”.

It would be great to give all spending an identifier, and have a record of how identifiers map from one year to the next.

What money is actually spent?

How does the budget relate to what is spent? There is some info here at BudgetAus, but the upshot is “This might be a good task for a future group of volunteers”…

Revenue

There is revenue data available here – I haven’t looked at it carefully yet, but I hope to include it, if possible.

Cross-country comparison

It would be great to compare the percentages spent in key areas by governments across the world.
Maybe it’s already being done? To do this I’d need some standard hierarchy of categories (health, education, defence, and subdivisions of these, etc), and we’d need every country’s government (and every State government) to tag their spending by those categories. Sounds simple in concept but I bet it would be hard to make it happen.

In the meantime, my plan is to check quandl for data and see how far I can go with what’s there…

D3

Finally, many thanks to the authors for the awesome d3 package!

Conclusion

If you have any comments or know how to solve any of the data issues raised above, please let me know.

  

Serve datatables with ajax from Django

Datatables is an amazing resource which lets you quickly display lots of data in tables, with sorting, searching and pagination all built in.

The simplest way to use it is to populate the table when you load the page.  Then the sorting, searching and pagination all just happen by themselves.

If you have a lot of data, you can improve page load times by just serving the data you need to, using ajax. On first sight, this is made easy too.  However, be warned: if the server is sending only the data needed, then the server needs to take care of sorting, searching and pagination. You will also need to control the table column sizes more carefully.

There’s quite a lot required to get this right, so I thought I’d share what I’ve learned from doing this in Django.

Start with the following html. This example demonstrates using the render function to insert a link into the table.

</pre>
<div class="row">
<table class="table table-striped table-bordered" id="example" style="clear: both;">
<thead>
<tr>
<th>Name</th>
<th>Value</th>
</tr>
</thead>
</table>
</div>
<pre>

and javascript:

$(document).ready(function() {
    exampleTable = $('#example').dataTable( {
        "aaSorting": [[ 2, "asc" ]],
        "aoColumns": [
            { "mData":"name", "sWidth":"150px" },
            { "mData":"supplier", "sWidth":"150px",
              "mRender": function (supplier, type, full)  {
                             return '<a href="'+supplier.slug+'">' + supplier.name + '</a>';
                         },
            },
            { "sType": 'numeric', "sClass": "right", "mData":"price", "sWidth":"70px" },
        ],
        "bServerSide": true,
        "sAjaxSource": "{% url 'api' 'MyClass' %}",
        "bStateSave" : true, // optional
                fnStateSave :function(settings,data){
                        localStorage.setItem("exampleState", JSON.stringify(data));
                },
                fnStateLoad: function(settings) {
                        return JSON.parse(localStorage.getItem("exampleState"));
                },
        fnInitComplete: function() { // use this if you don't hardcode column widths
            this.fnAdjustColumnSizing();
        }
    });
    $('#example').click(function() { // only if you don't hardcode column widths
        exampleTable.fnAdjustColumnSizing();
    });

Next you need to write an API for the data. I’ve put my api in its own file, apis.py, and made it a generic class-based view, so I’ve added to urls.py:

from django.conf.urls import patterns, url
from myapp import views, apis

urlpatterns = patterns('',
   ...
   url(r'^api/v1/(?P<cls_name>[\w-]+)/$',apis.MyAPI.as_view(),name='api'),
)

Then in apis.py, I put the following. You could use Django REST framework or TastyPie for a fuller solution, but this is often sufficient. I’ve written it in a way that can work across many classes; just pass the class name in the URL (with the right capitalization). One missing feature here is an ability to sort on multiple columns.

import sys
import json

from django.http import HttpResponse
from django.views.generic import TemplateView
from django.core.serializers.json import DjangoJSONEncoder

import myapp.models

class JSONResponse(HttpResponse):
    """
    Return a JSON serialized HTTP response
    """
    def __init__(self, request, data, status=200):
        # pass DjangoJSONEncoder to handle Decimal fields
        json_data = json.dumps(data, cls=DjangoJSONEncoder)
        super(JSONResponse, self).__init__(
            content=json_data,
            content_type='application/json',
            status=status,
        )

class JSONViewMixin(object):
    """
    Return JSON data. Add to a class-based view.
    """
    def json_response(self, data, status=200):
        return JSONResponse(self.request, data, status=status)

# API

# define a map from json column name to model field name
# this would be better placed in the model
col_name_map = {'name': 'name',
                'supplier': 'supplier__name', # can do foreign key look ups
                'price': 'price',
               }
class MyAPI(JSONViewMixin, View):
    "Return the JSON representation of the objects"
    def get(self, request, *args, **kwargs):
        class_name = kwargs.get('cls_name')
        params = request.GET
        # make this api general enough to handle different classes
        klass = getattr(sys.modules['myapp.models'], class_name)

        # TODO: this only pays attention to the first sorting column
        sort_col_num = params.get('iSortCol_0', 0)
        # default to value column
        sort_col_name = params.get('mDataProp_{0}'.format(sort_col_num), 'value')
        search_text = params.get('sSearch', '').lower()
        sort_dir = params.get('sSortDir_0', 'asc')
        start_num = int(params.get('iDisplayStart', 0))
        num = int(params.get('iDisplayLength', 25))
        obj_list = klass.objects.all()
        sort_dir_prefix = (sort_dir=='desc' and '-' or '')
        if sort_col_name in col_name_map:
            sort_col = col_name_map[sort_col_name]
            obj_list = obj_list.order_by('{0}{1}'.format(sort_dir_prefix, sort_col))

        filtered_obj_list = obj_list
        if search_text:
            filtered_obj_list = obj_list.filter_on_search(search_text)

        d = {"iTotalRecords": obj_list.count(),                # num records before applying any filters
            "iTotalDisplayRecords": filtered_obj_list.count(), # num records after applying filters
            "sEcho":params.get('sEcho',1),                     # unaltered from query
            "aaData": [obj.as_dict() for obj in filtered_obj_list[start_num:(start_num+num)]] # the data
        }

        return self.json_response(d)

This API depends on the model for two extra things:

  • the object manager needs a filter_on_search method, and
  • the model needs an as_dict method.

The filter_on_search method is tricky to get right. You need to search with OR on the different fields of the model, and AND on different words in the search text. Here is an example which subclasses the QuerySet and object Manager classes to allow chaining of methods (along the lines of this StackOverflow answer).

from django.db import models
from django.db.models import Q
from django.db.models.query import QuerySet

class Supplier(models.Model):
    name = models.CharField(max_length=60)
    slug = models.SlugField(max_length=200)

class MyClass(models.Model):
    name = models.CharField(max_length=60)
    supplier = models.ForeignKey(Supplier)
    price = models.DecimalField(max_digits=8, decimal_places=2)
    objects = MyClassManager()

    def as_dict(self):
        """
        Create data for datatables ajax call.
        """
        return {'name': self.name,
                'supplier': {'name': self.supplier.name, 'slug': self.supplier.slug},
                'price': self.price,
                }

class MyClassMixin(object):
    """
    This will be subclassed by both the Object Manager and the QuerySet.
    By doing it this way, you can chain these functions, along with filter().
    (A simpler approach would define these in MyClassManager(models.Manager),
        but won't let you chain them, as the result of each is a QuerySet, not a Manager.)
    """
    def q_for_search_word(self, word):
        """
        Given a word from the search text, return the Q object which you can filter on,
        to show only objects containing this word.
        Extend this in subclasses to include class-specific fields, if needed.
        """
        return Q(name__icontains=word) | Q(supplier__name__icontains=word)

    def q_for_search(self, search):
        """
        Given the text from the search box, search on each word in this text.
        Return a Q object which you can filter on, to show only those objects with _all_ the words present.
        Do not expect to override/extend this in subclasses.
        """
        q = Q()
        if search:
            searches = search.split()
            for word in searches:
                q = q & self.q_for_search_word(word)
        return q

    def filter_on_search(self, search):
        """
        Return the objects containing the search terms.
        Do not expect to override/extend this in subclasses.
        """
        return self.filter(self.q_for_search(search))

class MyClassQuerySet(QuerySet, MyClassMixin):
    pass

class MyClassManager(models.Manager, MyClassMixin):
    def get_query_set(self):
        return MyClassQuerySet(self.model, using=self._db)

This is a stripped down version of my production code. I haven’t fully tested this stripped down version, so please let me know if you find any problems with it.

Hope it helps!