I have an optimisation I would like to run when the user presses a button on a Django page. For small cases, it is fine to run it synchronously. However, when it takes more than a second or so, it is not great to have the web server held back by a process of unknown length.
The solution I have settled on is Celery, with Redis as the message broker. I am using Redis over the alternatives, since it seems to have much lower memory requirements (I find it uses under 2 Mb, vs. 10-30 Mb per Celery process). And the equivalent commands if you want to use redis-queue (which uses about 10 Mb per worker) instead of Celery are given in this post.
There is a bit of a learning curve to get started with this, so I am making a guide for the next person by listing all the steps I have taken to get set up on both my development platform (running MacOS X) and a unix server (hosted by Webfaction). Along the way I hope to answer questions about security and what the right settings are to put in the redis.conf
file, the celery config file, and the usual Django settings.py
file.
Install Redis
Redis is the message broker. You will need to have this running at all times for Celery’s tasks to be executed.
Installing Redis on Mac OS X is described in this blog. Basically, just download the latest version from redis.io, and in the resulting untarred directory:
make test make sudo mv src/redis-server /usr/bin sudo mv src/redis-cli /usr/bin mkdir ~/.redis touch ~/.redis/redis.conf
Installing Redis on your server is similar, though you may need to know how to download the code from the command line first (e.g. see this post):
wget http://redis.googlecode.com/files/redis-2.6.14.tar.gz tar xzf redis-2.6.14.tar.gz cd redis-2.6.14 make test make
On the production server we don’t need to relocate the redis-server
or redis-cli
executables, as we’ll see in the next section.
Run Redis
To run Redis on your Mac, just type one of:
redis-server # if no config required, or: redis-server ~/Python/redis-2.6.14/redis.conf
To run it on your Webfaction server, first add a custom app listening on a port, and note the port number you are assigned.
Now we need to daemonize it (see this post from the Webfaction community). In summary, in your redis directory, edit the redis.conf
file like so (feel free to change the location of the pid
file):
daemonize yes ... pidfile /home/username/webapps/mywebapp/redis.pid ... port xxxxx # set to the port of the custom app you created
To test this works, type the commands below. If all is well, the pid
file will now contain a process id which you can check by providing it to the ps
command.
src/redis-server redis.conf cat /home/username/webapps/mywebapp/redis.pid ps xxxxx # use the number in the pid file
Note – when I did this without assigning the port number of the custom app, I got the following error:
# Warning: no config file specified, using the default config. In order to specify a config file use src/redis-server /path/to/redis.conf # Unable to set the max number of files limit to 10032 (Operation not permitted), setting the max clients configuration to 4064. # Opening port 6379: bind: Address already in use
It turns out someone else was already using port 6379
, the default Redis port.
Now in practice you will want Redis to be managed with cron
, so that it restarts if there is a problem. Webfaction has some docs on how to do this here; I used:
crontab -e # and add this line to the file, changing the path as necessary: 0,10,20,30,40,50 * * * * ~/webapps/redis/redis-2.6.14/src/redis-server ~/webapps/redis/redis-2.6.14/redis.conf
FYI, for me the running Redis process uses 1.7 Mb (i.e. nothing compared to each celery process, as we’ll see).
Install Celery
The Celery docs cover this. Installation is simple, on both development and production machines (except that I install it in the web app’s environment with Webfaction, as explained here):
pip install django-celery-with-redis
I have added the following to settings.py
, replacing the port number for production:
BROKER_URL = 'redis://localhost:6379/0' CELERY_RESULT_BACKEND = 'redis://localhost:6379/0' import djcelery djcelery.setup_loader() INSTALLED_APPS = ( ... 'djcelery', ... )
And added the suggested lines to the top of wsgi.py
:
import djcelery djcelery.setup_loader()
I found lots more detail here, but I haven’t yet established how much of this is required.
Run a Celery worker
Now you need to start a Celery worker.
On your development server, you can enter your Django project directory and type:
python manage.py celery worker --loglevel=info
On your production server, I started by trying the same command above, to test out whether Celery could find the Redis process and run jobs – and it worked fine. But in practice, the Celery docs say: “you will want to run the worker in the background as a daemon“. (Note this link also talks about Celery beat, which “is a scheduler. It kicks off tasks at regular intervals, which are then executed by the worker nodes available in the cluster.” In my case, I do not need this.)
To do this, I copied the CentOS celeryd
shell script file from the link at the end of the daemonization doc (since the server I am using runs CentOS), and placed it in a new celerydaemon
directory in my Django project directory, along with the Django celeryd config file (I renamed the config file from celeryd,
which was confusing as it is the same name as the shell script, to celery.sysconfig
). I also created a new directory in my home directory called celery
to hold the pid
and log
output files.
One more change is required, at least if you are using Webfaction to host your site: the call to celery_multi
does not have a preceding python command by default. While this works in an ssh
shell, it does not work with cron
- I believe because the $PATH
is not set up the same way in cron
. So I explicitly add the python
command in the front, including the path to python.
The config file looks like this:
# Names of nodes to start (space-separated) CELERYD_NODES="myapp-node_1" # Where to chdir at start. This could be the root of a virtualenv. CELERYD_CHDIR="/home/username/webapps/webappname/projectname" # How to call celeryd-multi (for Django) # note python (incl path) added to front CELERYD_MULTI="/home/user/bin/python $CELERYD_CHDIR/manage.py celeryd_multi" # Extra arguments #CELERYD_OPTS="--app=my_application.path.to.worker --time-limit=300 --concurrency=8 --loglevel=DEBUG" CELERYD_OPTS="--time-limit=180 --concurrency=2 --loglevel=DEBUG" # If you want to restart the worker after every 3 tasks, can use eg: # (I mention it here because I couldn't work out how to use # CELERYD_MAX_TASKS_PER_CHILD) #CELERYD_OPTS="--time-limit=180 --concurrency=2 --loglevel=DEBUG --maxtasksperchild=3" # Create log/pid dirs, if they don't already exist CELERY_CREATE_DIRS=1 # %n will be replaced with the nodename CELERYD_LOG_FILE="/home/username/celery/%n.log" CELERYD_PID_FILE="/home/username/celery/%n.pid" # Workers run as an unprivileged user CELERYD_USER=celery CELERYD_GROUP=celery # Name of the projects settings module. export DJANGO_SETTINGS_MODULE="myproject.settings"
In the shell script, I changed the two references to /var
(DEFAULT_PID_FILE
and DEFAULT_LOG_FILE
) and the reference to /etc
(CELERY_DEFAULTS
) in the shell script to directories I can write to, e.g.:
DEFAULT_PID_FILE="/home/username/celery/%n.pid" DEFAULT_LOG_FILE="/home/username/celery/%n.log" ... CELERY_DEFAULTS=${CELERY_DEFAULTS:-"/home/username/webapps/webappname/projectname/celerydaemon/celeryd.sysconfig"}
I found a problem in the CentOS script – it calls /etc/init.d/functions
, which resets the $PATH
variable globally, so that the rest of the script cannot find python any more. I have raised this as an issue, where you can also see my workaround.
To test things out on the production server, you can type (use sh
rather than source
here because the script ends with an exit, and you don’t want to be logged out of your ssh
session each time):
sh celerydaemon/celeryd start
and you should see a new .pid
file in ~/celery
showing the process id of the new worker(s).
Type the following line to stop all the celery processes:
sh celerydaemon/celeryd stop
Restart celery with cron if needed
As with Redis, you can ensure the celery workers are restarted by cron
if they fail. Unlike with Redis, there are a lot of tricks here for the unwary (i.e. me).
- Write a script to check if a celery process is running. Webfaction provides an example here, which I have changed the last line of to read:
sh /home/username/webapps/webappname/projectname/celerydaemon/celeryd restart
- This is the script we will ask
cron
to run. Note that I userestart
here, notstart
; I am doing this because I have found in a real case that if the server dies suddenly, celery continues to think it is still running even when it isn’t, and sostart
does nothing. So add to your crontab (assuming the above script is calledcelery_check.sh
):crontab -e 1,11,21,31,41,51 * * * * ~/webapps/webappname/projectname/celerydaemon/celery_check.sh
- One last thing, pointed out to me in correspondence with Webfaction: the
celeryd
script file implementsrestart
with:stop && start
So if
stop
fails for any reason, the script will not restart celery. For our purposes, we wantstart
to occur regardless, so change this line to:stop; start;
Your celery workers should now restart if there is a problem.
Controlling the number of processes
If you’re like me you are now confused about the difference between a node, a worker, a process and a thread. When I run the celeryd start
command, it kicks off three processes, one of which has the pid in the node’s pid
file. This despite my request for one node, and “--concurrency=2
” in the config file.
When I change the concurrency
setting to 1, then I get two processes. When I also add another node, I get four processes.
So what I assume is happening is: workers are the same things as nodes, and each worker needs one process for overhead and “concurrency
” additional processes.
For me, at first I found each celery process required about 30-35Mb (regardless of the number of nodes or concurrency). So three use about 100Mb. When I looked again a week later, the processes were using only 10 Mb each, even when solving tasks. I’m not sure what explains the discrepancy.
Use it
With this much, you can adapt the Celery demo (adding two numbers) to your own site, and it should work.
On my site I use ajax and javascript to regularly poll whether the optimisation is finished. The following files hopefully give the basic idea.
urls.py
# urls.py from myapp.views import OptView, status_view ... url(r'^opt/', OptView.as_view(), name="opt"), url(r'^status/', status_view, name="status"), # for ajax ...
views.py
# views.py import json from django.views.generic import TemplateView from django.core.exceptions import SuspiciousOperation from celery.result import AsyncResult from . import tasks class OptView(TemplateView): template_name = 'opt.html' def get_context_data(self, **kwargs): """ Kick off the optimization. """ # replace the next line with a call to your task result = tasks.solve.delay(params) # save the task id so we can query its status via ajax self.request.session['task_id'] = result.task_id # if you need to cancel the task, use: # revoke(self.request.session['task_id'], terminate=True) context = super(OptView, self).get_context_data(**kwargs) return context def status_view(request): """ Called by the opt page via ajax to check if the optimisation is finished. If it is, return the results in JSON format. """ if not request.is_ajax(): raise SuspiciousOperation("No access.") try: result = AsyncResult(request.session['task_id']) except KeyError: ret = {'error':'No optimisation (or you may have disabled cookies).'} return HttpResponse(json.dumps(ret)) try: if result.ready(): # to do - check if it is really solved, or if it timed out or failed ret = {'status':'solved'} # replace this with the relevant part of the result ret.update({'result':result}) else: ret = {'status':'waiting'} except AttributeError: ret = {'error':'Cannot find an optimisation task.'} return HttpResponse(json.dumps(ret)) return HttpResponse(json.dumps(ret))
javascript
// include this javascript in your template (needs jQuery) // also include the {% csrf_token %} tag, not nec. in a form $(function() { function handle_error(xhr, textStatus, errorThrown) { clearInterval(interval_id); alert("Please report this error: "+errorThrown+xhr.status+xhr.responseText); } function show_status(data) { var obj = JSON.parse(data); if (obj.error) { clearInterval(interval_id); alert(obj.error); } if (obj.status == "waiting"){ // do nothing } else if (obj.status == "solved"){ clearInterval(interval_id); // show the solution } else { clearInterval(interval_id); alert(data); } } function check_status() { $.ajax({ type: "POST", url: "/status/", data: {csrfmiddlewaretoken: document.getElementsByName('csrfmiddlewaretoken')[0].value}, success: show_status, error: handle_error }); } setTimeout(check_status, 0.05); // check every second var interval_id = setInterval(check_status, 1000); });
As mentioned in the comments to the code above, if you need to cancel an optimisation, you can use:
revoke(task_id, terminate=True)
Monitoring
You can monitor what’s happening in celery with celery flower, at least on dev:
pip install flower celery flower --broker=redis://localhost:PORTNUM/0
And then go to localhost:5555
in your web browser.
When you use djcelery
, you will also find a djcelery
app in the admin panel, where you can view workers and tasks. There is a little bit of set up required to populate these tables. More info about this is provided in the celery docs.
Security
Some links on this topic:
- http://redis.io/topics/security
- http://docs.celeryproject.org/en/latest/userguide/security.html
I’ll add to this section as I learn more about it.
I hope that’s helpful – please let me know what you think.
One thought on “Asynchronous calls from Django”
Comments are closed.