Introducing django-bakery
A set of helpers for baking out your Django site as flat files
When Web traffic spikes and your site starts to sag, your first impulse might be to architecture up, to add more servers, shard the database and cache, cache, cache. Provided you have the skill, time and money, that will get the job done.
Lacking any of those three ingredients, the only guaranteed way to avoid a database crash is to not have a database. That sounds flippant, but it’s true. When faced with high traffic demands and little time or funding, the Data Desk does exactly that. We save every page generated by a database-backed site as a flat file and then host them all using a static file service like Amazon S3.
We call this process “baking.” It’s our path to cheaper, more stable hosting for simple sites. We use it for publishing election results, timelines, documents, interactive tables, special projects and even this blog.
The system comes with some major advantages, like:
- No database crashes
- Zero server configuration and upkeep
- No need to optimize your app code
- You don’t pay to host CPUs, only bandwidth
- An offline administration panel is more secure
- Less stress (This one can change your life)
There are drawbacks. For one, you have to build the bakery into your code base. More important, a flat site can only be so complex. No online database means your site is all read and no write, which means no user-generated content and no complex searches. Sites we host that could not be baked include Mapping L.A. and NHTSA Vehicle Complaints, each of which allows users to interact with a large and shifting dataset.
So what’s the trick?
To streamline the process, we developed an open-source Django library called django-bakery. It makes baking out your site easier by integrating the steps into Django’s standard project layout.
To try it out, the first thing you need to do is install the library from PyPI, like so:
$ pip install django-bakery
Then edit your settings.py
and its INSTALLED_APPS
.
INSTALLED_APPS = ( ... 'bakery', ... )
Then add a BUILD_DIR
directory path where the flattened site will be baked.
import os ROOT_PATH = os.path.dirname(__file__) BUILD_DIR = os.path.join(ROOT_PATH, 'build')
The crucial step is to refactor your views to inherit our class-based views. They are designed to automatically flatten themselves. Here is a list view and a detail view using our system.
from yourapp.models import DummyModel from bakery.views import BuildableDetailView, BuildableListView class DummyListView(BuildableListView): """ A list of all tables. """ queryset = DummyModel.live.all() class DummyDetailView(BuildableDetailView): """ All about one table. """ queryset = DummyModel.live.all()
After you’ve converted your views, add them to a list in settings.py
where all buildable views will be stored.
BAKERY_VIEWS = [ 'yourapp.views.DummyListView', 'yourapp.views.DummyDetailView', ]
Then run the management command that will bake them out.
$ python manage.py build
That should create your build directory and flatten all the designated views into it. You can review its work by firing up the buildserver
, which will locally host your flat files in the same way the Django’s runserver
hosts your database-driven pages.
$ python manage.py buildserver
To publish the site on Amazon S3, all that’s necessary yet is to create a bucket. You can go to aws.amazon.com/s3/ to set up an account. If you need some basic instructions you can find them here. Now set your bucket name in the settings.py
file:
AWS_BUCKET_NAME = 'my-bucket'
Next, install s3cmd, a utility we’ll use to move files back and forth between your desktop and S3. In Ubuntu, that’s as simple as:
$ sudo apt-get install s3cmd
If you’re using Mac or Windows, you’ll need to download this file and follow the installation instructions you find there.
Once it’s installed, we need to configure s3cmd with your Amazon login credentials. Go to Amazon’s security credentials page and get your access key and secret access key. Then, from your terminal, run:
$ s3cmd --configure
Finally, now that everything is set up, publishing your files to S3 is as simple as:
$ python manage.py publish
The next level
If your site publishes a large database, the build-and-publish routine can take a long time to run. Sometimes that’s acceptable, but if you’re periodically making small updates to the site it can be frustrating to wait for the entire database to rebuild every time there’s a minor edit.
We tackle this problem by hooking targeted build routines to our Django models. When an object is edited, the model is able to rebuild only those pages that object is connected to. We accomplish this with a build
method you can inherit. All that’s necessary is that you define a list of the detail views connected to an object.
from django.db import models from bakery.models import BuildableModel class DummyModel(BuildableModel) detail_views = ('yourapp.views.DummyDetailView',) title = models.CharField(max_length=100) description = models.TextField()
Now, when obj.build()
is called, only that object’s detail pages will be rebuilt. If other pages ought to be updated as well, particularly if they come from views that don’t take the object as an input, you should include those in the pre-defined _build_related
model method called at the end of build
.
from django.db import models from bakery.models import BuildableModel class DummyModel(BuildableModel) detail_views = ('yourapp.views.DummyDetailView',) title = models.CharField(max_length=100) description = models.TextField() def _build_related(self): """ Rebuild the sitemap and RSS feed as part of the build routine. """ import views views.SitemapView().build_queryset() views.DummyRSSFeed().build_queryset()
With this system in place, a update posted to the database by an entrant using the Django admin can set into motion a small build that is then synced with your live site on Amazon S3. We use that system to host applications with in-house Django administration panels that, for the entrant, walk and talk like a live database, but then automatically figure out how to serve themselves on the Web as flat files. That’s how a site like timelines.latimes.com is managed.
Finally, to speed the process a bit more, we hand off the build from the user’s save
request in the admin to a job server that does the work in the background. This prevents a push-button save in the admin from having to wait for the entire build to complete before returning a response. Here is the save
override on the Timeline
model that assesses whether the publication status of an object has changed, and then passes off build instructions to a Celery job server.
@transaction.commit_manually def save(self, *args, **kwargs): """ A custom save that bake the page and republishes it when necessary. """ logger.debug("Saving %s" % self) # if obj.save(build=False) has been passed, we skip everything. if not kwargs.pop('build', True): super(Timeline, self).save(*args, **kwargs) transaction.commit() else: # First figure out what we're going to have to do after we save. # If the timeline has not yet been created... if not self.id: if self.is_published: action = 'publish' else: action = None else: current = Timeline.objects.get(id=self.id) # If it's been unpublished... if not self.is_published and current.is_published: action = 'unpublish' # If it's being published... elif self.is_published: action = 'publish' # If it's remaining unpublished... else: action = None super(Timeline, self).save(*args, **kwargs) transaction.commit() logger.debug("Post-save action: %s" % action) # Do whatever needs to be done if action: if action == 'publish': tasks.publish.delay(self) elif action == 'unpublish': tasks.unpublish.delay(self)
The tasks don’t have to be complicated. Ours are as simple as:
import sys import logging from django.conf import settings from celery.decorators import task from django.core import management logger = logging.getLogger('timelines.tasks') @task() def publish(obj): """ Bake all pages related to a timeline, and then sync with S3. """ try: # Here the object is built obj.build() # And if the settings allow publication from this environment... if settings.PUBLISH: # ... the publish command is called to sync with S3. management.call_command("publish") except Exception, exc: logger.error("Task Error: publish", exc_info=sys.exc_info(), extra={ 'status_code': 500, 'request': None }) @task() def unpublish(obj): """ Unbake all pages related to a timeline, and then sync to S3. """ try: obj.unbuild() if settings.PUBLISH: management.call_command("publish") except Exception, exc: logger.error("Task Error: unpublish", exc_info=sys.exc_info(), extra={ 'status_code': 500, 'request': None })
And that’s it. These tools have already proven useful for us, but have only been sketched out. All of the code is free and open on GitHub and any contributions or advice is welcome.
Much respect due
This application was made in close collaboration with Ken Schwencke, my partner here at the Data Desk. Without his imagination, criticism and contributions, most of our work would be impossible, including this library.
Also, I presented the ideas behind django-bakery last month to a group of our peers at the 2012 conference of the National Institute for Computer-Assisted Reporting. The NICAR community is a constant source of challenge and inspiration. Many of our ideas, here and elsewhere, have been adapted from things the community has taught us.