Introducing django-bakery

A set of helpers for baking out your Django site as flat files

Los Angeles Times photographic archive, UCLA Library

1988: Confectionary carrots come off the line at Van de Kamp's bakery in Los Angeles and are placed on cakes by Martha Cofre, left, and fellow workers.

By Ben Welsh

Reporting from #DTLA

When Web traffic spikes and your site starts to sag, your first impulse might be to architecture up, to add more servers, shard the database and cache, cache, cache. Provided you have the skill, time and money, that will get the job done.

In this post

So what’s the trick?
The next level
Much respect due

Lacking any of those three ingredients, the only guaranteed way to avoid a database crash is to not have a database. That sounds flippant, but it’s true. When faced with high traffic demands and little time or funding, the Data Desk does exactly that. We save every page generated by a database-backed site as a flat file and then host them all using a static file service like Amazon S3.

We call this process “baking.” It’s our path to cheaper, more stable hosting for simple sites. We use it for publishing election results, timelines, documents, interactive tables, special projects and even this blog.

The system comes with some major advantages, like:

No database crashes
Zero server configuration and upkeep
No need to optimize your app code
You don’t pay to host CPUs, only bandwidth
An offline administration panel is more secure
Less stress (This one can change your life)

There are drawbacks. For one, you have to build the bakery into your code base. More important, a flat site can only be so complex. No online database means your site is all read and no write, which means no user-generated content and no complex searches. Sites we host that could not be baked include Mapping L.A. and NHTSA Vehicle Complaints, each of which allows users to interact with a large and shifting dataset.

So what’s the trick?

To streamline the process, we developed an open-source Django library called django-bakery. It makes baking out your site easier by integrating the steps into Django’s standard project layout.

To try it out, the first thing you need to do is install the library from PyPI, like so:

$ pip install django-bakery

Then edit your settings.py and its INSTALLED_APPS.

INSTALLED_APPS = (
    ...
    'bakery',
    ...
)

Then add a BUILD_DIR directory path where the flattened site will be baked.

import os
ROOT_PATH = os.path.dirname(__file__)
BUILD_DIR = os.path.join(ROOT_PATH, 'build')

The crucial step is to refactor your views to inherit our class-based views. They are designed to automatically flatten themselves. Here is a list view and a detail view using our system.

from yourapp.models import DummyModel
from bakery.views import BuildableDetailView, BuildableListView

class DummyListView(BuildableListView):
    """
    A list of all tables.
    """
    queryset = DummyModel.live.all()

class DummyDetailView(BuildableDetailView):
    """
    All about one table.
    """
    queryset = DummyModel.live.all()

After you’ve converted your views, add them to a list in settings.py where all buildable views will be stored.

BAKERY_VIEWS = [
    'yourapp.views.DummyListView',
    'yourapp.views.DummyDetailView',
]

Then run the management command that will bake them out.

$ python manage.py build

That should create your build directory and flatten all the designated views into it. You can review its work by firing up the buildserver, which will locally host your flat files in the same way the Django’s runserver hosts your database-driven pages.

$ python manage.py buildserver

To publish the site on Amazon S3, all that’s necessary yet is to create a bucket. You can go to aws.amazon.com/s3/ to set up an account. If you need some basic instructions you can find them here. Now set your bucket name in the settings.py file:

AWS_BUCKET_NAME = 'my-bucket'

Next, install s3cmd, a utility we’ll use to move files back and forth between your desktop and S3. In Ubuntu, that’s as simple as:

$ sudo apt-get install s3cmd

If you’re using Mac or Windows, you’ll need to download this file and follow the installation instructions you find there.

Once it’s installed, we need to configure s3cmd with your Amazon login credentials. Go to Amazon’s security credentials page and get your access key and secret access key. Then, from your terminal, run:

$ s3cmd --configure

Finally, now that everything is set up, publishing your files to S3 is as simple as:

$ python manage.py publish

The next level

If your site publishes a large database, the build-and-publish routine can take a long time to run. Sometimes that’s acceptable, but if you’re periodically making small updates to the site it can be frustrating to wait for the entire database to rebuild every time there’s a minor edit.

We tackle this problem by hooking targeted build routines to our Django models. When an object is edited, the model is able to rebuild only those pages that object is connected to. We accomplish this with a build method you can inherit. All that’s necessary is that you define a list of the detail views connected to an object.

from django.db import models
from bakery.models import BuildableModel

class DummyModel(BuildableModel)
    detail_views = ('yourapp.views.DummyDetailView',)
    title = models.CharField(max_length=100)
    description = models.TextField()

Now, when obj.build() is called, only that object’s detail pages will be rebuilt. If other pages ought to be updated as well, particularly if they come from views that don’t take the object as an input, you should include those in the pre-defined _build_related model method called at the end of build.

from django.db import models
from bakery.models import BuildableModel

class DummyModel(BuildableModel)
    detail_views = ('yourapp.views.DummyDetailView',)
    title = models.CharField(max_length=100)
    description = models.TextField()

    def _build_related(self):
        """
        Rebuild the sitemap and RSS feed as part of the build routine.
        """
        import views
        views.SitemapView().build_queryset()
        views.DummyRSSFeed().build_queryset()

With this system in place, a update posted to the database by an entrant using the Django admin can set into motion a small build that is then synced with your live site on Amazon S3. We use that system to host applications with in-house Django administration panels that, for the entrant, walk and talk like a live database, but then automatically figure out how to serve themselves on the Web as flat files. That’s how a site like timelines.latimes.com is managed.

Finally, to speed the process a bit more, we hand off the build from the user’s save request in the admin to a job server that does the work in the background. This prevents a push-button save in the admin from having to wait for the entire build to complete before returning a response. Here is the save override on the Timelinemodel that assesses whether the publication status of an object has changed, and then passes off build instructions to a Celery job server.

@transaction.commit_manually
def save(self, *args, **kwargs):
    """
    A custom save that bake the page and republishes it when necessary.
    """
    logger.debug("Saving %s" % self)
    # if obj.save(build=False) has been passed, we skip everything.
    if not kwargs.pop('build', True):
        super(Timeline, self).save(*args, **kwargs)
        transaction.commit()
    else:
        # First figure out what we're going to have to do after we save.
        # If the timeline has not yet been created...
        if not self.id:
            if self.is_published:
                action = 'publish'
            else:
                action = None
        else:
            current = Timeline.objects.get(id=self.id)
            # If it's been unpublished...
            if not self.is_published and current.is_published:
                action = 'unpublish'
            # If it's being published...
            elif self.is_published:
                action = 'publish'
            # If it's remaining unpublished...
            else:
                action = None
        super(Timeline, self).save(*args, **kwargs)
        transaction.commit()
        logger.debug("Post-save action: %s" % action)
        # Do whatever needs to be done
        if action:
            if action == 'publish':
                tasks.publish.delay(self)
            elif action == 'unpublish':
                tasks.unpublish.delay(self)

The tasks don’t have to be complicated. Ours are as simple as:

import sys
import logging
from django.conf import settings
from celery.decorators import task
from django.core import management
logger = logging.getLogger('timelines.tasks')

@task()
def publish(obj):
    """
    Bake all pages related to a timeline, and then sync with S3.
    """
    try:
        # Here the object is built
        obj.build()
        # And if the settings allow publication from this environment...
        if settings.PUBLISH:
            # ... the publish command is called to sync with S3.
            management.call_command("publish")
    except Exception, exc:
        logger.error("Task Error: publish",
            exc_info=sys.exc_info(),
            extra={
                'status_code': 500,
                'request': None
            })

@task()
def unpublish(obj):
    """
    Unbake all pages related to a timeline, and then sync to S3.
    """
    try:
        obj.unbuild()
        if settings.PUBLISH:
            management.call_command("publish")
    except Exception, exc:
        logger.error("Task Error: unpublish",
            exc_info=sys.exc_info(),
            extra={
                'status_code': 500,
                'request': None
            })

And that’s it. These tools have already proven useful for us, but have only been sketched out. All of the code is free and open on GitHub and any contributions or advice is welcome.

Much respect due

This application was made in close collaboration with Ken Schwencke, my partner here at the Data Desk. Without his imagination, criticism and contributions, most of our work would be impossible, including this library.

Also, I presented the ideas behind django-bakery last month to a group of our peers at the 2012 conference of the National Institute for Computer-Assisted Reporting. The NICAR community is a constant source of challenge and inspiration. Many of our ideas, here and elsewhere, have been adapted from things the community has taught us.

Advanced Search

Data Desk

Maps, databases, analysis and visualization

Introducing django-bakery

A set of helpers for baking out your Django site as flat files

So what’s the trick?

The next level

Much respect due

Readers: What’s your take? Share it here.

Recommended on Facebook

Latest work

Latest posts

Advanced Search

Data Desk

Maps, databases, analysis and visualization

Introducing django-bakery

A set of helpers for baking out your Django site as flat files

So what’s the trick?

The next level

Much re­spect due

Readers: What’s your take? Share it here.

Recommended on Facebook

Latest work

Latest posts

Much respect due