Introducing django-bakery

    A set of helpers for baking out your Django site as flat files

    Los Angeles Times photographic archive, UCLA Library
    1988: Confectionary carrots come off the line at Van de Kamp's bakery in Los Angeles and are placed on cakes by Martha Cofre, left, and fellow workers.

    When Web traffic spikes and your site starts to sag, your first im­pulse might be to ar­chi­tec­ture up, to add more serv­ers, shard the data­base and cache, cache, cache. Provided you have the skill, time and money, that will get the job done.

    Lack­ing any of those three in­gredi­ents, the only guar­an­teed way to avoid a data­base crash is to not have a data­base. That sounds flip­pant, but it’s true. When faced with high traffic de­mands and little time or fund­ing, the Data Desk does ex­actly that. We save every page gen­er­ated by a data­base-backed site as a flat file and then host them all us­ing a stat­ic file ser­vice like Amazon S3.

    We call this pro­cess “bak­ing.” It’s our path to cheap­er, more stable host­ing for simple sites. We use it for pub­lish­ing elec­tion res­ults, timelines, doc­u­ments, in­ter­act­ive tables, spe­cial pro­jects and even this blog.

    The sys­tem comes with some ma­jor ad­vant­ages, like:

    1. No data­base crashes
    2. Zero serv­er con­fig­ur­a­tion and up­keep
    3. No need to op­tim­ize your app code
    4. You don’t pay to host CPUs, only band­width
    5. An off­line ad­min­is­tra­tion pan­el is more se­cure
    6. Less stress (This one can change your life)

    There are draw­backs. For one, you have to build the bakery in­to your code base. More im­port­ant, a flat site can only be so com­plex. No on­line data­base means your site is all read and no write, which means no user-gen­er­ated con­tent and no com­plex searches. Sites we host that could not be baked in­clude Map­ping L.A. and NHTSA Vehicle Com­plaints, each of which al­lows users to in­ter­act with a large and shift­ing data­set.

    So what’s the trick?

    To stream­line the pro­cess, we de­veloped an open-source Django lib­rary called django-bakery. It makes bak­ing out your site easi­er by in­teg­rat­ing the steps in­to Django’s stand­ard pro­ject lay­out.

    To try it out, the first thing you need to do is in­stall the lib­rary from PyPI, like so:

    $ pip install django-bakery
    

    Then edit your settings.py and its INSTALLED_APPS.

    INSTALLED_APPS = (
        ...
        'bakery',
        ...
    )
    

    Then add a BUILD_DIR dir­ect­ory path where the flattened site will be baked.

    import os
    ROOT_PATH = os.path.dirname(__file__)
    BUILD_DIR = os.path.join(ROOT_PATH, 'build')
    

    The cru­cial step is to re­fact­or your views to in­her­it our class-based views. They are de­signed to auto­mat­ic­ally flat­ten them­selves. Here is a list view and a de­tail view us­ing our sys­tem.

    from yourapp.models import DummyModel
    from bakery.views import BuildableDetailView, BuildableListView
    
    class DummyListView(BuildableListView):
        """
        A list of all tables.
        """
        queryset = DummyModel.live.all()
    
    class DummyDetailView(BuildableDetailView):
        """
        All about one table.
        """
        queryset = DummyModel.live.all()
    

    After you’ve con­ver­ted your views, add them to a list in settings.py where all build­able views will be stored.

    BAKERY_VIEWS = [
        'yourapp.views.DummyListView',
        'yourapp.views.DummyDetailView',
    ]
    

    Then run the man­age­ment com­mand that will bake them out.

    $ python manage.py build
    

    That should cre­ate your build dir­ect­ory and flat­ten all the des­ig­nated views in­to it. You can re­view its work by fir­ing up the buildserver, which will loc­ally host your flat files in the same way the Django’s runserver hosts your data­base-driv­en pages.

    $ python manage.py buildserver
    

    To pub­lish the site on Amazon S3, all that’s ne­ces­sary yet is to cre­ate a buck­et. You can go to aws.amazon.com/s3/ to set up an ac­count. If you need some ba­sic in­struc­tions you can find them here. Now set your buck­et name in the settings.py file:

    AWS_BUCKET_NAME = 'my-bucket'
    

    Next, in­stall s3cmd, a util­ity we’ll use to move files back and forth between your desktop and S3. In Ubuntu, that’s as simple as:

    $ sudo apt-get install s3cmd
    

    If you’re us­ing Mac or Win­dows, you’ll need to down­load this file and fol­low the in­stall­a­tion in­struc­tions you find there.

    Once it’s in­stalled, we need to con­fig­ure s3cmd with your Amazon lo­gin cre­den­tials. Go to Amazon’s se­cur­ity cre­den­tials page and get your ac­cess key and secret ac­cess key. Then, from your ter­min­al, run:

    $ s3cmd --configure
    

    Fi­nally, now that everything is set up, pub­lish­ing your files to S3 is as simple as:

    $ python manage.py publish
    

    The next level

    If your site pub­lishes a large data­base, the build-and-pub­lish routine can take a long time to run. Some­times that’s ac­cept­able, but if you’re peri­od­ic­ally mak­ing small up­dates to the site it can be frus­trat­ing to wait for the en­tire data­base to re­build every time there’s a minor edit.

    We tackle this prob­lem by hook­ing tar­geted build routines to our Django mod­els. When an ob­ject is ed­ited, the mod­el is able to re­build only those pages that ob­ject is con­nec­ted to. We ac­com­plish this with a build meth­od you can in­her­it. All that’s ne­ces­sary is that you define a list of the de­tail views con­nec­ted to an ob­ject.

    from django.db import models
    from bakery.models import BuildableModel
    
    class DummyModel(BuildableModel)
        detail_views = ('yourapp.views.DummyDetailView',)
        title = models.CharField(max_length=100)
        description = models.TextField()
    

    Now, when obj.build() is called, only that ob­ject’s de­tail pages will be re­built. If oth­er pages ought to be up­dated as well, par­tic­u­larly if they come from views that don’t take the ob­ject as an in­put, you should in­clude those in the pre-defined _build_related mod­el meth­od called at the end of build.

    from django.db import models
    from bakery.models import BuildableModel
    
    class DummyModel(BuildableModel)
        detail_views = ('yourapp.views.DummyDetailView',)
        title = models.CharField(max_length=100)
        description = models.TextField()
    
        def _build_related(self):
            """
            Rebuild the sitemap and RSS feed as part of the build routine.
            """
            import views
            views.SitemapView().build_queryset()
            views.DummyRSSFeed().build_queryset()
    

    With this sys­tem in place, a up­date pos­ted to the data­base by an entrant us­ing the Django ad­min can set in­to mo­tion a small build that is then synced with your live site on Amazon S3. We use that sys­tem to host ap­plic­a­tions with in-house Django ad­min­is­tra­tion pan­els that, for the entrant, walk and talk like a live data­base, but then auto­mat­ic­ally fig­ure out how to serve them­selves on the Web as flat files. That’s how a site like timelines.latimes.com is man­aged.

    Fi­nally, to speed the pro­cess a bit more, we hand off the build from the user’s save re­quest in the ad­min to a job serv­er that does the work in the back­ground. This pre­vents a push-but­ton save in the ad­min from hav­ing to wait for the en­tire build to com­plete be­fore re­turn­ing a re­sponse. Here is the save over­ride on the Timelinemod­el that as­sesses wheth­er the pub­lic­a­tion status of an ob­ject has changed, and then passes off build in­struc­tions to a Cel­ery job serv­er.

    @transaction.commit_manually
    def save(self, *args, **kwargs):
        """
        A custom save that bake the page and republishes it when necessary.
        """
        logger.debug("Saving %s" % self)
        # if obj.save(build=False) has been passed, we skip everything.
        if not kwargs.pop('build', True):
            super(Timeline, self).save(*args, **kwargs)
            transaction.commit()
        else:
            # First figure out what we're going to have to do after we save.
            # If the timeline has not yet been created...
            if not self.id:
                if self.is_published:
                    action = 'publish'
                else:
                    action = None
            else:
                current = Timeline.objects.get(id=self.id)
                # If it's been unpublished...
                if not self.is_published and current.is_published:
                    action = 'unpublish'
                # If it's being published...
                elif self.is_published:
                    action = 'publish'
                # If it's remaining unpublished...
                else:
                    action = None
            super(Timeline, self).save(*args, **kwargs)
            transaction.commit()
            logger.debug("Post-save action: %s" % action)
            # Do whatever needs to be done
            if action:
                if action == 'publish':
                    tasks.publish.delay(self)
                elif action == 'unpublish':
                    tasks.unpublish.delay(self)
    

    The tasks don’t have to be com­plic­ated. Ours are as simple as:

    import sys
    import logging
    from django.conf import settings
    from celery.decorators import task
    from django.core import management
    logger = logging.getLogger('timelines.tasks')
    
    @task()
    def publish(obj):
        """
        Bake all pages related to a timeline, and then sync with S3.
        """
        try:
            # Here the object is built
            obj.build()
            # And if the settings allow publication from this environment...
            if settings.PUBLISH:
                # ... the publish command is called to sync with S3.
                management.call_command("publish")
        except Exception, exc:
            logger.error("Task Error: publish",
                exc_info=sys.exc_info(),
                extra={
                    'status_code': 500,
                    'request': None
                })
    
    @task()
    def unpublish(obj):
        """
        Unbake all pages related to a timeline, and then sync to S3.
        """
        try:
            obj.unbuild()
            if settings.PUBLISH:
                management.call_command("publish")
        except Exception, exc:
            logger.error("Task Error: unpublish",
                exc_info=sys.exc_info(),
                extra={
                    'status_code': 500,
                    'request': None
                })
    

    And that’s it. These tools have already proven use­ful for us, but have only been sketched out. All of the code is free and open on Git­Hub and any con­tri­bu­tions or ad­vice is wel­come.

    Much re­spect due

    This ap­plic­a­tion was made in close col­lab­or­a­tion with Ken Schwencke, my part­ner here at the Data Desk. Without his ima­gin­a­tion, cri­ti­cism and con­tri­bu­tions, most of our work would be im­possible, in­clud­ing this lib­rary.

    Also, I presen­ted the ideas be­hind django-bakery last month to a group of our peers at the 2012 con­fer­ence of the Na­tion­al In­sti­tute for Com­puter-As­sisted Re­port­ing. The NICAR com­munity is a con­stant source of chal­lenge and in­spir­a­tion. Many of our ideas, here and else­where, have been ad­ap­ted from things the com­munity has taught us.

    Readers: What’s your take? Share it here.

    Advertisement

    Latest work

      About The Data Desk

      This page was created by the Data Desk, a team of reporters and Web developers in downtown L.A.