Introducing django-softhyphen

    New open-source Python library automates hyphenation of text, allows easy formatting of HTML in more bookish style

    Robert Gauthier Los Angeles Times
    Justified: An example of automatically hyphenated text in a timeline about Frank McCourt's time as owner of the Los Angeles Dodgers.

    Today we an­nounce the re­lease of django-softhy­phen, a free and open-source Py­thon lib­rary for auto­mat­ic­ally hy­phen­at­ing HTML text. It al­lows on­line pub­lish­ers to more smoothly align copy to both the left and right mar­gins, a prac­tice com­mon in print but rarely seen on­line.

    By de­fault, Eng­lish-lan­guage HTML is aligned to the left mar­gin, but “ragged” on the right. Let’s look at an ex­ample.

    The text be­low is drawn from an April 21, 1946 Los Angeles Times art­icle by Leslie Lieber titled “Gypsy Geni­us.” It pro­files gui­tar­ist Django Re­in­hardt, who today is also the name­sake of the Data Desk’s pre­ferred de­vel­op­ment frame­work.

    Django Reinhardt is a temperamental Gypsy whose deficiencies, which include illiteracy and two paralyzed fingers on his left hand, have not prevented him from becoming the most sophisticated hot-guitar player in the world.

    Look at the right-hand side of the para­graph and you will see what I mean by “ragged.” That’s the style you’ll find all over the Web, in­clud­ing news sites like latimes.com.

    Books, magazines and news­pa­pers, on the oth­er hand, tend to snap to the right column as well as the left. That style is called “jus­ti­fied” align­ment. It can be achieved in HTML by set­ting a para­graph’s text-align at­trib­ute to justify, as I have done here:

    Django Reinhardt is a temperamental Gypsy whose deficiencies, which include illiteracy and two paralyzed fingers on his left hand, have not prevented him from becoming the most sophisticated hot-guitar player in the world.

    This will look OK much of the time, but in some cases the space between words gets stretched out. This comes up more of­ten as the column’s width gets nar­row­er. Here, for in­stance:

    Django Reinhardt is a temperamental Gypsy whose deficiencies, which include illiteracy and two paralyzed fingers on his left hand, have not prevented him from becoming the most sophisticated hot-guitar player in the world.

    django-softhy­phen helps ad­dress this prob­lem by auto­mat­ic­ally hy­phen­at­ing words that run up against the right mar­gin. It scans the text and in­serts the HTML en­tity ­ between syl­lables in­side each word. The ­ en­tity, also known as a “soft hy­phen,” is an in­struc­tion for the browser to in­sert a hy­phen and line break when ne­ces­sary. See it in ac­tion here:

    Django Re­in­hardt is a tem­pera­ment­al Gypsy whose de­fi­cien­cies, which in­clude il­lit­er­acy and two para­lyzed fin­gers on his left hand, have not pre­ven­ted him from be­com­ing the most soph­ist­ic­ated hot-gui­tar play­er in the world.

    Al­though ­ is sup­por­ted by most browsers, there are reas­ons why soft hy­phens are a rare sight. Of course, it is in­con­veni­ent to go back through everything you write and in­struct the com­puter where you want it to hy­phe­nate, but that’s not all.

    For one, hy­phen­at­ing text doesn’t guar­an­tee every line will look per­fect. Print de­sign­ers ex­er­cise more fine-grained con­trol than a simple hy­phen­a­tion hint, as seen in this demon­stra­tion of the cap­ab­il­it­ies of Adobe In­Des­ign pub­lish­ing soft­ware. The next gen­er­a­tion of today’s HTML styl­ing rules, known as CSS3, prom­ises to move the Web closer to that level of con­trol, but it will be some time be­fore the bulk of In­ter­net surfers use a browser that sup­ports it.

    Fur­ther­more, the ­ tag it­self has been cri­ti­cized, as long ago as 1997, as be­ing fun­da­ment­ally flawed. One small but frus­trat­ing ex­ample is that if you copy and paste hy­phen­ated HTML in­to oth­er pro­grams, like a text ed­it­or, the pas­ted text will some­times ap­pear with all the hy­phen­a­tion hints spelled out, break­ing up longer words in­to syl­lables.

    All that said, we hope that django-softhy­phen proves use­ful to those who want to use ­, or at least brings a small meas­ure of at­ten­tion to the is­sue of hy­phen­a­tion, an im­port­ant fea­ture of ty­po­graphy that has been handled with grace in print but has yet to be solved on the Web.

    Get­ting star­ted

    The code is on­line and can be in­stalled from PyPI’s pack­age re­pos­it­ory us­ing pip or easy_install.

    $ pip install django-softhyphen
    # Or, if you prefer...
    $ easy_install django-softhyphen
    

    If you’d prefer to work with the source code, the trunk is avail­able on Git­Hub. It can be forked there or cloned in the usu­al man­ner.

    $ git clone https://github.com/datadesk/django-softhyphen.git
    $ cd django-softhyphen
    $ python setup.py install
    

    The code is pack­aged as a plug­gable Django ap­plic­a­tion, but you can im­port it and use it from your shell like any oth­er Py­thon mod­ule. Hy­phen­at­ing HTML text is as simple as jump­ing in­to your Py­thon shell and run­ning the fol­low­ing code.

    >>> from softhy­phen.html im­port hy­phen­ate
    >>> hy­phen­ate("<h1>I love hy­phen­a­tion</h1>")
    u'<h1>I love hy&shy;phen&shy;a&shy;tion</h1>'
    

    And if you want to hy­phen­ate text from a dif­fer­ent lan­guage, simply provide the lan­guage’s code as an ex­tra keyword ar­gu­ment. By de­fault the meth­od hy­phen­ates us­ing an Amer­ic­an Eng­lish dic­tion­ary. Here’s my at­tempt at Span­ish:

    >>> hy­phen­ate("<h1>Me en­canta guiones</h1>", lan­guage="es-es")
    u'<h1>Me en&shy;can&shy;ta gu&shy;io&shy;nes</h1>'
    

    What not to do

    An ob­vi­ous ap­plic­a­tion of the lib­rary is in a blog, where it could hy­phen­ate the body of a post. To in­teg­rate django-softhy­phen in your Django pro­ject, you must add it to the INSTALLED_APPS in your settings.py file after fol­low­ing the in­stall­a­tion in­struc­tions above.

    IN­STALLED_APPS = (
        ...
        'softhy­phen',
        ...
    )
    

    We’ve cre­ated a cus­tom Django tem­plate tag that al­lows you to quickly pro­cess a text block in your tem­plate.

    {% load softhyphen_tags %}
    <div class="body">
        {{ post.body|softhyphen }}
    </div>
    

    But that might not be a good idea. If your blog, like many Django pro­jects, is served out of the data­base dy­nam­ic­ally, then it would re­quire the hy­phen­a­tion pro­cess to run every time the tem­plate is parsed.

    A solu­tion is offered by James Ben­nett in his ex­cel­lent book Prac­tic­al Django Pro­jects. He sug­gests adding an ex­tra field to your mod­el that stores pre-rendered HTML and is up­dated when the post is saved. For him, it was a way to more ef­fi­ciently pro­cess Mark­down and Pyg­ments format­ting, but it can work just as well in our case.

    Here’s an sim­pli­fied ex­ample ad­ap­ted from Ben­nett’s open-source blog ap­plic­a­tion, Col­trane.

    from django.db im­port mod­els
    from softhy­phen.html im­port hy­phen­ate
    
    class Blo­g­Post(mod­els.Mod­el):
        pub­_d­ate = mod­els.Dat­e­Time­Field()
        title = mod­els.Char­Field(max_length=250)
        body = mod­els.Text­Field()
        body_html = mod­els.Text­Field(ed­it­able=False, blank=True)
    
        def __u­nicode__(self):
            re­turn self.title
    
        def save(self):
            self.body_html = hy­phen­ate(self.body)
            su­per(Blo­g­Post, self).save()
    

    No­tice the over­ride of the mod­el’s save meth­od, and how it fills the body_html field just be­fore a re­cord is saved to the data­base. If you use this fix, you can change your tem­plate to call the rendered field and re­duce the num­ber of times you need to hy­phen­ate text.

    <div class="body">
        {{ post.body_html|safe }}
    </div>
    

    Much re­spect due

    Our code­base is ad­ap­ted from the work of Filipe For­tes, who gen­er­ously open-sourced the guts of www.softhy­phen.com. His code does most of the work, passing the provided text through a dic­tion­ary provided by Open­Of­fice and de­tect­ing where hy­phens ought to be placed. We have re­fact­ored it as a plug­gable Django ap­plic­a­tion, ad­ded a Django tem­plate tag, writ­ten a num­ber of unit tests and pack­aged the code for dis­tri­bu­tion on PyPI. And there’s more. Filipe’s work is it­self built on top of code pub­lished by Wil­bert Ber­end­sen, who also de­serves our thanks.

    Readers: What’s your take? Share it here.

    Advertisement

    Latest work

      About The Data Desk

      This page was created by the Data Desk, a team of reporters and Web developers in downtown L.A.