Introducing django-softhyphen
New open-source Python library automates hyphenation of text, allows easy formatting of HTML in more bookish style
Today we announce the release of django-softhyphen, a free and open-source Python library for automatically hyphenating HTML text. It allows online publishers to more smoothly align copy to both the left and right margins, a practice common in print but rarely seen online.
By default, English-language HTML is aligned to the left margin, but “ragged” on the right. Let’s look at an example.
The text below is drawn from an April 21, 1946 Los Angeles Times article by Leslie Lieber titled “Gypsy Genius.” It profiles guitarist Django Reinhardt, who today is also the namesake of the Data Desk’s preferred development framework.
Django Reinhardt is a temperamental Gypsy whose deficiencies, which include illiteracy and two paralyzed fingers on his left hand, have not prevented him from becoming the most sophisticated hot-guitar player in the world.
Look at the right-hand side of the paragraph and you will see what I mean by “ragged.” That’s the style you’ll find all over the Web, including news sites like latimes.com.
Books, magazines and newspapers, on the other hand, tend to snap to the right column as well as the left. That style is called “justified” alignment. It can be achieved in HTML by setting a paragraph’s text-align
attribute to justify
, as I have done here:
Django Reinhardt is a temperamental Gypsy whose deficiencies, which include illiteracy and two paralyzed fingers on his left hand, have not prevented him from becoming the most sophisticated hot-guitar player in the world.
This will look OK much of the time, but in some cases the space between words gets stretched out. This comes up more often as the column’s width gets narrower. Here, for instance:
Django Reinhardt is a temperamental Gypsy whose deficiencies, which include illiteracy and two paralyzed fingers on his left hand, have not prevented him from becoming the most sophisticated hot-guitar player in the world.
django-softhyphen helps address this problem by automatically hyphenating words that run up against the right margin. It scans the text and inserts the HTML entity ­
between syllables inside each word. The ­
entity, also known as a “soft hyphen,” is an instruction for the browser to insert a hyphen and line break when necessary. See it in action here:
Django Reinhardt is a temperamental Gypsy whose deficiencies, which include illiteracy and two paralyzed fingers on his left hand, have not prevented him from becoming the most sophisticated hot-guitar player in the world.
Although ­
is supported by most browsers, there are reasons why soft hyphens are a rare sight. Of course, it is inconvenient to go back through everything you write and instruct the computer where you want it to hy­phe­nate
, but that’s not all.
For one, hyphenating text doesn’t guarantee every line will look perfect. Print designers exercise more fine-grained control than a simple hyphenation hint, as seen in this demonstration of the capabilities of Adobe InDesign publishing software. The next generation of today’s HTML styling rules, known as CSS3, promises to move the Web closer to that level of control, but it will be some time before the bulk of Internet surfers use a browser that supports it.
Furthermore, the ­
tag itself has been criticized, as long ago as 1997, as being fundamentally flawed. One small but frustrating example is that if you copy and paste hyphenated HTML into other programs, like a text editor, the pasted text will sometimes appear with all the hyphenation hints spelled out, breaking up longer words into syllables.
All that said, we hope that django-softhyphen proves useful to those who want to use ­
, or at least brings a small measure of attention to the issue of hyphenation, an important feature of typography that has been handled with grace in print but has yet to be solved on the Web.
Getting started
The code is online and can be installed from PyPI’s package repository using pip
or easy_install
.
$ pip install django-softhyphen # Or, if you prefer... $ easy_install django-softhyphen
If you’d prefer to work with the source code, the trunk is available on GitHub. It can be forked there or cloned in the usual manner.
$ git clone https://github.com/datadesk/django-softhyphen.git $ cd django-softhyphen $ python setup.py install
The code is packaged as a pluggable Django application, but you can import it and use it from your shell like any other Python module. Hyphenating HTML text is as simple as jumping into your Python shell and running the following code.
>>> from softhyphen.html import hyphenate >>> hyphenate("<h1>I love hyphenation</h1>") u'<h1>I love hy­phen­a­tion</h1>'
And if you want to hyphenate text from a different language, simply provide the language’s code as an extra keyword argument. By default the method hyphenates using an American English dictionary. Here’s my attempt at Spanish:
>>> hyphenate("<h1>Me encanta guiones</h1>", language="es-es") u'<h1>Me en­can­ta gu­io­nes</h1>'
What not to do
An obvious application of the library is in a blog, where it could hyphenate the body of a post. To integrate django-softhyphen in your Django project, you must add it to the INSTALLED_APPS
in your settings.py
file after following the installation instructions above.
INSTALLED_APPS = ( ... 'softhyphen', ... )
We’ve created a custom Django template tag that allows you to quickly process a text block in your template.
{% load softhyphen_tags %} <div class="body"> {{ post.body|softhyphen }} </div>
But that might not be a good idea. If your blog, like many Django projects, is served out of the database dynamically, then it would require the hyphenation process to run every time the template is parsed.
A solution is offered by James Bennett in his excellent book Practical Django Projects. He suggests adding an extra field to your model that stores pre-rendered HTML and is updated when the post is saved. For him, it was a way to more efficiently process Markdown and Pygments formatting, but it can work just as well in our case.
Here’s an simplified example adapted from Bennett’s open-source blog application, Coltrane.
from django.db import models from softhyphen.html import hyphenate class BlogPost(models.Model): pub_date = models.DateTimeField() title = models.CharField(max_length=250) body = models.TextField() body_html = models.TextField(editable=False, blank=True) def __unicode__(self): return self.title def save(self): self.body_html = hyphenate(self.body) super(BlogPost, self).save()
Notice the override of the model’s save
method, and how it fills the body_html
field just before a record is saved to the database. If you use this fix, you can change your template to call the rendered field and reduce the number of times you need to hyphenate text.
<div class="body"> {{ post.body_html|safe }} </div>
Much respect due
Our codebase is adapted from the work of Filipe Fortes, who generously open-sourced the guts of www.softhyphen.com. His code does most of the work, passing the provided text through a dictionary provided by OpenOffice and detecting where hyphens ought to be placed. We have refactored it as a pluggable Django application, added a Django template tag, written a number of unit tests and packaged the code for distribution on PyPI. And there’s more. Filipe’s work is itself built on top of code published by Wilbert Berendsen, who also deserves our thanks.