Natural Language Processing in the kitchen

    Los Angeles Times
    Database: The Times California Cookbook website.

    Nat­ur­al Lan­guage Pro­cessing is a field that cov­ers com­puter un­der­stand­ing and ma­nip­u­la­tion of hu­man lan­guage, and it’s ripe with pos­sib­il­it­ies for news­gath­er­ing. You usu­ally hear about it in the con­text of ana­lyz­ing large pools of le­gis­la­tion or oth­er doc­u­ment sets, at­tempt­ing to dis­cov­er pat­terns or root out cor­rup­tion. I de­cided to take it in­to the kit­chen for my latest pro­ject: The Times Cali­for­nia Cook­book re­cipe data­base.

    The first phase of the pro­ject, the hol­i­day edi­tion, launched with more than 600 hol­i­day-themed re­cipes from The Times Test Kit­chen. It’s a large num­ber, but there’s much more to come next year – we have close to an­oth­er 5,000 re­cipes staged and nearly ready to go.

    With only four months between the concept stage of the site and launch, the Data Desk had a tight time frame and lim­ited re­sources to com­plete two par­al­lel tasks: build the web­site and pre­pare the re­cipes for pub­lic­a­tion. The biggest chal­lenge was pre­par­ing the re­cipes, which were stored in The Times lib­rary archive as, es­sen­tially, un­struc­tured plain text. Pars­ing thou­sands of re­cords by hand was un­man­age­able, so we needed a pro­gram­mat­ic solu­tion to get us most of the way there.

    We had a pile of a couple thou­sand re­cords – news stor­ies, columns and more – and each re­cord con­tained one or more re­cipes. We needed to do the fol­low­ing:

    1. Sep­ar­ate the re­cipes from the rest of the story, while keep­ing the story in­tact for dis­play along­side the re­cipe later.
    2. De­term­ine how many re­cipes there were – more than one in many cases, and counts up to a dozen wer­en’t par­tic­u­larly un­usu­al.
    3. For each re­cipe, find the name, in­gredi­ents, steps, prep time, servings, nu­tri­tion and more.
    4. Load these in­to a data­base, pre­serving the re­la­tion­ships between the re­cipes that ran to­geth­er in the news­pa­per.

    Where to start?

    The well-worn path here at the Data Desk would be to write a pars­er that looks for com­mon pat­terns in format­ting and punc­tu­ation. You can break up the text line by line, then look for one or more reg­u­lar ex­pres­sion matches on each line. It might go something like this:

    import re
    # Define our patterns for a step and ingredient
    # A step could have a number in front
    # followed by a period like "1." or an "*"
    step_pattern = re.compile(r'^(?:[0-9]{1,2}\.\s|\*)', re.S)
    # An ingredient could have a fraction in front like "1/2" or  "1 1/4"
    ingredient_pattern = re.compile(r'^(?:[0-9]{1,3}\s|[0-9,/]{1,4}\s|[0-9]\s[0-9][/][0-9])', re.S)
    def tag(text):
        Attempt to classify a line of text as a "step" or "ingredient"
        based on the formatting or leading text.
        if step_pattern.match(text):
            return 'step'
        if ingredient_pattern.match(text):
            return 'ingredient'
        return None
    # Test it out
    tag('3 eggs')
    >>> 'ingredient'
    tag('1. Heat the oven to 375 degrees.')
    >>> 'step'

    Then you can make an at­tempt to tag each line of the story with a re­cipe field – de­scrip­tion, name, in­gredi­ent, step, nu­tri­tion, etc. – and write an­oth­er script to as­semble those parts in­to re­cipes that can be loaded in­to a data­base.

    After look­ing at a few re­cords it was im­me­di­ately evid­ent we wouldn’t be able to use pure reg­u­lar ex­pres­sions to parse them. We had de­cided to try to grab all of the re­cipes The Times had pub­lished from the year 2000 to present, and there were enorm­ous dif­fer­ences in the format­ting and struc­ture over the years. We needed nat­ur­al lan­guage pro­cessing and ma­chine learn­ing to parse it.

    Enter NLTK

    Nat­ur­al lan­guage pro­cessing is a big field, and you can do a lot with it – the vast ma­jor­ity of which I will not cov­er here. Py­thon, my pro­gram­ming lan­guage of choice, has an ex­cel­lent lib­rary for nat­ur­al lan­guage pro­cessing and ma­chine learn­ing called Nat­ur­al Lan­guage Toolkit, or NLTK, which I primar­ily used for this pro­cess. At left is an ex­ample of what the raw re­cipes looked like com­ing out of our lib­rary archive.

    One of the more com­mon uses of NLTK is tag­ging text. You could, for ex­ample, have it tag a news story with top­ics or ana­lyze an email to see if it’s spam. The very ba­sic ap­proach is to token­ize the text in­to words, then pass off those words in­to a clas­si­fi­er that you’ve trained with a set of already-tagged ex­amples. The clas­si­fi­er then re­turns the best fit­ting tag for the text.

    For re­cipes, we already have well-defined fields we need to ex­tract. There will be in­gredi­ents, steps, nu­tri­tion, servings, prep time and pos­sibly a couple more. We just need to train a clas­si­fi­er to tell the dif­fer­ence by passing it some ex­amples we’ve done manu­ally. After a bit of re­search and test­ing, I chose to go with a Max­im­um En­tropy clas­si­fi­er be­cause it seemed to fit the pro­ject best and was very ac­cur­ate.

    A ba­sic ap­proach might look something like this:

    import nltk
    import pickle
    from nltk.classify import MaxentClassifier
    # Set up our training material in a nice dictionary.
    training = {
        'ingredients': [
            'Pastry for 9-inch tart pan',
            'Apple cider vinegar',
            '3 eggs',
            '1/4 cup sugar',
        'steps': [
            'Sift the powdered sugar and cocoa powder together.',
            'Coarsely crush the peppercorns using a mortar and pestle.',
            'While the vegetables are cooking, scrub the pig ears clean and cut away any knobby bits of cartilage so they will lie flat.',
            'Heat the oven to 375 degrees.',
    # Set up a list that will contain all of our tagged examples,
    # which we will pass into the classifier at the end.
    training_set = []
    for key, val in training.items():
        for i in val:
            # Set up a list we can use for all of our features,
            # which are just individual words in this case.
            feats = []
            # Before we can tokenize words, we need to break the
            # text out into sentences.
            sentences = nltk.sent_tokenize(i)
            for sentence in sentences:
                feats = feats + nltk.word_tokenize(sentence)
            # For this example, it's a good idea to normalize for case.
            # You may or may not need to do this.
            feats = [i.lower() for i in feats]
            # Each feature needs a value. A typical use for a case like this
            # is to use True or 1, though you can use almost any value for
            # a more complicated application or analysis.
            feats = dict([(i, True) for i in feats])
            # NLTK expects you to feed a classifier a list of tuples
            # where each tuple is (features, tag).
            training_set.append((feats, key))
    # Train up our classifier
    classifier = MaxentClassifier.train(training_set)
    # Test it out!
    # You need to feed the classifier your data in the same format you used
    # to train it, in this case individual lowercase words.
    classifier.classify({'apple': True, 'cider': True, 'vinegar': True})
    >>> 'ingredients'
    # Save it to disk, if you want, because these can take a long time to train.
    outfile = open('my_pickle.pickle', 'wb')
    pickle.dump(classifier, outfile)

    The built-in Max­im­um En­tropy clas­si­fi­er can take an ex­ceed­ingly long time to train, but NLTK can in­ter­face with sev­er­al ex­tern­al ma­chine-learn­ing ap­plic­a­tions to make that pro­cess much quick­er. I was able to in­stall MegaM on my Mac, with some modi­fic­a­tions, and used it with NLTK to great ef­fect.

    Deep­er ana­lys­is

    But that’s just a be­gin­ning, and what is typ­ic­ally de­scribed as a “bag of words” ap­proach. To put it simply, the clas­si­fi­er learns how to tag your text based on the fre­quency of some of the words. It doesn’t ac­count for the or­der of the words, or com­mon phrases or any­thing else. Us­ing this meth­od I was able to tag fields with slightly more than 90% ac­cur­acy, which is pretty good. But we can do bet­ter.

    If you think about how a re­cipe is writ­ten, there are more dif­fer­ences between the fields than the in­di­vidu­al words like “but­ter” or “fry.” There might be com­mon phrases like “heat the oven” or “at room tem­per­at­ure.”

    There also might be dif­fer­ences in the gram­mar. For ex­ample, how can you cor­rectly tag “Mini ricotta latkes with sour cherry sauce” as a re­cipe title and not an in­gredi­ent? In­gredi­ents might have a reas­on­ably pre­dict­able mix of ad­ject­ives, nouns and prop­er nouns while steps might have more verbs and de­term­iners. A title would rarely have a pro­noun but could in­clude pre­pos­i­tions fairly of­ten.

    NLTK comes with a few meth­ods to make this type of ana­lys­is much easi­er. It has a great part-of-speech tag­ger, for in­stance, as well as func­tions for pulling bi-grams and tri-grams (two and three word phrases) out of blocks of text. You can eas­ily write a func­tion that token­izes text in­to sen­tences, then words, then tri-grams and parts of speech. Feed all of that in­to your clas­si­fi­er and you can tag text much more ac­cur­ately.

    It could look something like this:

    import nltk
    from nltk.tag.simplify import simplify_wsj_tag
    def get_features(text):
        words = []
        # Same steps to start as before
        sentences = nltk.sent_tokenize(text)
        for sentence in sentences:
            words = words + nltk.word_tokenize(sentence)
        # part of speech tag each of the words
        pos = nltk.pos_tag(words)
        # Sometimes it's helpful to simplify the tags NLTK returns by default.
        # I saw an increase in accuracy if I did this, but you may not
        # depending on the application.
        pos = [simplify_wsj_tag(tag) for word, tag in pos]
        # Then, convert the words to lowercase like before
        words = [i.lower() for i in words]
        # Grab the trigrams
        trigrams = nltk.trigrams(words)
        # We need to concatinate the trigrams into a single string to process
        trigrams = ["%s/%s/%s" % (i[0], i[1], i[2]) for i in trigrams]
        # Get our final dict rolling
        features = words + pos + trigrams
        # get our feature dict rolling
        features = dict([(i, True) for i in features])
        return features
    # Try it out
    text = "Transfer the pan to a wire rack to cool for 15 minutes."
    >>> {'DET': True, 'transfer/the/pan': True, 'for/15/minutes': True, 'rack/to/cool': True, 'wire': True, 'wire/rack/to': True, 'for': True, 'to': True, 'transfer': True, 'to/a/wire': True, '.': True, 'TO': True, 'NUM': True, 'NP': True, 'pan': True, 'a/wire/rack': True, 'the/pan/to': True, 'N': True, 'P': True, 'pan/to/a': True, '15/minutes/.': True, 'cool': True, 'a': True, '15': True, 'to/cool/for': True, 'cool/for/15': True, 'the': True, 'minutes': True, 'rack': True}

    Wrap­ping it up

    Us­ing a com­bin­a­tion of these meth­ods I was able to pull re­cipes out of news stor­ies very suc­cess­fully. To get the clas­si­fi­er work­ing really well you need to train it on a large, ran­dom sample of your data.

    I parsed about 10 or 20 re­cords by hand to get star­ted, then cre­ated a small Django app to ran­domly load a re­cord and at­tempt to parse it. I cor­rec­ted the tags that were wrong, saved the cor­rect ver­sion to a data­base, and peri­od­ic­ally re­trained the clas­si­fi­er us­ing the new samples. I ended up with a couple hun­dred parsed re­cords, and the clas­si­fi­er (which has some built-in meth­ods for test­ing) was about 98% ac­cur­ate.

    I wrote a pars­ing script that in­cor­por­ated some reg­u­lar ex­pres­sions and a bit of if/else lo­gic to try to tag as much as I could from format­ting, then used NLTK to tag the rest. After the tag­ging, the story still had to be as­sembled in­to one or more dis­crete re­cipes and loaded in­to a data­base so that hu­mans could re­view them.

    That pro­cess was re­l­at­ively straight­for­ward, but I did have to build a cus­tom ad­min for a small group of people to com­pare the ori­gin­al re­cord and parsed out­put side by side. In the end every re­cord had to be re­viewed by hand, and many of them needed one or more small tweaks. Only about one in 20 had struc­tur­al prob­lems. A big thanks to Maloy Moore, Tenny Tatus­i­an and the Food sec­tion staff for comb­ing through all of the re­cords by hand. Com­puters can really only do so much.

    If you want to learn more I highly re­com­mend the book Nat­ur­al Lan­guage Pro­cessing with Py­thon, which I read be­fore em­bark­ing on this pro­ject.

    Readers: What’s your take? Share it here.


    Latest work

      About The Data Desk

      This page was created by the Data Desk, a team of reporters and Web developers in downtown L.A.