Earley’s algorithm in C++

Introduction

Earley’s algorithm is a dynamic programming algorithm for generalised parsing, i.e., parsing with grammars that might be ambiguous. For example, the Earley algorithm could parse sentences generated by the following toy English grammar, which I am going to use in the explanation and code:

S -> NP VP
NP -> NP PP
NP -> Noun
VP -> Verb NP
VP -> VP PP
PP -> Prep NP
Noun -> John
Noun -> Mary
Noun -> Denver
Verb -> called
Prep -> from

As an example of the ambiguity of this grammar, consider the sentence “John called Mary from Denver”. Was John calling from Denver, or was Mary from Denver?

Data structures

The Earley algorithm uses three data structures, states, statelists, and the chart.

States

A state represents a state in the process of evaluating a grammar rule. In particular, it represents how far along the right hand side of the rule the parse has progressed. This is usually represented as a dot between symbols of the right hand side showing the position, hence a state is sometimes known as a “dotted rule”.

For example, the state below is for the grammar rule “PP -> Prep NP” from the grammar above:

(PP -> Prep @ NP, [3 , 4])

The position of the dot (represented by a “@”) indicates that the “Prep” has been processed and the “NP” should be next. The “3” and “4” are the positions within the input that the processed part of this state refers to (i.e., the span of the “Prep”).

Statelists

A statelist is an ordered list of states. The states in a statelist are unique, so it is sometimes called a state set.
Once a state has been added to a statelist, it is never removed or modified – instead new states are created in this or the next statelist based on the state by the procedures described below.

The chart

The chart is a set of statelists. There is one statelist for every point in the input, from before the first word to after the last. The statelists are generated one at a time as the parse progresses. There are only two statelists in action at any time, the current one being processed and the next one being generated.

The algorithm

The algorithm processes the chart from the first statelist to the last, constructing the statelists ahead of itself as it goes.
Rather than looping over the input, it loops over the statelists, the processing of which adds new statelists ahead of the iteration. The processing of states within the statelists consumes the input.

There are three procedures, the Predictor, the Scanner, and the Completer.

The Predictor

The Predictor processes states in which the dot is before a non-terminal. For example, in the state below,

(S -> @ NP VP, [0 , 0])

the dot is before the non-terminal “NP”, so the Predictor would process this state.
The Predictor inserts into the current statelist a state for each rule where that symbol after the dot (“NP”) is on the left (“NP -> NP PP” and “NP -> Noun”), and places the dot before the first symbol of these new rules’ right-hand sides. So the state above gives rise to the following two states:

(NP -> @ NP PP, [0 , 0])
(NP -> @ Noun, [0 , 0])

The Scanner

The Scanner identifies states in the statelist where the dot is before a terminal (in other words, those states where the dot is to the left a symbol not processed by the Predictor). It reads the next word of input, and inserts a state into the next statelist containing the same rule with the dot moved past the terminal.
For example, consider this state in which the Predictor has predicted a “Noun”, whose right hand side is a terminal (Noun -> John):

(Noun -> @ John, [0 , 0])

This causes the Scanner to read the next word of input, which, if it is “John”, results in the following state being added to the next statelist:

(Noun -> John @, [0 , 1])

The Completer

The Completer processes those states not processed by the Predictor or Scanner, in other words, those in which the dot is at the right-hand end, after every symbol in the rule. These states are now complete. It finds states that have the left-hand symbol of those completed states on their right-hand side, and inserts states that are identical, except that the dot is moved to the other side of that particular symbol, as the completion of the original state represents the completion of this symbol. So the Completer’s job is to drive the dots forward in rules that are being processed.
Continuing the example above, as soon as the Scanner has added the state for “Noun -> John”, that state has the dot on the extreme right, so the Completer adds a state marking the completion of the “Noun” part of the rule “NP -> Noun”:

(NP -> Noun @, [0 , 1])

Summary

The Predictor and Completer insert states into the current statelist. The Scanner inserts states into the next statelist (because it advances the reading position). The Predictor and Completer are duals of each other and book-end the work of the Scanner. The Predictor moves down the grammar hierarchy to the bottom in order to introduce states with lower level rules. It feeds the Scanner by working its way down to the terminals and inserting states for their rules. The Completer moves up the grammar hierarchy to the top to advance the dots. It is propagating the work of the Scanner upwards, all the way to the start symbol. The parse is successful if the start state is in the last statelist, with the dot on the right of its right hand side.

The code

Below is the code in C++. The class earley_parser is instantiated with a grammar (see Context-free grammar in C++), and its parse() method attempts to recognise a sentence, and prints the statelists if it is passed a std::ostream. The Predictor, Scanner, and Completer are implemented as methods with those names.

#ifndef EARLEY_HPP
#define EARLEY_HPP

#include <string>
#include <vector>
#include <list>
#include <iostream>
#include <iterator>
#include <algorithm>

#include <grammar.hpp>

namespace MB
{

namespace detail
{

// An element of a statelist, also known as a "dotted rule" or "Earley item"
struct state
{
    MB::rule::const_ptr rule_; // The grammar rule
    const unsigned int right_; // Index of right hand side alternative
    unsigned int dot_; // Position of dot within symbols on right hand side
    unsigned int i_, j_; // Positions within the input
    char added_by_; // Which function added this state
    state(MB::rule::const_ptr rule, unsigned int right, unsigned int i, unsigned int j)
        : rule_(rule), right_(right), dot_(0), i_(i), j_(j), added_by_(0)
    {
    }

    // Is dot all the way to the right?
    bool completed() const
    {
        return dot_ == rule_->right()[right_].size();
    }

    // Get symbol to the right of dot
    std::string next_symbol() const
    {
        return rule_->right()[right_][dot_];
    }
};

// Pretty-print state
std::ostream& operator <<(std::ostream& os, const state& st)
{
    const std::vector<std::string>& right = st.rule_->right()[st.right_];
    size_t rlen = right.size();
    os << '(';
    os << st.rule_->left() << " -> ";
    unsigned int s;
    for (s = 0; s < rlen; ++s) {
        if (s == st.dot_) {
            os << "@ ";
        }
        os << right[s];
        if (s < rlen - 1) {
            os << ' ';
        }
    }
    if (s == st.dot_) {
        os << " @";
    }
    os << ", [" << st.i_ << " , " << st.j_ << "]) ";
    switch (st.added_by_) {
        case 'P':
            os << "predictor";
            break;
        case 'S':
            os << "scanner";
            break;
        case 'C':
            os << "completer";
            break;
        default:
            os << "start state";
    }
    return os;
}

// Needed to check for duplicate states
bool operator ==(const state& state1, const state& state2)
{
    return state1.rule_->left() == state2.rule_->left()
        && state1.rule_->right() == state2.rule_->right()
        && state1.right_ == state2.right_
        && state1.dot_ == state2.dot_
        && state1.i_ == state2.i_
        && state1.j_ == state2.j_;
}

// A statelist is a list of states
typedef std::list<state> statelist;

// A chart is a vector of statelists
struct chart
{
    const MB::grammar& grammar_;
    std::vector<statelist> chart_;

    chart(const MB::grammar& grammar)
        : grammar_(grammar),
        // Chart begins with 1 statelist containing 1 dummy state used to track
        // successful parse
        chart_(1, statelist(1, state(MB::rule::create("$", grammar.get_start_left()),
                        0, 0, 0)))
    {
    }

    // Add state st to statelist s
    void add_state(state& st, unsigned int s)
    {
        if (s < chart_.size()) {
            // Adding to the last statelist
            statelist::iterator it = std::find(chart_[s].begin(), chart_[s].end(), st);
            if (it == chart_[s].end()) {
                chart_[s].push_back(st);
            }
        }
        else {
            // Adding to a new statelist
            chart_.push_back(statelist(1, st));
        }
    }

    // Add predictions for the next symbol in this state
    void predictor(state& st)
    {
        std::vector<MB::rule::const_ptr> rules;
        grammar_.get_rules_for_left(st.next_symbol(), std::back_inserter(rules));
        for (MB::rule::const_ptr r : rules) {
            for (unsigned int a = 0; a < r->right().size(); ++a) {
                state prediction = state(r, a, st.j_, st.j_);
                prediction.added_by_ = 'P';
                add_state(prediction, st.j_);
            }
        }
    }

    // Scan input for next symbol
    void scanner(const state& st, const std::vector<std::string>& input)
    {
        const std::string& word = input[st.j_];
        if (word == st.rule_->right()[st.right_][st.dot_]) {
            state scanned = state(st.rule_, st.right_, st.i_, st.j_ + 1);
            scanned.dot_ = st.dot_ + 1;
            scanned.added_by_ = 'S';
            add_state(scanned, st.j_ + 1);
        }
    }

    // Complete states
    void completer(const state& st)
    {
        const statelist& list = chart_[st.i_];
        const unsigned int i = st.i_;
        const unsigned int j = st.j_;
        for (const state& candidate : list) {
            if (candidate.j_ == i
                    && !candidate.completed()
                    && candidate.next_symbol() == st.rule_->left())
            {
                state completed = state(candidate.rule_, candidate.right_, candidate.i_, j);
                completed.dot_ = candidate.dot_ + 1;
                completed.added_by_ = 'C';
                add_state(completed, j);
            }
        }
    }

    // We have succeeded if the completed dummy state is in the final statelist
    bool succeeded() const
    {
        const statelist& list = chart_[chart_.size() - 1];
        return std::find_if(list.begin(), list.end(),
                [](const state &st)->bool {
                return st.rule_->left() == "$" && st.completed(); })
            != list.end();
    }

    // The main algorithm
    bool parse(const std::vector<std::string>& input, std::ostream *os)
    {
        for (unsigned int i = 0; i <= input.size(); ++i) {
            if (chart_.size() > i) { // Check for running out of statelists when parse fails
                for (statelist::iterator it = chart_[i].begin(); it != chart_[i].end(); ++it) {
                    state& st = *it;
                    if (!st.completed()
                            && !grammar_.symbol_is_terminal(st.next_symbol())) {
                        predictor(st);
                    }
                    else if (!st.completed()) {
                        if (i < input.size()) {
                            scanner(st, input);
                        }
                    }
                    else {
                        completer(st);
                    }
                }
                if (os) {
                    *os << *this;
                    *os << '\n';
                }
            }
        }
        return succeeded();
    }

    // Pretty-print chart
    friend std::ostream& operator <<(std::ostream& os, const chart &ch)
    {
        for (unsigned int i = 0; i < ch.chart_.size(); ++i) {
            os << "S" << i << ": ";
            os << '[';
            unsigned int j = 0;
            for (const state& st : ch.chart_[i]) {
                os << st;
                if (j < ch.chart_[i].size() - 1) {
                    os << ",\n";
                }
                ++j;
            }
            os << ']';
            os << "\n\n";
        }
        return os;
    }
};

} // namespace detail

class earley_parser
{
public:
    earley_parser(const MB::grammar& grammar)
        : grammar_(grammar)
    {
    }
    template <class InputIt>
    bool parse(InputIt begin, InputIt end) const
    {
        std::vector<std::string> input;
        std::copy(begin, end, std::back_inserter(input));
        return detail::chart(grammar_).parse(input, nullptr);
    }
    template <class InputIt>
    bool parse(InputIt begin, InputIt end, std::ostream& os) const
    {
        std::vector<std::string> input;
        std::copy(begin, end, std::back_inserter(input));
        return detail::chart(grammar_).parse(input, &os);
    }
private:
    const MB::grammar& grammar_;
};

} // namespace MB

#endif // EARLEY_HPP

Here is an example program that attempts to recognise the ambiguous sentence “John called Mary from Denver” mentioned in the introduction:

#include <fstream>
#include <iostream>
#include <string>

#include <grammar.hpp>
#include <earley.hpp>

int main()
{
    std::string filename = "earley.dat";
    std::ifstream ifs(filename);
    if (ifs) {
        MB::grammar grammar(ifs);
        std::string sentence[] = {"John", "called", "Mary", "from", "Denver"};
        const size_t len = sizeof(sentence) / sizeof(sentence[0]);
        bool success = MB::earley_parser(grammar).parse(sentence, sentence + len, std::cout);
        std::cout << "Success: " << std::boolalpha << success << '\n';
    }
    else {
        std::cerr << "Couldn't open " << filename << " for reading\n";
    }
}

Here is the output. The entire chart is printed for each iteration. Notice how each iteration adds another statelist. The parsing process is a process of driving the dots forward. At the end of the parse, the presence of the completed start rule in the last statelist indicates success.

S0: [($ -> @ S, [0 , 0]) start state,
(S -> @ NP VP, [0 , 0]) predictor,
(NP -> @ NP PP, [0 , 0]) predictor,
(NP -> @ Noun, [0 , 0]) predictor,
(Noun -> @ John, [0 , 0]) predictor,
(Noun -> @ Mary, [0 , 0]) predictor,
(Noun -> @ Denver, [0 , 0]) predictor]

S1: [(Noun -> John @, [0 , 1]) scanner]


S0: [($ -> @ S, [0 , 0]) start state,
(S -> @ NP VP, [0 , 0]) predictor,
(NP -> @ NP PP, [0 , 0]) predictor,
(NP -> @ Noun, [0 , 0]) predictor,
(Noun -> @ John, [0 , 0]) predictor,
(Noun -> @ Mary, [0 , 0]) predictor,
(Noun -> @ Denver, [0 , 0]) predictor]

S1: [(Noun -> John @, [0 , 1]) scanner,
(NP -> Noun @, [0 , 1]) completer,
(S -> NP @ VP, [0 , 1]) completer,
(NP -> NP @ PP, [0 , 1]) completer,
(VP -> @ Verb NP, [1 , 1]) predictor,
(VP -> @ VP PP, [1 , 1]) predictor,
(PP -> @ Prep NP, [1 , 1]) predictor,
(Verb -> @ called, [1 , 1]) predictor,
(Prep -> @ from, [1 , 1]) predictor]

S2: [(Verb -> called @, [1 , 2]) scanner]


S0: [($ -> @ S, [0 , 0]) start state,
(S -> @ NP VP, [0 , 0]) predictor,
(NP -> @ NP PP, [0 , 0]) predictor,
(NP -> @ Noun, [0 , 0]) predictor,
(Noun -> @ John, [0 , 0]) predictor,
(Noun -> @ Mary, [0 , 0]) predictor,
(Noun -> @ Denver, [0 , 0]) predictor]

S1: [(Noun -> John @, [0 , 1]) scanner,
(NP -> Noun @, [0 , 1]) completer,
(S -> NP @ VP, [0 , 1]) completer,
(NP -> NP @ PP, [0 , 1]) completer,
(VP -> @ Verb NP, [1 , 1]) predictor,
(VP -> @ VP PP, [1 , 1]) predictor,
(PP -> @ Prep NP, [1 , 1]) predictor,
(Verb -> @ called, [1 , 1]) predictor,
(Prep -> @ from, [1 , 1]) predictor]

S2: [(Verb -> called @, [1 , 2]) scanner,
(VP -> Verb @ NP, [1 , 2]) completer,
(NP -> @ NP PP, [2 , 2]) predictor,
(NP -> @ Noun, [2 , 2]) predictor,
(Noun -> @ John, [2 , 2]) predictor,
(Noun -> @ Mary, [2 , 2]) predictor,
(Noun -> @ Denver, [2 , 2]) predictor]

S3: [(Noun -> Mary @, [2 , 3]) scanner]


S0: [($ -> @ S, [0 , 0]) start state,
(S -> @ NP VP, [0 , 0]) predictor,
(NP -> @ NP PP, [0 , 0]) predictor,
(NP -> @ Noun, [0 , 0]) predictor,
(Noun -> @ John, [0 , 0]) predictor,
(Noun -> @ Mary, [0 , 0]) predictor,
(Noun -> @ Denver, [0 , 0]) predictor]

S1: [(Noun -> John @, [0 , 1]) scanner,
(NP -> Noun @, [0 , 1]) completer,
(S -> NP @ VP, [0 , 1]) completer,
(NP -> NP @ PP, [0 , 1]) completer,
(VP -> @ Verb NP, [1 , 1]) predictor,
(VP -> @ VP PP, [1 , 1]) predictor,
(PP -> @ Prep NP, [1 , 1]) predictor,
(Verb -> @ called, [1 , 1]) predictor,
(Prep -> @ from, [1 , 1]) predictor]

S2: [(Verb -> called @, [1 , 2]) scanner,
(VP -> Verb @ NP, [1 , 2]) completer,
(NP -> @ NP PP, [2 , 2]) predictor,
(NP -> @ Noun, [2 , 2]) predictor,
(Noun -> @ John, [2 , 2]) predictor,
(Noun -> @ Mary, [2 , 2]) predictor,
(Noun -> @ Denver, [2 , 2]) predictor]

S3: [(Noun -> Mary @, [2 , 3]) scanner,
(NP -> Noun @, [2 , 3]) completer,
(VP -> Verb NP @, [1 , 3]) completer,
(NP -> NP @ PP, [2 , 3]) completer,
(S -> NP VP @, [0 , 3]) completer,
(VP -> VP @ PP, [1 , 3]) completer,
(PP -> @ Prep NP, [3 , 3]) predictor,
($ -> S @, [0 , 3]) completer,
(Prep -> @ from, [3 , 3]) predictor]

S4: [(Prep -> from @, [3 , 4]) scanner]


S0: [($ -> @ S, [0 , 0]) start state,
(S -> @ NP VP, [0 , 0]) predictor,
(NP -> @ NP PP, [0 , 0]) predictor,
(NP -> @ Noun, [0 , 0]) predictor,
(Noun -> @ John, [0 , 0]) predictor,
(Noun -> @ Mary, [0 , 0]) predictor,
(Noun -> @ Denver, [0 , 0]) predictor]

S1: [(Noun -> John @, [0 , 1]) scanner,
(NP -> Noun @, [0 , 1]) completer,
(S -> NP @ VP, [0 , 1]) completer,
(NP -> NP @ PP, [0 , 1]) completer,
(VP -> @ Verb NP, [1 , 1]) predictor,
(VP -> @ VP PP, [1 , 1]) predictor,
(PP -> @ Prep NP, [1 , 1]) predictor,
(Verb -> @ called, [1 , 1]) predictor,
(Prep -> @ from, [1 , 1]) predictor]

S2: [(Verb -> called @, [1 , 2]) scanner,
(VP -> Verb @ NP, [1 , 2]) completer,
(NP -> @ NP PP, [2 , 2]) predictor,
(NP -> @ Noun, [2 , 2]) predictor,
(Noun -> @ John, [2 , 2]) predictor,
(Noun -> @ Mary, [2 , 2]) predictor,
(Noun -> @ Denver, [2 , 2]) predictor]

S3: [(Noun -> Mary @, [2 , 3]) scanner,
(NP -> Noun @, [2 , 3]) completer,
(VP -> Verb NP @, [1 , 3]) completer,
(NP -> NP @ PP, [2 , 3]) completer,
(S -> NP VP @, [0 , 3]) completer,
(VP -> VP @ PP, [1 , 3]) completer,
(PP -> @ Prep NP, [3 , 3]) predictor,
($ -> S @, [0 , 3]) completer,
(Prep -> @ from, [3 , 3]) predictor]

S4: [(Prep -> from @, [3 , 4]) scanner,
(PP -> Prep @ NP, [3 , 4]) completer,
(NP -> @ NP PP, [4 , 4]) predictor,
(NP -> @ Noun, [4 , 4]) predictor,
(Noun -> @ John, [4 , 4]) predictor,
(Noun -> @ Mary, [4 , 4]) predictor,
(Noun -> @ Denver, [4 , 4]) predictor]

S5: [(Noun -> Denver @, [4 , 5]) scanner]


S0: [($ -> @ S, [0 , 0]) start state,
(S -> @ NP VP, [0 , 0]) predictor,
(NP -> @ NP PP, [0 , 0]) predictor,
(NP -> @ Noun, [0 , 0]) predictor,
(Noun -> @ John, [0 , 0]) predictor,
(Noun -> @ Mary, [0 , 0]) predictor,
(Noun -> @ Denver, [0 , 0]) predictor]

S1: [(Noun -> John @, [0 , 1]) scanner,
(NP -> Noun @, [0 , 1]) completer,
(S -> NP @ VP, [0 , 1]) completer,
(NP -> NP @ PP, [0 , 1]) completer,
(VP -> @ Verb NP, [1 , 1]) predictor,
(VP -> @ VP PP, [1 , 1]) predictor,
(PP -> @ Prep NP, [1 , 1]) predictor,
(Verb -> @ called, [1 , 1]) predictor,
(Prep -> @ from, [1 , 1]) predictor]

S2: [(Verb -> called @, [1 , 2]) scanner,
(VP -> Verb @ NP, [1 , 2]) completer,
(NP -> @ NP PP, [2 , 2]) predictor,
(NP -> @ Noun, [2 , 2]) predictor,
(Noun -> @ John, [2 , 2]) predictor,
(Noun -> @ Mary, [2 , 2]) predictor,
(Noun -> @ Denver, [2 , 2]) predictor]

S3: [(Noun -> Mary @, [2 , 3]) scanner,
(NP -> Noun @, [2 , 3]) completer,
(VP -> Verb NP @, [1 , 3]) completer,
(NP -> NP @ PP, [2 , 3]) completer,
(S -> NP VP @, [0 , 3]) completer,
(VP -> VP @ PP, [1 , 3]) completer,
(PP -> @ Prep NP, [3 , 3]) predictor,
($ -> S @, [0 , 3]) completer,
(Prep -> @ from, [3 , 3]) predictor]

S4: [(Prep -> from @, [3 , 4]) scanner,
(PP -> Prep @ NP, [3 , 4]) completer,
(NP -> @ NP PP, [4 , 4]) predictor,
(NP -> @ Noun, [4 , 4]) predictor,
(Noun -> @ John, [4 , 4]) predictor,
(Noun -> @ Mary, [4 , 4]) predictor,
(Noun -> @ Denver, [4 , 4]) predictor]

S5: [(Noun -> Denver @, [4 , 5]) scanner,
(NP -> Noun @, [4 , 5]) completer,
(PP -> Prep NP @, [3 , 5]) completer,
(NP -> NP @ PP, [4 , 5]) completer,
(NP -> NP PP @, [2 , 5]) completer,
(VP -> VP PP @, [1 , 5]) completer,
(PP -> @ Prep NP, [5 , 5]) predictor,
(VP -> Verb NP @, [1 , 5]) completer,
(NP -> NP @ PP, [2 , 5]) completer,
(S -> NP VP @, [0 , 5]) completer,
(VP -> VP @ PP, [1 , 5]) completer,
(Prep -> @ from, [5 , 5]) predictor,
($ -> S @, [0 , 5]) completer]

Success: true

Reference: Dynamic Programming and the Earley Parser