Unger’s method in C++

Unger’s method is a top down, non-directional, generalised parsing algorithm. Top-down means that the parsing process begins with the grammar’s start symbol and attempts to derive the input string by applying the production rules from the start rule downwards. Non-directional means that the input is not processed from left to right (or right to left), but instead is processed in arbitrary order. This requires that the entire input is in memory from the beginning. Generalised means that the parser can handle ambiguous grammars, which can produce sentences that have multiple correct parses.

To choose between production rules to apply, Unger’s method works by evaluating all possible partitions of the input string that can correspond to the symbols on the right-hand-side of each grammar rule. It proceeds as follows:

  1. Find all of the right-hand side alternatives of rules applicable to the input (these are the right-hand sides of the start rules at first)
  2. For each right-hand side, count how many symbols it contains, and form all of the partitions of the input into that number of parts – each cell in the partition is a possible product of the symbol in the corresponding place in the alternative
  3. Reject any partitions that can’t possibly be correct because they don’t have terminals in cells that require them, i.e., where the symbol in the right-hand side is a terminal, the corresponding cell of the partition must contain only that terminal
  4. For the partitions that remain, repeat the whole process for each cell of the partition, using the rules that can produce the symbol corresponding to that cell, and the cell’s content as the input
  5. The process terminates when there are no more rules to try, or we have successfully broken the input down to terminals

For example, given the grammar:

E -> E + T
E -> T
T -> T * a
T -> a

And the sentence:

a * a + a

We find all of the right hand side alternatives of rules applicable to the input -these are the two start rules first

E + T
T

For each right-hand side, we count how many symbols it contains, and form of all of the partitions of the input into that number of parts.
The first start rule alternative “E + T” contains 3 symbols, so we form all of the partitions of the input into 3 parts as follows:

# E + T
0 a * a + a
1 a * a + a
2 a * a + a
3 a * a + a
4 a * a + a
5 a * a + a

We reject any partitions that can’t possibly be correct because they don’t have terminals in cells that require them. So we reject all of the above except row #5, because it is the only one with the “+” in the right place.

For the partitions that remain, we repeat the process for each cell of the partition, using the rules that can produce the symbol corresponding to that cell (the column header). So, in the example above, we repeat the process with “a * a” and all of the rules that can produce E, which is just “E -> E + T”.

Eventually, we will get down to the terminals if the parse succeeds, or run out of rules to try. Note that:

  • A table (rule) succeeds if any one of its rows succeeds
  • A row succeeds only if all of its cells succeed
  • We can check terminals for validity first in order to prevent unnecessary recursion

Here is the code in C++. The class unger_parser is instantiated with a grammar (see Context-free grammar in C++), and its parse method takes an input string which it will attempt to recognise, and will print the rules matched if given a std::ostream.

#ifndef UNGER_HPP
#define UNGER_HPP

#include <vector>
#include <iterator>
#include <iostream>
#include <iterator>

#include <grammar.hpp>


namespace MB
{

namespace detail
{
// First partition of sequence into k parts
void first_sequence_partition(unsigned int *arr, unsigned int k)
{
    unsigned int p;
    for (p = 0; p < k - 1; p++) {
        arr[p] = p + 1;
    }
}

// Next partition of sequence into k parts
bool next_sequence_partition(unsigned int *arr, size_t len, unsigned int k)
{
    int p; // Separator number
    unsigned int s; // Its highest place
    bool changed = 0;
    for (p = k - 2, s = len - 1; p >= 0 && !changed; p--, s--) {
        if (arr[p] < s) {
            arr[p]++;
            changed = true;
            if (static_cast<unsigned int>(p) < k - 2) {
                p++;
                while (static_cast<unsigned int>(p) <= k - 2) {
                    arr[p] = arr[p - 1] + 1;
                    p++;
                }
            }
        }
    }
    if (!changed) {
        first_sequence_partition(arr, k);
    }
    return changed;
}

// Fill a vector with len elements starting at start
void fill(const std::vector<std::string>& input, unsigned int start, unsigned int len,
        std::vector<std::string>& output)
{
    unsigned int i, j;
    for (i = start, j = 0; j < len; ++i, ++j) {
        output.push_back(input[i]);
    }
}

} // namespace detail

class unger_parser
{
public:
    unger_parser(const MB::grammar& grammar)
        : grammar_(grammar), os_(nullptr)
    {
    }

    template <class InputIt>
    bool parse(InputIt begin, InputIt end)
    {
        return parse(begin, end, nullptr);
    }

    template <class InputIt>
    bool parse(InputIt begin, InputIt end, std::ostream& os)
    {
        os_ = &os;
        bool success = parse(begin, end, &os);
        os_ = nullptr;
        return success;
    }

private:
    template <class InputIt>
    bool parse(InputIt begin, InputIt end, std::ostream *os)
    {
        std::vector<std::string> input;
        std::copy(begin, end, std::back_inserter(input));
        std::vector<MB::rule::const_ptr> rules;
        grammar_.get_start_rules(std::back_inserter(rules));
        return valid_rules(input, rules);
    }

    // If symbol is a single terminal, it needs to match cell
    bool valid_cell_terminal(const std::vector<std::string>& cell,
            const std::string& symbol)
    {
        return cell.size() == 1 && cell[0] == symbol;
    }

    // If symbol is a non-terminal, find its rules and try to match one of them
    bool valid_cell_nonterminal(const std::vector<std::string>& cell,
            const std::string& symbol)
    {
        // Try all of the rules where symbol is the left hand side
        std::vector<MB::rule::const_ptr> rules;
        grammar_.get_rules_for_left(symbol, std::back_inserter(rules));
        return valid_rules(cell, rules);
    }

    // A cell is valid if it matches a terminal or a rule for a non-terminal succeeds
    bool valid_cell(const std::vector<std::string>& cell,
            const std::string& symbol)
    {
        bool terminal = grammar_.symbol_is_terminal(symbol);
        return (terminal && valid_cell_terminal(cell, symbol))
            || (!terminal && valid_cell_nonterminal(cell, symbol));
    }

    // All of the cells in a row need to succeed
    bool valid_row(const std::vector<std::string>& input, 
            const std::vector<std::string>& right, const unsigned int *arr, unsigned int k)
    {
        const size_t len = input.size();
        bool success = true;
        std::vector<std::vector<std::string> > cells;
        for (unsigned int p = 0; p < k && success; p++) {
            std::vector<std::string> cell;
            if (p == 0) {
                // First partition goes from the beginning to arr[0]
                detail::fill(input, 0, arr[0], cell);
            }
            else if (p == k - 1) {
                // Last partition goes from arr[k - 2] to the end
                detail::fill(input, arr[k - 2], len - arr[k - 2], cell);
            }
            else {
                // Partition goes from separator p - 1 to separator p
                detail::fill(input, arr[p - 1], arr[p] - arr[p - 1], cell);
            }
            cells.push_back(cell);
            std::string symbol = right[p];
            if (grammar_.symbol_is_terminal(symbol)) {
                success = valid_cell_terminal(cell, symbol);
            }
        }
        if (success) {
            for (unsigned int p = 0; p < k && success; p++) {
                std::string symbol = right[p];
                success = valid_cell(cells[p], symbol);
            }
        }
        return success;
    }

    // One of the rows in a table needs to succeed
    bool valid_table(const std::vector<std::string>& input,
            const std::vector<std::string>& right)
    {
        const unsigned int k = right.size(); // Number of parts
        const size_t len = input.size(); // Number of tokens
        bool success = false;
        if (k <= len) {
            if (k > 1) {
                std::vector<unsigned int> arr(k);
                detail::first_sequence_partition(arr.data(), k);
                do {
                    success = valid_row(input, right, arr.data(), k);
                } while (detail::next_sequence_partition(arr.data(), len, k) && !success);
            }
            else {
                success = valid_cell(input, right[0]);
            }
        }
        return success;
    }
    // One of the right hand side alternatives of one of the rules needs to succeed 
    bool valid_rules(const std::vector<std::string>& input,
            const std::vector<MB::rule::const_ptr>& rules)
    {
        for (MB::rule::const_ptr r : rules) {
            for (const std::vector<std::string>& alternative : r->right()) {
                if (valid_table(input, alternative)) {
                    if (os_) {
                        *os_ << "Succeeded in matching rule ";
                        *os_ << *r;
                        *os_ << " with input ";
                        std::copy(input.begin(), input.end(), 
                                std::ostream_iterator<std::string>(*os_, " "));
                        *os_ << '\n';
                    }
                    return true;
                }
            }
        }
        return false;
    }

private:
    const MB::grammar& grammar_;
    std::ostream* os_;
};

} // namespace MB

#endif // UNGER_HPP

Here is a program for the example above:

#include <fstream>
#include <iostream>
#include <string>

#include <grammar.hpp>
#include <unger.hpp>

int main()
{
    std::string filename = "unger.dat";
    std::ifstream ifs(filename);
    if (ifs) {
        MB::grammar grammar(ifs);
        std::string sentence[] = {"a", "*", "a", "+", "a"};
        const size_t len = sizeof(sentence) / sizeof(sentence[0]);
        bool success = MB::unger_parser(grammar).parse(sentence, sentence + len, std::cout);
        std::cout << "Success: " << std::boolalpha << success << '\n';
    }
    else {
        std::cerr << "Couldn't open " << filename << " for reading\n";
    }
}

Here is the output:

Succeeded in matching rule T -> a  with input a
Succeeded in matching rule T -> T * a  with input a * a
Succeeded in matching rule E -> T  with input a * a
Succeeded in matching rule T -> a  with input a
Succeeded in matching rule E -> E + T  with input a * a + a
Success: true

Reference: