Tag Archives: C++11

CYK algorithm in C++

The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm for generalised parsing – i.e., parsing with ambiguous grammars where there may be more than one possible parse. The algorithm works by populating a table with the possible ways of parsing successively larger portions of the input. If the grammar’s start symbol is in the top cell of the table when it has been completed, the parse is successful.

Below is an implementation in C++. The class MB::cyk_parser is instantiated with a grammar (see Context-free grammar in C++). The parse() method attempts to recognise a sentence and displays the parsing table if it is called with a std::ostream.

The grammar must be in Chomsky Normal Form (CNF), which means that every left-hand side must be a single non-terminal symbol, and every right-hand side alternative must either be a single terminal symbol or a pair of non-terminal symbols. This is a requirement of the classic CYK algorithm, but some variants of the algorithm can handle grammars that are not in CNF.

The parsing table is an upper-right triangular matrix internally, but is printed as a lower-left triangular matrix, as this is easier to read.

#ifndef CYK_HPP
#define CYK_HPP

#include <vector>
#include <set>
#include <iostream>
#include <iterator>

#include <grammar.hpp>

namespace MB
{

class cyk_parser
{
public:
    cyk_parser(const MB::grammar& grammar)
        : grammar_(grammar)
    {
    }

    template <class InputIt>
    bool parse(InputIt begin, InputIt end)
    {
        return parse(begin, end, nullptr);
    }

    template <class InputIt>
    bool parse(InputIt begin, InputIt end, std::ostream& os)
    {
        return parse(begin, end, &os);
    }

private:
    template <class InputIt>
    bool parse(InputIt begin, InputIt end, std::ostream *os)
    {
        std::vector<std::string> str;
        std::copy(begin, end, std::back_inserter(str));
        const size_t len = str.size();
        std::vector<std::vector<std::set<std::string> > > table(len,
                std::vector<std::set<std::string> >(len));
        unsigned int i, j, k;

        // Populate main diagonal
        i = 0;
        for (const std::string& c : str) {
            std::vector<MB::rule::const_ptr> rules;
            grammar_.get_rules_for_symbol(c, std::back_inserter(rules));
            for (MB::rule::const_ptr r : rules) {
                table[i][i].insert(r->left());
            }
            i++;
        }
        // Populate upper-right triangle
        for (i = 1; i < len; i++) {
            for (j = i; j < len; j++) {
                for (k = j - i; k < j; k++) {
                    std::vector<std::pair<std::string, std::string> > product;
                    MB::detail::cartesian_product(table[j - i][k].begin(), table[j - i][k].end(), 
                            table[k + 1][j].begin(),
                            table[k + 1][j].end(), std::back_inserter(product));
                    std::vector<MB::rule::const_ptr> rules;
                    grammar_.get_rules_for_symbols(product.begin(), product.end(), 
                            std::back_inserter(rules));
                    for (MB::rule::const_ptr r : rules) {
                        table[j - i][j].insert(r->left());
                    }
                }
            }
        }
        if (os) {
            // Print the table 
            for (i = 0; i < len; i++) {
                k = 0;
                for (j = len - i - 1; j < len; j++) {
                    std::copy(table[k][j].begin(), table[k][j].end(), 
                            std::ostream_iterator<std::string>(*os, " "));
                    ++k;
                    *os << '\t';
                }
                *os << '\n';
            }
            for (const std::string& c : str) {
                *os << c << '\t';
            }
            *os << '\n';
        }
             
        // Successful if the start symbol is in the top-right cell    
        return table[0][len - 1].find(grammar_.get_start_left()) != table[0][len - 1].end();
    }

private:
    const MB::grammar& grammar_;
};

} // namespace MB

#endif // CYK_HPP

Here is an example program:

#include <fstream>
#include <iostream>

#include <grammar.hpp>
#include <cyk.hpp>

int main()
{
    std::string filename = "cyk.dat";
    std::ifstream ifs(filename);
    if (ifs) {
        MB::grammar grammar(ifs);
        std::string sentence[] = {"b", "a", "a", "b", "a"};
        const size_t len = sizeof(sentence) / sizeof(sentence[0]);
        bool success = MB::cyk_parser(grammar).parse(sentence, sentence + len, std::cout);
        std::cout << "Success: " << std::boolalpha << success << '\n';
    }
    else {
        std::cerr << "Couldn't open " << filename << " for reading\n";
    }
}

This is the grammar file:

S -> A B | B C
A -> B A | a
B -> C C | b
C -> A B | a

Here is the output. Notice the start symbol S in the top-left cell, which indicates a successful parse:

A C S
        A C S
        B       B
A S     B       C S     A S
B       A C     A C     B       A C
b       a       a       b       a
Success: true

Reference: The CYK Algorithm

Aho-Corasick algorithm in C++

The Aho-Corasick algorithm constructs a finite-state automaton (FSA) that can match multiple strings simultaneously. It begins by constructing a conventional prefix tree and then adds transition edges that allow the automaton to recover efficiently from failures.

Below is an implementation in C++. The constructor takes a pair of iterators to a sequence of strings, which the automaton will then be able to match. Once the automaton has been constructed, its search() method can be used to search a string of text. It takes an output iterator, to which it writes pairs containing the strings found and their starting positions.

#ifndef AHO_CORASICK_HPP
#define AHO_CORASICK_HPP

#include <queue>
#include <string>
#include <vector>
#include <set>

namespace MB
{

namespace detail
{

// Maximum number of characters in alphabet
static constexpr int MAXCHARS = 256;

struct state
{
    std::set<int> output_function;
    int failure_function;
    std::vector<int> goto_function;
    state()
        : failure_function(-1),
          goto_function(detail::MAXCHARS, -1)
    {
    }
};

} // namespace detail

class ac_automaton
{
public:
    template <class InputIt>
    ac_automaton(InputIt first, InputIt last)
        : states(1)
    {
        // Build the prefix tree
        for (InputIt it = first; it != last; ++it) {
            add_word(*it);
        }
        // Turn it into an automaton
        construct_automaton();
    }

    template <class OutputIt>
    OutputIt search(std::string text, OutputIt it) const
    {
        int current_state = 0;
     
        for (int i = 0; i < text.size(); ++i) {
            current_state = find_next_state(current_state, text[i]);
            if (states[current_state].output_function.size()) {
                for (auto length : states[current_state].output_function) {
                    *it++ = std::make_pair(std::string(text, i - length + 1, length),
                            i - length + 1);
                }
            }
        }
        return it;
    }

private:
    std::vector<detail::state> states;

private:
    void add_word(const std::string &word)
    {
        int current_state = 0;
        // Add to prefix tree
        for (int c : word) {
            if (states[current_state].goto_function == -1) {
                states[current_state].goto_function = states.size();
                states.push_back(detail::state());
            }
            current_state = states[current_state].goto_function;
        }
        // Add to output function for this state
        states[current_state].output_function.insert(word.size());
    }

    void construct_automaton()
    {
        // Complete the goto function by setting it to 0 for each
        // letter that doesn't have an edge from the root
        for (int c = 0; c < detail::MAXCHARS; ++c) {
            if (states[0].goto_function == -1) {
                states[0].goto_function = 0;
            }
        }
     
        // Compute failure and output functions
        std::queue<int> state_queue;
        for (int c = 0; c < detail::MAXCHARS; ++c) {
            if (states[0].goto_function != 0) {
                states[states[0].goto_function].failure_function = 0;
                state_queue.push(states[0].goto_function);
            }
        }
        while (state_queue.size()) {
            int s = state_queue.front();
            state_queue.pop();
     
            for (int c = 0; c < detail::MAXCHARS; ++c) {
                if (states[s].goto_function != -1) {
                    int failure = states[s].failure_function;
                    while (states[failure].goto_function == -1) {
                         failure = states[failure].failure_function;
                    }
                    failure = states[failure].goto_function;
                    states[states[s].goto_function].failure_function = failure;
                    for (auto length : states[failure].output_function) {
                        states[states[s].goto_function].output_function.insert(length);
                    }
                    state_queue.push(states[s].goto_function);
                }
            }
        }
    }
 
    int find_next_state(int current_state, char c) const
    {
        int next_state = current_state;
     
        while (states[next_state].goto_function == -1) {
            next_state = states[next_state].failure_function;
        }
     
        return states[next_state].goto_function;
    }
 
};

} // namespace MB

#endif // AHO_CORASICK_HPP

Here is an example program:

#include <iostream>

#include "aho_corasick.hpp"

int main()
{
    std::string words[] = {"brown", "dog", "fox", "jumps", "lazy", "over", "quick", "the"};
    const size_t len = sizeof(words) / sizeof(words[0]);
    std::string text = "the quick brown fox jumps over the lazy dog";
 
    MB::ac_automaton automaton(words, words + len);
    std::vector<std::pair<std::string, int> > results;

    automaton.search(text, std::back_inserter(results));
    for (auto it = results.begin(); it != results.end(); ++it) {
        std::cout << it->first << " starting at " << it->second << '\n';
    }
 
    return 0;
}

Output:

the starting at 0
quick starting at 4
brown starting at 10
fox starting at 16
jumps starting at 20
over starting at 26
the starting at 31
lazy starting at 35
dog starting at 40

Reference: Biosequence Algorithms, Spring 2005 Lecture 4: Set Matching and Aho-Corasick Algorithm

BK-Tree in C++

A BK-tree is an efficient data structure for storing items for which there is some concept of distance between them. A common use is to store words for approximate string matching, using the Levenshtein distance metric. Another use is to store phrases using some measure of semantic distance in order to implement an AI system like a chatterbot.

This is a simple implementation of a BK-tree in C++.

#ifndef BK_TREE_HPP
#define BK_TREE_HPP

#include <map>
#include <memory>
#include <type_traits>

namespace MB
{

template <typename T, typename Unit, typename Metric> class bktree;

namespace detail
{

template <typename T, typename Unit, typename Metric>
class bktree_node
{
    friend class bktree<T, Unit, Metric>;
	typedef bktree_node<T, Unit, Metric> node_type;
    typedef std::unique_ptr<node_type> node_ptr_type;
	
	T value_;
	std::map<Unit, node_ptr_type> children_;

	bktree_node(const T &value)
		: value_(value)
    {
    }

	bool insert(const T& value, const Metric& distance)
    {
        bool inserted = false;
		Unit dist = distance(value, this->value_);
		if (dist > 0) {
            // Not already here 
            auto it = children_.find(dist);
            if (it == children_.end()) {
                // Attach a new edge here
                children_.insert(std::make_pair(dist, node_ptr_type(new node_type(value))));
                inserted = true;
            }
            else {
                // Follow existing edge
                inserted = it->second->insert(value, distance);
            }
            return inserted;
        }
	}

    template <class OutputIt>
	OutputIt find(const T &value, const Unit& limit, const Metric& distance, OutputIt it) const
    {
		Unit dist = distance(value, this->value_);
		if (dist <= limit) {
            // Found one
			*it++ = std::make_pair(this->value_, dist);
        }
		for (auto iter = children_.begin(); iter != children_.end(); ++iter) {
            // Follow edges between dist + limit and dist - limit
			if (dist - limit <= iter->first && dist + limit >= iter->first) {
				iter->second->find(value, limit, distance, it);
            }
		}
        return it;
	}
};

}; // namespace detail

template <typename T, typename Unit, typename Metric>
class bktree
{
private:
	typedef typename detail::bktree_node<T, Unit, Metric>::node_type node_type;
    typedef typename detail::bktree_node<T, Unit, Metric>::node_ptr_type node_ptr_type;

    node_ptr_type head_;
    const Metric distance_;
	size_t size_;

public:
	bktree(const Metric& distance = Metric())
        : head_(nullptr), distance_(distance), size_(0L)
    {
        static_assert(std::is_integral<Unit>::value, "Integral unit type required.");
    }

	bool insert(const T &value)
    {
        bool inserted = false;
		if (head_ == nullptr) {
            // Inserting the first value
			head_ = node_ptr_type(new node_type(value));
			size_ = 1;
            inserted = true;
		}
        else if (head_->insert(value, distance_)) {
            ++size_;
            inserted = true;
        }
        return inserted;
	}

    template <class OutputIt>
	OutputIt find(const T& value, const Unit& limit, OutputIt it) const
    {
		return head_->find(value, limit, distance_, it);
	}

	size_t size() const
    {
		return size_;
	}
};

}; // namespace MB

#endif // BK_TREE_HPP

Here is an example program. It loads a dictionary from a file, and then prompts the user to enter words and a limit on distance, and it will return all words within that Levenshtein distance of the word entered. I used this dictionary with it:

#include <fstream>
#include <iostream>
#include <string>

#include "bk_tree.hpp"
#include "levenshtein_distance.hpp"

int main()
{
    const char filename[] = "dictionary.txt";
    std::ifstream ifs(filename);
    if (!ifs) {
        std::cerr << "Couldn't open " << filename << " for reading\n";
        return 1;
    }
    MB::bktree<std::string, int, levenshtein_distance> tree;
    std::string word;
    while (ifs >> word) {
        tree.insert(word);
    }
    std::cout << "Inserted " << tree.size() << " words\n";
    std::vector<std::pair<std::string, int> > results;
    do {
        std::cout << "Enter a word (\"quit!\" to quit)> ";
        std::cin >> word;
        if (word != "quit!") {
            int limit;
            std::cout << "Enter a limit> ";
            std::cin >> limit;
            tree.find(word, limit, std::back_inserter(results));
            for (auto it = results.begin(); it != results.end(); ++it) {
                std::cout << it->first << "(distance " << it->second << ")\n";
            }
            results.clear();
        }
    }  while (word != "quit!");
}

Example run:

$ ./dictionary
Inserted 127142 words
Enter a word ("quit!" to quit)> mispelling
Enter a limit> 2
spelling(distance 2)
misdealing(distance 2)
impelling(distance 2)
miscalling(distance 2)
mispenning(distance 2)
misrelying(distance 2)
respelling(distance 2)
dispelling(distance 1)
misbilling(distance 2)
misspelling(distance 1)
misspellings(distance 2)
Enter a word ("quit!" to quit)> quit!
$

Here is an implementation of the Levenshtein distance:

#include <string>
#include <vector>
#include <algorithm>

namespace MB
{

namespace detail
{

template <typename T>
T min3(const T& a, const T& b, const T& c)
{
   return std::min(std::min(a, b), c);
}

}; // namespace detail

class levenshtein_distance 
{
    mutable std::vector<std::vector<unsigned int> > matrix_;

public:
    explicit levenshtein_distance(size_t initial_size = 8)
        : matrix_(initial_size, std::vector<unsigned int>(initial_size))
    {
    }

    unsigned int operator()(const std::string& s, const std::string& t) const
    {
        const size_t m = s.size();
        const size_t n = t.size();
        // The distance between a string and the empty string is the string's length
        if (m == 0) {
            return n;
        }
        if (n == 0) {
            return m;
        }
        // Size the matrix as necessary
        if (matrix_.size() < m + 1) {
            matrix_.resize(m + 1, matrix_[0]);
        }
        if (matrix_[0].size() < n + 1) {
            for (auto& mat : matrix_) {
                mat.resize(n + 1);
            }
        }
        // The top row and left column are prefixes that can be reached by
        // insertions and deletions alone
        unsigned int i, j;
        for (i = 1;  i <= m; ++i) {
            matrix_[i][0] = i;
        }
        for (j = 1; j <= n; ++j) {
            matrix_[0][j] = j;
        }
        // Fill in the rest of the matrix
        for (j = 1; j <= n; ++j) {
            for (i = 1; i <= m; ++i) {
                unsigned int substitution_cost = s[i - 1] == t[j - 1] ? 0 : 1;
                matrix_[i][j] =
                    detail::min3(matrix_[i - 1][j] + 1,         // Deletion
                    matrix_[i][j - 1] + 1,                      // Insertion
                    matrix_[i - 1][j - 1] + substitution_cost); // Substitution
            }
        }
        return matrix_[m][n];
    }
};

}; // namespace MB

Typedef super

A useful addition to a derived class is a typedef aliasing the base class as super:

class Derived : public Base
{
private:
    typedef Base super;
};

This allows you to use super to call the base class constructor from the derived class:

    Derived(int i, int j)
        : super(i), j_(j)
    {
    }

You can also use super to call the base class version of a virtual function in a derived class override:

 virtual void Show()
{
    super::Show();
    std::cout << "Derived Show()\n";
}

It means that if you need to add another layer of inheritance, you only need to change the inheritance part of the class declaration and this typedef.

class Derived : public Intermediate
{
private:
    typedef Intermediate super;
};

You should make the typedef private, to prevent it from being inherited and accidentally used from derived classes that do not declare it.

Visual C++ has a non-standard extension __super: __super.

It is available in Java: Using the Keyword super.

Nowadays you should probably use using:

class Derived : public Base
{
private:
    using super = Base;
};

So it should probably be called the "using super =" idiom.

Looping over the keys and values in a map

Say we have this map:

#include <map>
#include <string>

typedef std::map<std::string, unsigned int> map_string_to_uint;
const size_t n = 12;

map_string_to_uint monthdays;
const char *months[] 
        = {"JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"};
unsigned int days[] = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};
for (int m = 0; m < n; ++m) {
    monthdays[months[m]] = days[m];
}

And we want to loop over the keys and values.

Method 1: Use an iterator

    for (map_string_to_uint::const_iterator it = monthdays.begin(); it != monthdays.end(); ++it) {
        std::cout << it->first << ": " << it->second << "\n";
    }

Method 2: Use a range-based for loop (C++11)

    for (const auto& pair : monthdays) {
        std::cout << pair.first << ": " << pair.second << "\n";
    }

Method 3: Use Boost.Foreach

#include <boost/foreach.hpp>

BOOST_FOREACH(const map_string_to_uint::value_type& pair, monthdays) {
    std::cout << pair.first << ": " << pair.second << "\n";
}

Reference: Boost.Foreach

Template specialisation with a non-type template parameter

The problem

Say you have a generic matrix class like this with non-type template parameters for the number of rows and columns:

template <typename ElementType, size_t Rows, size_t Columns>
class Matrix
{
    ElementType entries_[Rows][Columns];
    // etc...
};

You would like to create a specialization for a column vector, where the number of columns is always 1. However, the following typedef doesn’t compile:

typedef Matrix<size_t, Rows, 1> ColumnVector;

C++03

In C++03, you have 2 options.
Firstly, you can use inheritance:

template <typename ElementType, size_t Rows>
class ColumnVector : public Matrix<ElementType, Rows, 1>
{
};

Secondly, you could use a class with a nested typedef:

template <typename ElementType, size_t Rows>
struct ColumnVector
{
    typedef Matrix<ElementType, Rows, 1> type;
};

In this case, you could declare a column vector with:

ColumnVector<int, 3>::type vec;

C++11

In C++11, you can use an alias declaration to solve the problem more neatly:

template <typename ElementType, size_t Rows>
using ColumnVector = Matrix<ElementType, Rows, 1>;

This is a good example of how much more powerful the new alias declarations with using are than typedefs.

How to create, launch, and join a thread in C++

Method 1: Boost.Thread

#include <iostream>
#include <string>
#include <boost/thread.hpp>

class Hello
{
public:
	void operator()(const std::string& msg) const
	{
		std::cout << msg << "\n";
	}
};

int main()
{
	boost::thread t = boost::thread(Hello(), "Hello");
	t.join();
}

Method 2: pthreads (POSIX)

#include <iostream>

#include <pthread.h>

void* hello(void *msg)
{
    std::cout << static_cast<const char*>(msg) << "\n";
    return NULL;
}

int main()
{
    pthread_t thread;
    pthread_attr_t attr;
	pthread_attr_init(&attr);
	pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
    char msg[] = "Hello";
	int status = pthread_create(&thread, &attr, hello, msg);
	pthread_attr_destroy(&attr);
	if (status != 0) {
        std::cerr << "Failed to create thread\n";
        return 1;
	}
    status = pthread_join(thread, NULL);
    if (status != 0) {
        std::cerr << "Failed to join thread\n";
        return 1;
    }
}

Method 3: Windows threads

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif

#include <Windows.h>
#include <process.h>

#include <iostream>

void hello(void *msg)
{
	std::cout << static_cast<const char*>(msg) << "\n";
}

int main()
{
	char msg[] = "Hello";
	HANDLE thread = reinterpret_cast<HANDLE>(_beginthread(hello, 0, msg));
	if (thread == INVALID_HANDLE_VALUE) {
		std::cerr << "Failed to create thread\n";
		return 1;
	}
	WaitForSingleObject(thread, INFINITE);
}

Method 4: std::thread (C++11)

#include <string>
#include <iostream>
#include <thread>

void hello(const std::string& msg)
{
    std::cout << msg << "\n";
}

int main()
{
    std::thread t(hello, "Hello");
    t.join();
}

String formatting in C++

Method 1: Use vsnprintf() (C++11 and POSIX)

#include <iostream>
#include <vector>
#include <string>
#include <cstdarg>
#include <cstring>

std::string format(const std::string& format, ...)
{
    va_list args;
    va_start (args, format);
    size_t len = std::vsnprintf(NULL, 0, format.c_str(), args);
    va_end (args);
    std::vector<char> vec(len + 1);
    va_start (args, format);
    std::vsnprintf(&vec[0], len + 1, format.c_str(), args);
    va_end (args);
    return &vec[0];
}

int main()
{
    char str[] = "%s => %d";
    std::cout << string_format(str, "apples", 7) << "\n";
}

Method 2: Use Boost.Format

#include <iostream>
#include <boost/format.hpp>

int main()
{
    char str[] = "%s => %d";
    std::cout <<  boost::str(boost::format(str) % "apples" % 7) << "\n";
}

Reference: Boost Format library

Method 3: Use folly::format()

#include <iostream>
#include <folly/Format.h>

int main()
{
    char str[] = "{} => {}";
    std::cout <<  folly::format(str, "apples", 7) << "\n";
}

Reference: folly/Format.h

std::move()

The std::move() operator is a cast that gets an xvalue reference to an object, allowing it to be moved. Moving the object instead of copying it increases efficiency.

For example, consider the following swap() function:

template <typename T>
void swap(T& a, T& b)
{
    T tmp(a);
    a = b;
    b = tmp;
}

When this runs, a is first copied to tmp, then b is copied to a, and finally tmp is copied to b. If a and b were expensive objects to copy, this would be very inefficient.

With move(), we can rewrite the swap() function this way:

#include <utility>
template <typename T>
void swap(T& a, T& b)
{
    T tmp(std::move(a));
    a = std::move(b);   
    b = std::move(tmp);
}

Now if a and b are of a class that has a move constructor and assignment operator, they will be efficiently moved rather than copied.

Finding the size of arrays and containers

Arrays

You can find the size of an array with the following template function:

template <typename T, size_t N>
size_t size(T (&)[N])
{
    return N;
}

Example:

#include <iostream>

int main()
{
    int ints[] = {1, 2, 3, 4};
    const char *pchars[] = {"a", "b", "c", "d"};
 
    std::cout << size(ints) << "\n";
    std::cout << size(pchars) << "\n";
}
4
4

Note that once an array has decayed to a pointer by being passed to a function, this will not work.

Containers

You can extend this to containers with the following two functions, which take template template parameters:

template <template<class, class> class Container, class T, class Allocator>
size_t size(const Container<T, Allocator>& cont)
{
    return cont.size();
}

template <template<class, class, class, class> class Container, class Key, class T, class Compare,
        class Allocator>
size_t size(const Container<Key, T, Compare, Allocator>& cont)
{
    return cont.size();
}

Example:

#include <iostream>
#include <vector>
#include <algorithm>
#include <map>

int main()
{
    std::vector<int> vec(10);
    std::iota(vec.begin(), vec.end(), 0);
    std::map<std::string, int> dict;
    dict["apples"] = 7;
    dict["pears"] = 9;

    std::cout << size(vec) << "\n";
    std::cout << size(dict) << "\n";
}
10
2

C++11

In C++11, you can use decltype to select the container overload, so the two template template functions can be replaced with this:

template <class Container> 
auto size(const Container& cont) -> decltype(cont.size())
{
    return cont.size();
}

C++17

C++17 has added the std::size() function, which probably uses the size_t overload for arrays, and the one using decltype for containers.