Machine Learning models and algorithms

This is a list of machine learning models and algorithms, with links to library implementations.

Contents

AdaBoost

  • Boosting
  • Ensemble

Affinity Propagation

  • Clustering
  • Unsupervised

Apriori

  • Association rules learning

Averaged One-Dependence Estimators (AODE)

  • Bayesian
  • Classification

Averaged One-Dependence Estimators (AODE) with Subsumption Resolution

  • Bayesian
  • Classification

Bagging

  • Averaging
  • Ensemble

Bayesian Logistic Regression

  • Bayesian
  • Regression

Bayesian Network (BN)

  • Bayesian
  • Classification

Bernoulli Naive Bayes

  • Bayesian
  • Classification

C4.5 and C5.0

  • Classification
  • Decision tree
  • Regression
  • Supervised

Chi-squared Automatic Interaction Detection (CHAID)

  • Classification
  • Decision tree
  • Regression
  • Supervised

Classification And Regression Tree (CART)

  • Classification
  • Decision tree
  • Regression
  • Supervised

Conditional Decision Trees

  • Classification
  • Decision tree
  • Regression
  • Supervised

DBSCAN

  • Clustering

Decision Stump

  • Classification
  • Decision tree
  • Regression
  • Supervised

Discriminative Multinomial Naive Bayes

  • Bayesian
  • Classification

Eclat

  • Association rules learning

Elastic Net

  • Regression
  • Regularisation

Expectation Maximisation (EM)

  • Clustering
  • Unsupervised

Feedforward Network

  • Neural network

Gaussian Naive Bayes

  • Bayesian
  • Classification

Gaussian Processes

  • Regression

Gradient Boosted Trees (GBT)

  • Boosting
  • Ensemble

Hidden Naive Bayes

  • Bayesian
  • Classification

Hierarchical Clustering

  • Clustering
  • Unsupervised

Isotonic Regression

  • Regression

Iterative Dichotomiser 3 (ID3)

  • Classification
  • Decision tree
  • Regression
  • Supervised

K-Means

  • Clustering
  • Unsupervised

K-Medians

  • Clustering
  • Unsupervised

K-Nearest Neighbour

  • Classification
  • Instance based

Kernel Perceptron

Learning Vector Quantization (LVQ)

  • Instance based
  • Neural network
  • Unsupervised

Least Absolute Shrinkage And Selection Operator (LASSO)

  • Regression
  • Regularisation

Least Median of Squares Regression

  • Regression

Least-Angle Regression (LARS)

  • Regression
  • Regularisation

Locally Weighted Learning (LWL)

  • Instance based
  • Regression

Logistic Regression

  • Classification

M5

  • Classification
  • Decision tree
  • Regression
  • Supervised

Mean Shift

  • Clustering
  • Unsupervised

Multilayer Perceptron

  • Neural network

Multinomial Naive Bayes

  • Bayesian
  • Classification

Naive Bayes

  • Bayesian
  • Classification

Ordinary Least Squares

  • Regression

Perceptron

  • Neural network

Polynomial Regression

  • Regression

Radial Basis Function (RBF) Networks

  • Instance based
  • Neural network
  • Regression

Random Forest

  • Averaging
  • Ensemble

Recurrent Neural Network

  • Neural network

Ridge Regression

  • Regression
  • Regularisation

Self-Organizing Map (SOM), Kohonnen Network

  • Instance based
  • Neural network
  • Unsupervised

Spectral Clustering

  • Clustering
  • Unsupervised

Support Vector Machine (SVM)

  • Classification
  • Instance based
  • Kernel machine
  • Regression
  • Supervised

Voted Perceptron

  • Neural network

Weightily Averaged One-Dependence Estimators

  • Bayesian
  • Classification


Dynamic array in C

A dynamic array is one that grows to accommodate new items, overcoming the limitations of fixed-size arrays. New items can be added at the head or tail, or inserted in the middle. Existing elements are moved to accommodate new ones as necessary.

Here is an example program:

#include <stdio.h>

#include <dynarray.h>

int main(void)
{
    dynarray *array = dynarray_create(0); /* No preferred size */
    unsigned int i;
    void * data;
    char *elements[] = {"A", "B", "C", "D", "E", "F"};
    const unsigned int n = sizeof(elements) / sizeof(const char*);

    for (i = 0; i < n; i++) {
        if (i % 2 == 0) {
            dynarray_add_head(array, elements[i]);
        }
        else {
            dynarray_add_tail(array, elements[i]);
        }
    }
        
    dynarray_insert(array, 0, "X");                                /* Same as dynarray_addhead */
    dynarray_insert(array, dynarray_get_count(array) / 2, "Y");    /* Insert in the middle */
    dynarray_insert(array, dynarray_get_count(array), "Z");        /* Same as dynarray_add_tail */

    dynarray_set(array, dynarray_get_count(array) / 2, "P");
    dynarray_set(array, dynarray_get_count(array), "Q");           /* Same as dynarray_add_tail */

    for (i = 0; i < dynarray_get_count(array); i++) {
        printf("%d: %s\n", i, (const char*)dynarray_get(array, i));
    }
    putchar('\n');
    
    for (i = 0; dynarray_get_count(array); i++) {
        const unsigned int action = i % 3;
        if (action == 0) {
            data = dynarray_remove_head(array);
        }
        else if (action == 1) {
            data = dynarray_remove_tail(array);
        }
        else {
            data = dynarray_remove(array, dynarray_get_count(array) / 2);
        }
        printf("Removed: %s\n", (const char*)data);
    }

    dynarray_delete(array);

    return 0;
}

The header file:

#ifndef DYNARRAY_H
#define DYNARRAY_H

struct dynarray {
	void ** buffer;
	unsigned int size;
	unsigned int count;
};

typedef struct dynarray dynarray;

typedef void (*dynarray_forfn)(void *);

dynarray * dynarray_create(unsigned int startsize);
void dynarray_empty(dynarray * array);
void dynarray_delete(dynarray * array);
void dynarray_add_tail(dynarray * array, void * data);
void dynarray_add_head(dynarray * array, void * data);
void * dynarray_remove_tail(dynarray * array);
void * dynarray_remove_head(dynarray * array);
void dynarray_insert(dynarray *array, unsigned int pos, void *data);
void * dynarray_remove(dynarray *array, unsigned int pos);
void * dynarray_get(const dynarray * array, unsigned int pos);
void * dynarray_set(dynarray * array, unsigned int pos, void * data);
void dynarray_for_each(const dynarray * array, dynarray_forfn fun);
unsigned int dynarray_get_count(const dynarray * array);
void dynarray_set_size(dynarray * array, unsigned int size);

#endif /* DYNARRAY_H */

The implementation:

#include <stdlib.h>
#include <string.h> /* For memcpy and memmove */

#include <dynarray.h>

#define START_SIZE 4 /* Initial buffer size if not specified */

dynarray * dynarray_create(unsigned int size)
{
	dynarray * array = malloc(sizeof(dynarray));
	if (array != NULL) {
		if (size) {
			array->buffer = malloc(size * sizeof(void*));
			if (array->buffer) {
				array->size = size;
			}
			else {
				free(array);
				array = NULL;
			}
		}
		else {
			array->buffer = NULL;
			array->size = 0;
		}
		array->count = 0;
	}
	return array;
}

void dynarray_empty(dynarray * array)
{
	array->count = 0;
}

void dynarray_delete(dynarray * array)
{
	if (array) {
		free(array->buffer);
		free(array);
	}
}

void dynarray_add_tail(dynarray * array, void * data)
{
	if (array->count == array->size) {
		/* No more space */
		if (array->buffer != NULL) {
			void **buffer = realloc(array->buffer, array->size * 2 * sizeof(void*));
			array->buffer = buffer;
			array->size *= 2;
		}
		else {
			array->buffer = malloc(START_SIZE * sizeof(void*));
			array->size = START_SIZE;
		}
	}
	if (array->buffer != NULL) {
		array->buffer[array->count] = data;
		array->count++;
	}
}

void dynarray_add_head(dynarray * array, void * data)
{
	if (array->count == array->size) {
		/* No more space */
		if (array->buffer != NULL) {
			void **temp = malloc(array->size * 2 * sizeof(void*));
			if (temp) {
				/* Copy the elements one space to the right */
				memcpy(temp + 1, array->buffer, array->count * sizeof(void*));
				free(array->buffer);
				array->buffer = temp;
				array->size *= 2;
			}
		}
		else {
			array->buffer = malloc(START_SIZE * sizeof(void*));
			if (array->buffer) {
				array->size = START_SIZE;
			}
		}
	}
	else {
        /* Move the elements one space to the right */
		memmove(array->buffer + 1, array->buffer, array->count * sizeof(void*));
	}
	if (array->buffer != NULL) {
		array->buffer[0] = data;
		array->count++;
	}
}

void * dynarray_remove_tail(dynarray * array)
{
	void * data = NULL;
	if (array->count > 0) {
		data = array->buffer[array->count - 1];
		array->count--;
	}
	return data;
}

void * dynarray_remove_head(dynarray * array)
{
	void * data = NULL;
	if (array->count > 0) {
		data = array->buffer[0];
        /* Move the elements one space to the left */
        memmove(array->buffer, array->buffer + 1, (array->count - 1) * sizeof(void*));
		array->count--;
	}
	return data;
}

void dynarray_insert(dynarray *array, unsigned int pos, void *data)
{
	if (pos == 0) {
		dynarray_add_head(array, data);
	}
	else if (pos == array->count) {
		dynarray_add_tail(array, data);
	}
	else if (pos < array->count) {
		unsigned int i;
		if (array->count == array->size) {
			/* Reallocate the buffer and copy, leaving a space */
			void **temp = malloc(array->size * 2 * sizeof(void*));
			if (temp) {
				memcpy(temp, array->buffer, pos * sizeof(void*));
				memcpy(temp + pos + 1, array->buffer + pos, (array->count - pos) * sizeof(void*));
				free(array->buffer);
				array->buffer = temp;
				array->size *= 2;
			}
		}
		else {
			/* Move the elements after to the right */
            memmove(array->buffer + pos + 1, array->buffer + pos,
                    (array->count - pos) * sizeof(void*));
		}
		array->buffer[pos] = data;
		array->count++;
	}
}

void * dynarray_remove(dynarray *array, unsigned int pos)
{
	void *data;
	if (array->count < pos + 1) {
		data = NULL;
	}
	else if (pos == 0) {
		data = dynarray_remove_head(array);
	}
	else if (pos == array->count - 1) {
		data = dynarray_remove_tail(array);
	}
	else {
		unsigned int i;
		data = array->buffer[pos];
		/* Move the following elements left */
        memmove(array->buffer + pos, array->buffer + pos + 1,
                (array->count - pos - 1) * sizeof(void*)); 
		array->count--;
	}
	return data;
}

void * dynarray_get(const dynarray * array, unsigned int pos)
{
	void * data = NULL;
	if (pos < array->count) {
		data = array->buffer[pos];
	}
	return data;
}

void * dynarray_set(dynarray * array, unsigned int pos, void * data)
{
	void * temp = NULL;
	if (pos == array->count) {
		dynarray_add_tail(array, data);
	}
	else if (pos < array->count) {
		temp = array->buffer[pos];
		array->buffer[pos] = data;
	}
	return temp;
}

void dynarray_set_size(dynarray * array, unsigned int size)
{
	
	array->buffer = realloc(array->buffer, size);
	if (array->buffer) {
		array->size = size;
		if (array->size < array->count) {
			array->count = array->size;
		}
	}
	else {
		array->size = 0;
		array->count = 0;
	}
}

void dynarray_for_each(const dynarray * array, dynarray_forfn fun)
{
	unsigned int i;

	for (i = 0; i < array->count; i++) {
		fun(array->buffer[i]);
	}
}

unsigned int dynarray_get_count(const dynarray * array)
{
	return array->count;
}

Hash table in C

This is a hash table in C using singly-linked lists for overflow chains.

Here is an example program:

#include <stdio.h>
#include <string.h>

#include <hashtable.h>

int main(void)
{
    hashtable * table;
    const char * result;
    unsigned int e;
    char * elements[] = {"A", "B", "C", "D", "E", "F"};
    const unsigned int n = sizeof(elements) / sizeof(const char*);

    table = hashtable_create(7, (hashtable_cmpfn)strcmp);
    for (e = 0; e < n; e++) {
        hashtable_add(table, elements[e]);
    }
    hashtable_for_each(table, (hashtable_forfn)puts);
    for (e = 0; e < n; e++) {
        result = hashtable_find(table, elements[e]);
        if (result) {
            printf("Found: %s\n", result);
        }
        else {
            printf("Couldn't find %s\n", elements[e]);
        }
    }
    printf("The load factor is %.2f\n", hashtable_get_load_factor(table));
    for (e = 0; e < n; e++) {
        result = hashtable_remove(table, elements[e]);
        if (result) {
            printf("Removed: %s\n", result);
        }
        else {
            printf("Couldn't remove %s\n", elements[e]);
        }
    }
    hashtable_delete(table);

    return 0;
}

The header file:

#ifndef HASHTABLE_H
#define HASHTABLE_H

struct hashnode {
	void * data;
	struct hashnode * next;
};
typedef struct hashnode hashnode;

typedef int (*hashtable_cmpfn)(const void*, const void*);
typedef unsigned long (*hashtable_hashfn)(const void*);
typedef void (*hashtable_forfn)(void *);
typedef void (*hashtable_forfn2)(void *, void *);

struct hashtable {
	unsigned int size;
	hashtable_hashfn hash;
	hashtable_cmpfn compare;
	hashnode **buckets;
	size_t count;
};
typedef struct hashtable hashtable;

hashtable *hashtable_create(size_t size, hashtable_cmpfn compare);
void hashtable_empty(hashtable * table);
void hashtable_delete(hashtable * table);
void *hashtable_add(hashtable * table, void * data);
void *hashtable_find(const hashtable * table, const void * data);
void *hashtable_remove(hashtable * table, const void * data);
float hashtable_get_load_factor(const hashtable * table);
unsigned int hashtable_get_count(const hashtable * table);
unsigned int hashtable_find_count(const hashtable *table);
void hashtable_for_each(const hashtable * table, hashtable_forfn fun);
void hashtable_for_each2(const hashtable * table, hashtable_forfn2 fun, void *data);
void hashtable_set_hashfn(hashtable * table, hashtable_hashfn hash);

#endif /* HASHTABLE_H */

The implementation:

#include <stdlib.h>

#include <hashtable.h>

hashnode * hashnode_create(void * data)
{
	hashnode * node = malloc(sizeof(hashnode));
	if (node) {
		node->data = data;
		node->next = NULL;
	}
	return node;
}

void hashnode_delete(hashnode * node)
{
	free(node);
}

static unsigned long sdbm(const char *str)
{
    unsigned long hash = 0;
    int c;

    while ((c = *str++))
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

hashtable * hashtable_create(size_t size, hashtable_cmpfn compare)
{
	hashtable * table = malloc(sizeof (hashtable));
	if (table) {
		table->size = size;
		table->hash = (hashtable_hashfn)sdbm;
		table->compare = compare;
		table->count = 0;
		table->buckets = malloc(size * sizeof(hashnode *));
		if (table->buckets) {
			unsigned int b;
			for (b = 0; b < size; b++) {
				table->buckets[b] = NULL;
			}
		}
		else {
			free(table);
			table = NULL;
		}
	}
	return table;
}

void hashtable_empty(hashtable * table)
{
	unsigned int i;
	hashnode * temp;
	for (i = 0; i < table->size; i++) {
		hashnode * current = table->buckets[i];
		while (current != NULL) {
			temp = current->next;
			hashnode_delete(current);
			current = temp;
		}
        table->buckets[i] = NULL;
	}
	table->count = 0;
}

void hashtable_delete(hashtable * table)
{
	if (table) {
		hashtable_empty(table);
		free(table->buckets);
		free(table);
	}
}

void * hashtable_add(hashtable * table, void * data)
{
	const unsigned int bucket = table->hash(data) % table->size;
	void * found = NULL;

	if (table->buckets[bucket] == NULL) {
		/* An empty bucket */
		table->buckets[bucket] = hashnode_create(data);
	}
	else {
        unsigned int added = 0;
		hashnode * current, * previous = NULL;
		for (current = table->buckets[bucket]; current != NULL && !found && !added; current = current->next) {
			const int result = table->compare(current->data, data);
			if (result == 0) {
				/* Changing an existing entry */
				found = current->data;
				current->data = data;
			}
			else if (result > 0) {
				/* Add before current */
				hashnode * node = hashnode_create(data);
				node->next = current;
				if (previous == NULL) {
					/* Adding at the front */
					table->buckets[bucket] = node;
				}
				else {
					previous->next = node;
				}
                added = 1;
			}
            previous = current;
		}
		if (!found && !added && current == NULL) {
			/* Adding at the end */
			previous->next = hashnode_create(data);
		}
	}
	if (found == NULL) {
		table->count++;
	}

	return found;
}

void * hashtable_find(const hashtable * table, const void * data)
{
	hashnode * current;
	const unsigned int bucket = table->hash(data) % table->size;
	void * found = NULL;
	unsigned int passed = 0;
	for (current = table->buckets[bucket]; current != NULL && !found && !passed; current = current->next) {
		const int result = table->compare(current->data, data);
		if (result == 0) {
			found = current->data;
		}
		else if (result > 0) {
			passed = 1;
		}
	}
	return found;
}

void * hashtable_remove(hashtable * table, const void * data)
{
	hashnode * current, * previous = NULL;
	const unsigned int bucket = table->hash(data) % table->size;
	void * found = NULL;
	unsigned int passed = 0;

	current = table->buckets[bucket];
	while (current != NULL && !found && !passed) {
		const int result = table->compare(current->data, data);
		if (result == 0) {
			found = current->data;
			if (previous == NULL) {
				/* Removing the first entry */
				table->buckets[bucket] = current->next;
			}
			else {
				previous->next = current->next;
			}
			hashnode_delete(current);
			table->count--;
		}
		else if (result > 0) {
			passed = 1;
		}
		else {
			previous = current;
	 		current = current->next;
		}
	}
	return found;
}


float hashtable_get_load_factor(const hashtable * table)
{
	unsigned int touched = 0;
	float loadfactor;
	unsigned int b;
	for (b = 0; b < table->size; b++) {
		if (table->buckets[b] != NULL) {
			touched++;
		}
	}
	loadfactor = (float)touched / (float)table->size;
	return loadfactor;
}

unsigned int hashtable_get_count(const hashtable * table)
{
	return table->count;
}

unsigned int hashtable_find_count(const hashtable *table)
{
	unsigned int b;
	const hashnode *node;
	unsigned int count = 0;
	for (b = 0; b < table->size; b++) {
		for (node = table->buckets[b]; node != NULL; node = node->next) {
			count++;
		}
	}
	return count;
}

void hashtable_for_each(const hashtable * table, hashtable_forfn fun)
{
	unsigned int b;

	for (b = 0; b < table->size; b++) {
		const hashnode *node;
		for (node = table->buckets[b]; node != NULL; node = node->next) {
			fun(node->data);
		}
	}
}


void hashtable_for_each2(const hashtable * table, hashtable_forfn2 fun, void *data)
{
	unsigned int b;

	for (b = 0; b < table->size; b++) {
		const hashnode *node;
		for (node = table->buckets[b]; node != NULL; node = node->next) {
			fun(node->data, data);
		}
	}
}

void hashtable_set_hashfn(hashtable * table, hashtable_hashfn hash)
{
    table->hash = hash;
}

Open source XQuery implementations

These are all of the open source XQuery implementations I’m aware of, as described in their own words.

BaseX

Java

BaseX is a very light-weight, high-performance and scalable XML Database engine and XPath/XQuery 3.0 Processor, including full support for the W3C Update and Full Text extensions. An interactive and user-friendly GUI frontend gives you great insight into your XML documents.

  • High-performance database storage with text, attribute, full-text and path indexes.
  • Efficient support of the W3C XPath/XQuery Recommendations and Full Text and Update Extensions.
  • One of the highest available compliance rates for all supported specifications.
  • Client/Server architecture, supporting ACID safe transactions, user management, logging.
  • Highly interactive visualizations, supporting very large XML documents.
  • The only realtime XQuery editor available, with syntax highlighting and error feedback.
  • Wide range of interfaces: REST/RESTXQ, WebDAV, XQJ, XML:DB; clients in many different languages

Berkeley DB XML

C

Oracle Berkeley DB XML is an open source, embeddable XML database with XQuery-based access to documents stored in containers and indexed based on their content. Oracle Berkeley DB XML is built on top of Oracle Berkeley DB and inherits its rich features and attributes.

  • Built on top of Oracle Berkeley DB, inheriting all its rich features, such as transactions and replication
  • Provides a document parser, XML indexer, and XQuery engine on top of Oracle Berkeley DB to enable the fastest, most efficient retrieval of data
  • Applications perform database administration, eliminating the need for a DBA and allowing continuous, unattended operation
  • Lowers total cost of ownership with extreme performance to reduce hardware costs

eXist-db

Java

Schema-less Database

The high-performance native XML database engine stores textual or binary data and documents without requiring a database schema.

Rapid Prototyping

Upload your data and start writing code immediately.

Application Packages

eXist-db applications are packaged as single archive files that are installed directly in the database. Deployment, upgrading to new versions and distribution become a breeze.

Open

eXist-db is fully based upon Open Standards and Open Source making it a future-proof and substainable choice.

Browser-based IDE

A Browser-based IDE allows managing and editing all artifacts belonging to an application. Syntax-coloring, code-completion and error-checking help to get it right.

Forms Framework

Being a complete solution, eXist-db tightly integrates with XForms for complex form development.

Rich Stack of Libraries

Develop entire applications in XQuery using eXist-db’s rich set of libraries.

Community Driven

Being Open Source since 2001, eXist-db development has always been driven by the needs of a large user community.

Galax

OCaml

Galax is an open-source implementation of XQuery, the W3C XML Query Language. It includes several advanced extensions for XML updates, scripting, and distributed programming.

Galax comes with a state of the art compiler and optimizer. Most of Galax’s architecture is formally documented, making it ideal for users interested in teaching XQuery, in building new language extensions, or developing new optimizations.

Galax is implemented in O’Caml, a fast, modern, type-inferring, functional programming language descended from the ML (Meta Language) family.

Key Features.

  • XQuery 1.0 and XPath 2.0 support with 99.4% conformance.
  • XQuery type checking, modules, and XML schema import.
  • XQuery Updates, XQuery Scripting.
  • Web Services, and Distributed XQuery.

GCX

C++

The G(arbage) C(ollected) X(Query) engine is an in-memory XQuery engine, which is the first streaming XQuery engine that implements active garbage collection, a novel buffer management strategy in which both static and dynamic analysis are exploited. This technique actively purges main memory buffers at runtime based on the current status of query evaluation. In our paper, we show the various stages in evaluating composition-free XQuery with the GCX engine. Our technique has a significant impact on reducing both main memory consumption and query execution time, as can be seen in our experiments.

HXQ

Haskell

HXQ is a fast and space-efficient translator from XQuery (the standard query language for XML) to embedded Haskell code. The translation is based on Template Haskell. HXQ takes full advantage of Haskell’s lazy evaluation to keep in memory only those parts of XML data needed at each point of evaluation, thus performing stream-based evaluation for forward queries (queries that do not contain backward steps). This results to an implementation that is as fast and space-efficient as any stream-based implementation based on SAX filters or finite state machines. Furthermore, the coding is far simpler and extensible since it is based on XML trees, rather than SAX events. Since HXQ uses lazy evaluation, you get the first results of non-blocking queries immediately, while the non-streaming XQuery processors must first parse the entire input file and construct the whole XML tree in memory before they produce any output.

Finally, HXQ can store XML documents in a relational database (currently MySQL or SQLite), by shredding XML into relational tuples, and by translating XQueries over the shredded documents into optimized SQL queries. The mapping to relational tables is based on the document’s structural summary, which is derived from the document data rather than from a schema. It uses hybrid inlining to inline attributes and non-repeating elements into a single table, thus resulting to a compact relational schema. For each such mapping, HXQ synthesizes an XQuery that reconstructs the original XML document from the shredded data. This XQuery is fused with the user queries using partial evaluation techniques and parts of the resulting query are mapped to SQL queries using code folding rules so that all relevant predicates are promoted to SQL. This pushes most evaluation to the database query engine, thus resulting to a fast execution over large data sets.

MXQuery

Java

MXQuery is a low-footprint, extensible Open-Source XQuery Engine implemented in Java
Besides a high level of compliance with XQuery 1.0, XQuery Update Facility 1.0, XPath/XQuery Fulltext 1.0, it provides a wide coverage of upcoming W3C standards proposals (XQuery 3.0, Scripting), support for a wide range of Java Platforms (including mobile/embedded devices), cross-compilation to Javascript (for browser usage) and support for data stream processing/CEP .

Nux

Java

Nux is an open-source Java toolkit making efficient and powerful XML processing easy. It is geared towards embedded use in high-throughput XML messaging middleware such as large-scale Peer-to-Peer infrastructures, message queues, publish-subscribe and matchmaking systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc.

Have you ever tried to take advantage of a robust and natural commodity Java tool set for XML, XQuery, XPath, schema validation, binary XML, fuzzy fulltext similarity search and related technologies, yet were not ready to accept a significant performance penalty? Chances are most tool sets turned out not to be particularly robust and natural, that they incurred dramatic penalties when used in straightforward manners, and that their complex idiosyncracies had a strong tendency to distract from the real job and use cases you wanted to get done in a timely manner.

Nux helps to avoid XML nightmares, enabling you to mix and match powerful main-memory XML tools in natural, straightforward, seamless, effective and standards compliant manners.

Nux reliably processes whatever data fits into main memory (even, say, 250 MB messages), but it is not an XML database system, and does not attempt to be one. Nux integrates best-of-breed components, containing extensions of the XOM, Saxon and Lucene open-source libraries.

OrientX

Java

OrientX is a native XML database system,developed under Renmin University of China.
The word ‘OrientX’ is an abbreviation of Original RUC IDKE Native XML.
The name is pronouced orient-X.
OrientX system stores XML data and preserves its tree structure.
It also allows users to retrieve XML data in the form of XPath/XQuery query language.

Patternist

C++

Patternist is an XPath 2.0, XQuery 1.0 and XSL-T 2.0 implementation, licensed under the GNU LGPL license.

Development priority is:

  • Conformance and interoperability
  • The HCI aspect, that it is user friendly and has good usability
  • Clean, compact implementation that compiles without warnings and is well documented

PHP XML Classes

PHP

A collection of classes and resources to process XML using PHP

Qexo

Java

Qexo is a partial implementation of the XML Query language. It achieves high performance because a query is compiled down to Java bytecodes using the Kawa framework.
Kawa also includes a proof-of-concept implementation of XSLT.

Rainbow

Java

As more and more data becomes available in XML format, a general purpose XML repository management system (XRMS) is needed. Rather than developing it from scratch, there is great interest to exploit existing relational database technology as backend engine to store, retrieve, and query XML data set due to its maturity and performance.

However, due to the mismatch between the complexity of semi-structured XML data model and the simplicity of flat relational model, there are many ways to store one document in a relational database, and a number of heuristic techniques need to overcome. These include the data model mapping strategies, the transformation of XML queries into SQL queries, XML order handling, XML update propagation, XML query optimization, and XML indexing.

The Rainbow System we are proposing is based on a flexible strategy of mapping XML to the relational data model using generic XQuery loading statements, supports optimized XML order-sensitive query processing via an XML Query Algebra, XQuery to SQL translation, and serve as a solid yet scalable foundation for extended XML-based applications.

Specific next research goals within the larger Rainbow project that we are targeting include Update Query Processing through the XML virtual Views, Incremental XML view maintenance, XQuery Multiple Query Optimization, and XML Query Optimization by exploiting Materialized XML Views.

Sedna

C and C++

Sedna is a free native XML database which provides a full range of core database services – persistent storage, ACID transactions, security, indices, hot backup. Flexible XML processing facilities include W3C XQuery implementation, tight integration of XQuery with full-text search facilities and a node-level update language.

XBird

Java

XBird is a light-weight XQuery processor and database system written in Java. The light-weight means reasonably fast and embeddable.

Features

XBird introduces the following features:

  • XQuery Processor
  • Native XML Database Engine
  • Embeddable Database Engine
  • Distributed XQuery Processor
  • Support for HTML web page scraping

XBird is currently optimized for read-oriented workloads. It passes about 91% of the minimal conformance of XQuery Test Suite.

Xidel

Pascal

Xidel is a command line tool to download and extract data from HTML/XML pages as well as JSON APIs.

xq2xml

Java and XSLT

This distribution consists of a set of XSLT2 stylesheets to manipulate XQuery Expressions and express the query using other syntaxes. Currently stylesheets converting to XQueryX and XSLT are supplied. Also stylesheets to transform Xquery expressions, firstly an identity transform and secondly a stylesheet to remove any use of axes not supported in systems that do not implement the full axis feature of XQuery.

As XSLT requires XML input, or at least an XML view of non-XML input, and XQuery does not use an XML syntax, the stylesheet distribution is augmented by a Java based parser that parses an XQuery expression (or in its most general form, a series of XQuery modules) and returns an XML document. This parser is a trivial (10 or so line) wrapper around the Java class provided by Scott Boag on behalf of the the XQuery/XSLT working groups as part of the test parser applet distribution. (Scott has kindly included this functionality now in the parser distribution, so a separate wrapper is no longer needed).

The XQuery working group also provides an XQuery test suite, and this distribution contains (for xq2xqx) a set of test files converted to XQueryX syntax) and (for xq2xsl) a set of test files in XSLT syntax, and a test report in the official test report syntax for xq2xsl once coupled to an XSLT2 execution system (assumed to be Saxon8, for this release) considered as a new implementation of XQuery. The stylesheets and auxiliary Java code to process the test suite are also provided.

XQEngine

Java

XQEngine is a full-text search engine for XML documents. Utilizing XQuery as its front-end query language, it lets you interrogate collections of XML documents for boolean combinations of keywords, much as Google and other search engines let you do for HTML. XQuery, however, provides much more powerful search capabilities than equivalent HTML-based engines, since its XPath component lets you specify constraints on attributes and element hierarchies, in addition to the specific word content you’re searching on. Refer to the W3C’s XML Query website to see what the W3C and other vendors are doing with XQuery and XPath.

XQEngine is a compact (roughly 300K) embeddable component written in Java. It’s not a standalone application and requires a reasonable amount of Java programming skill to use. It has a straightforward programming interface that makes that fairly easy to do. It should work well as a personal productivity tool on a single desktop, as part of a CD-based application, or on a server with low to moderate traffic.

XQP

Java

XQP is a dynamic and scalable architecture for indexing and querying a large distributed repository of schema-less XML data, implemented on top of a structured peer-to-peer (P2P) system (Pastry). Unlike other approaches, XQP can process most forms of XQuery extended with full-text search, even those queries that search for multiple documents that are related through join conditions. The indexing is based on both the text terms and the structural summary of a document. Given an XQuery, our system can find all the plausible structural summaries applicable to the query using one peer lookup only. Each structural summary is associated with a small, dynamically adapting sub-space of peers who share the inverted lists related to all the documents that conform to this particular structural summary. Peers may participate in multiple sub-spaces, while the size of each sub-space may grow and shrink dynamically, depending on the workload. To locate multiple documents that are related through join conditions, XQP uses value histograms distributed over the P2P network.

XQuare

Java

The XQuare project provides a set of Java components for extending J2EE platforms with XML-based, heterogeneous information integration capabilities, using the XQuery language. The XQuare components allow enterprise information stored in a large variety of data sources to be accessed in real-time and shared as uniform XML views, ready for further filtering, processing and publishing. Incoming XML data can also be stored directly in relational databases, without the need to transform it first into procedural API calls (such as JDBC) or Java objects.

The XQuare components are designed to be embedded into Java-based Web or application servers, and rely on the standard J2EE services for exchanging, processing and publishing XML information. By adding to those services a new configurable data integration service, able to collect in real-time business information in distributed, heterogeneous data sources, the XQuare components can help dramatically reduce development costs for applications such as enterprise portals, B2B information exchange automation, database publishing…

XQuilla

C++

XQilla is an XQuery and XPath 2 library and command line utility written in C++, implemented on top of the Xerces-C library. It is made available under the terms of the Apache License v2.

Zorba

C++

Modern Query Processing

By providing a high level query language, Zorba allows you to process large amounts of data productively and. For example, it supports complex SQL-like queries such as (outer-)joins, grouping, sorting, transformation, full-text, indexes, and declarative updates. Zorba brings powerful queries database users take for granted to NoSQL without giving up on scalability.

Designed for Flexible Data: JSON and XML

Our query language is not your grandma’s SQL. It supports novel concepts purposely designed for flexible, schema-less data. The language covers the entire spectrum, from the simplicity of JSON to the completeness of XML.

Advanced Data Processing Libraries

Zorba users have immediate access to advanced, out of the box, data processing libraries. Built-in functions empower queries to use advanced math processing, string matching, data cleaning, or full-text features. Beyond that, conversion (e.g. CSV, PDF, HTML, ZIP), communication (e.g. HTTP, SMTP, IMAP), and connector (e.g. SQLLite, JDBC, CouchBase) libraries allow the user to productively process the Web’s data.

Pluggable Store

Zorba allows for plug-in alternative store implementations under the hood of the query processor. This enables users to seamlessly process data stored in different places: main memory, browser, disk, or cloud stores.

Graph data structures

Contents

Introduction

In this post, I introduce the concept of a graph and describe some ways of representing graphs in C.

Definitions

Graphs, vertices and edges

A graph is a collection of nodes called vertices, and the connections between them, called edges.

Undirected and directed graphs

When the edges in a graph have a direction, the graph is called a directed graph or digraph, and the edges are called directed edges or arcs.
Here, I shall be exclusively concerned with directed graphs, and so when I refer to an edge, I mean a directed edge.
This is not a limitation, since an undirected graph can easily be implemented as a directed graph by adding edges between connected vertices in both directions.

A representation can often be simplified if it is only being used for undirected graphs, and I’ll mention in passing how this can be achieved.

Neighbours and adjacency

A vertex that is the end-point of an edge is called a neighbour of the vertex that is its starting-point.
The first vertex is said to be adjacent to the second.

An example

The following diagram shows a graph with 5 vertices and 7 edges.
The edges between A and D and B and C are pairs that make a bidirectional connection, represented here by a double-headed arrow.

Graph showing 5 vertices labelled A to E

Mathematical definition

More formally, a graph is an ordered pair, G = <V, A>, where V is the set of vertices, and A, the set of arcs, is itself a set of ordered pairs of vertices.

For example, the following expressions describe the graph shown above in set-theoretic language:

V = {A, B, C, D, E}
A = {<A, B>, <A, D>, <B, C>, <C, B>, <D, A>, <D, C>, <D, E>}

Functions

A graph implementation needs a basic set of functions to assemble and modify graphs, and to enumerate vertices, edges and neighbours.

The following functions are provided by each representation.
These are the declarations for the intuitive representation, graph1:

graph1 *graph1_create(void);
Create an empty graph
void graph1_delete(graph1 *graph);
Delete a graph
vertex *graph1_add(graph1 *graph, const char *name, void *data);
Add a vertex to the graph with a name, and optionally some data
vertex *graph1_get_vertex(const graph1 *graph, const char *name);
Retrieve a vertex by name
void *graph1_remove(graph1 *graph, vertex *vertex);
Remove a vertex
void graph1_add_edge(graph1 *graph, vertex *vertex1, vertex *vertex2);
Create a directed edge between vertex1 and vertex2
void graph1_remove_edge(graph1 *graph, vertex *vertex1, vertex *vertex2);
Remove the directed edge from vertex1 to vertex2
unsigned int graph1_get_adjacent(const graph1 *graph, const vertex *vertex1, const vertex *vertex2);
Determine if there is an edge from vertex1 to vertex2
iterator *graph1_get_neighbours(const graph1 *graph, const vertex *vertex);
Get the neighbours of a vertex
iterator *graph1_get_edges(const graph1 *graph);
Get all of the edges in the graph
iterator *graph1_get_vertices(const graph1 *graph);
Get all of the vertices in the graph
unsigned int graph1_get_neighbour_count(const graph1 *graph, const vertex *vertex);
Get the count of neighbours of a vertex
unsigned int graph1_get_edge_count(const graph1 *graph);
Get the count of edges in the graph
unsigned int graph1_get_vertex_count(const graph1 *graph);
Get the count of vertices in the graph

Representation of vertices and edges

Vertices

All of the graph representations use the following definition of a vertex:

typedef struct {
    char *name;
    void *data;
    void *body;
    deletefn del;
} vertex;

Note the body field, which is not of interest to clients, but is used by some representations (Adjacency List and Incidence List) to add per-vertex strucure.

The following functions are provided for working with vertices:

const char *vertex_get_name(const vertex *v);
Get the vertex’s name
void *vertex_get_data(const vertex *v);
Get the data associated with a vertex

Edges

How edges are implemented internally varies with the representation.
In fact, in three representations, Adjacency List, Adjacency Matrix and Incidence Matrix, edges do not exist internally as objects at all.
From the viewpoint of clients however, edges, as enumerated by the iterator returned by the function to retrieve edges, are this structure:

typedef struct {
    vertex *from;
    vertex *to;
} edge;

The following functions are provided for working with edges:

const vertex *edge_get_from(const edge *e);
Get the vertex that is the starting-point of an edge
const vertex * edge_get_to(const edge *e);
Get the vertex that is the end-point of an edge

Example program

The following program constructs the graph shown in the introduction using the intuitive representation, graph1, and then enumerates the vertices, neighbours and edges:

#include <stdio.h>

#include <graph1.h>

int main(void)
{
    graph1 *graph;
    vertex *v;
    vertex *A, *B, *C, *D, *E;
    iterator *vertices, *edges;
    edge *e;

    /* Create a graph */
    graph = graph1_create();

    /* Add vertices */
    A = graph1_add(graph, "A", NULL);
    B = graph1_add(graph, "B", NULL);
    C = graph1_add(graph, "C", NULL);
    D = graph1_add(graph, "D", NULL);
    E = graph1_add(graph, "E", NULL);

    /* Add edges */
    graph1_add_edge(graph, A, B);
    graph1_add_edge(graph, A, D);
    graph1_add_edge(graph, B, C);
    graph1_add_edge(graph, C, B);
    graph1_add_edge(graph, D, A);
    graph1_add_edge(graph, D, C);
    graph1_add_edge(graph, D, E);

    /* Display */
    printf("Vertices (%d) and their neighbours:\n\n", graph1_get_vertex_count(graph));
    vertices = graph1_get_vertices(graph);
    while ((v = iterator_get(vertices))) {
        iterator *neighbours;
        vertex *neighbour;
        unsigned int n = 0;
        printf("%s (%d): ", vertex_get_name(v), graph1_get_neighbour_count(graph, v));
        neighbours = graph1_get_neighbours(graph, v);
        while ((neighbour = iterator_get(neighbours))) {
            printf("%s", vertex_get_name(neighbour));
            if (n < graph1_get_neighbour_count(graph, vertex) - 1) {
                fputs(", ", stdout);
            }
            n++;
        }
        putchar('\n');
        iterator_delete(neighbours);
    }
    putchar('\n');
    iterator_delete(vertices);
    printf("Edges (%d):\n\n", graph1_get_edge_count(graph));
    edges = graph1_get_edges(graph);
    while ((e = iterator_get(edges))) {
        printf("<%s, %s>\n", vertex_get_name(edge_get_from(e)), vertex_get_name(edge_get_to(e)));
    }
    putchar('\n');
    iterator_delete(edges);

    /* Delete */
    graph1_delete(graph);

    return 0;
}

Graph representations

There are essentially 5 ways of representing a graph:

The intuitive representation: graph1

What I call the "intuitive" and can also called the "object-oriented" representation is a direct translation of the mathematical definition of a graph into a data type:

typedef struct {
    set *vertices;
    set *edges;
} graph1;
  • Adding a vertex simply requires adding it to the vertex set.
  • Adding an edge simply requires adding it to the edge set.
  • Removing vertices and edges simply means removing them from the respective sets.
  • To find a vertex’s neighbours, search the edge set for edges having the vertex as the from field.
  • To determine if two vertices are adjacent, search the edge set for an edge having the first vertex as its from field, and the second vertex as its to field.
  • Getting all of the edges is easy; just return an iterator over the edge set.
  • For undirected graphs, each edge would be stored only once, and getting neighbours and adjacency testing would look at both vertices.

    The edge object would not be from and to but simply first and second, i.e., an unordered pair.

  • This is one of the representations where edges exist internally as objects (Incidence List is the other).
  • This is most like a sparse Adjacency Matrix, with the edge set holding those pairs that are adjacent, and non-adjacent pairs being absent.

    Adjacency List: graph2

    The graph is made up of a set of vertices.
    Each vertex contains a set of vertices for its neighbours.

    typedef struct {
        set *vertices;
    } graph2;
    
    
    typedef struct {
        set *neighbours;
    } vertex_body;
    

    For the graph shown in the introduction, the sets of neighbours would look like this:

    A: {B, D}
    B: {C}
    C: {B}
    D: {A, C, E}
    E: {}
    
    • Adding a vertex just means adding it to the vertex set.
    • Adding an edge means adding the end-point of it to the starting vertex’s neighbour set.
    • It is easy to go from a vertex to its neighbours, because the vertex stores them all.

      Just return an iterator over them.

      This makes the graph argument in the function to retrieve neighbours unnecessary in this implementation.

    • Testing for adjacency is easy; just search the first vertex’s neighbours for the second vertex.
    • Getting all edges is more difficult to implement in this representation, because edges don’t exist as objects.

      You need to iterate over the neighbours of each vertex in turn, and construct the edge from the vertex and the neighbour.

    Adjacency Matrix: graph3

    The graph is made up of a set of vertices and a matrix, whose rows and columns are indexed by vertices, and which contains a 1 entry if the vertices are connected.

    typedef struct {
        set    *vertices;
        matrix *edges;
    } graph3;
    

    The adjacency matrix for the graph shown in the introduction would look like this:

    \(
    \begin{array}{c c} &
    \begin{array}{c c c} A & B & C & D & E \\
    \end{array}
    \\
    \begin{array}{c c c}
    A\\
    B\\
    C\\
    D\\
    E\\
    \end{array}
    &
    \left[
    \begin{array}{c c c}
    0 & 1 & 0 & 1 & 0 \\
    0 & 0 & 1 & 0 & 0 \\
    0 & 1 & 0 & 0 & 0 \\
    1 & 0 & 1 & 0 & 1 \\
    0 & 0 & 0 & 0 & 0
    \end{array}
    \right]
    \end{array}
    \)

    • When adding a vertex, add a row and column to the matrix.
    • When removing a vertex, remove its row and column.
      As adding and removing rows and columns is expensive, these make the adjacency matrix unsuitable for graphs in which vertices are frequently added and removed.

    • Adding and removing edges is easy however, and requires no allocation or deallocation of memory, just setting a matrix element.
    • To get neighbours, look along the vertex’s row for 1s.
    • To determine adjacency, look for a 1 at the intersection of the first vertex’s row and the second vertex’s column.
    • To get the edge set, find all of the 1s in the matrix and construct the edges from the corresponding vertices.
    • If the graph is undirected, the matrix will be symmetrical about the main diagonal.
      This means that you can drop half of it, making a triangular matrix.

    • The vertex set needs to be ordered so that the index number of vertices can be looked up, or the matrix needs to be a 2-d map keyed by the vertices themselves.
    • Memory used for edges is a constant |V|2.
      The best use of this is a graph that is nearly complete, i.e., has a lot of edges.

    • The matrix can be sparse; this relates the memory usage more closely to the number of edges.
      It also makes addition and removal of columns easier (no block shifts), but requires renumbering afterwards.

    • You can use booleans or bits in the matrix to save memory.

    Incidence Matrix: graph4

    The graph is made up of a set of vertices and a matrix, as in Adjacency Matrix, but the matrix is vertices × edges, with each column containing two non-zero entries, one for the starting-point vertex and one for the end-point.

    typedef struct {
        set    *vertices;
        matrix *edges;
    } graph4;
    

    The incidence matrix for the graph shown in the introduction looks like this (1 means "from" and 2 means "to"):

    \(
    \begin{array}{c c} &
    \\
    \begin{array}{c c c}
    A\\
    B\\
    C\\
    D\\
    E\\
    \end{array}
    &
    \left[
    \begin{array}{c c c}
    1 & 1 & 0 & 0 & 2 & 0 & 0 \\
    2 & 0 & 1 & 2 & 0 & 0 & 0 \\
    0 & 0 & 2 & 1 & 0 & 2 & 0 \\
    0 & 2 & 0 & 0 & 1 & 1 & 1 \\
    0 & 0 & 0 & 0 & 0 & 0 & 2 \\
    \end{array}
    \right]
    \end{array}
    \)

    • When you add a vertex, you add a row to the matrix.
    • When you add an edge, you add a column to the matrix.
    • When you remove a vertex, you need to remove all of the columns containing the vertex from the matrix.
    • Getting the edges means iterating over the columns and constructing the edges from the two values.
    • To find neighbours, look for 1s in the vertex’s row, and in each such column look for the 2 value, which is the neighbour.
    • To determine adjacency, find a column containing a 1 in the starting-point vertex’s row, and a 2 in the end-point’s row.
    • For an undirected graph, you have one column per edge, and just the value 1 for "connected", so each column contains two 1s.

    Incidence List: graph5

    There is a set of vertices as in Adjacency List, but each vertex stores a list of the edges that it is the starting-point of, rather than neighbours.

    typedef struct {
        set * vertices;
    } graph5;
    
    typedef struct {
        set *edges;
    } vertex_body;
    

    For the graph shown in the introduction, the sets of edges would look like this:

    A: {<A, B>, <A, D>}
    B: {<B, C>}
    C: {<C, B>}
    D: {<D, A>, <D, C>, <D, E>}
    E: {}
    
    • Adding a vertex just means adding it to the vertex set.
    • Adding an edge means adding it to its starting vertex’s edge set.
    • Finding if two vertices are adjacent requires searching the first vertex’s edge set for an edge containing the second vertex as its to field.
    • Getting the neighbours requires retrieving them from the pairs in the set of edges for the vertex.
    • Getting the edge set requires enumerating each of the vertices’ edge sets in turn.
    • You can store the edges in the graph object as well as in each vertex.
  • Boggle in Python

    This is a solver for the game “Boggle”. The aim of the game is to find as many words as possible in a 4-by-4 grid randomly filled with letters. You are allowed to go up, down, left, right, or diagonally, but not use the same letter more than once.

    First, here’s a prefix tree, which is the ideal structure for looking up words one letter at a time:

    class PrefixTree(object):
        def __init__(self, letter=None):
            self.letter = letter
            self.children = {}
            self.stop = False
    
        def add(self, word):
            if len(word):
                letter = word[0]
                word = word[1:]
                if letter not in self.children:
                    self.children[letter] = PrefixTree(letter);
                return self.children[letter].add(word)
            else:
                self.stop = True
                return self
    
        def find_letter(self, letter):
            if letter not in self.children:
                return None
            return self.children[letter]
    
        def __repr__(self):
            return "PrefixTree(letter={0}, stop={1})".format(self.letter, self.stop)
    

    Here’s the code for the game solver itself. It simply searches recursively starting at each cell in the game, and looks up the sequences of letters it builds up in the prefix tree:

    from random import choice
    from prefixtree import PrefixTree
    
    class Boggle(object):
        def __init__(self, board=None):
            self.size = 4
            if board is None:
                self.board = []
                for i in range(0, self.size):
                    self.board.append([])
                    for j in range(0, self.size):
                        self.board[i].append(Boggle.random_letter())
            else:
                self.board = board
    
        @staticmethod
        def random_letter():
            return chr(choice(range(65,91)))
    
        def play(self, tree, found):
            for r in range(0, self.size):
                for c in range(0, self.size):
                    self.search_r(tree, found, r, c)
    
        def search_r(self, tree, found, row, col, path=None, node=None, word=None):
            letter = self.board[row][col]
            if node is None or path is None or word is None:
                node = tree.find_letter(letter)
                path = [(row, col)]
                word = letter
            else:
                node = node.find_letter(letter)
                path.append((row, col))
                word = word + letter
            if node is None:
                return
            elif node.stop:
                found.add(word)
            for r in range(row - 1, row + 2):
                for c in range(col - 1, col + 2):
                    if (r >= 0 and r < self.size
                            and c >= 0 and c < self.size 
                            and not (r == row and c == col) 
                            and (r, c) not in path):
                        self.search_r(tree, found, r, c, path[:], node, word[:])
    
        def __repr__(self):
            return "Boggle(size={0}, board={1})".format(self.size, self.board)
    

    I used this dictionary as a source of words. You may want to use a slightly smaller one.
    Here is the main program:

    def load_tree(tree):
        with open('dictionary.txt') as f:
            for line in f:
                word = line.rstrip().upper()
                tree.add(word)
    
    def main():
        boggle = Boggle()
        tree = PrefixTree()
        load_tree(tree)
        found = set()
        boggle.play(tree, found)
        for word in sorted(found):
            print word
        print boggle
    
    if __name__ == '__main__':
        main()
    

    Reverse an array without affecting special characters

    I found this little programming challenge here. The task is to take a string and reverse only the alphabetic characters in it, leaving any other characters unaffected.
    So, given the string:

    a,b$c
    

    The program should output:

    c,b$a
    

    I think that the best method is to walk a pair of pointers inwards from both ends of the string at the same time, examining the characters they point to. If the character being examined at either end is not a letter, keep going inwards. Once both characters are letters, swap them. Finish when the pointers meet in the middle.

    Here is my implementation in C:

    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    
    void swap(char *a, char *b)
    {
        char temp = *a;
        *a = *b;
        *b = temp;
    }
    
    void reverse(char *str)
    {
        char *p1 = str;
        char *p2 = str + strlen(str) - 1;
        while (p1 < p2) {
            while (!isalpha(*p1)) p1++;
            while (!isalpha(*p2)) p2--;
            if (p1 < p2) {
                swap(p1, p2);
                p1++;
                p2--;
            }
        }
    }
    
    int main(void)
    {
        char str[6];
        strcpy(str, "a,b$c");
        reverse2(str);
        printf("%s\n", str);
        return 0;
    }
    

    While I was writing this it occurred to me that it would be nice if you could have a filtered view into the string that hid the special characters, and then you could just reverse what you can see. Then I remembered boost::filter_iterator, and found that it can do just that, so here’s my C++ version using filter_iterator from Boost:

    #include <iostream>
    #include <algorithm>
    #include <cstring>
    #include <boost/iterator/filter_iterator.hpp>
    
    struct is_alpha
    {
        bool operator()(char c)
        {
            return std::isalpha(c);
        }
    };
    
    void reverse(char *str)
    {
        const size_t len = std::strlen(str);
        typedef boost::filter_iterator<is_alpha, char*> FilterIter;
        is_alpha predicate;
        FilterIter first(predicate, str, str + len);
        FilterIter last(predicate, str + len, str + len);
        std::reverse(first, last);
    }
    
    int main()
    {
        char str[6];
        strcpy(str, "a,b$c");
        reverse(str);
        std::cout << str << "\n";
    }
    

    Reference: Filter Iterator

    Finding the minimum depth of a binary tree recursively

    This is the counterpart of finding the height of a binary tree recursively. We start with a depth of 0, and only if a node has 2 children, we add the minimum of their depths.

    static int min(int a, int b)
    {
        if (a <= b) {
            return a;
        }
        return b;
    }
    
    unsigned int binarytree_minimum_depth_recursive(const btnode *node)
    {
        unsigned int depth = 0;
        if (node->left && node->right) {
            depth = min(binarytree_minimum_depth_recursive(node->left), 
                    binarytree_minimum_depth_recursive(node->right)) + 1;
        }
        return depth;
    }
    
    unsigned int binarytree_minimum_depth(const binarytree *tree)
    {
        unsigned int depth = 0;
        if (tree->root) {
            depth = binarytree_minimum_depth_recursive(tree->root);
        }
        return depth;
    }
    

    Related

    How to list all files in a directory with a certain extension in Python

    We want to be able to do this:

    def main():
        directory = '/home/martin/Python'
        files = list_files(directory, "py")
        for f in files:
            print f
    
    spam.py
    eggs.py
    ham.py
    

    There are 3 methods we can use:

    1. Use os.listdir()
    2. Use os.walk()
    3. Use glob.glob()

    Method1: Use os.listdir()

    from os import listdir
    
    def list_files1(directory, extension):
        return (f for f in listdir(directory) if f.endswith('.' + extension))
    

    Method 2: Use os.walk()

    Note that walk() will recurse into subdirectories, but we avoid this by returning on the first iteration of the loop.

    from os import walk
    
    def list_files2(directory, extension):
        for (dirpath, dirnames, filenames) in walk(directory):
            return (f for f in filenames if f.endswith('.' + extension))
    

    Method 3: Use glob.glob()

    You need to change into the directory to use glob(), so it’s good manners to save the current working directory and then change back into it at the end.

    from glob import glob
    from os import getcwd, chdir
    
    ef list_files3(directory, extension):
        saved = getcwd()
        chdir(directory)
        it = glob('*.' + extension)
        chdir(saved)
        return it
    

    Using boost::enable_shared_from_this

    The boost::enable_shared_from_this class is designed to allow a method of an object that has already been instantiated wrapped in a shared_ptr to return this as a shared_ptr correctly. This is necessary because simply returning a shared_ptr(this) will create another shared_ptr with its own reference count, which would conflict with the exisiting one and subvert reference counting.

    For example, consider the class outline below which is fairly typical of a heap-allocated, non-copyable class using shared_ptr:

    #include <boost/shared_ptr.hpp>
    #include <boost/core/noncopyable.hpp>
    
    class X : private boost::noncopyable
    {
    public:
        typedef boost::shared_ptr<X> Ptr;
        typedef boost::shared_ptr<const X> ConstPtr;
    private:
        X()
        {
        }
    public:
        static Ptr New()
        {
            return Ptr(new X());
        }
    };
    

    Now imagine we add the following method, which returns this wrapped in a shared_ptr, perhaps to facilitate method chaining:

        Ptr F()
        {
            // Do something...
            return Ptr(this);
        }
    

    The problem now is that the Ptr returned by F() is in addition to the Ptr already wrapping this instance, so although it refers to the same underlying raw pointer, it has its own reference count, as the following program demonstrates:

    int main()
    {
        X::Ptr x1 = X::New();
        X::Ptr x2 = x1->F();
        std::cout << std::boolalpha << "x1 == x2?: "<< (x1 == x2) << "\n";
        std::cout << std::boolalpha << "!(x1 < x2 || x < x2)?: " <<  (!(x1 < x2 || x2 < x1)) << "\n";
    }
    

    This outputs:

    x1 == x2?: true
    !(x1 < x2 || x < x2)?: false
    

    Operator== is showing that the two variables are holding the same raw pointer, but operator<, which is overloaded to show whether two pointers actually share ownership, shows that they do not.

    When the first pointer goes out of scope, it will decrement its reference count to 0 and delete the object. When the second one too goes out of scope, it will attempt to do the same, with undefined results.

    The solution is to use boost::enable_shared_from_this. This just involves:

    1. Include boost/enable_shared_from_this.hpp
    2. Derive your class publicly from boost_enable_shared_from_this, which takes the class as a template parameter (the Curiously Recurring Template Pattern)
    3. Whenever you want to return this wrapped in a shared_ptr, use the inherited shared_from_this() method.

    Here is our example class with those changes added:

    #include <boost/shared_ptr.hpp>
    #include <boost/enable_shared_from_this.hpp>
    #include <boost/core/noncopyable.hpp>
    
    class X : public boost::enable_shared_from_this<X>,
       private boost::noncopyable
    {
    public:
        typedef boost::shared_ptr<X> Ptr;
        typedef boost::shared_ptr<const X> ConstPtr;
    private:
        X()
        {
        }
    public:
        static Ptr New()
        {
            return Ptr(new X());
        }
        Ptr F()
        {
            return shared_from_this();
        }
    };
    

    This is the result of running the test program now:

    x1 == x2?: true
    !(x1 < x2 || x < x2)?: true
    

    So now the ownership of the raw pointer is correctly shared.

    References:
    noncopyable
    shared_ptr class template
    enable_shared_from_this