Tag Archives: Python

Binary search in Python

I needed to do binary search of a sorted list with a custom comparator, something that isn’t supported by the Python standard library, so I’ve written a suite of functions for the purpose. They’re modeled on the C++ standard library functions lower_bound(), upper_bound(), equal_range(), and binary_search().

lower_bound()

lower_bound() returns the index of the first element greater than or equal to the sought value, or the length of the list if it isn’t found:

def lower_bound(sequence, value, compare=cmp):
    """Find the index of the first element in sequence >= value"""
    elements = len(sequence)
    offset = 0
    middle = 0
    found = len(sequence)

    while elements > 0:
        middle = elements / 2
        if compare(value, sequence[offset + middle]) > 0:
            offset = offset + middle + 1
            elements = elements - (middle + 1)
        else:
            found = offset + middle
            elements = middle
    return found

upper_bound()

upper_bound() returns the index of the first element greater than the sought value, or the length of the list if there is no such value:

def upper_bound(sequence, value, compare=cmp):
    """Find the index of the first element in sequence > value"""
    elements = len(sequence)
    offset = 0
    middle = 0
    found = 0

    while elements > 0:
        middle = elements / 2
        if compare(value, sequence[offset + middle]) < 0:
            elements = middle
        else:
            offset = offset + middle + 1
            found = offset
            elements = elements - (middle + 1)
    return found

equal_range()

equal_range() returns a pair containing the results of lower_bound() and upper_bound(), so it defines the bounds of a slice whose elements are equal to the value:

def equal_range(sequence, value, compare=cmp):
    """Find the range of sequence equal to value"""
    return lower_bound(sequence, value, compare), upper_bound(sequence, value, compare)

binary_search()

binary_search() just returns a Boolean value indicating whether or not a value is present:

def binary_search(sequence, value, compare=cmp):
    """Find out if value is in sequence"""
    index = lower_bound(sequence, value, compare)
    return index < len(sequence) and compare(value, sequence[index]) == 0

British National Corpus (BNC) parser in Python

I needed to build an inverted index of the British National Corpus (BNC) files, and for that I needed a parser that would return each word in the files with its precise location. I thought I would share the Python class I developed here.

The class BNCParser is optionally initialised with an XML parser object, the default being an instance of xml.etree.ElementTree.XMLParser. Its parse() method can then be called on a filename, and it is a generator, yielding Word objects which contain the following information about each word found:

  • Level 1 div number
  • Level 2 div number
  • Level 3 div number
  • Level 4 div number
  • Sentence number
  • Word number
  • CLAWS tag
  • Headword
  • POS tag
  • Text (the actual word as it appears)

All positional fields (div numbers, sentence, and word) are 1-based. Lower level divs may not be present, in which case they will appear as 0.

Here is the code:

from collections import namedtuple
import xml.etree.ElementTree as ET

"""Represents all of the info about a single word occurrence"""
Word = namedtuple('Word', ['div1', 'div2', 'div3', 'div4', 'sentence',
        'word', 'c5', 'hw', 'pos', 'text'])

class BNCParser(object):
    """A parser for British National Corpus (BNC) files"""
    def __init__(self, parser=None):
        if parser is None:
            parser = ET.XMLParser(encoding = 'utf-8')
        self.parser =  parser

    def parse(self, filename):
        """Parse `filename` and yield `Word` objects for each word"""
        tree = ET.parse(filename, self.parser)
        root = tree.getroot()
        divs = [None, 0, 0, 0, 0] # 1-based
        sentence = 0
        word = 0
        for neighbour in root.iter('*'):
            if neighbour.tag == 'div':
                level = int(neighbour.attrib['level'])
                divs[level] += 1
                # Reset all lower-level divs to 0
                for i in range(level + 1, 5):
                    divs[i] = 0
                sentence = 0
            elif neighbour.tag == 's':
                sentence += 1
                word = 0
            elif neighbour.tag == 'w':
                word += 1
                yield Word(divs[1], divs[2], divs[3], divs[4], sentence, word, 
                        neighbour.attrib['c5'], neighbour.attrib['hw'], 
                        neighbour.attrib['pos'], neighbour.text)

Notice the algorithm for keeping track of positional information. The parse is a depth-first traversal of the element tree, with div numbers being updated as they are found. Whenever a new div begins, the div number at that level is incremented, and all lower level div numbers are set to 0, as is the sentence number, because they are relative to the parent div. Similarly, when a new sentence begins, the word number is set to 0.

Here is a small example program that parses a single file and prints the word information it finds:

def main():
    source = '2553/download/Texts/aca/HWV.xml'
    parser = BNCParser()
    for word in parser.parse(source):
        print(word)

if __name__ == '__main__':
    main()

Here is the beginning of the output:

Word(div1=1, div2=0, div3=0, div4=0, sentence=1, word=1, c5='NN2', hw='article', pos='SUBST', text='ARTICLES')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=1, c5='NN1', hw='immunogenicity', pos='SUBST', text='Immunogenicity ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=2, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=3, c5='AT0', hw='a', pos='ART', text='a ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=4, c5='AJ0', hw='supplemental', pos='ADJ', text='supplemental ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=5, c5='NN1', hw='dose', pos='SUBST', text='dose ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=6, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=7, c5='NN1-AJ0', hw='oral', pos='SUBST', text='oral ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=8, c5='PRP', hw='versus', pos='PREP', text='versus ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=9, c5='AJ0', hw='inactivated', pos='ADJ', text='inactivated ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=10, c5='NN1', hw='poliovirus', pos='SUBST', text='poliovirus ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=11, c5='NN1', hw='vaccine', pos='SUBST', text='vaccine')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=1, c5='PRP', hw='in', pos='PREP', text='In ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=2, c5='DT0', hw='many', pos='ADJ', text='many ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=3, c5='AJ0', hw='developing', pos='ADJ', text='developing ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=4, c5='NN2', hw='country', pos='SUBST', text='countries')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=5, c5='AT0', hw='the', pos='ART', text='the ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=6, c5='NN1', hw='immunogenicity', pos='SUBST', text='immunogenicity ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=7, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=8, c5='CRD', hw='three', pos='ADJ', text='three ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=9, c5='NN2', hw='dose', pos='SUBST', text='doses ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=10, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=11, c5='AJ0', hw='live', pos='ADJ', text='live')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=12, c5='VVD', hw='attenuate', pos='VERB', text='attenuated')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=13, c5='AJ0', hw='oral', pos='ADJ', text='oral ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=14, c5='NN1', hw='poliovirus', pos='SUBST', text='poliovirus ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=15, c5='NN1', hw='vaccine', pos='SUBST', text='vaccine ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=16, c5='NP0', hw='opv', pos='SUBST', text='OPV')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=17, c5='VBZ', hw='be', pos='VERB', text='is ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=18, c5='AJC', hw='low', pos='ADJ', text='lower ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=19, c5='CJS', hw='than', pos='CONJ', text='than ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=20, c5='DT0', hw='that', pos='ADJ', text='that ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=21, c5='PRP', hw='in', pos='PREP', text='in ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=22, c5='AJ0', hw='industrialised', pos='ADJ', text='industrialised ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=23, c5='NN2', hw='country', pos='SUBST', text='countries')

Boyer-Moore search of a list for a sub-list in Python

I had the need to search a list for a sub-list in Python, rather like std::search() in C++, but there isn’t a built-in equivalent in Python. A naive way of implementing it would be to try to match the sub-list against every position in the list, but that wouldn’t be very efficient. An efficient algorithm for string searching is the Boyer-Moore algorithm, and this can be adapted to search a list.

Below is an implementation of Boyer-Moore for searching a list in Python. It’s based on the Wikipedia article. The function search() takes a list to search (haystack), and a sub-list to search for (needle), and returns the starting index of needle, or -1 if it isn’t found.

def search(haystack, needle):
    """
    Search list `haystack` for sub-list `needle`.
    """
    if len(needle) == 0:
        return 0
    char_table = make_char_table(needle)
    offset_table = make_offset_table(needle)
    i = len(needle) - 1
    while i < len(haystack):
        j = len(needle) - 1
        while needle[j] == haystack[i]:
            if j == 0:
                return i
            i -= 1
            j -= 1
        i += max(offset_table[len(needle) - 1 - j], char_table.get(haystack[i]));
    return -1

    
def make_char_table(needle):
    """
    Makes the jump table based on the mismatched character information.
    """
    table = {}
    for i in range(len(needle) - 1):
        table[needle[i]] = len(needle) - 1 - i
    return table
    
def make_offset_table(needle):
    """
    Makes the jump table based on the scan offset in which mismatch occurs.
    """
    table = []
    last_prefix_position = len(needle)
    for i in reversed(range(len(needle))):
        if is_prefix(needle, i + 1):
            last_prefix_position = i + 1
        table.append(last_prefix_position - i + len(needle) - 1)
    for i in range(len(needle) - 1):
        slen = suffix_length(needle, i)
        table[slen] = len(needle) - 1 - i + slen
    return table
    
def is_prefix(needle, p):
    """
    Is needle[p:end] a prefix of needle?
    """
    j = 0
    for i in range(p, len(needle)):
        if needle[i] != needle[j]:
            return 0
        j += 1    
    return 1
    
def suffix_length(needle, p):
    """
    Returns the maximum length of the substring ending at p that is a suffix.
    """
    length = 0;
    j = len(needle) - 1
    for i in reversed(range(p + 1)):
        if needle[i] == needle[j]:
            length += 1
        else:
            break
        j -= 1
    return length

An example program:

def main():
    haystack = [0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1]
    needle = [0, 0, 1]
    index = search(haystack, needle)
    print(index)

if __name__ == '__main__':
    main()

Output:

4

Shunting Yard Algorithm in Python

The Shunting Yard Algorithm is a classic algorithm for parsing mathematical expressions invented by Edsger Dijkstra. Its name comes from the use of a stack to rearrange the operators and operands into the correct order for evaluation, which is rather reminiscent of a railway siding.

Here is a very simple implementation in Python:

def is_number(str):
    try:
        int(str)
        return True
    except ValueError:
        return False

def is_name(str):
    return re.match("\w+", str)

def peek(stack):
    return stack[-1] if stack else None

def apply_operator(operators, values):
    operator = operators.pop()
    right = values.pop()
    left = values.pop()
    values.append(eval("{0}{1}{2}".format(left, operator, right)))

def greater_precedence(op1, op2):
    precedences = {'+' : 0, '-' : 0, '*' : 1, '/' : 1}
    return precedences[op1] > precedences[op2]

def evaluate(expression):
    tokens = re.findall("[+/*()-]|\d+", expression)
    values = []
    operators = []
    for token in tokens:
        if is_number(token):
            values.append(int(token))
        elif token == '(':
            operators.append(token)
        elif token == ')':
            top = peek(operators)
            while top is not None and top != '(':
                apply_operator(operators, values)
                top = peek(operators)
            operators.pop() # Discard the '('
        else:
            # Operator
            top = peek(operators)
            while top is not None and top not in "()" and greater_precedence(top, token):
                apply_operator(operators, values)
                top = peek(operators)
            operators.append(token)
    while peek(operators) is not None:
        apply_operator(operators, values)

    return values[0]

Example program:

def main():
    expression = '((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2'
    print("Shunting Yard Algorithm: {0}".format(evaluate(expression)))
    print("Python: {0}".format(eval(expression)))

if __name__ == '__main__':
    main()

Output:

Shunting Yard Algorithm: 40
Python: 40

Integer partitions in Python

These functions produce integer partitions in reverse lexicographic and lexicographic order respectively.

def revlex_partitions(n):
    """
    Generate all integer partitions of n
    in reverse lexicographic order
    """
    if n == 0:
        yield []
        return
    for p in revlex_partitions(n - 1):
        if len(p) == 1 or (len(p) > 1 and p[-1] < p[-2]):
            p[-1] += 1
            yield p
            p[-1] -= 1
        p.append(1)
        yield p
        p.pop()

def lex_partitions(n):
    """
    Generate all integer partitions of n
    in lexicographic order
    """
    if n == 0:
        yield []
        return
    for p in lex_partitions(n - 1):
        p.append(1)
        yield p
        p.pop()
        if len(p) == 1 or (len(p) > 1 and p[-1] < p[-2]):
            p[-1] += 1
            yield p
            p[-1] -= 1

Example program:

def main():
    for p in revlex_partitions(5):
        print(p)
    print()
    for p in lex_partitions(5):
        print(p)

if __name__ == '__main__':
    main()

Output:

[5]
[4, 1]
[3, 2]
[3, 1, 1]
[2, 2, 1]
[2, 1, 1, 1]
[1, 1, 1, 1, 1]

[1, 1, 1, 1, 1]
[2, 1, 1, 1]
[2, 2, 1]
[3, 1, 1]
[3, 2]
[4, 1]
[5]

Reference: David Eppstein, IntegerPartitions.py

Greedy Set Cover in Python

In the Set Cover problem, we are given a universal set \(U\) and a family of subsets \(S_1, \ldots, S_k \subseteq U\). A set cover is a collection of subsets \(C\) from \(S_1, \ldots, S_k\) whose union is the universal set \(U\). We would like to minimise \(\left\vert{C}\right\vert\).

The problem of finding the optimum \(C\) is NP-Complete, but a greedy algorithm can give an \(O(log_e n)\) approximation to optimal solution.

The greedy algorithm selects the set \(S_i\) containing the largest number of uncovered points at each step, until all of the points have been covered.

Below is an implementation in Python:

def set_cover(universe, subsets):
    """Find a family of subsets that covers the universal set"""
    elements = set(e for s in subsets for e in s)
    # Check the subsets cover the universe
    if elements != universe:
        return None
    covered = set()
    cover = []
    # Greedily add the subsets with the most uncovered points
    while covered != elements:
        subset = max(subsets, key=lambda s: len(s - covered))
        cover.append(subset)
        covered |= subset

    return cover

def main():
    universe = set(range(1, 11))
    subsets = [set([1, 2, 3, 8, 9, 10]),
        set([1, 2, 3, 4, 5]),
        set([4, 5, 7]),
        set([5, 6, 7]),
        set([6, 7, 8, 9, 10])]
    cover = set_cover(universe, subsets)
    print(cover)

if __name__ == '__main__':
    main()

Output:

[set([1, 2, 3, 8, 9, 10]), set([4, 5, 7]), set([5, 6, 7])]

Greedy HornSAT in Python

Consider this logic puzzle:

If Mr. Smith has a dog, then Mrs. Brown has a cat.
If Mr. Jones has a dog, then he has a cat too.
If Mr. Smith has a dog and Mr. Jones has a cat, then Mrs. Peacock has a dog.
If Mrs. Brown and Mr. Jones share a pet of the same species, then Mr. Smith has a cat
All the men have dogs.

What can we say about who has what kind of pet?

We can encode the problem as Boolean variables. We write s, j, b, p for Boolean variables that are true if Smith, Jones, Brown, or Peacock respectively have a dog, and S, J, B, P for Boolean variables that are true if they have a cat. The puzzle can then be exressed as the following logical formula:

\(
\begin{matrix}
(s & \Rightarrow & B) & \land \\
(j & \Rightarrow & J) & \land \\
((s \land J) & \Rightarrow & p) & \land \\
((b \land j) & \Rightarrow & S) & \land \\
((B \land J) & \Rightarrow & S) & \land \\
& & (s) & \land \\
& & (j). &
\end{matrix}
\)

A formula that is a conjunction of implications in which no variable is complemented is called a Horn formula. To solve the puzzle, we need to find a satisfying truth assignment for the variables in the formula. While the problem of Boolean Satisfiability (SAT) is in general intractable, Boolean satisfiability for Horn clauses – HornSAT – can be solved in linear time using a greedy algorithm.

The algorithm works as follows:

  1. Make a set of truth assignments for the variables, starting with them all set to False
  2. Make a set W of all of the variables that appear on the right hand side of an empty implication
  3. While W is not empty:
    1. Remove an arbitrary element v from W
    2. Set v to True
    3. For each clause c where v appears on the left hand side:
      1. Delete v from the left hand side of c
      2. If this makes c and empty implication, add the variable on the right hand side of c to W, unless it is already there
  4. Return the set of truth assignments

Here is the implementation in Python:

def greedy_hornsat(variables, clauses):
    """Take a list of variables and horn clauses and return a truth assignment
    to the variables that satisfies all of them"""
    assignment = dict((v, False) for v in variables)
    working = set(t[1] for t in clauses if not len(t[0]))
    while working:
        v = working.pop()
        assignment[v] = True
        vclauses = [ c for c in clauses if v in c[0]]
        for vclause in vclauses:
            vclause[0].remove(v)
            if not vclause[0]:
                working.add(vclause[1])

    return assignment

An example program using the puzzle above:

def main():
    variables = ['s', 'j', 'b', 'p', 'S', 'J', 'B', 'P']
    clauses = [(['s'], 'B'),
            (['j'], 'J'),
            (['s', 'J'], 'p'),
            (['b', 'j'], 'S'),
            (['B', 'J'], 'S'),
            ([], 's'),
            ([], 'j')]
    assignment = greedy_hornsat(variables, clauses)
    print(assignment)

if __name__ == '__main__':
    main()

The output:

{'b': False, 'P': False, 'J': True, 'j': True, 'S': True, 'p': True, 's': True, 'B': True}

So we know that Mr. Jones and Mr. Smith both have a dog and a cat, Mrs. Brown has a cat, and Mrs. Peacock has a dog. Mrs. Brown may also have a dog, and Mrs. Peacock may also have a cat, but we don’t know those.

Reference: Notes for CS 170

Removing duplicates from a list while preserving order in Python

You can remove duplicates from a list easily by putting the elements in a set and then making a new list from the set’s contents:

def unique(sequence):
    return list(set(sequence))

However, this will put the elements in an arbitrary order. To preserve the order, you can use a set to keep track of which elements you’ve seen while populating the new list by a list comprehension:

def unique(sequence):
    seen = set()
    return [x for x in sequence if not (x in seen or seen.add(x))]

Note that this relies on the fact that set.add() returns None.

Count the number of occurences of characters in a string in Python

To count the number of occurrences of a character in a string, use the string’s count() method:

string = 'The quick brown fox jumps over the lazy dog'
print(string.count('z'))
1

Note that you can also count substrings with count():

string = 'spam, eggs, ham, spam, spam, spam'
print(string.count('spam'))
4

To count the number of occurrences of all characters in a string, use a collections.Counter:

string = 'The quick brown fox jumps over the lazy dog'
counter = Counter(string)
print(counter)
Counter({'a': 5, ' ': 5, 'm': 5, ',': 5, 's': 5, 'p': 4, 'g': 2, 'e': 1, 'h': 1})

Printing the attributes of an object in Python

vars

The vars() function simply returns the __dict__ attribute for any object that has one. Modules and user-defined class objects have a __dict__ attribute, but many other kinds of objects don’t.

dir()

The dir() function will call the object’s __dir__() if present. This will call the object’s __getattr__() or getattribute__() functions. If __dir__ is not present, dir() will try to read the object’s __dict__ attribute.

import pprint

def main():
    parrot = Parrot('Norewegian Blue', 'sleeping')
    pp = pprint.PrettyPrinter()
    print("vars()")
    pp.pprint(vars(parrot))
    print("dir()")
    pp.pprint(dir(parrot))

class Parrot(object):
    def __init__(self, breed, condition):
        self.breed = breed
        self.condition = condition


if __name__ == '__main__':
    main()
vars()
{'breed': 'Norewegian Blue', 'condition': 'sleeping'}
dir()
['__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'breed',
 'condition']