Tag Archives: XML

British National Corpus (BNC) parser in Python

I needed to build an inverted index of the British National Corpus (BNC) files, and for that I needed a parser that would return each word in the files with its precise location. I thought I would share the Python class I developed here.

The class BNCParser is optionally initialised with an XML parser object, the default being an instance of xml.etree.ElementTree.XMLParser. Its parse() method can then be called on a filename, and it is a generator, yielding Word objects which contain the following information about each word found:

  • Level 1 div number
  • Level 2 div number
  • Level 3 div number
  • Level 4 div number
  • Sentence number
  • Word number
  • CLAWS tag
  • Headword
  • POS tag
  • Text (the actual word as it appears)

All positional fields (div numbers, sentence, and word) are 1-based. Lower level divs may not be present, in which case they will appear as 0.

Here is the code:

from collections import namedtuple
import xml.etree.ElementTree as ET

"""Represents all of the info about a single word occurrence"""
Word = namedtuple('Word', ['div1', 'div2', 'div3', 'div4', 'sentence',
        'word', 'c5', 'hw', 'pos', 'text'])

class BNCParser(object):
    """A parser for British National Corpus (BNC) files"""
    def __init__(self, parser=None):
        if parser is None:
            parser = ET.XMLParser(encoding = 'utf-8')
        self.parser =  parser

    def parse(self, filename):
        """Parse `filename` and yield `Word` objects for each word"""
        tree = ET.parse(filename, self.parser)
        root = tree.getroot()
        divs = [None, 0, 0, 0, 0] # 1-based
        sentence = 0
        word = 0
        for neighbour in root.iter('*'):
            if neighbour.tag == 'div':
                level = int(neighbour.attrib['level'])
                divs[level] += 1
                # Reset all lower-level divs to 0
                for i in range(level + 1, 5):
                    divs[i] = 0
                sentence = 0
            elif neighbour.tag == 's':
                sentence += 1
                word = 0
            elif neighbour.tag == 'w':
                word += 1
                yield Word(divs[1], divs[2], divs[3], divs[4], sentence, word, 
                        neighbour.attrib['c5'], neighbour.attrib['hw'], 
                        neighbour.attrib['pos'], neighbour.text)

Notice the algorithm for keeping track of positional information. The parse is a depth-first traversal of the element tree, with div numbers being updated as they are found. Whenever a new div begins, the div number at that level is incremented, and all lower level div numbers are set to 0, as is the sentence number, because they are relative to the parent div. Similarly, when a new sentence begins, the word number is set to 0.

Here is a small example program that parses a single file and prints the word information it finds:

def main():
    source = '2553/download/Texts/aca/HWV.xml'
    parser = BNCParser()
    for word in parser.parse(source):
        print(word)

if __name__ == '__main__':
    main()

Here is the beginning of the output:

Word(div1=1, div2=0, div3=0, div4=0, sentence=1, word=1, c5='NN2', hw='article', pos='SUBST', text='ARTICLES')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=1, c5='NN1', hw='immunogenicity', pos='SUBST', text='Immunogenicity ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=2, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=3, c5='AT0', hw='a', pos='ART', text='a ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=4, c5='AJ0', hw='supplemental', pos='ADJ', text='supplemental ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=5, c5='NN1', hw='dose', pos='SUBST', text='dose ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=6, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=7, c5='NN1-AJ0', hw='oral', pos='SUBST', text='oral ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=8, c5='PRP', hw='versus', pos='PREP', text='versus ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=9, c5='AJ0', hw='inactivated', pos='ADJ', text='inactivated ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=10, c5='NN1', hw='poliovirus', pos='SUBST', text='poliovirus ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=1, word=11, c5='NN1', hw='vaccine', pos='SUBST', text='vaccine')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=1, c5='PRP', hw='in', pos='PREP', text='In ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=2, c5='DT0', hw='many', pos='ADJ', text='many ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=3, c5='AJ0', hw='developing', pos='ADJ', text='developing ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=4, c5='NN2', hw='country', pos='SUBST', text='countries')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=5, c5='AT0', hw='the', pos='ART', text='the ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=6, c5='NN1', hw='immunogenicity', pos='SUBST', text='immunogenicity ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=7, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=8, c5='CRD', hw='three', pos='ADJ', text='three ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=9, c5='NN2', hw='dose', pos='SUBST', text='doses ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=10, c5='PRF', hw='of', pos='PREP', text='of ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=11, c5='AJ0', hw='live', pos='ADJ', text='live')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=12, c5='VVD', hw='attenuate', pos='VERB', text='attenuated')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=13, c5='AJ0', hw='oral', pos='ADJ', text='oral ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=14, c5='NN1', hw='poliovirus', pos='SUBST', text='poliovirus ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=15, c5='NN1', hw='vaccine', pos='SUBST', text='vaccine ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=16, c5='NP0', hw='opv', pos='SUBST', text='OPV')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=17, c5='VBZ', hw='be', pos='VERB', text='is ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=18, c5='AJC', hw='low', pos='ADJ', text='lower ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=19, c5='CJS', hw='than', pos='CONJ', text='than ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=20, c5='DT0', hw='that', pos='ADJ', text='that ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=21, c5='PRP', hw='in', pos='PREP', text='in ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=22, c5='AJ0', hw='industrialised', pos='ADJ', text='industrialised ')
Word(div1=1, div2=1, div3=0, div4=0, sentence=2, word=23, c5='NN2', hw='country', pos='SUBST', text='countries')