1 Setting up Python

At the time of writing, the latest stable release of Python is version 3.10.5. If you already have Python installed, please ensure that you have at least version 3.6 or newer.

To install or upgrade Python, you can either:

  • download and run the latest installer from https://www.python.org (relevant for Windows and macOS)
  • use a package manager (relevant if you are using Linux, or the Homebrew package manager on macOS)

You can verify whether Python was successfully installed on your system by opening your Terminal or PowerShell and entering the command python --version or python3 --version.

In addition to installing Python, you will also need a text editor. Here is a non-exhaustive list of suggestions:

  • Mu: a simple Python editor for beginner programmers (limited to Python but very simple to the point of being suitable for children)
  • Geany: a lightweight Integrated Development Environment (IDE) supporting over 50 programming languages, including Python
  • VSCodium: the open-source version of Microsoft Visual Studio Code, i.e. a very similar product but without telemetry built-in. Both editors are fully-featured and contain many creature comforts for advanced programmers. However, the wealth of options may be overwhelming to beginners.
  • Long-established text editors such as Vim or Emacs (only use if you are already familiar with them due to the steep learning curve)

2 Running Python: Hello World!

A traditional first program to run is “Hello World!”. We will use this to ensure that everyone has Python properly installed and can execute Python scripts.

print("Hello World!")
## Hello World!

We can directly launch Python from our command line (Terminal or PowerShell) and execute the program interactively from there, or save the program to a file (e.g. called hello.py) and execute the program stored within the file by running Python on it: python3 hello.py. For the latter option, we need our Terminal to have the folder in which we saved the script file as its working directory. We can set the working with the cd command followed by a space and the file path. (NB: You may also be able to drag the folder to the Terminal after having typed cd followed by a space.) You can verify the path with the command pwd (“print working directory”) and/or list the content of the directory with ls(macOS/Linux) or dir (Windows) to see whether your script file appears in the list of contents.

A further option is to make our Python script executable. For this, we place a line that will tell our system how our script should be executed. NB: The path provided explicitly applies to Unix-like systems (Linux and macOS among others) but should be also interpretable (or even unnecessary) on Windows.

#!/usr/bin/env python3
print("Hello World!")

On Unix-like systems, we should then give executable permissions to the file with chmod +x hello.py. Once this is done, we can run the script as ./hello.py.

3 Python basics: A (hopefully) gentle introduction

Let us start with a “cultural studies flavoured” approach: We will generate the American song 99 bottles of beer of the wall in Python to see some of the basic concepts of the language. We will be aiming for the following lyrics:

99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.

[…]

1 bottle of beer on the wall, 1 bottle of beer. Take one down and pass it around, no more bottles of beer on the wall.

No more bottles of beer on the wall, no more bottles of beer. Go to the store and buy some more, 99 bottles of beer on the wall.

3.1 What do we need to do to generate this song?

  • Count down from 99 to 0
  • Use the appropriate plural or singular of the noun bottle
  • Write 0 as “No more” or “no more”
  • Handle the ending, which differs from the rest of the song
  • Export the result to a file
  • Extras:
    • Make alternate versions: switch the beverage and the starting number
    • Run the script with alternations from the command line
  • Luxury extras (challenges for when you are more familiar with Python):
    • Further alternations: switch the container
    • Spell out the numbers

3.2 How do we count down from 99?

We can repeat operations for a certain number of times using a for loop. At each repetition, a specified variable (often called i for iterative) is assigned a different value. To print numbers from 1 to 10, we can do the following:

for i in range(1, 11):
    print(i)
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10

Notice how the count is one number short of the endpoint of the range.

The range() function may take a further argument specifying the interval to be used. We could skip even numbers by specifying 2 as an interval.

for i in range(1, 11, 2):
    print(i)
## 1
## 3
## 5
## 7
## 9

To count down, we can specify the interval negatively. We also don’t have to stick with i for the name of the iterating variable, but can choose something more expressive.

for bottle_count in range(99, -1, -1):
    print(bottle_count)

3.3 How do we produce the appropriate plural or singular form?

Our condition is that we only need the singular form when the number of bottles is 1. We can achieve this with a conditional if statement.

if bottle_count == 1:
    print(str(bottle_count) + " bottle")
else:
    print(str(bottle_count) + " bottles")

There are more elegant ways to insert variables into strings:

print("%s bottles" % bottle_count) # "old school" way
print(f"{bottle_count} bottles") # new f-string introduced in Python 3.6

We can now combine our for-loop and our if-statement:

for bottle_count in range(99, -1, -1):
    if bottle_count == 1:
        print(f"{bottle_count} bottle")
    else:
        print(f"{bottle_count} bottles")

3.4 Add the rest of the song

for bottle_count in range(99, -1, -1):
    if bottle_count == 1:
        print(f"{bottle_count} bottle of beer on the wall, {bottle_count} bottle of beer.\n\tTake one down and pass it around, no more bottles of beer on the wall.\n")
    else:
        print(f"{bottle_count} bottles of beer on the wall, {bottle_count} bottles of beer.\n\tTake one down and pass it around, {bottle_count - 1} bottles of beer on the wall.\n")

It is generally advised to keep lines of code under 80 characters for better readability. But how can we keep the line short when our text is long? While this length limit is more a suggestion than a requirement, we can still comply with it by splitting parts of the string as long as they are within brackets. NB: Do not place any commas between the strings as Python would interpret this as a tuple, i.e. a data type akin to a list that cannot be edited.

for bottle_count in range(99, -1, -1):
    if bottle_count == 1:
        print(
            f"{bottle_count} bottle of beer on the wall, "
            f"{bottle_count} bottle of beer.\n\t"
            "Take one down and pass it around, "
            "no more bottles of beer on the wall.\n"
        )
    else:
        print(
            f"{bottle_count} bottles of beer on the wall, "
            f"{bottle_count} bottles of beer.\n\t"
            "Take one down and pass it around, "
            f"{bottle_count - 1} bottles of beer on the wall.\n"
        )

3.5 How do we write out 0 as “no more”?

There are multiple ways to achieve this. We could handle this with an if-statement, but it would quickly become convoluted due to the existing if-statement for handling the singular, and the fact that 0 occurs both when the bottle count is 1 due to the subtraction, and in the final line. A more elegant solution is to substitute the character 0 with the character string “no more”. For this, we use regular expressions, for which we need to import the module re. The imported modules are typically listed at the very top of a Python script. We can even use regular expressions to get rid of the if-statement to handle the plural-singular distinction and reserve the conditional to handle the last line. To substitute the text we want, we first store a default version of the line and then perform the substitutions we want on it.

Some things to note:

  • If we just substitute “0 bottles” with “no more bottles”, we will end up with “9no more bottles, 8no more bottles”, etc. We therefore need to specify that the 0 should be preceded by a word boundary. The regular expression wild card for word boundaries is \b. In Python, this particular wild card is somewhat confusing because the sequence \b is an escape sequence for bytes literals, so we either need to escape the \ escape character itself, hence \\b, or use a raw string notation r"\b". In a nutshell, raw strings ignore any Python-specific uses of the backslash character and pass everything as is to the regular expression parser. If you are confused, you can place double backslashes for any wild card, whether they are necessary (e.g. \\b, \\1) or not (e.g. \w or \\w). You can also play around with the raw string notation to see whether you prefer it (e.g. r"\b", r"\1", r"\w").
  • To have a capitalized “N” when “no more” occurs at the beginning of a line, we can use the anchor ^ to specify the beginning of the string.
  • If we just substitute “1 bottles” with “1 bottle”, we will end up with “91 bottle, 81 bottle” etc. We therefore specify that the 1 should be preceded by a word boundary.
import re

for bottle_count in range(99, -1, -1):
    default_line = (
        f"{bottle_count} bottles of beer on the wall, "
        f"{bottle_count} bottles of beer.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of beer on the wall.\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("^n", "N", updated_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    print(updated_line)

3.6 How do we handle the ending?

Our song looks fine except for the ending. We can use an if-statement to treat the ending differently. We can now also skip the capitalization of “no” at line beginnings.

import re

for bottle_count in range(99, -1, -1):
    if bottle_count > 0:
        default_line = (
            f"{bottle_count} bottles of beer on the wall, "
            f"{bottle_count} bottles of beer.\n\t"
            "Take one down and pass it around, "
            f"{bottle_count - 1} bottles of beer on the wall.\n"
        )
        updated_line = re.sub("\\b0", "no more", default_line)
        updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    else:
        updated_line = (
            "No more bottles of beer on the wall, "
            "no more bottles of beer.\n\t"
            "Go to the store and buy some more, "
            "99 bottles of beer on the wall.\n"
        )
    print(updated_line)

Alternatively, we can also let the for-loop stop at 1 and just write the ending after the loop.

import re

for bottle_count in range(99, 0, -1):
    default_line = (
        f"{bottle_count} bottles of beer on the wall, "
        f"{bottle_count} bottles of beer.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of beer on the wall.\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    print(updated_line)

print(
    "No more bottles of beer on the wall, "
    "no more bottles of beer.\n\t"
    "Go to the store and buy some more, "
    "99 bottles of beer on the wall.\n"
)

3.7 A different way of looping

In addition to a for-loop, we could also use a while-loop. Unlike in for-loops, there is no automatic iteration of the looping variable, so we have to be extra careful to specify how the loop should end or we may be stuck in an infinite loop. In general, prefer for-loops over while-loops unless there is a good reason to use a while-loop.

import re

bottle_count = 99

while bottle_count > 0:
    default_line = (
        f"{bottle_count} bottles of beer on the wall, "
        f"{bottle_count} bottles of beer.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of beer on the wall.\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    print(updated_line)
    bottle_count = bottle_count - 1 # infinite loop without this line ! ! !

print(
    "No more bottles of beer on the wall, "
    "no more bottles of beer.\n\t"
    "Go to the store and buy some more, "
    "99 bottles of beer on the wall.\n"
)

3.8 Saving the output to a file

There are different ways we can write the output to a file:

  • We can add each line to the output file as it is generated.
import re

for bottle_count in range(99, 0, -1):
    default_line = (
        f"{bottle_count} bottles of beer on the wall, "
        f"{bottle_count} bottles of beer.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of beer on the wall.\n\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    with open("99bottles.txt", "a") as outp:
        outp.write(updated_line)
with open("99bottles.txt", "a") as outp:
    outp.write(
        "No more bottles of beer on the wall, "
        "no more bottles of beer.\n\t"
        "Go to the store and buy some more, "
        "99 bottles of beer on the wall.\n\n"
    )
  • Or we can add each line to a string variable and write the entire song to the output file in one go.
import re

song_lyrics = ""

for bottle_count in range(99, 0, -1):
    default_line = (
        f"{bottle_count} bottles of beer on the wall, "
        f"{bottle_count} bottles of beer.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of beer on the wall.\n\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    song_lyrics += updated_line

song_lyrics += (
    "No more bottles of beer on the wall, "
    "no more bottles of beer.\n\t"
    "Go to the store and buy some more, "
    "99 bottles of beer on the wall.\n\n"
)

with open("99bottles.txt", "w") as outp:
    outp.write(song_lyrics)

The open() function has three basic modes:

  • “r”: read
  • “w”: write, which overwrites an existing file
  • “a”: append, which adds to the content of an existing file rather than overwriting it

3.9 How do we make alternate versions?

So far, we’ve just replicated the song lyrics. This has shown us how to use for-loops and while-loops, how to insert variables into strings, some basics of regular expression substitutions, and exporting output to a file. The song lyrics themselves haven’t changed though. Boring! We can learn more about using variables by modifying the song.

First, we may want to change the number at which we start counting. Maybe the song is too long (or too short?). We could start at ten or a million. One obvious way to adapt the code is to insert the number we want in the range() function, e.g. for bottle_count in range(10, 0, -1). A more readable way to do this is to define a variable prior to the loop so it is easier to spot and edit.

import re

song_lyrics = ""
count_start = 10 # define our counting start here

for bottle_count in range(count_start, 0, -1):
    default_line = (
        f"{bottle_count} bottles of beer on the wall, "
        f"{bottle_count} bottles of beer.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of beer on the wall.\n\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    song_lyrics += updated_line

song_lyrics += (
    "No more bottles of beer on the wall, "
    "no more bottles of beer.\n\t"
    "Go to the store and buy some more, "
    f"{count_start} bottles of beer on the wall.\n\n"
)

print(song_lyrics)

Another alternation we could make is to change the beverage. Perhaps you prefer wine over beer, or just plain water. Again, we can define our beverage before the loop starts, and then pass that variable to the f-strings.

import re

song_lyrics = ""
count_start = 5 # define our counting start here
beverage = "water" # define drink of choice here

for bottle_count in range(count_start, 0, -1):
    default_line = (
        f"{bottle_count} bottles of {beverage} on the wall, "
        f"{bottle_count} bottles of {beverage}.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of {beverage} on the wall.\n\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    song_lyrics += updated_line

song_lyrics += (
    f"No more bottles of {beverage} on the wall, "
    f"no more bottles of {beverage}.\n\t"
    "Go to the store and buy some more, "
    f"{count_start} bottles of {beverage} on the wall.\n\n"
)

print(song_lyrics)

3.10 Create our alternate versions from the command line

It’s neat that we can now make our own alternate versions of the song, but having to edit the program every time we change anything is cumbersome, and actually error-prone. We can actually set up the program so that we can specify variables such as the starting count and the beverage when we run the program from the command line. For this, we use the module argparse (argument parser).

import argparse
import re

parser = argparse.ArgumentParser(
    description = (
        "Generate lyrics to '99 bottles of beer on the wall'"
        " (or variants thereof)"
    ),
    epilog = "Please drink responsibly!"
)

parser.add_argument(
    "-b",
    "--beverage",
    default = "beer",
    type = str,
    help = "Beverage of choice (default: beer)"
)

parser.add_argument(
    "-c",
    "--count",
    default = 99,
    type = int,
    help = "Number at which to start our countdown (default: 99)"
)

args = parser.parse_args()

beverage = args.beverage
count_start = args.count

song_lyrics = ""

for bottle_count in range(count_start, 0, -1):
    default_line = (
        f"{bottle_count} bottles of {beverage} on the wall, "
        f"{bottle_count} bottles of {beverage}.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of {beverage} on the wall.\n\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    song_lyrics += updated_line

song_lyrics += (
    f"No more bottles of {beverage} on the wall, "
    f"no more bottles of {beverage}.\n\t"
    "Go to the store and buy some more, "
    f"{count_start} bottles of {beverage} on the wall.\n\n"
)

with open(f"{count_start}-bottles-of-{beverage}.txt", "w") as outp:
    outp.write(song_lyrics)

3.11 Working version

Our working version of the script now looks something like this. We could of course keep going and keep adding further improvements (add a choice of container such as glasses, kegs, cans, canteens, amphorae, etc., and spell out numbers, e.g. ninety-nine), but let us now move on to applications closer to our purpose.

#!/usr/bin/env python3

import argparse
import re

parser = argparse.ArgumentParser(
    description = (
        "Generate lyrics to '99 bottles of beer on the wall'"
        " (or variants thereof)"
    ),
    epilog = "Please drink responsibly!"
)

parser.add_argument(
    "-b",
    "--beverage",
    default = "beer",
    type = str,
    help = "beverage of choice (default: beer)"
)

parser.add_argument(
    "-c",
    "--count",
    default = 99,
    type = int,
    help = "number at which to start our countdown (default: 99)"
)

args = parser.parse_args()

beverage = args.beverage
count_start = args.count

song_lyrics = ""

for bottle_count in range(count_start, 0, -1):
    default_line = (
        f"{bottle_count} bottles of {beverage} on the wall, "
        f"{bottle_count} bottles of {beverage}.\n\t"
        "Take one down and pass it around, "
        f"{bottle_count - 1} bottles of {beverage} on the wall.\n\n"
    )
    updated_line = re.sub("\\b0", "no more", default_line)
    updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
    song_lyrics += updated_line

song_lyrics += (
    f"No more bottles of {beverage} on the wall, "
    f"no more bottles of {beverage}.\n\t"
    "Go to the store and buy some more, "
    f"{count_start} bottles of {beverage} on the wall.\n\n"
)

with open(f"{count_start}-bottles-of-{beverage}.txt", "w") as outp:
    outp.write(song_lyrics)

4 Handling a CorpusSearch output file

We will create a Python script that will take a CorpusSearch output file and convert it to a table that we can import in R or in a spreadsheet application such as Excel. We will be working on the file french-based-verbs-with-indirect-objects-PPCME2.out, but our script should work on any CorpusSearch output file from the PPCME2.

We first look at our CorpusSearch file to examine its structure. What is the information we want to extract, and how could a script reliably extract it?

Each sentence is surrounded by pairs of /~* and *~/, and the index of the matching elements are surrounded by pairs of /* and */. The latter pair is also used for the preface and the header at the very top of the file, so we have to be careful to treat those differently.

4.1 Reading an input file

Before we treat the file, however, we need to open it and read its contents into Python. As we want to treat any CorpusSearch output file and not just the one we are working with now, we want to be able to specify the file name when we run the program. We can parse the arguments as we did earlier, except this time we will read the file name. We also make the code more “pythonic” by storing individual steps as custom functions, which we then assemble in a main function. This is done for two main reasons:

  1. The functions can be imported and reused in other scripts;
  2. The code is easier to read, as a reader can understand what the code does without going into the nitty-gritty by just reading the main function (provided that names of the custom functions are transparent enough to indicate what they do).
  3. As an additional bonus, many editors use function definitions as folding points, which means that we can more easily focus on specific parts of the code.
import argparse

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Extract information from a CorpusSearch output file "
            "to a CSV table"
        )
    )
    parser.add_argument(
        "file_name",
        help = "Name of the file to be processed"
    )
    arguments = parser.parse_args()
    return arguments

def read_file(file_name):
    with open(file_name, "r") as inp:
            return inp.read()

def main():
    args = get_arguments()
    content = read_file(args.file_name)
    print(content[0:1000]) # test to verify that the file has been read

if __name__ == "__main__":
    main()

When running the script, the functions are stored but not executed unless called. If we execute the script directly, the main() function is executed, but not when we import the script. This coding style allows us to reuse functions, e.g. from output_to_csv import read_file.

4.2 Handling errors

As it stands, our script relies on the user correctly entering the name of the file to be processed correctly. If there is a typo or the intended file does not exist at this location, our script crashes. We can handle this more gracefully by catching the error and letting the user know what likely went wrong.

import argparse

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Extract information from a CorpusSearch output file "
            "to a CSV table"
        )
    )
    parser.add_argument(
        "file_name",
        help = "name of the file to be processed"
    )
    arguments = parser.parse_args()
    return arguments

def read_file(file_name):
    try:
        with open(file_name, "r") as inp:
            return inp.read()
    except FileNotFoundError:
        print(
            "\nWARNING: "
            "The file you specified does not exist.\n"
            "Please verify its spelling or location and try again.\n"
        )
        quit()

def main():
    args = get_arguments()
    content = read_file(args.file_name)
    print(content[0:1000]) # test to verify that the file has been read

if __name__ == "__main__":
    main()

4.3 Segmenting the file into its sentences

As each sentence is surrounded by pairs of /~* and *~/, we can use the character sequence /~* as a reliable marker to segment the text into units that correspond to query hits. This can do this by using the .split() method. This yields a list containing all our hits as distinct elements.

NB: Methods are similar to functions, except that they are always applied to objects and never in a standalone manner.

sentences = content.split("/~*")
print(sentences[0]) # test to see the header
print(sentences[1]) # test to see the first actual query hit
print(sentences[-1]) # test to see the last query hit

The splitting sequence is not included in the list elements. However, the closing tag is still present, which gives a way to exclude the file header.

4.4 Iterating through our hits

We can treat each hit with a for-loop.

for sentence in sentences:
    sentence[0:10] # first 10 characters as a test

As you may have noticed, our first hit is the preface section. We should exclude it by ensuring that only items containing the *~/ closing tag are treated. We can use the .find() method, which returns the position of the hit, or -1 in case there are no matches. An alternative would to use the re.search() function, which returns a match object if there is a match, and nothing if there is no match. When placed in an if statement, Python interprets the presence of a match object as logical “true”, and its absence as logical “false”. We could therefore also use if re.search("\*~/", sentence):, without forgetting to escape the asterisk character.

for sentence in sentences:
    if sentence.find("*~/") != -1:
        sentence[0:10]

4.5 Extracting relevant information

What information do we want to extract from the CorpusSearch output file? This of course depends on specific use cases. Here, we will extract the corpus file name, the sentence ID, the verb information (lemma, MED-ID, etymology) and the sentence itself.

4.5.1 Corpus file and sentence ID

The corpus file name is actually contained in the sentence ID, so we can kill two birds with one stone here. All sentence IDs are in parentheses and begin with CM. We can use this pattern to extract the sentence ID reliably using regular expressions. We need to escape the parentheses with a backslash, as parentheses are special characters to mark groups in regular expressions. We use the re.search() function to find our match. We can then extract our match and remove the surrounding parentheses.

sentence = sentences[1] # use the first actual hit as a test case
sentence_id = re.search("\(CM.*\)", sentence) # perform our search
sentence_id = sentence_id[0] # extract the actual match
# NB: sentence_id.group() or sentence_id.group(0) would also work
sentence_id = re.sub("\((.*)\)", "\\1", sentence_id)

We can define this sequence as a function so that our main() function is easier to read.

def extract_id(hit):
    id_tag = re.search("\(CM.*\)", hit)
    sentence_id = re.sub("\((.*)\)", "\\1", id_tag[0])
    return sentence_id

Extracting the corpus file name is then a one-liner that doesn’t necessarily warrant its own function.

identifier = extract_id(sentence)
# Extract content before the comma
#     to obtain the corpus file name from the identifier
corpusfile_name = re.sub("(.*),.*", "\\1", identifier)
# Alternative solution:
#    re.search("[^,]*", identifier)[0]
#    Any character that is not a comma

4.5.2 Lemma information

To extract the lemma information of the verb match in our query, we can rely on the section surrounded by the tags /* and */ listing the indices of the matched items from the query. There are a few things to consider when formulating a suitable regular expression:

  1. These tags include asterisks, which are normally quantifiers, so we need to escape them with a backslash, so /\* for the opening tag and \*/ for the closing tag;

  2. The wildcard . can refer to any character except line breaks. As the section listing the matched items is spread over multiple lines, we need to define a group that includes line breaks as well: (.|\n);

  3. By default, regular expressions are only matched within the scope of individual lines. To find matches over multiple lines, we need to specify the flag re.MULTILINE, or re.M for short.

  4. Another option is to redefine the . wildcard to include line breaks with the flag re.DOTALL, or re.S for short. This allows us to combine the requirements for points 2 and 3 while keeping the regular expression a bit simpler.

query_matches = re.search("/\*.*\*/", sentence, re.DOTALL)[0]
# Sequence /* followed by any characters including line breaks until
#     the closing sequence */

It turns out our regular expression was too broad as we also include the header/footer sections that occur whenever results from a new corpus file are displayed. We can fix this by specifying that the line occurring after the opening tag should begin with a digit and making the quantifier non-greedy by placing a ? character after it.

query_matches = re.search("/\*\n\d.*?\*/", sentence, re.DOTALL)[0]
# Sequence /* followed by a line break, the next line beginning with a digit,
#     until the next closing sequence */

Now that we have isolated the section listing the query hits, we can think of how we want to extract the lemma information. We have to consider the fact that that there can be more than one query hit per sentence (e.g. in CMCLOUD,131.811 or CMMIRK,13.358). The function re.search() stops at the first match it finds, but the function re.findall() generates a list of all the matches it finds. This means we can loop over such a list to treat any verbs matching the query. Let us define a function that handles the extraction of lemma information as a first step. We can deal with the looping when it is time to write the data to a table.

def extract_lemmas(hit):
    query_matches = re.search("/\*\n\d.*?\*/", hit, re.DOTALL)[0]
    # Sequence /* followed by a line break,
    #     the next line beginning with a digit,
    #     until the next closing sequence */
    lemmas = re.findall("@.*@", query_matches)
    # To include verb form: "\w*@.*@"
    return lemmas

4.5.3 Text

To extract the actual text of the sentence to include it in our table, we need to reliably identify its boundaries. The markers are /~* (which we chopped off when we segmented the file) and *~/. We can extract the content by gathering material occurring up to this closing tag, then removing the sentence ID and the closing tag. We can do this by specifying the material we want to keep as a group by placing it in parentheses, then accessing it with .group(1) (alternatively [1]). Finally, there are line breaks that have been hard coded by CorpusSearch. We can remove the trailing line breaks with the .strip() method, and replace the line breaks within the sentence with spaces using the re.sub() function.

text_sequence = re.search(".*\*~/", sentence, re.DOTALL)[0]
# Select from beginning of segment to closing tag
sentence_only = re.search("(.*)\(CM.*", text_sequence, re.DOTALL).group(1)
# Keep group that precedes the sentence ID marker beginning with (CM
cleaned_sentence = sentence_only.strip() # remove trailing white space
cleaned_sentence = re.sub("\n", " ", cleaned_sentence)

We have now extracted the actual sentence, but perhaps we want to remove the inserted lemma information, as it may make the sentence too long or cumbersome to read when in a table. We can leave the decision whether to keep the annotation to the user via a flag in the program call. We can wrap our code to extract the sentence in a function. The optional removal of annotation marking is best done outside of this in the main function, as it is a one-liner and saves us the trouble of carrying over the argument variable into our custom function. For the one-liner to remove the lemma annotation, we have to be careful how to formulate our regular expression. Using @.*@ will capture everything from the first to the last verb, and making it non-greedy with @.*?@ will not cover the annotation sequences as they contain multiple @ markers. The easiest is to specify non-whitespace characters as our wildcard: @\S*@.

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Extract information from a CorpusSearch output file "
            "to a CSV table"
        )
    )
    parser.add_argument(
        "file_name",
        help = "Name of the file to be processed"
    )
    parser.add_argument(
        "-p",
        "--plain",
        action = "store_true",
        help = "Toggle plain text by removing lemma annotation"
    )
    arguments = parser.parse_args()
    return arguments

def extract_text(hit):
    text_sequence = re.search(".*\*~/", hit, re.DOTALL)[0]
    # Select from beginning of segment to closing tag
    sentence_only = re.search("(.*)\(CM.*", text_sequence, re.DOTALL).group(1)
    # Keep group that precedes the sentence ID marker beginning with (CM
    cleaned_sentence = sentence_only.strip() # remove trailing white space
    cleaned_sentence = re.sub("\\n", " ", cleaned_sentence)
    return cleaned_sentence

def main():
    args = get_arguments()
    content = read_file(args.file_name)
    sentences = content.split("/~*")
    for sentence in sentences:
        if sentence.find("*~/") != -1: # exclude header
            identifier = extract_id(sentence)
            corpusfile_name = re.sub("(.*),.*", "\\1", identifier)
            verbs = extract_lemmas(sentence)
            text = extract_text(sentence)
            if args.plain:
                text = re.sub("@\S*@", "", text)
            print(text)

4.6 Writing the table

We have extracted all the information we want, and it is now time to assemble the puzzle pieces. There are two options to write out our data (see our prior song example): (1) write each line to the export file as we go, or (2) gather everything in a variable and write to the export file in bulk. Here we go with option (1) for reasons of simplicity and debugging: if there is indeed an error, we can check in our export file which sentence was last processed correctly, and which sentence caused problems. We could then identify whether there were patterns that we had not anticipated.

We need to define a name for our export file. We can base this on the name of our input file, substituting the ending .out with .csv. We can then write a header for the table prior to our for-loop. We will be using a tab-separated format as the other common markers comma and semicolon may occur within the text field. (NB: Text delimiters such as single or double quotation marks are also problematic as they may also occur in the tex t fields.) Within our loop, we then write a table row for each hit. Remember that there could be more than one hit per sentence. We therefore extract the information that was constant within a sentence first , i.e. corpus file, ID, and text. Then, we run a for-loop for each matching verb within that sentence.

def write_table_header(name):
    with open(name, "w") as outp:
        outp.write("CorpusFile\tsentence_id\tVerbInfo\tSentence\n")

def main():
    args = get_arguments()
    content = read_file(args.file_name)
    output_name = re.sub("\.out$", ".csv", args.file_name)
    write_table_header(output_name)
    sentences = content.split("/~*")
    for sentence in sentences:
        if sentence.find("*~/") != -1: # exclude header
            identifier = extract_id(sentence)
            corpus_file_name = re.sub("(.*),.*", "\\1", identifier)
            verbs = extract_lemmas(sentence)
            text = extract_text(sentence)
            if args.plain:
                text = re.sub("@\S*@", "", text)
            for verb in verbs:
                with open(output_name, "a") as outp:
                    outp.write(f"{corpus_file_name}\t"
                    f"{identifier}\t"
                    f"{verb}\t"
                    f"{text}\n"
                    )
            print(f"\nSentence {identifier} processed...\n")
            # Inform user of progress
    print("\nProcess complete!\nNow quitting.\n")
    # Inform user of completion

4.7 Working version

Our working version then looks something like this. (NB: Let us refrain from calling it a final or finished version, as there are still improvements we could make.)

#!/usr/bin/env python3

import argparse
import re

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Extract information from a CorpusSearch output file "
            "to a CSV table"
        )
    )
    parser.add_argument(
        "file_name",
        help = "Name of the file to be processed"
    )
    parser.add_argument(
        "-p",
        "--plain",
        action = "store_true",
        help = "Toggle plain text by removing lemma annotation"
    )
    arguments = parser.parse_args()
    return arguments

def read_file(file_name):
    try:
        with open(file_name, "r") as inp:
            return inp.read()
    except FileNotFoundError:
        print(
            "\nWARNING: "
            "The file you specified does not exist.\n"
            "Please verify and try again.\n"
        )
        quit()

def extract_id(hit):
    id_tag = re.search("\(CM.*\)", hit)
    sentence_id = re.sub("\((.*)\)", "\\1", id_tag[0])
    return sentence_id

def extract_lemmas(hit):
    query_matches = re.search("/\*\n\d.*?\*/", hit, re.DOTALL)[0]
    # Sequence /* followed by a line break,
    #     the next line beginning with a digit,
    #     until the next closing sequence */
    lemmas = re.findall("@.*@", query_matches)
    # To include verb form: "\w.@.*@"
    return lemmas

def extract_text(hit):
    text_sequence = re.search(".*\*~/", hit, re.DOTALL)[0]
    # Select from beginning of segment to closing tag
    sentence_only = re.search("(.*)\(CM.*", text_sequence, re.DOTALL).group(1)
    # Keep group that precedes the sentence ID marker beginning with (CM
    cleaned_sentence = sentence_only.strip() # remove trailing white space
    cleaned_sentence = re.sub("\n", " ", cleaned_sentence)
    return cleaned_sentence

def write_table_header(name):
    with open(name, "w") as outp:
        outp.write("CorpusFile\tSentenceID\tVerbInfo\tSentence\n")

def main():
    args = get_arguments()
    content = read_file(args.file_name)
    output_name = re.sub("\.out$", ".csv", args.file_name)
    write_table_header(output_name)
    sentences = content.split("/~*")
    for sentence in sentences:
        if sentence.find("*~/") != -1: # exclude header
            identifier = extract_id(sentence)
            corpus_file_name = re.sub("(.*),.*", "\\1", identifier)
            verbs = extract_lemmas(sentence)
            text = extract_text(sentence)
            if args.plain:
                text = re.sub("@\S*@", "", text)
            for verb in verbs:
                with open(output_name, "a") as outp:
                    outp.write(f"{corpus_file_name}\t"
                               f"{identifier}\t"
                               f"{verb}\t"
                               f"{text}\n"
                               )
            print(f"\nSentence {identifier} processed...\n")
            # Inform user of progress
    print("\nProcess complete!\nNow quitting.\n")
    # Inform user of completion

if __name__ == "__main__":
    main()

5 Extracting data from manually annotated files

Using Tara’s data of 14th century Dutch, we will extract manual annotation to a spreadsheet. As the data are spread over multiple files, we will need to learn how to process such data. We will first handle the single file WestVlaanderen_1390_1399.txt before extending our script to handle multiple files.

Another challenge is that the custom annotation format allows nesting/recursion in that a clause may contain another clause. Our best bet is to convert the annotation scheme to XML so that we can use tried-and-tested XML parsers rather than having to reinvent the wheel. To do so, we first need to investigate the annotation scheme:

  • Clauses of interest are enclosed with the opening tag [ and the closing tag ./], and may be nested.
  • Within these clauses, tags have been inserted as s/, a/, o/, etc. They have no closing tag, so we can either assume that their scope extends only to the following word, or until the next tag or end of the sentence. As the scheme seems to annotate constituents rather than words, we will assume the latter.

5.1 Conversion to XML tags

We will first read our test file to see how we can best convert the current scheme to XML. We can reuse the read_file() function from our previous script. NB: The scripts have to be in the same directory for this. For better organisation, it may worth considering storing custom functions that can be used in various scripts into a designated script.

To determine our file path in a way that works on different operating systems, we can use the os package and its function os.path.join().

import re
from output_to_csv import read_file
from os import path
file_path = path.join("dutch-data", "WestVlaanderen_1390_1399.txt")
content = read_file(file_path)
content[0:1000] # escape sequence for non-breakable spaces makes this illegible
print(content[0:1000]) # easier to read

It turns out that Word inserted all kinds of escape characters when exporting to plain text. They are rendered well when viewed with print(), but they are still present under the hood and we need to consider them when processing the data, especially the escape sequence \xa0 for non-breakable space characters. There are multiple ways in which we could handle them (although we luckily don’t even have to):

  1. Substitute xa0 sequences with space characters;

  2. Normalize the text to Unicode with a function from the unicodedata package, e.g. unicodedata.normalize("NFKD", text);

  3. Handle the escape sequences after conversion to XML, as the package we will use (“Beautiful Soup”) has methods to clean up the data conveniently if we absolutely wanted to (but we won’t actually need to do that).

We can use the regular expression wild card \s should we need to refer to spaces, as this covers all kinds of space characters.

To convert the current annotation to XML, we have to undertake two types of conversion:

  1. The easy one: change [ to <sentence> and ./] to </sentence>;

  2. The harder one: change the other tags to corresponding XML tags, e.g. v/ to <v>, and find the suitable placement for the closing tag. We first replace the opening tags, so that we now have a common < character marking all tags. This way, we can extend the scope to any character except <.

sentence_opening = re.sub("\[", "<sentence>", content)
sentence_closing = re.sub("\./\]", "</sentence>", sentence_opening)
tags_opening = re.sub("\\b(\w)/","<\\1>", sentence_closing)
# Capture the word character only, as we don't want to keep the slash
tags_closing = re.sub("<(\w)>([^<]*)", "<\\1>\\2</\\1>", tags_opening)
# Capture the character used in the opening tag,
# then capture the group of following characters other than <
# the closing tag is the same as the opening one but with a slash

The <c> tag is problematic when it has a nested clause of interest (as in line 15 of our example file). We therefore handle such cases separately immediately after inserting our sentence markers, before we handle the other tags. However, we first place a different set of tags, specifically containing non-word characters, because scope works differently for such <c> tags than for the other tags. We will replace these in our last step.

sentence_opening = re.sub("\[", "<sentence>", content)
sentence_closing = re.sub("\./\]", "</sentence>", sentence_opening)
clause_marking = re.sub(
    "c/<sentence>(.*?)</sentence>",
    "<.><sentence>\\1</sentence><..>",
    sentence_closing)
    # Place unique temporary markers for clauses containing
    #    an embedded sentence of interest
tags_opening = re.sub("\\b(\w)/","<\\1>", clause_marking)
# Capture the word character only, as we don't want to keep the slash
tags_closing = re.sub("<(\w)>([^<]*)", "<\\1>\\2</\\1>", tags_opening)
# Capture the character used in the opening tag,
# then capture the group of following characters other than <
# the closing tag is the same as the opening one but with  slash
clause_opening = re.sub("<\.>", "<c>", tags_closing)
clause_closing = re.sub("<\.\.>", "</c>", clause_opening)
# Replace our temporary markers

We can now store all this as a function for later use.

def convert_to_xml(text):
    sentence_opening = re.sub("\[", "<sentence>", text)
    sentence_closing = re.sub("\./\]", "</sentence>", sentence_opening)
    clause_marking = re.sub(
        "c/<sentence>(.*?)</sentence>",
        "<.><sentence>\\1</sentence><..>",
        sentence_closing)
    # Place unique temporary markers for clauses containing
    #    an embedded sentence of interest
    tags_opening = re.sub("\\b(\w)/","<\\1>", clause_marking)
    # Capture the word character only, as we don't want to keep the slash
    tags_closing = re.sub("<(\w)>([^<]*)", "<\\1>\\2</\\1>", tags_opening)
    # Capture the character used in the opening tag,
    # then capture the group of following characters other than <
    # the closing tag is the same as the opening one but with a slash
    clause_opening = re.sub("<\.>", "<c>", tags_closing)
    clause_closing = re.sub("<\.\.>", "</c>", clause_opening)
    # Replace our temporary markers
    return clause_closing

5.2 XML parsing

To parse our converted data, we will use the package “Beautiful Soup”, which is a very powerful tool for parsing HTML and XML data. Unlike the re package, it is not built in and needs to be installed. The easiest way to install additional packages in Python is to use pip (Package Installer for Python). From our terminal, we can install “Beautiful Soup” with the command python -m pip install beautifulsoup4 or python3 -m pip install beautifulsoup4. For XML parsing, we also need the package lxml, we we can install with python -m pip install lxml or python3 -m pip install lxml.

To parse the text as XML, we pass it to the BeautifulSoup() function and specify the use of the XML parser.

xml_ready = convert_to_xml(content)
xml = BeautifulSoup(xml_ready, "xml")
print(xml)

We only obtain an XML declaration line, but nothing else. Why is that? To be parsed, the XML text as a whole needs to be enclosed in a set of tags. Given the extensible nature of XML, we can easily do that with a custom set of tags such as <text> or <xml>.

xml = BeautifulSoup(f"<text>{xml_ready}</text>", "xml")
print(xml)
print(xml.prettify())

We can search for specific tags, e.g. all <o> or all <v> tags. We can access different attributes from these tags, such as .name and .string.

xml.find_all("o")
for tag in xml.find_all("v"):
    print(f"{tag.name}:\t{tag.string}")

We can search for all tags using .find_all() when specifying the search term as True.

for tag in xml.find_all(True):
    print(f"{tag.name}:\t{tag.string}")

The .string attribute only contains a value when the tag contains text, otherwise it is defined as None. This allows us to see that the <c> tag in the third sentence of interest contains an embedded sentence of interest.

5.3 Extracting the information we want

Now that we have a way to access the sentences of interest and their tags, even the recursively embedded ones, we can begin extracting the information we want:

  • the list of tags

  • whether the order is OV or VO

  • the actual sentence

5.3.1 Tag list

We can obtain a tag’s children with the .contents attribute. To iterate over them, we can use the .children generator instead.

sentences = xml.find_all("sentence")
sentences[0].contents # look at the first sentence before we loop over all
for child in sentences[0].children:
    print(child.name)

We can use this to increment a string variable to obtain the list of tags in the sentence (excluding None values).

tag_list = ""
for child in sentences[0].children:
    if child.name != None:
        tag_list += child.name
        
print(tag_list)

We can wrap this in a function for later use.

def get_tags(parent):
    tag_list = ""
    for child in parent.children:
        if child.name != None:
            tag_list += child.name
    return tag_list.upper() # constituent labels in uppercase

5.3.2 VO/OV order

As we have a full list of constituents, we can use regular expressions on it to check whether V occurs before or after O (or place NA values if one or both are not present).

tags = get_tags(sentence[0])
if re.search("V.*O", tags):
    vo_order = "VO"
elif re.search("O.*V", tags): # elif = else if
    vo_order = "OV"
else:
    vo_order = "NA"

print(vo_order)

Again, we can store this in a function for later use.

def get_ov_order(tag_list):
    if re.search("V.*O", tag_list):
        vo_order = "VO"
    elif re.search("O.*V", tag_list):
        vo_order = "OV"
    else:
        vo_order = "NA"
    return vo_order

5.3.3 Extracting the sentence

The .string attribute is inadequate for extracting the whole sentence as we get Nonevalues whenever there are embedded tags. To obtain the entire “human-readable” text, we can use the .get_text() method. As this is a one-liner, we do not necessarily need a dedicated function.

text = sentences[0].get_text()
print(text)

5.4 Looping through all the sentences of a file

Now that we have found ways to extract the information we need, we can loop through all the sentences of a file.

for sentence in sentences:
    tags = get_tags(sentence)
    ov = get_ov_order(tags)
    text = sentence.get_text()
    print("---\n"
          f"Tags:\t{tags}\n"
          f"OV:\t{ov}\n"
          f"Text:\t{text}\n"
          "---\n"
    )

Before we export this data to a table, we shouldn’t forget that we want to handle multiple files.

5.5 Looping through files

The os package contains a function os.listdir() that returns a list of files in a given directory/folder.

from os import listdir
listdir(".") # contents of the current working directory
listdir("dutch-data") # contents of a specific directory

This way, we can generate a list of files through which we can loop. Also, we can catch potential errors if the user provides the wrong directory name. We can exclude files that are not of the type we want, e.g. invisible files that may have been placed by the OS, or syncing services such as Dropbox, Nextcloud, or GitHub.

def get_file_list(directory):
    try:
        file_names = listdir(directory)
        return file_names
    except FileNotFoundError:
        print("\nThe directory you specified does not exist."
              "\nPlease verify its existence and spelling and try again."
              "\nQuitting now...\n"
        )
        quit()

files = get_file_list("dutch-data")

for file in files:
    if re.search("\.txt$", file):
        print(file)

files = get_file_list("ditch-data") # typo in the directory name

5.6 Putting it all together

Now that we have all the “puzzle pieces”, we can combine all the elements together. First, we write a function to parse the argument call to obtain the name of the directory. We can adapt the function we defined in our previous script.

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Extract information from annotated text files "
            "to a CSV table"
        )
    )
    parser.add_argument(
        "dir_name",
        help = "Name of the directory to be processed"
    )
    arguments = parser.parse_args()
    return arguments

One tiny missing puzzle piece is a function to prepare a table header before we iterate through our files and sentences.

def write_table_header(file_name):
    with open(file_name, "w") as outp:
        outp.write("File\tConstituents\tOV_order\tText\n")

Now we can start working on our main() function, in which we assemble our “puzzle”.

from os import listdir, path
def main():
    args = get_arguments()
    files = get_file_list(args.dir_name)
    output_name = f"{args.dir_name}.csv"
    write_table_header(output_name)
    for file in files:
        if re.search("\.txt$", file):
            file_path = path.join(args.dir_name, file)
            content = read_file(file_path)
            xml_ready = convert_to_xml(content)
            xml = BeautifulSoup(f"<text>{xml_ready}</text>", "xml")
            sentences = xml.find_all("sentence")
            for sentence in sentences:
                tags = get_tags(sentence)
                ov = get_ov_order(tags)
                text = sentence.get_text()
                with open(output_name, "a") as outp:
                    outp.write(f"{file}\t{tags}\t{ov}\t{text}\n"
            print(f"\nFile {file} processed...\n")
    print(
        "All files processed.\n"
        f"Table saved as {output_name}\n"
        "Now quitting.\n"
    )

5.7 Working version

The working version of our script looks something like this.

#!/usr/bin/env python3

from os import listdir, path
import re
import argparse
from bs4 import BeautifulSoup
from output_to_csv import read_file

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Extract information from annotated text files "
            "to a CSV table"
        )
    )
    parser.add_argument(
        "dir_name",
        help = "Name of the directory to be processed"
    )
    arguments = parser.parse_args()
    return arguments

def get_file_list(directory):
    try:
        file_names = listdir(directory)
        return file_names
    except FileNotFoundError:
        print("\nThe directory you specified does not exist."
              "\nPlease verify its existence and spelling and try again."
              "\nQuitting now...\n"
        )
        quit()

def write_table_header(file_name):
    with open(file_name, "w") as outp:
        outp.write("File\tConstituents\tOV_order\tText\n")

def convert_to_xml(text):
    sentence_opening = re.sub("\[", "<sentence>", text)
    sentence_closing = re.sub("\./\]", "</sentence>", sentence_opening)
    clause_marking = re.sub(
        "c/<sentence>(.*?)</sentence>",
        "<.><sentence>\\1</sentence><..>",
        sentence_closing)
    tags_opening = re.sub("\\b(\w)/","<\\1>", clause_marking)
    # Capture the word character only, as we don't want to keep the slash
    tags_closing = re.sub("<(\w)>([^<]*)", "<\\1>\\2</\\1>", tags_opening)
    # Capture the character used in the opening tag,
    # then capture the group of following characters other than <
    # the closing tag is the same as the opening one but with  slash
    clause_opening = re.sub("<\.>", "<c>", tags_closing)
    clause_closing = re.sub("<\.\.>", "</c>", clause_opening)
    return clause_closing

def get_tags(parent):
    tag_list = ""
    for child in parent.children:
        if child.name != None:
            tag_list += child.name
    return tag_list.upper() # constituent labels in uppercase

def get_ov_order(tag_list):
    if re.search("V.*O", tag_list):
        vo_order = "VO"
    elif re.search("O.*V", tag_list):
        vo_order = "OV"
    else:
        vo_order = "NA"
    return vo_order

def main():
    args = get_arguments()
    files = get_file_list(args.dir_name)
    output_name = f"{args.dir_name}.csv"
    write_table_header(output_name)
    for file in files:
        if re.search("\.txt$", file):
            file_path = path.join(args.dir_name, file)
            content = read_file(file_path)
            xml_ready = convert_to_xml(content)
            xml = BeautifulSoup(f"<text>{xml_ready}</text>", "xml")
            sentences = xml.find_all("sentence")
            for sentence in sentences:
                tags = get_tags(sentence)
                ov = get_ov_order(tags)
                text = sentence.get_text()
                with open(output_name, "a") as outp:
                    outp.write(f"{file}\t{tags}\t{ov}\t{text}\n")
            print(f"\nFile {file} processed...\n")
    print(
        "All files processed.\n"
        f"Table saved as {output_name}\n"
        "Now quitting.\n"
    )

if __name__ == "__main__":
    main()

6 Determining the distance between primes and targets

Using a file from Lena’s Middle English data Brut.out, we will determine the distance between primes and targets. Our file is a CorpusSearch output file that was generated by a query for double-object constructions and prepositional object constructions. To start out small and test the methodology, the query was carried out only on the corpus file CMBRUT3 (The Brut or The Chronicles of England). The intended measuring units for the distances between prime and targets are corpus units, i.e. the sentences marked with identifiers in the corpus.

These identifiers are the first option to check the distance between prime and target, provided that sentence IDs are numbered consistently. Based on the hits in our result file, we see the sentence numbering scheme does not simply number sentences, but also divides the file into sections, e.g. CMBRUT,2.29 and CMRUT,3.33, so we cannot measure the distance between these two units without knowing how many units there are in the second section. Looking at the source corpus file m3-fr.cmbrut3.m3.lemma.psd, we can see that even within a section, numbering is not consistent as certain numbers are skipped, e.g. there is no CMBRUT3,1.4.

Our safest bet is therefore to renumber the corpus units in a consistent manner. Rather than overwriting the identifiers used in the corpus, we will map them to a simpler natural number scheme.

6.1 Renumbering the corpus file(s)

We open our corpus file and gather all sentence IDs. We then assign each of them a running number in a for-loop, incrementing the running number by 1 at every iteration. To store a mapping of IDs and running numbers, we will use a data structure called a dictionary (called hash in many other programming languages). Dictionaries store combinations of keys and associated values under a single variable. They are defined by pairs of curly brackets and keyword/value combinations that can be created or accessed with square brackets. Let us do the first two IDs manually to get a feel of how dictionaries work.

import re
from output_to_csv import read_file

corpus_file = read_file("m3-fr.cmbrut3.m3.lemma.psd")

id_list = re.findall("\(ID\s.*?\)", corpus_file)
# Sequence (ID until the next closing bracket
numbering_scheme = {} # create empty dictionary
running_number = 1
# First ID
current_id = re.sub("\(ID\s(.*)\)", "\\1", id_list[0])
numbering_scheme[current_id] = running_number
numbering_scheme
# Second ID
current_id = re.sub("\(ID\s(.*)\)", "\\1", id_list[1])
running_number += 1
numbering_scheme[current_id] = running_number
numbering_scheme
# Access values by their indices
numbering_scheme[current_id]
numbering_scheme["CMBRUT3,1.3"]

Now that we have had a first taste of how dictionaries work, we can wrap everything in a function in our script and treat all the IDs in the file in a for-loop.

import re
from output_to_csv import read_file

def renumber(content):
    id_list = re.findall("\(ID\s.*?\)", content)
    # Sequence (ID until the next closing bracket
    numbering_scheme = {} # create empty dictionary
    running_number = 1
    for unit in id_list:
        current_id = re.sub("\(ID\s(.*)\)", "\\1", unit)
        numbering_scheme[current_id] = running_number
        running_number += 1
    return numbering_scheme

def main():
    corpus_file = read_file("m3-fr.cmbrut3.m3.lemma.psd")
    id_map = renumber(corpus_file)

6.2 Making our renumbered scheme reusable

So far, we have always exported our output as CSV tables to be used by other programs (e.g. a spreadsheet application or R). Given that we intend to use our scheme in Python, it seems redundant to export it as a table and reimport it as a table later. We could also continue working with the scheme within the same Python script, but ideally we’d like to separate the renumbering from the determination of prime and target distances. We want to perform the renumbering only once, and investigate the prime and target distances for multiple queries. Exporting and reimporting is the more viable solution then. However, there is a way to save data directly as a Python object, called pickling. We avoid potentially lengthy conversion processes as the pickle is “preserved” as is. It should be noted that for this reason, one should only load pickled data from trustworthy sources. As an alternative that is better suitable for sharing, is human-readable, and can be accessed in programming languages other than Python, the JSON (JavaScript Object Notation) format may be a better option. Python has a JSON module that uses the same .dump() and .load() methods as the pickle module. However, using JSON would mean converting back and fro between Python and JSON, and the running numbers in our renumbered scheme would be loaded as strings, which we would then have to reconvert to integers before we can perform subtractions to estimate the distance between corpus units. As we will use our numbering scheme only locally and in Python, we will stick to pickling as we get to enjoy the advantages (speed and 100% Python compatibility) with none of the security issues.

import pickle

def main():
    corpus_file = read_file("m3-fr.cmbrut3.m3.lemma.psd")
    id_map = renumber(corpus_file)
    with open("id_map.pickle", "wb") as outp:
        pickle.dump(id_map, outp) # object, file

If we restart Python (as would happen if we were to reuse the object in a different script), we can load the pickle to obtain the exact same object.

import pickle

with open("id_map.pickle", "rb") as inp:
    id_map = pickle.load(inp)

One remaining issue that is that we are only handling one corpus file. Once we add more files, how will we handle the transition from one file to the next in the numbering scheme stored in our dictionary? To make our script future-proof, we will address this now by nesting dictionaries: Each corpus file will have its own dictionary and its own running count within the main dictionary. (NB: We could also have a separate pickle file for each corpus file, but this may be more difficult to manage in the long run.)

We will add a second corpus file (m4-en.cmcapser.m4.lemma.psd) to see whether our renumbering system works on multiple files. We place our files in a separate directory called ppcme2.

We will also need to update our renumber() function to insert a nested dictionary rather than creating one from scratch.

We can save the code below as a script called renumber_corpus_units.py and run it once on the directory ppcme2. We will use the resulting file ppcme2.pickle for our script to measure the distance between primes and targets.

import argparse, pickle, re
from os import listdir, path
from output_to_csv import read_file
from dutch_data import get_file_list

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Map corpus units to consistent numbering scheme"
        )
    )
    parser.add_argument(
        "dir_name",
        help = "Name of the directory to be processed"
    )
    arguments = parser.parse_args()
    return arguments

def renumber(content, main_dictionary, file_id):
    id_list = re.findall("\(ID\s.*?\)", content)
    # Sequence (ID until the next closing bracket
    main_dictionary[file_id] = {} # create nested dictionary
    running_number = 1
    for unit in id_list:
        current_id = re.sub("\(ID\s(.*)\)", "\\1", unit)
        main_dictionary[file_id][current_id] = running_number
        running_number += 1

def main():
    id_map = {} # main dictionary across corpus files
    args = get_arguments()
    files = get_file_list(args.dir_name)
    for file in files:
        if re.search("\.psd$", file):
            file_path = path.join(args.dir_name, file)
            with open(file_path, "r") as inp:
                file_content = inp.read()
            file_id = re.search("\(ID\s(.*?),", file_content).group(1)
            # (ID followed by a space, we catch the group until the comma
            renumber(file_content, id_map, file_id)
        print(f"File {file} processed...\n")
    with open(f"{args.dir_name}.pickle", "wb") as outp:
        pickle.dump(id_map, outp)
    print(
        f"Dictionary pickled as {args.dir_name}.pickle\n"
        "Now quitting."
        )

if __name__ == "__main__":
    main()

6.3 Handling the corpus query results

Now that we have a consistently numbered version of our corpus files, we can return to our actual goal: determining the distance between primes and their targets. Our CorpusSearch output file Brut.out contains double objects (DO) and prepositional objects (PO). We want to generate a table that contains:

  • the text ID
  • the sentence ID
  • information whether we have a DO or PO
  • distance to the next hit (with a value NA for the last hit of the file)
  • the actual sentence

We first load our pickled numbering scheme.

import pickle

with open("ppcme2.pickle", "rb") as inp:
    id_map = pickle.load(inp)

We now turn to the file Brut.out. While we are treating an output file from a query that was run on a single corpus file for now, we can already make our script ready for queries run on multiple files, so the first level of keys in our dictionary will be that of individual corpus texts, although this appears redundant now as all hits will be from the same text.

We won’t write a table directly to an output file as in our previous script that converted an output file to a CSV table (output_to_csv.py). The reason for this is that we will perform further operations once we have extracted information from the sentences, namely the distance from the next hit. So in a first step, we will store the relevant information about each hit in a dictionary, including the number we assigned in our new scheme. We will estimate the distances based on these numbers later on.

We can import some functions from our previous script output_to_csv.py and adapt some other lines. from it. We store some of the operations as functions so that our main function doesn’t appear too long or convoluted. As there may be more than one indirect object per sentence, we store them as a list. For now, we extract the information of each sentence and store it in our dictionary. We print the dictionary and pickle it to check whether the extraction process that constitutes step 1 worked correctly before we proceed with the work on step 2.

#!/usr/bin/env python3

import argparse, pickle, re
from output_to_csv import read_file, extract_text

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Determine distance between query hits"
        )
    )
    parser.add_argument(
        "dot_out_file",
        help = "Name of the CorpusSearch .out file to be processed"
    )
    arguments = parser.parse_args()
    return arguments

def extract_objects(sentence):
    query_matches = re.search("/\*\n\d.*?\*/", sentence, re.DOTALL)[0]
    # Sequence /* followed by a line break,
    #     the next line beginning with a digit,
    #     until the next closing sequence */
    objects = re.findall("(PP|OB2)", query_matches)
    return objects

def add_sentence_information(
    sentence,
    text_id,
    sentence_id,
    main_dict,
    map_dict
    ):
    text = extract_text(sentence)
    objects = extract_objects(sentence)
    running_number = map_dict[text_id][sentence_id]
    # for now, just copy the new sentence number
    if text_id not in list(main_dict):
        main_dict[text_id] = {}
        # if this is the first entry for a new text,
        #    we need to create a new nested dictionary
        #    before inserting the information for our sentence
    main_dict[text_id][sentence_id] = {}
    main_dict[text_id][sentence_id]["text"] = text
    main_dict[text_id][sentence_id]["running_number"] = running_number
    main_dict[text_id][sentence_id]["objects"] = objects

def main():
    # Step 1: extract information from query hits,
    #    including running sentence number
    with open("ppcme2.pickle", "rb") as inp:
        id_map = pickle.load(inp) # load running number scheme
    hit_dict = {} # main dictionary to hold sentence information
    args = get_arguments()
    file_name = args.dot_out_file
    file_content = read_file(file_name)
    sentences = file_content.split("/~*")
    for sentence in sentences:
        if re.search("\*~/", sentence): # exclude header
            sentence_id = re.search("\(\d+\sID\s(.*?)\)", sentence).group(1)
            # (index space ID space, we catch the following group
            #    until the first closing bracket
            text_id = re.search("(.*?),", sentence_id).group(1)
            # we catch any characters until the first comma
            add_sentence_information(
                sentence,
                text_id,
                sentence_id,
                hit_dict,
                id_map
            )
    # End of step 1: print and export intermediate results
    print(hit_dict) # check how well it worked
    with open("hit_dict.pickle", "wb") as outp:
        pickle.dump(hit_dict, outp)
        # export as a pickle to we can run some tests
        #    in our live session to work on step 2

if __name__ == "__main__":
    main()

Now that we have extracted the information from the query hits, we want to create our export table. We can loop the dictionary keys and gather the information that we need for each. As corpus units may contain more than one hit, we should write one table row per hit rather than one per corpus unit. This means that we will only write the table row at the level of an embedded loop iterating through the list of indirect objects. We should be careful about the distance between tokens within the same unit. The last item in the list should be measured in comparison to the next unit; all the others should have the value 0.

The main point of the script is to obtain the distance from the next hit. To do so, we need to access the sentence number of the next corpus unit. Dictionaries are not numbered as such, so we can’t directly access the next item. However, we can work around this by converting the inventory of keys into a list, from which we can increment the index to obtain the next key. We have to be aware that if we are at the last key in a list, fetching the non-existent next item will result in an index error, which we have to catch to prevent our program from crashing.

def get_next_key(unit_keys, unit_key):
    current_index = unit_keys.index(unit_key)
    try:
        next_key = unit_keys[current_index + 1]
    except IndexError:
        next_key = "NA"
    return next_key

def write_table_header(output_file_name):
    with open(output_file_name, "w") as outp:
        outp.write(
            "TextID\t"
            "SentenceID\t"
            "Construction\t"
            "Distance\t"
            "Sentence\n"
            )

def write_table_row(
    output_file_name,
    text_id,
    sentence_id,
    construction,
    distance,
    sentence
    ):
    with open(output_file_name, "a") as outp:
        outp.write(
            f"{text_id}\t"
            f"{sentence_id}\t"
            f"{construction}\t"
            f"{distance}\t"
            f"{sentence}\n"
            )

def main():
    # Step 1: extract information from query hits,
    #    including running sentence number
    with open("ppcme2.pickle", "rb") as inp:
        id_map = pickle.load(inp) # load running number scheme
    hit_dict = {} # main dictionary to hold sentence information
    args = get_arguments()
    file_name = args.dot_out_file
    file_content = read_file(file_name)
    sentences = file_content.split("/~*")
    for sentence in sentences:
        if re.search("\*~/", sentence): # exclude header
            sentence_id = re.search("\(\d+\sID\s(.*?)\)", sentence).group(1)
            # (index space ID space, we catch the following group
            #    until the first closing bracket
            text_id = re.search("(.*?),", sentence_id).group(1)
            # we catch any characters until the first comma
            add_sentence_information(
                sentence,
                text_id,
                sentence_id,
                hit_dict,
                id_map
            )

    # Step 2: Iterate through our dictionary of hits,
    #    gather the information including the distance to the next hit,
    #    and write to our output table
    text_keys = list(hit_dict)
    output_file_name = re.sub("\.out$", ".csv", file_name)
    write_table_header(output_file_name)
    for text_key in text_keys:
        unit_keys = list(hit_dict[text_key])
        for unit_key in unit_keys:
            current_text = hit_dict[text_key][unit_key]['text']
            object_list = hit_dict[text_key][unit_key]['objects']
            try:
                current_number = hit_dict[text_key][unit_key]['running_number']
                next_key = get_next_key(unit_keys, unit_key)
                # If the function get_next_key() returns "NA",
                #    we have to catch the ensuing key error
                #    (as we don't have a corpus unit with the key "NA").
                next_number = hit_dict[text_key][next_key]['running_number']
                distance = next_number - current_number
            except KeyError:
                distance = "NA"
            for item in object_list:
                if object_list.index(item) == len(object_list) -1:
                    report_distance = distance
                else:
                    report_distance = 0
                write_table_row(
                    output_file_name,
                    text_key,
                    unit_key,
                    item,
                    report_distance,
                    current_text
                    )
            print(f"Processed unit {unit_key}...\n")
        print(f"Processed text {text_key}...\n")
    print(
        "Process complete.\n"
        f"Table saved as {output_file_name}.\n"
        "Now quitting.\n"
        )

Our script runs without errors and generates a table that looks like what we want. Upon closer inspection, however, we can see that multiple hits within a unit sometimes (but not always) have the value 0 even for the last hit. We just caught a bug! Just because our script runs without errors shouldn’t mean that it works as intended. The bug shows the following pattern: when the multiple items within a corpus unit had the same value (e.g. in CMBRUT3,7.167), we get 0 across the board, but when the multiple items are not identical (e.g. CMBRUT3,17.514), we get the expected behaviour. How did this happen? When we checked whether an item was the last on the list, fetching the index for the current item stopped at the first hit, so when we have the same object type multiple times, we always get the index of the first item. We should therefore remove each item from the list after we’ve treated it, but we don’t want to modify a list on which we are iterating, as this is likely to cause problems, such as our loop skipping an item. A quick fix is to wrap our list in a list() function. This way, we iterate over a copy which will not change as we remove items. The original list will be empty once we are done iterating, which is not a problem for our purpose, but if we were to need the list after that, we’d need to come up with a different fix.

def main():
    # ...
    # Step 2: Iterate through our dictionary of hits,
    #    gather the information including the distance to the next hit,
    #    and write to our output table
    text_keys = list(hit_dict)
    output_file_name = re.sub("\.out$", ".csv", file_name)
    write_table_header(output_file_name)
    for text_key in text_keys:
        unit_keys = list(hit_dict[text_key])
        for unit_key in unit_keys:
            current_text = hit_dict[text_key][unit_key]['text']
            object_list = hit_dict[text_key][unit_key]['objects']
            try:
                current_number = hit_dict[text_key][unit_key]['running_number']
                next_key = get_next_key(unit_keys, unit_key)
                next_number = hit_dict[text_key][next_key]['running_number']
                distance = next_number - current_number
            except KeyError:
                distance = "NA"
            for item in list(object_list):
                if object_list.index(item) == len(object_list) -1:
                    report_distance = distance
                else:
                    report_distance = 0
                write_table_row(
                    output_file_name,
                    text_key,
                    unit_key,
                    item,
                    report_distance,
                    current_text
                    )
                object_list.remove(item)
            print(f"Processed unit {unit_key}...\n")
        print(f"Processed text {text_key}...\n")
    print(
        "Process complete.\n"
        f"Table saved as {output_file_name}.\n"
        "Now quitting.\n"
        )

6.4 Working version

The working version of our script looks something like this.

#!/usr/bin/env python3

import argparse, pickle, re
from output_to_csv import read_file, extract_text

def get_arguments():
    parser = argparse.ArgumentParser(
        description = (
            "Determine distance between query hits"
        )
    )
    parser.add_argument(
        "dot_out_file",
        help = "Name of the CorpusSearch .out file to be processed"
    )
    arguments = parser.parse_args()
    return arguments

def extract_objects(sentence):
    query_matches = re.search("/\*\n\d.*?\*/", sentence, re.DOTALL)[0]
    # Sequence /* followed by a line break,
    #     the next line beginning with a digit,
    #     until the next closing sequence */
    objects = re.findall("(PP|OB2)", query_matches)
    return objects

def add_sentence_information(
    sentence,
    text_id,
    sentence_id,
    main_dict,
    map_dict
    ):
    text = extract_text(sentence)
    objects = extract_objects(sentence)
    running_number = map_dict[text_id][sentence_id]
    # for now, just copy the new sentence number
    if text_id not in list(main_dict):
        # if this is the first entry for a new text,
        #    we need to create a new nested dictionary
        #    before inserting the information for our sentence
        main_dict[text_id] = {}
    main_dict[text_id][sentence_id] = {}
    main_dict[text_id][sentence_id]["text"] = text
    main_dict[text_id][sentence_id]["running_number"] = running_number
    main_dict[text_id][sentence_id]["objects"] = objects


def get_next_key(unit_keys, unit_key):
    current_index = unit_keys.index(unit_key)
    try:
        next_key = unit_keys[current_index + 1]
    except IndexError:
        next_key = "NA"
    return next_key

def write_table_header(output_file_name):
    with open(output_file_name, "w") as outp:
        outp.write(
            "TextID\t"
            "SentenceID\t"
            "Construction\t"
            "Distance\t"
            "Sentence\n"
            )

def write_table_row(
    output_file_name,
    text_id,
    sentence_id,
    construction,
    distance,
    sentence
    ):
    with open(output_file_name, "a") as outp:
        outp.write(
            f"{text_id}\t"
            f"{sentence_id}\t"
            f"{construction}\t"
            f"{distance}\t"
            f"{sentence}\n"
            )

def main():
    # Step 1: extract information from query hits,
    #    including running sentence number
    with open("ppcme2.pickle", "rb") as inp:
        id_map = pickle.load(inp) # load running number scheme
    hit_dict = {} # main dictionary to hold sentence information
    args = get_arguments()
    file_name = args.dot_out_file
    file_content = read_file(file_name)
    sentences = file_content.split("/~*")
    for sentence in sentences:
        if re.search("\*~/", sentence): # exclude header
            sentence_id = re.search("\(\d+\sID\s(.*?)\)", sentence).group(1)
            # (index space ID space, we catch the following group
            #    until the first closing bracket
            text_id = re.search("(.*?),", sentence_id).group(1)
            # we catch any characters until the first comma
            add_sentence_information(
                sentence,
                text_id,
                sentence_id,
                hit_dict,
                id_map
            )
    # Step 2: Iterate through our dictionary of hits,
    #    gather the information including the distance to the next hit,
    #    and write to our output table
    text_keys = list(hit_dict)
    output_file_name = re.sub("\.out$", ".csv", file_name)
    write_table_header(output_file_name)
    for text_key in text_keys:
        unit_keys = list(hit_dict[text_key])
        for unit_key in unit_keys:
            current_text = hit_dict[text_key][unit_key]['text']
            object_list = hit_dict[text_key][unit_key]['objects']
            try:
                current_number = hit_dict[text_key][unit_key]['running_number']
                next_key = get_next_key(unit_keys, unit_key)
                next_number = hit_dict[text_key][next_key]['running_number']
                distance = next_number - current_number
            except KeyError:
                distance = "NA"
            for item in list(object_list):
                if object_list.index(item) == len(object_list) -1:
                    report_distance = distance
                else:
                    report_distance = 0
                write_table_row(
                    output_file_name,
                    text_key,
                    unit_key,
                    item,
                    report_distance,
                    current_text
                    )
                object_list.remove(item)
            print(f"Processed unit {unit_key}...\n")
        print(f"Processed text {text_key}...\n")
    print(
        "Process complete.\n"
        f"Table saved as {output_file_name}.\n"
        "Now quitting.\n"
        )

if __name__ == "__main__":
    main()

There are still some improvements that could be made:

  • For now, we have have hard-coded the pickled renumbering scheme for the PPCME2, but we could expand this to cover any pickled inventory for other corpora (e.g. PLAEME, PCMEP, or a combination of PPCME2, PLAEME, and PCMEP). We could add the pickled file to be loaded as an argument.

  • The features under investigation, i.e. double object versus prepositional object constructions, are also hard-coded, so we could think of ways to expand the applications of our script, e.g. provide a list of tags as an argument, or write them in a text file that can be specified and loaded with an argument.

But for now, our script works for our immediate purpose.