At the time of writing, the latest stable release of Python is version 3.10.5. If you already have Python installed, please ensure that you have at least version 3.6 or newer.
To install or upgrade Python, you can either:
You can verify whether Python was successfully installed on your
system by opening your Terminal or PowerShell and entering the command
python --version
or python3 --version
.
In addition to installing Python, you will also need a text editor. Here is a non-exhaustive list of suggestions:
A traditional first program to run is “Hello World!”. We will use this to ensure that everyone has Python properly installed and can execute Python scripts.
print("Hello World!")
## Hello World!
We can directly launch Python from our command line (Terminal or
PowerShell) and execute the program interactively from there, or save
the program to a file (e.g. called hello.py
) and execute
the program stored within the file by running Python on it:
python3 hello.py
. For the latter option, we need our
Terminal to have the folder in which we saved the script file as its
working directory. We can set the working with the cd
command followed by a space and the file path. (NB: You may also be able
to drag the folder to the Terminal after having typed cd
followed by a space.) You can verify the path with the command
pwd
(“print working directory”) and/or list the content of
the directory with ls
(macOS/Linux) or dir
(Windows) to see whether your script file appears in the list of
contents.
A further option is to make our Python script executable. For this, we place a line that will tell our system how our script should be executed. NB: The path provided explicitly applies to Unix-like systems (Linux and macOS among others) but should be also interpretable (or even unnecessary) on Windows.
#!/usr/bin/env python3
print("Hello World!")
On Unix-like systems, we should then give executable permissions to
the file with chmod +x hello.py
. Once this is done, we can
run the script as ./hello.py
.
Let us start with a “cultural studies flavoured” approach: We will generate the American song 99 bottles of beer of the wall in Python to see some of the basic concepts of the language. We will be aiming for the following lyrics:
99 bottles of beer on the wall, 99 bottles of beer. Take one down and pass it around, 98 bottles of beer on the wall.
[…]
1 bottle of beer on the wall, 1 bottle of beer. Take one down and pass it around, no more bottles of beer on the wall.
No more bottles of beer on the wall, no more bottles of beer. Go to the store and buy some more, 99 bottles of beer on the wall.
We can repeat operations for a certain number of times using a
for loop. At each repetition, a specified variable
(often called i
for iterative) is assigned a
different value. To print numbers from 1 to 10, we can do the
following:
for i in range(1, 11):
print(i)
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
Notice how the count is one number short of the endpoint of the range.
The range()
function may take a further argument
specifying the interval to be used. We could skip even numbers by
specifying 2 as an interval.
for i in range(1, 11, 2):
print(i)
## 1
## 3
## 5
## 7
## 9
To count down, we can specify the interval negatively. We also don’t
have to stick with i
for the name of the iterating
variable, but can choose something more expressive.
for bottle_count in range(99, -1, -1):
print(bottle_count)
Our condition is that we only need the singular form when the number of bottles is 1. We can achieve this with a conditional if statement.
if bottle_count == 1:
print(str(bottle_count) + " bottle")
else:
print(str(bottle_count) + " bottles")
There are more elegant ways to insert variables into strings:
print("%s bottles" % bottle_count) # "old school" way
print(f"{bottle_count} bottles") # new f-string introduced in Python 3.6
We can now combine our for-loop and our if-statement:
for bottle_count in range(99, -1, -1):
if bottle_count == 1:
print(f"{bottle_count} bottle")
else:
print(f"{bottle_count} bottles")
for bottle_count in range(99, -1, -1):
if bottle_count == 1:
print(f"{bottle_count} bottle of beer on the wall, {bottle_count} bottle of beer.\n\tTake one down and pass it around, no more bottles of beer on the wall.\n")
else:
print(f"{bottle_count} bottles of beer on the wall, {bottle_count} bottles of beer.\n\tTake one down and pass it around, {bottle_count - 1} bottles of beer on the wall.\n")
It is generally advised to keep lines of code under 80 characters for better readability. But how can we keep the line short when our text is long? While this length limit is more a suggestion than a requirement, we can still comply with it by splitting parts of the string as long as they are within brackets. NB: Do not place any commas between the strings as Python would interpret this as a tuple, i.e. a data type akin to a list that cannot be edited.
for bottle_count in range(99, -1, -1):
if bottle_count == 1:
print(
f"{bottle_count} bottle of beer on the wall, "
f"{bottle_count} bottle of beer.\n\t"
"Take one down and pass it around, "
"no more bottles of beer on the wall.\n"
)
else:
print(
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n"
)
There are multiple ways to achieve this. We could handle this with an
if-statement, but it would quickly become convoluted due to the existing
if-statement for handling the singular, and the fact that 0 occurs both
when the bottle count is 1 due to the subtraction, and in the final
line. A more elegant solution is to substitute the character 0 with the
character string “no more”. For this, we use regular expressions, for
which we need to import the module re
. The imported modules
are typically listed at the very top of a Python script. We can even use
regular expressions to get rid of the if-statement to handle the
plural-singular distinction and reserve the conditional to handle the
last line. To substitute the text we want, we first store a default
version of the line and then perform the substitutions we want on
it.
Some things to note:
\b
. In Python,
this particular wild card is somewhat confusing because the sequence
\b
is an escape sequence for bytes literals, so we either
need to escape the \
escape character itself, hence
\\b
, or use a raw string notation
r"\b"
. In a nutshell, raw strings ignore any
Python-specific uses of the backslash character and pass everything as
is to the regular expression parser. If you are confused, you can place
double backslashes for any wild card, whether they are necessary
(e.g. \\b
, \\1
) or not (e.g. \w
or \\w
). You can also play around with the raw string
notation to see whether you prefer it (e.g. r"\b"
,
r"\1"
, r"\w"
).^
to specify
the beginning of the string.import re
for bottle_count in range(99, -1, -1):
default_line = (
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("^n", "N", updated_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
print(updated_line)
Our song looks fine except for the ending. We can use an if-statement to treat the ending differently. We can now also skip the capitalization of “no” at line beginnings.
import re
for bottle_count in range(99, -1, -1):
if bottle_count > 0:
default_line = (
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
else:
updated_line = (
"No more bottles of beer on the wall, "
"no more bottles of beer.\n\t"
"Go to the store and buy some more, "
"99 bottles of beer on the wall.\n"
)
print(updated_line)
Alternatively, we can also let the for-loop stop at 1 and just write the ending after the loop.
import re
for bottle_count in range(99, 0, -1):
default_line = (
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
print(updated_line)
print(
"No more bottles of beer on the wall, "
"no more bottles of beer.\n\t"
"Go to the store and buy some more, "
"99 bottles of beer on the wall.\n"
)
In addition to a for-loop, we could also use a while-loop. Unlike in for-loops, there is no automatic iteration of the looping variable, so we have to be extra careful to specify how the loop should end or we may be stuck in an infinite loop. In general, prefer for-loops over while-loops unless there is a good reason to use a while-loop.
import re
bottle_count = 99
while bottle_count > 0:
default_line = (
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
print(updated_line)
bottle_count = bottle_count - 1 # infinite loop without this line ! ! !
print(
"No more bottles of beer on the wall, "
"no more bottles of beer.\n\t"
"Go to the store and buy some more, "
"99 bottles of beer on the wall.\n"
)
There are different ways we can write the output to a file:
import re
for bottle_count in range(99, 0, -1):
default_line = (
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
with open("99bottles.txt", "a") as outp:
outp.write(updated_line)
with open("99bottles.txt", "a") as outp:
outp.write(
"No more bottles of beer on the wall, "
"no more bottles of beer.\n\t"
"Go to the store and buy some more, "
"99 bottles of beer on the wall.\n\n"
)
import re
song_lyrics = ""
for bottle_count in range(99, 0, -1):
default_line = (
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
song_lyrics += updated_line
song_lyrics += (
"No more bottles of beer on the wall, "
"no more bottles of beer.\n\t"
"Go to the store and buy some more, "
"99 bottles of beer on the wall.\n\n"
)
with open("99bottles.txt", "w") as outp:
outp.write(song_lyrics)
The open()
function has three basic modes:
So far, we’ve just replicated the song lyrics. This has shown us how to use for-loops and while-loops, how to insert variables into strings, some basics of regular expression substitutions, and exporting output to a file. The song lyrics themselves haven’t changed though. Boring! We can learn more about using variables by modifying the song.
First, we may want to change the number at which we start counting.
Maybe the song is too long (or too short?). We could start at ten or a
million. One obvious way to adapt the code is to insert the number we
want in the range()
function,
e.g. for bottle_count in range(10, 0, -1)
. A more readable
way to do this is to define a variable prior to the loop so it is easier
to spot and edit.
import re
song_lyrics = ""
count_start = 10 # define our counting start here
for bottle_count in range(count_start, 0, -1):
default_line = (
f"{bottle_count} bottles of beer on the wall, "
f"{bottle_count} bottles of beer.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of beer on the wall.\n\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
song_lyrics += updated_line
song_lyrics += (
"No more bottles of beer on the wall, "
"no more bottles of beer.\n\t"
"Go to the store and buy some more, "
f"{count_start} bottles of beer on the wall.\n\n"
)
print(song_lyrics)
Another alternation we could make is to change the beverage. Perhaps you prefer wine over beer, or just plain water. Again, we can define our beverage before the loop starts, and then pass that variable to the f-strings.
import re
song_lyrics = ""
count_start = 5 # define our counting start here
beverage = "water" # define drink of choice here
for bottle_count in range(count_start, 0, -1):
default_line = (
f"{bottle_count} bottles of {beverage} on the wall, "
f"{bottle_count} bottles of {beverage}.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of {beverage} on the wall.\n\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
song_lyrics += updated_line
song_lyrics += (
f"No more bottles of {beverage} on the wall, "
f"no more bottles of {beverage}.\n\t"
"Go to the store and buy some more, "
f"{count_start} bottles of {beverage} on the wall.\n\n"
)
print(song_lyrics)
It’s neat that we can now make our own alternate versions of the
song, but having to edit the program every time we change anything is
cumbersome, and actually error-prone. We can actually set up the program
so that we can specify variables such as the starting count and the
beverage when we run the program from the command line. For this, we use
the module argparse
(argument parser).
import argparse
import re
parser = argparse.ArgumentParser(
description = (
"Generate lyrics to '99 bottles of beer on the wall'"
" (or variants thereof)"
),
epilog = "Please drink responsibly!"
)
parser.add_argument(
"-b",
"--beverage",
default = "beer",
type = str,
help = "Beverage of choice (default: beer)"
)
parser.add_argument(
"-c",
"--count",
default = 99,
type = int,
help = "Number at which to start our countdown (default: 99)"
)
args = parser.parse_args()
beverage = args.beverage
count_start = args.count
song_lyrics = ""
for bottle_count in range(count_start, 0, -1):
default_line = (
f"{bottle_count} bottles of {beverage} on the wall, "
f"{bottle_count} bottles of {beverage}.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of {beverage} on the wall.\n\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
song_lyrics += updated_line
song_lyrics += (
f"No more bottles of {beverage} on the wall, "
f"no more bottles of {beverage}.\n\t"
"Go to the store and buy some more, "
f"{count_start} bottles of {beverage} on the wall.\n\n"
)
with open(f"{count_start}-bottles-of-{beverage}.txt", "w") as outp:
outp.write(song_lyrics)
Our working version of the script now looks something like this. We could of course keep going and keep adding further improvements (add a choice of container such as glasses, kegs, cans, canteens, amphorae, etc., and spell out numbers, e.g. ninety-nine), but let us now move on to applications closer to our purpose.
#!/usr/bin/env python3
import argparse
import re
parser = argparse.ArgumentParser(
description = (
"Generate lyrics to '99 bottles of beer on the wall'"
" (or variants thereof)"
),
epilog = "Please drink responsibly!"
)
parser.add_argument(
"-b",
"--beverage",
default = "beer",
type = str,
help = "beverage of choice (default: beer)"
)
parser.add_argument(
"-c",
"--count",
default = 99,
type = int,
help = "number at which to start our countdown (default: 99)"
)
args = parser.parse_args()
beverage = args.beverage
count_start = args.count
song_lyrics = ""
for bottle_count in range(count_start, 0, -1):
default_line = (
f"{bottle_count} bottles of {beverage} on the wall, "
f"{bottle_count} bottles of {beverage}.\n\t"
"Take one down and pass it around, "
f"{bottle_count - 1} bottles of {beverage} on the wall.\n\n"
)
updated_line = re.sub("\\b0", "no more", default_line)
updated_line = re.sub("\\b1 bottles", "1 bottle", updated_line)
song_lyrics += updated_line
song_lyrics += (
f"No more bottles of {beverage} on the wall, "
f"no more bottles of {beverage}.\n\t"
"Go to the store and buy some more, "
f"{count_start} bottles of {beverage} on the wall.\n\n"
)
with open(f"{count_start}-bottles-of-{beverage}.txt", "w") as outp:
outp.write(song_lyrics)
We will create a Python script that will take a CorpusSearch
output file and convert it to a table that we can import in R
or in a spreadsheet application such as Excel. We will be
working on the file
french-based-verbs-with-indirect-objects-PPCME2.out
, but
our script should work on any CorpusSearch output file from the
PPCME2.
We first look at our CorpusSearch file to examine its structure. What is the information we want to extract, and how could a script reliably extract it?
Each sentence is surrounded by pairs of /~*
and
*~/
, and the index of the matching elements are surrounded
by pairs of /*
and */
. The latter pair is also
used for the preface and the header at the very top of the file, so we
have to be careful to treat those differently.
Before we treat the file, however, we need to open it and read its contents into Python. As we want to treat any CorpusSearch output file and not just the one we are working with now, we want to be able to specify the file name when we run the program. We can parse the arguments as we did earlier, except this time we will read the file name. We also make the code more “pythonic” by storing individual steps as custom functions, which we then assemble in a main function. This is done for two main reasons:
import argparse
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Extract information from a CorpusSearch output file "
"to a CSV table"
)
)
parser.add_argument(
"file_name",
help = "Name of the file to be processed"
)
arguments = parser.parse_args()
return arguments
def read_file(file_name):
with open(file_name, "r") as inp:
return inp.read()
def main():
args = get_arguments()
content = read_file(args.file_name)
print(content[0:1000]) # test to verify that the file has been read
if __name__ == "__main__":
main()
When running the script, the functions are stored but not executed
unless called. If we execute the script directly, the
main()
function is executed, but not when we import the
script. This coding style allows us to reuse functions,
e.g. from output_to_csv import read_file
.
As it stands, our script relies on the user correctly entering the name of the file to be processed correctly. If there is a typo or the intended file does not exist at this location, our script crashes. We can handle this more gracefully by catching the error and letting the user know what likely went wrong.
import argparse
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Extract information from a CorpusSearch output file "
"to a CSV table"
)
)
parser.add_argument(
"file_name",
help = "name of the file to be processed"
)
arguments = parser.parse_args()
return arguments
def read_file(file_name):
try:
with open(file_name, "r") as inp:
return inp.read()
except FileNotFoundError:
print(
"\nWARNING: "
"The file you specified does not exist.\n"
"Please verify its spelling or location and try again.\n"
)
quit()
def main():
args = get_arguments()
content = read_file(args.file_name)
print(content[0:1000]) # test to verify that the file has been read
if __name__ == "__main__":
main()
As each sentence is surrounded by pairs of /~*
and
*~/
, we can use the character sequence /~*
as
a reliable marker to segment the text into units that correspond to
query hits. This can do this by using the .split()
method.
This yields a list containing all our hits as distinct elements.
NB: Methods are similar to functions, except that they are always applied to objects and never in a standalone manner.
sentences = content.split("/~*")
print(sentences[0]) # test to see the header
print(sentences[1]) # test to see the first actual query hit
print(sentences[-1]) # test to see the last query hit
The splitting sequence is not included in the list elements. However, the closing tag is still present, which gives a way to exclude the file header.
We can treat each hit with a for-loop.
for sentence in sentences:
sentence[0:10] # first 10 characters as a test
As you may have noticed, our first hit is the preface section. We
should exclude it by ensuring that only items containing the
*~/
closing tag are treated. We can use the
.find()
method, which returns the position of the hit, or
-1 in case there are no matches. An alternative would to use the
re.search()
function, which returns a match object if there
is a match, and nothing if there is no match. When placed in an if
statement, Python interprets the presence of a match object as logical
“true”, and its absence as logical “false”. We could therefore also use
if re.search("\*~/", sentence):
, without forgetting to
escape the asterisk character.
for sentence in sentences:
if sentence.find("*~/") != -1:
sentence[0:10]
What information do we want to extract from the CorpusSearch output file? This of course depends on specific use cases. Here, we will extract the corpus file name, the sentence ID, the verb information (lemma, MED-ID, etymology) and the sentence itself.
The corpus file name is actually contained in the sentence ID, so we
can kill two birds with one stone here. All sentence IDs are in
parentheses and begin with CM
. We can use this pattern to
extract the sentence ID reliably using regular expressions. We need to
escape the parentheses with a backslash, as parentheses are special
characters to mark groups in regular expressions. We use the
re.search()
function to find our match. We can then extract
our match and remove the surrounding parentheses.
sentence = sentences[1] # use the first actual hit as a test case
sentence_id = re.search("\(CM.*\)", sentence) # perform our search
sentence_id = sentence_id[0] # extract the actual match
# NB: sentence_id.group() or sentence_id.group(0) would also work
sentence_id = re.sub("\((.*)\)", "\\1", sentence_id)
We can define this sequence as a function so that our
main()
function is easier to read.
def extract_id(hit):
id_tag = re.search("\(CM.*\)", hit)
sentence_id = re.sub("\((.*)\)", "\\1", id_tag[0])
return sentence_id
Extracting the corpus file name is then a one-liner that doesn’t necessarily warrant its own function.
identifier = extract_id(sentence)
# Extract content before the comma
# to obtain the corpus file name from the identifier
corpusfile_name = re.sub("(.*),.*", "\\1", identifier)
# Alternative solution:
# re.search("[^,]*", identifier)[0]
# Any character that is not a comma
To extract the lemma information of the verb match in our query, we
can rely on the section surrounded by the tags /*
and
*/
listing the indices of the matched items from the query.
There are a few things to consider when formulating a suitable regular
expression:
These tags include asterisks, which are normally quantifiers, so
we need to escape them with a backslash, so /\*
for the
opening tag and \*/
for the closing tag;
The wildcard .
can refer to any character except
line breaks. As the section listing the matched items is spread over
multiple lines, we need to define a group that includes line breaks as
well: (.|\n)
;
By default, regular expressions are only matched within the scope
of individual lines. To find matches over multiple lines, we need to
specify the flag re.MULTILINE
, or re.M
for
short.
Another option is to redefine the .
wildcard to
include line breaks with the flag re.DOTALL
, or
re.S
for short. This allows us to combine the requirements
for points 2 and 3 while keeping the regular expression a bit
simpler.
query_matches = re.search("/\*.*\*/", sentence, re.DOTALL)[0]
# Sequence /* followed by any characters including line breaks until
# the closing sequence */
It turns out our regular expression was too broad as we also include
the header/footer sections that occur whenever results from a new corpus
file are displayed. We can fix this by specifying that the line
occurring after the opening tag should begin with a digit and making the
quantifier non-greedy by placing a ?
character after
it.
query_matches = re.search("/\*\n\d.*?\*/", sentence, re.DOTALL)[0]
# Sequence /* followed by a line break, the next line beginning with a digit,
# until the next closing sequence */
Now that we have isolated the section listing the query hits, we can
think of how we want to extract the lemma information. We have to
consider the fact that that there can be more than one query hit per
sentence (e.g. in CMCLOUD,131.811
or
CMMIRK,13.358
). The function re.search()
stops
at the first match it finds, but the function re.findall()
generates a list of all the matches it finds. This means we can loop
over such a list to treat any verbs matching the query. Let us define a
function that handles the extraction of lemma information as a first
step. We can deal with the looping when it is time to write the data to
a table.
def extract_lemmas(hit):
query_matches = re.search("/\*\n\d.*?\*/", hit, re.DOTALL)[0]
# Sequence /* followed by a line break,
# the next line beginning with a digit,
# until the next closing sequence */
lemmas = re.findall("@.*@", query_matches)
# To include verb form: "\w*@.*@"
return lemmas
To extract the actual text of the sentence to include it in our
table, we need to reliably identify its boundaries. The markers are
/~*
(which we chopped off when we segmented the file) and
*~/
. We can extract the content by gathering material
occurring up to this closing tag, then removing the sentence ID and the
closing tag. We can do this by specifying the material we want to keep
as a group by placing it in parentheses, then accessing it with
.group(1)
(alternatively [1]
). Finally, there
are line breaks that have been hard coded by CorpusSearch. We
can remove the trailing line breaks with the .strip()
method, and replace the line breaks within the sentence with spaces
using the re.sub()
function.
text_sequence = re.search(".*\*~/", sentence, re.DOTALL)[0]
# Select from beginning of segment to closing tag
sentence_only = re.search("(.*)\(CM.*", text_sequence, re.DOTALL).group(1)
# Keep group that precedes the sentence ID marker beginning with (CM
cleaned_sentence = sentence_only.strip() # remove trailing white space
cleaned_sentence = re.sub("\n", " ", cleaned_sentence)
We have now extracted the actual sentence, but perhaps we want to
remove the inserted lemma information, as it may make the sentence too
long or cumbersome to read when in a table. We can leave the decision
whether to keep the annotation to the user via a flag in the program
call. We can wrap our code to extract the sentence in a function. The
optional removal of annotation marking is best done outside of this in
the main function, as it is a one-liner and saves us the trouble of
carrying over the argument variable into our custom function. For the
one-liner to remove the lemma annotation, we have to be careful how to
formulate our regular expression. Using @.*@
will capture
everything from the first to the last verb, and making it non-greedy
with @.*?@
will not cover the annotation sequences as they
contain multiple @
markers. The easiest is to specify
non-whitespace characters as our wildcard: @\S*@
.
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Extract information from a CorpusSearch output file "
"to a CSV table"
)
)
parser.add_argument(
"file_name",
help = "Name of the file to be processed"
)
parser.add_argument(
"-p",
"--plain",
action = "store_true",
help = "Toggle plain text by removing lemma annotation"
)
arguments = parser.parse_args()
return arguments
def extract_text(hit):
text_sequence = re.search(".*\*~/", hit, re.DOTALL)[0]
# Select from beginning of segment to closing tag
sentence_only = re.search("(.*)\(CM.*", text_sequence, re.DOTALL).group(1)
# Keep group that precedes the sentence ID marker beginning with (CM
cleaned_sentence = sentence_only.strip() # remove trailing white space
cleaned_sentence = re.sub("\\n", " ", cleaned_sentence)
return cleaned_sentence
def main():
args = get_arguments()
content = read_file(args.file_name)
sentences = content.split("/~*")
for sentence in sentences:
if sentence.find("*~/") != -1: # exclude header
identifier = extract_id(sentence)
corpusfile_name = re.sub("(.*),.*", "\\1", identifier)
verbs = extract_lemmas(sentence)
text = extract_text(sentence)
if args.plain:
text = re.sub("@\S*@", "", text)
print(text)
We have extracted all the information we want, and it is now time to assemble the puzzle pieces. There are two options to write out our data (see our prior song example): (1) write each line to the export file as we go, or (2) gather everything in a variable and write to the export file in bulk. Here we go with option (1) for reasons of simplicity and debugging: if there is indeed an error, we can check in our export file which sentence was last processed correctly, and which sentence caused problems. We could then identify whether there were patterns that we had not anticipated.
We need to define a name for our export file. We can base this on the
name of our input file, substituting the ending .out
with
.csv
. We can then write a header for the table prior to our
for-loop. We will be using a tab-separated format as the other common
markers comma and semicolon may occur within the text field. (NB: Text
delimiters such as single or double quotation marks are also problematic
as they may also occur in the tex t fields.) Within our loop, we then
write a table row for each hit. Remember that there could be more than
one hit per sentence. We therefore extract the information that was
constant within a sentence first , i.e. corpus file, ID, and text. Then,
we run a for-loop for each matching verb within that sentence.
def write_table_header(name):
with open(name, "w") as outp:
outp.write("CorpusFile\tsentence_id\tVerbInfo\tSentence\n")
def main():
args = get_arguments()
content = read_file(args.file_name)
output_name = re.sub("\.out$", ".csv", args.file_name)
write_table_header(output_name)
sentences = content.split("/~*")
for sentence in sentences:
if sentence.find("*~/") != -1: # exclude header
identifier = extract_id(sentence)
corpus_file_name = re.sub("(.*),.*", "\\1", identifier)
verbs = extract_lemmas(sentence)
text = extract_text(sentence)
if args.plain:
text = re.sub("@\S*@", "", text)
for verb in verbs:
with open(output_name, "a") as outp:
outp.write(f"{corpus_file_name}\t"
f"{identifier}\t"
f"{verb}\t"
f"{text}\n"
)
print(f"\nSentence {identifier} processed...\n")
# Inform user of progress
print("\nProcess complete!\nNow quitting.\n")
# Inform user of completion
Our working version then looks something like this. (NB: Let us refrain from calling it a final or finished version, as there are still improvements we could make.)
#!/usr/bin/env python3
import argparse
import re
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Extract information from a CorpusSearch output file "
"to a CSV table"
)
)
parser.add_argument(
"file_name",
help = "Name of the file to be processed"
)
parser.add_argument(
"-p",
"--plain",
action = "store_true",
help = "Toggle plain text by removing lemma annotation"
)
arguments = parser.parse_args()
return arguments
def read_file(file_name):
try:
with open(file_name, "r") as inp:
return inp.read()
except FileNotFoundError:
print(
"\nWARNING: "
"The file you specified does not exist.\n"
"Please verify and try again.\n"
)
quit()
def extract_id(hit):
id_tag = re.search("\(CM.*\)", hit)
sentence_id = re.sub("\((.*)\)", "\\1", id_tag[0])
return sentence_id
def extract_lemmas(hit):
query_matches = re.search("/\*\n\d.*?\*/", hit, re.DOTALL)[0]
# Sequence /* followed by a line break,
# the next line beginning with a digit,
# until the next closing sequence */
lemmas = re.findall("@.*@", query_matches)
# To include verb form: "\w.@.*@"
return lemmas
def extract_text(hit):
text_sequence = re.search(".*\*~/", hit, re.DOTALL)[0]
# Select from beginning of segment to closing tag
sentence_only = re.search("(.*)\(CM.*", text_sequence, re.DOTALL).group(1)
# Keep group that precedes the sentence ID marker beginning with (CM
cleaned_sentence = sentence_only.strip() # remove trailing white space
cleaned_sentence = re.sub("\n", " ", cleaned_sentence)
return cleaned_sentence
def write_table_header(name):
with open(name, "w") as outp:
outp.write("CorpusFile\tSentenceID\tVerbInfo\tSentence\n")
def main():
args = get_arguments()
content = read_file(args.file_name)
output_name = re.sub("\.out$", ".csv", args.file_name)
write_table_header(output_name)
sentences = content.split("/~*")
for sentence in sentences:
if sentence.find("*~/") != -1: # exclude header
identifier = extract_id(sentence)
corpus_file_name = re.sub("(.*),.*", "\\1", identifier)
verbs = extract_lemmas(sentence)
text = extract_text(sentence)
if args.plain:
text = re.sub("@\S*@", "", text)
for verb in verbs:
with open(output_name, "a") as outp:
outp.write(f"{corpus_file_name}\t"
f"{identifier}\t"
f"{verb}\t"
f"{text}\n"
)
print(f"\nSentence {identifier} processed...\n")
# Inform user of progress
print("\nProcess complete!\nNow quitting.\n")
# Inform user of completion
if __name__ == "__main__":
main()
Using Tara’s data of 14th century Dutch, we will extract
manual annotation to a spreadsheet. As the data are spread over multiple
files, we will need to learn how to process such data. We will first
handle the single file WestVlaanderen_1390_1399.txt
before
extending our script to handle multiple files.
Another challenge is that the custom annotation format allows nesting/recursion in that a clause may contain another clause. Our best bet is to convert the annotation scheme to XML so that we can use tried-and-tested XML parsers rather than having to reinvent the wheel. To do so, we first need to investigate the annotation scheme:
[
and the closing tag ./]
, and may be nested.s/
,
a/
, o/
, etc. They have no closing tag, so we
can either assume that their scope extends only to the following word,
or until the next tag or end of the sentence. As the scheme seems to
annotate constituents rather than words, we will assume the latter.To parse our converted data, we will use the package “Beautiful
Soup”, which is a very powerful tool for parsing HTML and XML data.
Unlike the re
package, it is not built in and needs to be
installed. The easiest way to install additional packages in Python is
to use pip
(Package Installer for Python). From our
terminal, we can install “Beautiful Soup” with the command
python -m pip install beautifulsoup4
or
python3 -m pip install beautifulsoup4
. For XML parsing, we
also need the package lxml
, we we can install with
python -m pip install lxml
or
python3 -m pip install lxml
.
To parse the text as XML, we pass it to the
BeautifulSoup()
function and specify the use of the XML
parser.
xml_ready = convert_to_xml(content)
xml = BeautifulSoup(xml_ready, "xml")
print(xml)
We only obtain an XML declaration line, but nothing else. Why is
that? To be parsed, the XML text as a whole needs to be enclosed in a
set of tags. Given the extensible nature of XML, we can easily do that
with a custom set of tags such as <text>
or
<xml>
.
xml = BeautifulSoup(f"<text>{xml_ready}</text>", "xml")
print(xml)
print(xml.prettify())
We can search for specific tags, e.g. all <o>
or
all <v>
tags. We can access different attributes from
these tags, such as .name
and .string
.
xml.find_all("o")
for tag in xml.find_all("v"):
print(f"{tag.name}:\t{tag.string}")
We can search for all tags using .find_all()
when
specifying the search term as True
.
for tag in xml.find_all(True):
print(f"{tag.name}:\t{tag.string}")
The .string
attribute only contains a value when the tag
contains text, otherwise it is defined as None
. This allows
us to see that the <c>
tag in the third sentence of
interest contains an embedded sentence of interest.
Now that we have a way to access the sentences of interest and their tags, even the recursively embedded ones, we can begin extracting the information we want:
the list of tags
whether the order is OV or VO
the actual sentence
We can obtain a tag’s children with the .contents
attribute. To iterate over them, we can use the .children
generator instead.
sentences = xml.find_all("sentence")
sentences[0].contents # look at the first sentence before we loop over all
for child in sentences[0].children:
print(child.name)
We can use this to increment a string variable to obtain the list of
tags in the sentence (excluding None
values).
tag_list = ""
for child in sentences[0].children:
if child.name != None:
tag_list += child.name
print(tag_list)
We can wrap this in a function for later use.
def get_tags(parent):
tag_list = ""
for child in parent.children:
if child.name != None:
tag_list += child.name
return tag_list.upper() # constituent labels in uppercase
As we have a full list of constituents, we can use regular
expressions on it to check whether V occurs before or after O (or place
NA
values if one or both are not present).
tags = get_tags(sentence[0])
if re.search("V.*O", tags):
vo_order = "VO"
elif re.search("O.*V", tags): # elif = else if
vo_order = "OV"
else:
vo_order = "NA"
print(vo_order)
Again, we can store this in a function for later use.
def get_ov_order(tag_list):
if re.search("V.*O", tag_list):
vo_order = "VO"
elif re.search("O.*V", tag_list):
vo_order = "OV"
else:
vo_order = "NA"
return vo_order
The .string
attribute is inadequate for extracting the
whole sentence as we get None
values whenever there are
embedded tags. To obtain the entire “human-readable” text, we can use
the .get_text()
method. As this is a one-liner, we do not
necessarily need a dedicated function.
text = sentences[0].get_text()
print(text)
Now that we have found ways to extract the information we need, we can loop through all the sentences of a file.
for sentence in sentences:
tags = get_tags(sentence)
ov = get_ov_order(tags)
text = sentence.get_text()
print("---\n"
f"Tags:\t{tags}\n"
f"OV:\t{ov}\n"
f"Text:\t{text}\n"
"---\n"
)
Before we export this data to a table, we shouldn’t forget that we want to handle multiple files.
The os
package contains a function
os.listdir()
that returns a list of files in a given
directory/folder.
from os import listdir
listdir(".") # contents of the current working directory
listdir("dutch-data") # contents of a specific directory
This way, we can generate a list of files through which we can loop. Also, we can catch potential errors if the user provides the wrong directory name. We can exclude files that are not of the type we want, e.g. invisible files that may have been placed by the OS, or syncing services such as Dropbox, Nextcloud, or GitHub.
def get_file_list(directory):
try:
file_names = listdir(directory)
return file_names
except FileNotFoundError:
print("\nThe directory you specified does not exist."
"\nPlease verify its existence and spelling and try again."
"\nQuitting now...\n"
)
quit()
files = get_file_list("dutch-data")
for file in files:
if re.search("\.txt$", file):
print(file)
files = get_file_list("ditch-data") # typo in the directory name
Now that we have all the “puzzle pieces”, we can combine all the elements together. First, we write a function to parse the argument call to obtain the name of the directory. We can adapt the function we defined in our previous script.
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Extract information from annotated text files "
"to a CSV table"
)
)
parser.add_argument(
"dir_name",
help = "Name of the directory to be processed"
)
arguments = parser.parse_args()
return arguments
One tiny missing puzzle piece is a function to prepare a table header before we iterate through our files and sentences.
def write_table_header(file_name):
with open(file_name, "w") as outp:
outp.write("File\tConstituents\tOV_order\tText\n")
Now we can start working on our main()
function, in
which we assemble our “puzzle”.
from os import listdir, path
def main():
args = get_arguments()
files = get_file_list(args.dir_name)
output_name = f"{args.dir_name}.csv"
write_table_header(output_name)
for file in files:
if re.search("\.txt$", file):
file_path = path.join(args.dir_name, file)
content = read_file(file_path)
xml_ready = convert_to_xml(content)
xml = BeautifulSoup(f"<text>{xml_ready}</text>", "xml")
sentences = xml.find_all("sentence")
for sentence in sentences:
tags = get_tags(sentence)
ov = get_ov_order(tags)
text = sentence.get_text()
with open(output_name, "a") as outp:
outp.write(f"{file}\t{tags}\t{ov}\t{text}\n"
print(f"\nFile {file} processed...\n")
print(
"All files processed.\n"
f"Table saved as {output_name}\n"
"Now quitting.\n"
)
The working version of our script looks something like this.
#!/usr/bin/env python3
from os import listdir, path
import re
import argparse
from bs4 import BeautifulSoup
from output_to_csv import read_file
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Extract information from annotated text files "
"to a CSV table"
)
)
parser.add_argument(
"dir_name",
help = "Name of the directory to be processed"
)
arguments = parser.parse_args()
return arguments
def get_file_list(directory):
try:
file_names = listdir(directory)
return file_names
except FileNotFoundError:
print("\nThe directory you specified does not exist."
"\nPlease verify its existence and spelling and try again."
"\nQuitting now...\n"
)
quit()
def write_table_header(file_name):
with open(file_name, "w") as outp:
outp.write("File\tConstituents\tOV_order\tText\n")
def convert_to_xml(text):
sentence_opening = re.sub("\[", "<sentence>", text)
sentence_closing = re.sub("\./\]", "</sentence>", sentence_opening)
clause_marking = re.sub(
"c/<sentence>(.*?)</sentence>",
"<.><sentence>\\1</sentence><..>",
sentence_closing)
tags_opening = re.sub("\\b(\w)/","<\\1>", clause_marking)
# Capture the word character only, as we don't want to keep the slash
tags_closing = re.sub("<(\w)>([^<]*)", "<\\1>\\2</\\1>", tags_opening)
# Capture the character used in the opening tag,
# then capture the group of following characters other than <
# the closing tag is the same as the opening one but with slash
clause_opening = re.sub("<\.>", "<c>", tags_closing)
clause_closing = re.sub("<\.\.>", "</c>", clause_opening)
return clause_closing
def get_tags(parent):
tag_list = ""
for child in parent.children:
if child.name != None:
tag_list += child.name
return tag_list.upper() # constituent labels in uppercase
def get_ov_order(tag_list):
if re.search("V.*O", tag_list):
vo_order = "VO"
elif re.search("O.*V", tag_list):
vo_order = "OV"
else:
vo_order = "NA"
return vo_order
def main():
args = get_arguments()
files = get_file_list(args.dir_name)
output_name = f"{args.dir_name}.csv"
write_table_header(output_name)
for file in files:
if re.search("\.txt$", file):
file_path = path.join(args.dir_name, file)
content = read_file(file_path)
xml_ready = convert_to_xml(content)
xml = BeautifulSoup(f"<text>{xml_ready}</text>", "xml")
sentences = xml.find_all("sentence")
for sentence in sentences:
tags = get_tags(sentence)
ov = get_ov_order(tags)
text = sentence.get_text()
with open(output_name, "a") as outp:
outp.write(f"{file}\t{tags}\t{ov}\t{text}\n")
print(f"\nFile {file} processed...\n")
print(
"All files processed.\n"
f"Table saved as {output_name}\n"
"Now quitting.\n"
)
if __name__ == "__main__":
main()
Using a file from Lena’s Middle English data Brut.out
,
we will determine the distance between primes and targets. Our file is a
CorpusSearch output file that was generated by a query for
double-object constructions and prepositional object constructions. To
start out small and test the methodology, the query was carried out only
on the corpus file CMBRUT3
(The Brut or The Chronicles
of England). The intended measuring units for the distances between
prime and targets are corpus units, i.e. the sentences marked with
identifiers in the corpus.
These identifiers are the first option to check the distance between
prime and target, provided that sentence IDs are numbered consistently.
Based on the hits in our result file, we see the sentence numbering
scheme does not simply number sentences, but also divides the file into
sections, e.g. CMBRUT,2.29
and CMRUT,3.33
, so
we cannot measure the distance between these two units without knowing
how many units there are in the second section. Looking at the source
corpus file m3-fr.cmbrut3.m3.lemma.psd
, we can see that
even within a section, numbering is not consistent as certain numbers
are skipped, e.g. there is no CMBRUT3,1.4
.
Our safest bet is therefore to renumber the corpus units in a consistent manner. Rather than overwriting the identifiers used in the corpus, we will map them to a simpler natural number scheme.
We open our corpus file and gather all sentence IDs. We then assign each of them a running number in a for-loop, incrementing the running number by 1 at every iteration. To store a mapping of IDs and running numbers, we will use a data structure called a dictionary (called hash in many other programming languages). Dictionaries store combinations of keys and associated values under a single variable. They are defined by pairs of curly brackets and keyword/value combinations that can be created or accessed with square brackets. Let us do the first two IDs manually to get a feel of how dictionaries work.
import re
from output_to_csv import read_file
corpus_file = read_file("m3-fr.cmbrut3.m3.lemma.psd")
id_list = re.findall("\(ID\s.*?\)", corpus_file)
# Sequence (ID until the next closing bracket
numbering_scheme = {} # create empty dictionary
running_number = 1
# First ID
current_id = re.sub("\(ID\s(.*)\)", "\\1", id_list[0])
numbering_scheme[current_id] = running_number
numbering_scheme
# Second ID
current_id = re.sub("\(ID\s(.*)\)", "\\1", id_list[1])
running_number += 1
numbering_scheme[current_id] = running_number
numbering_scheme
# Access values by their indices
numbering_scheme[current_id]
numbering_scheme["CMBRUT3,1.3"]
Now that we have had a first taste of how dictionaries work, we can wrap everything in a function in our script and treat all the IDs in the file in a for-loop.
import re
from output_to_csv import read_file
def renumber(content):
id_list = re.findall("\(ID\s.*?\)", content)
# Sequence (ID until the next closing bracket
numbering_scheme = {} # create empty dictionary
running_number = 1
for unit in id_list:
current_id = re.sub("\(ID\s(.*)\)", "\\1", unit)
numbering_scheme[current_id] = running_number
running_number += 1
return numbering_scheme
def main():
corpus_file = read_file("m3-fr.cmbrut3.m3.lemma.psd")
id_map = renumber(corpus_file)
So far, we have always exported our output as CSV tables to be used
by other programs (e.g. a spreadsheet application or R). Given that we
intend to use our scheme in Python, it seems redundant to export it as a
table and reimport it as a table later. We could also continue working
with the scheme within the same Python script, but ideally we’d like to
separate the renumbering from the determination of prime and target
distances. We want to perform the renumbering only once, and investigate
the prime and target distances for multiple queries. Exporting and
reimporting is the more viable solution then. However, there is a way to
save data directly as a Python object, called pickling.
We avoid potentially lengthy conversion processes as the pickle is
“preserved” as is. It should be noted that for this reason, one should
only load pickled data from trustworthy sources. As an alternative that
is better suitable for sharing, is human-readable, and can be accessed
in programming languages other than Python, the JSON (JavaScript Object
Notation) format may be a better option. Python has a JSON module that
uses the same .dump()
and .load()
methods as
the pickle
module. However, using JSON would mean
converting back and fro between Python and JSON, and the running numbers
in our renumbered scheme would be loaded as strings, which we would then
have to reconvert to integers before we can perform subtractions to
estimate the distance between corpus units. As we will use our numbering
scheme only locally and in Python, we will stick to pickling as
we get to enjoy the advantages (speed and 100% Python compatibility)
with none of the security issues.
import pickle
def main():
corpus_file = read_file("m3-fr.cmbrut3.m3.lemma.psd")
id_map = renumber(corpus_file)
with open("id_map.pickle", "wb") as outp:
pickle.dump(id_map, outp) # object, file
If we restart Python (as would happen if we were to reuse the object in a different script), we can load the pickle to obtain the exact same object.
import pickle
with open("id_map.pickle", "rb") as inp:
id_map = pickle.load(inp)
One remaining issue that is that we are only handling one corpus file. Once we add more files, how will we handle the transition from one file to the next in the numbering scheme stored in our dictionary? To make our script future-proof, we will address this now by nesting dictionaries: Each corpus file will have its own dictionary and its own running count within the main dictionary. (NB: We could also have a separate pickle file for each corpus file, but this may be more difficult to manage in the long run.)
We will add a second corpus file
(m4-en.cmcapser.m4.lemma.psd
) to see whether our
renumbering system works on multiple files. We place our files in a
separate directory called ppcme2
.
We will also need to update our renumber()
function to
insert a nested dictionary rather than creating one from scratch.
We can save the code below as a script called
renumber_corpus_units.py
and run it once on the directory
ppcme2
. We will use the resulting file
ppcme2.pickle
for our script to measure the distance
between primes and targets.
import argparse, pickle, re
from os import listdir, path
from output_to_csv import read_file
from dutch_data import get_file_list
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Map corpus units to consistent numbering scheme"
)
)
parser.add_argument(
"dir_name",
help = "Name of the directory to be processed"
)
arguments = parser.parse_args()
return arguments
def renumber(content, main_dictionary, file_id):
id_list = re.findall("\(ID\s.*?\)", content)
# Sequence (ID until the next closing bracket
main_dictionary[file_id] = {} # create nested dictionary
running_number = 1
for unit in id_list:
current_id = re.sub("\(ID\s(.*)\)", "\\1", unit)
main_dictionary[file_id][current_id] = running_number
running_number += 1
def main():
id_map = {} # main dictionary across corpus files
args = get_arguments()
files = get_file_list(args.dir_name)
for file in files:
if re.search("\.psd$", file):
file_path = path.join(args.dir_name, file)
with open(file_path, "r") as inp:
file_content = inp.read()
file_id = re.search("\(ID\s(.*?),", file_content).group(1)
# (ID followed by a space, we catch the group until the comma
renumber(file_content, id_map, file_id)
print(f"File {file} processed...\n")
with open(f"{args.dir_name}.pickle", "wb") as outp:
pickle.dump(id_map, outp)
print(
f"Dictionary pickled as {args.dir_name}.pickle\n"
"Now quitting."
)
if __name__ == "__main__":
main()
Now that we have a consistently numbered version of our corpus files,
we can return to our actual goal: determining the distance between
primes and their targets. Our CorpusSearch output file
Brut.out
contains double objects (DO) and prepositional
objects (PO). We want to generate a table that contains:
NA
for the last
hit of the file)We first load our pickled numbering scheme.
import pickle
with open("ppcme2.pickle", "rb") as inp:
id_map = pickle.load(inp)
We now turn to the file Brut.out
. While we are treating
an output file from a query that was run on a single corpus file for
now, we can already make our script ready for queries run on multiple
files, so the first level of keys in our dictionary will be that of
individual corpus texts, although this appears redundant now as all hits
will be from the same text.
We won’t write a table directly to an output file as in our previous
script that converted an output file to a CSV table
(output_to_csv.py
). The reason for this is that we will
perform further operations once we have extracted information from the
sentences, namely the distance from the next hit. So in a first step, we
will store the relevant information about each hit in a dictionary,
including the number we assigned in our new scheme. We will estimate the
distances based on these numbers later on.
We can import some functions from our previous script
output_to_csv.py
and adapt some other lines. from it. We
store some of the operations as functions so that our main function
doesn’t appear too long or convoluted. As there may be more than one
indirect object per sentence, we store them as a list. For now, we
extract the information of each sentence and store it in our dictionary.
We print the dictionary and pickle it to check whether the extraction
process that constitutes step 1 worked correctly before we proceed with
the work on step 2.
#!/usr/bin/env python3
import argparse, pickle, re
from output_to_csv import read_file, extract_text
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Determine distance between query hits"
)
)
parser.add_argument(
"dot_out_file",
help = "Name of the CorpusSearch .out file to be processed"
)
arguments = parser.parse_args()
return arguments
def extract_objects(sentence):
query_matches = re.search("/\*\n\d.*?\*/", sentence, re.DOTALL)[0]
# Sequence /* followed by a line break,
# the next line beginning with a digit,
# until the next closing sequence */
objects = re.findall("(PP|OB2)", query_matches)
return objects
def add_sentence_information(
sentence,
text_id,
sentence_id,
main_dict,
map_dict
):
text = extract_text(sentence)
objects = extract_objects(sentence)
running_number = map_dict[text_id][sentence_id]
# for now, just copy the new sentence number
if text_id not in list(main_dict):
main_dict[text_id] = {}
# if this is the first entry for a new text,
# we need to create a new nested dictionary
# before inserting the information for our sentence
main_dict[text_id][sentence_id] = {}
main_dict[text_id][sentence_id]["text"] = text
main_dict[text_id][sentence_id]["running_number"] = running_number
main_dict[text_id][sentence_id]["objects"] = objects
def main():
# Step 1: extract information from query hits,
# including running sentence number
with open("ppcme2.pickle", "rb") as inp:
id_map = pickle.load(inp) # load running number scheme
hit_dict = {} # main dictionary to hold sentence information
args = get_arguments()
file_name = args.dot_out_file
file_content = read_file(file_name)
sentences = file_content.split("/~*")
for sentence in sentences:
if re.search("\*~/", sentence): # exclude header
sentence_id = re.search("\(\d+\sID\s(.*?)\)", sentence).group(1)
# (index space ID space, we catch the following group
# until the first closing bracket
text_id = re.search("(.*?),", sentence_id).group(1)
# we catch any characters until the first comma
add_sentence_information(
sentence,
text_id,
sentence_id,
hit_dict,
id_map
)
# End of step 1: print and export intermediate results
print(hit_dict) # check how well it worked
with open("hit_dict.pickle", "wb") as outp:
pickle.dump(hit_dict, outp)
# export as a pickle to we can run some tests
# in our live session to work on step 2
if __name__ == "__main__":
main()
Now that we have extracted the information from the query hits, we want to create our export table. We can loop the dictionary keys and gather the information that we need for each. As corpus units may contain more than one hit, we should write one table row per hit rather than one per corpus unit. This means that we will only write the table row at the level of an embedded loop iterating through the list of indirect objects. We should be careful about the distance between tokens within the same unit. The last item in the list should be measured in comparison to the next unit; all the others should have the value 0.
The main point of the script is to obtain the distance from the next hit. To do so, we need to access the sentence number of the next corpus unit. Dictionaries are not numbered as such, so we can’t directly access the next item. However, we can work around this by converting the inventory of keys into a list, from which we can increment the index to obtain the next key. We have to be aware that if we are at the last key in a list, fetching the non-existent next item will result in an index error, which we have to catch to prevent our program from crashing.
def get_next_key(unit_keys, unit_key):
current_index = unit_keys.index(unit_key)
try:
next_key = unit_keys[current_index + 1]
except IndexError:
next_key = "NA"
return next_key
def write_table_header(output_file_name):
with open(output_file_name, "w") as outp:
outp.write(
"TextID\t"
"SentenceID\t"
"Construction\t"
"Distance\t"
"Sentence\n"
)
def write_table_row(
output_file_name,
text_id,
sentence_id,
construction,
distance,
sentence
):
with open(output_file_name, "a") as outp:
outp.write(
f"{text_id}\t"
f"{sentence_id}\t"
f"{construction}\t"
f"{distance}\t"
f"{sentence}\n"
)
def main():
# Step 1: extract information from query hits,
# including running sentence number
with open("ppcme2.pickle", "rb") as inp:
id_map = pickle.load(inp) # load running number scheme
hit_dict = {} # main dictionary to hold sentence information
args = get_arguments()
file_name = args.dot_out_file
file_content = read_file(file_name)
sentences = file_content.split("/~*")
for sentence in sentences:
if re.search("\*~/", sentence): # exclude header
sentence_id = re.search("\(\d+\sID\s(.*?)\)", sentence).group(1)
# (index space ID space, we catch the following group
# until the first closing bracket
text_id = re.search("(.*?),", sentence_id).group(1)
# we catch any characters until the first comma
add_sentence_information(
sentence,
text_id,
sentence_id,
hit_dict,
id_map
)
# Step 2: Iterate through our dictionary of hits,
# gather the information including the distance to the next hit,
# and write to our output table
text_keys = list(hit_dict)
output_file_name = re.sub("\.out$", ".csv", file_name)
write_table_header(output_file_name)
for text_key in text_keys:
unit_keys = list(hit_dict[text_key])
for unit_key in unit_keys:
current_text = hit_dict[text_key][unit_key]['text']
object_list = hit_dict[text_key][unit_key]['objects']
try:
current_number = hit_dict[text_key][unit_key]['running_number']
next_key = get_next_key(unit_keys, unit_key)
# If the function get_next_key() returns "NA",
# we have to catch the ensuing key error
# (as we don't have a corpus unit with the key "NA").
next_number = hit_dict[text_key][next_key]['running_number']
distance = next_number - current_number
except KeyError:
distance = "NA"
for item in object_list:
if object_list.index(item) == len(object_list) -1:
report_distance = distance
else:
report_distance = 0
write_table_row(
output_file_name,
text_key,
unit_key,
item,
report_distance,
current_text
)
print(f"Processed unit {unit_key}...\n")
print(f"Processed text {text_key}...\n")
print(
"Process complete.\n"
f"Table saved as {output_file_name}.\n"
"Now quitting.\n"
)
Our script runs without errors and generates a table that looks like
what we want. Upon closer inspection, however, we can see that multiple
hits within a unit sometimes (but not always) have the value 0 even for
the last hit. We just caught a bug! Just because our script runs without
errors shouldn’t mean that it works as intended. The bug shows the
following pattern: when the multiple items within a corpus unit had the
same value (e.g. in CMBRUT3,7.167
), we get 0 across the
board, but when the multiple items are not identical
(e.g. CMBRUT3,17.514
), we get the expected behaviour. How
did this happen? When we checked whether an item was the last on the
list, fetching the index for the current item stopped at the first hit,
so when we have the same object type multiple times, we always get the
index of the first item. We should therefore remove each item from the
list after we’ve treated it, but we don’t want to modify a list on which
we are iterating, as this is likely to cause problems, such as our loop
skipping an item. A quick fix is to wrap our list in a
list()
function. This way, we iterate over a copy which
will not change as we remove items. The original list will be empty once
we are done iterating, which is not a problem for our purpose, but if we
were to need the list after that, we’d need to come up with a different
fix.
def main():
# ...
# Step 2: Iterate through our dictionary of hits,
# gather the information including the distance to the next hit,
# and write to our output table
text_keys = list(hit_dict)
output_file_name = re.sub("\.out$", ".csv", file_name)
write_table_header(output_file_name)
for text_key in text_keys:
unit_keys = list(hit_dict[text_key])
for unit_key in unit_keys:
current_text = hit_dict[text_key][unit_key]['text']
object_list = hit_dict[text_key][unit_key]['objects']
try:
current_number = hit_dict[text_key][unit_key]['running_number']
next_key = get_next_key(unit_keys, unit_key)
next_number = hit_dict[text_key][next_key]['running_number']
distance = next_number - current_number
except KeyError:
distance = "NA"
for item in list(object_list):
if object_list.index(item) == len(object_list) -1:
report_distance = distance
else:
report_distance = 0
write_table_row(
output_file_name,
text_key,
unit_key,
item,
report_distance,
current_text
)
object_list.remove(item)
print(f"Processed unit {unit_key}...\n")
print(f"Processed text {text_key}...\n")
print(
"Process complete.\n"
f"Table saved as {output_file_name}.\n"
"Now quitting.\n"
)
The working version of our script looks something like this.
#!/usr/bin/env python3
import argparse, pickle, re
from output_to_csv import read_file, extract_text
def get_arguments():
parser = argparse.ArgumentParser(
description = (
"Determine distance between query hits"
)
)
parser.add_argument(
"dot_out_file",
help = "Name of the CorpusSearch .out file to be processed"
)
arguments = parser.parse_args()
return arguments
def extract_objects(sentence):
query_matches = re.search("/\*\n\d.*?\*/", sentence, re.DOTALL)[0]
# Sequence /* followed by a line break,
# the next line beginning with a digit,
# until the next closing sequence */
objects = re.findall("(PP|OB2)", query_matches)
return objects
def add_sentence_information(
sentence,
text_id,
sentence_id,
main_dict,
map_dict
):
text = extract_text(sentence)
objects = extract_objects(sentence)
running_number = map_dict[text_id][sentence_id]
# for now, just copy the new sentence number
if text_id not in list(main_dict):
# if this is the first entry for a new text,
# we need to create a new nested dictionary
# before inserting the information for our sentence
main_dict[text_id] = {}
main_dict[text_id][sentence_id] = {}
main_dict[text_id][sentence_id]["text"] = text
main_dict[text_id][sentence_id]["running_number"] = running_number
main_dict[text_id][sentence_id]["objects"] = objects
def get_next_key(unit_keys, unit_key):
current_index = unit_keys.index(unit_key)
try:
next_key = unit_keys[current_index + 1]
except IndexError:
next_key = "NA"
return next_key
def write_table_header(output_file_name):
with open(output_file_name, "w") as outp:
outp.write(
"TextID\t"
"SentenceID\t"
"Construction\t"
"Distance\t"
"Sentence\n"
)
def write_table_row(
output_file_name,
text_id,
sentence_id,
construction,
distance,
sentence
):
with open(output_file_name, "a") as outp:
outp.write(
f"{text_id}\t"
f"{sentence_id}\t"
f"{construction}\t"
f"{distance}\t"
f"{sentence}\n"
)
def main():
# Step 1: extract information from query hits,
# including running sentence number
with open("ppcme2.pickle", "rb") as inp:
id_map = pickle.load(inp) # load running number scheme
hit_dict = {} # main dictionary to hold sentence information
args = get_arguments()
file_name = args.dot_out_file
file_content = read_file(file_name)
sentences = file_content.split("/~*")
for sentence in sentences:
if re.search("\*~/", sentence): # exclude header
sentence_id = re.search("\(\d+\sID\s(.*?)\)", sentence).group(1)
# (index space ID space, we catch the following group
# until the first closing bracket
text_id = re.search("(.*?),", sentence_id).group(1)
# we catch any characters until the first comma
add_sentence_information(
sentence,
text_id,
sentence_id,
hit_dict,
id_map
)
# Step 2: Iterate through our dictionary of hits,
# gather the information including the distance to the next hit,
# and write to our output table
text_keys = list(hit_dict)
output_file_name = re.sub("\.out$", ".csv", file_name)
write_table_header(output_file_name)
for text_key in text_keys:
unit_keys = list(hit_dict[text_key])
for unit_key in unit_keys:
current_text = hit_dict[text_key][unit_key]['text']
object_list = hit_dict[text_key][unit_key]['objects']
try:
current_number = hit_dict[text_key][unit_key]['running_number']
next_key = get_next_key(unit_keys, unit_key)
next_number = hit_dict[text_key][next_key]['running_number']
distance = next_number - current_number
except KeyError:
distance = "NA"
for item in list(object_list):
if object_list.index(item) == len(object_list) -1:
report_distance = distance
else:
report_distance = 0
write_table_row(
output_file_name,
text_key,
unit_key,
item,
report_distance,
current_text
)
object_list.remove(item)
print(f"Processed unit {unit_key}...\n")
print(f"Processed text {text_key}...\n")
print(
"Process complete.\n"
f"Table saved as {output_file_name}.\n"
"Now quitting.\n"
)
if __name__ == "__main__":
main()
There are still some improvements that could be made:
For now, we have have hard-coded the pickled renumbering scheme for the PPCME2, but we could expand this to cover any pickled inventory for other corpora (e.g. PLAEME, PCMEP, or a combination of PPCME2, PLAEME, and PCMEP). We could add the pickled file to be loaded as an argument.
The features under investigation, i.e. double object versus prepositional object constructions, are also hard-coded, so we could think of ways to expand the applications of our script, e.g. provide a list of tags as an argument, or write them in a text file that can be specified and loaded with an argument.
But for now, our script works for our immediate purpose.