Workshop: Working with corpora (1 April 2022, University of Mannheim)
Held by Carola Trips and Michael Percillier
Programme
- Morning (10:00–12:30):
- Introduction to the Penn corpora (corpus design, website, CorpusSearch)
- How to install and work with CorpusSearch?
- New Middle English corpora
- Lunch break (12:30–14:00)
- Afternoon (14:00–16:00)
- Alternative search method (presented by Tara Struik)
- How to proceed with your data
Resources
- CorpusSearch: http://corpussearch.sourceforge.net
- Penn Parsed Corpora of English: https://www.ling.upenn.edu/hist-corpora/index.html
- Middle English Dictionary: https://quod.lib.umich.edu/m/middle-english-dictionary/dictionary
- Toolbox Anglistik IV: http://anglistik-toolbox.uni-mannheim.de
- BASICS Toolkit: http://basics-toolkit.spdns.org
Setting up CorpusSearch
For this workshop, we will be running CorpusSearch from the search platform on the Toolbox Anglistik IV of Mannheim University. The credentials will be provided during the workshop.
If you wish to run CorpusSearch on your own computer, you will need to set it up. The official installation instructions for CorpusSearch are available here. We provide a more detailed guide to get you started. The instructions are slightly different for Windows and Linux/macOS (from now on referred to jointly as Unix-like).
Windows
- Install Java from https://java.com. You can test whether the installation was successful by opening the program PowerShell and entering the command
java -version
- Download the latest version of CorpusSearch and place the file
CS_2.003.04.jar
in a convenient location, e.g.C:\Program Files
- You can now run CorpusSearch with the following command in Windows Terminal or PowerShell (or cmd.exe on older versions of Windows):
java -classpath "C:\Program Files\CS_2.003.04.jar" csearch/CorpusSearch
- To avoid having to type in or copy-paste the lengthy command listed in point 3 over and over, you can define an alias with the following command:
doskey cd=java -classpath "C:\Program Files\CS_2.003.04.jar csearch/CorpusSearch" $*
. You can now use the commandcs
to run CorpusSearch. However, this alias will only be active for the current session, meaning that you will have to redefine it after every time you close your terminal. If you wish to have a permanent alias as described for Unix-like systems below, you may consider installing Cygwin or the Windows Subsystem for Linux.
Unix-like
- A suitable version of Java may already be installed on your system. Verify this by entering the command
java -version
in your Terminal. Should Java not be installed, install it from https://java.com - Download the latest version of CorpusSearch and place the file
CS_2.003.04.jar
in a convenient location, e.g.~/corpussearch
- You can now run CorpusSearch with the following command in the Terminal:
java -classpath ~/corpussearch/CS_2.003.04.jar csearch/CorpusSearch
- To simplify running CorpusSearch, you can create an alias so that you only have to type in cs rather than the entire command listed in point 3. The exact instructions to do this will vary slightly depending on which Terminal shell you are using on your system. If you are unsure which shell your system uses, type either
echo $0
orecho $SHELL
in your Terminal.
- bash (default on most Linux distributions and macOS prior to version 10.15 Catalina):
- In your Terminal, move to your home directory with the command
cd ~
- List the contents of your home directory with the command
ls -l
- Look for the file
.bashrc
(Linux) or.bash_profile
(macOS), if it doesn’t exist, create it withtouch .bashrc
ortouch .bash_profile
- Open the file with a text editor, e.g.
nano .bashrc
ornano .bash_profile
, and add the following line:alias cs='java -classpath $HOME/corpussearch/CS_2.003.04.jar csearch/CorpusSearch'
, then save and exit the editor. If you do not feel at ease using a terminal-based text editor, you can use the commandopen -e .bash_profile
to open the file in TextEdit on macOS - Restart your Terminal: the command
cs
now lauches CorpusSearch
- In your Terminal, move to your home directory with the command
- zsh (default on macOS since version 10.15 Catalina):
- In your Terminal, move to your home directory with the command
cd ~
- List the contents of your home directory with the command
ls -l
- Look for the file
.zshrc
, if it doesn’t exist, create it withtouch .zshrc
- Open the file with a text editor, e.g.
nano .zshrc
, and add the following line:alias cs='java -classpath $HOME/corpussearch/CS_2.003.04.jar csearch/CorpusSearch'
, then save and exit the editor. If you do not feel at ease using a terminal-based text editor, you can use the commandopen -e .zshrc
to open the file in TextEdit on macOS - Restart your Terminal: the command
cs
now lauches CorpusSearch
- In your Terminal, move to your home directory with the command
- fish:
- Enter the following command:
abbr -a cs java -classpath $HOME/corpussearch/CS_2.003.04.jar csearch/CorpusSearch
- Restart your Terminal: the command
cs
is now an abbreviation for the longer command listed in point 3
- Enter the following command:
Running CorpusSearch
Assuming that you have a cs
alias already set up, CorpusSearch can be run as follows: cs query.q corpus.psd
, where query.q
is your query file and corpus.psd
is your corpus file. This also assumes that all these files are in your current working directory, e.g. C:\Documents\CorpusSearch
(Windows) or ~/Documents/CorpusSearch
(Unix-like). You should therefore switch your working directory beforehand with the cd
command.
By default, this creates an output file with the ending .out
and the same name as the query file, so in our example query.out
. To create a file with a different name, you can specify it with the -o
flag as in cs query.q corpus.psd -o results.out
.
As a corpus generally consists of multiple files, you may want to query them all at once rather than individually. This can be achieved with the *
wildcard. Assuming that your corpus files are stored in the directory C:\Documents\Corpus
(Windows) or ~/Documents/Corpus
(Unix-like), you can query all the corpus files in the directory with cs query.q C:\Documents\Corpus\*.psd
(Windows) or cs query.q ~/Documents/Corpus/*.psd
(Unix-like).
A guide on how to define your query within the .q
query file is provided on the Toolbox Anglistik IV of Mannheim University.