Workshop: Working with corpora (1 April 2022, University of Mannheim)

Held by Carola Trips and Michael Percillier

Programme

  • Morning (10:00–12:30):
    • Introduction to the Penn corpora (corpus design, website, CorpusSearch)
    • How to install and work with CorpusSearch?
    • New Middle English corpora
  • Lunch break (12:30–14:00)
  • Afternoon (14:00–16:00)
    • Alternative search method (presented by Tara Struik)
    • How to proceed with your data

Resources

Setting up CorpusSearch

For this workshop, we will be running CorpusSearch from the search platform on the Toolbox Anglistik IV of Mannheim University. The credentials will be provided during the workshop.

If you wish to run CorpusSearch on your own computer, you will need to set it up. The official installation instructions for CorpusSearch are available here. We provide a more detailed guide to get you started. The instructions are slightly different for Windows and Linux/macOS (from now on referred to jointly as Unix-like).

Windows

  1. Install Java from https://java.com. You can test whether the installation was successful by opening the program PowerShell and entering the command java -version
  2. Download the latest version of CorpusSearch and place the file CS_2.003.04.jar in a convenient location, e.g. C:\Program Files
  3. You can now run CorpusSearch with the following command in Windows Terminal or PowerShell (or cmd.exe on older versions of Windows): java -classpath "C:\Program Files\CS_2.003.04.jar" csearch/CorpusSearch
  4. To avoid having to type in or copy-paste the lengthy command listed in point 3 over and over, you can define an alias with the following command: doskey cd=java -classpath "C:\Program Files\CS_2.003.04.jar csearch/CorpusSearch" $*. You can now use the command cs to run CorpusSearch. However, this alias will only be active for the current session, meaning that you will have to redefine it after every time you close your terminal. If you wish to have a permanent alias as described for Unix-like systems below, you may consider installing Cygwin or the Windows Subsystem for Linux.

Unix-like

  1. A suitable version of Java may already be installed on your system. Verify this by entering the command java -version in your Terminal. Should Java not be installed, install it from https://java.com
  2. Download the latest version of CorpusSearch and place the file CS_2.003.04.jar in a convenient location, e.g. ~/corpussearch
  3. You can now run CorpusSearch with the following command in the Terminal: java -classpath ~/corpussearch/CS_2.003.04.jar csearch/CorpusSearch
  4. To simplify running CorpusSearch, you can create an alias so that you only have to type in cs rather than the entire command listed in point 3. The exact instructions to do this will vary slightly depending on which Terminal shell you are using on your system. If you are unsure which shell your system uses, type either echo $0 or echo $SHELL in your Terminal.
  • bash (default on most Linux distributions and macOS prior to version 10.15 Catalina):
    • In your Terminal, move to your home directory with the command cd ~
    • List the contents of your home directory with the command ls -l
    • Look for the file .bashrc (Linux) or .bash_profile (macOS), if it doesn’t exist, create it with touch .bashrc or touch .bash_profile
    • Open the file with a text editor, e.g. nano .bashrc or nano .bash_profile, and add the following line: alias cs='java -classpath $HOME/corpussearch/CS_2.003.04.jar csearch/CorpusSearch', then save and exit the editor. If you do not feel at ease using a terminal-based text editor, you can use the command open -e .bash_profile to open the file in TextEdit on macOS
    • Restart your Terminal: the command cs now lauches CorpusSearch
  • zsh (default on macOS since version 10.15 Catalina):
    • In your Terminal, move to your home directory with the command cd ~
    • List the contents of your home directory with the command ls -l
    • Look for the file .zshrc, if it doesn’t exist, create it with touch .zshrc
    • Open the file with a text editor, e.g. nano .zshrc, and add the following line: alias cs='java -classpath $HOME/corpussearch/CS_2.003.04.jar csearch/CorpusSearch', then save and exit the editor. If you do not feel at ease using a terminal-based text editor, you can use the command open -e .zshrc to open the file in TextEdit on macOS
    • Restart your Terminal: the command cs now lauches CorpusSearch
  • fish:
    • Enter the following command: abbr -a cs java -classpath $HOME/corpussearch/CS_2.003.04.jar csearch/CorpusSearch
    • Restart your Terminal: the command cs is now an abbreviation for the longer command listed in point 3

Running CorpusSearch

Assuming that you have a cs alias already set up, CorpusSearch can be run as follows: cs query.q corpus.psd, where query.q is your query file and corpus.psd is your corpus file. This also assumes that all these files are in your current working directory, e.g. C:\Documents\CorpusSearch (Windows) or ~/Documents/CorpusSearch (Unix-like). You should therefore switch your working directory beforehand with the cd command.

By default, this creates an output file with the ending .out and the same name as the query file, so in our example query.out. To create a file with a different name, you can specify it with the -o flag as in cs query.q corpus.psd -o results.out.

As a corpus generally consists of multiple files, you may want to query them all at once rather than individually. This can be achieved with the * wildcard. Assuming that your corpus files are stored in the directory C:\Documents\Corpus (Windows) or ~/Documents/Corpus (Unix-like), you can query all the corpus files in the directory with cs query.q C:\Documents\Corpus\*.psd (Windows) or cs query.q ~/Documents/Corpus/*.psd (Unix-like).

A guide on how to define your query within the .q query file is provided on the Toolbox Anglistik IV of Mannheim University.