We are data mining a corpus of ancient texts to train machine learning classifiers that distinguish between different genres.
Replication code for Gianitsos et al., "Stylometric Classification of Ancient Greek Literary Texts by Genre," LaTeCH-CLfL 2019
Link to paper: https://www.aclweb.org/anthology/W19-2507/
Open the Terminal app
-
Check that you have
Python 3.6
installed:which python3.6
If it is installed, this command should have output a path. For example:
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6
. If nothing was output, downloadPython 3.6
here: https://www.python.org/downloads/release/python-368/ -
Ensure that you have the Xcode command-line tools installed on your Mac by running the following. If the tools are already installed, it will not do anything harmful. This step ensures you have
git
andsvn
installed which are necessary to run the code in this project.xcode-select --install
-
Install
pipenv
1. If already installed, this command will not do anything harmful.pip install pipenv
-
Clone this repository - click on green 'clone' button on the right side of the Github webpage for this repo to copy the link:
git clone <link you just copied>
-
Navigate inside the project folder:
cd <the project folder you just cloned>
-
Now that you are in the project directory, run the following command. This will generate a virtual environment called
.venv
in the current directory2 that will contain the Python dependencies for this project.PIPENV_VENV_IN_PROJECT=true pipenv install
-
This will activate the virtual environment. After activation, running
Python
commands will ignore the system-levelPython
version & packages, and only use the packages from the virtual environment.pipenv shell
Using exit
will exit the virtual environment i.e. it restores the system-level Python
configurations to your shell. You can also simply close the terminal. Whenever you want to resume working on the project, run pipenv shell
while in the project directory to activate the virtual environment again.
Here are examples of commands you can run:
Run the demo (this does a feature extraction for a small sample of files, and analyzes the results in one step):
python demo.py
Extract features from all files:
python run_feature_extraction.py all_data.pickle
Extract features from only drama and epic files:
python run_feature_extraction.py drama_epic_data.pickle drama epic
Run all model analyzer functions on the data from all files to classify prose from verse:
python run_ml_analyzers.py all_data.pickle labels/prosody_labels.csv all
Run all model analyzer functions on the data from only drama and epic files to classify drama from epic:
python run_ml_analyzers.py drama_epic_data.pickle labels/genre_labels.csv all
1) The pipenv
tool works by making a project-specific directory called a virtual environment that hold the dependencies for that project. After a virtual environment is activated, newly installed dependencies will automatically go into the virtual environment instead of being placed among your system-level Python
packages. This precludes the possiblity of different projects on the same machine from having dependencies that conflict with one another. ↩
2) Setting the PIPENV_VENV_IN_PROJECT
variable to true will indicate to pipenv
to make this virtual environment within the same directory as the project so that all the files corresponding to a project can be in the same place. This is not default behavior (e.g. on Mac, the environments will normally be placed in ~/.local/share/virtualenvs/
by default). ↩