This repository contains various cheminformatic tools.
Running this code requires an Anaconda distribution for Python. This can be downloaded from the Anaconda website. You will need to add Anaconda to your path variable. Here is a good tutorial about installing Anaconda and adding it to your path.
If you have git installed, you may clone this repository by the following command:
git clone https://github.com/russodanielp/cheminf_tools.git
Alternatively, you can click the 'Clone or download' button (upper right corner) to download these files as zip.
To get started, you must first create the appropriate Python environment which is required for this project. If you have conda installed from the command line, this is a very simple process. You can check by running:
conda
If this does not work, refer to the link about about installing Anaconda.
To create this Python environment, simply change directories to the cloned repository (or the downloaded folder containing this project) and run the following command:
conda env create -f environment.yml
If this does not work you can create the correct environment by running:
conda create -n cheminformatics -c rdkit rdkit scikit-learn
After the necessary dependecies are installed, you must activate this environment before running these scripts. To do that, run the following command.
On windows:
activate cheminformatics
On Linux/Mac:
source activate cheminformatics
IMPORTANT: This will need to be done before running any of these Python scripts.
This script builds a Random Forest model from a set of training compounds contained in .sdf
file format. It is assumed
that in the sdf there are at least two fields/columns/properties: (1) Biological activity & (2) compound names.
It is assumed that the biological activity is already binarized, that is, 1 represents an 'active' response and 0 represents an 'inactive' response.
The script accepts three parameters:
-ds
-- The name of the training set.sdf
file.-nc
-- The name of the field/column/property that contains the compound names.-ac
-- The name of the field/column/property that contains the biological activities.
An example code can be run like this:
python build_random_forest.py -ds data/CERAPP_agonist_training.sdf -nc Name -ac Class
After running the following files will be generated:
Random_forest_5fcv_results.csv
--.csv
file containing five-fold cross validation scores.Random_forest_5fcv_predictions.csv
--.csv
file containing five-fold cross validation predictions.Random_Forest_Model.pkl
-- The saved model required for making predictions usingpredict_random_forest.py
.
This script used the Random Forest model to predict a set of test compounds in .sdf
file format. It is assumed
that in the sdf there are at least two fields/columns/properties: (1) Biological activity & (2) compound names. The
biological activity will be used to generate statistics.
The script accepts three parameters:
-ts
-- The name of the test set.sdf
file.-nc
-- The name of the field/column/property that contains the compound names.-ac
-- The name of the field/column/property that contains the biological activities.
An example code can be run like this:
python predict_random_forest.py -ds data/CERAPP_agonist_test.sdf -nc Name -ac Class
After running the following files will be generated:
Random_forest_test_set_results.csv
--.csv
file containing test set scores.Random_forest_test_set_predictions.csv
--.csv
file containing test set predictions.
CONTACT: danrusso@rutgers.edu