Skip to content

Commit

Permalink
Move in projects (#40)
Browse files Browse the repository at this point in the history
* change callbacks

* pbar

* T0 works with torch==2.0.1 and pyl==2.0.4

* accum inference outputs

* update config for finetuning and fix progress

* fix inference in t0

* update reqs

* move to projects/mhr

* create mhr config

* remove old import

* update t0-3b args

* added scripts

* added tests

* fix value input for Z tuning

* update setuup instructions

* Loosen python dependencies.

* Fix setup and tests.

* Increase test verbosity. Removed dependabot pipeline.

* Bump pytorch-lightning from 2.0.4 to 2.0.5

* Fix T0 dataset creation script

* use /tmp for the datasets preparation

* move dataset scripts inside projects/mhr

* move some files around

* temporarily move _set_defaults back to mttl/config

* remove output folder and it ignore them anywhere in the repo.

* removed bb

* fix missing files in the setup bundle

* move fineture scripts to scripts/finetune

* remove pl_zeroshot

* Review scripts to use same envvar. Add instructions to readme. Save processed data inside the projects/mhr folder.

* Removed hardcoded train_dir

* add env var to load storycloze dataset

* add STORYCLOZE_DIR to readme

---------

Co-authored-by: Alessandro Sordoni <alessandro.sordoni@gmail.com>
Co-authored-by: Lucas Caccia <lucas.page-caccia@mail.mcgill.ca>
Co-authored-by: matheper <matpereira@microsoft.com>
  • Loading branch information
4 people authored Aug 1, 2023
1 parent f84f266 commit ce4ca51
Show file tree
Hide file tree
Showing 71 changed files with 291 additions and 299 deletions.
11 changes: 0 additions & 11 deletions .github/dependabot.yml

This file was deleted.

2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ jobs:
# flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
pytest -vv
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ amulet_*/
wandb
.amltconfig
.amltignore
cache/
output/
**/output
data/
**/data
.vscode/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
68 changes: 58 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,81 @@ MTTL - Multi-Task Transfer Learning

## Setup

Install Python packages:
MTTL supports `Python 3.8` and `Python 3.9`. It is recommended to create a virtual environment for MTTL using `virtualenv` or `conda`. For example, with `conda`:

`pip install -r requirements.txt`
conda create -n mttl python=3.9
conda activate mttl

_The package `promptsource` currently requires Python 3.7. Alternative versions require local installations (see their [documentation](https://github.com/bigscience-workshop/promptsource#setup))._
Install the required Python packages:

Download the datasets:
pip install -e .

`bash scripts/create_datasets.sh`

## Multi-task Pre-training

The general command:
## Multi-Head Adapter Routing

`python pl_train.py -c $CONFIG_FILES -k $KWARGS`
Please ensure that you have navigated to the `projects/mhr` directory before running the Multi-Head Adapter Routing scripts:

cd projects/mhr


### Data Preparation

Download and prepare the datasets for the experiments using the following script:

bash datasets/create_datasets.sh


### Environment Variables

Based on your experiments, you may need to export one or more of the following environment variables:

T0_DATA_DIR: `data/t0_data/processed` if you ran the `create_datasets.sh`
NI_DATA_DIR: `data/ni_data/processed` if you ran the `create_datasets.sh`
XFIT_DATA_DIR: `data/ni_data/processed` if you ran the `create_datasets.sh`
CHECKPOINT_DIR
OUTPUT_DIR
CACHE_DIR
STORYCLOZE_DIR: path to your downloaded `.csv` files. See [the storycloze official website](https://cs.rochester.edu/nlp/rocstories/)


### Multi-task Pre-training

The general command for pre-training a model is:

python pl_train.py -c $CONFIG_FILES -k $KWARGS

Multiple `CONFIG_FILES` can be concatenated as `file1+file2`. To modify defaults, `KWARGS` can be expressed as `key=value`.
You can check [scripts/pretrain](scripts/pretrain) for examples.

## Test Fine-Tuning
### Test Fine-Tuning

To perform finetuning for a test task, use the script `pl_finetune.py`

## Hyper-parameter Search for Test Fine-Tuning
### Hyper-parameter Search for Test Fine-Tuning

To perform an hyperparameter search for a test task, use the script `pl_finetune_tune.py`.
The script will just call the functions in `pl_finetune.py` in a loop. The script itself defines hp ranges for different fine-tuning types.


### Pre-Configured Scripts

Alternatively, you can run the pre-configured scripts from the `scripts` folder. For example:

bash scripts/mhr_pretrain.sh

### Know Issues
If you run into issues with protoc `TypeError: Descriptors cannot not be created directly.`, you can try to downgrade protobuf to 3.20.*:

pip install protobuf==3.20.*


## Running Tests

pip install -e ".[test]"
pytest -vv tests


## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Expand Down
3 changes: 0 additions & 3 deletions configs/t0/3b.json

This file was deleted.

200 changes: 102 additions & 98 deletions mttl/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,113 @@


class Config:

def __init__(self, filenames=None, kwargs=None, raise_error=True):
# Stores personalization of the config file in a dict (json serializable)
self._updated_kwargs = {}
self.filenames = filenames
self._set_defaults()

if filenames:
for filename in filenames.split("+"):
if not os.path.exists(filename):
filename = os.path.join(os.getenv("CONFIG_PATH", default="configs"), filename)

self.update_kwargs(json.load(open(filename)), eval=False, raise_error=raise_error)

if kwargs:
self.update_kwargs(kwargs, raise_error=raise_error)

self.save_config(self.output_dir)

def was_overridden(self, key):
return key in self._updated_kwargs

def was_default(self, key):
return key not in self._updated_kwargs

def update_kwargs(self, kwargs, eval=True, raise_error=True):
for (k, v) in kwargs.items():
if eval:
try:
v = ast.literal_eval(v)
except (ValueError, SyntaxError):
v = v
else:
v = v
if not hasattr(self, k) and raise_error:
raise ValueError(f"{k} is not in the config")

if eval:
print("Overwriting {} to {}".format(k, v))

if k == 'finegrained':
k = 'poly_granularity'
v = 'finegrained' if v else 'coarsegrained'
elif k in ['train_dir', 'output_dir']:
# this raises an error if the env. var does not exist
v = Template(v).substitute(os.environ)

setattr(self, k, v)
self._updated_kwargs[k] = v

def __getitem__(self, item):
return getattr(self, item, None)

def to_json(self):
"""
Converts parameter values in config to json
:return: json
"""
import copy

to_save = copy.deepcopy(self.__dict__)
to_save.pop("_updated_kwargs")

return json.dumps(to_save, indent=4, sort_keys=False)

def save_config(self, output_dir):
"""
Saves the config
"""
os.makedirs(output_dir, exist_ok=True)

with open(os.path.join(output_dir, "config.json"), "w+") as fout:
fout.write(self.to_json())
fout.write("\n")

@classmethod
def parse(cls, extra_kwargs=None, raise_error=True):
import itertools

parser = argparse.ArgumentParser()
parser.add_argument("-c", "--config_files", required=False)
parser.add_argument("-k", "--kwargs", nargs="*", action='append')
args = parser.parse_args()

kwargs = {}
if args.kwargs:
kwargs_opts = list(itertools.chain(*args.kwargs))
for value in kwargs_opts:
key, _, value = value.partition('=')
kwargs[key] = value
args.kwargs = kwargs
if extra_kwargs:
args.kwargs.update(extra_kwargs)

config = cls(args.config_files, args.kwargs, raise_error=raise_error)

print(config.to_json())
return config

def _set_defaults(self):
self.cache_dir = os.getenv("CACHE_DIR", "./cache")
self.free_up_space = False
# Data config
self.dataset = None
self.custom_tasks_splits = None
self.train_dir = os.getenv("AMLT_DATA_DIR", "/tmp/")
self.output_dir = os.getenv("AMLT_OUTPUT_DIR", "./output")
self.train_dir = os.getenv("TRAIN_DIR", "/tmp/")
self.output_dir = os.getenv("OUTPUT_DIR", "./output")
self.finetune_task_name = None
self.example_to_ids_path = None # path to clustering of data
self.embeddings_path = None
Expand Down Expand Up @@ -103,12 +200,12 @@ def __init__(self, filenames=None, kwargs=None, raise_error=True):
self.poly_use_shared_skill = False # use one skill shared by all tasks

"""
poly_granularity : how granular is the module selection :
poly_granularity : how granular is the module selection :
coarsegrained : 1 single selector across all linear layers
coderwise : 2 selectors (1 for encoder, 1 for decoder)
blockwise : 1 selector for each block of K attention layers (and layernorm)
layerwise : 1 selector for each attention layer (and layernorm)
finegrained : 1 selector for every linear layer
layerwise : 1 selector for each attention layer (and layernorm)
finegrained : 1 selector for every linear layer
"""
self.poly_granularity = 'finegrained'

Expand All @@ -119,75 +216,6 @@ def __init__(self, filenames=None, kwargs=None, raise_error=True):
self.adapters_weight_decay = None
self.module_logits_dropout = 0.
self.module_logits_l2_norm = False
self.filenames = filenames

if filenames:
for filename in filenames.split("+"):
if not os.path.exists(filename):
filename = os.path.join(os.getenv("CONFIG_PATH", default="configs"), filename)

self.update_kwargs(json.load(open(filename)), eval=False, raise_error=raise_error)

if kwargs:
self.update_kwargs(kwargs, raise_error=raise_error)

self.save_config(self.output_dir)

def was_overridden(self, key):
return key in self._updated_kwargs

def was_default(self, key):
return key not in self._updated_kwargs

def update_kwargs(self, kwargs, eval=True, raise_error=True):
for (k, v) in kwargs.items():
if eval:
try:
v = ast.literal_eval(v)
except (ValueError, SyntaxError):
v = v
else:
v = v
if not hasattr(self, k) and raise_error:
raise ValueError(f"{k} is not in the config")

if eval:
print("Overwriting {} to {}".format(k, v))

if k == 'finegrained':
k = 'poly_granularity'
v = 'finegrained' if v else 'coarsegrained'
elif k in ['train_dir', 'output_dir']:
# this raises an error if the env. var does not exist
v = Template(v).substitute(os.environ)

setattr(self, k, v)
self._updated_kwargs[k] = v

def __getitem__(self, item):
return getattr(self, item, None)

def to_json(self):
"""
Converts parameter values in config to json
:return: json
"""
import copy

to_save = copy.deepcopy(self.__dict__)
to_save.pop("_updated_kwargs")

return json.dumps(to_save, indent=4, sort_keys=False)

def save_config(self, output_dir):
"""
Saves the config
"""
os.makedirs(output_dir, exist_ok=True)

with open(os.path.join(output_dir, "config.json"), "w+") as fout:
fout.write(self.to_json())
fout.write("\n")


class ParseKwargs(argparse.Action):
Expand All @@ -196,27 +224,3 @@ def __call__(self, parser, namespace, values, option_string=None):
for value in values:
key, value = value.split('=')
getattr(namespace, self.dest)[key] = value


def parse_config(extra_kwargs=None, raise_error=True):
import itertools

parser = argparse.ArgumentParser()
parser.add_argument("-c", "--config_files", required=False)
parser.add_argument("-k", "--kwargs", nargs="*", action='append')
args = parser.parse_args()

kwargs = {}
if args.kwargs:
kwargs_opts = list(itertools.chain(*args.kwargs))
for value in kwargs_opts:
key, _, value = value.partition('=')
kwargs[key] = value
args.kwargs = kwargs
if extra_kwargs:
args.kwargs.update(extra_kwargs)

config = Config(args.config_files, args.kwargs, raise_error=raise_error)

print(config.to_json())
return config
Loading

0 comments on commit ce4ca51

Please sign in to comment.