Skip to content

Commit

Permalink
更新空闲显卡定义的有关说明。
Browse files Browse the repository at this point in the history
  • Loading branch information
HAL-42 committed Jul 29, 2024
1 parent 9fe5796 commit d215a4a
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 31 deletions.
25 changes: 10 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -818,7 +818,7 @@ parser = argparse.ArgumentParser(description='Tuning AlchemyCat MNIST Example')
parser.add_argument('-c', '--cfg2tune', type=str)
args = parser.parse_args()

# Set `pool_size` to GPU num, will run `pool_size` of configs in parallel
# Will run `torch.cuda.device_count() // work_gpu_num` of configs in parallel
runner = Cfg2TuneRunner(args.cfg2tune, experiment_root='/tmp/experiment', work_gpu_num=1)

@runner.register_work_fn # How to run config
Expand Down Expand Up @@ -1021,22 +1021,17 @@ For `config C + algorithm code A ——> reproducible experiment E(C, A)`, meani
We also provide a [script](alchemy_cat/torch_tools/scripts/tag_exps.py) that runs `pyhon -m alchemy_cat.torch_tools.scripts.tag_exps -s commit_ID -a commit_ID`, interactively lists the new configs added by the commit, and tags the commit according to the config path. This helps quickly trace back the config and algorithm of a historical experiment.

### Allocate GPU for Child Processes Manually
The `work` function of `Cfg2TuneRunner` sometimes needs to allocate GPUs for subprocesses. Besides using the `cuda_env` parameter, you can manually assign idle GPUs based on `pkl_idx` using the `allocate_cuda_by_group_rank`:
The `work` function receives the idle GPU automatically allocated by `Cfg2TuneRunner` through the `cuda_env` parameter. We can further control the definition of 'idle GPU':
```python
from alchemy_cat.cuda_tools import allocate_cuda_by_group_rank

# ... Code before

@runner.register_work_fn # How to run config
def work(pkl_idx: int, cfg: Config, cfg_pkl: str, cfg_rslt_dir: str, cuda_env: dict[str, str]) -> ...:
current_cudas, env_with_current_cuda = allocate_cuda_by_group_rank(group_rank=pkl_idx, group_cuda_num=2, block=True, verbosity=True)
subprocess.run([sys.executable, 'train.py', '-c', cfg_pkl], env=env_with_current_cuda)

# ... Code after
runner = Cfg2TuneRunner(args.cfg2tune, experiment_root='/tmp/experiment', work_gpu_num=1,
block=True, # Try to allocate idle GPU
memory_need=10 * 1024, # Need 10 GB memory
max_process=2) # Max 2 process already ran on each GPU
```
`group_rank` commonly is `pkl_idx`, and `group_cuda_num` is the number of GPUs needed for the task. If `block` is `True`, it waits if the GPU is occupied. If `verbosity` is `True`, it prints blocking situations.

The return value `current_cudas` is a list containing the allocated GPU numbers. `env_with_current_cuda` is an environment variable dictionary with `CUDA_VISIBLE_DEVICES` set, which can be passed directly to the `env` parameter of `subprocess.run`.
where:
- `block`: Defaults is `True`. If set to `False`, GPUs are allocated sequentially, regardless of whether they are idle.
- `memory_need`: The amount of GPU memory required for each sub-config, in MB. The free memory on an idle GPU must be ≥ `memory_need`. Default is `-1.`, indicating need all memory.
- `max_process`: Maximum number of existing processes. The number of existing processes on an idle GPU must be ≤ `max_process`. Default value is `-1`, indicating no limit.

### Pickling Lambda Functions
Sub-configs generated by `Cfg2Tune` will be saved using pickle. However, if `Cfg2Tune` defines dependencies as `DEP(lambda c: ...)`, these lambda functions cannot be pickled. Workarounds include:
Expand Down
26 changes: 11 additions & 15 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -817,7 +817,7 @@ parser = argparse.ArgumentParser(description='Tuning AlchemyCat MNIST Example')
parser.add_argument('-c', '--cfg2tune', type=str)
args = parser.parse_args()

# Set `pool_size` to GPU num, will run `pool_size` of configs in parallel
# Will run `torch.cuda.device_count() // work_gpu_num` of configs in parallel
runner = Cfg2TuneRunner(args.cfg2tune, experiment_root='/tmp/experiment', work_gpu_num=1)

@runner.register_work_fn # How to run config
Expand Down Expand Up @@ -1019,23 +1019,19 @@ cfg.sched.epochs = 15

我们还提供了一个[脚本](alchemy_cat/torch_tools/scripts/tag_exps.py),运行`pyhon -m alchemy_cat.torch_tools.scripts.tag_exps -s commit_ID -a commit_ID`,将交互式地列出该 commit 新增的配置,并按照配置路径给 commit 打上标签。这有助于快速回溯历史上某个实验的配置和算法。

### 为子任务手动分配显卡
`Cfg2TuneRunner``work`函数有时需要为子进程分配显卡。除了使用`cuda_env`参数,还可以使用`allocate_cuda_by_group_rank`,根据`pkl_idx`手动分配空闲显卡,
### 自动分配空闲显卡
`work`函数通过`cuda_env`参数,接收`Cfg2TuneRunner`自动分配的空闲显卡。我们还可以进一步控制『空闲显卡』的定义
```python
from alchemy_cat.cuda_tools import allocate_cuda_by_group_rank

# ... Code before

@runner.register_work_fn # How to run config
def work(pkl_idx: int, cfg: Config, cfg_pkl: str, cfg_rslt_dir: str, cuda_env: dict[str, str]) -> ...:
current_cudas, env_with_current_cuda = allocate_cuda_by_group_rank(group_rank=pkl_idx, group_cuda_num=2, block=True, verbosity=True)
subprocess.run([sys.executable, 'train.py', '-c', cfg_pkl], env=env_with_current_cuda)

# ... Code after
runner = Cfg2TuneRunner(args.cfg2tune, experiment_root='/tmp/experiment', work_gpu_num=1,
block=True, # Try to allocate idle GPU
memory_need=10 * 1024, # Need 10 GB memory
max_process=2) # Max 2 process already ran on each GPU
```
`group_rank`一般为`pkl_idx``group_cuda_num`为任务所需显卡数量。`block``True`时,若分配的显卡被占用,会阻塞直到有空闲。`verbosity``True`时,会打印阻塞情况。
其中:
* `block`: 默认为`True`。若为`False`,则总是顺序分配显卡,不考虑空闲与否。
* `memory_need`:运行每个子配置需要的显存,单位为 MB。空闲显卡之可用显存必须 ≥ `memory_need`。默认值为`-1.`,表示需要独占所有显存。
* `max_process`:最大已有进程数。空闲显卡已有的进程数必须 ≤ `max_process`。默认值为`-1`,表示无限制。

返回值`current_cudas`是一个列表,包含了分配的显卡号。`env_with_current_cuda`是设置了`CUDA_VISIBLE_DEVICES`的环境变量字典,可直接传入`subprocess.run``env`参数。

### 匿名函数无法 pickle 问题
`Cfg2Tune`生成的子配置会被 pickle 保存。然而,若`Cfg2Tune`定义了形似`DEP(lambda c: ...)`的依赖项,所存储的匿名函数无法被 pickle。变通方法有:
Expand Down
2 changes: 1 addition & 1 deletion alchemy_cat/dl_config/examples/tune_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
parser.add_argument('-c', '--cfg2tune', type=str)
args = parser.parse_args()

# Set `pool_size` to GPU num, will run `pool_size` of configs in parallel
# Will run `torch.cuda.device_count() // work_gpu_num` of configs in parallel
runner = Cfg2TuneRunner(args.cfg2tune, experiment_root='/tmp/experiment', work_gpu_num=1)

@runner.register_work_fn # How to run config
Expand Down

0 comments on commit d215a4a

Please sign in to comment.