Skip to content

Commit

Permalink
Merge pull request #1 from rdnfn/dev/zwei
Browse files Browse the repository at this point in the history
Dev/zwei
  • Loading branch information
rdnfn authored Jan 21, 2025
2 parents 9a10210 + 3878bc3 commit 0de87dd
Show file tree
Hide file tree
Showing 112 changed files with 49,459 additions and 399 deletions.
26 changes: 26 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Lint

on:
pull_request:
branches:
- main

jobs:
lint:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"
- name: Check formatting with Black
run: black --check --version && black --check .
26 changes: 26 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Tests

on:
pull_request:
branches:
- main

jobs:
test:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"
- name: Run tests
run: pytest
99 changes: 61 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,39 @@

# Inverse Constitutional AI

Repository containing code for the paper "Inverse Constitutional AI: Compressing Preferences into Principles" ([pdf](https://arxiv.org/abs/2406.06560)). The figure below provides an overview of the *Inverse Constitutional AI* (ICAI) problem we introduce: starting from a set of pairwise preference feedback, we derive a set of natural language principles (a *constitution*) that explain the preference data.
For validation, we re-construct the original preferences with an LLM judging according to the generated constitution. The constitution represents a (highly compact) compression of the preferences.

<p align="center">
<img src="./docs/img/02_complete_overview.png" width="1000px" align="center">
<img src="./docs/img/00_logo_v0.png" width="110px" align="center">
</p>

# Motivation

Feedback data plays an important role in fine-tuning and evaluating state-of-the-art AI models. Often pairwise text preferences are used: given two texts, human (or AI) annotators select the “better” one. Such feedback data is widely used to align models to human preferences (e.g., reinforcement learning from human feedback), or to rank models according to human preferences (e.g., Chatbot Arena). Despite its wide-spread use, prior work has demonstrated that human-annotated pairwise text preference data often exhibits unintended biases. For example, human annotators have been shown to prefer assertive over truthful texts in certain contexts. Models trained or evaluated on this data may implicitly encode these biases in a manner hard to identify. To be able to better understand existing pairwise text preference data, we formulate its interpretation as a compression task: the *Inverse Constitutional AI* problem. Read the [full paper](https://arxiv.org/abs/2406.06560) for more background.

# Algorithm
# Inverse Constitutional AI

We introduce a first *Inverse Constitutional AI* (ICAI) algorithm that generates a set of principles based on a feedback dataset. See the figure below for an overview of the algorithm. Given a dataset of pairwise rankings, in Step 1 candidate principles are generated using an LLM. In Step 2, these principles are clustered using an embedding model. In Step 3, similar principles are “de-duplicated” by sampling one principle per cluster. In Step 4, each principle is tested to evaluate its ability to help an LLM reconstruct the original annotations. Finally in Step 5, the principles are filtered according to the testing results and a set of filtered principles are returned as the final constitution. Optionally, this last step is augmented with additional clustering and subsampling steps to ensure diverse principles. The implementation is provided in this repository.
This repository contains the official implementation of the *Inverse Constitutional AI* (ICAI) algorithm [[paper]](https://arxiv.org/abs/2406.06560). ICAI compresses pairwise preference datasets into a readable list of principles (constitution) that the annotations appear to follow (e.g. "select the friendlier response"). ICAI principles provide an interpretable overview of a feedback dataset, enabling users to discover *problematic annotation biases* or *better understand differences between datasets, user groups or models*.

<p align="center">
<img src="./docs/img/03_algorithm.png" width="800px" align="center">
<img src="./docs/img/01_basic_overview_v2.png" width="750px" align="center">
</p>


## Installation

1. *Pip install the package*

- **Non-contributors**
```
pip install git+https://github.com/rdnfn/icai.git
```
- **Contributors:** clone repo locally, e.g.
```
git clone git@github.com:rdnfn/icai.git
```
Then (inside repo folder) install package in editable mode:
```
pip install -e .
```
2. *Set up API secrets:* inside the main directory of the cloned repo (or wherever you like really) set up a secrets.toml file like below. You only need to include keys for APIs you want to use.
1. Pip install the package (for development installation see [here](#dev-installation))
```
pip install git+https://github.com/rdnfn/icai.git
```
2. Set up API secrets: inside the main directory of the cloned repo (or wherever you like really) set up a `secrets.toml` file like below. You only need to include keys for APIs you want to use.
```toml
OPENAI_API_KEY="<YOURKEY>"
ANTHROPIC_API_KEY="<YOURKEY>"
```
3. *Download data*, or use your own feedback data. You can use the data notebook (see Quickstart) to download the supported data sources. Currently supported data sources:
- https://github.com/anthropics/hh-rlhf
- https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
## Quickstart
Given a feedback dataset (use the [data notebook](https://github.com/rdnfn/icai/blob/main/notebooks/01_data_prepocessing.ipynb) to download one), you can run your first Inverse Constitutional AI (ICAI) experiment using the `icai-exp` command:
You can run your first Inverse Constitutional AI (ICAI) experiment using the `icai-exp` command:
```
icai-exp secrets_path="./secrets.toml" data_path="data/processed/example/example.csv"
icai-exp data_path="data/processed/example/example.csv"
```
This will run the ICAI algorithm on the toy `example.csv` pairwise feedback dataset and generate a constitution for this dataset.
To get the available experiment parameters and instructions on how to adjust them, run
```
Expand All @@ -64,7 +45,7 @@ icai-exp --help
### Inspecting results
By default all experiment results are saved in the `./outputs/<DATE>_<TIME>` directory. These outputs contain a full record of API calls, as well as intermediate states of the algorithm (proposed principles, clusters, distilled principles, etc.). Each result output follows the structure below:
By default all experiment results are saved in the `./outputs/<DATE>_<TIME>` directory. The exact result file location is also printed to the console at the end of any (completed) `icai-exp` call. These outputs contain a full record of API calls, as well as intermediate states of the algorithm (proposed principles, clusters, distilled principles, etc.). Each result output follows the structure below:
```text
./outputs
Expand All @@ -86,6 +67,14 @@ By default all experiment results are saved in the `./outputs/<DATE>_<TIME>` dir
└── 093_results_testset.json
```

## Run experiment with your own data

To run ICAI on your dataset, you first need to convert it to a `csv` file with the following three columns: `text_a`,`text_b`,`preferred_text`. The first two should be strings. Note that the ICAI implementation currently uses no separate "prompt" column. If such a column exists in your dataset, you likely want to add the prompt to each response column (`text_a`,`text_b`) such that the ICAI algorithm can understand the full context of the preference label. Entries in the column `preferred_text` should take one of two values: `"text_a"` or `"text_b"`. Ties or other annotation values are currently not used by algorithm. To run ICAI on you prepared dataset, simply use:

```
icai-exp data_path="<path/to/your-data.csv>"
```

## Run experiment from config file

In the `exp/configs` folder there is a number of configs to recreate experiments. You can run these experiments using the command:
Expand All @@ -106,12 +95,23 @@ icai-exp -cd ./exp/configs/001_synthetic_orthogonal
## Development

### Dev installation

clone repo locally, e.g.
```
git clone git@github.com:rdnfn/icai.git
```
Then (inside repo folder) install package in editable mode:
```
pip install -e .
```

### Running test cases

To run the test cases for the code, use the following command:
Tests are included as part of the package. Run them using:

```
pytest ./tests
```bash
pytest ./src
```

### Simplest way to run experiment script
Expand All @@ -121,7 +121,30 @@ This doesn't do any meaningful experimental work but allows running the experime
icai-exp generate_constitution=false annotator.constitution=null annotator.other_annotator_configs="[]"
```

# Background

## Motivation

Feedback data plays an important role in fine-tuning and evaluating state-of-the-art AI models. Often pairwise text preferences are used: given two texts, human (or AI) annotators select the “better” one. Such feedback data is widely used to align models to human preferences (e.g., reinforcement learning from human feedback), or to rank models according to human preferences (e.g., Chatbot Arena). Despite its wide-spread use, prior work has demonstrated that human-annotated pairwise text preference data often exhibits unintended biases. For example, human annotators have been shown to prefer assertive over truthful texts in certain contexts. Models trained or evaluated on this data may implicitly encode these biases in a manner hard to identify. To be able to better understand existing pairwise text preference data, we formulate its interpretation as a compression task: the *Inverse Constitutional AI* problem. Read the [full paper](https://arxiv.org/abs/2406.06560) for more background.

## Method overview

The figure below provides an overview of the *Inverse Constitutional AI* (ICAI) problem we introduce: starting from a set of pairwise preference feedback, we derive a set of natural language principles (a *constitution*) that explain the preference data.
For validation, we re-construct the original preferences with an LLM judging according to the generated constitution. The constitution represents a (highly compact) compression of the preferences.

<p align="center">
<img src="./docs/img/02_complete_overview.png" width="1000px" align="center">
</p>

## Algorithm

We introduce a first *Inverse Constitutional AI* (ICAI) algorithm that generates a set of principles based on a feedback dataset. See the figure below for an overview of the algorithm. Given a dataset of pairwise rankings, in Step 1 candidate principles are generated using an LLM. In Step 2, these principles are clustered using an embedding model. In Step 3, similar principles are “de-duplicated” by sampling one principle per cluster. In Step 4, each principle is tested to evaluate its ability to help an LLM reconstruct the original annotations. Finally in Step 5, the principles are filtered according to the testing results and a set of filtered principles are returned as the final constitution. Optionally, this last step is augmented with additional clustering and subsampling steps to ensure diverse principles. The implementation is provided in this repository.

<p align="center">
<img src="./docs/img/03_algorithm.png" width="800px" align="center">
</p>


### License
# License

All code in this repo is licensed under [Apache-2.0](./LICENSE).
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<|im_start|>system
You are a highly efficient assistant, who evaluates and rank large language models (LLMs) based on the quality of their responses to given prompts. This process will create a leaderboard reflecting the most accurate and human-preferred answers.
<|im_end|>
<|im_start|>user
I require a leaderboard for various large language models. I'll provide you with prompts given to these models and their corresponding responses. Your task is to assess these responses, ranking the models in order of preference from a human perspective. Once ranked, please output the results in a structured JSON format for the make_partial_leaderboard function.

## Model Outputs

Here are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier.

{
{
"model": "m",
"output": """{output_1}"""
},
{
"model": "M",
"output": """{output_2}"""
}
}

## Task

Evaluate and rank the models based on the quality and relevance of their outputs. The ranking should be such that the model with the highest quality output is ranked first.
<|im_end|>
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
alpaca_eval_gpt4_turbo_fn:
prompt_template: "alpaca_eval_gpt4o_fn_noinstruction_flipped/alpaca_eval_fn.txt"
fn_completions: "openai_completions"
completions_kwargs:
model_name: "gpt-4o-2024-05-13"
max_tokens: 100
temperature: 0
function_call:
name: "make_partial_leaderboard"
functions:
- name: "make_partial_leaderboard"
description: "Make a leaderboard of models given a list of the models ordered by the preference of their outputs."
parameters:
type: "object"
properties:
ordered_models:
type: "array"
description: "A list of models ordered by the preference of their outputs. The first model in the list has the best output."
items:
type: "object"
properties:
model:
type: "string"
description: "The name of the model"
rank:
type: "number"
description: "Order of preference of the model, 1 has the best output"
"required": [ "ordered_models" ]
fn_completion_parser: "pipeline_meta_parser"
completion_parser_kwargs:
parsers_to_kwargs:
json_parser:
annotation_key: "ordered_models"
ranking_parser:
model_1_name: "M" # flipped from alpaca_eval_gpt4o_fn_noinstruction
batch_size: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<|im_start|>system
You are a highly efficient assistant, who evaluates and rank large language models (LLMs) based on the quality of their responses to given prompts. This process will create a leaderboard reflecting the most accurate and human-preferred answers.
<|im_end|>
<|im_start|>user
I require a leaderboard for various large language models. I'll provide you with prompts given to these models and their corresponding responses. Your task is to assess these responses, ranking the models in order of preference from a human perspective. Once ranked, please output the results in a structured JSON format for the make_partial_leaderboard function.

## Model Outputs

Here are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier.

{
{
"model": "m",
"output": """{output_1}"""
},
{
"model": "M",
"output": """{output_2}"""
}
}

## Task

Evaluate and rank the models based on the quality and relevance of their outputs. The ranking should be such that the model with the highest quality output is ranked first.
<|im_end|>
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
alpaca_eval_gpt4_turbo_fn:
prompt_template: "alpaca_eval_gpt4omini_fn_noinstruction/alpaca_eval_fn.txt"
fn_completions: "openai_completions"
completions_kwargs:
model_name: "gpt-4o-mini-2024-07-18"
max_tokens: 100
temperature: 0
function_call:
name: "make_partial_leaderboard"
functions:
- name: "make_partial_leaderboard"
description: "Make a leaderboard of models given a list of the models ordered by the preference of their outputs."
parameters:
type: "object"
properties:
ordered_models:
type: "array"
description: "A list of models ordered by the preference of their outputs. The first model in the list has the best output."
items:
type: "object"
properties:
model:
type: "string"
description: "The name of the model"
rank:
type: "number"
description: "Order of preference of the model, 1 has the best output"
"required": [ "ordered_models" ]
fn_completion_parser: "pipeline_meta_parser"
completion_parser_kwargs:
parsers_to_kwargs:
json_parser:
annotation_key: "ordered_models"
ranking_parser:
model_1_name: "m"
batch_size: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<|im_start|>system
You are a highly efficient assistant, who evaluates and rank large language models (LLMs) based on the quality of their responses to given prompts. This process will create a leaderboard reflecting the most accurate and human-preferred answers.
<|im_end|>
<|im_start|>user
I require a leaderboard for various large language models. I'll provide you with prompts given to these models and their corresponding responses. Your task is to assess these responses, ranking the models in order of preference from a human perspective. Once ranked, please output the results in a structured JSON format for the make_partial_leaderboard function.

## Model Outputs

Here are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier.

{
{
"model": "m",
"output": """{output_1}"""
},
{
"model": "M",
"output": """{output_2}"""
}
}

## Task

Evaluate and rank the models based on the quality and relevance of their outputs. The ranking should be such that the model with the highest quality output is ranked first. Focus on the last response by the assistant.

<|im_end|>
Loading

0 comments on commit 0de87dd

Please sign in to comment.