-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from rdnfn/dev/zwei
Dev/zwei
- Loading branch information
Showing
112 changed files
with
49,459 additions
and
399 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
name: Lint | ||
|
||
on: | ||
pull_request: | ||
branches: | ||
- main | ||
|
||
jobs: | ||
lint: | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v4 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: '3.x' | ||
|
||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install -e ".[dev]" | ||
- name: Check formatting with Black | ||
run: black --check --version && black --check . |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
name: Tests | ||
|
||
on: | ||
pull_request: | ||
branches: | ||
- main | ||
|
||
jobs: | ||
test: | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v4 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: '3.x' | ||
|
||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install -e ".[dev]" | ||
- name: Run tests | ||
run: pytest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
25 changes: 25 additions & 0 deletions
25
data/annotator_configs/alpaca_eval_gpt4o_fn_noinstruction_flipped/alpaca_eval_fn.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
<|im_start|>system | ||
You are a highly efficient assistant, who evaluates and rank large language models (LLMs) based on the quality of their responses to given prompts. This process will create a leaderboard reflecting the most accurate and human-preferred answers. | ||
<|im_end|> | ||
<|im_start|>user | ||
I require a leaderboard for various large language models. I'll provide you with prompts given to these models and their corresponding responses. Your task is to assess these responses, ranking the models in order of preference from a human perspective. Once ranked, please output the results in a structured JSON format for the make_partial_leaderboard function. | ||
|
||
## Model Outputs | ||
|
||
Here are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier. | ||
|
||
{ | ||
{ | ||
"model": "m", | ||
"output": """{output_1}""" | ||
}, | ||
{ | ||
"model": "M", | ||
"output": """{output_2}""" | ||
} | ||
} | ||
|
||
## Task | ||
|
||
Evaluate and rank the models based on the quality and relevance of their outputs. The ranking should be such that the model with the highest quality output is ranked first. | ||
<|im_end|> |
36 changes: 36 additions & 0 deletions
36
data/annotator_configs/alpaca_eval_gpt4o_fn_noinstruction_flipped/configs.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
alpaca_eval_gpt4_turbo_fn: | ||
prompt_template: "alpaca_eval_gpt4o_fn_noinstruction_flipped/alpaca_eval_fn.txt" | ||
fn_completions: "openai_completions" | ||
completions_kwargs: | ||
model_name: "gpt-4o-2024-05-13" | ||
max_tokens: 100 | ||
temperature: 0 | ||
function_call: | ||
name: "make_partial_leaderboard" | ||
functions: | ||
- name: "make_partial_leaderboard" | ||
description: "Make a leaderboard of models given a list of the models ordered by the preference of their outputs." | ||
parameters: | ||
type: "object" | ||
properties: | ||
ordered_models: | ||
type: "array" | ||
description: "A list of models ordered by the preference of their outputs. The first model in the list has the best output." | ||
items: | ||
type: "object" | ||
properties: | ||
model: | ||
type: "string" | ||
description: "The name of the model" | ||
rank: | ||
type: "number" | ||
description: "Order of preference of the model, 1 has the best output" | ||
"required": [ "ordered_models" ] | ||
fn_completion_parser: "pipeline_meta_parser" | ||
completion_parser_kwargs: | ||
parsers_to_kwargs: | ||
json_parser: | ||
annotation_key: "ordered_models" | ||
ranking_parser: | ||
model_1_name: "M" # flipped from alpaca_eval_gpt4o_fn_noinstruction | ||
batch_size: 1 |
25 changes: 25 additions & 0 deletions
25
data/annotator_configs/alpaca_eval_gpt4omini_fn_noinstruction/alpaca_eval_fn.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
<|im_start|>system | ||
You are a highly efficient assistant, who evaluates and rank large language models (LLMs) based on the quality of their responses to given prompts. This process will create a leaderboard reflecting the most accurate and human-preferred answers. | ||
<|im_end|> | ||
<|im_start|>user | ||
I require a leaderboard for various large language models. I'll provide you with prompts given to these models and their corresponding responses. Your task is to assess these responses, ranking the models in order of preference from a human perspective. Once ranked, please output the results in a structured JSON format for the make_partial_leaderboard function. | ||
|
||
## Model Outputs | ||
|
||
Here are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier. | ||
|
||
{ | ||
{ | ||
"model": "m", | ||
"output": """{output_1}""" | ||
}, | ||
{ | ||
"model": "M", | ||
"output": """{output_2}""" | ||
} | ||
} | ||
|
||
## Task | ||
|
||
Evaluate and rank the models based on the quality and relevance of their outputs. The ranking should be such that the model with the highest quality output is ranked first. | ||
<|im_end|> |
36 changes: 36 additions & 0 deletions
36
data/annotator_configs/alpaca_eval_gpt4omini_fn_noinstruction/configs.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
alpaca_eval_gpt4_turbo_fn: | ||
prompt_template: "alpaca_eval_gpt4omini_fn_noinstruction/alpaca_eval_fn.txt" | ||
fn_completions: "openai_completions" | ||
completions_kwargs: | ||
model_name: "gpt-4o-mini-2024-07-18" | ||
max_tokens: 100 | ||
temperature: 0 | ||
function_call: | ||
name: "make_partial_leaderboard" | ||
functions: | ||
- name: "make_partial_leaderboard" | ||
description: "Make a leaderboard of models given a list of the models ordered by the preference of their outputs." | ||
parameters: | ||
type: "object" | ||
properties: | ||
ordered_models: | ||
type: "array" | ||
description: "A list of models ordered by the preference of their outputs. The first model in the list has the best output." | ||
items: | ||
type: "object" | ||
properties: | ||
model: | ||
type: "string" | ||
description: "The name of the model" | ||
rank: | ||
type: "number" | ||
description: "Order of preference of the model, 1 has the best output" | ||
"required": [ "ordered_models" ] | ||
fn_completion_parser: "pipeline_meta_parser" | ||
completion_parser_kwargs: | ||
parsers_to_kwargs: | ||
json_parser: | ||
annotation_key: "ordered_models" | ||
ranking_parser: | ||
model_1_name: "m" | ||
batch_size: 1 |
26 changes: 26 additions & 0 deletions
26
data/annotator_configs/alpaca_eval_gpt4omini_fn_noinstruction_v2_mt/alpaca_eval_fn.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
<|im_start|>system | ||
You are a highly efficient assistant, who evaluates and rank large language models (LLMs) based on the quality of their responses to given prompts. This process will create a leaderboard reflecting the most accurate and human-preferred answers. | ||
<|im_end|> | ||
<|im_start|>user | ||
I require a leaderboard for various large language models. I'll provide you with prompts given to these models and their corresponding responses. Your task is to assess these responses, ranking the models in order of preference from a human perspective. Once ranked, please output the results in a structured JSON format for the make_partial_leaderboard function. | ||
|
||
## Model Outputs | ||
|
||
Here are the unordered outputs from the models. Each output is associated with a specific model, identified by a unique model identifier. | ||
|
||
{ | ||
{ | ||
"model": "m", | ||
"output": """{output_1}""" | ||
}, | ||
{ | ||
"model": "M", | ||
"output": """{output_2}""" | ||
} | ||
} | ||
|
||
## Task | ||
|
||
Evaluate and rank the models based on the quality and relevance of their outputs. The ranking should be such that the model with the highest quality output is ranked first. Focus on the last response by the assistant. | ||
|
||
<|im_end|> |
Oops, something went wrong.