Skip to content

Commit

Permalink
Make namespace default to swebench, clean up links
Browse files Browse the repository at this point in the history
  • Loading branch information
john-b-yang committed Feb 4, 2025
1 parent bc400fd commit 193fde5
Show file tree
Hide file tree
Showing 13 changed files with 52 additions and 54 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ body:
attributes:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already
addressed by searching through [the past issues](https://github.com/princeton-nlp/SWE-agent/issues).
addressed by searching through [the past issues](https://github.com/swe-bench/SWE-bench/issues).
- type: textarea
attributes:
label: Describe the bug
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ contact_links:
url: https://discord.gg/AVEFbBn2rH
about: Developers and users can be found on the Discord server
- name: Blank issue
url: https://github.com/princeton-nlp/SWE-bench/issues/new
url: https://github.com/swe-bench/SWE-bench/issues/new
about: None of the above? Open a blank issue
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,4 @@ jobs:
uses: codecov/codecov-action@v4.0.1
with:
token: ${{ secrets.CODECOV_TOKEN }}
slug: princeton-nlp/SWE-bench
slug: swe-bench/SWE-bench
20 changes: 10 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Major release - the SWE-bench evaluation harness has been upgraded to incorporat
* Significant modifications to underlying evaluation logic
* Minor updates to installation specifications for different repos + versions.

Read the full report [here](https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240627_docker)
Read the full report [here](https://github.com/swe-bench/SWE-bench/tree/main/docs/20240627_docker)

## [1.1.5] - 5/15/2024
* Add support for HumanEvalFix (Python, JS, Go, Java) ([source](https://huggingface.co/datasets/bigcode/humanevalpack))
Expand All @@ -64,22 +64,22 @@ Read the full report [here](https://github.com/princeton-nlp/SWE-bench/tree/main
* Rewrite `swebench.metrics.get_model_report`.

## [1.0.5] - 4/7/2024
* Fix log parsing for `pydicom`, `pylint`, and `requests` libraries. [5cb448](https://github.com/princeton-nlp/SWE-bench/commit/5cb448140a8cd05490650b0671d860765180f26c)
* Fix log parsing for `pydicom`, `pylint`, and `requests` libraries. [5cb448](https://github.com/swe-bench/SWE-bench/commit/5cb448140a8cd05490650b0671d860765180f26c)

## [1.0.4] - 4/5/2024
* Fixed `env_list` parsing. [5be59d](https://github.com/princeton-nlp/SWE-bench/commit/5be59d665233ffb63b9beb30b2740cc41098e51f)
* Updated `ExecWrapper`, `LogWrapper` logic for harness. [231a2b](https://github.com/princeton-nlp/SWE-bench/commit/231a2b205c5ca9ddcb126b73b22667d79e1b6108)
* Fixed `env_list` parsing. [5be59d](https://github.com/swe-bench/SWE-bench/commit/5be59d665233ffb63b9beb30b2740cc41098e51f)
* Updated `ExecWrapper`, `LogWrapper` logic for harness. [231a2b](https://github.com/swe-bench/SWE-bench/commit/231a2b205c5ca9ddcb126b73b22667d79e1b6108)

## [1.0.2] - 4/2/2024
* Added `try/catch` around `lsof` based clean up for `run_evaluation.py`. [3fb217](https://github.com/princeton-nlp/SWE-bench/commit/3fb2179a5c69737465f916898e8708adffff9914)
* Fixed `get_eval_refs` function. [12a287](https://github.com/princeton-nlp/SWE-bench/commit/12a287a9591cb4a0d65483f0c8bfaa3375285bfc)
* Fixed `seaborn` log parser. [0372b6](https://github.com/princeton-nlp/SWE-bench/commit/0372b6a9ff62516067fb26f602163c231d818163)
* Added `try/catch` around `lsof` based clean up for `run_evaluation.py`. [3fb217](https://github.com/swe-bench/SWE-bench/commit/3fb2179a5c69737465f916898e8708adffff9914)
* Fixed `get_eval_refs` function. [12a287](https://github.com/swe-bench/SWE-bench/commit/12a287a9591cb4a0d65483f0c8bfaa3375285bfc)
* Fixed `seaborn` log parser. [0372b6](https://github.com/swe-bench/SWE-bench/commit/0372b6a9ff62516067fb26f602163c231d818163)

## [1.0.1] - 3/31/2024
First working version. We strongly recommend not using versions older than this one.
* Added logging for failed installations. [58d24d](https://github.com/princeton-nlp/SWE-bench/commit/58d24d1b65b95ed96d57805604aca7adca49861d)
* Added missing `datasets` dependency. [68e89e](https://github.com/princeton-nlp/SWE-bench/commit/68e89ef8d099ca5c23a8fd5681e3f990cf729fd6)
* Reorganized repository to be directly build-able as a PyPI package. [548bdb](https://github.com/princeton-nlp/SWE-bench/commit/548bdbffb2ac5f0a09c1d7eb95bbee1bce126233)
* Added logging for failed installations. [58d24d](https://github.com/swe-bench/SWE-bench/commit/58d24d1b65b95ed96d57805604aca7adca49861d)
* Added missing `datasets` dependency. [68e89e](https://github.com/swe-bench/SWE-bench/commit/68e89ef8d099ca5c23a8fd5681e3f990cf729fd6)
* Reorganized repository to be directly build-able as a PyPI package. [548bdb](https://github.com/swe-bench/SWE-bench/commit/548bdbffb2ac5f0a09c1d7eb95bbee1bce126233)

## [0.6.9 - 0.6.9.2] - 3/31/2024
> ⚠️ Do NOT use these versions. The PyPI package for these versions was under development. Specifically, some of the evaluation configurations required re-validation. A detailed report for the failures and our recovery from it are detailed in [Bug Report 4/5/2024](docs/reports/20240405_eval_bug/README.md).
Expand Down
19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

<div align="center">

| [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
| [日本語](docs/README_JP.md) | [English](https://github.com/swe-bench/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |

</div>

Expand All @@ -27,14 +27,14 @@ Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.
</a>
</p>

Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/swe-bench/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.

## 📰 News
* **[Jan. 13, 2025]**: We've integrated [SWE-bench Multimodal](https://swebench.github.io/multimodal) ([paper](https://arxiv.org/abs/2410.03859), [dataset](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal)) into this repository! Unlike SWE-bench, we've kept evaluation for the test split *private*. Submit to the leaderboard using [sb-cli](https://github.com/swe-bench/sb-cli/tree/main), our new cloud-based evaluation tool.
* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal) for more details.
* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md#%EF%B8%8F-evaluation-with-modal) for more details.
* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/swe-bench/SWE-bench/blob/main/docs/20240627_docker/README.md).
* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/SWE-agent/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
* **[Jan. 16, 2024]**: SWE-bench has been accepted to ICLR 2024 as an oral presentation! ([OpenReview 🔗](https://openreview.net/forum?id=VTF8yNQM66))

## 👋 Overview
Expand Down Expand Up @@ -77,8 +77,7 @@ python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path <path_to_predictions> \
--max_workers <num_workers> \
--run_id <run_id> \
--namespace swebench
--run_id <run_id>
# use --predictions_path 'gold' to verify the gold patches
# use --run_id to name the evaluation run
```
Expand All @@ -104,12 +103,12 @@ python -m swebench.harness.run_evaluation --help
See the [evaluation tutorial]((./assets/evaluation.md)) for the full rundown on datasets you can evaluate.
If you're looking for non-local, cloud based evaluations, check out...
* [sb-cli](https://github.com/swe-bench/sb-cli), our tool for running evaluations automatically on AWS, or...
* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal)
* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md#%EF%B8%8F-evaluation-with-modal)

Additionally, you can also:
* [Train](https://github.com/swe-bench/SWE-bench/tree/main/swebench/inference/make_datasets) your own models on our pre-processed datasets.
* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
* Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) ([tutorial](./assets/evaluation.md)) on your own repositories, to make new SWE-Bench tasks.
* Run [inference](https://github.com/swe-bench/SWE-bench/blob/main/swebench/inference/README.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
* Run SWE-bench's [data collection procedure](https://github.com/swe-bench/SWE-bench/blob/main/swebench/collect/) ([tutorial](./assets/evaluation.md)) on your own repositories, to make new SWE-Bench tasks.

## ⬇️ Downloads
| Datasets | Models | RAG |
Expand Down
7 changes: 3 additions & 4 deletions assets/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,13 @@ python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path <path_to_predictions> \
--max_workers <num_workers> \
--run_id <run_id> \
--namespace swebench
--run_id <run_id>
# use --predictions_path 'gold' to verify the gold patches
# use --run_id to name the run, logs will be written to ./logs/run_evaluation/<run_id>
# use --split to specify which split to evaluate on, usually `dev` or `test`
```

You can run evaluation for the following (`dataset_name`, `--split`)
You can run evaluation for the following (`dataset_name`, `split`)
* `princeton-nlp/SWE-bench_Lite`, `test` (300 task instances)
* `princeton-nlp/SWE-bench_Verified`, `test` (500)
* `princeton-nlp/SWE-bench`, `dev` (225)
Expand All @@ -42,7 +41,7 @@ You can run evaluation for the following (`dataset_name`, `--split`)

You *cannot* run evaluation on the `test` split of `princeton-nlp/SWE-bench_Multimodal` using this repository (517 instances).
To encourage less intentional climbing of the leaderboard, we have intentionally made specifications for evaluating the test split private.
You can submit to the leaderboard using
Use [sb-cli](https://github.com/swe-bench/sb-cli/) for SWE-bench Multimodal evaluation.

### 🌩️ Evaluation with Modal
You can also run evaluations entirely on the cloud using [Modal](https://modal.com/) to avoid local setup and resource constraints:
Expand Down
6 changes: 3 additions & 3 deletions docs/20240415_eval_bug/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Requirement already satisfied: pytest in /n/fs/p-swe-bench/temp/seaborn/tmphkkam
```

Over time, the version number may increase. The solution we use for this is to explicitly specify versions for PyPI packages that are installed (e.g. `click==8.0.1`).
Examples of this can be found throughout the `swebench/harness/constants.py` file, such as [here](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/harness/constants.py#L32).
Examples of this can be found throughout the `swebench/harness/constants.py` file, such as [here](https://github.com/swe-bench/SWE-bench/blob/main/swebench/harness/constants.py#L32).
</details>

**4. PyPI Package Dependency Updates**: Assuming failure modes #1, #2, and #3 don't occur (conda environment is created + is set up correctly), a last source of potential error is that the PyPI packages for dependencies is updated by the maintainers. At this time, based on the extent of our investigation, this is not a source of error for any task instances. However, if future versions of a PyPI package break prior functionality, this may cause an error.
Expand All @@ -91,7 +91,7 @@ Examples of this can be found throughout the `swebench/harness/constants.py` fil

The fix shown for Failure Mode #3 also resolves this situation.

**5. P2P Tests with Machine-Specific Paths**: We found that some P2P tests collected automatically via the harness picked up on tests with machine-specific paths to local testing files (e.g. a test named `test-validation.py:[/n/fs/p-swe-bench/temp/pytest-tmpdir/TEXT_001.txt]`), which are impossible to resolve on other machines. To remedy this, we have either 1. Rewritten the log parsing logic to refactor machine-specific paths into just keeping the file name (see [here](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/harness/log_parsers.py#L28)), or 2. Removed these tests entirely.
**5. P2P Tests with Machine-Specific Paths**: We found that some P2P tests collected automatically via the harness picked up on tests with machine-specific paths to local testing files (e.g. a test named `test-validation.py:[/n/fs/p-swe-bench/temp/pytest-tmpdir/TEXT_001.txt]`), which are impossible to resolve on other machines. To remedy this, we have either 1. Rewritten the log parsing logic to refactor machine-specific paths into just keeping the file name (see [here](https://github.com/swe-bench/SWE-bench/blob/main/swebench/harness/log_parsers.py#L28)), or 2. Removed these tests entirely.
- 🟡 Low (< 10)
- Affected Repositories: requests, sympy

Expand All @@ -117,7 +117,7 @@ To identify and then fix these issues, we carried out the following steps:
To perform multiple rounds of validation, we run the `sweep_conda_links.py` script multiple times. We use manual inspection + several on the fly scripts to identify, then rename or remove any problematic tests.

## Outcomes
We introduce the following fixes that are solutions to the discussed problems, which is generally starting from [#65](https://github.com/princeton-nlp/SWE-bench/pull/65) to the latest `1.0.x` releases:
We introduce the following fixes that are solutions to the discussed problems, which is generally starting from [#65](https://github.com/swe-bench/SWE-bench/pull/65) to the latest `1.0.x` releases:
* Fix conda version to `py39_23.10.0-1`.
* Specify specific pip package versions to install (e.g. `contourpy==1.1.0`).
* Add missing pip packages that need to be installed due to changes in the conda resolution logic.
Expand Down
12 changes: 6 additions & 6 deletions docs/README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

<div align="center">

| [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
| [日本語](docs/README_JP.md) | [English](https://github.com/swe-bench/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |

</div>

Expand All @@ -27,7 +27,7 @@
</a>
</p>

请访问我们的[网站](http://swe-bench.github.io)查看公共排行榜,并查看[更改日志](https://github.com/princeton-nlp/SWE-bench/blob/master/CHANGELOG.md)以获取有关 SWE-bench 基准最新更新的信息。
请访问我们的[网站](http://swe-bench.github.io)查看公共排行榜,并查看[更改日志](https://github.com/swe-bench/SWE-bench/blob/master/CHANGELOG.md)以获取有关 SWE-bench 基准最新更新的信息。

## 👋 概述

Expand Down Expand Up @@ -97,8 +97,8 @@ python -m swebench.harness.run_evaluation --help

此外,SWE-Bench仓库还可以帮助你:
* 在我们预处理的数据集上训练你自己的模型
* 在现有模型上运行[推理](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md)(无论是你本地的模型如LLaMA,还是你通过API访问的模型如GPT-4)。推理步骤是指给定一个仓库和一个问题,让模型尝试生成修复方案。
* 在你自己的仓库上运行SWE-bench的[数据收集程序](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/),以创建新的SWE-Bench任务。
* 在现有模型上运行[推理](https://github.com/swe-bench/SWE-bench/blob/main/swebench/inference/README.md)(无论是你本地的模型如LLaMA,还是你通过API访问的模型如GPT-4)。推理步骤是指给定一个仓库和一个问题,让模型尝试生成修复方案。
* 在你自己的仓库上运行SWE-bench的[数据收集程序](https://github.com/swe-bench/SWE-bench/blob/main/swebench/collect/),以创建新的SWE-Bench任务。

## ⬇️ 下载

Expand All @@ -115,8 +115,8 @@ python -m swebench.harness.run_evaluation --help

我们还编写了以下博客文章,介绍如何使用SWE-bench的不同部分。
如果你想看到关于特定主题的文章,请通过issue告诉我们。
* [2023年11月1日] 为SWE-Bench收集评估任务 ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
* [2023年11月6日] 在SWE-bench上进行评估 ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
* [2023年11月1日] 为SWE-Bench收集评估任务 ([🔗](https://github.com/swe-bench/SWE-bench/blob/main/assets/collection.md))
* [2023年11月6日] 在SWE-bench上进行评估 ([🔗](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md))

## 💫 贡献

Expand Down
Loading

0 comments on commit 193fde5

Please sign in to comment.