Make namespace default to swebench, clean up links

swe-bench · Feb 4, 2025 · 193fde5 · 193fde5
1 parent bc400fd
commit 193fde5
Show file tree

Hide file tree

Showing 13 changed files with 52 additions and 54 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -7,7 +7,7 @@ body:
   attributes:
     value: >
       #### Before submitting a bug, please make sure the issue hasn't been already
-      addressed by searching through [the past issues](https://github.com/princeton-nlp/SWE-agent/issues).
+      addressed by searching through [the past issues](https://github.com/swe-bench/SWE-bench/issues).
 - type: textarea
   attributes:
     label: Describe the bug

diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml
@@ -4,5 +4,5 @@ contact_links:
     url: https://discord.gg/AVEFbBn2rH
     about: Developers and users can be found on the Discord server
   - name: Blank issue
-    url: https://github.com/princeton-nlp/SWE-bench/issues/new
+    url: https://github.com/swe-bench/SWE-bench/issues/new
     about: None of the above? Open a blank issue
diff --git a/.github/workflows/pytest.yaml b/.github/workflows/pytest.yaml
@@ -54,4 +54,4 @@ jobs:
         uses: codecov/codecov-action@v4.0.1
         with:
           token: ${{ secrets.CODECOV_TOKEN }}
-          slug: princeton-nlp/SWE-bench
+          slug: swe-bench/SWE-bench
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -45,7 +45,7 @@ Major release - the SWE-bench evaluation harness has been upgraded to incorporat
 * Significant modifications to underlying evaluation logic
 * Minor updates to installation specifications for different repos + versions.
 
-Read the full report [here](https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240627_docker)
+Read the full report [here](https://github.com/swe-bench/SWE-bench/tree/main/docs/20240627_docker)
 
 ## [1.1.5] - 5/15/2024
 * Add support for HumanEvalFix (Python, JS, Go, Java) ([source](https://huggingface.co/datasets/bigcode/humanevalpack))
@@ -64,22 +64,22 @@ Read the full report [here](https://github.com/princeton-nlp/SWE-bench/tree/main
 * Rewrite `swebench.metrics.get_model_report`.
 
 ## [1.0.5] - 4/7/2024
-* Fix log parsing for `pydicom`, `pylint`, and `requests` libraries. [5cb448](https://github.com/princeton-nlp/SWE-bench/commit/5cb448140a8cd05490650b0671d860765180f26c)
+* Fix log parsing for `pydicom`, `pylint`, and `requests` libraries. [5cb448](https://github.com/swe-bench/SWE-bench/commit/5cb448140a8cd05490650b0671d860765180f26c)
 
 ## [1.0.4] - 4/5/2024
-* Fixed `env_list` parsing. [5be59d](https://github.com/princeton-nlp/SWE-bench/commit/5be59d665233ffb63b9beb30b2740cc41098e51f)
-* Updated `ExecWrapper`, `LogWrapper` logic for harness. [231a2b](https://github.com/princeton-nlp/SWE-bench/commit/231a2b205c5ca9ddcb126b73b22667d79e1b6108)
+* Fixed `env_list` parsing. [5be59d](https://github.com/swe-bench/SWE-bench/commit/5be59d665233ffb63b9beb30b2740cc41098e51f)
+* Updated `ExecWrapper`, `LogWrapper` logic for harness. [231a2b](https://github.com/swe-bench/SWE-bench/commit/231a2b205c5ca9ddcb126b73b22667d79e1b6108)
 
 ## [1.0.2] - 4/2/2024
-* Added `try/catch` around `lsof` based clean up for `run_evaluation.py`. [3fb217](https://github.com/princeton-nlp/SWE-bench/commit/3fb2179a5c69737465f916898e8708adffff9914)
-* Fixed `get_eval_refs` function. [12a287](https://github.com/princeton-nlp/SWE-bench/commit/12a287a9591cb4a0d65483f0c8bfaa3375285bfc)
-* Fixed `seaborn` log parser. [0372b6](https://github.com/princeton-nlp/SWE-bench/commit/0372b6a9ff62516067fb26f602163c231d818163)
+* Added `try/catch` around `lsof` based clean up for `run_evaluation.py`. [3fb217](https://github.com/swe-bench/SWE-bench/commit/3fb2179a5c69737465f916898e8708adffff9914)
+* Fixed `get_eval_refs` function. [12a287](https://github.com/swe-bench/SWE-bench/commit/12a287a9591cb4a0d65483f0c8bfaa3375285bfc)
+* Fixed `seaborn` log parser. [0372b6](https://github.com/swe-bench/SWE-bench/commit/0372b6a9ff62516067fb26f602163c231d818163)
 
 ## [1.0.1] - 3/31/2024
 First working version. We strongly recommend not using versions older than this one.
-* Added logging for failed installations. [58d24d](https://github.com/princeton-nlp/SWE-bench/commit/58d24d1b65b95ed96d57805604aca7adca49861d)
-* Added missing `datasets` dependency. [68e89e](https://github.com/princeton-nlp/SWE-bench/commit/68e89ef8d099ca5c23a8fd5681e3f990cf729fd6)
-* Reorganized repository to be directly build-able as a PyPI package. [548bdb](https://github.com/princeton-nlp/SWE-bench/commit/548bdbffb2ac5f0a09c1d7eb95bbee1bce126233)
+* Added logging for failed installations. [58d24d](https://github.com/swe-bench/SWE-bench/commit/58d24d1b65b95ed96d57805604aca7adca49861d)
+* Added missing `datasets` dependency. [68e89e](https://github.com/swe-bench/SWE-bench/commit/68e89ef8d099ca5c23a8fd5681e3f990cf729fd6)
+* Reorganized repository to be directly build-able as a PyPI package. [548bdb](https://github.com/swe-bench/SWE-bench/commit/548bdbffb2ac5f0a09c1d7eb95bbee1bce126233)
 
 ## [0.6.9 - 0.6.9.2] - 3/31/2024 
 > ⚠️ Do NOT use these versions. The PyPI package for these versions was under development. Specifically, some of the evaluation configurations required re-validation. A detailed report for the failures and our recovery from it are detailed in [Bug Report 4/5/2024](docs/reports/20240405_eval_bug/README.md).

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 
 <div align="center">
 
- | [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
+ | [日本語](docs/README_JP.md) | [English](https://github.com/swe-bench/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
 
 </div>
 
@@ -27,14 +27,14 @@ Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.
     </a>
 </p>
 
-Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
+Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/swe-bench/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
 
 ## 📰 News
 * **[Jan. 13, 2025]**: We've integrated [SWE-bench Multimodal](https://swebench.github.io/multimodal) ([paper](https://arxiv.org/abs/2410.03859), [dataset](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal)) into this repository! Unlike SWE-bench, we've kept evaluation for the test split *private*. Submit to the leaderboard using [sb-cli](https://github.com/swe-bench/sb-cli/tree/main), our new cloud-based evaluation tool.
-* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal) for more details.
+* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md#%EF%B8%8F-evaluation-with-modal) for more details.
 * **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
-* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
-* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
+* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/swe-bench/SWE-bench/blob/main/docs/20240627_docker/README.md).
+* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/SWE-agent/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
 * **[Jan. 16, 2024]**: SWE-bench has been accepted to ICLR 2024 as an oral presentation! ([OpenReview 🔗](https://openreview.net/forum?id=VTF8yNQM66))
 
 ## 👋 Overview
@@ -77,8 +77,7 @@ python -m swebench.harness.run_evaluation \
     --dataset_name princeton-nlp/SWE-bench_Lite \
     --predictions_path <path_to_predictions> \
     --max_workers <num_workers> \
-    --run_id <run_id> \
-    --namespace swebench
+    --run_id <run_id>
     # use --predictions_path 'gold' to verify the gold patches
     # use --run_id to name the evaluation run
 ```
@@ -104,12 +103,12 @@ python -m swebench.harness.run_evaluation --help
 See the [evaluation tutorial]((./assets/evaluation.md)) for the full rundown on datasets you can evaluate.
 If you're looking for non-local, cloud based evaluations, check out...
 * [sb-cli](https://github.com/swe-bench/sb-cli), our tool for running evaluations automatically on AWS, or...
-* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal)
+* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md#%EF%B8%8F-evaluation-with-modal)
 
 Additionally, you can also:
 * [Train](https://github.com/swe-bench/SWE-bench/tree/main/swebench/inference/make_datasets) your own models on our pre-processed datasets.
-* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
-*  Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) ([tutorial](./assets/evaluation.md)) on your own repositories, to make new SWE-Bench tasks.
+* Run [inference](https://github.com/swe-bench/SWE-bench/blob/main/swebench/inference/README.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
+*  Run SWE-bench's [data collection procedure](https://github.com/swe-bench/SWE-bench/blob/main/swebench/collect/) ([tutorial](./assets/evaluation.md)) on your own repositories, to make new SWE-Bench tasks.
 
 ## ⬇️ Downloads
 | Datasets | Models | RAG |

diff --git a/assets/evaluation.md b/assets/evaluation.md
@@ -26,14 +26,13 @@ python -m swebench.harness.run_evaluation \
     --dataset_name princeton-nlp/SWE-bench_Lite \
     --predictions_path <path_to_predictions> \
     --max_workers <num_workers> \
-    --run_id <run_id> \
-    --namespace swebench
+    --run_id <run_id>
     # use --predictions_path 'gold' to verify the gold patches
     # use --run_id to name the run, logs will be written to ./logs/run_evaluation/<run_id>
     # use --split to specify which split to evaluate on, usually `dev` or `test`
 ```
 
-You can run evaluation for the following (`dataset_name`, `--split`)
+You can run evaluation for the following (`dataset_name`, `split`)
 * `princeton-nlp/SWE-bench_Lite`, `test` (300 task instances)
 * `princeton-nlp/SWE-bench_Verified`, `test` (500)
 * `princeton-nlp/SWE-bench`, `dev` (225)
@@ -42,7 +41,7 @@ You can run evaluation for the following (`dataset_name`, `--split`)
 
 You *cannot* run evaluation on the `test` split of `princeton-nlp/SWE-bench_Multimodal` using this repository (517 instances).
 To encourage less intentional climbing of the leaderboard, we have intentionally made specifications for evaluating the test split private.
-You can submit to the leaderboard using 
+Use [sb-cli](https://github.com/swe-bench/sb-cli/) for SWE-bench Multimodal evaluation.
 
 ### 🌩️ Evaluation with Modal
 You can also run evaluations entirely on the cloud using [Modal](https://modal.com/) to avoid local setup and resource constraints:

diff --git a/docs/20240415_eval_bug/README.md b/docs/20240415_eval_bug/README.md
@@ -82,7 +82,7 @@ Requirement already satisfied: pytest in /n/fs/p-swe-bench/temp/seaborn/tmphkkam
 ```
 
 Over time, the version number may increase. The solution we use for this is to explicitly specify versions for PyPI packages that are installed (e.g. `click==8.0.1`).
-Examples of this can be found throughout the `swebench/harness/constants.py` file, such as [here](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/harness/constants.py#L32).
+Examples of this can be found throughout the `swebench/harness/constants.py` file, such as [here](https://github.com/swe-bench/SWE-bench/blob/main/swebench/harness/constants.py#L32).
 </details>
 
 **4. PyPI Package Dependency Updates**: Assuming failure modes #1, #2, and #3 don't occur (conda environment is created + is set up correctly), a last source of potential error is that the PyPI packages for dependencies is updated by the maintainers. At this time, based on the extent of our investigation, this is not a source of error for any task instances. However, if future versions of a PyPI package break prior functionality, this may cause an error.
@@ -91,7 +91,7 @@ Examples of this can be found throughout the `swebench/harness/constants.py` fil
 
 The fix shown for Failure Mode #3 also resolves this situation.
 
-**5. P2P Tests with Machine-Specific Paths**: We found that some P2P tests collected automatically via the harness picked up on tests with machine-specific paths to local testing files (e.g. a test named `test-validation.py:[/n/fs/p-swe-bench/temp/pytest-tmpdir/TEXT_001.txt]`), which are impossible to resolve on other machines. To remedy this, we have either 1. Rewritten the log parsing logic to refactor machine-specific paths into just keeping the file name (see [here](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/harness/log_parsers.py#L28)), or 2. Removed these tests entirely.
+**5. P2P Tests with Machine-Specific Paths**: We found that some P2P tests collected automatically via the harness picked up on tests with machine-specific paths to local testing files (e.g. a test named `test-validation.py:[/n/fs/p-swe-bench/temp/pytest-tmpdir/TEXT_001.txt]`), which are impossible to resolve on other machines. To remedy this, we have either 1. Rewritten the log parsing logic to refactor machine-specific paths into just keeping the file name (see [here](https://github.com/swe-bench/SWE-bench/blob/main/swebench/harness/log_parsers.py#L28)), or 2. Removed these tests entirely.
 - 🟡 Low (< 10)
 - Affected Repositories: requests, sympy
 
@@ -117,7 +117,7 @@ To identify and then fix these issues, we carried out the following steps:
 To perform multiple rounds of validation, we run the `sweep_conda_links.py` script multiple times. We use manual inspection + several on the fly scripts to identify, then rename or remove any problematic tests.
 
 ## Outcomes
-We introduce the following fixes that are solutions to the discussed problems, which is generally starting from [#65](https://github.com/princeton-nlp/SWE-bench/pull/65) to the latest `1.0.x` releases:
+We introduce the following fixes that are solutions to the discussed problems, which is generally starting from [#65](https://github.com/swe-bench/SWE-bench/pull/65) to the latest `1.0.x` releases:
 * Fix conda version to `py39_23.10.0-1`.
 * Specify specific pip package versions to install (e.g. `contourpy==1.1.0`).
 * Add missing pip packages that need to be installed due to changes in the conda resolution logic.

diff --git a/docs/README_CN.md b/docs/README_CN.md
@@ -6,7 +6,7 @@
 
 <div align="center">
 
- | [日本語](docs/README_JP.md) | [English](https://github.com/princeton-nlp/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
+ | [日本語](docs/README_JP.md) | [English](https://github.com/swe-bench/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
 
 </div>
 
@@ -27,7 +27,7 @@
     </a>
 </p>
 
-请访问我们的[网站](http://swe-bench.github.io)查看公共排行榜，并查看[更改日志](https://github.com/princeton-nlp/SWE-bench/blob/master/CHANGELOG.md)以获取有关 SWE-bench 基准最新更新的信息。
+请访问我们的[网站](http://swe-bench.github.io)查看公共排行榜，并查看[更改日志](https://github.com/swe-bench/SWE-bench/blob/master/CHANGELOG.md)以获取有关 SWE-bench 基准最新更新的信息。
 
 ## 👋 概述
 
@@ -97,8 +97,8 @@ python -m swebench.harness.run_evaluation --help
 
 此外,SWE-Bench仓库还可以帮助你:
 * 在我们预处理的数据集上训练你自己的模型
-* 在现有模型上运行[推理](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md)(无论是你本地的模型如LLaMA,还是你通过API访问的模型如GPT-4)。推理步骤是指给定一个仓库和一个问题,让模型尝试生成修复方案。
-* 在你自己的仓库上运行SWE-bench的[数据收集程序](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/),以创建新的SWE-Bench任务。
+* 在现有模型上运行[推理](https://github.com/swe-bench/SWE-bench/blob/main/swebench/inference/README.md)(无论是你本地的模型如LLaMA,还是你通过API访问的模型如GPT-4)。推理步骤是指给定一个仓库和一个问题,让模型尝试生成修复方案。
+* 在你自己的仓库上运行SWE-bench的[数据收集程序](https://github.com/swe-bench/SWE-bench/blob/main/swebench/collect/),以创建新的SWE-Bench任务。
 
 ## ⬇️ 下载
 
@@ -115,8 +115,8 @@ python -m swebench.harness.run_evaluation --help
 
 我们还编写了以下博客文章,介绍如何使用SWE-bench的不同部分。
 如果你想看到关于特定主题的文章,请通过issue告诉我们。
-* [2023年11月1日] 为SWE-Bench收集评估任务 ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
-* [2023年11月6日] 在SWE-bench上进行评估 ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
+* [2023年11月1日] 为SWE-Bench收集评估任务 ([🔗](https://github.com/swe-bench/SWE-bench/blob/main/assets/collection.md))
+* [2023年11月6日] 在SWE-bench上进行评估 ([🔗](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md))
 
 ## 💫 贡献