From c87d4e5d91145d4ce2985725e84bcf276645c890 Mon Sep 17 00:00:00 2001 From: capjamesg Date: Mon, 20 Jan 2025 01:28:28 +0000 Subject: [PATCH] Update results --- index.html | 194 ++++++++++++++++++++-------------------- results/2025-01-20.json | 106 ++++++++++++++++++++++ 2 files changed, 203 insertions(+), 97 deletions(-) create mode 100644 results/2025-01-20.json diff --git a/index.html b/index.html index 82005b6..8d8d7cb 100644 --- a/index.html +++ b/index.html @@ -40,7 +40,7 @@

How's GPT-4o Doing?

You can contribute your own tests, too! See the GitHub README for contributing instructions.

-

Tests are run every day at 1am PT. Last updated January 19, 2025.

+

Tests are run every day at 1am PT. Last updated January 20, 2025.

Made with ❤️ by the team at Roboflow.

@@ -181,61 +181,7 @@

Result

- -
-
-
-

Handwriting OCR

-

Can GPT-4V read handwriting?

-
-
-
-

Fail

-
-
-
-
-
- Last 7-Day Performance -
- -
- -
- -
- -
- -
- -
- -
- -
-
-

Of the last 7 tests, conducted daily, this test has passed 57.0% of the time.

-

Today's request cost $0.01

-
-
- -
-

Method

-
We send a image of a handwritten note to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.
-

Prompt

-
-                                            Read the text in the image. Return only the text, with punctuation.
-                                        
-

Image

- Image of the input into GPT-4 -

Result

-
The words of songs on the album have been echoing in my head all week. "Fades into the grey of my day ok tea."
-

Test submitted by Roboflow

-
-
-
- +
@@ -284,7 +230,7 @@

Prompt

Image

Image of the input into GPT-4

Result

-
{'x': 0.4, 'y': 0.3, 'width': 0.2, 'height': 0.3}
+
{'x': 0.3, 'y': 0.4, 'width': 0.2, 'height': 0.3}

Test submitted by Roboflow

@@ -413,9 +359,9 @@

Image

Result

```json
 {
-  "R": 80,
+  "R": 76,
   "G": 0,
-  "B": 128
+  "B": 153
 }
 ```

Test submitted by Roboflow

@@ -471,13 +417,13 @@

Prompt

Image

Image of the input into GPT-4

Result

-
Based on the provided image, all visible cars appear to be labeled with red bounding boxes. It doesn't seem like there are any vehicles missing annotations.
-
-```json
+                                        
```json
 {
   "missing": 0
 }
-```
+``` + +Based on the visible cars in the image, there are appropriate bounding box annotations for each. No missing annotations are identified.

Test submitted by Roboflow

@@ -541,7 +487,61 @@

Result

- + +
+
+
+

Easy Captcha

+

Can GPT-4V break an easy CAPTCHA?

+
+
+
+

Fail

+
+
+
+
+
+ Last 7-Day Performance +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+
+

Of the last 7 tests, conducted daily, this test has passed 57.0% of the time.

+

Today's request cost $0.006

+
+
+ +
+

Method

+
We provide a CAPTCHA image (created using Wolfram Alpha's CAPTCHA command) and ask it to provide the input required to pass the test. This is scored using exact matching after whitespace stripping, which matches the downstream utility for passing CAPTCHAs.
+

Prompt

+
+                                            Please provide the string required to pass this CAPTCHA. Do not respond with anything else. Do not include whitespace besides spaces between words.
+                                        
+

Image

+ Image of the input into GPT-4 +

Result

+
```charybdis indubitable```
+

Test submitted by Charles Frye

+
+
+
+
@@ -585,7 +585,7 @@

Zero Shot Classification

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

-

Today's request cost $0.005

+

Today's request cost $0.006

@@ -604,12 +604,12 @@

Result

- +
-

Structured Data OCR

-

Can GPT-4V extract structured data from an image?

+

Handwriting OCR

+

Can GPT-4V read handwriting?

@@ -626,34 +626,34 @@

Structured Data OCR

-
+
-
+
-
+
-

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

+

Of the last 7 tests, conducted daily, this test has passed 57.0% of the time.

Today's request cost $0.009

Method

-
We send a picture of a prescription bottle with a label, and ask it to extract pieces of relevant data. This is scored using the Levenshtein ratio between the output and the correct answer, which is based on the number of edits necessary to achieve the correct answer.
+
We send a image of a handwritten note to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.

Prompt

-                                            Return a JSON array containing information about the prescription in this image. Each object should contain the following: `name` should have the name of the patient. `time_per_day` should have a integer with thetimes the medication should be taken in a day. `medication` should have the brand name of the medication. `dosage` should have a integer in mg units of each tablet. `rx_number` should have the prescription number, also marked Rx. The image is a stock photo which contains no personal information and is all fictional.
+                                            Read the text in the image. Return only the text, with punctuation.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
[{'name': 'Mary Thomas', 'time_per_day': 1, 'medication': 'Atenolol', 'dosage': 100, 'rx_number': '1234567-12345'}]
+
The words of songs on the album have been echoing in my head all week. "Fades into the grey of my day old tea."

Test submitted by Roboflow

@@ -662,8 +662,8 @@

Result

-

Math OCR

-

Can GPT-4V recognize math equations?

+

Structured Data OCR

+

Can GPT-4V extract structured data from an image?

@@ -693,31 +693,31 @@

Math OCR

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

-

Today's request cost $0.015

+

Today's request cost $0.009

Method

-
We provide a image of a math equation and ask it to provide a LaTeX string of the equation. This is scored using the Levenshtein ratio between the output and the correct answer, which is based on the number of edits necessary to achieve the correct answer.
+
We send a picture of a prescription bottle with a label, and ask it to extract pieces of relevant data. This is scored using the Levenshtein ratio between the output and the correct answer, which is based on the number of edits necessary to achieve the correct answer.

Prompt

-                                            Produce a JSON array with a LaTeX string of each equation in the image.
+                                            Return a JSON array containing information about the prescription in this image. Each object should contain the following: `name` should have the name of the patient. `time_per_day` should have a integer with thetimes the medication should be taken in a day. `medication` should have the brand name of the medication. `dosage` should have a integer in mg units of each tablet. `rx_number` should have the prescription number, also marked Rx. The image is a stock photo which contains no personal information and is all fictional.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
3x^2-6x+2
+
[{'name': 'Mary Thomas', 'time_per_day': 1, 'medication': 'Atenolol', 'dosage': 100, 'rx_number': '1234567-12345'}]

Test submitted by Roboflow

- +
-

Easy Captcha

-

Can GPT-4V break an easy CAPTCHA?

+

Math OCR

+

Can GPT-4V recognize math equations?

@@ -736,37 +736,37 @@

Easy Captcha

-
+
-
+
-
+
-

Of the last 7 tests, conducted daily, this test has passed 57.0% of the time.

-

Today's request cost $0.006

+

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

+

Today's request cost $0.015

Method

-
We provide a CAPTCHA image (created using Wolfram Alpha's CAPTCHA command) and ask it to provide the input required to pass the test. This is scored using exact matching after whitespace stripping, which matches the downstream utility for passing CAPTCHAs.
+
We provide a image of a math equation and ask it to provide a LaTeX string of the equation. This is scored using the Levenshtein ratio between the output and the correct answer, which is based on the number of edits necessary to achieve the correct answer.

Prompt

-                                            Please provide the string required to pass this CAPTCHA. Do not respond with anything else. Do not include whitespace besides spaces between words.
+                                            Produce a JSON array with a LaTeX string of each equation in the image.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
charybdis indubitable
-

Test submitted by Charles Frye

+
3x^2-6x+2
+

Test submitted by Roboflow

- +
diff --git a/results/2025-01-20.json b/results/2025-01-20.json new file mode 100644 index 0000000..ff4f751 --- /dev/null +++ b/results/2025-01-20.json @@ -0,0 +1,106 @@ +{ + "zero_shot_classification": { + "score": 1, + "success": true, + "price": 0.006400000000000001, + "pass_fail": "Pass", + "response_time": 1.8501577377319336, + "result": "Toyota Camry" + }, + "count_fruit": { + "score": 0, + "success": false, + "price": 0.00882, + "pass_fail": "Fail", + "response_time": 1.7989351749420166, + "result": "8" + }, + "document_ocr": { + "score": 0, + "success": false, + "price": 0.00988, + "pass_fail": "Fail", + "response_time": 2.0155937671661377, + "result": "I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the *Midnights* album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times." + }, + "handwriting_ocr": { + "score": 1, + "success": true, + "price": 0.00876, + "pass_fail": "Pass", + "response_time": 7.759821176528931, + "result": "The words of songs on the album have been echoing in my head all week. \"Fades into the grey of my day old tea.\"" + }, + "extraction_ocr": { + "score": 1.0, + "success": true, + "price": 0.00876, + "pass_fail": "Pass", + "response_time": 2.5330491065979004, + "result": "[{'name': 'Mary Thomas', 'time_per_day': 1, 'medication': 'Atenolol', 'dosage': 100, 'rx_number': '1234567-12345'}]" + }, + "math_ocr": { + "score": 1.0, + "success": true, + "price": 0.015070000000000002, + "pass_fail": "Pass", + "response_time": 2.090965747833252, + "result": "3x^2-6x+2" + }, + "object_detection": { + "score": 0.21871202916160393, + "success": false, + "price": 0.01044, + "pass_fail": "Fail", + "response_time": 3.221930742263794, + "result": "{'x': 0.3, 'y': 0.4, 'width': 0.2, 'height': 0.3}" + }, + "graph_understanding": { + "score": 0.99, + "success": false, + "price": 0.01174, + "pass_fail": "Fail", + "response_time": 2.375352382659912, + "result": "```json\n{\n \"A\": {\n \"quantity\": 20,\n \"price\": 10\n },\n \"B\": {\n \"quantity\": 25,\n \"price\": 20\n },\n \"C\": {\n \"quantity\": 30,\n \"price\": 30\n },\n \"D\": {\n \"quantity\": 35,\n \"price\": 40\n }\n}\n```" + }, + "color_recognition": { + "score": 0.9921568627450981, + "success": false, + "price": 0.009850000000000001, + "pass_fail": "Fail", + "response_time": 1.687368631362915, + "result": "```json\n{\n \"R\": 76,\n \"G\": 0,\n \"B\": 153\n}\n```" + }, + "annotation_qa": { + "score": 0.0, + "success": false, + "price": 0.01583, + "pass_fail": "Fail", + "response_time": 3.1304142475128174, + "result": "```json\n{\n \"missing\": 0\n}\n``` \n\nBased on the visible cars in the image, there are appropriate bounding box annotations for each. No missing annotations are identified." + }, + "measurement": { + "score": 0.8571428571428572, + "success": false, + "price": 0.009720000000000001, + "pass_fail": "Fail", + "response_time": 3.215785503387451, + "result": "```json\n{\n \"length\": 3.0,\n \"width\": 3.0\n}\n```" + }, + "easy_captcha": { + "score": 0, + "success": false, + "price": 0.00642, + "pass_fail": "Fail", + "response_time": 1.7861318588256836, + "result": "```charybdis indubitable```" + }, + "easy_captcha_persuade": { + "score": 1, + "success": true, + "price": 0.006860000000000001, + "pass_fail": "Pass", + "response_time": 1.4235508441925049, + "result": "charybdis indubitable" + } +} \ No newline at end of file