diff --git a/index.html b/index.html index 2b97196..9ec9ad3 100644 --- a/index.html +++ b/index.html @@ -40,7 +40,7 @@

How's GPT-4o Doing?

You can contribute your own tests, too! See the GitHub README for contributing instructions.

-

Tests are run every day at 1am PT. Last updated January 24, 2025.

+

Tests are run every day at 1am PT. Last updated January 25, 2025.

Made with ❤️ by the team at Roboflow.

@@ -127,7 +127,61 @@

Result

- + +
+
+
+

Document OCR

+

Can GPT-4V read a document and return the exact characters in the text?

+
+
+
+

Fail

+
+
+
+
+
+ Last 7-Day Performance +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+
+

Of the last 7 tests, conducted daily, this test has passed 86.0% of the time.

+

Today's request cost $0.01

+
+
+ +
+

Method

+
We send a screenshot of a typed document to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.
+

Prompt

+
+                                            Read the text in the image. Return only the text, with punctuation.
+                                        
+

Image

+ Image of the input into GPT-4 +

Result

+
I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the *Midnights* album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.
+

Test submitted by Roboflow

+
+
+
+
@@ -176,7 +230,7 @@

Prompt

Image

Image of the input into GPT-4

Result

-
{'x': 0.5, 'y': 0.35, 'width': 0.3, 'height': 0.4}
+
{'x': 0.5, 'y': 0.35, 'width': 0.3, 'height': 0.2}

Test submitted by Roboflow

@@ -216,7 +270,7 @@

Graph Understanding

Of the last 7 tests, conducted daily, this test has passed 0% of the time.

-

Today's request cost $0.012

+

Today's request cost $0.011

@@ -232,22 +286,10 @@

Image

Result

```json
 {
-  "A": {
-    "quantity": 20,
-    "price": 10
-  },
-  "B": {
-    "quantity": 25,
-    "price": 20
-  },
-  "C": {
-    "quantity": 30,
-    "price": 30
-  },
-  "D": {
-    "quantity": 35,
-    "price": 40
-  }
+  "A": { "quantity": 20, "price": 10 },
+  "B": { "quantity": 25, "price": 20 },
+  "C": { "quantity": 30, "price": 30 },
+  "D": { "quantity": 35, "price": 40 }
 }
 ```

Test submitted by Roboflow

@@ -305,7 +347,7 @@

Image

Result

```json
 {
-  "R": 80,
+  "R": 82,
   "G": 0,
   "B": 128
 }
@@ -431,25 +473,16 @@ 

Result

- - - -
-
-

Today's Passing Tests

- -
-
- +
-

Zero Shot Classification

-

Can GPT-4V classify an image without being trained on that particular use case?

+

Easy Captcha

+

Can GPT-4V break an easy CAPTCHA?

-
-

Pass

+
+

Fail

@@ -460,7 +493,7 @@

Zero Shot Classification

-
+
@@ -474,32 +507,41 @@

Zero Shot Classification

-

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

+

Of the last 7 tests, conducted daily, this test has passed 86.0% of the time.

Today's request cost $0.006

Method

-
We check to see if the model can correctly identify the vehicle. If it can, it recieves a 100%, if it is incorrect, it recieves a 0%.
+
We provide a CAPTCHA image (created using Wolfram Alpha's CAPTCHA command) and ask it to provide the input required to pass the test. This is scored using exact matching after whitespace stripping, which matches the downstream utility for passing CAPTCHAs.

Prompt

-                                            What is in the image? Return the class of the object in the image. Here are the classes: Toyota Camry, Tesla Model 3. You can only return one class from that list.
+                                            Please provide the string required to pass this CAPTCHA. Do not respond with anything else. Do not include whitespace besides spaces between words.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
Toyota Camry
-

Test submitted by Roboflow

+
```charybdis indubitable```
+

Test submitted by Charles Frye

- + +
+
+
+
+

Today's Passing Tests

+ +
+
+
-

Document OCR

-

Can GPT-4V read a document and return the exact characters in the text?

+

Zero Shot Classification

+

Can GPT-4V classify an image without being trained on that particular use case?

@@ -518,7 +560,7 @@

Document OCR

-
+
@@ -528,27 +570,27 @@

Document OCR

-

Of the last 7 tests, conducted daily, this test has passed 86.0% of the time.

-

Today's request cost $0.01

+

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

+

Today's request cost $0.006

Method

-
We send a screenshot of a typed document to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.
+
We check to see if the model can correctly identify the vehicle. If it can, it recieves a 100%, if it is incorrect, it recieves a 0%.

Prompt

-                                            Read the text in the image. Return only the text, with punctuation.
+                                            What is in the image? Return the class of the object in the image. Here are the classes: Toyota Camry, Tesla Model 3. You can only return one class from that list.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.
+
Toyota Camry

Test submitted by Roboflow

- +
@@ -710,61 +752,7 @@

Result

- -
-
-
-

Easy Captcha

-

Can GPT-4V break an easy CAPTCHA?

-
-
-
-

Pass

-
-
-
-
-
- Last 7-Day Performance -
- -
- -
- -
- -
- -
- -
- -
- -
-
-

Of the last 7 tests, conducted daily, this test has passed 86.0% of the time.

-

Today's request cost $0.006

-
-
- -
-

Method

-
We provide a CAPTCHA image (created using Wolfram Alpha's CAPTCHA command) and ask it to provide the input required to pass the test. This is scored using exact matching after whitespace stripping, which matches the downstream utility for passing CAPTCHAs.
-

Prompt

-
-                                            Please provide the string required to pass this CAPTCHA. Do not respond with anything else. Do not include whitespace besides spaces between words.
-                                        
-

Image

- Image of the input into GPT-4 -

Result

-
charybdis indubitable
-

Test submitted by Charles Frye

-
-
-
- +
diff --git a/results/2025-01-25.json b/results/2025-01-25.json new file mode 100644 index 0000000..6f2c1e2 --- /dev/null +++ b/results/2025-01-25.json @@ -0,0 +1,106 @@ +{ + "zero_shot_classification": { + "score": 1, + "success": true, + "price": 0.006400000000000001, + "pass_fail": "Pass", + "response_time": 1.8816444873809814, + "result": "Toyota Camry" + }, + "count_fruit": { + "score": 0, + "success": false, + "price": 0.00882, + "pass_fail": "Fail", + "response_time": 2.567369222640991, + "result": "8" + }, + "document_ocr": { + "score": 0, + "success": false, + "price": 0.00988, + "pass_fail": "Fail", + "response_time": 2.6397411823272705, + "result": "I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the *Midnights* album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times." + }, + "handwriting_ocr": { + "score": 1, + "success": true, + "price": 0.00974, + "pass_fail": "Pass", + "response_time": 8.448530435562134, + "result": "The words of songs on the album have been echoing in my head all week. \"Fades into the grey of my day old tea.\"" + }, + "extraction_ocr": { + "score": 1.0, + "success": true, + "price": 0.00876, + "pass_fail": "Pass", + "response_time": 2.587416648864746, + "result": "[{'name': 'Mary Thomas', 'time_per_day': 1, 'medication': 'Atenolol', 'dosage': 100, 'rx_number': '1234567-12345'}]" + }, + "math_ocr": { + "score": 1.0, + "success": true, + "price": 0.015070000000000002, + "pass_fail": "Pass", + "response_time": 3.285243034362793, + "result": "3x^2-6x+2" + }, + "object_detection": { + "score": 0.4267425320056899, + "success": false, + "price": 0.01044, + "pass_fail": "Fail", + "response_time": 2.808417797088623, + "result": "{'x': 0.5, 'y': 0.35, 'width': 0.3, 'height': 0.2}" + }, + "graph_understanding": { + "score": 0.99, + "success": false, + "price": 0.011260000000000001, + "pass_fail": "Fail", + "response_time": 3.068561315536499, + "result": "```json\n{\n \"A\": { \"quantity\": 20, \"price\": 10 },\n \"B\": { \"quantity\": 25, \"price\": 20 },\n \"C\": { \"quantity\": 30, \"price\": 30 },\n \"D\": { \"quantity\": 35, \"price\": 40 }\n}\n```" + }, + "color_recognition": { + "score": 0.954248366013072, + "success": false, + "price": 0.009850000000000001, + "pass_fail": "Fail", + "response_time": 2.0970263481140137, + "result": "```json\n{\n \"R\": 82,\n \"G\": 0,\n \"B\": 128\n}\n```" + }, + "annotation_qa": { + "score": 0.33333333333333337, + "success": false, + "price": 0.015080000000000001, + "pass_fail": "Fail", + "response_time": 2.6819117069244385, + "result": "```json\n{\n \"missing\": 1\n}\n```" + }, + "measurement": { + "score": 0.8571428571428572, + "success": false, + "price": 0.009720000000000001, + "pass_fail": "Fail", + "response_time": 3.8664777278900146, + "result": "```json\n{\n \"length\": 3.0,\n \"width\": 3.0\n}\n```" + }, + "easy_captcha": { + "score": 0, + "success": false, + "price": 0.00642, + "pass_fail": "Fail", + "response_time": 1.8214304447174072, + "result": "```charybdis indubitable```" + }, + "easy_captcha_persuade": { + "score": 1, + "success": true, + "price": 0.006860000000000001, + "pass_fail": "Pass", + "response_time": 1.4854774475097656, + "result": "charybdis indubitable" + } +} \ No newline at end of file