Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated for datafusion 44.0.0 #301

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pmcgleenon
Copy link
Contributor

This PR updates the datafusion Clickbench results for release 44.0.0

Ran the benchmarks on an EC2 instance described in the README: c6a.4xlarge, 500gb gp2 running Amazon Linux 2 AMI

@CLAassistant
Copy link

CLAassistant commented Feb 4, 2025

CLA assistant check
All committers have signed the CLA.

@pmcgleenon
Copy link
Contributor Author

@rschu1ze here's the updated results for datafusion 44.0.0. Not sure what's happening with the bot user commit to index.html, I didn't change that file

@alamb
Copy link

alamb commented Feb 4, 2025

In case anyone would like to see more backstory / discussion there is quite a bit here

@@ -19,7 +19,7 @@ fi
TRIES=3
QUERY_NUM=1
echo $1
cat queries.sql | while read -r query; do
cat queries.sql | while read query; do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reverts the change from #252 from @JelteF for reasons explained here: apache/datafusion#13983 (comment)

I will freely admit to not being an expert in multi-level escaping in bash but reverting the change has fixed our results. I have not looked more carefully into this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change should not be reverted. The Q28 query was being executed incorrectly before, to be much simpler than was intended.

I'm not sure why the query now fails for datafusion (the comment you linked does not contain an error). But reverting my fix is not the way that failure should be resolved.

Copy link
Contributor Author

@pmcgleenon pmcgleenon Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JelteF that makes sense, got it! Looking back to your original change read -r is required to capture the group from the regex (it seems to be picking up domains from the Referer column).

As you point out the original query didn't have the capture group \1

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

With your change from #252 the \1 appears which is needed to pick up the domain:

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

datafusion gives this error running with that query:

datafusion-cli -f create_single.sql query-with-r.sql
DataFusion CLI v43.0.0
0 row(s) fetched.
Elapsed 0.028 seconds.

SQL error: TokenizerError("unsupported escape char: '\\.'")

It runs fine if we make a couple of tweaks to the query:

  • fixed: (?:www\.)
  • used double escaping : \\1 instead of \1
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k
HAVING COUNT(*) > 100000
ORDER BY l DESC LIMIT 25;

@alamb @JelteF does the modification above to that query look valid?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @JelteF does the modification above to that query look valid?

Yes, I think it does and follows the examples in
https://datafusion.apache.org/user-guide/sql/scalar_functions.html#regexp-replace

I am somewhat concerned that the changes on #252 basically invalidated a bunch of previous results, but I will comment on that ticket directly.

Here is the query I tested

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k
HAVING COUNT(*) > 100000
ORDER BY l DESC LIMIT 25;

Making that change however does give different answers (and takes about 3x longer):

DataFusion CLI v45.0.0
+------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| k                            | l                  | c       | min(hits.parquet.Referer)                                                                                                                                                                                                                                                                                                                                                                                                       |
+------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| svpressa.ru                  | 283.7283855401811  | 242465  | http://svpressa.ru/                                                                                                                                                                                                                                                                                                                                                                                                             |
| msuzie-showforumdisplay      | 245.21045437713738 | 183636  | http://msuzie-showforumdisplay/63/~2/?name=&cost_neu%3D400%26retpath=default777TWCWkI9lalUwTXpJMU1q                                                                                                                                                                                                                                                                                                                             |
| saint-peters-total=меньше 80 | 223.21598396017976 | 200501  | http://saint-peters-total=меньше 80/cash/station=search&input_age1=&int[2274/trashbox.ru/details&cad=rjt&fu=0&input_who1=2&input_who1=2&input_city-4400047.html?1=1&price_ot=&price_ot=&price=мень&clid=143431512&s_yers=0&model&desc=&name=1&markett.ru%2Frenazheltyj-95692465&text=поздравстраницу&clid=4405404/menu_29.html&ei=UIX1UZTFNJTI0ci8guSn3tW.pl?cmd=showthreadreplies=978-4121-469171&history47/11/?from=&district |
| domics                       | 198.50872791952895 | 326080  | http://domics/825179.11931861234499792                                                                                                                                                                                                                                                                                                                                                                                          |
| e96.ru                       | 196.10750462709447 | 1018998 | http://e96.ru/%3Ffrom]=&input_act[count_num=0&dff=arian-carrina1201517&cad=rjt&fu=0&input_state/apartments-sale/list.htm?size=30&input_with_photo-takovo-zuevo-shpilki.ru/real.nn.ru/yandex.ua/searchAuto=on&input_who1=2&input_online.ru/clck/51cbd&ll=&top=http://loveplane=42621/page9                                                                                                                                       |
| gadgets.irr.ru               | 126.33208836777007 | 349675  | https://gadgets.irr.ru/2jmj7l5rSw0yVb                                                                                                                                                                                                                                                                                                                                                                                           |
| go.mail                      | 104.57850246727648 | 8227493 | http://go.mail/04/detskaia-moda-zhiensmed                                                                                                                                                                                                                                                                                                                                                                                       |
| google.ru                    | 104.02219755312633 | 2158346 | http://google.ru/                                                                                                                                                                                                                                                                                                                                                                                                               |
| msouz.ru                     | 103.07106523287989 | 301765  | http://msouz.ru/?ffshop                                                                                                                                                                                                                                                                                                                                                                                                         |
| state=19945206               | 102.14149048904294 | 512409  | http://state=19945206/foto-4/login%20NoTs3M&where=all&filmId=u8aGGqtWs3M&where=all&server=sc.cheloveplanet.ru/forum.little&categoriya/muzhskaya_istory.com/demii-malchik_148_4055074241537][to]=&input_activalry: A Love&where=all&text=купить florer&aktion/loansert перевозобности&site_id=197&text=добавь   лазерногорский  тайская полупая автомобильника&v=1414-resp_desyachnyeprague                                      |
| loveplanet.ru                | 99.86426938807375  | 461134  | http://loveplanet.ru/%3Faw_opel/page=2013                                                                                                                                                                                                                                                                                                                                                                                       |
| bonprix.ru                   | 99.77964049821476  | 1125057 | http://bonprix.ru/                                                                                                                                                                                                                                                                                                                                                                                                              |
| geomethiettai.ru             | 91.85402826428313  | 115906  | https://geomethiettai.ru/GameMain.aspx?group=houses/list=266559j7077&num=7&prune_day=lastdiscussd59394303&price_ot=&priceup&page3/sankt-petushhiblock                                                                                                                                                                                                                                                                           |
| yaroslavens.ru               | 88.5815401902163   | 124595  | http://yaroslavens.ru/main.aspx#catalog%2F1004-1100000147-otvet/actions/dislocal                                                                                                                                                                                                                                                                                                                                                |
| novjob.ru                    | 87.7909191350555   | 133049  | http://novjob.ru/                                                                                                                                                                                                                                                                                                                                                                                                               |
| kino                         | 86.9073791984018   | 120135  | http://kino/6/21/2/women.asp?whichpage4/#oversion=unreadm&uid                                                                                                                                                                                                                                                                                                                                                                   |
| mysw.info                    | 86.22567355918362  | 984546  | http://mysw.info/                                                                                                                                                                                                                                                                                                                                                                                                               |
| m.myloveplanet.ru            | 86.15602069365993  | 151544  | http://m.myloveplanet.ru/                                                                                                                                                                                                                                                                                                                                                                                                       |
| cn.ru                        | 84.85155686029164  | 124674  | http://cn.ru/GameMain.aspx#catalog/100523&tails.xml?market_pc.html?pid=9403&lr=47&msgid=36305f39386                                                                                                                                                                                                                                                                                                                             |
| povarenok.ru                 | 80.4026351589313   | 144811  | http://povarenok.ru/                                                                                                                                                                                                                                                                                                                                                                                                            |
| gorod                        | 78.1314030778123   | 110728  | http://gorod/%3Fauto.ria.ua%2Fjob                                                                                                                                                                                                                                                                                                                                                                                               |
| myloveplanet.ru              | 77.49123727188874  | 110582  | http://myloveplanet.ru/#associety/auto                                                                                                                                                                                                                                                                                                                                                                                          |
| tambov.irr.ru                | 75.0797639208672   | 315318  | http://tambov.irr.ru/0/c1/tgFtaeLDK0yb01A7xvQF08sjCFqQxn51                                                                                                                                                                                                                                                                                                                                                                      |
| state                        | 73.33104463740268  | 191521  | http://state/1721038250%26ie%3D31634&lr=2055b71ce92d72e9800039                                                                                                                                                                                                                                                                                                                                                                  |
| kurortmag.ru                 | 73.15373913939574  | 155263  | http://kurortmag.ru/                                                                                                                                                                                                                                                                                                                                                                                                            |
+------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
25 row(s) fetched.
Elapsed 6.296 seconds.

Here is the result with . and 1 (no escapes):

DataFusion CLI v45.0.0
+-----------------------------------------------------------------------+-------------------+----------+-----------------------------------------------------------------------------------------------+
| k                                                                     | l                 | c        | min(hits.parquet.Referer)                                                                     |
+-----------------------------------------------------------------------+-------------------+----------+-----------------------------------------------------------------------------------------------+
| 1                                                                     | 81.60577503194366 | 66284655 | http://%26ad%3D1%25EA%25D0%26utm_source=web&cd=19590&input_onlist/би-2 место будущей кондицин |
| http:%2F%2Fwwww.regnancies/search&evL8gE&where=all&filmId=bEmYZc_WTDE | 69.0              | 207347   | http:%2F%2Fwwww.regnancies/search&evL8gE&where=all&filmId=bEmYZc_WTDE                         |
| http://loveche.html?ctid                                              | 24.0              | 144901   | http://loveche.html?ctid                                                                      |
| http://rukodeliveresult                                               | 23.0              | 226135   | http://rukodeliveresult                                                                       |
| http://holodilnik.ru                                                  | 20.0              | 133893   | http://holodilnik.ru                                                                          |
| http:%2F%2Fviewtopic                                                  | 20.0              | 391115   | http:%2F%2Fviewtopic                                                                          |
| http://smeshariki.ru                                                  | 20.0              | 210736   | http://smeshariki.ru                                                                          |
| http:%2F%2FviewType                                                   | 19.0              | 148907   | http:%2F%2FviewType                                                                           |
| http://новострашная                                                   | 19.0              | 731499   | http://новострашная                                                                           |
| http:%2F%2Fwwww.ukr                                                   | 19.0              | 655178   | http:%2F%2Fwwww.ukr                                                                           |
| http://state=2008                                                     | 17.0              | 139630   | http://state=2008                                                                             |
+-----------------------------------------------------------------------+-------------------+----------+-----------------------------------------------------------------------------------------------+
11 row(s) fetched.
Elapsed 1.967 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for confirming, I've updated Q28 to include the additional escape characters and re-run the results. As expected, the Q28 results have regressed. It seems this is expected since that query is now working correctly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙏

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the need for double escaping in datafusion-cli is trackd here:

Hopefully we'll fix it in the next release

Copy link

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pmcgleenon 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants