-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated for datafusion 44.0.0 #301
base: main
Are you sure you want to change the base?
Conversation
@rschu1ze here's the updated results for datafusion |
In case anyone would like to see more backstory / discussion there is quite a bit here |
datafusion/run.sh
Outdated
@@ -19,7 +19,7 @@ fi | |||
TRIES=3 | |||
QUERY_NUM=1 | |||
echo $1 | |||
cat queries.sql | while read -r query; do | |||
cat queries.sql | while read query; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reverts the change from #252 from @JelteF for reasons explained here: apache/datafusion#13983 (comment)
I will freely admit to not being an expert in multi-level escaping in bash but reverting the change has fixed our results. I have not looked more carefully into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change should not be reverted. The Q28 query was being executed incorrectly before, to be much simpler than was intended.
I'm not sure why the query now fails for datafusion (the comment you linked does not contain an error). But reverting my fix is not the way that failure should be resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JelteF that makes sense, got it! Looking back to your original change read -r
is required to capture the group from the regex (it seems to be picking up domains from the Referer column).
As you point out the original query didn't have the capture group \1
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
With your change from #252 the \1
appears which is needed to pick up the domain:
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
datafusion gives this error running with that query:
datafusion-cli -f create_single.sql query-with-r.sql
DataFusion CLI v43.0.0
0 row(s) fetched.
Elapsed 0.028 seconds.
SQL error: TokenizerError("unsupported escape char: '\\.'")
It runs fine if we make a couple of tweaks to the query:
- fixed: (?:www\.)
- used double escaping :
\\1
instead of\1
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k
HAVING COUNT(*) > 100000
ORDER BY l DESC LIMIT 25;
@alamb @JelteF does the modification above to that query look valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @JelteF does the modification above to that query look valid?
Yes, I think it does and follows the examples in
https://datafusion.apache.org/user-guide/sql/scalar_functions.html#regexp-replace
I am somewhat concerned that the changes on #252 basically invalidated a bunch of previous results, but I will comment on that ticket directly.
Here is the query I tested
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k
HAVING COUNT(*) > 100000
ORDER BY l DESC LIMIT 25;
Making that change however does give different answers (and takes about 3x longer):
DataFusion CLI v45.0.0
+------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| k | l | c | min(hits.parquet.Referer) |
+------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| svpressa.ru | 283.7283855401811 | 242465 | http://svpressa.ru/ |
| msuzie-showforumdisplay | 245.21045437713738 | 183636 | http://msuzie-showforumdisplay/63/~2/?name=&cost_neu%3D400%26retpath=default777TWCWkI9lalUwTXpJMU1q |
| saint-peters-total=меньше 80 | 223.21598396017976 | 200501 | http://saint-peters-total=меньше 80/cash/station=search&input_age1=&int[2274/trashbox.ru/details&cad=rjt&fu=0&input_who1=2&input_who1=2&input_city-4400047.html?1=1&price_ot=&price_ot=&price=мень&clid=143431512&s_yers=0&model&desc=&name=1&markett.ru%2Frenazheltyj-95692465&text=поздравстраницу&clid=4405404/menu_29.html&ei=UIX1UZTFNJTI0ci8guSn3tW.pl?cmd=showthreadreplies=978-4121-469171&history47/11/?from=&district |
| domics | 198.50872791952895 | 326080 | http://domics/825179.11931861234499792 |
| e96.ru | 196.10750462709447 | 1018998 | http://e96.ru/%3Ffrom]=&input_act[count_num=0&dff=arian-carrina1201517&cad=rjt&fu=0&input_state/apartments-sale/list.htm?size=30&input_with_photo-takovo-zuevo-shpilki.ru/real.nn.ru/yandex.ua/searchAuto=on&input_who1=2&input_online.ru/clck/51cbd&ll=&top=http://loveplane=42621/page9 |
| gadgets.irr.ru | 126.33208836777007 | 349675 | https://gadgets.irr.ru/2jmj7l5rSw0yVb |
| go.mail | 104.57850246727648 | 8227493 | http://go.mail/04/detskaia-moda-zhiensmed |
| google.ru | 104.02219755312633 | 2158346 | http://google.ru/ |
| msouz.ru | 103.07106523287989 | 301765 | http://msouz.ru/?ffshop |
| state=19945206 | 102.14149048904294 | 512409 | http://state=19945206/foto-4/login%20NoTs3M&where=all&filmId=u8aGGqtWs3M&where=all&server=sc.cheloveplanet.ru/forum.little&categoriya/muzhskaya_istory.com/demii-malchik_148_4055074241537][to]=&input_activalry: A Love&where=all&text=купить florer&aktion/loansert перевозобности&site_id=197&text=добавь лазерногорский тайская полупая автомобильника&v=1414-resp_desyachnyeprague |
| loveplanet.ru | 99.86426938807375 | 461134 | http://loveplanet.ru/%3Faw_opel/page=2013 |
| bonprix.ru | 99.77964049821476 | 1125057 | http://bonprix.ru/ |
| geomethiettai.ru | 91.85402826428313 | 115906 | https://geomethiettai.ru/GameMain.aspx?group=houses/list=266559j7077&num=7&prune_day=lastdiscussd59394303&price_ot=&priceup&page3/sankt-petushhiblock |
| yaroslavens.ru | 88.5815401902163 | 124595 | http://yaroslavens.ru/main.aspx#catalog%2F1004-1100000147-otvet/actions/dislocal |
| novjob.ru | 87.7909191350555 | 133049 | http://novjob.ru/ |
| kino | 86.9073791984018 | 120135 | http://kino/6/21/2/women.asp?whichpage4/#oversion=unreadm&uid |
| mysw.info | 86.22567355918362 | 984546 | http://mysw.info/ |
| m.myloveplanet.ru | 86.15602069365993 | 151544 | http://m.myloveplanet.ru/ |
| cn.ru | 84.85155686029164 | 124674 | http://cn.ru/GameMain.aspx#catalog/100523&tails.xml?market_pc.html?pid=9403&lr=47&msgid=36305f39386 |
| povarenok.ru | 80.4026351589313 | 144811 | http://povarenok.ru/ |
| gorod | 78.1314030778123 | 110728 | http://gorod/%3Fauto.ria.ua%2Fjob |
| myloveplanet.ru | 77.49123727188874 | 110582 | http://myloveplanet.ru/#associety/auto |
| tambov.irr.ru | 75.0797639208672 | 315318 | http://tambov.irr.ru/0/c1/tgFtaeLDK0yb01A7xvQF08sjCFqQxn51 |
| state | 73.33104463740268 | 191521 | http://state/1721038250%26ie%3D31634&lr=2055b71ce92d72e9800039 |
| kurortmag.ru | 73.15373913939574 | 155263 | http://kurortmag.ru/ |
+------------------------------+--------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
25 row(s) fetched.
Elapsed 6.296 seconds.
Here is the result with .
and 1
(no escapes):
DataFusion CLI v45.0.0
+-----------------------------------------------------------------------+-------------------+----------+-----------------------------------------------------------------------------------------------+
| k | l | c | min(hits.parquet.Referer) |
+-----------------------------------------------------------------------+-------------------+----------+-----------------------------------------------------------------------------------------------+
| 1 | 81.60577503194366 | 66284655 | http://%26ad%3D1%25EA%25D0%26utm_source=web&cd=19590&input_onlist/би-2 место будущей кондицин |
| http:%2F%2Fwwww.regnancies/search&evL8gE&where=all&filmId=bEmYZc_WTDE | 69.0 | 207347 | http:%2F%2Fwwww.regnancies/search&evL8gE&where=all&filmId=bEmYZc_WTDE |
| http://loveche.html?ctid | 24.0 | 144901 | http://loveche.html?ctid |
| http://rukodeliveresult | 23.0 | 226135 | http://rukodeliveresult |
| http://holodilnik.ru | 20.0 | 133893 | http://holodilnik.ru |
| http:%2F%2Fviewtopic | 20.0 | 391115 | http:%2F%2Fviewtopic |
| http://smeshariki.ru | 20.0 | 210736 | http://smeshariki.ru |
| http:%2F%2FviewType | 19.0 | 148907 | http:%2F%2FviewType |
| http://новострашная | 19.0 | 731499 | http://новострашная |
| http:%2F%2Fwwww.ukr | 19.0 | 655178 | http:%2F%2Fwwww.ukr |
| http://state=2008 | 17.0 | 139630 | http://state=2008 |
+-----------------------------------------------------------------------+-------------------+----------+-----------------------------------------------------------------------------------------------+
11 row(s) fetched.
Elapsed 1.967 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Thanks for confirming, I've updated Q28 to include the additional escape characters and re-run the results. As expected, the Q28 results have regressed. It seems this is expected since that query is now working correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW the need for double escaping in datafusion-cli is trackd here:
Hopefully we'll fix it in the next release
4dc6cad
to
b89c76f
Compare
b89c76f
to
7c0a8ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pmcgleenon 🙏
This PR updates the datafusion Clickbench results for release 44.0.0
Ran the benchmarks on an EC2 instance described in the README: c6a.4xlarge, 500gb gp2 running Amazon Linux 2 AMI