-
I have obtained an ibis dataframe: df, from trino through a series of data transformations. Now, I want to store df in ClickHouse. Since direct storage leads to timeouts, I plan to first create a table and then batch-insert into ClickHouse. Firstly, attempting to create a table like this results in an error: Therefore, I manually created the table in ClickHouse using DDL. Secondly, batch insertion using to_pyarrow_batches also causes errors: However, using to_pandas_batches works successfully. To prevent errors such as timeouts during the insertion of a batch into ClickHouse, I wish to implement a logic for retrying or handling failures. I can manually implement this, but I'm interested in knowing if there's a better approach. Also, if I want to execute this ETL operation in parallel using multi-threading, are there any best practices? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Thanks for opening a discussion! I don't think con.create_table("table_name", df.schema(), overwrite=True) is supported in ClickHouse right now, probably because we haven't implemented it. IIRC ClickHouse does not support This code for batch in df.to_pyarrow_batches(limit=100, chunk_size=10):
con.insert("algo_user", batch) tries to insert a
Yep, because that API produces pandas DataFrames, which most backends support for insertion.
I would suggest evaluating libraries that help with this. I'm not aware of any popular libraries for dealing with timeouts, but I'm sure they exist. Ibis is unlikely to grow specific support for handling timeouts, and instead we try to build the library to work well within the Python ecosystem so that you can compose tools and libraries together to suit your needs.
Personally my advice here would be to avoid threads or any kind of concurrency until you're reasonably sure you need it. Single threaded inserts might be fast enough for you, and they are a lot less complex to deal with 😃 |
Beta Was this translation helpful? Give feedback.
Your specific question can be accomplished with your original code minus the
overwrite=True
bit: