-
Notifications
You must be signed in to change notification settings - Fork 20
How to write a None value to a column #174
Comments
There is no concept of Null/None value type in Parquet. Null values are managed by definition levels, depending on the value of the definition level you get the null value at nesting level that you need. So basically, if you use the low-level API, the simple example would be something like this: writer.write_batch(&[1, 2, 3], Some(&[1, 1, 0, 1]), None); This would result in something like this:
Of course, it gets more complicated when you have arrays or nested structs; but it looks like you are writing primitive values only, not complex ones, like maps. OPTIONAL in schema tells reader/writer that this field is expected to nulls, so it needs to create data structures for definition and repetition levels, otherwise, it is omitted. Hope that helps. It looks we do need some high level API for writes. |
Thanks for the tips, I read over some Twitter blog posts and now I get it a little more, the definition level is for specifying how much of the structure is defined. Sounded like a bitmask at first but I see the number is for quantifying how fully defined the value is. Now I'm doing two passes over the list of structure with optional string, one pass to gather up the ByteArrays, the other to mark whether the value should be written or not. Unfortunately I'm not getting it to work just yet! The file is always the same size, about 2MB. Here is the code I'm trying:
got any tips on how to debug this? I'm going to try dumping the parquet file as-is to make sure the schema is correct. |
Well, it looks alright. Couple of things: You probably want to make it like this: if count % page_count != 0 {
page_count += 1;
} Another thing is that values slice should only contain non-null elements. For example, if you have But it looks like we need to make write API better or more usable. |
Agreed. Ideally user of this library should not need to know about repetition & definition levels. Maybe column writer could have a "append" API which allows user to keep adding values and then generate the column once it's done. |
Looks like my writer code had a variety of errors, namely
I was able to track down those last two by creating a sample in-memory data set and make sure it could be read back. The last code snippet above was actually generating corrupt parquet files, failing to write the footer. I think this ties back to my "Help A Newbie Out" ticket #173. Here's the working sample, using the same schema parquet file as above:
then tested with the parquet-reader cli (fetched with
and it looks right! |
A little bit of refactoring and I've got something which will be useful as I expand my collection of writers. I imagine I'll be writing a lot of sparse lists of bytearray data:
when called it looks something like:
I've read that rust's iterator code is sometimes suboptimal so I will benchmark these two flavors of writing a column. Edit: if it wasn't clear, I'm looking for feedback! Please let me know if this API is somewhat useful. |
@sunchao yes, you are right. I am also thinking about doing something similar to #173, to make sure users close column writers, row groups and files. I was personally in favour of adding Row API similar to reads, where you give us a @xrl Looks great! Would be fantastic if you could post the pain points you have encountered, so we could address them at some point. We could literary create commands to convert CSV and JSON into parquet, I am not sure how useful that would be. Let me know. Cheers! |
Would it make sense to have a serde-style procedural macro for generating the writer?
combined with a schema I'm not sure the compiler can guarantee what is generated is compatible. but it could alleviate boilerplate. |
Yes, it could work. But I would suggest implementing a simple approach first, just to see how much work it is and potential pitfalls - and it looks like you have already done some work! If we implemented a row api like we did for reading, it would result in a much easier to use interface - user would give us list of records and potentially schema, and we would create a proper parquet file and hide all of the complexity behind it. We could start working on proposal for that and/or updating low level api! But it is a bit off topic from the original issue:) |
I am converting postgres-stored data in to a parquet file and I am going to deal with a sparse list of strings. How do I use the column writer to write those Option?
With this schema:
I get this error:
Is
OPTIONAL
not what I think it is?The text was updated successfully, but these errors were encountered: