Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Provide a way to reduce kuzu db disk file (data.kz) size (VACUUM) #4798

Open
bakkiaraj opened this issue Jan 27, 2025 · 6 comments

Comments

@bakkiaraj
Copy link

Description

Is there a way to vaccum the Kuzu DB data files? It takes more size than expected in my simple usecases , I am looking for a "VACUUM" kind of command to repack / reduce the db size, So i can commit the DB into Git (in my usecase, The DB is a collateral for downstream jobs to refer and This DB needs to be safe guarded in version control for build reproduction).

P.S: I am using git LFS for now, But I feel, we should have way to reduce the disk size of kuzu DB

@ray6080
Copy link
Contributor

ray6080 commented Jan 27, 2025

hi @bakkiaraj , may I know in your use case are you performing lots of DELETE and DROP statements?
Unfortunately, we don't have a VACUUM command now, though it's in our roadmap, and I agree it's something we should have in the database.

@bakkiaraj
Copy link
Author

@ray6080 , At work, I am working on application which will determine how the network of the IPs (inside SoC), This data then gets stored in Kuzu as nodes and relations , relation being the IP bus types like AXI ... The resulting kuzu DB is a asset that will be used by multiple downstream build jobs to generate the RTL code , do some more analysis like power consumption , security etc ..

I do not have too many nodes , They are in the order of <1000 . As of now, we are not deleting / updating the nodes but this is in the plan.

To give perspective, I have a table for nodes (nodes has few properties including STRUCT properties) , Python code , in a loop CREATE the nodes then CREATE relations , For 72 nodes , the data.kz file is ~13MB (Linux OS). This I feel too much.

This is a reason I was asking for compact / vaccum facility.

@ray6080
Copy link
Contributor

ray6080 commented Jan 27, 2025

hi @bakkiaraj may I know your table schema? I'd like to try reproducing on my side. The unexpected data.kz file size might be due to that we preserve some extra space when there are few tuples, as we usually expect there will be more tuples coming to amortize the space usage. but maybe we can optimize this better to be less aggressive.

@bakkiaraj
Copy link
Author

@ray6080 , Here is the representational schema ,

CREATE NODE TABLE MYNODE (
                node_name STRING,
                type STRING,
                timer STRING,
                reset STRING,
                powerup STRING,
                voltas STRING,
                port_time FLOAT,
                defines_data JSON,
                band JSON,
                obs JSON,
                interface_type STRING,
                floorsuffix STRING,
                stitch_p STRING,
                address_blocks STRUCT(
                    block_name STRING,
                    address_end INT64,
                    address_start INT64,
                    is_internal_ap BOOLEAN,
                    has_fbl BOOLEAN,
                    rsf STRING,
                    offset_based_decoding INT8,
                    prog_sa BOOLEAN
                    )[],
                fifo_properties JSON,
                s_id INT64,
                ch_instance_name STRING,
                b_interfaces JSON,
                fs STRING,
                fs_req STRING,
                c_id UINT16,
                legacy STRUCT(
                wrapper_info STRING
                ),
                PRIMARY KEY (node_name)
                );"""


"CREATE REL TABLE AXIBUS_REL (
                        FROM MYNODE TO MYNODE,
                        version INT8,
                        addr_width INT16,
                        data_channel STRING,
                        data_width INT32,
                        id_width INT8,
                        user_data JSON,
                        axi_flavor JSON
                        );

There are more REL schemas but they are similar to the above AXIBUS_REL

@ray6080
Copy link
Contributor

ray6080 commented Feb 2, 2025

@bakkiaraj thanks for sharing this! I think this is mainly due to that we optimistically reserve pages for few tuples. Will see if we can have a better way to handle this.

@bakkiaraj
Copy link
Author

@ray6080 awesome. Thanks. Will wait for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants