Optimization: Provide a way to reduce kuzu db disk file (data.kz) size (VACUUM) #4798

bakkiaraj · 2025-01-27T06:59:03Z

Description

Is there a way to vaccum the Kuzu DB data files? It takes more size than expected in my simple usecases , I am looking for a "VACUUM" kind of command to repack / reduce the db size, So i can commit the DB into Git (in my usecase, The DB is a collateral for downstream jobs to refer and This DB needs to be safe guarded in version control for build reproduction).

P.S: I am using git LFS for now, But I feel, we should have way to reduce the disk size of kuzu DB

ray6080 · 2025-01-27T07:17:48Z

hi @bakkiaraj , may I know in your use case are you performing lots of DELETE and DROP statements?
Unfortunately, we don't have a VACUUM command now, though it's in our roadmap, and I agree it's something we should have in the database.

bakkiaraj · 2025-01-27T09:15:18Z

@ray6080 , At work, I am working on application which will determine how the network of the IPs (inside SoC), This data then gets stored in Kuzu as nodes and relations , relation being the IP bus types like AXI ... The resulting kuzu DB is a asset that will be used by multiple downstream build jobs to generate the RTL code , do some more analysis like power consumption , security etc ..

I do not have too many nodes , They are in the order of <1000 . As of now, we are not deleting / updating the nodes but this is in the plan.

To give perspective, I have a table for nodes (nodes has few properties including STRUCT properties) , Python code , in a loop CREATE the nodes then CREATE relations , For 72 nodes , the data.kz file is ~13MB (Linux OS). This I feel too much.

This is a reason I was asking for compact / vaccum facility.

ray6080 · 2025-01-27T13:07:36Z

hi @bakkiaraj may I know your table schema? I'd like to try reproducing on my side. The unexpected data.kz file size might be due to that we preserve some extra space when there are few tuples, as we usually expect there will be more tuples coming to amortize the space usage. but maybe we can optimize this better to be less aggressive.

bakkiaraj · 2025-01-28T06:11:16Z

@ray6080 , Here is the representational schema ,

CREATE NODE TABLE MYNODE (
                node_name STRING,
                type STRING,
                timer STRING,
                reset STRING,
                powerup STRING,
                voltas STRING,
                port_time FLOAT,
                defines_data JSON,
                band JSON,
                obs JSON,
                interface_type STRING,
                floorsuffix STRING,
                stitch_p STRING,
                address_blocks STRUCT(
                    block_name STRING,
                    address_end INT64,
                    address_start INT64,
                    is_internal_ap BOOLEAN,
                    has_fbl BOOLEAN,
                    rsf STRING,
                    offset_based_decoding INT8,
                    prog_sa BOOLEAN
                    )[],
                fifo_properties JSON,
                s_id INT64,
                ch_instance_name STRING,
                b_interfaces JSON,
                fs STRING,
                fs_req STRING,
                c_id UINT16,
                legacy STRUCT(
                wrapper_info STRING
                ),
                PRIMARY KEY (node_name)
                );"""


"CREATE REL TABLE AXIBUS_REL (
                        FROM MYNODE TO MYNODE,
                        version INT8,
                        addr_width INT16,
                        data_channel STRING,
                        data_width INT32,
                        id_width INT8,
                        user_data JSON,
                        axi_flavor JSON
                        );

There are more REL schemas but they are similar to the above AXIBUS_REL

ray6080 · 2025-02-02T04:03:14Z

@bakkiaraj thanks for sharing this! I think this is mainly due to that we optimistically reserve pages for few tuples. Will see if we can have a better way to handle this.

bakkiaraj · 2025-02-02T04:12:58Z

@ray6080 awesome. Thanks. Will wait for the update.

bakkiaraj added the performance optimization label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Provide a way to reduce kuzu db disk file (data.kz) size (VACUUM) #4798

Optimization: Provide a way to reduce kuzu db disk file (data.kz) size (VACUUM) #4798

bakkiaraj commented Jan 27, 2025

ray6080 commented Jan 27, 2025

bakkiaraj commented Jan 27, 2025

ray6080 commented Jan 27, 2025 •

edited

Loading

bakkiaraj commented Jan 28, 2025

ray6080 commented Feb 2, 2025 •

edited

Loading

bakkiaraj commented Feb 2, 2025

Optimization: Provide a way to reduce kuzu db disk file (data.kz) size (VACUUM) #4798

Optimization: Provide a way to reduce kuzu db disk file (data.kz) size (VACUUM) #4798

Comments

bakkiaraj commented Jan 27, 2025

Description

ray6080 commented Jan 27, 2025

bakkiaraj commented Jan 27, 2025

ray6080 commented Jan 27, 2025 • edited Loading

bakkiaraj commented Jan 28, 2025

ray6080 commented Feb 2, 2025 • edited Loading

bakkiaraj commented Feb 2, 2025

ray6080 commented Jan 27, 2025 •

edited

Loading

ray6080 commented Feb 2, 2025 •

edited

Loading