Suitability and parameters for frequently updated data storage #477

alexander0042 · 2024-06-20T13:39:23Z

alexander0042
Jun 20, 2024

I'm the developer behind Pirate Weather, a weather API that takes NOAA forecasts and shares them through a free/ open/ documented API. Currently, my back-end works by overwriting the weather data (saved as Zarr arrays) on disk; however, because there are two processes (a sync process that keeps the array updated and a second process to handle queries and return data), I'm hitting occasional issues with file locking and would rather switch to a database solution.

The speed, licensing, and hybrid storage make Garnet look ideal for this sort of task. The data I'm storing consists of ~50 GB of ~1 MB pages, updating roughly few hours, and stored on AWS direct attached NVME drives. I don't need any persistence or recovery, since all the data is saved elsewhere, but I do need to make sure that old values aren't kept on disk very long after new values are written.

I'm trialling running Garnet in docker using these commands:

 docker run -p 6379:6379 -v /mnt/nvme/garnetstore:/tmp   --ulimit memlock=-1 ghcr.io/microsoft/garnet --storage-tier true --logdir /tmp --auth NoAuth --memory 100gb --i 16gb --compaction-type ShiftForced --compaction-freq 900 --compaction-max-segments 100 --logger-level Information --index-max-size 12G --page 2M

I've read through the docs in detail and am a little confused about the difference between compaction types (I want to delete anything that's been overwritten), and the difference between the --memory, --compaction-freq, and --compaction-max-segments parameters, which one drives compaction? Are checkpoints required for compaction, since I wasn't planning on using that feature.

I know this is a little bit of an unusual case, but as Zarr becomes more popular I could imagine it becoming a larger one! Thank you in advance for any help, and I'm happy to clarify anything about my setup

badrishc · 2024-06-21T21:16:43Z

badrishc
Jun 21, 2024
Maintainer

PR here adds a docs page to the website + improved switches: #482

7 replies

badrishc Jun 28, 2024
Maintainer

When a key-value is overwritten with a new value, the old value sitting in the log is no longer live. So, during compaction, it will get removed and not copied to the tail. In short, the goal is to eliminate all dead records in the old segment and copy the live records over to the tail so we do not lose them -- before the segment is deleted.

badrishc Jun 28, 2024
Maintainer

If you do not care about retaining these old live records (i.e., you are OK if those records "age out" of the system) - then just use the Shift compaction strategy.

alexander0042 Jul 2, 2024
Author

Ahh, that makes sense, really appreciate the explanation! Last (hopefully) question: could you elaborate on the difference between the "scan" and "lookup" types? I've read through the docs a few times, but don't have quite the grasp to parse the implications of "scan the entire log" vs. "random lookup".

For what it's worth, I've been running a test instance for a few days using shift, and while I really don't want old (but still live) records to be dropped, I have enough extra space that it's been working despite me using "shift". While I've got more testing to do, it's a promising start for using this as a zarr back-end

badrishc Jul 18, 2024
Maintainer

This link has the details of compaction: https://microsoft.github.io/FASTER/docs/fasterkv-basics/#log-compaction -- lookup based is better in most cases as it does not incur memory overhead of a temporary hash table.

alexander0042 Jul 18, 2024
Author

That link was exactly what I was looking for, thank you so much! If I'm understanding it correctly, I think scan might be perfect for my use case, as the records are always written/ overwritten sequentially, but I'll test them both out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suitability and parameters for frequently updated data storage #477

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Suitability and parameters for frequently updated data storage #477

alexander0042 Jun 20, 2024

Replies: 1 comment · 7 replies

badrishc Jun 21, 2024 Maintainer

badrishc Jun 28, 2024 Maintainer

badrishc Jun 28, 2024 Maintainer

alexander0042 Jul 2, 2024 Author

badrishc Jul 18, 2024 Maintainer

alexander0042 Jul 18, 2024 Author

alexander0042
Jun 20, 2024

Replies: 1 comment 7 replies

badrishc
Jun 21, 2024
Maintainer

badrishc Jun 28, 2024
Maintainer

badrishc Jun 28, 2024
Maintainer

alexander0042 Jul 2, 2024
Author

badrishc Jul 18, 2024
Maintainer

alexander0042 Jul 18, 2024
Author