Cherry-pick TiKV related changes to 8.10.fb #364

v01dstar · 2024-04-11T17:37:09Z

6.29 (last TiKV base) diff: facebook/rocksdb@6.29.fb...tikv:rocksdb:6.29.tikv

Apply write-amplification-based rate limiter fe76269
- 3dc1b55
- 458bbd7
- ccf5215
- f9aacb3
- 4751586
- 8d6414e
- db59a3f
- c0953b3
- 938c016
- Comments:
  - Env::IOPriority has more options now, LOW, MID, HIGH, USER instead of just LOW and HIGH, write amplification rate limiter's tuning logic may need to adjust to this change.
Apply PerfFlag patch b27b564
- 2f1efc5
- 7ee5329
- Comments:
  - We need to resolve compatibility issue every time when RocksDB adds new metrics. And all dependent projects, like Titan and rust-rocksdb have to change accordingly.
Compaction filter optimization 2a93687 <-re-evaluate unsafe filter v4
- 8e20349
- 9554ad2
- ~~23c8635~~ <- removed (this is a bug)
- 3a238bf <- no longer needed, see avoid deadlock for GC in compaction filter tikv#9694
- Manual apply related changes in bb515db
- Comments:
  - We also need to resolve compatibility issue every time for this when RocksDB changes filter API.
Doubly skiplist for reverse scan acc80ce
- 5d01038
Add WAL write duration metric de0fc30
- e3c8f48
Add Iterator and Append method for WriteBatch 6ffdf20
- 340b810
TiKV IO rate limiter a8d22fc
- 7749860
- 52b4a97 <- retiring because of Add subcompaction event API facebook/rocksdb#9311
- 4ec4a1f
- 8bd42ca <- removed base_background_compactions related code due to Remove unused API base_background_compactions facebook/rocksdb#9462
- 28f7636
- Remove ROCKSDB_LITE macro use
Manifest dump tool optimization
- e1797a0
- 615c1dc
Implement pipelined commit / multi-batch write
- 910417b
- 19db40b
- 2d03d53 <- removed, but we need to verify this does not cause upgrade issue again tikv restart failed（upgrade new image and CrashLoopBackOff） tikv#13007
- f4cba2f
- 9bb7147
- Manually applied some changes from 98a80e9
- Comments:
  - This is not easy to maintain, multi-batch write's implementation is based on rocksdb::BatchWrite() (with some custom code). Whenever rocksdb::BatchWrite changes, multi-batch write's implementation should change accordingly, this requires human intervention.
Per-file encryption key management
- b9c2064
- 6386992
- 63399df
- 63586f2
- 7ee32c0
- bbd27cf
- 1868d12
- 4cebfc1
- 9464766
- Remove ROCKSDB_LITE macro
Optimize SST partitioner to avoid huge compaction
- e2f6ec7
Titan
- Make statistics extensible 6d88b39
  - 7c6dcaa <- pitfall, do not try to move statistics impl to stats.cc. https://stackoverflow.com/questions/1111440/undefined-reference-error-for-template-method
  - Manual apply changes in bb515db
- Manual apply related changes (blob index) in bb515db
- dcf2f8d
Raftstore v2?
- 32f8f2b
- faad483
- 8899a36 <- remove wal
- 638c217
- Expose seqno / Add post write callback
  - 3cd757c <- over-written by 08aa503
  - 08aa503
  - fdcd14d
- 9ea79ab
- acc624f
- cd9aa99
- 8a9c10e
- 14f36f8
- 5b9cef9 <-re-evaluate
- de47e8e
- 6121b2d
- 0813e37
- fe76937 just the CheckInRange API

Already exist in upstream (fb):

No longer needed:

bc1f255
dc9353f <- fixed differently in upstream OnFlushCompleted is called before flush completed facebook/rocksdb#5892
545d0b2 <- fixed by Sort L0 files by newly introduced epoch_num facebook/rocksdb#10922
bb515db <- Separated. i.e. monitoring related changes are merged with "Make statistics extensible"
2cbb069 <- reverted by 53eae82
53eae82 <- reverting 2cbb069

Need triage:

0559eac <- rocksdb cloud

To be verified:

40551e2 <- we can evaluate RocksDB's new option for solving this problem introduced in Delay bottommost level single file compactions facebook/rocksdb#11701

Complications:

WriteBufferManager has changed a lot
SST file epoch number was introduced, instance merge needs to accommodate that change.
Write stall logic behavior change introduced in multi-instance support project made some tests to fail.
RocksDB now uses C++17 standard

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

compaction_filter: add bottommost_level into context (tikv#160) Signed-off-by: qupeng <qupeng@pingcap.com> Signed-off-by: tabokie <xy.tao@outlook.com> add range for compaction filter context (tikv#192) * add range for compaction filter context Signed-off-by: qupeng <qupeng@pingcap.com> Signed-off-by: tabokie <xy.tao@outlook.com> allow no_io for VersionSet::GetTableProperties (tikv#211) * allow no_io for VersionSet::GetTableProperties Signed-off-by: qupeng <qupeng@pingcap.com> Signed-off-by: tabokie <xy.tao@outlook.com> expose seqno from compaction filter and iterator (tikv#215) This PR supports to access `seqno` for every key/value pairs in compaction filter or iterator. It's helpful to enhance GC in compaction filter in TiKV. Signed-off-by: qupeng <qupeng@pingcap.com> Signed-off-by: tabokie <xy.tao@outlook.com> allow to query DB stall status (tikv#226) This PR adds a new property is-write-stalled to query whether the column family is in write stall or not. In TiKV there is a compaction filter used for GC, in which DB::write is called. So if we can query whether the DB instance is stalled or not, we can skip to create more compaction filter instances to save some resources. Signed-off-by: qupeng <qupeng@pingcap.com> Signed-off-by: tabokie <xy.tao@outlook.com> Fix compatibilty issue with Titan Signed-off-by: v01dstar <yang.zhang@pingcap.com> filter deletion in compaction filter (tikv#344) And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <xy.tao@outlook.com> Adjustments for compaptibilty with 8.10.facebook Signed-off-by: v01dstar <yang.zhang@pingcap.com> Adjust tikv related changes with upstream Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Ref tikv#277 When the iterator read keys in reverse order, each Prev() function cost O(log n) times. So I add prev pointer for every node in skiplist to improve the Prev() function. Signed-off-by: Little-Wallace liuwei@pingcap.com Implemented new virtual functions: - `InsertWithHintConcurrently` - `FindRandomEntry` Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Add WAL write duration metric UCP tikv/tikv#6541 Signed-off-by: Wangweizhen <hawking.rei@gmail.com> Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

I want to use format of rocksdb::WriteBatch to encode key-value pairs of TiKV, and I need an more effective method to copy data from Entry to WriteBatch directly so that I could avoid CPU cost of decode. Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Implement multi batches write Signed-off-by: v01dstar <yang.zhang@pingcap.com> Fix SIGABRT caused by uninitialized mutex (tikv#296) (tikv#298) * Fix SIGABRT caused by uninitialized mutex Signed-off-by: Wenbo Zhang <ethercflow@gmail.com> * Use spinlock instead of mutex to reduce writer ctor cost Signed-off-by: Wenbo Zhang <ethercflow@gmail.com> * Update db/write_thread.h Co-authored-by: Xinye Tao <xy.tao@outlook.com> Signed-off-by: Wenbo Zhang <ethercflow@gmail.com> Co-authored-by: Xinye Tao <xy.tao@outlook.com> Signed-off-by: Wenbo Zhang <ethercflow@gmail.com> Co-authored-by: Xinye Tao <xy.tao@outlook.com>

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Signed-off-by: hillium <yujuncen@pingcap.com> Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

* Add copy constructor for ColumnFamilyHandleImpl Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

* return sequence number of writes Signed-off-by: 5kbpers <hustmh@gmail.com> * fix compile error Signed-off-by: 5kbpers <hustmh@gmail.com> Signed-off-by: tabokie <xy.tao@outlook.com>

…nFlushBegin event (tikv#300) * add largest seqno of memtable Signed-off-by: 5kbpers <hustmh@gmail.com> * add test Signed-off-by: 5kbpers <hustmh@gmail.com> * address comment Signed-off-by: 5kbpers <hustmh@gmail.com> * address comment Signed-off-by: 5kbpers <hustmh@gmail.com> * format Signed-off-by: 5kbpers <hustmh@gmail.com> * memtable info Signed-off-by: 5kbpers <hustmh@gmail.com> Signed-off-by: 5kbpers <hustmh@gmail.com> Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

A callback that is called after write succeeds and changes have been applied to memtable. Titan change: tikv/titan#270 Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Summary: Modify existing write buffer manager to support multiple instances. Previously, a flush is triggered before user writes if `ShouldFlush()` returns true. But in the multiple-instance context, this will cause flushing for all DBs that are undergoing writes. In this patch, column families are registered to a shared linked list inside the write buffer manager. When flush condition is triggered, the column family with highest score from this list will be chosen and flushed. The score can be either size or age. The flush condition calculation is also changed to exclude immutable memtables. This is because RocksDB schedules flush every time an immutable memtable is generated. They will eventually be evicted from memory given the flush bandwidth doesn't bottleneck. Test plan: - Unit test cases - Trigger flush of largest/oldest memtable in another DB - Resolve flush condition by destroy CF/DB - Dynamically change flush threshold - Manual test insert, update, read-write workload, [script](https://gist.github.com/tabokie/d38d27dc3843946c7813ab7bafd0f753). Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

* fix bug of using post write callback with empty batch Signed-off-by: tabokie <xy.tao@outlook.com> * fix nullptr Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: tabokie <xy.tao@outlook.com>

Add support to merge multiple DBs that have no overlapping data (tombstone included). Memtables are frozen and then referenced by the target DB. Table files are hard linked with new file numbers into the target DB. After merge, the sequence numbers of memtables and L0 files will appear out-of-order compared to a single DB. But for any given user key, the ordering still holds because there will only be one unique source DB that contains the key and the source DB's ordering is inherited by the target DB. If source and target instances share the same block cache, target instance will be able to reuse cache. This is done by cloning the table readers of source instances to the target instance. Because the cache key is stored in table reader, reads after the merge can still retrieve source instances' blocks via old cache key. Under release build, it takes 8ms to merge a 25GB DB (500 files) into another. Signed-off-by: tabokie <xy.tao@outlook.com>

* exclude uninitialized files when estimating compression ratio Signed-off-by: tabokie <xy.tao@outlook.com> * add comment Signed-off-by: tabokie <xy.tao@outlook.com> * fix flaky test Signed-off-by: tabokie <xy.tao@outlook.com> --------- Signed-off-by: tabokie <xy.tao@outlook.com>

* hook delete dir in encrypted env Signed-off-by: tabokie <xy.tao@outlook.com> * add a comment Signed-off-by: tabokie <xy.tao@outlook.com> --------- Signed-off-by: tabokie <xy.tao@outlook.com>

* add toggle Signed-off-by: tabokie <xy.tao@outlook.com> * protect underflow Signed-off-by: tabokie <xy.tao@outlook.com> * fix build Signed-off-by: tabokie <xy.tao@outlook.com> * remove deadline and add penalty for l0 files Signed-off-by: tabokie <xy.tao@outlook.com> * fix build Signed-off-by: tabokie <xy.tao@outlook.com> * consider compaction trigger Signed-off-by: tabokie <xy.tao@outlook.com> --------- Signed-off-by: tabokie <xy.tao@outlook.com>

Also added a new options to detect whether manual compaction is disabled. In practice we use this to avoid blocking on flushing a tablet that will be destroyed shortly after. --------- Signed-off-by: tabokie <xy.tao@outlook.com>

…heckpoint (tikv#338) * fix renaming encrypted directory Signed-off-by: tabokie <xy.tao@outlook.com> * fix build Signed-off-by: tabokie <xy.tao@outlook.com> * patch test manager Signed-off-by: tabokie <xy.tao@outlook.com> * fix build Signed-off-by: tabokie <xy.tao@outlook.com> * check compaction paused during checkpoint Signed-off-by: tabokie <xy.tao@outlook.com> * add comment Signed-off-by: tabokie <xy.tao@outlook.com> --------- Signed-off-by: tabokie <xy.tao@outlook.com>

And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <xy.tao@outlook.com>

Signed-off-by: Spade A <u6748471@anu.edu.au>

Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

hbisheng · 2024-06-24T05:52:27Z

utilities/rate_limiters/write_amp_based_rate_limiter.cc

+    return;
+  }
+
+  ++total_requests_[pri];


It looks like tikv ran into a segfault on this line. I'm still trying to understand why but I guess it's related to the addition of new IO priorities.

Here's the evidence:

segfault information:

[Mon Jun 24 10:59:14 2024] apply-1[2562783]: segfault at 7f68952f40a0 ip 000055bd6eacc09a sp 00007f6c1d621f90 error 6 in tikv-server[55bd69e00000+6052000]

The static address of the segfault code can be calculated as ip (000055bd6eacc09a) - base_addr_of_tikv (55bd69e00000) = 0x4ccc09a.

With the help of gdb, we can locate the code of that address.

$ gdb tikv-server (gdb) info line *0x4ccc09a Line 172 of "/workspace/.cargo/git/checkouts/rust-rocksdb-9e01d192e8b6561d/af14652/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc" starts at address 0x4ccc093 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+147> and ends at 0x4ccc0a2 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+162>.

Rocksdb introduced more types of priorities (User, Mid etc) since 8.x, while WriteAmpRateLimiter only considered 3 of them. I thought it could work, but maybe this can cause some problems.

One thing that seems to be interesting: the segfault issue seemed to only happen on Linux machines; I wasn't able to reproduce it on Mac. So it could be arch/compiler related.

Also, I managed to get a coredump of the segfault on Linux.

Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00005595b5961efa in rocksdb::WriteAmpBasedRateLimiter::Request (this=0x7ffa1e214000, bytes=46541, pri=<optimized out>, stats=0x0) at /root/tabokie/packages/cargo/.cargo/git/checkouts/rust-rocksdb-da3ff04b5d606849/ca1f1dd/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc:172 172 ++total_requests_[pri];

It shows that total_requests_ was initialized with a length of 4 as expected. But the pri was optimized out.

(gdb) print total_requests_ $1 = {1407, 16, 274, 0} (gdb) print pri $2 = <optimized out>

Given the definition of IOPriority, the only way for it to cause a segfault is when pri equals IO_TOTAL, but I don't think that's how we expect pri to be used...

enum IOPriority { IO_LOW = 0, IO_MID = 1, IO_HIGH = 2, IO_USER = 3, IO_TOTAL = 4 };

Still investigating...

Found the problem! Turns out that one constructor for Writer did not initialize rate_limiter_priority. With this one-line fix, the segfault problem went away.

We might want to check how the bug was introduced and whether there could be other similar problems.
Update: The upstream 8.10.fb branch does not have this problem (db/write_thread.h), so it's likely an oversight when we cherry-picked the commits.

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

v01dstar and others added 8 commits February 6, 2024 23:21

Sanitize tikv-rocksdb

74d783a

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Apply write-amplification based rate limiter patch

fe76269

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Apply PerfFlag patch

b27b564

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Make statistics extensible

6d88b39

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Add WAL write duration metric

de0fc30

Add WAL write duration metric UCP tikv/tikv#6541 Signed-off-by: Wangweizhen <hawking.rei@gmail.com> Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

zhangjinpeng87 requested a review from Connor1996 May 10, 2024 02:13

v01dstar and others added 6 commits May 19, 2024 00:12

Add support for TiKV IO rate limiter

e13234e

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Allow tool dump single SST meta in MANIFEST

325afa4

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Add patches

6b53a73

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Add KeyManagedEncryptedEnv for per file key management

54d3462

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Add patch files

16bf38e

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

v01dstar force-pushed the 8.10-tikv branch from 0acb2b9 to 16bf38e Compare May 19, 2024 08:12

YuJuncen and others added 4 commits May 21, 2024 01:18

Optimize SST partitioner to avoid huge compaction

135f0fa

Signed-off-by: hillium <yujuncen@pingcap.com> Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Add patch file

1240012

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Fix statistics

a53f73c

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Fix compaction filter test

7de0d08

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

v01dstar force-pushed the 8.10-tikv branch from ee9612a to 7de0d08 Compare May 22, 2024 02:13

v01dstar and others added 9 commits May 21, 2024 19:25

Add support for TitanColumnFamilyHandle

60e5b2f

* Add copy constructor for ColumnFamilyHandleImpl Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Return sequence number of writes (tikv#292)

b0c27b8

* return sequence number of writes Signed-off-by: 5kbpers <hustmh@gmail.com> * fix compile error Signed-off-by: 5kbpers <hustmh@gmail.com> Signed-off-by: tabokie <xy.tao@outlook.com>

support post write callback (tikv#326)

b3121eb

A callback that is called after write succeeds and changes have been applied to memtable. Titan change: tikv/titan#270 Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

fix bug of using post write callback with empty batch (tikv#327)

496ed89

* fix bug of using post write callback with empty batch Signed-off-by: tabokie <xy.tao@outlook.com> * fix nullptr Signed-off-by: tabokie <xy.tao@outlook.com> Signed-off-by: tabokie <xy.tao@outlook.com>

hook delete dir in encrypted env (tikv#334)

7dbc017

* hook delete dir in encrypted env Signed-off-by: tabokie <xy.tao@outlook.com> * add a comment Signed-off-by: tabokie <xy.tao@outlook.com> --------- Signed-off-by: tabokie <xy.tao@outlook.com>

tabokie and others added 7 commits May 21, 2024 23:49

FlushForGetLiveFiles does not wait for write stall (tikv#336)

b0bfbb1

filter deletion in compaction filter (tikv#344)

bb9cb78

And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <xy.tao@outlook.com>

enable cf uses separete write buffer manager (tikv#343)

0af4d9d

Signed-off-by: Spade A <u6748471@anu.edu.au>

fix deadlock between Flush and UnregisterDB (tikv#349)

5a9e751

Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>

v01dstar force-pushed the 8.10-tikv branch 5 times, most recently from 2e011bf to 3ea895b Compare May 31, 2024 01:37

Fix compatibility

ce0449c

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

v01dstar force-pushed the 8.10-tikv branch from 3ea895b to ce0449c Compare May 31, 2024 01:47

v01dstar added 2 commits May 31, 2024 23:33

Add CheckInRange API

c6143dc

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Apply new IO priorities in rate limiter

00010c1

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

hbisheng reviewed Jun 24, 2024

View reviewed changes

v01dstar force-pushed the 8.10-tikv branch from 620dcb3 to ec8629a Compare August 15, 2024 22:52

Fix Titan compatibility

578e0db

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com> Signed-off-by: v01dstar <yang.zhang@pingcap.com>

v01dstar force-pushed the 8.10-tikv branch from ec8629a to 578e0db Compare September 5, 2024 18:29

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 22, 2024

v01dstar closed this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick TiKV related changes to 8.10.fb #364

Cherry-pick TiKV related changes to 8.10.fb #364

v01dstar commented Apr 11, 2024 •

edited

Loading

hbisheng Jun 24, 2024

v01dstar Jun 24, 2024 •

edited

Loading

hbisheng Jun 25, 2024 •

edited

Loading

hbisheng Jun 26, 2024 •

edited

Loading

Connor1996 Jun 26, 2024

Cherry-pick TiKV related changes to 8.10.fb #364

Cherry-pick TiKV related changes to 8.10.fb #364

Conversation

v01dstar commented Apr 11, 2024 • edited Loading

hbisheng Jun 24, 2024

Choose a reason for hiding this comment

v01dstar Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

hbisheng Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

hbisheng Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Connor1996 Jun 26, 2024

Choose a reason for hiding this comment

v01dstar commented Apr 11, 2024 •

edited

Loading

v01dstar Jun 24, 2024 •

edited

Loading

hbisheng Jun 25, 2024 •

edited

Loading

hbisheng Jun 26, 2024 •

edited

Loading