Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick TiKV related changes to 8.10.fb #364

Closed
wants to merge 38 commits into from

Conversation

v01dstar
Copy link

@v01dstar v01dstar commented Apr 11, 2024

6.29 (last TiKV base) diff: facebook/rocksdb@6.29.fb...tikv:rocksdb:6.29.tikv

Already exist in upstream (fb):

No longer needed:

Need triage:

To be verified:

Complications:

  • WriteBufferManager has changed a lot
  • SST file epoch number was introduced, instance merge needs to accommodate that change.
  • Write stall logic behavior change introduced in multi-instance support project made some tests to fail.
  • RocksDB now uses C++17 standard

v01dstar and others added 8 commits February 6, 2024 23:21
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
compaction_filter: add bottommost_level into context (tikv#160)

Signed-off-by: qupeng <qupeng@pingcap.com>
Signed-off-by: tabokie <xy.tao@outlook.com>

add range for compaction filter context (tikv#192)

* add range for compaction filter context

Signed-off-by: qupeng <qupeng@pingcap.com>
Signed-off-by: tabokie <xy.tao@outlook.com>

allow no_io for VersionSet::GetTableProperties (tikv#211)

* allow no_io for VersionSet::GetTableProperties

Signed-off-by: qupeng <qupeng@pingcap.com>
Signed-off-by: tabokie <xy.tao@outlook.com>

expose seqno from compaction filter and iterator (tikv#215)

This PR supports to access `seqno` for every key/value pairs in compaction filter or iterator.
It's helpful to enhance GC in compaction filter in TiKV.

Signed-off-by: qupeng <qupeng@pingcap.com>
Signed-off-by: tabokie <xy.tao@outlook.com>

allow to query DB stall status (tikv#226)

This PR adds a new property is-write-stalled to query whether the column family is in write stall or not.

In TiKV there is a compaction filter used for GC, in which DB::write is called. So if we can query whether the DB instance is stalled or not, we can skip to create more compaction filter instances to save some resources.

Signed-off-by: qupeng <qupeng@pingcap.com>
Signed-off-by: tabokie <xy.tao@outlook.com>

Fix compatibilty issue with Titan

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

filter deletion in compaction filter (tikv#344)

And delay the buffer initialization of writable file to first actual write.

---------

Signed-off-by: tabokie <xy.tao@outlook.com>

Adjustments for compaptibilty with 8.10.facebook

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Adjust tikv related changes with upstream

Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Ref tikv#277

When the iterator read keys in reverse order, each Prev() function cost O(log n) times. So I add prev pointer for every node in skiplist to improve the Prev() function.

Signed-off-by: Little-Wallace liuwei@pingcap.com

Implemented new virtual functions:
- `InsertWithHintConcurrently`
- `FindRandomEntry`

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Add WAL write duration metric

UCP tikv/tikv#6541

Signed-off-by: Wangweizhen <hawking.rei@gmail.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
I want to use format of rocksdb::WriteBatch to encode key-value pairs of TiKV, and I need an more effective method to copy data from Entry to WriteBatch directly so that I could avoid CPU cost of decode.

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
@zhangjinpeng87 zhangjinpeng87 requested a review from Connor1996 May 10, 2024 02:13
v01dstar and others added 6 commits May 19, 2024 00:12
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Implement multi batches write

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

Fix SIGABRT caused by uninitialized mutex (tikv#296) (tikv#298)

* Fix SIGABRT caused by uninitialized mutex

Signed-off-by: Wenbo Zhang <ethercflow@gmail.com>

* Use spinlock instead of mutex to reduce writer ctor cost

Signed-off-by: Wenbo Zhang <ethercflow@gmail.com>

* Update db/write_thread.h

Co-authored-by: Xinye Tao <xy.tao@outlook.com>
Signed-off-by: Wenbo Zhang <ethercflow@gmail.com>

Co-authored-by: Xinye Tao <xy.tao@outlook.com>
Signed-off-by: Wenbo Zhang <ethercflow@gmail.com>

Co-authored-by: Xinye Tao <xy.tao@outlook.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
YuJuncen and others added 4 commits May 21, 2024 01:18
Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
v01dstar and others added 9 commits May 21, 2024 19:25
* Add copy constructor for ColumnFamilyHandleImpl

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
* return sequence number of writes

Signed-off-by: 5kbpers <hustmh@gmail.com>

* fix compile error

Signed-off-by: 5kbpers <hustmh@gmail.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
…nFlushBegin event (tikv#300)

* add largest seqno of memtable

Signed-off-by: 5kbpers <hustmh@gmail.com>

* add test

Signed-off-by: 5kbpers <hustmh@gmail.com>

* address comment

Signed-off-by: 5kbpers <hustmh@gmail.com>

* address comment

Signed-off-by: 5kbpers <hustmh@gmail.com>

* format

Signed-off-by: 5kbpers <hustmh@gmail.com>

* memtable info

Signed-off-by: 5kbpers <hustmh@gmail.com>

Signed-off-by: 5kbpers <hustmh@gmail.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
A callback that is called after write succeeds and changes have been applied to memtable.

Titan change: tikv/titan#270

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
Summary:

Modify existing write buffer manager to support multiple instances.

Previously, a flush is triggered before user writes if `ShouldFlush()` returns
true. But in the multiple-instance context, this will cause flushing for all
DBs that are undergoing writes.

In this patch, column families are registered to a shared linked list inside
the write buffer manager. When flush condition is triggered, the column family
with highest score from this list will be chosen and flushed. The score can be
either size or age.

The flush condition calculation is also changed to exclude immutable memtables.
This is because RocksDB schedules flush every time an immutable memtable is
generated. They will eventually be evicted from memory given the flush
bandwidth doesn't bottleneck.

Test plan:

- Unit test cases
  - Trigger flush of largest/oldest memtable in another DB
  - Resolve flush condition by destroy CF/DB
  - Dynamically change flush threshold
- Manual test insert, update, read-write workload, [script](https://gist.github.com/tabokie/d38d27dc3843946c7813ab7bafd0f753).

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
* fix bug of using post write callback with empty batch

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix nullptr

Signed-off-by: tabokie <xy.tao@outlook.com>

Signed-off-by: tabokie <xy.tao@outlook.com>
Add support to merge multiple DBs that have no overlapping data (tombstone included).

Memtables are frozen and then referenced by the target DB. Table files are hard linked
with new file numbers into the target DB. After merge, the sequence numbers of memtables
and L0 files will appear out-of-order compared to a single DB. But for any given user
key, the ordering still holds because there will only be one unique source DB that
contains the key and the source DB's ordering is inherited by the target DB.

If source and target instances share the same block cache, target instance will be able
to reuse cache. This is done by cloning the table readers of source instances to the
target instance. Because the cache key is stored in table reader, reads after the merge
can still retrieve source instances' blocks via old cache key.

Under release build, it takes 8ms to merge a 25GB DB (500 files) into another.

Signed-off-by: tabokie <xy.tao@outlook.com>
* exclude uninitialized files when estimating compression ratio

Signed-off-by: tabokie <xy.tao@outlook.com>

* add comment

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix flaky test

Signed-off-by: tabokie <xy.tao@outlook.com>

---------

Signed-off-by: tabokie <xy.tao@outlook.com>
* hook delete dir in encrypted env

Signed-off-by: tabokie <xy.tao@outlook.com>

* add a comment

Signed-off-by: tabokie <xy.tao@outlook.com>

---------

Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie and others added 7 commits May 21, 2024 23:49
* add toggle

Signed-off-by: tabokie <xy.tao@outlook.com>

* protect underflow

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix build

Signed-off-by: tabokie <xy.tao@outlook.com>

* remove deadline and add penalty for l0 files

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix build

Signed-off-by: tabokie <xy.tao@outlook.com>

* consider compaction trigger

Signed-off-by: tabokie <xy.tao@outlook.com>

---------

Signed-off-by: tabokie <xy.tao@outlook.com>
Also added a new options to detect whether manual compaction is disabled. In practice we use this to avoid blocking on flushing a tablet that will be destroyed shortly after.

---------

Signed-off-by: tabokie <xy.tao@outlook.com>
…heckpoint (tikv#338)

* fix renaming encrypted directory

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix build

Signed-off-by: tabokie <xy.tao@outlook.com>

* patch test manager

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix build

Signed-off-by: tabokie <xy.tao@outlook.com>

* check compaction paused during checkpoint

Signed-off-by: tabokie <xy.tao@outlook.com>

* add comment

Signed-off-by: tabokie <xy.tao@outlook.com>

---------

Signed-off-by: tabokie <xy.tao@outlook.com>
And delay the buffer initialization of writable file to first actual write.

---------

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: Spade A <u6748471@anu.edu.au>
Signed-off-by: SpadeA-Tang <u6748471@anu.edu.au>
@v01dstar v01dstar force-pushed the 8.10-tikv branch 5 times, most recently from 2e011bf to 3ea895b Compare May 31, 2024 01:37
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
v01dstar added 2 commits May 31, 2024 23:33
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
return;
}

++total_requests_[pri];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like tikv ran into a segfault on this line. I'm still trying to understand why but I guess it's related to the addition of new IO priorities.

Here's the evidence:

  1. segfault information:
[Mon Jun 24 10:59:14 2024] apply-1[2562783]: segfault at 7f68952f40a0 ip 000055bd6eacc09a sp 00007f6c1d621f90 error 6 in tikv-server[55bd69e00000+6052000]
  1. The static address of the segfault code can be calculated as ip (000055bd6eacc09a) - base_addr_of_tikv (55bd69e00000) = 0x4ccc09a.
  2. With the help of gdb, we can locate the code of that address.
$ gdb tikv-server
(gdb) info line *0x4ccc09a
Line 172 of "/workspace/.cargo/git/checkouts/rust-rocksdb-9e01d192e8b6561d/af14652/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc"                                                                                               
   starts at address 0x4ccc093 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+147>                                                                                                                            
   and ends at 0x4ccc0a2 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+162>.

Copy link
Author

@v01dstar v01dstar Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rocksdb introduced more types of priorities (User, Mid etc) since 8.x, while WriteAmpRateLimiter only considered 3 of them. I thought it could work, but maybe this can cause some problems.

Copy link
Member

@hbisheng hbisheng Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that seems to be interesting: the segfault issue seemed to only happen on Linux machines; I wasn't able to reproduce it on Mac. So it could be arch/compiler related.

Also, I managed to get a coredump of the segfault on Linux.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005595b5961efa in rocksdb::WriteAmpBasedRateLimiter::Request (this=0x7ffa1e214000, bytes=46541, pri=<optimized out>, stats=0x0)
    at /root/tabokie/packages/cargo/.cargo/git/checkouts/rust-rocksdb-da3ff04b5d606849/ca1f1dd/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc:172
172	  ++total_requests_[pri];

It shows that total_requests_ was initialized with a length of 4 as expected. But the pri was optimized out.

(gdb) print total_requests_
$1 = {1407, 16, 274, 0}
(gdb) print pri
$2 = <optimized out>

Given the definition of IOPriority, the only way for it to cause a segfault is when pri equals IO_TOTAL, but I don't think that's how we expect pri to be used...

  enum IOPriority {
    IO_LOW = 0,
    IO_MID = 1,
    IO_HIGH = 2,
    IO_USER = 3,
    IO_TOTAL = 4
  };

Still investigating...

Copy link
Member

@hbisheng hbisheng Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the problem! Turns out that one constructor for Writer did not initialize rate_limiter_priority. With this one-line fix, the segfault problem went away.

We might want to check how the bug was introduced and whether there could be other similar problems.
Update: The upstream 8.10.fb branch does not have this problem (db/write_thread.h), so it's likely an oversight when we cherry-picked the commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
Signed-off-by: v01dstar <yang.zhang@pingcap.com>
@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 22, 2024
@v01dstar v01dstar closed this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.