Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding t_cid in Neon heap WAL records #8499

Closed
hlinnaka opened this issue Jul 24, 2024 · 5 comments
Closed

Regarding t_cid in Neon heap WAL records #8499

hlinnaka opened this issue Jul 24, 2024 · 5 comments
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@hlinnaka
Copy link
Contributor

hlinnaka commented Jul 24, 2024

Reported by Muhammad Malik on pgsql-hackers:

https://www.postgresql.org/message-id/EA2P220MB0947DC092E176F5AEA297520A6AA2%40EA2P220MB0947.NAMP220.PROD.OUTLOOK.COM

Neon added a t_cid field to heap WAL records https://github.com/yibit/neon-postgresql/blob/main/docs/core_changes.md#add-t_cid-to-heap-wal-records.

However, when replaying the delete log record, it is discarding the combo flag and storing the raw cmax on the old tuple https://github.com/neondatabase/neon/blob/main/pgxn/neon_rmgr/neon_rmgr.c#L376. This will make the tuple header different from what is in the buffer cache if the deleted tuple was using a combocid. Similarly, there was no t_cid added for the old tuple in xl_neon_heap_update, and it is using the t_cid of the new tuple to set cmax on the old tuple during redo_neon_heap_update.

Why is this not a problem when a visibility check is performed on the tuple after reading from storage, since it won't get the correct cmin value on the old tuple?
Also, what is the need of adding the t_cid of the new tuple in xl_neon_heap_update when it is already present in the xl_neon_heap_header? Seems like it is sending the same t_cid twice with the update WAL record.
Thanks,
Muhammad

@hlinnaka hlinnaka added the t/bug Issue Type: Bug label Jul 24, 2024
@hlinnaka
Copy link
Contributor Author

However, when replaying the delete log record, it is discarding the combo flag and storing the raw cmax on the old tuple https://github.com/neondatabase/neon/blob/main/pgxn/neon_rmgr/neon_rmgr.c#L376. This will make the tuple header different from what is in the buffer cache if the deleted tuple was using a combocid.

Hmm, yes I think you're right, we're not setting the HEAP_COMBOCID flag correctly. Thanks for the report!

Similarly, there was no t_cid added for the old tuple in xl_neon_heap_update, and it is using the t_cid of the new tuple to set cmax on the old tuple during redo_neon_heap_update.

Why is this not a problem when a visibility check is performed on the tuple after reading from storage, since it won't get the correct cmin value on the old tuple?

It probably does cause problems; we apparently don't have enough test coverage for combocids.

Also, what is the need of adding the t_cid of the new tuple in xl_neon_heap_update when it is already present in the xl_neon_heap_header? Seems like it is sending the same t_cid twice with the update WAL record.

Hmm, yeah, if no combocids are involved, the old and the new tuple will have the same command id, as cmax on the old tuple and as cmin on the new tuple. But with a combocid, that's not so, so we need to store both in the WAL record. But we're missing setting the HEAP_COMBOCID flag there as well.

@knizhnik
Copy link
Contributor

I wonder if it is enough just to store t_cid directory instead of using HeapTupleHeaderSetCmin/Cmax macros which affect HEAP_COMBOCID flag. This flag is already stored in t_infomask - no need to somehow change it in redo handler.

#8503

@knizhnik
Copy link
Contributor

Looks like our current test (test_runner/sql_regress/sql/neon-cid.sql) is not testing CID handling at all.
At least I have commented assignment of t_cid in neon_rmgr.c and test_pg_regress still passed.

As far as I understand, to reproduce the problem we need to make some changes in transaction, then force throwing away this pages from cache and then access them once again.

I have added in my PR test_combocid.py - it really reproduces incorrect behaviour if we are not restoring CID.
But I failed to to reproduce the problem with COMBOCID.

@ololobus
Copy link
Member

ololobus commented Jul 30, 2024

This week:

@ololobus ololobus added the c/compute Component: compute, excluding postgres itself label Jul 30, 2024
@ololobus
Copy link
Member

ololobus commented Aug 13, 2024

This week:

knizhnik added a commit that referenced this issue Aug 14, 2024
## Problem

See #8499

## Summary of changes

Save HEAP_COMBOCID flag in WAL and do not clear it in redo handlers.

Related Postgres PRs:
neondatabase/postgres#457
neondatabase/postgres#458
neondatabase/postgres#459


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

3 participants