Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include metadata in mcap merge #958

Merged
merged 2 commits into from
Nov 2, 2023

Conversation

wkalt
Copy link
Contributor

@wkalt wkalt commented Aug 29, 2023

Includes metadata records from input files in mcap merge via a new read option. This required a breaking change to read options to avoid a dependency cycle: since I need to supply a callback option to apply to metadata records, the readopts package required awareness of "mcap" while "mcap" required awareness of readopts for configuration.

To address this I have moved readopts.go under the mcap package. Users who upgrade the library will need to swap out the package name if they are using any options.

@wkalt wkalt requested review from jtbandes and james-rms August 29, 2023 14:56
@wkalt
Copy link
Contributor Author

wkalt commented Aug 29, 2023

when I was looking at this I initially figured I would knock out attachments as well, but attachments is actually going to require some additional thought from us. The problem is that the iterators used by the call to Messages are constructed when NewReader is called, not when Messages is called. The lexer is going to need to receive an attachment callback option so the construction is happening in the wrong place currently & needs to get unwound somehow.

if err != nil {
return nil, err
}
for _, idx := range info.MetadataIndexes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of checking the MetadataIndexes if the offset is already known? This just seems like it would prevent reading metadata if there are no indexes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that's true. I guess I figured the only time anyone would be doing a read of a metadata record at a known offset would be through the index, but I could just blindly seek and try to parse and maybe that's more useful to someone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated that

@@ -40,6 +40,7 @@ type indexedMessageIterator struct {
hasReadSummarySection bool

compressedChunkAndMessageIndex []byte
metadataCallback func(*Metadata) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: conceptually/name-wise it's a bit weird that this is a "message iterator", yet it can also yield Metadata records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can't yield metadata records - it handles them with a callback. But maybe that's still weird? Not sure.

@@ -237,13 +238,54 @@ func (it *indexedMessageIterator) loadChunk(chunkIndex *ChunkIndex) error {
return nil
}

func readRecord(r io.Reader) (TokenType, []byte, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprising there wasn't already a function for this somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

func (it *indexedMessageIterator) Next(p []byte) (*Schema, *Channel, *Message, error) {
if !it.hasReadSummarySection {
err := it.parseSummarySection()
if err != nil {
return nil, nil, nil, err
}
// take care of the metadata here
if it.metadataCallback != nil {
for _, idx := range it.metadataIndexes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any advantage to trying to insert the metadata between messages based on where it appeared in the original file order? I guess since metadata records don't have a timestamp, that kind of implies there's no expected ordering with respect to messages. Seems like maybe this is at least worth a doc comment pointing out that an indexed message iterator will read all metadata records up front (if indexed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - not only do they not have timestamps, but they don't get interleaved with messages in the file, since the metadatas are written outside the chunks.

I think a bigger concern is whether metadata records are ordered with respect to each other. If you have collisions between merged metadata records, which should take precedence? We can make this deterministic but I'm not sure if we can make a principled decision.

@james-rms
Copy link
Collaborator

@wkalt are you still interested in pursuing this?

Includes metadata records from input files in mcap merge via a new read
option. This required a breaking change to read options to avoid a
dependency cycle: since I need to supply a callback option to apply to
metadata records, the readopts package required awareness of "mcap"
while "mcap" required awareness of readopts for configuration.

To address this I have moved readopts.go under the mcap package. Users
who upgrade the library will need to swap out the package name if they
are using any options.
@wkalt wkalt force-pushed the ticket/fg-4490/mcap-merge-include-metadata branch from 80b5b85 to 9f71938 Compare November 2, 2023 17:17
When supplied, metadata with duplicate names will be permitted. By
default, metadata with duplicate names will throw an error in merging.
Metadata with 100% duplicate content will be deduplicated in either
case.
@wkalt wkalt force-pushed the ticket/fg-4490/mcap-merge-include-metadata branch from 5fe1256 to 2d5279a Compare November 2, 2023 18:03
@wkalt wkalt merged commit a164ec1 into main Nov 2, 2023
@wkalt wkalt deleted the ticket/fg-4490/mcap-merge-include-metadata branch November 2, 2023 18:13
pezy pushed a commit to pezy/mcap that referenced this pull request Jan 11, 2024
Includes metadata records from input files in mcap merge via a new read
option. This required a breaking change to read options to avoid a
dependency cycle: since I need to supply a callback option to apply to
metadata records, the readopts package required awareness of "mcap"
while "mcap" required awareness of readopts for configuration.

To address this I have moved readopts.go under the mcap package. Users
who upgrade the library will need to swap out the package name if they
are using any options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants