-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include metadata in mcap merge #958
Conversation
when I was looking at this I initially figured I would knock out attachments as well, but attachments is actually going to require some additional thought from us. The problem is that the iterators used by the call to Messages are constructed when NewReader is called, not when Messages is called. The lexer is going to need to receive an attachment callback option so the construction is happening in the wrong place currently & needs to get unwound somehow. |
6997448
to
80b5b85
Compare
go/mcap/reader.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
for _, idx := range info.MetadataIndexes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of checking the MetadataIndexes if the offset is already known? This just seems like it would prevent reading metadata if there are no indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that's true. I guess I figured the only time anyone would be doing a read of a metadata record at a known offset would be through the index, but I could just blindly seek and try to parse and maybe that's more useful to someone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated that
@@ -40,6 +40,7 @@ type indexedMessageIterator struct { | |||
hasReadSummarySection bool | |||
|
|||
compressedChunkAndMessageIndex []byte | |||
metadataCallback func(*Metadata) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: conceptually/name-wise it's a bit weird that this is a "message iterator", yet it can also yield Metadata records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can't yield metadata records - it handles them with a callback. But maybe that's still weird? Not sure.
@@ -237,13 +238,54 @@ func (it *indexedMessageIterator) loadChunk(chunkIndex *ChunkIndex) error { | |||
return nil | |||
} | |||
|
|||
func readRecord(r io.Reader) (TokenType, []byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surprising there wasn't already a function for this somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
func (it *indexedMessageIterator) Next(p []byte) (*Schema, *Channel, *Message, error) { | ||
if !it.hasReadSummarySection { | ||
err := it.parseSummarySection() | ||
if err != nil { | ||
return nil, nil, nil, err | ||
} | ||
// take care of the metadata here | ||
if it.metadataCallback != nil { | ||
for _, idx := range it.metadataIndexes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any advantage to trying to insert the metadata between messages based on where it appeared in the original file order? I guess since metadata records don't have a timestamp, that kind of implies there's no expected ordering with respect to messages. Seems like maybe this is at least worth a doc comment pointing out that an indexed message iterator will read all metadata records up front (if indexed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - not only do they not have timestamps, but they don't get interleaved with messages in the file, since the metadatas are written outside the chunks.
I think a bigger concern is whether metadata records are ordered with respect to each other. If you have collisions between merged metadata records, which should take precedence? We can make this deterministic but I'm not sure if we can make a principled decision.
@wkalt are you still interested in pursuing this? |
Includes metadata records from input files in mcap merge via a new read option. This required a breaking change to read options to avoid a dependency cycle: since I need to supply a callback option to apply to metadata records, the readopts package required awareness of "mcap" while "mcap" required awareness of readopts for configuration. To address this I have moved readopts.go under the mcap package. Users who upgrade the library will need to swap out the package name if they are using any options.
80b5b85
to
9f71938
Compare
When supplied, metadata with duplicate names will be permitted. By default, metadata with duplicate names will throw an error in merging. Metadata with 100% duplicate content will be deduplicated in either case.
5fe1256
to
2d5279a
Compare
Includes metadata records from input files in mcap merge via a new read option. This required a breaking change to read options to avoid a dependency cycle: since I need to supply a callback option to apply to metadata records, the readopts package required awareness of "mcap" while "mcap" required awareness of readopts for configuration. To address this I have moved readopts.go under the mcap package. Users who upgrade the library will need to swap out the package name if they are using any options.
Includes metadata records from input files in mcap merge via a new read option. This required a breaking change to read options to avoid a dependency cycle: since I need to supply a callback option to apply to metadata records, the readopts package required awareness of "mcap" while "mcap" required awareness of readopts for configuration.
To address this I have moved readopts.go under the mcap package. Users who upgrade the library will need to swap out the package name if they are using any options.