Followup about recent question about Pickle #925

bobby-c-shen · 2020-05-25T23:48:12Z

bobby-c-shen
May 25, 2020

This is a follow-up question to issue #652 about 25 days ago.
If this is an inappropriate place to ask this question, I completely understand. (In this case, I would also be very grateful if you could refer me to somewhere I might find the answer to my own questions, as I don't know where.)

As a follow-up to this recent issue, I am wondering if my outline of decoding, at least for the AAC and AC3 audio codecs, is accurate.

Let's assume that all sample rates are 48000 and that all audio is mono.

Main section 1. Is my outline of the decoding process correct? Which points are wrong? Some of the points are from https://libav.org/documentation/doxygen/master/group__lavc__encdec.html.

When decoding an audio stream, we segment the stream into an ordered list of packets, e.g. P[0], P[1], ..., P[999]. (Assume 1000 packets.) Concretely, packets are byte-strings, possibly starting after 0 bytes because of the header. e.g. P[0] is bytes [40:540] and P[1] is bytes [540:1020]
This segmentation involves syncwords in order to guard against total data corruption in case 1 byte is lost. If 1 byte is lost, then usually only 1 or 2 packets are affected. I think a Parser class in C helps with this.
For AAC, the packets have different numbers of bytes. AC3 files usually have a constant packet size.
In the C process, a decoding object D is initialized
We pass packet P[0] to the D.avcodec_send_packet() method, returning output Y[0]. This effectively passes a small binary data string of on the order of 500 bytes.
Since I'm assuming everything is mono audio, this method returns a 1-d array of floats. This method may the internal state of D. We then pass packet P[1], then P[2], ..., P[999]. These successive calls return Y[1], Y[2], ..., Y[999], respectively. Because of the possible state change, it is important to pass the packets in a specific order.
6.1. The internal state of D is potentially very complex. (how complex?)
This array has length (always?) 1024 for AAC and 1536 for AC3.
This page claims that frames stand alone. Does that mean that packets are decoded independently?; or does this just mean that 1024-sample frames are encoded independently?; or am I just misunderstanding.
https://wiki.multimedia.cx/index.php/Understanding_AAC
(less important for me) If the packet timestamps of the stream are very uniform, then we will simply concatenate all of the returned arrays Y[0], ..., Y[999] into the full array, and this is the decoded array. If the packets have nonuniform timestamps, then we still might concatenate all of the arrays, or maybe insert zero samples, depending on the other parameters of the FFmpeg call.

--

Main section 2. Let's suppose that my outline in section 1 is accurate. If not, then the rest of my message might be moot.

Let's suppose we have initial decoder object D and either the AAC or AC3 codec and packets P[0], P[1], ..., P[999]. Assuming that the decoder state matters a lot, I'd like to consider 3 orders of passing the packets to D.

Order 1: The same order as the packets. P[0], P[1], ..., P[999]
Order 2: we remove P[0] completely. P[1], P[2], ..., P[999]
Order 3: We replace P[0] with an arbitrary packet, P_new. (e.g. P_new = P[1], but P_new could be an arbitrary packet not in the list.) P_new, P[1], ..., P[999]

In order 1, suppose that the output arrays are Y[0], Y[1], ..., Y[999]
In order 2, since the state may matter, we can't say that the first array output is Y[1]. Instead, we use different symbols Y2[1], Y2[2], ..., Y2[999]. (indexing from 1. This output list has 999 elements.)
In order 3, suppose that the output arrays are Y3[0], Y3[1], ..., Y3[999]. (1000 elements).

My main questions are: Is the state of D flushed fairly quickly or is the state very persistent such that any sequence 'mutation' will significantly change state, or somewhere in between? Although the lists Y, Y2, and Y3 are clearly similar waveforms perceptually, are they completely different at a low level or do they converge.

If hypothetically the state of D is flushed after 50 packets, then would Y[n], Y2[n], Y3[n] be approximately equal length-1024 float arrays for n >= 51? Is there any such value of n? Or maybe the state of D depends on how many packets are decoded and is otherwised flushed after 50 packets? If so, is Y[n] ~ Y3[n] for n >= 51 but Y[n] != Y2[n] for any large n because the decoder processed n packets before outputting Y[n] but only n-1 packets before Y2[n]

Note that I have experimented with PyAV and I suspect that for the AC3 codec and a deletion mutation, there is no such value of n. The decoder states will always be different. I do not know about a substitution mutation or the AAC codec or if I am doing my PyAV analysis correctly. I have only done experimenting with PyAV snce I am not used to using C.

Sincerely,
Bobby

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Followup about recent question about Pickle #925

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Followup about recent question about Pickle #925

bobby-c-shen May 25, 2020

Replies: 0 comments

bobby-c-shen
May 25, 2020