I've looked at the RFC but it's not clear how channels are encoded. Are channels implicitly associated with frames in a VBR Code 3 packet? I can see a stereo flag in the format but it's not clear to me how this would interact with multiple channels either.
Related
So I have an esp32 which captures images and sound. The esp32-camera library already returns the jpeg encoded buffer. The audio however is uncompressed and is just a digital representation of signal strength at high sample rate.
I use esp32 to host a webpage which contains <image> element and a JavaScript snippet, which constantly sends GET requests to a branching url for image data and updates the element. This approach is not very good, especially that now I've added audio capabilities to the circuit.
I'm curious if it would be possible to combine jpeg encoded frames and some audio data into a chunk of h264 and then send it directly as a response to a GET request making it a stream?
This not only would simplify the whole serving multiple webpages thing, but also remove the issues of syncing the audio and video if they are sent separately.
In particular I'm also curious how easy would it be to do on esp32 since it doesn't have a whole bunch of ram and computational power. It would be challenging to find or port large libraries which could help as well, so i guess I would have to code it myself.
I also am not sure if h264 is the best option. I know its supported on most browser out of the box and is using jpeg compression behind the scenes for the frames, but perhaps a simpler format exists which is also widely supported.
So to sum it up: Is h264 a best bet in the provided context? Is combining jpeg and uncompressed mono audio into h264 possible in the provided context? If an answer to either of previous questions is a no, what alternatives do i have if any?
I'm curious if it would be possible to combine jpeg encoded frames and some audio data into a chunk of h264 and then send it directly as a response to a GET request making it a stream?
H.264 is a video codec. It doesn't have anything to do with audio.
I know its supported on most browser out of the box and is using jpeg compression behind the scenes for the frames
No, this isn't true. H.264 is its own thing. It's far more powerful than JPEG and is specifically designed for motion, whereas JPEG was not.
You need a few things:
A video codec, to efficiently handle your frames. Most of these embedded camera libraries can give you an MJPEG stream. I'd use that if possible. I don't think your ESP32 has other video encoding capability, does it? H.264 is a good choice, but only if you can actually encode it.
A container format, to aid in streaming your audio and video streams together. ISOBMFF/MP4 is common, as is WebM/Matroska.
If you're only streaming to a single client (which seems likely given the limited horsepower of the board), and if you have enough capability to do the audio/video encoding, you can generate a WebM stream on the fly that is directly playable in a <video> element. This seems exactly what you are asking for.
I am currently trying to challange myself and write a little meeting tool in order to learn stuff. I am currently stuck at the audio part. My audio stack is working, each user sends an OPUS stream (20ms packets) to the server. I am not thinking about how I will handle the audio. The outcome shall be: All users receive all audio, but not the own audio (so he does not hear himself). I want the meeting to support as many concurrent users as possible.
I have the following ideas, but none feels quite right:
send all audio streams to all users, which would mean bigger traffic, mixing would be done on the client side
mix the audio on a per-user-basis, which means if I have n users I would need to include n encodings in each frame.
mix the whole audio together, for alle users, send it to all users, each user will receive a second opus package containing the "own" opus audio packet which was sent to the server (or it will be numbered and stored on the client side so it does not need retransmission). I dont know if after decoding I can remove the "own" audio from the stream without getting some unclean audio.
How is this normally done? Are there options I am missing? I have profiled all steps involved, the most expensive part is encoding (one 20ms encoding takes about 600ns), decoding and mixing need near to no-time at all (5-10ns for each step).
Currently I would prefer option 3, but I do not find informations if the audio will be clean or if it will result in washy audio or some cracking.
The whole thing is written in C++, but I did not include the tag since I dont need code examples, just informations on this topic. I tried googling a lot, read a lot of documentation of opus, but did not find anything related to this.
I have two RTP streams (one for each call direction) that I want to mix in a single WAV file.
The problem is that the two streams may use different codecs (and therefore different sampling frequency, encoding, etc).
Is it possible to store the two RTP streams in a WAV file using two channels (i.e. stereo)? Asked differently, is there a way to store multiple channels with different encoding, sampling frequency, etc?
Structure of the WAV file assumes that sampling rate and channel bitness is the same for all channels of the feed. Encoding applies the entire feed (with many encodings/formats/codecs you cannot separate a channel without decoding the feed).
You will need to store feeds in separate files, or you need a file format which supports multiple audio tracks (MP4, MKV for example) though they all have their own restrictions.
As Roman R. mentioned it is not immediate. You will need to take an extra step in between to convert whatever you have on your RTP stream into a proper WAV file. The idea is to use a software like ffmpeg to do so:
2 × mono → stereo: ffmpeg -i left.mp3 -i right.mp3 -ac 2 output.wav
After that you could try something of the flavor (untested):
ffmpeg -i rtp://leftrtp -i rtp://rightrtp -ac 2 output.wav
Most likely you will need to tune the codec settings to make it work as you want. You can Google around and find some infos on the subject or read the ffmpeg doc.
im trying to write a little client for rtmp(audio only). so far i got the communication working (red5 server) but now im stuck with the audio data.
the server is sending in MP3 44KHz 16bit stereo.
i get my Audiomessage which consists of the byte identifying the codec (0x2f) and the audio data which looks for example like this
ff:fb:92:64:eb:80:03:98:58:d2:e9:26:1b:7e:5d:e7:4a:1a:19:26:5c:8b:89:07:47:44:98:6b:91:2d:9c:28:b4:33:15:70:82:c9:29:87:8d:e4:8f:31:83:84:7b:e5:82:b5:57:62:00:02:e5:bb:f1:86:15:7a:8f:da:9e:ca:4f:83:9d:0a:c4:56:7b:b3:3d:56:43:ba:2b:28:b8:9d:0c:e1:82:0c:08:36:24:f3:39:67:54:b7:41:d9:8e:ef:36:96:56:22:d2:b9:9f:ae:40:43:8e:ea:39:52:0c:a4:48:25:02:54:91:c7:35:37:2d:be:f2:37:23:61:65:35:d9:0f:aa:18:b4:37:d9:d4:c8:68:21:3c:bd:ea:c1:d0:98:df:eb:96:59:99:88:09:37:36:c3:8b:47:80:64:84:41:ba:35:ea:a6:0a:d6:74:9e:09:f6:a5:d7:3f:1f:53:d8:fb:8d:d9:d3:f8:ee:c7:c1:68:25:25:8e:ae:6a:1c:08:52:9d:58:cf:cf:87:c1:ba:a4:f0:63:76:b0:b4:65:79:1b:3b:21:5f:2f:b5:7a:18:43:af:f7:fd:15:0c:87:c9:73:54:95:22:94:cc:cb:e3:da:4d:e0:f3:8a:95:69:69:eb:32:71:57:08:49:76:e0:f3:84:8c:4b:4c:84:6b:5d:7a:c8:c9:d7:df:d5:e2:68:bb:5f:6c:9f:ba:f4:0a:6c:6e:51:8a:b3:59:9a:07:0c:e4:2a:9d:ec:d1:99:53:48:f2:8b:22:b2:d3:bf:e1:5b:9f:ee:49:9f:2c:ee:63:1f:6f:da:90:e7:65:00:55:99:97:77:b9:e8:97:43:81:fd:32:e4:81:20:d0:78:f5:4f:59:47:39:f2:57:5d:f4:d5:91:48:c9:45:10:52:49:4d:04:87:6b:0e:a5:72:ed:34:74:08:93:5b:8a:54:3a:d9:7e:53:8f:c7:5e:b1:99:f3:55:63:72:49:99:55:3a:b8:0d:73:3b:2a:ea:9a:b5:32:d2:3b:61:c2:4e:e9:56:78:99:14:4a:a7:46:f4:ee:ae:6f:ff:c8:85:2d:07:68:ad:e2:84:dd:0a:bd:2e:93:12:43
i dont find a little thing about the data format. as the first byte is always 0xff i assume every chunk of audio data has a little header describing its contents.
the rtmp spec from adobe doesnt loose a single word about the format of the audio message package (just two lines saying its an audio message... wow).
does anyone know the format for the audio messages or at least a source where i find something?
The Adobe spec doesn't document the elementary stream formats because they are covered in their own documents, and usually quite large. MP3 is covered by ISO/IEC 11172-3.
There is a good rundown available here:
http://www.mpgedit.org/mpgedit/mpeg_format/mpeghdr.htm
Recently i have been trying to convert an audio file from one format to another through ffmpeg. i was trying to do some google but results made me a little confused about the difference between encoding and decoding an audio file and converting from one format to another.
Let me describe it this way: There are several different file formats for video files (sometimes also called "wrappers"). There are also several different codecs which can be used to encode (or compress) the audio and video. Audio and video use different codecs - and the encoded formats can be sorted in different file types/formats.
So when you talk about "encoding" vs. "converting" a couple of things come into play.
"Encoding" would be the act of taking audio/video and encoding them into a given codec(s). "Converting" implies having stuff in one format, but wanting it in another. There are two ways of looking at this:
Often called "repackaging" - this is when the video (for example) has been encoded correctly (let's say h264, with a bunch of parameters), but you want it in a different file-type - maybe it's an .AVI and you wanted it in an .MP4. This doesn't involve changing the actual video - just re-wraping the h264 stream in a new "wrapper", and is thus a fast operation.
Re-encoding. Let's say your audio was in a MP3 format, and you wanted it in an AAC format. This would require decoding the entire MP3 stream, and re-encoding it into AAC.
Obviously you can also do "1" and "2" together.
Refer Formats and Codecs for detailed information.
Hope it helps!