Odd ball question for somebody just getting started with html5 players and streaming video....
When using YouTube long videos can be scrolled towards then end then played from there. Assuming YouTube first pulls down metadata like total video start/stop points and a bunch of thumbnails for scrolling.
Is this possible with an open html5 video player (like projekkter)? Reason asking is that I have video data inside a mongo database that I would like to stream similar to the YouTube player.
Inside mongo I have a bunch of smaller h264 files each in a document: actual raw h264 usually 1000kb (max 2 seconds), creation timestamp (long), and potentially a converted format (like mp4) for known clients. Idea is to query off a time range and order by creation time then piping the results into readable stream. There is a nice ffmpeg module to take streams and reformat if needed. Thought about piping the stream to the client with binaryjs and appending it into the player.
But the source directives in the documentation are usually URLs plus I need to lock down the start/stop point for the total video being played plus thumbnails.
Related
I come to a technical problem and I need you.
Situation data:
I record the screen as well as 1 to 2 audio tracks (microphone and speaker).
These three recordings are done separately (it could be mixed but I don't prefer) and every 10s (this is configurable), I send the chunk of recorded data to my backend. We, therefore, have 2 to 3 chunks sent every 10s.
These data chunks are interdependent. Example: The 1st video chunk starts with the headers and a keyframe. The second chunk can be in the middle of a frame. It's like having the entire video and doing a random one-bit split.
The video stream is in h264 in a WebM container. I don't have a lot of control over it.
The audio stream is in opus in a WebM container. I can't use aac directly, nor do I have much control.
Given the reality, the server may be restarted randomly (crash, update, scaled, ...). It doesn't happen often (4 times a week). In addition, the customer can, once the recording ends on his side, close the application or his computer. This will prevent the end of the recording from being sent. Once it reconnects, the missing data chunks are sent. This, therefore, prevents the use of a "live" stream on the backend side.
Goals :
Store video and audio as it is received on the server in cloud storage.
Be able to start playing the video/audio even when the upload has not finished (so in a live stream)
As soon as the last chunks have been received on the server, I want the entire video to be already available in VoD (Video On Demand) with as little delay as possible.
Everything must be distributed with the audios in AAC. The audios can be mixed or not, and mixed or not with the video.
Current and blocking solution:
The most promising solution I have seen is using HLS to support the Live and VoD mode that I need. It would also bring a lot of optimization possibilities for the future.
Video isn't a problem in this context, here's what I do:
Every time I get a data chunk, I append it to a screen.webm file.
Then I spit the file with ffmpeg
ffmpeg -ss {total_duration_in_storage} -i screen.webm -c: v copy -f hls -hls_time 8 -hls_list_size 0 output.m3u8
I ignore the last file unless it's the last chunk.
I upload all the files to the cloud storage along with a newly updated output.m3u8 with the new file information.
Note: total_duration_in_storage corresponds to the time already uploaded
on cloud storage. So the sum of the parts presents in the last output.m3u8.
Note 2: I ignore the last file in point 3 because it allows me to have keyframes in each song of my playlist and therefore to be able to use a seeking which allows segmenting only the parts necessary for each new chunk.
My problem is with the audio. I can use the same method and it works fine, I don't re-encode. But I need to re-encode in aac to be compatible with HLS but also with Safari.
If I re-encode only the new chunks that arrive, there is an auditory glitch
The only possible avenue I have found is to re-encode and segment all the files each time a new chunk comes along. This will be problematic for long recordings (multiple hours).
Do you have any solutions for this problem or another way to achieve my goal?
Thanks a lot for your help!
youtube-dl can be used to see what formats are used to store YouTube content:
youtube-dl -F https://youtu.be/??????
The above command hints that the audio and video are mostly stored separately. Is it right? Does YouTube streaming combine audio and video in real-time?
Formats for a sample YouTube content
Most large streaming services will use ABR streaming (see: https://stackoverflow.com/a/42365034/334402).
The two most common ABR streaming formats are HLS and MPEG-DASH and both provide a manifest or index file which the player downloads first and which will contain links to the media streams, typically audio, video, subtitle tracks etc.
For encrypted content the audio and video, and even different bit rate video tracks, may all have separate encryption keys.
The player will download the audio and video tracks and synchronise them for playback.
in general streaming video and audio are sent in separate channels .... ditto for multi track audio like 5+1 ... during transport these channels are wrapped by a media container like mp4 etc
motive is partly due to distinct compression algorithms ... some algos are best for audio versus others for video and baked into these algos is the spread and sharing of data over time across video frames see B-frames for details ... these channels are not limited to video and audio ... if you own the sending and receiving sides you can send arbitrary data in many distinct channels by making up your own data protocol ... as an aside modern codec like H.256 allow data to get sent from receiver back to sender when you think you are simply viewing a movie (read the RFC)
youtube stores each of its various flavors of video and audio in separate files on its end then combines them based in desired streaming quality choices on a per download basis
I use a video player called MPV to transcode a dynamic playlist of media files.
I pipe MPV's encoded output into FFMPEG and format it for rtmp delivery.
However the playlist may contain media with misaligned audio and video, ie - the audio track may be shorter / longer than the video track.
No matter what MPV will only output what it's given. So if my media file has audio that is 1 second long and video that is 2 seconds long, it will output a media stream with exactly the same misalignment, rather than generating null audio or skipping to the next item in the playlist when it first encounters an active stream ending (eof).
For example, assuming my playlist was full of problematic media where the audio and video of each file was misaligned:
If I output this media stream to a popular streaming service's server, it could lead to stuttering and/or loss of a/v sync.
Similarly, if I output this media stream to a file and played it back in MPV or another video player, the result appears to be more like this:
I have tried to fix this in MPV in all sorts of ways, trying every relevant command line option available. I even wrote a user script that detects 'eof' audio and skips to the next item in the playlist, but it is not fast enough and still leads to small gaps of audio.
So my only hope is correcting it in ffmpeg. In the event of null audio/video, I need a fallback or a generative filter that can fill these empty gaps with silence (audio) or a colour/image (video).
I'm open to any ideas, and if my understanding in a/v encoding is a little off please educate me.
Due to the richness and complexity of my app's audio content, I am using AVAudioEngine to manage all audio across the app. I am converting every audio source to be represented as a node in my AVAudioEngine graph.
For example, instead using AVAudioPlayer objects to play mp3 files in my app, I create AVAudioPlayerNode objects using buffers of those audio files.
However, I do have a video player in my app that plays video files with audio using the AVPlayer framework (I know of nothing else in iOS that can play video files). Unfortunately, there seems to be no way I can obtain the audio output stream as a node in my AVAudioEngine graph.
Any pointers?
If you have a video file, you can extract audio data and pull it out from the video.
Then you can set the volume of AVPlayer to 0. (If you didn't remove audio data from the video)
and Play AVAudioPlayerNode.
If you receive the video data through network, You should make parser of the packet and divide them.
But AV-sync is very tough thing.
Basically I'm trying to replicate YouTube's ability to begin video playback from any part of hosted movie. So if you have a 60 minute video, a user could skip straight to the 30 minute mark without streaming the first 30 minutes of video. Does anyone have an idea how YouTube accomplishes this?
Well the player opens the HTTP resource like normal. When you hit the seek bar, the player requests a different portion of the file.
It passes a header like this:
RANGE: bytes-unit = 10001\n\n
and the server serves the resource from that byte range. Depending on the codec it will need to read until it gets to a sync frame to begin playback
Video is a series of frames, played at a frame rate. That said, there are some rules about the order of what frames can be decoded.
Essentially, you have reference frames (called I-Frames) and you have modification frames (class P-Frames and B-Frames)... It is generally true that a properly configured decoder will be able to join a stream on any I-Frame (that is, start decoding), but not on P and B frames... So, when the user drags the slider, you're going to need to find the closest I frame and decode that...
This may of course be hidden under the hood of Flash for you, but that is what it will be doing...
I don't know how YouTube does it, but if you're looking to replicate the functionality, check out Annodex. It's an open standard that is based on Ogg Theora, but with an extra XML metadata stream.
Annodex allows you to have links to named sections within the video or temporal URIs to specific times in the video. Using libannodex, the server can seek to the relevant part of the video and start serving it from there.
If I were to guess, it would be some sort of selective data retrieval, like the Range header in HTTP. that might even be what they use. You can find more about it here.