How many .ts files should be there in .m3u8 file for Live streaming? - http-live-streaming

For Http Live streaming, initially I have added 3 .ts files in to .m3u8 file it starts playing. In what ways should I append the incoming .ts file into .m3u8 file?
Shall I keep appending?
Shall I replace the older files with new one? If so in what order like, new set of files?

The best method I've seen is where you decide on the amount of "history" you want, for example H = 20 files, and then publish on the last X files in the playlist (if each segment is 10 seconds, then 3 files is a good idea).
you start by publishing movie_000, movie_001, and movie_002.
after 10 seconds you publish movie_001, and movie_002 and movie_003
...
and so on until you reach the amount of files you wish to have and then you rewrite older files (this way your hard drive doesn't overflow with data)
so after H files X 10 seconds, you will have movie_018, movie_019, movie_000

Related

Combine audio files based on the timestamps in their filenames

I just received dozens of audio files that are recorded radio transmissions. Each transmission is its own file, and each file has its transmission time as part of its filename.
How can I programmatically combine the files into a single mp3 file, with each transmission starting at the correct time relative to the first?
Filename format:
PD_YYYY_MM_DD_HH_MM_SS.wav
Examples:
PD_2022_01_22_16_21_52.wav
PD_2022_01_22_16_21_55.wav
PD_2022_01_22_16_22_02.wav
PD_2022_01_22_16_22_05.wav
PD_2022_01_22_16_23_03.wav

Downloading S3 files in Google Colab

I am working on a project and it happens that some data is provided in form of S3fileSystem. I can read that data using S3FileSystem.open(path). But there are more than 360 files and it takes atleast 3 minutes to read a single file. I was wondering, is there any way of downloading these files in my system and read them from there, instead of reading it directly from S3fileSystem. There is another reason, although I can read all those files but once my session on colab reconnects I have to re-read all those files again, hence it will take a lot of time. I am using following code to read files
fs_s3 = s3fs.S3FileSystem(anon=True)
s3path = 'file_name'
remote_file_obj = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(remote_file_obj, engine= 'h5netcdf')
Is there any way of downloading those files?
You can use another s3fs to mount the bucket, then copy the files to Colab.
how to mount
After mounting, you can
!cp /s3/yourfile.zip /content/

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

How to read / readSream a directory containing files with completely different schemas

What if I have this:
Data:
/user/1_data/1.parquet
/user/1_data/2.parquet
/user/1_data/3.parquet
/user/2_data/1.parquet
/user/2_data/2.parquet
/user/3_data/1.parquet
/user/3_data/2.parquet
Each directory has files containing completely different schemas.
I don't want to have to create a single stream job for each folder. At the same time, I also want to save them in different locations.
How would I read / readStream them all without having collect data to the driver or hard coding the directories path?

Design batch job to process multiple files in a FTP folder

I want to design a batch job to process multiple zip files in the folder. Basically, the input zip file contains a directory structure and last directory have CSV file and set of PDFs. The job should take zip file and unzip and upload to an external system and database based on the index file in the leaf node folder.
Ex: input zip file structure
input1.zip
--Folder 1
--> Folder2
--> abc.pdf
...
...
...
--> cdf.pdf
--> metadata.csv
I can add spring integration and invoke the job just after the FTP coping completed. However, My question is, how should I design the job to pick up multiple zip files and allow them to process in parallelly.
Since each zip file takes around 10 min to process, I need multiple instances to process zip files in an efficient manner.
Appreciate any suggestions. Thank you.

Resources