Joining mp3 files to form a stream - audio

I'm wondering if it's possible to, in real time, concatenate a series of mp3 files to form a live stream.
For example, in some directory I have file1.mp3, file2.mp3, file3.mp3 - each file is 1 minute in duration.
I want to load an mp3 stream which I could load in a web-browser or on a phone, etc which will join all these files together to form a 3 minute stream. However, say I'm 2 minutes into the stream and upload another file to that directory - file4.mp3 - that is also of 1 minute duration. I would want that to automatically be added to the end of my live stream, such that when file3.mp3 is finished file4.mp3 will start straight away.
I hope I explained myself well. I am just keen to know:
1) If there is a name for what I am trying to achieve?
2) Whether what I am doing is possible with current technologies.

I think HTTP Live Streaming is what you're looking for. http://en.m.wikipedia.org/wiki/HTTP_Live_Streaming

Related

Best Way to Save Sensor Data, Split Every x Megabytes in Python

I'm saving sensor data at 64 samples per second into a csv file. The file is about 150megs at end of 24 hours. It takes a bit longer than I'd like to process it and I need to do some processing in real time.
value = str(milivolts)
logFile.write(str(datet) + ',' + value + "\n")
So I end up with single lines with date and milivolts up to 150 megs. At end of 24 hours it makes a new file and starts saving to it.
I'd like to know if there is a better way to do this. I have searched but can't find any good information on a compression to use while saving sensor data. Is there a way to compress while streaming / saving? What format is best for this?
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
Thanks for any input.
I'd like to know if there is a better way to do this.
One of the simplest ways is to use a logging framework, it will allow you to configure what compressor to use (if any), the approximate size of a file and when to rotate logs. You could start with this question. Try experimenting with several different compressors to see if speed/size is OK for your app.
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
A logging framework would do this for you based on the configuration. You could combine several different options: have fixed-size logs and rotate at least once a day, for example.
Generally, this is accurate up to the size of a logged line, so if the data is split into lines of reasonable size, this makes life super easy. One line ends in one file, another is being written into a new file.
Files also rotate, so you can have order of the data encoded in the file names:
raw_data_<date>.gz
raw_data_<date>.gz.1
raw_data_<date>.gz.2
In the meta code it will look like this:
# Parse where to save data, should we compress data,
# what's the log pattern, how to rotate logs etc
loadLogConfig(...)
# any compression, rotation, flushing etc happens here
# but we don't care, and just write to file
logger.trace(data)
# on shutdown, save any temporary buffer to the files
logger.flush()

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

HLS protocol: get absolute elapsed time during a live streaming

I have a very basic question and I didn't get if I googled wrong or if the answer is so simple that I haven't seen it.
I'm implementing a web app using hls.js as Javascript library and I need a way to get the absolute elapsed time of a live streaming e.g. if a user join the live after 10 minutes, I need a way to detect that the user's 1st second is 601st second of the streaming.
Inspecting the streaming fragments I found some information like startPTS and endPTS, but all these information were always related to the retrieved chunks instead of the whole streaming chunks e.g. if a user join the live after 10 minutes and the chunks duration is 2 seconds, the first chunk I'll get will have startPTS = 0 and endPTS = 2, the second chunk I'll get will have startPTS = 2 and endPTS = 4 and so on (rounding the values to the nearest integer).
Is there a way to extract the absolute elapsed time as I need from an HLS live streaming ?
I'm having the exact same need on iOS (AVPlayer) and came with the following solution:
read the m3u8 manifest, for me it looks like this:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-MEDIA-SEQUENCE:410
#EXT-X-TARGETDURATION:8
#EXTINF:8.333,
410.ts
#EXTINF:8.333,
411.ts
#EXTINF:8.334,
412.ts
#EXTINF:8.333,
413.ts
#EXTINF:8.333,
414.ts
#EXTINF:8.334,
415.ts
Observe that the 409 first segments are not part of the manifest
Multiply EXT-X-MEDIA-SEQUENCE by EXT-X-TARGETDURATION and you have an approximation of the clock time for the first available segment.
Let's also notice that each segment is not exactly 8s long, so when I'm using the target duration, I'm actually accumulating an error of about 333ms per segment:
410 * 8 = 3280 seconds = 54.6666 minutes
In this case for me the segments are always 8.333 or 8.334, so by EXTINF instead, I get:
410 * 8.333 = 3416.53 seconds = 56.9421 minutes
These almost 56.9421 minutes is still an approximation (since we don't exactly know how many time we accumulated the new 0.001 error), but it's much much closer to the real clock time.

Merge multiple audio files into one file

I want to merge two audio files and produce one final file. For example if file1 has length of 5 minutes and file2 has length of 4 minutes, I want the result to be a single 5 minutes file, because both files will start from 0:00 seconds and will run together (i.e overlapping.)
You can use the APIs in the Windows.Media.Audio namespace to create audio graphs for audio routing, mixing, and processing scenarios. For how to create audio graphs please reference this article.
An audio graph is a set of interconnected audio nodes. The two audio files you want to merge supply the "audio input nodes", and "audio output nodes" are the destination single file for audio processed by the graph.
The scenario 4 of AudioCreatio official sample - Submix, just provide the feature you want. Provide two files it will output the mixed audio, but change the output node to AudioFileOutputNode for saving to a new file since the sample create AudioDeviceOutputNode for playing.

Server logs / Webalizer, 206 partial content for audio and video files – how do I calculate the number of downloads?

I need to calculate the number of video and audio file downloads from our media server. Our media server only hosts audio/video files (mp3 and mp4) and we parse our IIS log files monthly using Stone Steps Webalizer.
When I look at the Webalizer stats most of the ‘hits’ are ‘code 206 partial content’ and most of the remainder are ‘code 200 ok’. So for instance our most recent monthly Webalizer stats look something like this -
Total hits: 1,600,000
Code 200 - ok: 300,000
Code 206 - Partial Content: 1,300,000
The total hits figure is much larger than I would expect it to be in relation to the amount of data being served (Total Kbytes).
When I analyse the log files it looks as though media players (iTunes, Quicktime etc) create multiple 206's for a single download/play and I suspect that Webalizer does not group these multiple 206's from the same IP/visit and instead records each 206 as a ‘hit’ - and because of this the total hits figure is vastly inflated. There is a criticism of Weblizer on the Wiki page which appears to confirm this - http://en.wikipedia.org/wiki/Webalizer
Am I correct about the 206's and Webalizer, and if I am correct how would I calculate the number of downloads? Is there an industry standard methodology and/or are there alternative web analytics applications that would be better suited to the task?
Any help or advice would be much appreciated.
Didn't receive any response to my question but thought I would give an update.
We have analysed a one hour sample of our log files and we have done some testing of different browsers / media players on an mp3 and mp4 file.
Here are our findings -
Some media players, particularly iTunes/Quicktime, produce a series
of 206 requests but do not produce a 200 request.
Most but not all web broswers (Chrome is the exception), produce a
200 request and no 206 requests when downloading a media file i.e.
download to desktop as opposed to playing in a desktop media player
or media player plug-in
If the file is cached by the browser/media player it may produce 304
request and no 200 and no 206 request.
Given the above we think it's impossible to count 'downloads' of media files from log file analysis unless the software has an intelligent algorithm designed specifically for that purpose. For example, it would need to group all requests for a specific media file from the same IP within a set time period (say 30 minutes) and count that as one download. As far as I'm aware there isn't any log file analysis software on the market which can offer that functionality.
I did a quick Google search to find out more about podcast/video metrics / log file analysis and it does seem to be a very real, albeit niche problem. Google Analytics and other web metrics tools that use web beacons e.g. SiteStat, are not an option unless your media files are only available for download from your website i.e. no RSS or iTunes syndication etc. Even then I'm not sure if they could do the job.
I think this is why companies such as podtrac and blubrry offer specialised podcast/video measurement tools using redirects as opposed to log file analysis.
Podtrac
http://podtrac.com/publisher/measurement
Blubrry
http://www.blubrry.com/podcast_statistics/
If anyone has experience or expertise in this area feel free to chime in and offer advice or correct me if I'm wrong.
Try my software. I encountered the same issue with mp3's being split into multiple streams for IPods and Iphones. It is really easy to implement and works a treat.
Github
This is probably WAY too late to help you specifically but if you have parsed your server logs and stored them somewhere sensible like a DBMS a quick bit of SQL will give you the combined results you're after. Given a very simple log table where each 206 is recorded with a 'hit time' the ip address of the endpoint and an id/foreign key of the item fetched you could run this query:
select min(hit_time) as hit_time, ip_address, episode_id
from podcast_hit
group by DATE(hit_time), ip_address, episode_id
This will group up all the 206 records and make them unique by day and user giving you more accurate stats. Hope this helps someone!

Resources