I'm just starting to look into Deep Learning for an idea I have for a project. I'm very new to it and have a general question I hope someone can answer for me before I start down what is undoubtedly going to be a long dark path.
If I provide the NN (for example ) 1000 unprocessed audio files and then 1000 processed audio files (the same files but processed in a specific way)
could it generate an algorithm that I could then provide a new unprocessed audio file - and it, in turn, attempts to generate the processed version? or is that just some next level stuff.
Yes, look at the Magenta project
Related
I have a flow where iOS app users will record a large video file and upload it to our server. After the fact, the user might want to extract certain portions of that larger video based on specific time stamps and generate a highlight reel that can be viewed and shared locally back on the iOS device.
As a FE developer I don't really have much experience with where to even start here. Our BE will be built in NodeJS. It seems to me that this should be a relatively straightforward problem to solve, but I don't know.
Are there APIs that make movie manipulation easy? Can I easily extract a clip based on a start and stop time and save that as a separate file? Are those costly tasks? Or not too bad?
I'm guessing that the response to this call would be a list of a series of file names that have been generated as a result of these clips being generated, that the iOS app could then pull down and load.
It's not quite as straightforward as it might seem as video files are quite structured with header information and indexing into the individual video and audio tracks and frames. Any splitting up or cropping needs to allow for this and also create new files with the correct headers and indexing etc.
Fortunately, there are indeed libraries that you can use to do this type of thing, one of the most powerful being ffmpeg.
There are projects which allow the ffmpeg command line tool be used programatically - the advantage of this approach is that you get to leverage the vast community knowledge base for ffmpeg command line.
One of the popular ones for nodejs is:
https://github.com/damianociarla/node-ffmpeg
You can then look at the ffmpeg documentation or community answers to find the particularly functionality you need - for example to crop video at a start and end time as you asked:
https://stackoverflow.com/a/42827058/334402
https://superuser.com/a/704118
The general idea is quite simple and will be of the format:
ffmpeg -i yourInputVideo.mp4 -ss 01:30:00 -to 02:30:00 -c copy copy yourNewOutputVideo.mp4
It's worth taking a look at the seeking info in the ffmpeg online documentation (https://ffmpeg.org/ffmpeg.html) to help understand the examples, especially the second one above:
-ss position (input/output)
When used as an input option (before -i), seeks in this input file to position. Note that in most formats it is not possible to seek exactly, so ffmpeg will seek to the closest seek point before position. When transcoding and -accurate_seek is enabled (the default), this extra segment between the seek point and position will be decoded and discarded. When doing stream copy or when -noaccurate_seek is used, it will be preserved.
When used as an output option (before an output url), decodes but discards input until the timestamps reach position.
position must be a time duration specification, see (ffmpeg-utils)the Time duration section in the ffmpeg-utils(1) manual.
I know this has already been posted more than 10 years ago but I want to believe that some progress has been made on this side. (we have Deepfake nowadays, so much progress on the AI side).
I tried some tutorials with audacity but was highly disappointed with the result (to be fair the resulting output is not that bad, but not good enough for prod).
What reputable algorithm could I use to process myself a mp3 file and remove the vocals while preserving the drums and centered instruments, and removing vocal echo?
This task is known in the community as "Vocal Source Separation" or "Vocal Signal Separation" or "Singing Voice Source Separation", which are specialized "Music Source Separation" tasks, again an example of the more general "Source Separtion" task.
Here are some papers: Music Source Separation.
One of the most actively developed open source solutions is Spleeter, which has been used commercially in various audio products. There is an online tool based on it, you can try it out at Splitter.ai. The "2 stem" version will give you one track with vocals, and one track with everything else.
I've been asked to find the actual runtime of a batch of files. Each of these files contains voice and silences (guided meditation type), and I need to find a way to measure the runtime of just the voice.
The manual way of doing this is opening a file, looking at the wave, identifying the silences and removing them so the final duration of the file is the "just voice" runtime. This can take me 3-4 minutes per file, and that's just too much for a batch of 1800 files.
So my question is: is there a way to automatically delete the silent parts? And if so, can it be scripted or automated in any way?
In my studio we work with Sound Forge and ProTools.
ProTools has this built in, select the region and edit->strip silence.
SoX can do this if you want to set up some scripts without using ProTools(nice blog post)
I'm looking for a program that is able to recognize individual audio samples from my computer and reroute them to trigger WAV files from a library. In my project, it would need to be realtime as the latency would not be a desired result. I tried using dictation software that would recognize words to trigger opening a file and that's the direction where I want to go, but instead of words I want it to be sounds and it would happen in realtime. I'm not sure where to go and am just looking for some guidance. Does anyone have any suggestions of what I should do?
That's a fairly broad question, but I can tell you how I would do it. (Hardly the only way, but where I would start.)
If you're looking for real time input, the Java Sound library (excellent tutorial here) allows for that. (Just note that microphone input from a web page is difficult on anything, due to major security concerns, so this would be a desktop application.)
If it needs to be real time, the first thing I would suggest is stream and multithread the hell out of it. I would suggest the Java 8 Stream API, but since you're looking for subsamples that match a specific pattern, then each data point will have to be aware of the state of its neighbors, and that isn't easy with streams.
You will probably want to know if a sound roughly resembles an audio profile, so for that, I would pick a tolerance on just how close you want it to be for a match (remembering that samples may not line up 100% anyway, so "exact" is not an option), and then look up Hidden Markov Models. I suggest these because they're what voice recognition software typically uses, and while your sounds may not be voices, it will give you an idea of what has already been done.
You'll also want to maintain a limited list of audio samples in memory. Specifically, you will likely need the most recent data, because an audio signal is a time-variant signal, and you can't get a match from just one point. I wouldn't make it much longer than the longest sample you're looking to recognize, as audio takes up a boatload of memory.
Lastly (for audio), I would recommend picking a standard format for comparison. Make it as good as gets you decent results, and start high. You will want to convert everything to that format before you compare it.
Once you recognize a specific sound, it's basically a Command Pattern. Specific sounds can be mapped, even with a java.util.HashMap, to specific files, which (if there are few enough) you might even have pre-loaded.
Lastly, it's worth looking at the Java Speech API. It's not part of the JDK and it's quite dated, but you might get some good advice from its implementation.
This is of course the advice of a Java-preferring programmer, but I imagine that there might be some decent libraries in Python and Ruby to help you as well; and of course there's something in C somewhere. This may sound like a lot, but most of the material is already implemented and ready-to-go.
Hopefully this helps, let's look forward to other answers.
I am doing one experiment in which I need to capture skeleton data from kinect and then apply that data to a model, I have captured data from kinect and have stored it in a file, i.e in a file i have location of each joint in each frame,
Now I want my model in blender to take the joint position from file, and move accordingly. But I dont have any idea on how to start.
I also have written a small script in python to read position from file and update the position of one bone:
obj.channels['head'].location = Vector((float(xs),float(ys),float(zs)))
but it does not move anything. Am I doing it in wrong way, or we cannot move the armature by just updating the position??
Please guide me on this topic, as i am completely new to python and blender
I don't think that this is the best solution, you can simply export your data to a bvh file and save yourself from a lot of headaches.
You can find a lot of Kinect-sdk to bvh tutorials on the net and the bvh is the de-facto standard to store data from motion capture events, there are no reasons why you should re-invent the wheel and doing extra work.
To use your bvh file in Blender you can simply follow one of the many tutorial on the subject.