Not sure if this is the right place to ask this, but can you tell how similar 2 songs sound from their waveform?
For example, could you search for a certain pattern in a audio file's waveform to tell if it was metal, rock or classical music?
Related
So, I have spent quite a bit of time looking for the answer but I get the exact opposite (something I already know how to do), namely, the audio below a certain threshold can get muted (can be done either with ffmpeg or sox).
What I need is for the command to search my audio for anything that exceeds -18dB and mute all such segments. (The reason for this is because I have a recording where I am playing an electric piano with the audio output being recorded directly through the cable, but I am also recording from two microphones what I am speaking - between my playing. When I am playing the keyboard, the mics pick up a lot of action noise which I want to mute, and that's easy to do because the channel that records my piano input has piano sounds at that point and that corresponds exactly with when I want the audio from my mics to be muted, or at least attenuated.)
Does anyone know if such a feat is possible?
If it cannot be done with just one command, that's fine I'll try anything.
I am planning to develop a music app which includes a function to find the similar song just like what KKBOX and Shazam are doing, but I'm not familiar in this area. I've found that they applied FFT to proceed the comparison of the songs so that the user can search the similar song.
However, i am thinking that what if I generate the waveform of the song, and then directly compare the waveform of the songs. I would like to ask is it possible for my idea?
As your objective is to find "similar" songs, comparing a 2d waveform is highly unlikely to work. However, it's a good idea to first explore the feasibility of your approach, before rejecting it out of hand.
I would suggest picking a set of 5 songs
1 song and 1 song you think is very similar to it
1 song that's different from the first one, and a song by the same band on the same album (or from the same time period)
1 audio file that's completely different (e.g. from an audiobook or podcast)
Run through the librosa tutorials (https://librosa.org/doc/main/tutorial.html) and/or some of the walkthroughs on Medium (e.g. https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d), but stopping before you get to the part that uses MFCC. Just focus on the waveform images.
Looking at the visualizations for your songs and thinking through this problem, reason about a)why the waveform-comparison ought to work, and b)why the waveform-comparison won't work.
So think about things like tempo, timbre and timing - what would be the effect on the waveform of playing the same song on different instruments, with a different effects treatment, at a different tempo, or in a different order (same song, but changing order of verses and chorus).
Setting aside the non-trivial quetion of which waveform you'd be using (amplitude? of what frequency/frequencies?), at this point, you should see how many problems there are with just looking at the waveform, and why MFCC (or similar) is better. Additionally, you'll be better prepared to think about how MFCC parameters might be selected - how much of the song do you need to sample, when should you start the sampling.
Is your idea possible? Probably not in the way you are thinking - maybe you could experiment with something like transforming the data of the song in some way and then comparing that representation (e.g. looking at changes in amplitude or tempo) The problem with audio is that it encapsulates a lot of features in its signal:
key
tempo
effects treatment (e.g. reverb)
instruments
tone
dynamics
etc.
Watch a tutorial on audio mixing and you'll see/hear just how much the output signal of the exact same song can be changed without actually changing the song being played.
Innovation sometimes emerges when curious people try things that 'probably won't work', so anything is worth a shot, but once you've figured out for yourself why something won't work, it's useful to accept commonly used techniques, and look for opportunities for innovation in other ways.
OK, I'm going to try to explain the issue I am having in a few lines.
I am a blind game developer, and as such I make games using only audio and not graphics, so other blind people can play.
I am trying to build a binaural audio scene in a game. However, the sounds that play only represent their center position on the screen, x/2 and y/2.
This works fine for persons, cars, and other small objects however, when there is a door or a wall, that occupies 5 squares or the whole x or y in the case of walls, I am clueless as to how to implement this into sound.
Does anyone have any ideas?
I don't think this has nothing to do with the sound library that I use, rather it's to do with maths or geometry etc.
I thought about creating multiple coppies of the sound, one for each position, but people say it's a really bad idea.
So what do you suggest?
Thanks.
I would create a set of audio filters which gets applied to sound of any object to indicate attributes like its size and whether its moving towards or away from POV
Experiment with defining these audio filters to be possibly modulating its volume up and down or maybe modulating the frequency of an object's audio
These audio filters are applying an agreed upon audio texture to the object's audio
I have a situation where I have a video capture of HD content via HDMI with audio from a sound board that goes through a impedance drop into a microphone input of a camcorder. That same signal is split at line level to a 'line in' jack on the same computer that is capturing the HDMI. Alternatively I can capture the audio via USB from the soundboard which is probably the best plan, but carries with it the same issue.
The point is that the line in or usb capture will be much higher quality than the one on HDMI because the line out -> impedance change -> mic in path generates inferior quality in that simply brushing the mic jack on the camera while trying to change the zoom (close proximity) can cause noise on the recording.
So I can do this today:
Take the good sound and the camera captured sound and load each into
audacity and pretty quickly use the timeshift toot to perfectly fit
the good audio to the questionable audio from the HDMI capture and
cut the good audio to the exact size of the video. Then I can use
ffmpeg or other video editing software to replace the questionable
audio with the better audio.
But while somewhat quick and easy, it always carries with it a bit of human error and time. I'd like to automate this if possible as this process is repeated at least weekly throughout the year.
Does anyone have a suggestion if any of these ideas have merit or could suggest another approach?
I suspect but have yet to confirm that the system timestamp of the start time may be recorded in both audio captured with something like Audacity, or the USB capture tool from the sound board as well as the HDMI mpeg-2 video. I tried ffprobe on a couple audacity captured .wav files but didn't see anything in the results about such a time code, but perhaps other audio formats or other probing tools may include this info. Can anyone advise if this is common with any particular capture tools or file formats?
if so, I think I could get best results by extracting this information and then using simple adelay and atrim filters in ffmpeg to sync reliably directly from the two sources in one ffmpeg call. This is all theoretical for me right now-- I've never tried either of these filters yet-- just trying to optimize against blind alleys by asking for advice up front.
If such timestamps are not embedded, possibly I can use the file system timestamp for the same idea expressed in 1a, but I suspect the file open of the two capture tools may have different inherant delays. Possibly these delays will be found to be nearly constant and the approach can work with a built-in constant anticipation delay but sounds messy and less reliable than idea 1. Still, I'd take it, if it turns out reasonably reliable
Are there any ffmpeg or general digital audio experts out there that know of particular filters that can be employed on the actual data to look for similarities like normalizing the peak amplitudes or normalizing the amplification of the two to some RMS value and then stepping through a short 10 second snippet of audio, moving one time stream .01s left against the other repeatedly and subtracting the two and looking for a minimum? Sounds like it could take a while, but if it could do this in less than a minute and be reliable, I suspect it could work. But I have only rudimentary knowledge of audio streams and perhaps what I suggest is just not plausible-- but since each stream starts with the same source I think there should be a chance. I am just way out of my depth as to how to go down this road, so if someone out there knows such magic or can throw me some names of filters and example calls, I can explore if I can make it work.
any hardware level suggestions to take a line level output down to a mic level input and not have the problems I am seeing using a simple in-line impedance drop module, so that I can simply rely on the audio from the HDMI?
Thanks in advance for any pointers or suggestinons!
I need to develop a program that toggles a particular audio track on or off when it recognizes a parrot scream or screech. The software would need to recognize a particular range of sounds and allow some variations in the range (as a parrot likely won't replicate its sreeches EXACTLY each time).
Example: Bird screeches, no audio. Bird stops screeching for five seconds, audio track praising the bird plays. Regular chattering needs to be ignored completely, as it is not to be discouraged.
I've heard of java libraries that have speech recognition with dictionaries built in, but the software would need to be taught the particular sounds that my particular parrot makes - not words or any random bird sound. In addition as I mentioned above, it would need to allow for slight variation in the sound, as the screech will likely never be 100% identical to the recorded version.
What would be the best way to go about this/what language should I look into?
Edit: Alternatively (and perhaps this would be a more simple solution), is there a way to make the audio toggle based on the volume of input? So it wouldn't matter what kind of sound the parrot makes, just how loud it is?
This question seems to be tightly related to voice recognition. I would recomend taking a look at this post: How to convert human voice into digital format?