I want to horizontally cut the spectrogram of a wav file into 24 pieces,and after measuring the power of each piece, and finally rank the pieces by power orders what should I do please
Could you show some code that you have written to try out the same? It would be easier to help if we have something to build upon and rectify issues, if any.
Additionally please try basic image manipulation to do the same. Instead of cutting you could divide the image into N (here 24) regions and analyze them in parallel using multiprocessing.
Related
I am looking for scaling a PNG file according to an audio provided, a frequency range (20hz-1000hz for example) and a threshold, for a smooth effect.
For example, when there is a kick, scale go to 120% smoothly, I would like to make those audio visualizers such as dubstep, etc... where when kicks comes in, their image are "pumping".
First, is it doable with ffmpeg?
Where to start?
I found showcqt that takes frequencies in input etc., but its output is a video so I don't think I can use it in my case. Any help appreciated.
If you are able to read the PCM values as they are being output, then you might consider using a rolling RMS average in order to get a continuous stream of amplitudes. IDK the best length of the array. Perhaps it should correspond to the number of audio frames that would give you an update for each visual frame? The folks at the DSP site would have the best insights.
If you do a rolling average, computations are not terribly expensive. You'd do the square on the incoming and add that to a ring buffer (circular queue) and drop the outgoing. Only those data points need be added to the rolling average when computing the new rolling average, since the denominator is fixed and known. I found a video that describes the basic RMS math here using Matlab.
It might be necessary to add some smoothing to visualizer that is receiving the volume updates. Also, handing off data from the audio thread should likely employ some form of loose coupling. It would not be good if the thread that is processing the audio was also handling graphics.
I'm a little over my head, but I think this is what is generally done for visualizers.
for a school project, I am supposed to analyze a short sound recording in wav format. I am done with the project, I DFT'd it, filtered out unwanted frequencies, and got the correct result. What eludes me, though, is the meaning of the values of the individual samples of my wav file. I have tens of thousands of samples that look like this:
[ 0.06234258 0.16020246 0.14122963 ... -0.01704375 -0.08993937 -0.09293508]
However, no matter how much I multiply these values by a number, the resulting sound sounds the same. If I multiply every sample by 1000, it sounds just as it sounded before. The same goes for dividing. So what do these samples mean, if not volume?
EDIT:
Here is the code I'm using:
import soundfile as sf
import IPython
samples, sampling_freq = sf.read('recording.wav')
IPython.display.display(IPython.display.Audio(samples, rate=sampling_freq )) #This one displays a playable bar.
The samples (basically a long array of floating point numbers) in the file is the Pulse Code Modulated data representing the audio.
Given that audio players use this data to recreate the original audio wave, multiplying every sample by some factor should increase the volume. However, some audio players scale down (re-normalize) the samples to prevent audio clamping - which can be the cause why it sounds the same.
The ideal way to visualize the audio should be using Audacity. It has the capability to show the audio wave in real time. Something like this -
PC: Google
Thanks for dropping in here.
I'm currently working on a project, and I'm not that strong with python yet. So I was hoping for some constructive feedback on this question.
I have a dataset containing core samples, all stored with sample id, latitude, longitude, content and other data irrelevant for this question.
Now I've imported this dataset and sliced it as I want it to be. For the images I'm using the rasterio module to open 2 satellite images that covers the region. I'm using the utm module to convert back and forth between latlong->UTM->Pixel values (Which also seems to be throwing me strange coordinates at some points).
Annoyingly enough, the two Sentinel-2 images are cut right across the center of the map.
As I'm doing bounding boxes on top of where the samples are taken, this is a problem as I need to extract 10x10 pixel cut outs of that region. This leads to a lot of the samples not getting a proper cut out.
So I thought why not merge the two images together into one large rectangular bit. But I still need to retain the meta data with the UTM coordinates.
How would you suggest I proceed. Can it be done in an easy way? Is there another angle on this I've overlooked?
Thank you for your time.
I'm not sure I completely understand the question, but if you are simply trying to merge 2 images, have you looked at the command line tool gdal_merge.py?
A very simple example:
gdal_merge.py -o merged_image.tif image1.tif image2.tif
I need to apply an impulse response of an audio file that is 48kHz to an audio file that is 44.1kHz. It could be someone speaking for example. If I'm using the correct term, I need to convolve two audio file together so it would sound like someone is speaking inside of a cathedral.
What I don't know is how I go about doing this. I looked at minim library since it's the only audio library that I remember using and I found an example that applies an impulse response of a low pass filter to an audio file. Is there a way to convolve two audio files together to output a new sound? Audio processing isn't my forte so please don't mind my ignorance. Thanks, I'm trying to figure this out along the way.
Yes, convolution is what you want, but first your source needs to be the same sample rate. Once your sources are the same sample rate, you have have two options for performing the S/R conversion: 1. you can do it "directly", which is the most straightforward way, but takes M*N time, or 2. you can do it using the Fourier transform, which is much more complex, but faster. You will need to implement the overlap add algorithm as well. Looking at the docs of Minim, it looks to me like they use a standard IIR filter, not convolution by an impulse response, so I don't think that will help. You will have to do a lot of work to do convolution on top of what Minim gives you using the FFT. If you want to go the "direct" route, it will look something like this:
for( i in 0...input.length )
for( j in 0...conv.length )
output[i] += i-j < 0 ? 0 : input[i-j] * conv[j] ;
more details here: http://www-rohan.sdsu.edu/~jiracek/DAGSAW/4.3.html or google "discrete convolution"
Update: Minim does give you convolution: http://code.compartmental.net/minim/javadoc/ddf/minim/effects/Convolver.html
I'm working on an openGL project that involves a speaking cartoon face. My hope is to play the speech (encoded as mp3s) and animate its mouth using the audio data. I've never really worked with audio before so I'm not sure where to start, but some googling led me to believe my first step would be converting the mp3 to pcm.
I don't really anticipate the need for any Fourier transforms, though that could be nice. The mouth really just needs to move around when there's audio (I was thinking of basing it on volume).
Any tips on to implement something like this or pointers to resources would be much appreciated. Thanks!
-S
Whatever you do, you're going to need to decode the MP3s into PCM data first. There are a number of third-party libraries that can do this for you. Then, you'll need to analyze the PCM data and do some signal processing on it.
Automatically generating realistic lipsync data from audio is a very hard problem, and you're wise to not try to tackle it. I like your idea of simply basing it on the volume. One way you could compute the current volume is to use a rolling window of some size (e.g. 1/16 second), and compute the average power in the sound wave over that window. That is, at frame T, you compute the average power over frames [T-N, T], where N is the number of frames in your window.
Thanks to Parseval's theorem, we can easily compute the power in a wave without having to take the Fourier transform or anything complicated -- the average power is just the sum of the squares of the PCM values in the window, divided by the number of frames in the window. Then, you can convert the power into a decibel rating by dividing it by some base power (which can be 1 for simplicity), taking the logarithm, and multiplying by 10.