Audio data format being rejected in Speech Studio - speech-to-text

I'm uploading a zip file of audio data to Custom Speech project in Speech Studio. However, the files are being rejected after upload.
I've tried sox and ffmpeg to do the file conversion. The output of sox is matching the requirements on the doc pages. I don't understand why the files are being rejected.
sox.exe" --i audio1.wav
Input File : 'audio1.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:02.27 = 36320 samples ~ 170.25 CDDA sectors
File Size : 72.7k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
I zip up the file and upload it. I believe this matches with the requirements below.
File format RIFF (WAV)
Sample rate 8,000 Hz or 16,000 Hz
Channels 1 (mono)
Maximum length per audio 2 hours
Sample format PCM, 16-bit
Archive format .zip
Maximum archive size 2 GB
The UI displays "Failed to upload data. Please check your data format and try to upload again."
I can only believe that there's an issue with the service.

I have little experience with sox, but you use ffmpeg with:
ffmpeg.exe -i -ac 1 -ar 16000
You can find ffmpeg here: https://www.ffmpeg.org/
It is free.
Hope this helps.

Related

Sox audio concatenation file length wrong

I am trying to concatenate multiple audio files using Sox. Each file is very hi-res: 4ch, PCM, 256k sampling (yes, not a typo), 24 bit. Each file is approx 2 mins long. I can concatenate up to 9 files successfully with: sox file1.wav file2.wav file3.wav outfile.wav.
After 9 files I have the following sox summary which is correct:
Channels : 4
Sample Rate : 256000
Precision : 24-bit
Duration : 00:21:33.89 = 331236000 samples ~ 97041.8 CDDA sectors
File Size : 3.97G
Bit Rate : 24.6M
Sample Encoding: 24-bit Signed Integer PCM
When I add a 10th ~2 minute file I get:
Channels : 4
Sample Rate : 256000
Precision : 24-bit
Duration : 00:00:25.18 = 6445658 samples ~ 1888.38 CDDA sectors
File Size : 4.37G
Bit Rate : 1.39G
Sample Encoding: 24-bit Signed Integer PCM
You'll note here that we went from a length of 00:21:33.89 to a length of 00:00:25.18 with a corresponding drop in samples. Expected result would be a file of ~00:23:xx.xx with the 2 minutes added. The actual file size grew from 3.97GB to 4.37GB so the data is there, it appears to be a problem in the header.
Does anyone know of an upper limit in sox that we might be meeting?
Alternatively does anyone know how I might fix the file post facto? I tried sox --ignore-length infile.wav outfile.wav but the output file was identical.
Thanks

What is the difference between these 2 wav files?

I am trying to use a program called arss to create a spectrogram from a wav file. I have 2 wav files, one works and the other does not (it was converted to wav from mp3).
The error that arss throws at me is:
This WAVE file is not currently supported.
Which is fine, but I have no idea what parts of my wav file to change so that it will be supported. The docs don't help here (as far as I can tell)
When I run mediainfo on both wav files, I get the following specs:
working wav:
General
Complete name : working.wav
Format : Wave
File size : 1.15 MiB
Duration : 6 s 306 ms
Overall bit rate mode : Constant
Overall bit rate : 1 536 kb/s
Audio
Format : PCM
Format settings : Little / Signed
Codec ID : 1
Duration : 6 s 306 ms
Bit rate mode : Constant
Bit rate : 1 536 kb/s
Channel(s) : 2 channels
Sampling rate : 48.0 kHz
Bit depth : 16 bits
Stream size : 1.15 MiB (100%)
not working wav:
General
Complete name : not_working.wav
Format : Wave
File size : 5.49 MiB
Duration : 30 s 0 ms
Overall bit rate mode : Constant
Overall bit rate : 1 536 kb/s
Writing application : Lavf57.83.100
Audio
Format : PCM
Format settings : Little / Signed
Codec ID : 1
Duration : 30 s 0 ms
Bit rate mode : Constant
Bit rate : 1 536 kb/s
Channel(s) : 2 channels
Sampling rate : 48.0 kHz
Bit depth : 16 bits
Stream size : 5.49 MiB (100%)
Comparing the audio specs of both files, I can't tell any difference between anything other than the file size and duration. I even updated the Sampling rate of the non-working wav using ffmpeg so that it would match the working one at 48.0kHz, but no luck.
Any idea?
Both wav files are available here.
FFmpeg, by default, writes a LIST chunk, with some metadata, before the data chunk. ARSS has a rigid parser and expects the data chunk to start at a fixed byte offset (0x24). FFmpeg can be told to skip writing the LIST chunk using the bitexact option.
ffmpeg -i not_working.wav -c copy -bitexact new.wav
Note that ARSS doesn't check for sampling rate, only that WAVs have little endian PCM.
Here's a related Q, not quite a duplicate, linked for future readers:
ffmpeg - Making a Clean WAV file
I have an older version of ffmpeg and the -bitextact option alone did not help much. That is, it did remove some of the data from the hunk, but the LIST hunk was still present with other data.
You may have to also ask for no metadata at all with the -map_metadata option like so:
ffmpeg ... -i <input> ... -flags +bitexact -map_metadata -1 ... <output>
(The ... represent location with other command line options as required in your case)
By adding the -map_metadata -1, it really removed everything and the LIST hunk is now fully gone.

ffmpeg, stretch audio to x seconds

I am trying to make an audio file be exactly x second.
So far i tried using the atempo filter by doing the following calculation
Audio length / desired length = atempo.
But this is not accurate, and I am having to tweak the tempo manually to get it to an exact fit.
Are there any other solutions to get this work ? Or am I doing this incorrectly?
My original file is a wav file, and my output in an mp3
Here is a sample command
ffmpeg -i input.wav -codec:a libmp3lame -filter:a "atempo=0.9992323" -b:a 320K output.mp3
UPDATE:
I was able to correctly calculate the tempo by changing the way I am receiving the audio length.
I am now calculating the current audio length using the actual file size and the sample rate.
Audio Length = file size / (sample rate * 2)
Sample rate is something like 16000 Hz. You can get that by using ffprob or ffmpeg.
You are calculating the tempo incorrectly.
Audio length / desired length = atempo
should be:
desired length / Audio length = atempo
This answer was posted as an edit to the question ffmpeg, stretch audio to x seconds by the OP Max Doumit under CC BY-SA 3.0.

Convert audio to 8-bit signed PCM

I have a .mp4 audio file that I want to convert to a 8-bit unsigned PCM format for an Arduino Uno using the TMRpcm library.
It also could be a .wav file. Anyways, I have tried many things to no avail. The closest I got was with Audacity using the NIST Sphere codec. I tried to do this with FFmpeg, but it only supports demuxing NIST Sphere files. How do I convert audio to this format on Mac OS X (10.10.2)?
avconv is a fork from ffmpeg ... so use ffmpeg if you wish
avconv -i input.mp4 -ar 8000 -acodec pcm_u8 -ac 1 output.wav
WAV is the container format for the PCM codec so if you MUST have PCM then get into a binary file editor (wxHexEditor is a nice one) and delete the first 44 bytes (its header) of that WAV file
So above gives you 8000 samples per second and a bit depth of 8 bits, and mono.
verify this using
avprobe some_video_audio_file.wav
see bit depth listing available using avconv here
I realized that I was trying to convert a corrupt audio file. Audacity converted a valid file correctly.

FFMPEG audio decoding

I have used the avcodec_decode_audio3 function to decode the AMR content in the frame order.
I get 640 bytes output for each frame, with sample format being float and I have saved the output as a raw output file.
Now, I want to validate this output content. But I can't play it in any player as it does not have any header or media info. And I am not able to find any command in ffmpeg which gives me raw audio output.
Now, if want to re-encode that raw output content in FFMPEG, what would be the input format I need to give.
Can anybody give some suggestion on this?
If the audio data is saved in a binary file as raw (headerless), you can use Audacity to import is as raw data and play it back. You would need to provide sample encoding, sample rate and number of channels.
If there are any problems you can perform conversion to a raw file using ffmpeg, and use the result for comparison. For example:
ffmpeg -i input.wav -f f32le output.raw
produces raw audio file with 32-bit little-endian float samples, with original sample rate and number of channels. Alternatively, result sample rate and number of channels can be specified, for example, -ar 44100 and -ac 2.

Resources